In this post, I’ll walk through how I extended the base Sitecore XM 10.4 AKS setup with a modern observability stack using Grafana, Alloy, Loki, and Prometheus. This setup provides deep insights into both infrastructure and application health, with powerful log aggregation and visualization.
Project Overview
- Base: Sitecore XM 10.4 running on Azure Kubernetes Service (AKS)
- Enhancements: Added
a full Grafana observability stack:
- Grafana for dashboards and visualization
- Alloy (Grafana
Alloy, formerly Promtail) for log
collection and multiline parsing
- Loki for log aggregation and querying
- Prometheus for metrics collection
All configuration files and setup
scripts are available in my public GitHub repo.
Why Add Grafana, Alloy, Loki,
and Prometheus?
Sitecore’s default AKS setup provides only the most basic
health and logging capabilities. Out of the box, log files are stored in persistent
volume as plain .txt files. There is no built-in UI for searching,
filtering, or correlating these logs across pods
or services. This makes troubleshooting
and monitoring in a distributed Kubernetes environment
extremely challenging. With the Grafana stack, you gain:
- Centralized log aggregation
- Powerful,
flexible dashboards
- Metrics-based alerting and troubleshooting
Component Breakdown
1. Grafana
Grafana is the visualization layer.
It connects to both Loki (for logs)
and Prometheus (for metrics),
letting you build dashboards that combine
infrastructure and application data.
Example dashboard:
- Top:
CPU and memory usage for Windows node (akswin000005)
- Bottom:
Real-time Sitecore CD and CM logs (warnings and errors),
parsed and searchable
2. Alloy (Grafana Alloy,
formerly Promtail)
Alloy is the log collector and shipper.
It runs as a DaemonSet on both Windows and Linux nodes,
tails log files, and forwards them to Loki.
Key detail: Multiline log parsing
Sitecore logs are not simple one-line entries. For example, an ERROR log often includes a stack trace and nested exceptions, spanning many lines:
If you simply tail these files line-by-line,
you lose the context of the error. That’s why Grafana Alloy is configured
with a custom multiline parser (see alloy-configmap.yaml in the
repo). This parser recognizes the start of a new
log entry (using a regex for the timestamp or thread prefix) and
groups all subsequent lines until the next entry,
ensuring each log event is captured in full.
You can see error message as one line in Grafana Loki:
As logs are parsed, Alloy extracts key fields such as
timestamp and log level, and attaches extra labels like job=sitecore, role=cm
or role=cd, etc. This makes it easy to filter logs in Grafana by environment,
service, or severity.
Example log labels in Loki/Grafana:
job="sitecore"
role="cm" or role="cd"
level="ERROR" or level="INFO"
3. Loki
Loki is the log aggregation backend.
It stores logs from Sitecore pods,
indexed by labels (e.g., role=cm, role=cd, job=sitecore).
Log querying in Grafana:
- Filter logs by labels (e.g., role=cm or role=cd)
- Search for
errors, warnings, or specific text
Example Loki Role Filter:
4. Prometheus
Prometheus scrapes metrics from
your cluster (including
Windows nodes via the Windows exporter).
It provides data for CPU, memory,
and custom application metrics.
Example Prometheus query:
100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[5m])) * 100)
This shows CPU usage percentage per Windows node.
Visualization:
How to Set Up
- Clone the repo:
git clone https://github.com/dogabenli/grafana-sitecore-demo.git
2. Deploy all
components
Deploy Sitecore AKS first, followed by Grafana components.
3. Access
Grafana:
Expose the Grafana service and
log in to start exploring dashboards.
Conclusion
By adding Grafana, Alloy, Loki,
and Prometheus to your Sitecore AKS environment,
you gain:
- Centralized,
searchable logs (with multiline support)
- Real-time dashboards for
both infrastructure and application health
- The
ability to troubleshoot and correlate issues quickly
All code and
configuration is available on GitHub: https://github.com/dogabenli/grafana-sitecore-demo
Comments
Post a Comment