Skip to main content

Modern Observability for Sitecore 10.4 on AKS: Grafana, Alloy, Loki, and Prometheus


In this post, I’ll walk through how I extended the base Sitecore XM 10.4 AKS setup with a modern observability stack using Grafana, Alloy, Loki, and Prometheus. This setup provides deep insights into both infrastructure and application health, with powerful log aggregation and visualization.

Project Overview

  • Base: Sitecore XM 10.4 running on Azure Kubernetes Service (AKS)
  • Enhancements: Added a full Grafana observability stack:
  • Grafana for dashboards and visualization
  • Alloy (Grafana Alloy, formerly Promtail) for log collection and multiline parsing
  • Loki for log aggregation and querying
  • Prometheus for metrics collection

All configuration files and setup scripts are available in my public GitHub repo.


Why Add Grafana, Alloy, Loki, and Prometheus?

Sitecore’s default AKS setup provides only the most basic health and logging capabilities. Out of the box, log files are stored in persistent volume as plain .txt files. There is no built-in UI for searching, filtering, or correlating these logs across pods or services. This makes troubleshooting and monitoring in a distributed Kubernetes environment extremely challenging. With the Grafana stack, you gain:

  • Centralized log aggregation 
  • Powerful, flexible dashboards
  • Metrics-based alerting and troubleshooting

Component Breakdown

1. Grafana

Grafana is the visualization layer. It connects to both Loki (for logs) and Prometheus (for metrics), letting you build dashboards that combine infrastructure and application data. 

Example dashboard:


  • Top: CPU and memory usage for Windows node (akswin000005)
  • Bottom: Real-time Sitecore CD and CM logs (warnings and errors), parsed and searchable

2. Alloy (Grafana Alloy, formerly Promtail)

Alloy is the log collector and shipper. It runs as a DaemonSet on both Windows and Linux nodes, tails log files, and forwards them to Loki.

Key detail:  Multiline log parsing

Sitecore logs are not simple one-line entries. For example, an ERROR log often includes a stack trace and nested exceptions, spanning many lines:

If you simply tail these files line-by-line, you lose the context of the error. That’s why Grafana Alloy is configured with a custom multiline parser (see alloy-configmap.yaml in the repo). This parser recognizes the start of a new log entry (using a regex for the timestamp or thread prefix) and groups all subsequent lines until the next entry, ensuring each log event is captured in full.

You can see error message as one line in Grafana Loki:

As logs are parsed, Alloy extracts key fields such as timestamp and log level, and attaches extra labels like job=sitecore, role=cm or role=cd, etc. This makes it easy to filter logs in Grafana by environment, service, or severity.

Example log labels in Loki/Grafana:

job="sitecore"

role="cm" or role="cd"

level="ERROR" or level="INFO"


3. Loki

Loki is the log aggregation backend. It stores logs from Sitecore pods, indexed by labels (e.g., role=cm, role=cd, job=sitecore).

Log querying in Grafana:


  • Filter logs by labels (e.g., role=cm or role=cd)
  • Search for errors, warnings, or specific text

Example Loki Role Filter:






4. Prometheus

Prometheus scrapes metrics from your cluster (including Windows nodes via the Windows exporter). It provides data for CPU, memory, and custom application metrics.

Example Prometheus query:

100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[5m])) * 100)

This shows CPU usage percentage per Windows node.

Visualization:




How to Set Up

  1. Clone the repo:

git clone https://github.com/dogabenli/grafana-sitecore-demo.git

2.       Deploy all components

Deploy Sitecore AKS first, followed by Grafana components.

3.       Access Grafana:

Expose the Grafana service and log in to start exploring dashboards.


Conclusion

By adding Grafana, Alloy, Loki, and Prometheus to your Sitecore AKS environment, you gain:

  • Centralized, searchable logs (with multiline support)
  • Real-time dashboards for both infrastructure and application health
  • The ability to troubleshoot and correlate issues quickly

All code and configuration is available on GitHub: https://github.com/dogabenli/grafana-sitecore-demo 




Comments

Popular posts from this blog

Sitecore Commerce – XC9 Tips – Missing Commerce Components in SXA Toolbox on Experience Editor

I've recently had an issue that commerce components were missing in SXA Toolbox. I setup Sitecore Commerce on top of an existing instance and I already had a SXA website working on it. The idea was to add commerce components and functionality to my existing website. But after commerce setup, the toolbox was still showing default SXA components and commerce components were missing although I add commerce tenant and website modules: I checked Available Renderings under Presentation folder, there was no problem, commerce renderings were there. I created another tenant and website to see if it shows the commerce components in toolbox. Nothing seemed different but I was seeing commerce components for new website and it was missing on existing one. Then, I noticed two things: 1- Selected catalog was empty in content editor (/sitecore/Commerce/Catalog Management/Catalogs) even if I see Habitat_Master catalog in Merchandising section on commerce management panel. 2- Bootstrap ...

Sitecore 9 Playground Series - Marketing Automation, Custom Predicates, Activities, Page Events and More

This is going to be a series of blog posts that I am going to follow a scenario and build a demo project step by step. Here is the plan: Imagine that you have a website displaying movies. Visitors are able to see movie details and take some actions like save movie or share it.  You want to follow the visitors' activities and you want to take some marketing actions based on those activities. For example, if a contact visits a movie more than X time or she/he saves a movie, you want to send those movies to an external system. In addition, there is going to be a limit to send same movie. Such as, it will not be possible to send same movie more than 2 times.  You want to configure this as a marketing automation plan to give flexibility to your marketing managers. They should be able to add configurable rules and activities.  To be able to carry out this plan, I am going to implement and show those topics: Custom page events & XConnect queries Cal...