Prometheus Recording Rules

Recording rules pre-aggregate common queries so the carbon dashboard loads quickly. Without them, each API call runs expensive increase() queries across all time series in real time.

The rules file is in ada-carbon-monitoring-api: recording_rules.yml

Production Prometheus: https://host-172-16-100-248.nubes.stfc.ac.uk

Why Recording Rules?

The carbon monitoring API needs to calculate CPU usage over time periods (hourly, daily). The raw Prometheus query for this is:

sum by (cloud_project_name) (
  increase(node_cpu_seconds_total{mode!="idle", cloud_project_name="IDAaaS"}[1h])
)

This query scans every node_cpu_seconds_total sample in the last hour, computes the increase for each time series, filters by mode, and sums by project. With hundreds of machines and multiple CPU cores each, this is slow.

Recording rules run these queries on a schedule and store the results as new time series. The API then queries the pre-computed result directly:

ada:cpu_busy_seconds_increase_1h:by_project{cloud_project_name="IDAaaS"}

This returns instantly because the value is already computed.

Setup

1. Copy the rules file

Copy recording_rules.yml to the Prometheus server, in the same directory as prometheus.yml.

2. Add to prometheus.yml

rule_files:
  - "recording_rules.yml"

3. Reload Prometheus

# Option 1: Send SIGHUP
kill -HUP $(pidof prometheus)

# Option 2: HTTP reload (if --web.enable-lifecycle is enabled)
curl -X POST http://localhost:9090/-/reload

# Option 3: Restart the service
sudo systemctl restart prometheus

4. Verify

Open the Prometheus UI at https://host-172-16-100-248.nubes.stfc.ac.uk/rules or query the API:

curl -s https://host-172-16-100-248.nubes.stfc.ac.uk/api/v1/rules | jq '.data.groups | length'
# Expected: 3

Rule Groups

There are 3 groups with 16 rules total, each computing busy and idle CPU seconds at different granularities.

Group 1: CPU Aggregations

Name: ada_carbon_cpu_aggregations Evaluation interval: every 1 minute

Pre-aggregated CPU totals. These sum node_cpu_seconds_total across all CPU cores and modes.

Rule	Labels	Description
`ada:cpu_busy_seconds_total:by_project`	cloud_project_name	Busy CPU across all machines in a project
`ada:cpu_idle_seconds_total:by_project`	cloud_project_name	Idle CPU across all machines in a project
`ada:cpu_busy_seconds_total:by_project_machine`	cloud_project_name, machine_name	Busy CPU per machine type
`ada:cpu_idle_seconds_total:by_project_machine`	cloud_project_name, machine_name	Idle CPU per machine type
`ada:cpu_busy_seconds_total:by_project_machine_host`	cloud_project_name, machine_name, host	Busy CPU per individual host
`ada:cpu_idle_seconds_total:by_project_machine_host`	cloud_project_name, machine_name, host	Idle CPU per individual host

Busy means all modes except idle (user, system, nice, irq, softirq, steal, iowait). Idle means the idle mode only.

Group 2: Hourly Increases

Name: ada_carbon_hourly_increases Evaluation interval: every 5 minutes

These compute increase(...[1h]) - the number of CPU seconds added in the last hour. The carbon API uses these directly for energy and carbon calculations.

Rule	Labels	Description
`ada:cpu_busy_seconds_increase_1h:by_project`	cloud_project_name	Hourly busy increase per project
`ada:cpu_idle_seconds_increase_1h:by_project`	cloud_project_name	Hourly idle increase per project
`ada:cpu_busy_seconds_increase_1h:by_project_machine`	cloud_project_name, machine_name	Hourly busy per machine type
`ada:cpu_idle_seconds_increase_1h:by_project_machine`	cloud_project_name, machine_name	Hourly idle per machine type
`ada:cpu_busy_seconds_increase_1h:by_project_machine_host`	cloud_project_name, machine_name, host	Hourly busy per host
`ada:cpu_idle_seconds_increase_1h:by_project_machine_host`	cloud_project_name, machine_name, host	Hourly idle per host

Group 3: Daily Increases

Name: ada_carbon_daily_increases Evaluation interval: every 15 minutes

These compute increase(...[1d]) for daily summary views and the heatmap.

Rule	Labels	Description
`ada:cpu_busy_seconds_increase_1d:by_project`	cloud_project_name	Daily busy increase per project
`ada:cpu_idle_seconds_increase_1d:by_project`	cloud_project_name	Daily idle increase per project
`ada:cpu_busy_seconds_increase_1d:by_project_machine`	cloud_project_name, machine_name	Daily busy per machine type
`ada:cpu_idle_seconds_increase_1d:by_project_machine`	cloud_project_name, machine_name	Daily idle per machine type

How the API Uses These Rules

The carbon monitoring API queries these recording rules to calculate energy and carbon:

1. Query: ada:cpu_busy_seconds_increase_1h:by_project{cloud_project_name="IDAaaS"}
   Result: 8313.1 busy CPU seconds in the last hour

2. Query: ada:cpu_idle_seconds_increase_1h:by_project{cloud_project_name="IDAaaS"}
   Result: 28564580 idle CPU seconds in the last hour

3. Calculate energy:
   busy_kwh = 12W x 8313.1s / 3,600,000 = 0.0277 kWh
   idle_kwh = 1W x 28564580s / 3,600,000 = 7.93 kWh
   total_kwh = 7.96 kWh

4. Get carbon intensity: 185 gCO2/kWh (from UK Grid API)

5. Calculate carbon: 7.96 kWh x 185 gCO2/kWh = 1472.6 gCO2eq

Label Reference

The recording rules use these Prometheus labels from node_cpu_seconds_total:

Label	Description	Examples
`cloud_project_name`	OpenStack project	IDAaaS, CDAaaS, DDAaaS
`machine_name`	Machine type within a project	Muon, Laser, Analysis, SANS
`host`	Individual machine hostname	172.16.100.50, workspace-abc-muon-0
`mode`	CPU mode	user, system, idle, iowait, nice, irq, softirq, steal

Naming Convention

Recording rule names follow the Prometheus convention:

ada:metric_name:aggregation_level

ada: - namespace prefix
cpu_busy_seconds_total or cpu_busy_seconds_increase_1h - what is being measured
by_project, by_project_machine, by_project_machine_host - aggregation level