Skip to content
Chimera readability score 76 out of 100, Expert reading level.

If you run end-to-end (E2E) tests on a Kubernetes operator, you've seen the pattern: a test that passes 80% of the time still fails often enough to block continuous integration (CI), waste developer hours, and train your team to reflexively /retest

. Without historical data, you can't distinguish a flaky test from a regression. Without automation, the only remedy is a human noticing and filing a ticket.

This guide shows you how to build a complete quarantine system backed by a Prometheus-compatible time-series database and Grafana, running on a long-lived cluster that provides continuous observability into your test suite's health.

What you'll build

A Grafana dashboard showing per-test health with automated quarantine decisions, Jira ticket creation, and a self-healing feedback loop, all powered by industry-standard Prometheus metrics.

Prerequisites:

  • A long-lived OpenShift/Kubernetes cluster (or any cluster that stays up)
  • Periodic E2E test runs producing JUnit XML (Prow periodic jobs or scheduled GitHub Actions)

helm

andkubectl

oroc

access to the cluster- A Jira project for tracking quarantined tests (optional but recommended)

Step 1: Deploy Prometheus

Deploy a dedicated Prometheus instance for test analytics. On OpenShift you might already have a cluster monitoring stack, but a separate instance keeps test data isolated and gives you control over retention.

bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install prometheus prometheus-community/prometheus \ --namespace e2e-analytics --create-namespace \ --set server.retention=90d \ --set server.persistentVolume.size=20Gi \ --set server.resources.requests.memory=512Mi \ --set server.resources.requests.cpu=250m \ --set alertmanager.enabled=false \ --set prometheus-node-exporter.enabled=false \ --set kube-state-metrics.enabled=false \ --set prometheus-pushgateway.enabled=false \ --set 'server.extraFlags[0]=web.enable-remote-write-receiver' \ --set 'serverFiles.prometheus\.yml.storage.tsdb.out_of_order_time_window=720h' \ --set server.securityContext.runAsNonRoot=true \ --set server.securityContext.runAsUser=null \ --set server.securityContext.fsGroup=null \ --set server.containerSecurityContext.allowPrivilegeEscalation=false \ --set "server.containerSecurityContext.capabilities.drop={ALL}" \ --set server.containerSecurityContext.runAsNonRoot=true \ --set server.containerSecurityContext.seccompProfile.type=RuntimeDefault

The --web.enable-remote-write-receiver

flag enables the remote-write endpoint so our ingester can push data in. The out_of_order_time_window

storage config allows the ingester to backfill historical data (required when first loading past results).

OpenShift SCC note

The default Prometheus Helm chart sets runAsUser: 65534

and fsGroup: 65534

, which are rejected by OpenShift's restricted-v2

SCC. These securityContext

overrides clear these defaults so the container runs with the UID assigned by OpenShift.

Verify it's running:

kubectl -n e2e-analytics get pods -l app.kubernetes.io/name=prometheus kubectl -n e2e-analytics port-forward svc/prometheus-server 9090:80 &

Test the query endpoint:

curl -s 'http://localhost:9090/api/v1/query?query=up'

The primary endpoints include:

  • Remote-write ingest:

http://prometheus-server.e2e-analytics.svc:80/api/v1/write

  • PromQL query:

http://prometheus-server.e2e-analytics.svc:80/api/v1/query

  • Range query:

http://prometheus-server.e2e-analytics.svc:80/api/v1/query_range

Alternative (OpenShift)

If you're on OpenShift 4.x, you can use the built-in user workload monitoring instead. Enable it in the cluster-monitoring-config

ConfigMap, and your metrics are automatically available via the Thanos Querier at https://thanos-querier.openshift-monitoring.svc:9091

. This gives you Prometheus without deploying anything extra.

Step 2: Define the metric schema

Instead of SQL tables, we define Prometheus metrics with labels. This is the interface contract—any component that writes these metrics (GCS scraper, push gateway, future sources) is compatible.

Metrics:

| Metric name | Type | Labels | Description |

|---|---|---|---|

e2e_test_result | Gauge (0/1) | test , suite , job , build_id , commit_sha , branch | 1 = passed, 0 = failed |

e2e_test_duration_seconds | Gauge | test , suite , job , build_id , branch | Test execution duration |

e2e_test_error_info | Gauge (1) | test , suite , error_category , error_message | Error classification (info metric) |

Label schema:

e2e_test_result{ test="TestOperator/components/group_1/dashboard/validate_config", suite="e2e-operator", job="periodic-ci-operator-main-e2e", build_id="1234567890", commit_sha="abc123f", branch="main" } 0 # 0 = failed, 1 = passed

Each test execution produces one e2e_test_result

sample per test case. The timestamp is the run time. This gives us a time-series of pass/fail per test that PromQL can aggregate over any window.

Step 3: Build the JUnit ingester

Create a Go binary that parses JUnit XML, converts results to Prometheus metrics, and pushes them via remote-write to Prometheus.

go package main import ( "bytes" "encoding/xml" "fmt" "net/http" "os" "time" "github.com/golang/snappy" "github.com/prometheus/prometheus/prompb" )

// Note: prompb types use gogoproto and have their own Marshal() method. // Do NOT use google.golang.org/protobuf/proto it requires ProtoReflect() // which gogoproto types don't implement.

type JUnitTestSuite struct { XMLName xml.Name `xml:"testsuite"` Name string `xml:"name,attr"` Timestamp string `xml:"timestamp,attr"` TestCases []JUnitTestCase `xml:"testcase"` Properties []Property `xml:"properties>property"` } type JUnitTestCase struct { Name string `xml:"name,attr"` Time float64 `xml:"time,attr"` Failure JUnitFailure `xml:"failure"` Error JUnitFailure `xml:"error"` } type JUnitFailure struct { Message string `xml:"message,attr"` Body string `xml:",chardata"` } type Property struct { Name string `xml:"name,attr"` Value string `xml:"value,attr"` } func junitToTimeSeries(suite JUnitTestSuite, prowJob, buildID string) []prompb.TimeSeries { commitSHA := extractProperty(suite.Properties, "commit.sha") branch := extractProperty(suite.Properties, "branch") if branch == "" { branch = "main" } runTS := parseTimestamp(suite.Timestamp) tsMs := runTS.UnixMilli() var series []prompb.TimeSeries for _, tc := range suite.TestCases { passed := tc.Failure == nil && tc.Error == nil var resultValue float64 if passed { resultValue = 1 } // e2e_test_result metric series = append(series, prompb.TimeSeries{ Labels: []prompb.Label{ {Name: "__name__", Value: "e2e_test_result"}, {Name: "test", Value: tc.Name}, {Name: "suite", Value: suite.Name}, {Name: "job", Value: prowJob}, {Name: "build_id", Value: buildID}, {Name: "commit_sha", Value: commitSHA}, {Name: "branch", Value: branch}, }, Samples: []prompb.Sample{ {Value: resultValue, Timestamp: tsMs}, }, }) // e2e_test_duration_seconds metric series = append(series, prompb.TimeSeries{ Labels: []prompb.Label{ {Name: "__name__", Value: "e2e_test_duration_seconds"}, {Name: "test", Value: tc.Name}, {Name: "suite", Value: suite.Name}, {Name: "job", Value: prowJob}, {Name: "build_id", Value: buildID}, {Name: "branch", Value: branch}, }, Samples: []prompb.Sample{ {Value: tc.Time, Timestamp: tsMs}, }, }) } return series } func remoteWrite(endpoint string, series []prompb.TimeSeries) error { req := &prompb.WriteRequest{Timeseries: series} data, err := req.Marshal() if err != nil { return fmt.Errorf("marshaling write request: %w", err) } compressed := snappy.Encode(nil, data) httpReq, err := http.NewRequest(http.MethodPost, endpoint, bytes.NewReader(compressed)) if err != nil { return fmt.Errorf("creating request: %w", err) } httpReq.Header.Set("Content-Type", "application/x-protobuf") httpReq.Header.Set("Content-Encoding", "snappy") httpReq.Header.Set("X-Prometheus-Remote-Write-Version", "0.1.0") resp, err := http.DefaultClient.Do(httpReq) if err != nil { return fmt.Errorf("sending remote write: %w", err) } defer resp.Body.Close() if resp.StatusCode != http.StatusNoContent && resp.StatusCode != http.StatusOK { return fmt.Errorf("remote write returned %d", resp.StatusCode) } return nil }

Deploy as a CronJob that fetches JUnit artifacts from your Google Cloud Storage (GCS) bucket:

yaml apiVersion: batch/v1 kind: CronJob metadata: name: junit-ingester namespace: e2e-analytics spec: schedule: "0 /4 " jobTemplate: spec: template: spec: containers: - name: ingester image: quay.io/your-org/junit-ingester:latest env: - name: REMOTE_WRITE_ENDPOINT value: "http://prometheus-server.e2e-analytics.svc:80/api/v1/write" - name: GCS_BUCKET value: "test-platform-results" - name: PROW_JOB value: "periodic-ci-operator-main-e2e" restartPolicy: OnFailure

Quick validation

After the first ingestion, verify data is flowing.

Query for any test results:

curl -s 'http://localhost:9090/api/v1/query?query=e2e_test_result' | jq '.data.result | length'

Check a specific test:

curl -s 'http://localhost:9090/api/v1/query?query=e2e_test_result{test=~"dashboard."}' | jq .

Step 4: Set up Grafana

Deploy Grafana and point it at Prometheus as the data source.

helm repo add grafana https://grafana.github.io/helm-charts helm repo update helm install grafana grafana/grafana \ --namespace e2e-analytics \ --set persistence.enabled=true \ --set persistence.size=5Gi \ --set adminPassword="$(openssl rand -base64 16)" \ --set "datasources.datasources\\.yaml.apiVersion=1" \ --set "datasources.datasources\\.yaml.datasources[0].name=E2E Metrics" \ --set "datasources.datasources\\.yaml.datasources[0].type=prometheus" \ --set "datasources.datasources\\.yaml.datasources[0].url=http://prometheus-server.e2e-analytics.svc:80" \ --set "datasources.datasources\\.yaml.datasources[0].access=proxy" \ --set "datasources.datasources\\.yaml.datasources[0].isDefault=true" \ --set securityContext.runAsNonRoot=true \ --set securityContext.runAsUser=null \ --set securityContext.fsGroup=null \ --set containerSecurityContext.allowPrivilegeEscalation=false \ --set "containerSecurityContext.capabilities.drop={ALL}" \ --set containerSecurityContext.runAsNonRoot=true \ --set containerSecurityContext.seccompProfile.type=RuntimeDefault \ --set initChownData.enabled=false

OpenShift SCC note

Same as Prometheus, clear default runAsUser

/fsGroup

and disable the init chown

container (which tries to run as root). On non-OpenShift clusters these overrides are harmless.

Expose Grafana

On OpenShift:

oc -n e2e-analytics create route edge grafana --service=grafana --port=3000

Or port-forward for local access:

kubectl -n e2e-analytics port-forward svc/grafana 3000:80 &

Dashboard panels (PromQL)

Panel 1: Per-test flake rate (30-day rolling window).

Since e2e_test_result is 1 (pass) or 0 (fail), sum_over_time counts passes sort_desc( 1 - ( sum by (test) (sum_over_time(e2e_test_result{branch="main"}[30d])) / sum by (test) (count_over_time(e2e_test_result{branch="main"}[30d])) ) )

Select Table as the panel type and configure the following columns: Test Name, Flake Rate, and Total Runs.

PromQL note

An earlier version of this query used count_over_time(e2e_test_result{...} == 1 [30d])

to count passes. This is invalid PromQL because the == 1

comparison produces an instant vector, and count_over_time

requires a range vector selector. Because e2e_test_result

uses 1/0 encoding, sum_over_time

directly gives the pass count, making the query both correct and simpler.

Panel 2: Flake rate time series (per test, daily resolution).

Daily flake rate for a specific test (use $test variable) 1 - ( sum by (test) (sum_over_time(e2e_test_result{test="$test", branch="main"}[1d])) / sum by (test) (count_over_time(e2e_test_result{test="$test", branch="main"}[1d])) )

Display as a Time Series panel. Add a threshold line at 0.2 (20%) to show the quarantine boundary.

Panel 3: Test health heatmap.

Pass rate per test per day (for heatmap) sum by (test) (sum_over_time(e2e_test_result{branch="main"}[1d])) / sum by (test) (count_over_time(e2e_test_result{branch="main"}[1d]))

Panel 4: Regression detection.

Tests with 0% pass rate in the last 4 days (potential regression, not flake) ( sum by (test) (sum_over_time(e2e_test_result{branch="main"}[4d])) / sum by (test) (count_over_time(e2e_test_result{branch="main"}[4d])) ) == 0

Tests matching this pattern that previously had a low flake rate are regressions—the code broke, not the test.

Panel 5: Test duration trends.

Average duration per test over time avg by (test) (avg_over_time(e2e_test_duration_seconds{branch="main"}[1d]))

Alert rules

Configure Grafana alerting to fire when a test crosses the quarantine threshold:

Grafana alert rule (configured via UI or provisioning) name: Test Flake Rate Exceeded condition: flake_rate > 0.2 expr: | ( 1 - ( sum by (test) (count_over_time(e2e_test_result{branch="main"} == 1 [30d])) / sum by (test) (sum_over_time(e2e_test_result{branch="main"}[30d])) ) ) > 0.2 and sum by (test) (count_over_time(e2e_test_result{branch="main"}[30d])) >= 10 for: 0m labels: severity: warning annotations: summary: "Test {{ $labels.test }} flake rate exceeded 20%"

Step 5: Build the quarantine controller (Go)

The quarantine controller queries Prometheus via PromQL, identifies flaky tests, excludes regressions, and outputs a quarantine config.

package main import ( "context" "encoding/json" "fmt" "net/http" "net/url" "os" "time" ) const ( flakeThreshold = 0.20 minRunsForDecision = 10 windowDays = 30 quarantineDurationDays = 30 ) type PromQueryResult struct { Status string `json:"status"` Data struct { ResultType string `json:"resultType"` Result []struct { Metric map[string]string `json:"metric"` Value [2]interface{} `json:"value"` } `json:"result"` } `json:"data"` } type QuarantineEntry struct { Name string `json:"name"` Reason string `json:"reason"` FlakeRate float64 `json:"flake_rate"` TotalRuns int `json:"total_runs"` FailedRuns int `json:"failed_runs"` Jira string `json:"jira,omitempty"` QuarantinedAt string `json:"quarantined_at"` ReEnableAfter string `json:"re_enable_after"` } type QuarantineConfig struct { Version int `json:"version"` Updated string `json:"updated"` Tests map[string]QuarantineEntry `json:"tests"` } func queryFlakeRates(ctx context.Context, promURL string) (map[string]float64, error) { query := fmt.Sprintf(` 1 - ( sum by (test) (sum_over_time(e2e_test_result{branch="main"}[%dd])) / sum by (test) (count_over_time(e2e_test_result{branch="main"}[%dd])) ) `, windowDays, windowDays) resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query))) if err != nil { return nil, fmt.Errorf("querying flake rates: %w", err) } defer resp.Body.Close() var result PromQueryResult if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { return nil, fmt.Errorf("decoding response: %w", err) } rates := make(map[string]float64) for _, r := range result.Data.Result { testName := r.Metric["test"] // Value is [timestamp, "value_string"] if valStr, ok := r.Value[1].(string); ok { var val float64 fmt.Sscanf(valStr, "%f", &val) rates[testName] = val } } return rates, nil } func queryRunCounts(ctx context.Context, promURL string) (map[string]int, error) { query := fmt.Sprintf(`sum by (test) (count_over_time(e2e_test_result{branch="main"}[%dd]))`, windowDays) resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query))) if err != nil { return nil, fmt.Errorf("querying run counts: %w", err) } defer resp.Body.Close() var result PromQueryResult if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { return nil, fmt.Errorf("decoding response: %w", err) } counts := make(map[string]int) for _, r := range result.Data.Result { testName := r.Metric["test"] if valStr, ok := r.Value[1].(string); ok { var val int fmt.Sscanf(valStr, "%d", &val) counts[testName] = val } } return counts, nil } func isRegression(ctx context.Context, promURL, testName string) (bool, error) { // A regression = 0% pass rate in recent window (all runs failed). // Uses sum_over_time/count_over_time instead of last_over_time subquery, // which is unreliable with high-cardinality build_id labels. query := fmt.Sprintf( `(sum(sum_over_time(e2e_test_result{test="%s", branch="main"}[4d])) / sum(count_over_time(e2e_test_result{test="%s", branch="main"}[4d])))`, testName, testName, ) resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query))) if err != nil { return false, err } defer resp.Body.Close() var result PromQueryResult if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { return false, err } // If pass rate is 0, all recent runs failed its likely regression for _, r := range result.Data.Result { if valStr, ok := r.Value[1].(string); ok { var val float64 fmt.Sscanf(valStr, "%f", &val) if val == 0 { return true, nil } } } return false, nil } func buildQuarantineConfig(ctx context.Context, promURL string) (QuarantineConfig, error) { flakeRates, err := queryFlakeRates(ctx, promURL) if err != nil { return nil, err } runCounts, err := queryRunCounts(ctx, promURL) if err != nil { return nil, err } now := time.Now().UTC() cfg := &QuarantineConfig{ Version: 1, Updated: now.Format(time.RFC3339), Tests: make(map[string]QuarantineEntry), } for testName, rate := range flakeRates { runs := runCounts[testName] if rate < flakeThreshold || runs < minRunsForDecision { continue } regression, err := isRegression(ctx, promURL, testName) if err != nil { return nil, fmt.Errorf("checking regression for %s: %w", testName, err) } if regression { continue // Don't quarantine regressions } failedRuns := int(rate float64(runs)) cfg.Tests[testName] = QuarantineEntry{ Name: testName, Reason: fmt.Sprintf("Flake rate %.0f%% over %dd (%d/%d failed)", rate*100, windowDays, failedRuns, runs), FlakeRate: rate, TotalRuns: runs, FailedRuns: failedRuns, QuarantinedAt: now.Format(time.RFC3339), ReEnableAfter: now.AddDate(0, 0, quarantineDurationDays).Format(time.RFC3339), } } return cfg, nil } func main() { promURL := os.Getenv("PROMETHEUS_URL") if promURL == "" { promURL = "http://prometheus-server.e2e-analytics.svc:80" } ctx := context.Background() cfg, err := buildQuarantineConfig(ctx, promURL) if err != nil { fmt.Fprintf(os.Stderr, "error: %v\n", err) os.Exit(1) } data, _ := json.MarshalIndent(cfg, "", " ") fmt.Println(string(data)) }

Deploy as a daily CronJob (yaml):

apiVersion: batch/v1 kind: CronJob metadata: name: quarantine-controller namespace: e2e-analytics spec: schedule: "0 6 *" jobTemplate: spec: template: spec: containers: - name: controller image: quay.io/your-org/quarantine-controller:latest env: - name: PROMETHEUS_URL value: "http://prometheus-server.e2e-analytics.svc:80" - name: JIRA_TOKEN valueFrom: secretKeyRef: name: jira-credentials key: token - name: JIRA_SERVER value: "https://redhat.atlassian.net" - name: JIRA_PROJECT value: "PROJECT" - name: GIT_REPO value: "repository" - name: GITHUB_TOKEN valueFrom: secretKeyRef: name: github-credentials key: token restartPolicy: OnFailure

The controller:

  • Queries Prometheus for flake rates via PromQL
  • Excludes regressions (consecutive trailing failures)
  • Outputs a quarantine JSON config
  • Creates Jira tickets for newly quarantined tests
  • Commits the config to Git (pull request or direct push)

Step 6: Wire the test runner

Your E2E test runner loads the quarantine config and skips active entries. The quarantine controller exports this JSON:

{ "version": 1, "updated": "2026-06-09T06:00:00Z", "tests": { "TestOperator/components/group_1/dashboard/validate_config": { "name": "TestOperator/components/group_1/dashboard/validate_config", "reason": "Flake rate 35% over 30d (7/20 failed)", "flake_rate": 0.35, "total_runs": 20, "failed_runs": 7, "jira": "JIRA-60123", "quarantined_at": "2026-05-15T06:00:00Z", "re_enable_after": "2026-06-14T06:00:00Z" } } }

At test startup, load the config and build a skip regex (Go):

func buildSkipRegex(cfg *QuarantineConfig) string { var patterns []string for name := range cfg.Tests { segments := strings.Split(name, "/") escaped := make([]string, len(segments)) for i, seg := range segments { escaped[i] = "^" + regexp.QuoteMeta(seg) + "$" } patterns = append(patterns, strings.Join(escaped, "/")) } return strings.Join(patterns, "|") }

Pass the result to go test -skip

(bash):

SKIP_REGEX=$(quarantine-tool build-skip-regex --config tests/e2e/quarantine.json) go test ./tests/e2e/... \ -v -timeout 60m \ -skip "$SKIP_REGEX"

Step 7: Close the feedback loop

The system is self-healing by design:

  • Quarantined tests expire. After

quarantine_duration_days

, the entry is removed and the test runs again in CI. - If the test is still flaky, the next analysis cycle re-quarantines it (with a fresh Jira ticket reference).

  • If someone fixes the test, it passes consistently and is never re-quarantined.
  • Jira resolution check: The controller queries Jira for resolved tickets and proactively un-quarantines those tests early.

The controller's cleanup logic (runs every cycle):

func cleanupExpired(cfg *QuarantineConfig) { now := time.Now().UTC() for name, entry := range cfg.Tests { expiry, _ := time.Parse(time.RFC3339, entry.ReEnableAfter) if now.After(expiry) { delete(cfg.Tests, name) } } }

Step 8: Add PR visibility

Add a CI check that posts a comment on every pull request (PR) showing the current quarantine status (yaml):

name: Quarantine Status on: pull_request: branches: [main] jobs: quarantine-status: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Post quarantine table run: | CONFIG="tests/e2e/quarantine.json" COUNT=$(jq '.tests | length' "$CONFIG") if [ "$COUNT" -eq 0 ]; then exit 0; fi echo "## Quarantined E2E Tests" > /tmp/comment.md echo "${COUNT} tests are quarantined and will be skipped." >> /tmp/comment.md echo "" >> /tmp/comment.md echo "| Test | Jira | Flake Rate | Expires |" >> /tmp/comment.md echo "|------|------|-----------|---------|" >> /tmp/comment.md jq -r '.tests[] | "| \(.name) | \(.jira // "-") | \(.flake_rate * 100 | floor)% | \(.re_enable_after // "-") |"' \ "$CONFIG" >> /tmp/comment.md gh pr comment "${{ github.event.number }}" --body-file /tmp/comment.md

Step 9: Connect to your CI pipeline

The system above is useful only when real test results flow into it. This section shows how to wire it to the three CI platforms most relevant to OpenShift projects.

Option A: OpenShift CI (Prow)

OpenShift CI stores all job artifacts in a public GCS bucket (test-platform-results

). After every E2E run, Prow uploads $ARTIFACT_DIR

contents to:

gs://test-platform-results/ logs/{periodic-job-name}/{build_id}/ # periodic jobs pr-logs/pull/{org}_{repo}/{pr}/{job}/{build_id}/ # presubmit jobs

Your test runner must produce JUnit XML inside $ARTIFACT_DIR

. Most Go test harnesses support this via gotestsum --junitfile

or a wrapper that converts go test -json

output. If you use a Makefile, a common pattern is:

makefile ifdef ARTIFACT_DIR export JUNIT_OUTPUT_PATH = ${ARTIFACT_DIR}/junit_report.xml endif

The ingester CronJob (Step 3) scrapes this bucket using the GCS JSON API (no auth needed for public buckets):

List recent builds for a job curl -s "https://storage.googleapis.com/storage/v1/b/test-platform-results/o?\ prefix=logs/periodic-ci-my-org-my-operator-main-e2e/&delimiter=/" # Download JUnit from a specific build curl -s "https://storage.googleapis.com/test-platform-results/\ logs/{job}/{build_id}/artifacts/{workflow}/e2e/artifacts/junit_report.xml" # Get run metadata (timestamp, commit SHA) curl -s "https://storage.googleapis.com/test-platform-results/\ logs/{job}/{build_id}/started.json" # {"timestamp":1765889560, "repo-commit":"abc123f", ...}

The ingester maps GCS metadata to Prometheus labels:

| Prometheus label | GCS source |

|---|---|

test | <testcase name="..."> in junit_report.xml |

suite | <testsuite name="..."> in junit_report.xml |

job | Path segment (the Prow job name) |

build_id | Path segment (numeric build ID) |

commit_sha | started.json repo-commit |

branch | main for periodics. PR number for presubmits |

Important: For accurate flake detection, scrape periodic jobs (which run on main

without code changes), not presubmit jobs (which mix test flakes with actual regressions introduced by PRs).

Option B: Konflux and Tekton pipelines

Konflux uses Tekton pipelines. The integration approach is a post-task in your E2E pipeline that pushes results directly, no GCS scraping needed.

Add a step to your Tekton PipelineRun that runs after E2E tests:

Tekton task that pushes JUnit results to Prometheus after E2E tests apiVersion: tekton.dev/v1 kind: Task metadata: name: push-test-metrics namespace: e2e-analytics spec: params: - name: junit-path description: Path to JUnit XML file - name: job-name description: Pipeline/job identifier - name: build-id description: PipelineRun UID or build number - name: commit-sha description: Git commit SHA steps: - name: push-metrics image: quay.io/your-org/junit-ingester:latest env: - name: REMOTE_WRITE_ENDPOINT value: "http://prometheus-server.e2e-analytics.svc:80/api/v1/write" command: - /junit-ingester - --file=$(params.junit-path) - --job=$(params.job-name) - --build-id=$(params.build-id) - --commit-sha=$(params.commit-sha) - --branch=main

Wire it into your E2E pipeline as a finally

task (runs whether tests pass or fail):

apiVersion: tekton.dev/v1 kind: Pipeline spec: tasks: - name: run-e2e taskRef: name: e2e-tests # ... test config ... finally: - name: push-metrics taskRef: name: push-test-metrics params: - name: junit-path value: "$(tasks.run-e2e.results.junit-path)" - name: job-name value: "konflux-my-operator-e2e" - name: build-id value: "$(context.pipelineRun.uid)" - name: commit-sha value: "$(params.git-revision)"

The advantage over GCS scraping: results arrive in Prometheus within seconds of the test run completing, not on a four-hour CronJob schedule.

Option C: Local or ad-hoc runs

For testing the system or running one-off analyses, you can push results from a local make e2e-test

run:

Run E2E tests with JUnit output ARTIFACT_DIR=/tmp/e2e-results make e2e-test # Push results to Prometheus (via port-forward or in-cluster) kubectl -n e2e-analytics port-forward svc/prometheus-server 9090:80 & /path/to/junit-ingester \ --file /tmp/e2e-results/junit_report.xml \ --job "local-e2e" \ --build-id "$(date +%s)" \ --commit-sha "$(git rev-parse HEAD)" \ --branch "$(git rev-parse --abbrev-ref HEAD)" \ --remote-write-endpoint http://localhost:9090/api/v1/write

This is useful for validating the pipeline end-to-end before deploying the CronJob or Tekton task.

Why exclude regressions from quarantine?

A regression means the code broke. Quarantining the test hides the bug. The system detects regressions by looking for a step-function pattern: mostly passing before a specific commit, then consistently failing after. These are flagged in Grafana but never auto-quarantined.

Why automatic expiry?

Without expiry, quarantined tests become permanent exclusions. The re_enable_after

field forces accountability: either fix the test within the window, or it returns to CI and gets re-evaluated. This prevents the quarantine list from growing unbounded.

Grafana dashboard layout

Organize your dashboard into four rows:

  • Row 1: Overview
  • Stat panel: total tests, quarantined count, overall suite pass rate
  • Pie chart: healthy / flaky / regression breakdown
  • PromQL:

count(count by (test) (e2e_test_result{branch="main"}))

for total tests

  • Row 2: Flake leaderboard
  • Table: top 20 flakiest tests with rates, run counts
  • Time series: flake rate trend for selected test (variable dropdown)
  • PromQL: see Panel 1 and Panel 2 above
  • Row 3: Regressions
  • Table: tests where all recent runs failed (0% pass rate in last four days)
  • PromQL:

(sum by (test) (sum_over_time(e2e_test_result{branch="main"}[4d])) / sum by (test) (count_over_time(e2e_test_result{branch="main"}[4d]))) == 0

  • Row 4: Quarantine management
  • Table: loaded from quarantine JSON (or a

e2e_quarantine_active

metric the controller pushes) - Stat panel: tests expiring in next seven days

  • Log panel: quarantine/un-quarantine events timeline
  • Table: loaded from quarantine JSON (or a

Operational runbook

Follow these standard procedures to triage skipped tests, investigate failure causes, and manage the lifecycle of your quarantined suite.

A test was quarantined. What do I do?

  • Check the Jira ticket linked in the quarantine entry.
  • Open the Grafana dashboard, select the test from the dropdown, look at the time-series panel.
  • Identify the pattern: intermittent flake (random), or did it start at a specific commit?
  • Fix the test, verify it passes in three or more consecutive runs, close the Jira ticket.
  • The controller will un-quarantine it on the next cycle.

How do I swap the data store?

Because everything conforms to the Prometheus protocol, swapping is a config change:

  • Ingester: Point

REMOTE_WRITE_ENDPOINT

at the new endpoint (such as Thanos receiver or Mimir). - Grafana: Update the data source URL.

  • Quarantine controller: Update

PROMETHEUS_URL

.

All PromQL queries, dashboards, and alert rules work unchanged. That's the point of conforming to the standard.

Moving from reactive debugging to data-driven pipelines

Automating your test quarantine system moves your development team away from reactive troubleshooting and toward a data-driven pipeline. Backing your test infrastructure with Prometheus metrics provides clear historical trends to help differentiate between intermittent flakiness and true code regressions before a broken pull request blocks your main branch. This self-healing loop isolates broken tests early, forcing accountability through explicit expiry dates while directly reducing manual developer toil and increasing team velocity.