Version: 2.3.0

Open Observability

Litmus 2.0 builds on principle of Open observability through observability hooks such as probes to validate steady state hypothesis along with chaos injection. Chaos exporter also provides prometheus metrics which can be used to generate alerts based on events and also to view chaos impact and application performance in terms of probe success percentage and experiment verdict. It provides for several integration points with Prometheus's alert manager and Grafana, also for the in-house application and infrastructure monitoring capabilities with chaos events, metadata and results.

Prerequisites#

The following should be required before knowing about Open observability hooks in litmus 2.0:

Probes#

Litmus probes are pluggable checks that can be defined within the ChaosEngine for any chaos experiment. The experiment pods execute these checks based on the mode they are defined in & factor their success as necessary conditions in determining the verdict of the experiment (along with the standard in-built checks).

Litmus currently supports four types of probes:

httpProbe: To query health/downstream URIs
cmdProbe: To execute any user-desired health-check function implemented as a shell command
k8sProbe: To perform CRUD operations against native & custom Kubernetes resources
promProbe: To execute promql queries and match prometheus metrics for specific criteria

These probes can be used in isolation or in several combinations to achieve the desired checks.

More about Probes can be found here

Chaos exporter#

Chaos exporter is a custom Prometheus and CloudWatch exporter to expose Litmus Chaos metrics. Typically deployed along with the chaos-operator deployment, which, in-turn is associated with all chaosresults in the cluster.

Two types of metrics are exposed:

AggregateMetrics:#

These metrics are derived from all the chaosresults present inside WATCH_NAMESPACE. If WATCH_NAMESPACE is not defined then it derives metrics from all namespaces. It exposes total_passed_experiment, total_failed_experiment, total_awaited_experiment, experiment_run_count, experiment_installed_count metrics.

ExperimentScoped:#

Individual experiment run status. It exposes passed_experiment, failed_experiment, awaited_experiment, probe_success_percentage, startTime, endTime, totalDuration, chaosInjectTime metrics

All metrics exported from chaos exporter can be found here

Integrations#

Summary#

Litmus supports several kinds of probes and also has a chaos-exporter on it's execution plane on the target agent's cluster which is essential for interleaved monitoring, integrated alerts and to hook into existing observability infrastructure. Chaos experimentation is a lot about hypothesizing around the application and/or infrastructure behavior, controlling blast radius & measuring SLOs. SREs love to visualize the impact of chaos - either actively (live) or recorded (as with automated chaos tests)

Resources#

Observability Considerations in Chaos: The Metrics Story

Monitoring Litmus Chaos Experiments