Building Resilience with kube-probesim

As the creator of kube-probesim, I wanted to solve a specific problem: simulating how Kubernetes applications handle liveness and readiness probe failures. This tool was born out of a need to quickly replicate real-world failure conditions like random probe failures, latency spikes, and failing external dependencies in a controlled manner.

For many Kubernetes developers and operators, building resilient systems requires more than just passing happy path tests. You need to understand how your application behaves under stress or failure. Kubernetes’ liveness and readiness probes are crucial in this respect, but testing them without good tools can be tricky. That’s where kube-probesim comes in. Let me explain how you can leverage it and how easy it is to deploy, thanks to its availability in the GitHub Container Registry.

Why kube-probesim?

When I first started using Kubernetes in production, one of the challenges was ensuring applications were resilient enough to withstand failures—especially during scale-ups, unexpected traffic spikes, or external service downtimes. Probes (liveness and readiness) play a crucial role in ensuring the stability of the system, but testing these probes in realistic scenarios is difficult. kube-probesim simplifies this.

Here’s what it does:

Probe Failures: Simulate random or time-triggered liveness/readiness probe failures.
Latency Simulation: Introduce configurable network latencies.
External Dependency Failure: Simulate downstream service failures.
Network Partitioning: Test behavior under simulated network issues.

The goal is to empower developers and operators to test probe handling under controlled, configurable failure conditions without introducing a lot of complexity.

Deploying kube-probesim from GitHub Container Registry

Since I’m hosting kube-probesim in the GitHub Container Registry, deploying it into any Kubernetes cluster is a breeze. No need to build the image yourself—just pull the container directly from the registry.

Here’s how you can do it in your own environment:

Add kube-probesim Deployment

Create a Kubernetes deployment YAML with the GitHub-hosted container image:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-probesim
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-probesim
  template:
    metadata:
      labels:
        app: kube-probesim
    spec:
      containers:
      - name: kube-probesim
        image: ghcr.io/jhoelzel/kube-probesim:latest
        ports:
        - containerPort: 8080
        env:
        - name: FAILURE_RATE
          value: "20"
        - name: LATENCY
          value: "50"
        - name: LIVENESS_FAIL_AFTER_TIME
          value: "60"
        - name: READINESS_FAIL_AFTER_TIME
          value: "120"

This YAML uses the official kube-probesim container from GitHub’s Container Registry, configuring it with a 20% failure rate and 50ms of added latency.

Deploy it

Apply the YAML to your Kubernetes cluster:
```
kubectl apply -f kube-probesim-deployment.yaml
```
Kubernetes will pull the image directly from GitHub and start the kube-probesim pod with the parameters you’ve set.

Real-World Scenarios

Here’s how I’ve used kube-probesim in the wild:

1. Random Failures

To simulate a 10% failure rate across both probes, with a 100ms response delay:

   FAILURE_RATE=10 LATENCY=100 ./kube-probesim

This allows me to test how Kubernetes reschedules pods when failures start happening unpredictably, helping ensure the system can handle such events without downtime.

2. Failing External Dependencies

Using the /dependency endpoint with FAIL_DEPENDENCY=true helped me test scenarios where downstream services were unavailable, helping me tweak timeouts and retries for critical service dependencies.

   FAIL_DEPENDENCY=true ./kube-probesim

Observability and Monitoring

If you’re running kube-probesim in a production-like environment, don’t forget to integrate your monitoring and logging stack (e.g., Prometheus or Grafana). Observing probe failures over time will help you understand how Kubernetes reacts and will give you insights into system recovery times and automatic pod rescheduling.

TLDR: A tool to simulate probe failure

The beauty of kube-probesim lies in its flexibility and how easy it is to deploy thanks to its GitHub Container Registry image. Whether you need to test random failures, network issues, or external service dependencies, kube-probesim lets you simulate these failures in a safe, controlled environment. It’s a simple but powerful way to make your applications more resilient before they go live in production.

Next Steps:

Get the image from the GitHub Container Registry.
Deploy kube-probesim in your staging environment, and start testing!

Remember, production is unforgiving. kube-probesim gives you the edge by helping you identify and fix failure points before they cause issues for real users.