A production-grade platform that automatically detects and recovers from failures in Kubernetes β no human intervention required.
Every production system eventually faces the same 3am incident:
| Scenario | Traditional Response | This Platform |
|---|---|---|
| Pod crashes (OOM, bug, bad config) | Alert fires β engineer wakes up β restarts manually | Liveness probe detects failure β container restarts automatically |
| Bad deployment breaks users | Engineer manually runs rollback | Post-deploy health check triggers helm rollback |
| Pod stuck in CrashLoopBackOff | Engineer deletes pod | Custom controller detects restart count β deletes and reschedules |
| Traffic spike overwhelms service | Engineer manually scales up | HPA scales to 5 replicas at 70% CPU β no ticket needed |
| Node drained for maintenance | Pod evicted, service drops | PodDisruptionBudget blocks eviction until replacement is ready |
| New version has hidden bug | All users hit the bug | Canary routes 10% traffic first β rollback before full exposure |
The answer isn't more on-call rotations. It's automating the response.
flowchart TB
subgraph ci["βοΈ GitHub Actions CI/CD"]
push[git push to main] --> build[Build Docker Image]
build --> registry[Push to Docker Hub]
registry --> deploy[helm upgrade]
deploy --> check{Health Check}
check -->|CrashLoopBackOff| rollback[helm rollback β©]
check -->|All pods healthy| live[β Live]
end
subgraph k8s["βΈοΈ Kubernetes Cluster"]
subgraph traffic["Traffic Layer"]
ingress[Nginx Ingress] --> svc[Service: app-a]
svc -->|90%| stable["Stable Pods\nβ β β β β β β β β\ntrack=stable"]
svc -->|10%| canary["Canary Pods\nβ\ntrack=canary"]
end
subgraph selfheal["π§ Self-Healing Layer"]
liveness["Liveness Probe\n/health every 5s"] -->|fail 3Γ| restart[Restart Container]
readiness["Readiness Probe\n/ready every 3s"] -->|fail 2Γ| remove[Remove from Service]
ctrl["Custom Go Controller\nwatches all pods"] -->|restarts > 3| delete[Delete + Reschedule]
hpa["HPA"] -->|CPU > 70%| scale[Scale 2 β 5 replicas]
pdb["PodDisruptionBudget"] -->|node drain| block[Block until safe]
end
subgraph obs["π Observability (monitoring ns)"]
prom["Prometheus :9090\nmetrics scraping"]
grafana["Grafana :3000\ndashboards"]
jaeger["Jaeger :16686\ndistributed traces"]
alerts["AlertManager\nalert routing"]
prom --> grafana
prom --> alerts
end
end
ci -->|deploys to| k8s
stable -->|/metrics| prom
stable -->|OTel spans| jaeger
canary -->|/metrics| prom
The platform has 7 layers of automated recovery, each handling a different failure class:
| Layer | Mechanism | Trigger | Recovery Action |
|---|---|---|---|
| 1 | Liveness Probe | /health fails 3Γ in 15s |
Restart the container |
| 2 | Readiness Probe | /ready fails 2Γ in 6s |
Stop sending traffic to pod |
| 3 | Custom Go Controller | Any container restarts > 3Γ | Delete pod β K8s reschedules fresh |
| 4 | Helm Auto-Rollback | CrashLoopBackOff after deploy | helm rollback to last known-good revision |
| 5 | HPA | CPU > 70% sustained | Scale from 2 β up to 5 replicas |
| 6 | PodDisruptionBudget | Node drain / voluntary eviction | Block until replacement pod is ready |
| 7 | Canary Rollback | Bad canary validation | Remove canary β 100% traffic back to stable |
self-healing-k8s-platform/
β
βββ app/ # Go microservice
β βββ main.go # HTTP server: /, /health, /ready, /metrics, /crash
β βββ main_test.go # Unit tests (handlers, middleware, responseWriter)
β βββ Dockerfile # Multi-stage build β minimal alpine image
β βββ go.mod
β
βββ controller/ # Custom Kubernetes controller
β βββ main.go # Watches pods, deletes those with restarts > 3
β βββ reconciler_test.go # Unit tests using fake k8s client
β βββ deploy.yaml # Deployment + ClusterRole RBAC
β βββ go.mod
β
βββ helm-chart/ # Kubernetes packaging
β βββ Chart.yaml
β βββ values.yaml # All tuneable defaults
β βββ templates/
β βββ deployment.yaml # Pods with app + track labels for traffic splitting
β βββ service.yaml # Routes to stable AND canary pods (no track selector)
β βββ hpa.yaml # CPU-based autoscaler (disabled for canary)
β βββ pdb.yaml # PodDisruptionBudget (disabled for canary)
β
βββ canary/ # Canary deployment workflows
β βββ values-canary.yaml # Helm overrides: 1 replica, no service, track=canary
β βββ deploy-canary.sh # Strategy A: replica-ratio traffic split
β βββ promote-canary.sh # Upgrade stable image, remove canary release
β βββ rollback-canary.sh # Remove canary β 100% stable
β βββ nginx-deploy-canary.sh # Strategy B: exact-weight Nginx ingress split
β βββ nginx-rollback-canary.sh # Remove Nginx canary ingress + service + release
β βββ nginx-ingress-stable.yaml # Stable ingress manifest
β βββ nginx-ingress-canary.yaml # Canary ingress (canary-weight annotation)
β βββ nginx-service-canary.yaml # Service selecting only track=canary pods
β
βββ observability/
β βββ prometheus-values.yaml # kube-prometheus-stack Helm values
β βββ jaeger-values.yaml # Jaeger all-in-one Helm values
β βββ install.sh # One-command: installs Prometheus + Grafana + Jaeger
β
βββ self-healing/
β βββ auto-rollback.sh # Post-deploy health check + helm rollback
β βββ alertmanager-rules.yaml # Alerts: CrashLoop, high error rate, slow p95
β
βββ chaos-testing/
β βββ kill-pods.sh # Randomly kills pods on interval to test resilience
β
βββ .github/workflows/
βββ ci-cd.yaml # Build β Push β Deploy β Health check β Rollback
# macOS
brew install docker kubectl helm minikube go jq
# Verify
docker info
minikube version
helm version
kubectl version --client
go versionminikube start --driver=docker --memory=4096 --cpus=2
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# minikube Ready control-plane 10s v1.29.x# Point Docker CLI at Minikube's internal registry (no push to Docker Hub needed)
eval $(minikube docker-env)
docker build -t self-healing-app:latest ./app
docker images | grep self-healing-apphelm install app-a ./helm-chart \
--set image.repository=self-healing-app \
--set image.tag=latest \
--set image.pullPolicy=Never
# Watch pods come up (~20 seconds)
kubectl get pods -w
# NAME READY STATUS RESTARTS AGE
# app-a-xxxxxxxxx-xxxxx 1/1 Running 0 15s
# app-a-xxxxxxxxx-xxxxx 1/1 Running 0 15skubectl port-forward svc/app-a 8080:80
curl http://localhost:8080/ # {"service":"app-a","status":"ok"}
curl http://localhost:8080/health # {"status":"healthy"}
curl http://localhost:8080/ready # {"status":"ready"}
curl http://localhost:8080/metrics # Prometheus metrics./observability/install.sh
# Installs: kube-prometheus-stack (Prometheus + Grafana + AlertManager) + Jaeger
# Takes ~3 minutes
kubectl get pods -n monitoring
# Wait until all pods show STATUS=Running# Open 3 terminals:
kubectl port-forward svc/prometheus-operated 9090:9090 -n monitoring # Prometheus
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring # Grafana (admin/admin123)
kubectl port-forward svc/jaeger-query 16686:16686 -n monitoring # Jaegereval $(minikube docker-env)
docker build -t self-healing-controller:latest ./controller
kubectl apply -f controller/deploy.yaml
kubectl logs -f deployment/self-healing-controllerAdd these secrets to your GitHub repo (Settings β Secrets β Actions):
| Secret | Value |
|---|---|
DOCKERHUB_USERNAME |
your Docker Hub username |
DOCKERHUB_TOKEN |
Docker Hub access token |
KUBECONFIG |
cat ~/.kube/config | base64 |
Every push to main will then: build image β push β deploy β health check β auto-rollback if unhealthy.
# Terminal 1: watch pods
kubectl get pods -w
# Terminal 2: trigger crash
kubectl port-forward svc/app-a 8080:80
curl http://localhost:8080/crashWatch terminal 1:
app-a-xxx 1/1 Running 0 β CrashLoopBackOff 1 β Running 2
K8s detects the /health probe failure and restarts the container automatically.
helm upgrade app-a ./helm-chart \
--set image.repository=self-healing-app \
--set image.tag=does-not-exist \
--set image.pullPolicy=Never
kubectl get pods -w # pods go into ErrImageNeverPull
./self-healing/auto-rollback.sh app-a 30
helm history app-a
# REVISION STATUS DESCRIPTION
# 1 superseded Install complete
# 2 failed Upgrade failed
# 3 deployed Rollback to 1# Terminal 1: keep sending requests
watch -n 0.5 'curl -s http://localhost:8080/ | jq .service'
# Terminal 2: kill pods randomly every 10 seconds
./chaos-testing/kill-pods.sh 10 app-aThe watch window should never stop responding β K8s reschedules replacement pods before the service notices.
# Tag a "v2" image
eval $(minikube docker-env)
docker tag self-healing-app:latest self-healing-app:v2
# Deploy to 10% of traffic
./canary/deploy-canary.sh v2
# Verify traffic split
kubectl get pods -l app=app-a -o wide
for i in $(seq 1 20); do curl -s http://localhost:8080/ | jq -r .service; done | sort | uniq -c
# 18 app-a β stable
# 2 app-a-canary β canary (~10%)
# Promote or roll back
./canary/promote-canary.sh v2 # v2 becomes stable
./canary/rollback-canary.sh # remove canary, 100% back to v1# App unit tests (no cluster needed)
cd app && go test ./... -v
# Controller unit tests (uses fake k8s client β no cluster needed)
cd controller && go test ./... -vTests cover: all HTTP handlers, Prometheus instrumentation middleware, responseWriter status capture, and all reconciler scenarios (skip namespaces, threshold boundary, multi-container pods, pod not found).
| Technology | Role |
|---|---|
| Go 1.22 | Microservice + custom Kubernetes controller |
| Kubernetes 1.29 | Container orchestration, probes, HPA, PDB |
| Helm 3 | Deployment packaging, rollback management |
| Prometheus | Metrics collection (http_requests_total, http_request_duration_seconds) |
| Grafana | Real-time dashboards for cluster and app health |
| Jaeger + OpenTelemetry | Distributed tracing for every HTTP request |
| AlertManager | Alert routing for CrashLoop, error rate, latency |
| GitHub Actions | CI/CD pipeline with automated health checks |
| Nginx Ingress | Weight-based canary traffic splitting |
| controller-runtime | Custom operator framework |