Skip to content

abhay1999/self-healing-k8s-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Self-Healing Kubernetes Platform

CI/CD Go Version Kubernetes Helm License: MIT

A production-grade platform that automatically detects and recovers from failures in Kubernetes β€” no human intervention required.


The Problem

Every production system eventually faces the same 3am incident:

Scenario Traditional Response This Platform
Pod crashes (OOM, bug, bad config) Alert fires β†’ engineer wakes up β†’ restarts manually Liveness probe detects failure β†’ container restarts automatically
Bad deployment breaks users Engineer manually runs rollback Post-deploy health check triggers helm rollback
Pod stuck in CrashLoopBackOff Engineer deletes pod Custom controller detects restart count β†’ deletes and reschedules
Traffic spike overwhelms service Engineer manually scales up HPA scales to 5 replicas at 70% CPU β€” no ticket needed
Node drained for maintenance Pod evicted, service drops PodDisruptionBudget blocks eviction until replacement is ready
New version has hidden bug All users hit the bug Canary routes 10% traffic first β€” rollback before full exposure

The answer isn't more on-call rotations. It's automating the response.


Architecture

flowchart TB
    subgraph ci["βš™οΈ GitHub Actions CI/CD"]
        push[git push to main] --> build[Build Docker Image]
        build --> registry[Push to Docker Hub]
        registry --> deploy[helm upgrade]
        deploy --> check{Health Check}
        check -->|CrashLoopBackOff| rollback[helm rollback ↩]
        check -->|All pods healthy| live[βœ“ Live]
    end

    subgraph k8s["☸️ Kubernetes Cluster"]
        subgraph traffic["Traffic Layer"]
            ingress[Nginx Ingress] --> svc[Service: app-a]
            svc -->|90%| stable["Stable Pods\n● ● ● ● ● ● ● ● ●\ntrack=stable"]
            svc -->|10%| canary["Canary Pods\n●\ntrack=canary"]
        end

        subgraph selfheal["πŸ”§ Self-Healing Layer"]
            liveness["Liveness Probe\n/health every 5s"] -->|fail 3Γ—| restart[Restart Container]
            readiness["Readiness Probe\n/ready every 3s"] -->|fail 2Γ—| remove[Remove from Service]
            ctrl["Custom Go Controller\nwatches all pods"] -->|restarts > 3| delete[Delete + Reschedule]
            hpa["HPA"] -->|CPU > 70%| scale[Scale 2 β†’ 5 replicas]
            pdb["PodDisruptionBudget"] -->|node drain| block[Block until safe]
        end

        subgraph obs["πŸ“Š Observability (monitoring ns)"]
            prom["Prometheus :9090\nmetrics scraping"]
            grafana["Grafana :3000\ndashboards"]
            jaeger["Jaeger :16686\ndistributed traces"]
            alerts["AlertManager\nalert routing"]
            prom --> grafana
            prom --> alerts
        end
    end

    ci -->|deploys to| k8s
    stable -->|/metrics| prom
    stable -->|OTel spans| jaeger
    canary -->|/metrics| prom
Loading

How It Works

The platform has 7 layers of automated recovery, each handling a different failure class:

Layer Mechanism Trigger Recovery Action
1 Liveness Probe /health fails 3Γ— in 15s Restart the container
2 Readiness Probe /ready fails 2Γ— in 6s Stop sending traffic to pod
3 Custom Go Controller Any container restarts > 3Γ— Delete pod β†’ K8s reschedules fresh
4 Helm Auto-Rollback CrashLoopBackOff after deploy helm rollback to last known-good revision
5 HPA CPU > 70% sustained Scale from 2 β†’ up to 5 replicas
6 PodDisruptionBudget Node drain / voluntary eviction Block until replacement pod is ready
7 Canary Rollback Bad canary validation Remove canary β†’ 100% traffic back to stable

Project Structure

self-healing-k8s-platform/
β”‚
β”œβ”€β”€ app/                          # Go microservice
β”‚   β”œβ”€β”€ main.go                   # HTTP server: /, /health, /ready, /metrics, /crash
β”‚   β”œβ”€β”€ main_test.go              # Unit tests (handlers, middleware, responseWriter)
β”‚   β”œβ”€β”€ Dockerfile                # Multi-stage build β†’ minimal alpine image
β”‚   └── go.mod
β”‚
β”œβ”€β”€ controller/                   # Custom Kubernetes controller
β”‚   β”œβ”€β”€ main.go                   # Watches pods, deletes those with restarts > 3
β”‚   β”œβ”€β”€ reconciler_test.go        # Unit tests using fake k8s client
β”‚   β”œβ”€β”€ deploy.yaml               # Deployment + ClusterRole RBAC
β”‚   └── go.mod
β”‚
β”œβ”€β”€ helm-chart/                   # Kubernetes packaging
β”‚   β”œβ”€β”€ Chart.yaml
β”‚   β”œβ”€β”€ values.yaml               # All tuneable defaults
β”‚   └── templates/
β”‚       β”œβ”€β”€ deployment.yaml       # Pods with app + track labels for traffic splitting
β”‚       β”œβ”€β”€ service.yaml          # Routes to stable AND canary pods (no track selector)
β”‚       β”œβ”€β”€ hpa.yaml              # CPU-based autoscaler (disabled for canary)
β”‚       └── pdb.yaml              # PodDisruptionBudget (disabled for canary)
β”‚
β”œβ”€β”€ canary/                       # Canary deployment workflows
β”‚   β”œβ”€β”€ values-canary.yaml        # Helm overrides: 1 replica, no service, track=canary
β”‚   β”œβ”€β”€ deploy-canary.sh          # Strategy A: replica-ratio traffic split
β”‚   β”œβ”€β”€ promote-canary.sh         # Upgrade stable image, remove canary release
β”‚   β”œβ”€β”€ rollback-canary.sh        # Remove canary β†’ 100% stable
β”‚   β”œβ”€β”€ nginx-deploy-canary.sh    # Strategy B: exact-weight Nginx ingress split
β”‚   β”œβ”€β”€ nginx-rollback-canary.sh  # Remove Nginx canary ingress + service + release
β”‚   β”œβ”€β”€ nginx-ingress-stable.yaml # Stable ingress manifest
β”‚   β”œβ”€β”€ nginx-ingress-canary.yaml # Canary ingress (canary-weight annotation)
β”‚   └── nginx-service-canary.yaml # Service selecting only track=canary pods
β”‚
β”œβ”€β”€ observability/
β”‚   β”œβ”€β”€ prometheus-values.yaml    # kube-prometheus-stack Helm values
β”‚   β”œβ”€β”€ jaeger-values.yaml        # Jaeger all-in-one Helm values
β”‚   └── install.sh                # One-command: installs Prometheus + Grafana + Jaeger
β”‚
β”œβ”€β”€ self-healing/
β”‚   β”œβ”€β”€ auto-rollback.sh          # Post-deploy health check + helm rollback
β”‚   └── alertmanager-rules.yaml   # Alerts: CrashLoop, high error rate, slow p95
β”‚
β”œβ”€β”€ chaos-testing/
β”‚   └── kill-pods.sh              # Randomly kills pods on interval to test resilience
β”‚
└── .github/workflows/
    └── ci-cd.yaml                # Build β†’ Push β†’ Deploy β†’ Health check β†’ Rollback

Deployment Guide

Prerequisites

# macOS
brew install docker kubectl helm minikube go jq

# Verify
docker info
minikube version
helm version
kubectl version --client
go version

Step 1 β€” Start the cluster

minikube start --driver=docker --memory=4096 --cpus=2

kubectl get nodes
# NAME       STATUS   ROLES           AGE   VERSION
# minikube   Ready    control-plane   10s   v1.29.x

Step 2 β€” Build the app image

# Point Docker CLI at Minikube's internal registry (no push to Docker Hub needed)
eval $(minikube docker-env)

docker build -t self-healing-app:latest ./app
docker images | grep self-healing-app

Step 3 β€” Deploy with Helm

helm install app-a ./helm-chart \
  --set image.repository=self-healing-app \
  --set image.tag=latest \
  --set image.pullPolicy=Never

# Watch pods come up (~20 seconds)
kubectl get pods -w
# NAME                    READY   STATUS    RESTARTS   AGE
# app-a-xxxxxxxxx-xxxxx   1/1     Running   0          15s
# app-a-xxxxxxxxx-xxxxx   1/1     Running   0          15s

Step 4 β€” Verify the app

kubectl port-forward svc/app-a 8080:80

curl http://localhost:8080/         # {"service":"app-a","status":"ok"}
curl http://localhost:8080/health   # {"status":"healthy"}
curl http://localhost:8080/ready    # {"status":"ready"}
curl http://localhost:8080/metrics  # Prometheus metrics

Step 5 β€” Install the observability stack

./observability/install.sh
# Installs: kube-prometheus-stack (Prometheus + Grafana + AlertManager) + Jaeger
# Takes ~3 minutes

kubectl get pods -n monitoring
# Wait until all pods show STATUS=Running
# Open 3 terminals:
kubectl port-forward svc/prometheus-operated 9090:9090 -n monitoring   # Prometheus
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring      # Grafana (admin/admin123)
kubectl port-forward svc/jaeger-query 16686:16686 -n monitoring        # Jaeger

Step 6 β€” Deploy the custom controller

eval $(minikube docker-env)
docker build -t self-healing-controller:latest ./controller

kubectl apply -f controller/deploy.yaml
kubectl logs -f deployment/self-healing-controller

Step 7 β€” Enable CI/CD (optional)

Add these secrets to your GitHub repo (Settings β†’ Secrets β†’ Actions):

Secret Value
DOCKERHUB_USERNAME your Docker Hub username
DOCKERHUB_TOKEN Docker Hub access token
KUBECONFIG cat ~/.kube/config | base64

Every push to main will then: build image β†’ push β†’ deploy β†’ health check β†’ auto-rollback if unhealthy.


Demo Scenarios

Scenario 1 β€” Pod crash and self-healing

# Terminal 1: watch pods
kubectl get pods -w

# Terminal 2: trigger crash
kubectl port-forward svc/app-a 8080:80
curl http://localhost:8080/crash

Watch terminal 1:

app-a-xxx   1/1   Running            0    β†’ CrashLoopBackOff 1 β†’ Running 2

K8s detects the /health probe failure and restarts the container automatically.


Scenario 2 β€” Bad deploy + auto-rollback

helm upgrade app-a ./helm-chart \
  --set image.repository=self-healing-app \
  --set image.tag=does-not-exist \
  --set image.pullPolicy=Never

kubectl get pods -w   # pods go into ErrImageNeverPull

./self-healing/auto-rollback.sh app-a 30

helm history app-a
# REVISION  STATUS      DESCRIPTION
# 1         superseded  Install complete
# 2         failed      Upgrade failed
# 3         deployed    Rollback to 1

Scenario 3 β€” Chaos testing (zero-downtime validation)

# Terminal 1: keep sending requests
watch -n 0.5 'curl -s http://localhost:8080/ | jq .service'

# Terminal 2: kill pods randomly every 10 seconds
./chaos-testing/kill-pods.sh 10 app-a

The watch window should never stop responding β€” K8s reschedules replacement pods before the service notices.


Scenario 4 β€” Canary deployment

# Tag a "v2" image
eval $(minikube docker-env)
docker tag self-healing-app:latest self-healing-app:v2

# Deploy to 10% of traffic
./canary/deploy-canary.sh v2

# Verify traffic split
kubectl get pods -l app=app-a -o wide
for i in $(seq 1 20); do curl -s http://localhost:8080/ | jq -r .service; done | sort | uniq -c
#   18 app-a         ← stable
#    2 app-a-canary  ← canary (~10%)

# Promote or roll back
./canary/promote-canary.sh v2      # v2 becomes stable
./canary/rollback-canary.sh        # remove canary, 100% back to v1

Running Tests

# App unit tests (no cluster needed)
cd app && go test ./... -v

# Controller unit tests (uses fake k8s client β€” no cluster needed)
cd controller && go test ./... -v

Tests cover: all HTTP handlers, Prometheus instrumentation middleware, responseWriter status capture, and all reconciler scenarios (skip namespaces, threshold boundary, multi-container pods, pod not found).


Key Technologies

Technology Role
Go 1.22 Microservice + custom Kubernetes controller
Kubernetes 1.29 Container orchestration, probes, HPA, PDB
Helm 3 Deployment packaging, rollback management
Prometheus Metrics collection (http_requests_total, http_request_duration_seconds)
Grafana Real-time dashboards for cluster and app health
Jaeger + OpenTelemetry Distributed tracing for every HTTP request
AlertManager Alert routing for CrashLoop, error rate, latency
GitHub Actions CI/CD pipeline with automated health checks
Nginx Ingress Weight-based canary traffic splitting
controller-runtime Custom operator framework

About

Production-grade self-healing Kubernetes platform: CI/CD, Prometheus, Grafana, Jaeger, custom Go controller, canary deployments, chaos testing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors