Scaling SaaS 🚀: Managing 300+ K8s Environments
🧭 Setting the Scene
When I joined my team, one of the first things that struck me was the scale. We weren’t just managing a few clusters — we were supporting over 300 Kubernetes environments spread across 60+ enterprise customers. Each customer had their own isolated setup, and every environment needed to be reliable, secure, observable, and easily maintainable. Manual operations just wouldn’t cut it. The real challenge wasn’t Kubernetes itself — it was the coordination, consistency, and control at scale. This blog shares the journey of how we built an internal platform that automated the lifecycle of these environments — from provisioning and configuration to monitoring and scaling.
🛑 The Problem with Manual Ops
Initially, environment provisioning involved a lot of manual steps:
- Helm chart configurations
- Secrets and ConfigMap setup
- Ingress and networking configuration
- Handling environment wise custom images, core product base images Even with some automation scripts in place, operational drift crept in, and tracking “what went wrong where” was tough.
A simple customer onboarding process could take a few hours to days, depending on workload and human availability. And multiply that by hundreds? 🔥 You get the picture.
🧠 The Platform We Built
We built an internal orchestration layer using Vert.x — a lightweight reactive Java toolkit that fit our need for high concurrency and async processing. Vert.x framework is very good in handling concurrency and is highly scalable, these were two main reason for us to choose this.
🧩 The Vert.x Microservice
- This orchestration service acts as the control plane for all our environments. It:
- Listens to event triggers (customer onboard, upgrade, patch)
- Generates and templated manifests/Helm overrides
- Applies them to target clusters using kubectl or ArgoCD APIs
- Validates the state post-deployment using health and readiness checks
- Pushes metrics to Prometheus-compatible endpoints for tracking
- We designed the service to be idempotent, self healing and retry-safe.
- All interactions that customer wanted to do with their environment would be through this microservice through APIs which are exposed.
🚀 GitOps + ArgoCD = Consistency
To keep things declarative and traceable, we adopted GitOps using ArgoCD. For every customer, we maintain:
- A Git repo with Helm values and manifests
- A naming convention for environments and namespaces
- Whenever a change is committed, ArgoCD syncs the target cluster and ensures declarative state is maintained.
📌 Tip: We used ApplicationSets in ArgoCD to dynamically generate apps per customer using a central config.
🧪 Observability Is a First-Class Citizen
Every microservice deployed is instrumented with Micrometer and exposes /actuator/prometheus metrics. Prometheus scrapes these, and we plug them into Grafana dashboards grouped per customer. We defined key RED metrics:
- Request rate
- Error rate
- Duration (latency percentiles)
🌀 Moving to Ambient Mesh
Initially, our workloads used Istio with sidecars for service mesh — enabling mTLS, telemetry, and traffic control. But sidecar injection at this scale:
- Consumed more CPU/memory
- Increased pod startup time
- For upgrades, adding a pod to mesh, etc all would require a restart of the pod.
So, we migrated to Istio Ambient Mode — where ztunnel(per node) handles Layer 4, and sidecars are gone.
🔍 Result: ~10% infra savings on CPU/mem and simpler pod lifecycle
🧠 AI-Assisted Scaling
As a beta feature, we built a system that:
- Collected runtime metrics from Prometheus
- Trained a small neural network using Python (Keras + TensorFlow)
- Predicted high-load windows for a service based on metrics
- Pre-triggered HPA or VPA before the latency spike hit
It was a fun way to apply AI in real-time operations — and it actually worked in early trials.
🔁 Final Thoughts
Looking back, a few key principles helped us build this at scale:
- Automate every repeatable task
- Design for failure — retries, rollbacks, monitoring
- Keep everything declarative and Git-driven
- Treat your internal platform as a product
There’s still more to do — multi-region failover, and cost-aware provisioning. But what we’ve built already makes it easy for anyone in our team to onboard a new customer — often with a single API call.
🙋♂️ Over to You
If you're working in DevOps/Platform or managing SaaS infra at scale, I’d love to hear how you're solving similar problems. Ping me on LinkedIn or mail me and let’s chat!