Scaling SaaS 🚀: Managing 300+ K8s Environments

🧭 Setting the Scene

When I joined my team, one of the first things that struck me was the scale. We weren’t just managing a few clusters — we were supporting over 300 Kubernetes environments spread across 60+ enterprise customers. Each customer had their own isolated setup, and every environment needed to be reliable, secure, observable, and easily maintainable. Manual operations just wouldn’t cut it. The real challenge wasn’t Kubernetes itself — it was the coordination, consistency, and control at scale. This blog shares the journey of how we built an internal platform that automated the lifecycle of these environments — from provisioning and configuration to monitoring and scaling.

🛑 The Problem with Manual Ops

Initially, environment provisioning involved a lot of manual steps:

Helm chart configurations
Secrets and ConfigMap setup
Ingress and networking configuration
Handling environment wise custom images, core product base images Even with some automation scripts in place, operational drift crept in, and tracking “what went wrong where” was tough.

A simple customer onboarding process could take a few hours to days, depending on workload and human availability. And multiply that by hundreds? 🔥 You get the picture.

🧠 The Platform We Built

We built an internal orchestration layer using Vert.x — a lightweight reactive Java toolkit that fit our need for high concurrency and async processing. Vert.x framework is very good in handling concurrency and is highly scalable, these were two main reason for us to choose this.

🧩 The Vert.x Microservice

This orchestration service acts as the control plane for all our environments. It:
Listens to event triggers (customer onboard, upgrade, patch)
Generates and templated manifests/Helm overrides
Applies them to target clusters using kubectl or ArgoCD APIs
Validates the state post-deployment using health and readiness checks
Pushes metrics to Prometheus-compatible endpoints for tracking
We designed the service to be idempotent, self healing and retry-safe.
All interactions that customer wanted to do with their environment would be through this microservice through APIs which are exposed.

🚀 GitOps + ArgoCD = Consistency

To keep things declarative and traceable, we adopted GitOps using ArgoCD. For every customer, we maintain:

A Git repo with Helm values and manifests
A naming convention for environments and namespaces
Whenever a change is committed, ArgoCD syncs the target cluster and ensures declarative state is maintained.

📌 Tip: We used ApplicationSets in ArgoCD to dynamically generate apps per customer using a central config.

🧪 Observability Is a First-Class Citizen

Every microservice deployed is instrumented with Micrometer and exposes /actuator/prometheus metrics. Prometheus scrapes these, and we plug them into Grafana dashboards grouped per customer. We defined key RED metrics:

Request rate
Error rate
Duration (latency percentiles)

🌀 Moving to Ambient Mesh

Initially, our workloads used Istio with sidecars for service mesh — enabling mTLS, telemetry, and traffic control. But sidecar injection at this scale:

Consumed more CPU/memory
Increased pod startup time
For upgrades, adding a pod to mesh, etc all would require a restart of the pod.

So, we migrated to Istio Ambient Mode — where ztunnel(per node) handles Layer 4, and sidecars are gone.

🔍 Result: ~10% infra savings on CPU/mem and simpler pod lifecycle

🧠 AI-Assisted Scaling

As a beta feature, we built a system that:

Collected runtime metrics from Prometheus
Trained a small neural network using Python (Keras + TensorFlow)
Predicted high-load windows for a service based on metrics
Pre-triggered HPA or VPA before the latency spike hit

It was a fun way to apply AI in real-time operations — and it actually worked in early trials.

🔁 Final Thoughts

Looking back, a few key principles helped us build this at scale:

Automate every repeatable task
Design for failure — retries, rollbacks, monitoring
Keep everything declarative and Git-driven
Treat your internal platform as a product

There’s still more to do — multi-region failover, and cost-aware provisioning. But what we’ve built already makes it easy for anyone in our team to onboard a new customer — often with a single API call.

🙋‍♂️ Over to You

If you're working in DevOps/Platform or managing SaaS infra at scale, I’d love to hear how you're solving similar problems. Ping me on LinkedIn or mail me and let’s chat!