Kubernetes production best practices
Kubernetes in Production: Best Practices for 2026
A senior engineer's guide to running Kubernetes in production — cluster architecture, GitOps, security hardening, observability, and the failure modes teams encounter most.
TL;DR — Key Points
- 1Most Kubernetes production problems are configuration problems, not framework bugs — they are fixable with the right audit approach.
- 2Security hardening (Pod Security Standards, RBAC scoping, network policies) should be applied before the cluster hits production, not during an audit finding.
- 3GitOps via Argo CD with progressive delivery eliminates the majority of deployment-related production incidents.
- 4SLO-based observability with runbooks cuts mean time to resolution faster than any other single investment in platform reliability.
Kubernetes in Production: Best Practices for 2026
Kubernetes is the dominant container orchestration platform for a reason — it handles the operational complexity of running many services at scale better than any alternative. But the learning curve is steep, and the failure modes in production are different from what developers encounter in development environments.
Most Kubernetes problems that engineering teams hit in year two are not framework bugs. They are configuration choices made in year one that seemed reasonable at the time: request/limit ratios that cause eviction cascades, RBAC policies that are too permissive, node configurations that leave insufficient system overhead, and observability gaps that make incident diagnosis a guessing game.
Why Kubernetes misbehaves in ways that development environments never predict
The most common Kubernetes production problems Wolk Inc encounters are not caused by incorrect understanding of the API. They come from design decisions that work well at small scale but create fragility at production load: resource requests and limits set too close together (which causes cascading eviction under memory pressure), cluster autoscaler configurations that take too long to provision nodes during traffic spikes, and GitOps sync policies that apply configuration changes immediately without a canary period.
Security configurations are the second major gap. Default Kubernetes installations are permissive by design — network policies are not applied by default, pod security standards are not enforced, and service accounts have more permissions than necessary. Teams that have not explicitly hardened these settings are operating with a significantly larger attack surface than they realize, which becomes visible during SOC 2 or ISO 27001 assessments.
Observability is the third gap. Most teams install Prometheus and Grafana and consider observability "done." But default cluster metrics without application-level SLOs, structured logging without correlation between traces, and alerting rules that fire too early or too late all contribute to incident response that takes three times longer than necessary. The difference between a team that resolves Kubernetes incidents in 20 minutes and a team that takes two hours is almost always observability quality, not engineering skill.
The final category is upgrade and change management. Kubernetes releases a new minor version every four months. AWS EKS, Azure AKS, and Google GKE each have their own support windows and deprecation timelines. Teams that do not have a disciplined upgrade cadence find themselves running unsupported versions, blocked from using new features, and facing difficult catch-up upgrades that require multiple skipped versions.
A Production Kubernetes Readiness Framework
The following five areas cover the most common gaps Wolk Inc finds when auditing production Kubernetes clusters. Each area has specific, actionable checks that can be completed in a day of engineering effort.
Cluster Architecture: Multi-AZ, Node Groups, and Karpenter
Production clusters must span multiple availability zones with nodes distributed evenly across AZs. Node groups should be separated by workload class — system workloads on dedicated node groups, application workloads on separate groups, with spot instances only for stateless, interruption-tolerant workloads. Karpenter (AWS) or Cluster Autoscaler with overprovisioning is recommended over basic autoscaling for faster response to traffic spikes. Separate node groups per team or criticality tier improves blast radius containment when a workload behaves unexpectedly.
Resource Requests, Limits, and HPA Configuration
Set resource requests based on actual observed usage from Prometheus (not guesses), and set limits 2–3× above requests for CPU (where throttling is preferable to eviction) and at 1.1–1.2× for memory (where exceeding limits causes OOMKill). Horizontal Pod Autoscaler targets should be set at 60–70% of the limit, not 80–90%, to allow headroom before scaling. KEDA is recommended over HPA for event-driven workloads. VPA (Vertical Pod Autoscaler) in recommendation mode is valuable for right-sizing request values over time.
Security Hardening: Pod Security, RBAC, and Network Policies
Apply Pod Security Standards at the restricted level for all application workloads. Define network policies for every namespace — deny-all by default, allow only explicitly required ingress and egress. Scope service account permissions to the minimum required by each workload using dedicated service accounts per deployment. Use OPA/Gatekeeper or Kyverno admission controllers to enforce organizational policies (required labels, image registry restrictions, prohibited capabilities) at admission time rather than discovering violations in audits.
GitOps with Argo CD: Sync Policies and Progressive Delivery
Implement GitOps via Argo CD with automated sync enabled only for non-production environments. Production changes should require a manual sync gate or automated progressive delivery via Argo Rollouts with a defined canary analysis step. Use Application Sets for managing many applications consistently. Enable automated pruning and self-healing with diff alerts — so you know when cluster state drifts from Git state before it causes an incident. Keep Argo CD itself as a separately managed application with its own GitOps repository.
Observability: SLOs, Dashboards, and Runbooks
Define SLOs for each production service (availability, latency p50/p95/p99) and implement Prometheus recording rules and Grafana dashboards that show SLI-to-SLO burn rate. Use structured JSON logging with a correlation ID field propagated through all service calls. Implement distributed tracing with OpenTelemetry and Tempo or Jaeger. Write runbooks for the five most common failure modes (pod crash loops, OOMKill, node not-ready, autoscaler lag, and network policy blocking) so on-call engineers can diagnose and resolve without reconstructing context under pressure.
These five areas cover the most impactful Kubernetes production readiness gaps. Teams that address all five consistently find that production incidents decrease in frequency, are diagnosed faster when they do occur, and are resolved without the cascading failures that happen when one gap compounds another.
Production Kubernetes Transformation: A SaaS Platform Example
A mid-market SaaS company running EKS came to Wolk Inc after a series of production incidents: pod evictions during the monthly reporting run, a security audit that found overpermissive RBAC, and incident response times that averaged 90 minutes for P1 issues.
Wolk Inc conducted a two-week audit, implemented the five-area framework, and migrated their deployment pipelines from manual kubectl applies to Argo CD GitOps. Results after 60 days: zero pod eviction incidents in the next quarterly reporting cycle, the security audit finding was cleared, and mean time to resolution dropped from 90 minutes to 18 minutes.
The most impactful single change was implementing structured logging with correlation IDs — which made it possible to trace a user-facing error to its root cause in seconds rather than minutes of log searching across multiple services.
Actionable takeaways
- Most Kubernetes production problems are configuration problems, not framework bugs — they are fixable with the right audit approach.
- Security hardening (Pod Security Standards, RBAC scoping, network policies) should be applied before the cluster hits production, not during an audit finding.
- GitOps via Argo CD with progressive delivery eliminates the majority of deployment-related production incidents.
- SLO-based observability with runbooks cuts mean time to resolution faster than any other single investment in platform reliability.
- Kubernetes upgrade discipline — staying within one minor version of the latest supported release — prevents the painful catch-up scenarios that force risky multi-version jumps.
Marcus Reid
Lead DevOps Engineer · Wolk Inc
Eight years building platform reliability programs, CI/CD pipelines, and cloud infrastructure for North American enterprises.
Running Kubernetes in production and want a cluster audit?
Wolk Inc conducts production Kubernetes audits covering the five areas in this guide — cluster architecture, resource configuration, security hardening, GitOps, and observability. Most audits surface 3–5 high-priority findings. Book a 30-minute call to discuss what a cluster audit engagement looks like.
Wolk Inc is a 2021-founded senior-engineer-only DevOps, Cloud, AI and Cybersecurity consulting firm serving US and Canadian enterprises.