# ADR-001: Adopt Azure Kubernetes Service for Container Orchestration

## Status
Accepted

## Context
Our platform team needs a container orchestration platform to host 20+ microservices. Current deployment uses Azure App Service with manual CI/CD, causing:
- 30min average deployment time
- No traffic splitting or canary deployments
- Per-service resource limits hard to manage
- No native Kubernetes ecosystem (Helm, Istio, Prometheus)

Options evaluated: AKS, App Service (stay), Service Fabric, Nomad.

### Evaluation Criteria
- Managed control plane (reduce ops burden)
- Azure-native integration (AAD, Key Vault, Policy)
- Ecosystem maturity (Helm, Operators, CNCF landscape)
- Cost predictability for 20-node workload (~£4k/month)

## Decision
**Adopt Azure Kubernetes Service** as the container orchestration platform.

### Architecture
- **Cluster topology:** Single regional cluster (West Europe), 3-system nodepool + 20-user nodepool
- **Node SKU:** Standard_D4s_v5 (4 vCPU, 16 GB) for user workloads
- **Networking:** Azure CNI with Cilium, bring-your-own VNet
- **Authentication:** AAD-integrated with RBAC, managed identity per workload
- **Add-ons:** Azure Policy (Gatekeeper), Prometheus + Grafana via Container Insights
- **GitOps:** Flux v2 for cluster state management

### Migration Strategy
1. Phase 1: Lift-and-shift stateless microservices (weeks 1-4)
2. Phase 2: Stateful workloads with CSI drivers + Azure Disk (weeks 5-8)
3. Phase 3: Service mesh (Istio) for mTLS and observability (weeks 9-12)

## Consequences

### Positive
- Native Kubernetes ecosystem — Helm charts, Operators, CNCF tooling
- Azure-managed control plane (no etcd or API server management)
- Horizontal pod autoscaling + cluster autoscaler reduce manual work
- Integration with Azure Policy for compliance enforcement
- Cost savings of ~40% vs App Service Premium for equivalent compute

### Negative
- Increased team learning curve (Kubernetes expertise required)
- More complex networking (CNI, network policies, service mesh)
- Control plane upgrades require careful planning (no automatic upgrades)
- Monitoring stack needs reconfiguration (migrate from App Insights to Prometheus)

## Compliance
- Azure Policy (built-in + custom) enforces pod security, allowed registries, and resource limits
- Cluster is scored against AKS Well-Architected Review checklist quarterly
- Flux v2 enforces desired state — drift is detected and remediated automatically

## Notes
- Supersedes ADR-000 (which recommended App Service)
- References: [AKS Well-Architected Framework](https://learn.microsoft.com/en-us/azure/well-architected/service-guides/azure-kubernetes-service)
- Related: ADR-004 (Container Registry strategy), ADR-007 (Ingress controller selection)

---  
**Date:** 2025-11-15  
**Reviewed by:** Platform Architecture Team
