DevOps & Cloud Interview Questions: 55+ Questions with Answers for 2026
Technical Interviews20 min read

DevOps & Cloud Interview Questions: 55+ Questions with Answers for 2026

πŸ‘€
Interview Whisper Team
December 3, 2025

You've set up CI/CD pipelines, deployed applications to Kubernetes, and managed infrastructure with Terraform.

But when the interviewer asks "How would you design a zero-downtime deployment strategy?" or "Explain the CAP theorem and its implications for distributed systems" β€” you need more than just hands-on experience.

This guide gives you 55+ real DevOps and cloud interview questions asked at Amazon, Google, Microsoft, and top tech companies β€” with expert answers that demonstrate both technical depth and operational wisdom.

DevOps engineer preparing for technical interview

What DevOps Interviews Actually Test

DevOps and cloud interviews evaluate multiple dimensions:

  • Technical depth: Understanding of infrastructure, networking, security
  • System design: Architecting reliable, scalable systems
  • Troubleshooting: Debugging production issues under pressure
  • Automation mindset: Everything as code, repeatability
  • Operational excellence: Monitoring, incident response, reliability

Companies want engineers who can build AND operate systems at scale.


CI/CD & Automation Questions

1. Explain CI/CD and its benefits

Answer:

Continuous Integration (CI):

  • Developers merge code frequently (daily or more)
  • Automated builds and tests run on every merge
  • Catches integration issues early

Continuous Delivery (CD):

  • Code is always in a deployable state
  • Deployment to staging is automated
  • Production deployment requires manual approval

Continuous Deployment:

  • Every change that passes tests automatically deploys to production
  • No manual intervention

Benefits:

  • Faster feedback: Know within minutes if code breaks something
  • Reduced risk: Small, frequent changes are easier to debug
  • Higher quality: Automated tests catch regressions
  • Developer productivity: Less time on manual deployments

Key metrics:

  • Deployment frequency
  • Lead time for changes
  • Change failure rate
  • Mean time to recovery (MTTR)

2. What's the difference between blue-green and canary deployments?

Answer:

Blue-Green Deployment:

                    Load Balancer
                    /           \
                Blue (v1)    Green (v2)
                [active]      [idle]
  • Two identical production environments
  • Deploy new version to idle environment
  • Switch traffic instantly via load balancer
  • Easy rollback (switch back)

Pros: Instant rollback, full testing in production Cons: Double infrastructure cost, database migrations complex


Canary Deployment:

                    Load Balancer
                    /    |    \
                v1(90%) v1(5%) v2(5%)
                        ↓
                    gradually increase v2
  • Deploy to small subset of servers/users first
  • Gradually increase traffic to new version
  • Monitor for errors, rollback if issues

Pros: Lower risk, real user validation Cons: Slower rollout, more complex routing


When to use:

  • Blue-green: When you need instant rollback capability
  • Canary: When you want to validate with real traffic gradually

3. How would you design a CI/CD pipeline for a microservices architecture?

Answer:

Key considerations:

  • Each service has its own pipeline
  • Shared infrastructure components
  • Service dependencies during testing

Pipeline stages:

Code Push β†’ Build β†’ Unit Test β†’ Security Scan β†’
β†’ Build Image β†’ Push to Registry β†’ Deploy to Dev β†’
β†’ Integration Tests β†’ Deploy to Staging β†’ E2E Tests β†’
β†’ Deploy to Production (Canary) β†’ Full Rollout

Per-service pipeline:

# Example: GitHub Actions
name: service-a-pipeline
on:
  push:
    paths:
      - 'services/service-a/**'

jobs:
  build-and-deploy:
    steps:
      - name: Run tests
        run: npm test

      - name: Build Docker image
        run: docker build -t service-a:${{ github.sha }}

      - name: Push to registry
        run: docker push registry/service-a:${{ github.sha }}

      - name: Deploy to K8s
        run: kubectl set image deployment/service-a ...

Cross-cutting concerns:

  • Shared libraries: Separate pipeline, version properly
  • Database migrations: Run before deployment
  • Contract testing: Verify service compatibility
  • Environment parity: Dev/staging mirror production

Container orchestration with Kubernetes


Container & Kubernetes Questions

4. Explain the difference between Docker containers and VMs

Answer:

Aspect Virtual Machine Container
Isolation Hardware-level (hypervisor) OS-level (kernel)
Size GBs (includes OS) MBs (shares host kernel)
Startup Minutes Seconds
Overhead High (full OS per VM) Low (shared kernel)
Portability Limited High ("works on my machine")
Security Stronger isolation Weaker (shared kernel)

Architecture:

VM:                          Container:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   App  β”‚  App   β”‚         β”‚   App  β”‚  App   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€         β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚Guest OSβ”‚Guest OSβ”‚         β”‚Container Runtimeβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€         β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   Hypervisor    β”‚         β”‚    Host OS      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€         β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   Hardware      β”‚         β”‚   Hardware      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

When to use VMs:

  • Strong security isolation needed
  • Different OS requirements
  • Legacy applications
  • Multi-tenant environments

When to use containers:

  • Microservices
  • Fast scaling requirements
  • Development/testing environments
  • Cloud-native applications

5. Explain Kubernetes architecture

Answer:

Control Plane (Master):

  • API Server: Entry point for all REST commands
  • etcd: Distributed key-value store for cluster state
  • Scheduler: Assigns pods to nodes based on resources
  • Controller Manager: Runs control loops (ReplicaSet, Deployment, etc.)

Worker Nodes:

  • kubelet: Agent ensuring containers run in pods
  • kube-proxy: Network proxy implementing service abstraction
  • Container runtime: Docker, containerd, CRI-O

Key objects:

# Pod: Smallest deployable unit
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: my-app:v1

# Deployment: Manages ReplicaSets, enables rollouts
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    # Pod template

# Service: Stable network endpoint for pods
apiVersion: v1
kind: Service
metadata:
  name: my-app-service
spec:
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080

6. How does Kubernetes handle service discovery and load balancing?

Answer:

Service types:

  1. ClusterIP (default):

    • Internal IP accessible within cluster
    • kube-proxy routes traffic to pods
  2. NodePort:

    • Exposes service on each node's IP
    • Port range: 30000-32767
  3. LoadBalancer:

    • Provisions cloud load balancer
    • External traffic routes to NodePort
  4. ExternalName:

    • Maps to external DNS name
    • No proxying, just DNS CNAME

Service discovery:

# DNS-based (preferred)
# Format: <service>.<namespace>.svc.cluster.local
curl http://my-service.default.svc.cluster.local

# Environment variables (legacy)
MY_SERVICE_HOST=10.0.0.1
MY_SERVICE_PORT=80

Load balancing:

  • kube-proxy uses iptables rules or IPVS
  • Round-robin by default
  • SessionAffinity for sticky sessions

7. What are Kubernetes probes and when do you use each?

Answer:

Liveness Probe:

  • Purpose: Is the container alive?
  • Action on failure: Restart container
  • Use case: Detect deadlocks, hung processes
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10

Readiness Probe:

  • Purpose: Is the container ready to receive traffic?
  • Action on failure: Remove from service endpoints
  • Use case: Warm-up time, dependency checks
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Startup Probe:

  • Purpose: Has the container started successfully?
  • Action on failure: Restart container
  • Use case: Slow-starting applications
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Best practices:

  • Liveness: Check core functionality, not dependencies
  • Readiness: Check dependencies, database connections
  • Set appropriate timeouts and thresholds
  • Don't make probes too expensive

8. Explain Kubernetes resource management (requests and limits)

Answer:

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"      # 0.25 CPU cores
  limits:
    memory: "512Mi"
    cpu: "500m"

Requests:

  • Minimum resources guaranteed
  • Used by scheduler to place pods
  • Node must have this available

Limits:

  • Maximum resources allowed
  • CPU: Throttled if exceeded
  • Memory: OOMKilled if exceeded

QoS Classes:

Class Condition Priority
Guaranteed requests = limits for all containers Highest
Burstable requests < limits Medium
BestEffort No requests or limits Lowest (evicted first)

Best practices:

  • Always set requests (for scheduling)
  • Set limits to prevent noisy neighbors
  • Monitor actual usage, adjust accordingly
  • Use LimitRanges for defaults

Cloud & Infrastructure Questions

9. Explain the shared responsibility model in cloud

Answer:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CUSTOMER RESPONSIBILITY                   β”‚
β”‚  - Data encryption & integrity                              β”‚
β”‚  - IAM (Identity and Access Management)                     β”‚
β”‚  - Application security                                      β”‚
β”‚  - Network configuration (security groups, NACLs)           β”‚
β”‚  - OS patching (EC2), runtime patching (Lambda: partial)    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                      SHARED                                  β”‚
β”‚  - Patch management (varies by service)                     β”‚
β”‚  - Configuration management                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                    CLOUD PROVIDER                            β”‚
β”‚  - Physical security                                         β”‚
β”‚  - Hardware maintenance                                      β”‚
β”‚  - Network infrastructure                                    β”‚
β”‚  - Hypervisor security                                       β”‚
β”‚  - Managed service security (RDS engine, Lambda runtime)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Varies by service type:

  • IaaS (EC2): Customer manages OS and up
  • PaaS (Elastic Beanstalk): Customer manages application
  • SaaS (S3): Customer manages data and access

Cloud infrastructure and networking


10. How would you design a highly available architecture on AWS?

Answer:

Multi-AZ design:

                          Route 53 (DNS)
                              β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   CloudFront      β”‚
                    β”‚   (CDN/WAF)       β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚        Application            β”‚
              β”‚        Load Balancer          β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                   β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚    AZ-1 (a)       β”‚   β”‚    AZ-2 (b)       β”‚
        β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
        β”‚ β”‚ Auto Scaling  β”‚ β”‚   β”‚ β”‚ Auto Scaling  β”‚ β”‚
        β”‚ β”‚ Group (EC2)   β”‚ β”‚   β”‚ β”‚ Group (EC2)   β”‚ β”‚
        β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
        β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
        β”‚ β”‚ RDS Primary   │◄┼───┼─│ RDS Standby   β”‚ β”‚
        β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
        β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
        β”‚ β”‚ ElastiCache   │◄┼───┼─│ ElastiCache   β”‚ β”‚
        β”‚ β”‚ (Redis)       β”‚ β”‚   β”‚ β”‚ Replica       β”‚ β”‚
        β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key components:

  1. DNS: Route 53 with health checks and failover
  2. CDN: CloudFront for static content, edge caching
  3. Load Balancing: ALB distributing across AZs
  4. Compute: Auto Scaling Groups spanning AZs
  5. Database: RDS Multi-AZ for automatic failover
  6. Caching: ElastiCache with replication
  7. Storage: S3 (11 9s durability, automatic replication)

Recovery objectives:

  • RTO (Recovery Time Objective): How fast to recover
  • RPO (Recovery Point Objective): How much data loss acceptable

11. Explain VPC, subnets, security groups, and NACLs

Answer:

VPC (Virtual Private Cloud):

  • Isolated network in the cloud
  • Define IP range (CIDR block)
  • Contains subnets, route tables, gateways

Subnets:

  • Subdivisions of VPC
  • Public subnet: Route to Internet Gateway
  • Private subnet: No direct internet access
VPC: 10.0.0.0/16
β”œβ”€β”€ Public Subnet: 10.0.1.0/24 (AZ-a)
β”‚   └── Route: 0.0.0.0/0 β†’ Internet Gateway
β”œβ”€β”€ Public Subnet: 10.0.2.0/24 (AZ-b)
β”œβ”€β”€ Private Subnet: 10.0.10.0/24 (AZ-a)
β”‚   └── Route: 0.0.0.0/0 β†’ NAT Gateway
└── Private Subnet: 10.0.20.0/24 (AZ-b)

Security Groups vs NACLs:

Aspect Security Group NACL
Level Instance/ENI Subnet
State Stateful Stateless
Rules Allow only Allow & Deny
Evaluation All rules Ordered by number
Default Deny all inbound Allow all

Security Group example:

Inbound:
- Port 443 from 0.0.0.0/0 (HTTPS)
- Port 22 from 10.0.0.0/8 (SSH from VPC)

Outbound:
- All traffic (stateful, responses automatic)

12. Compare IaC tools: Terraform, CloudFormation, Pulumi

Answer:

Aspect Terraform CloudFormation Pulumi
Provider Multi-cloud AWS only Multi-cloud
Language HCL (declarative) YAML/JSON Python, TS, Go
State Self-managed or remote AWS managed Self-managed or cloud
Modularity Modules, registry Nested stacks Standard packages
Learning curve Medium Low (AWS users) Low (if you know language)
Drift detection terraform plan Drift detection feature pulumi preview

Terraform example:

resource "aws_instance" "web" {
  ami           = "ami-12345678"
  instance_type = "t3.micro"

  tags = {
    Name = "web-server"
  }
}

resource "aws_security_group" "web" {
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

When to use:

  • Terraform: Multi-cloud, mature ecosystem, large community
  • CloudFormation: AWS-native, integrated with AWS services
  • Pulumi: When team prefers real programming languages

Monitoring & Observability Questions

13. Explain the three pillars of observability

Answer:

1. Metrics:

  • Numerical data over time
  • Examples: CPU usage, request count, latency percentiles
  • Tools: Prometheus, CloudWatch, Datadog
# Prometheus metrics
http_requests_total{method="GET", status="200"} 1234
http_request_duration_seconds{quantile="0.99"} 0.5

2. Logs:

  • Discrete events with context
  • Examples: Error messages, audit trails, debug info
  • Tools: ELK Stack, Splunk, CloudWatch Logs
{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc123",
  "message": "Payment failed",
  "user_id": "user_456"
}

3. Traces:

  • Request flow across services
  • Shows latency breakdown, dependencies
  • Tools: Jaeger, Zipkin, AWS X-Ray
Request β†’ API Gateway (5ms) β†’ Auth Service (10ms) β†’
       β†’ User Service (15ms) β†’ Database (50ms)
Total: 80ms

Why all three:

  • Metrics: "Something is wrong" (high error rate)
  • Logs: "What went wrong" (error details)
  • Traces: "Where it went wrong" (which service)

14. How would you set up alerting for a production system?

Answer:

Alert categories:

  1. Symptoms (user impact): Prefer these

    • Error rate > 1%
    • P99 latency > 500ms
    • Availability < 99.9%
  2. Causes (system health): Use sparingly

    • CPU > 80%
    • Memory > 90%
    • Disk > 85%

Alert design principles:

# Good alert
- name: HighErrorRate
  expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.01
  for: 5m  # Avoid flapping
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 1%"
    runbook: "https://wiki/runbooks/high-error-rate"

Best practices:

  • Alert on symptoms, not causes
  • Use for duration to avoid flapping
  • Include runbook links
  • Set appropriate severity levels
  • Route to right team (PagerDuty routing)

Anti-patterns:

  • Alerting on every metric
  • No runbooks
  • Alert fatigue (too many non-actionable alerts)
  • Missing severity levels

15. Explain SLIs, SLOs, and SLAs

Answer:

SLI (Service Level Indicator):

  • Quantitative measure of service
  • Examples: availability, latency, error rate
Availability SLI = (successful_requests / total_requests) * 100
Latency SLI = p99_latency_ms

SLO (Service Level Objective):

  • Target value for SLI
  • Internal goal, drives engineering decisions
SLO: 99.9% availability (allows 8.76 hours downtime/year)
SLO: P99 latency < 200ms

SLA (Service Level Agreement):

  • Contract with customers
  • Includes consequences (refunds) for missing targets
  • Usually less aggressive than SLO (buffer)
SLA: 99.5% availability
     If missed: 10% service credit

Error budget:

Error budget = 100% - SLO
             = 100% - 99.9%
             = 0.1% of requests can fail

Monthly error budget = 0.1% * 43,200 min = 43.2 minutes downtime

When error budget is exhausted:

  • Freeze feature releases
  • Focus on reliability

Site reliability engineering and incident management


Security & Networking Questions

16. How do you secure a Kubernetes cluster?

Answer:

Control plane security:

  • Enable RBAC (Role-Based Access Control)
  • Use network policies to restrict pod communication
  • Audit logging enabled
  • etcd encryption at rest

Pod security:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

Network security:

# NetworkPolicy: Only allow traffic from frontend to backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-policy
spec:
  podSelector:
    matchLabels:
      app: backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend

Secrets management:

  • Don't store secrets in ConfigMaps
  • Use external secrets (AWS Secrets Manager, Vault)
  • Enable encryption at rest for secrets

Image security:

  • Scan images for vulnerabilities (Trivy, Snyk)
  • Use minimal base images (distroless, Alpine)
  • Sign and verify images (cosign)

Runtime security:

  • Pod Security Standards (restricted, baseline)
  • Runtime protection (Falco)
  • Regular security audits

17. Explain how TLS/SSL works

Answer:

TLS Handshake (simplified):

Client                                Server
  β”‚                                     β”‚
  │────── 1. ClientHello ──────────────►│
  β”‚       (supported ciphers, random)   β”‚
  β”‚                                     β”‚
  │◄───── 2. ServerHello ──────────────│
  β”‚       (chosen cipher, random)       β”‚
  β”‚       Certificate                   β”‚
  β”‚       ServerKeyExchange             β”‚
  β”‚                                     β”‚
  │────── 3. ClientKeyExchange ────────►│
  β”‚       (pre-master secret)           β”‚
  β”‚       ChangeCipherSpec              β”‚
  β”‚       Finished                      β”‚
  β”‚                                     β”‚
  │◄───── 4. ChangeCipherSpec ─────────│
  β”‚       Finished                      β”‚
  β”‚                                     β”‚
  │◄══════ 5. Encrypted data ══════════►│

Key concepts:

  1. Certificate verification:

    • Server presents certificate
    • Client verifies certificate chain to trusted CA
    • Ensures server identity
  2. Key exchange:

    • Agree on shared secret without transmitting it
    • Methods: RSA, Diffie-Hellman, ECDHE
  3. Symmetric encryption:

    • Use shared secret to encrypt actual data
    • Much faster than asymmetric

Common issues:

  • Expired certificates
  • Certificate hostname mismatch
  • Weak cipher suites
  • Missing intermediate certificates

18. How does a load balancer work and what are the types?

Answer:

Types:

Layer 4 (Transport):

  • Routes based on IP and port
  • No content inspection
  • Lower latency, higher throughput
  • Examples: AWS NLB, HAProxy (TCP mode)

Layer 7 (Application):

  • Routes based on content (URL, headers, cookies)
  • Can terminate TLS
  • Can modify requests
  • Examples: AWS ALB, Nginx, HAProxy (HTTP mode)

Load balancing algorithms:

Algorithm Description Use case
Round Robin Sequential distribution Equal capacity servers
Weighted RR Proportional to weight Different server capacities
Least Connections Send to server with fewest connections Variable request lengths
IP Hash Consistent server per client IP Session affinity
Least Response Time Fastest server Performance optimization

Health checks:

# ALB health check
Protocol: HTTP
Path: /health
Interval: 30s
Timeout: 5s
Unhealthy threshold: 2
Healthy threshold: 3

Troubleshooting Questions

19. A production service is slow. How do you diagnose it?

Answer:

Step 1: Triage (1-2 minutes)

  • Scope: All users or subset? All endpoints or specific?
  • When did it start? Any recent deployments?
  • Check dashboards: error rates, latency, traffic

Step 2: Follow the request path

User β†’ CDN β†’ Load Balancer β†’ Application β†’ Database
         ↓         ↓              ↓           ↓
    Check each hop for latency contribution

Step 3: Investigate each layer

  1. Network:

    • DNS resolution time
    • TLS handshake time
    • Network latency between services
  2. Application:

    • CPU/memory usage
    • Thread/connection pool exhaustion
    • Slow endpoints (trace data)
  3. Database:

    • Slow queries (query logs)
    • Lock contention
    • Connection pool exhaustion
    • Replication lag
  4. External dependencies:

    • Third-party API latency
    • Downstream service issues

Step 4: Common causes checklist

  • Recent deployment?
  • Traffic spike?
  • Database slow queries?
  • Resource exhaustion (CPU, memory, connections)?
  • External dependency issues?
  • DNS issues?
  • Certificate expiry?

20. Explain how you would handle a DDoS attack

Answer:

Immediate response:

  1. Detection:

    • Unusual traffic patterns
    • Geographic anomalies
    • Request rate spikes
  2. Mitigation layers:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    AWS Shield / CloudFlare         β”‚  ← DDoS protection service
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    WAF (Web Application Firewall)  β”‚  ← Rate limiting, rules
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    CDN (CloudFront)                β”‚  ← Absorb traffic at edge
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    Load Balancer                   β”‚  ← Connection limits
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    Application                     β”‚  ← Rate limiting, captcha
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Short-term actions:

  • Enable DDoS protection service (Shield Advanced, CloudFlare)
  • Add WAF rules to block attack patterns
  • Scale infrastructure (auto-scaling)
  • Block malicious IPs/ranges

Long-term prevention:

  • Always-on DDoS protection
  • Rate limiting at multiple layers
  • Geographic restrictions if applicable
  • Regular capacity planning
  • Incident response runbooks

Scenario-Based Questions

21. Design a disaster recovery plan for a critical application

Answer:

Define objectives:

  • RTO: 1 hour (recovery time)
  • RPO: 5 minutes (data loss tolerance)

Strategy: Multi-region active-passive

Primary Region (us-east-1)          DR Region (us-west-2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Application      β”‚             β”‚  Scaled-down App    β”‚
β”‚    (active)         β”‚             β”‚  (warm standby)     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€             β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    RDS Multi-AZ     │────────────►│  RDS Read Replica   β”‚
β”‚    (primary)        β”‚  async      β”‚  (promotable)       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€             β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    S3 Bucket        │────────────►│  S3 Bucket          β”‚
β”‚                     β”‚  CRR        β”‚  (cross-region)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components:

  1. Data replication:

    • Database: Async replication to DR region
    • Object storage: Cross-region replication
    • Configuration: Stored in Git, replicated
  2. Infrastructure:

    • Terraform/CloudFormation templates in Git
    • Warm standby in DR region
    • Auto-scaling pre-configured
  3. Failover process:

    • Promote read replica to primary
    • Scale up DR application
    • Update DNS (Route 53 health checks)
    • Notify stakeholders
  4. Testing:

    • Regular DR drills (quarterly)
    • Chaos engineering
    • Document lessons learned

22. How would you migrate a monolith to microservices?

Answer:

Phase 1: Preparation

  • Map existing functionality and dependencies
  • Identify bounded contexts (DDD)
  • Set up CI/CD, monitoring, container infrastructure

Phase 2: Strangler Fig Pattern

                    API Gateway
                    /         \
           β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”   β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚            β”‚   β”‚            β”‚
           β”‚  Monolith  β”‚   β”‚ New        β”‚
           β”‚  (legacy)  β”‚   β”‚ Microserviceβ”‚
           β”‚            β”‚   β”‚            β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓
        Gradually move functionality
                ↓
           Eventually retire monolith

Step-by-step:

  1. Build facade: API Gateway in front of monolith
  2. Extract service: Move one bounded context out
  3. Route traffic: New requests to microservice
  4. Repeat: Extract next service
  5. Retire: Eventually decommission monolith

Key considerations:

  • Start with least coupled components
  • Handle data sharing (events, API calls)
  • Maintain backward compatibility during transition
  • Invest in observability early

Quick-Fire Questions

23. What's the difference between TCP and UDP?

  • TCP: Connection-oriented, guaranteed delivery, ordered, slower
  • UDP: Connectionless, no guarantee, unordered, faster (used for: streaming, gaming, DNS)

24. What is a reverse proxy?

Server that sits between clients and origin servers. Handles SSL termination, load balancing, caching, security. Examples: Nginx, HAProxy.

25. What's the CAP theorem?

Distributed systems can only guarantee two of: Consistency, Availability, Partition tolerance. During network partition, must choose between C and A.

26. Explain DNS resolution.

Browser cache β†’ OS cache β†’ Resolver β†’ Root server β†’ TLD server β†’ Authoritative server β†’ IP address returned.

27. What is a container registry?

Repository for storing and distributing container images. Examples: Docker Hub, ECR, GCR, Harbor.

28. What's immutable infrastructure?

Servers are never modified after deployment. Changes = deploy new servers, destroy old ones. Benefits: consistency, reproducibility.

29. Explain GitOps.

Using Git as single source of truth for infrastructure and application config. Changes via pull requests, automated sync to cluster.

30. What is service mesh?

Infrastructure layer for service-to-service communication. Handles: traffic management, security, observability. Examples: Istio, Linkerd.


Practice DevOps Interviews with AI

Reading questions is one thing. Explaining complex systems clearly under pressure is what gets you hired.

Interview Whisper lets you:

  • Practice explaining architecture decisions
  • Answer scenario-based questions
  • Get feedback on technical clarity
  • Build confidence with system design

The best DevOps engineers communicate complexity simply.

Start Practicing DevOps Interview Questions with AI


DevOps Interview Preparation Checklist

Fundamentals:

  • Linux administration basics
  • Networking (TCP/IP, DNS, HTTP)
  • Git version control
  • Scripting (Bash, Python)

Containers & Orchestration:

  • Docker fundamentals
  • Kubernetes architecture
  • Pod lifecycle, deployments
  • Service discovery, networking

Cloud Platforms:

  • At least one cloud deeply (AWS, GCP, Azure)
  • Compute, storage, networking services
  • IAM and security best practices
  • Cost optimization

CI/CD:

  • Pipeline design
  • Testing strategies
  • Deployment patterns (blue-green, canary)
  • GitOps workflows

Infrastructure as Code:

  • Terraform or CloudFormation
  • State management
  • Module design

Observability:

  • Metrics, logs, traces
  • Alerting best practices
  • SLI/SLO/SLA concepts

Related Articles


DevOps interviews test both technical depth and communication skills. Practice explaining these concepts clearly.

Practice DevOps Interview Questions with AI Feedback

#devops interview#cloud interview#AWS interview#kubernetes#CI/CD#SRE interview#infrastructure#docker

Found this helpful? Share it!

Ready to Ace Your Next Interview?

Get real-time AI coaching during your interviews with Interview Whisper

Download Free