You've set up CI/CD pipelines, deployed applications to Kubernetes, and managed infrastructure with Terraform.
But when the interviewer asks "How would you design a zero-downtime deployment strategy?" or "Explain the CAP theorem and its implications for distributed systems" β you need more than just hands-on experience.
This guide gives you 55+ real DevOps and cloud interview questions asked at Amazon, Google, Microsoft, and top tech companies β with expert answers that demonstrate both technical depth and operational wisdom.
What DevOps Interviews Actually Test
DevOps and cloud interviews evaluate multiple dimensions:
- Technical depth: Understanding of infrastructure, networking, security
- System design: Architecting reliable, scalable systems
- Troubleshooting: Debugging production issues under pressure
- Automation mindset: Everything as code, repeatability
- Operational excellence: Monitoring, incident response, reliability
Companies want engineers who can build AND operate systems at scale.
CI/CD & Automation Questions
1. Explain CI/CD and its benefits
Answer:
Continuous Integration (CI):
- Developers merge code frequently (daily or more)
- Automated builds and tests run on every merge
- Catches integration issues early
Continuous Delivery (CD):
- Code is always in a deployable state
- Deployment to staging is automated
- Production deployment requires manual approval
Continuous Deployment:
- Every change that passes tests automatically deploys to production
- No manual intervention
Benefits:
- Faster feedback: Know within minutes if code breaks something
- Reduced risk: Small, frequent changes are easier to debug
- Higher quality: Automated tests catch regressions
- Developer productivity: Less time on manual deployments
Key metrics:
- Deployment frequency
- Lead time for changes
- Change failure rate
- Mean time to recovery (MTTR)
2. What's the difference between blue-green and canary deployments?
Answer:
Blue-Green Deployment:
Load Balancer
/ \
Blue (v1) Green (v2)
[active] [idle]
- Two identical production environments
- Deploy new version to idle environment
- Switch traffic instantly via load balancer
- Easy rollback (switch back)
Pros: Instant rollback, full testing in production Cons: Double infrastructure cost, database migrations complex
Canary Deployment:
Load Balancer
/ | \
v1(90%) v1(5%) v2(5%)
β
gradually increase v2
- Deploy to small subset of servers/users first
- Gradually increase traffic to new version
- Monitor for errors, rollback if issues
Pros: Lower risk, real user validation Cons: Slower rollout, more complex routing
When to use:
- Blue-green: When you need instant rollback capability
- Canary: When you want to validate with real traffic gradually
3. How would you design a CI/CD pipeline for a microservices architecture?
Answer:
Key considerations:
- Each service has its own pipeline
- Shared infrastructure components
- Service dependencies during testing
Pipeline stages:
Code Push β Build β Unit Test β Security Scan β
β Build Image β Push to Registry β Deploy to Dev β
β Integration Tests β Deploy to Staging β E2E Tests β
β Deploy to Production (Canary) β Full Rollout
Per-service pipeline:
# Example: GitHub Actions
name: service-a-pipeline
on:
push:
paths:
- 'services/service-a/**'
jobs:
build-and-deploy:
steps:
- name: Run tests
run: npm test
- name: Build Docker image
run: docker build -t service-a:${{ github.sha }}
- name: Push to registry
run: docker push registry/service-a:${{ github.sha }}
- name: Deploy to K8s
run: kubectl set image deployment/service-a ...
Cross-cutting concerns:
- Shared libraries: Separate pipeline, version properly
- Database migrations: Run before deployment
- Contract testing: Verify service compatibility
- Environment parity: Dev/staging mirror production
Container & Kubernetes Questions
4. Explain the difference between Docker containers and VMs
Answer:
| Aspect | Virtual Machine | Container |
|---|---|---|
| Isolation | Hardware-level (hypervisor) | OS-level (kernel) |
| Size | GBs (includes OS) | MBs (shares host kernel) |
| Startup | Minutes | Seconds |
| Overhead | High (full OS per VM) | Low (shared kernel) |
| Portability | Limited | High ("works on my machine") |
| Security | Stronger isolation | Weaker (shared kernel) |
Architecture:
VM: Container:
βββββββββββββββββββ βββββββββββββββββββ
β App β App β β App β App β
ββββββββββΌβββββββββ€ ββββββββββΌβββββββββ€
βGuest OSβGuest OSβ βContainer Runtimeβ
ββββββββββ΄βββββββββ€ βββββββββββββββββββ€
β Hypervisor β β Host OS β
βββββββββββββββββββ€ βββββββββββββββββββ€
β Hardware β β Hardware β
βββββββββββββββββββ βββββββββββββββββββ
When to use VMs:
- Strong security isolation needed
- Different OS requirements
- Legacy applications
- Multi-tenant environments
When to use containers:
- Microservices
- Fast scaling requirements
- Development/testing environments
- Cloud-native applications
5. Explain Kubernetes architecture
Answer:
Control Plane (Master):
- API Server: Entry point for all REST commands
- etcd: Distributed key-value store for cluster state
- Scheduler: Assigns pods to nodes based on resources
- Controller Manager: Runs control loops (ReplicaSet, Deployment, etc.)
Worker Nodes:
- kubelet: Agent ensuring containers run in pods
- kube-proxy: Network proxy implementing service abstraction
- Container runtime: Docker, containerd, CRI-O
Key objects:
# Pod: Smallest deployable unit
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: app
image: my-app:v1
# Deployment: Manages ReplicaSets, enables rollouts
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
# Pod template
# Service: Stable network endpoint for pods
apiVersion: v1
kind: Service
metadata:
name: my-app-service
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 8080
6. How does Kubernetes handle service discovery and load balancing?
Answer:
Service types:
-
ClusterIP (default):
- Internal IP accessible within cluster
- kube-proxy routes traffic to pods
-
NodePort:
- Exposes service on each node's IP
- Port range: 30000-32767
-
LoadBalancer:
- Provisions cloud load balancer
- External traffic routes to NodePort
-
ExternalName:
- Maps to external DNS name
- No proxying, just DNS CNAME
Service discovery:
# DNS-based (preferred)
# Format: <service>.<namespace>.svc.cluster.local
curl http://my-service.default.svc.cluster.local
# Environment variables (legacy)
MY_SERVICE_HOST=10.0.0.1
MY_SERVICE_PORT=80
Load balancing:
- kube-proxy uses iptables rules or IPVS
- Round-robin by default
- SessionAffinity for sticky sessions
7. What are Kubernetes probes and when do you use each?
Answer:
Liveness Probe:
- Purpose: Is the container alive?
- Action on failure: Restart container
- Use case: Detect deadlocks, hung processes
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
Readiness Probe:
- Purpose: Is the container ready to receive traffic?
- Action on failure: Remove from service endpoints
- Use case: Warm-up time, dependency checks
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Startup Probe:
- Purpose: Has the container started successfully?
- Action on failure: Restart container
- Use case: Slow-starting applications
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
Best practices:
- Liveness: Check core functionality, not dependencies
- Readiness: Check dependencies, database connections
- Set appropriate timeouts and thresholds
- Don't make probes too expensive
8. Explain Kubernetes resource management (requests and limits)
Answer:
resources:
requests:
memory: "256Mi"
cpu: "250m" # 0.25 CPU cores
limits:
memory: "512Mi"
cpu: "500m"
Requests:
- Minimum resources guaranteed
- Used by scheduler to place pods
- Node must have this available
Limits:
- Maximum resources allowed
- CPU: Throttled if exceeded
- Memory: OOMKilled if exceeded
QoS Classes:
| Class | Condition | Priority |
|---|---|---|
| Guaranteed | requests = limits for all containers | Highest |
| Burstable | requests < limits | Medium |
| BestEffort | No requests or limits | Lowest (evicted first) |
Best practices:
- Always set requests (for scheduling)
- Set limits to prevent noisy neighbors
- Monitor actual usage, adjust accordingly
- Use LimitRanges for defaults
Cloud & Infrastructure Questions
9. Explain the shared responsibility model in cloud
Answer:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CUSTOMER RESPONSIBILITY β
β - Data encryption & integrity β
β - IAM (Identity and Access Management) β
β - Application security β
β - Network configuration (security groups, NACLs) β
β - OS patching (EC2), runtime patching (Lambda: partial) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β SHARED β
β - Patch management (varies by service) β
β - Configuration management β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β CLOUD PROVIDER β
β - Physical security β
β - Hardware maintenance β
β - Network infrastructure β
β - Hypervisor security β
β - Managed service security (RDS engine, Lambda runtime) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Varies by service type:
- IaaS (EC2): Customer manages OS and up
- PaaS (Elastic Beanstalk): Customer manages application
- SaaS (S3): Customer manages data and access
10. How would you design a highly available architecture on AWS?
Answer:
Multi-AZ design:
Route 53 (DNS)
β
βββββββββββ΄ββββββββββ
β CloudFront β
β (CDN/WAF) β
βββββββββββ¬ββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββ
β Application β
β Load Balancer β
βββββββββββββββββ¬ββββββββββββββββ
βββββββββββ΄ββββββββββ
β β
βββββββββββββ΄ββββββββ βββββββββ΄ββββββββββββ
β AZ-1 (a) β β AZ-2 (b) β
β βββββββββββββββββ β β βββββββββββββββββ β
β β Auto Scaling β β β β Auto Scaling β β
β β Group (EC2) β β β β Group (EC2) β β
β βββββββββββββββββ β β βββββββββββββββββ β
β βββββββββββββββββ β β βββββββββββββββββ β
β β RDS Primary βββΌββββΌββ RDS Standby β β
β βββββββββββββββββ β β βββββββββββββββββ β
β βββββββββββββββββ β β βββββββββββββββββ β
β β ElastiCache βββΌββββΌββ ElastiCache β β
β β (Redis) β β β β Replica β β
β βββββββββββββββββ β β βββββββββββββββββ β
βββββββββββββββββββββ βββββββββββββββββββββ
Key components:
- DNS: Route 53 with health checks and failover
- CDN: CloudFront for static content, edge caching
- Load Balancing: ALB distributing across AZs
- Compute: Auto Scaling Groups spanning AZs
- Database: RDS Multi-AZ for automatic failover
- Caching: ElastiCache with replication
- Storage: S3 (11 9s durability, automatic replication)
Recovery objectives:
- RTO (Recovery Time Objective): How fast to recover
- RPO (Recovery Point Objective): How much data loss acceptable
11. Explain VPC, subnets, security groups, and NACLs
Answer:
VPC (Virtual Private Cloud):
- Isolated network in the cloud
- Define IP range (CIDR block)
- Contains subnets, route tables, gateways
Subnets:
- Subdivisions of VPC
- Public subnet: Route to Internet Gateway
- Private subnet: No direct internet access
VPC: 10.0.0.0/16
βββ Public Subnet: 10.0.1.0/24 (AZ-a)
β βββ Route: 0.0.0.0/0 β Internet Gateway
βββ Public Subnet: 10.0.2.0/24 (AZ-b)
βββ Private Subnet: 10.0.10.0/24 (AZ-a)
β βββ Route: 0.0.0.0/0 β NAT Gateway
βββ Private Subnet: 10.0.20.0/24 (AZ-b)
Security Groups vs NACLs:
| Aspect | Security Group | NACL |
|---|---|---|
| Level | Instance/ENI | Subnet |
| State | Stateful | Stateless |
| Rules | Allow only | Allow & Deny |
| Evaluation | All rules | Ordered by number |
| Default | Deny all inbound | Allow all |
Security Group example:
Inbound:
- Port 443 from 0.0.0.0/0 (HTTPS)
- Port 22 from 10.0.0.0/8 (SSH from VPC)
Outbound:
- All traffic (stateful, responses automatic)
12. Compare IaC tools: Terraform, CloudFormation, Pulumi
Answer:
| Aspect | Terraform | CloudFormation | Pulumi |
|---|---|---|---|
| Provider | Multi-cloud | AWS only | Multi-cloud |
| Language | HCL (declarative) | YAML/JSON | Python, TS, Go |
| State | Self-managed or remote | AWS managed | Self-managed or cloud |
| Modularity | Modules, registry | Nested stacks | Standard packages |
| Learning curve | Medium | Low (AWS users) | Low (if you know language) |
| Drift detection | terraform plan |
Drift detection feature | pulumi preview |
Terraform example:
resource "aws_instance" "web" {
ami = "ami-12345678"
instance_type = "t3.micro"
tags = {
Name = "web-server"
}
}
resource "aws_security_group" "web" {
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
}
When to use:
- Terraform: Multi-cloud, mature ecosystem, large community
- CloudFormation: AWS-native, integrated with AWS services
- Pulumi: When team prefers real programming languages
Monitoring & Observability Questions
13. Explain the three pillars of observability
Answer:
1. Metrics:
- Numerical data over time
- Examples: CPU usage, request count, latency percentiles
- Tools: Prometheus, CloudWatch, Datadog
# Prometheus metrics
http_requests_total{method="GET", status="200"} 1234
http_request_duration_seconds{quantile="0.99"} 0.5
2. Logs:
- Discrete events with context
- Examples: Error messages, audit trails, debug info
- Tools: ELK Stack, Splunk, CloudWatch Logs
{
"timestamp": "2025-01-15T10:30:00Z",
"level": "ERROR",
"service": "payment-api",
"trace_id": "abc123",
"message": "Payment failed",
"user_id": "user_456"
}
3. Traces:
- Request flow across services
- Shows latency breakdown, dependencies
- Tools: Jaeger, Zipkin, AWS X-Ray
Request β API Gateway (5ms) β Auth Service (10ms) β
β User Service (15ms) β Database (50ms)
Total: 80ms
Why all three:
- Metrics: "Something is wrong" (high error rate)
- Logs: "What went wrong" (error details)
- Traces: "Where it went wrong" (which service)
14. How would you set up alerting for a production system?
Answer:
Alert categories:
-
Symptoms (user impact): Prefer these
- Error rate > 1%
- P99 latency > 500ms
- Availability < 99.9%
-
Causes (system health): Use sparingly
- CPU > 80%
- Memory > 90%
- Disk > 85%
Alert design principles:
# Good alert
- name: HighErrorRate
expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m # Avoid flapping
labels:
severity: critical
annotations:
summary: "Error rate above 1%"
runbook: "https://wiki/runbooks/high-error-rate"
Best practices:
- Alert on symptoms, not causes
- Use
forduration to avoid flapping - Include runbook links
- Set appropriate severity levels
- Route to right team (PagerDuty routing)
Anti-patterns:
- Alerting on every metric
- No runbooks
- Alert fatigue (too many non-actionable alerts)
- Missing severity levels
15. Explain SLIs, SLOs, and SLAs
Answer:
SLI (Service Level Indicator):
- Quantitative measure of service
- Examples: availability, latency, error rate
Availability SLI = (successful_requests / total_requests) * 100
Latency SLI = p99_latency_ms
SLO (Service Level Objective):
- Target value for SLI
- Internal goal, drives engineering decisions
SLO: 99.9% availability (allows 8.76 hours downtime/year)
SLO: P99 latency < 200ms
SLA (Service Level Agreement):
- Contract with customers
- Includes consequences (refunds) for missing targets
- Usually less aggressive than SLO (buffer)
SLA: 99.5% availability
If missed: 10% service credit
Error budget:
Error budget = 100% - SLO
= 100% - 99.9%
= 0.1% of requests can fail
Monthly error budget = 0.1% * 43,200 min = 43.2 minutes downtime
When error budget is exhausted:
- Freeze feature releases
- Focus on reliability
Security & Networking Questions
16. How do you secure a Kubernetes cluster?
Answer:
Control plane security:
- Enable RBAC (Role-Based Access Control)
- Use network policies to restrict pod communication
- Audit logging enabled
- etcd encryption at rest
Pod security:
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
Network security:
# NetworkPolicy: Only allow traffic from frontend to backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-policy
spec:
podSelector:
matchLabels:
app: backend
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
Secrets management:
- Don't store secrets in ConfigMaps
- Use external secrets (AWS Secrets Manager, Vault)
- Enable encryption at rest for secrets
Image security:
- Scan images for vulnerabilities (Trivy, Snyk)
- Use minimal base images (distroless, Alpine)
- Sign and verify images (cosign)
Runtime security:
- Pod Security Standards (restricted, baseline)
- Runtime protection (Falco)
- Regular security audits
17. Explain how TLS/SSL works
Answer:
TLS Handshake (simplified):
Client Server
β β
βββββββ 1. ClientHello βββββββββββββββΊβ
β (supported ciphers, random) β
β β
βββββββ 2. ServerHello βββββββββββββββ
β (chosen cipher, random) β
β Certificate β
β ServerKeyExchange β
β β
βββββββ 3. ClientKeyExchange βββββββββΊβ
β (pre-master secret) β
β ChangeCipherSpec β
β Finished β
β β
βββββββ 4. ChangeCipherSpec ββββββββββ
β Finished β
β β
ββββββββ 5. Encrypted data βββββββββββΊβ
Key concepts:
-
Certificate verification:
- Server presents certificate
- Client verifies certificate chain to trusted CA
- Ensures server identity
-
Key exchange:
- Agree on shared secret without transmitting it
- Methods: RSA, Diffie-Hellman, ECDHE
-
Symmetric encryption:
- Use shared secret to encrypt actual data
- Much faster than asymmetric
Common issues:
- Expired certificates
- Certificate hostname mismatch
- Weak cipher suites
- Missing intermediate certificates
18. How does a load balancer work and what are the types?
Answer:
Types:
Layer 4 (Transport):
- Routes based on IP and port
- No content inspection
- Lower latency, higher throughput
- Examples: AWS NLB, HAProxy (TCP mode)
Layer 7 (Application):
- Routes based on content (URL, headers, cookies)
- Can terminate TLS
- Can modify requests
- Examples: AWS ALB, Nginx, HAProxy (HTTP mode)
Load balancing algorithms:
| Algorithm | Description | Use case |
|---|---|---|
| Round Robin | Sequential distribution | Equal capacity servers |
| Weighted RR | Proportional to weight | Different server capacities |
| Least Connections | Send to server with fewest connections | Variable request lengths |
| IP Hash | Consistent server per client IP | Session affinity |
| Least Response Time | Fastest server | Performance optimization |
Health checks:
# ALB health check
Protocol: HTTP
Path: /health
Interval: 30s
Timeout: 5s
Unhealthy threshold: 2
Healthy threshold: 3
Troubleshooting Questions
19. A production service is slow. How do you diagnose it?
Answer:
Step 1: Triage (1-2 minutes)
- Scope: All users or subset? All endpoints or specific?
- When did it start? Any recent deployments?
- Check dashboards: error rates, latency, traffic
Step 2: Follow the request path
User β CDN β Load Balancer β Application β Database
β β β β
Check each hop for latency contribution
Step 3: Investigate each layer
-
Network:
- DNS resolution time
- TLS handshake time
- Network latency between services
-
Application:
- CPU/memory usage
- Thread/connection pool exhaustion
- Slow endpoints (trace data)
-
Database:
- Slow queries (query logs)
- Lock contention
- Connection pool exhaustion
- Replication lag
-
External dependencies:
- Third-party API latency
- Downstream service issues
Step 4: Common causes checklist
- Recent deployment?
- Traffic spike?
- Database slow queries?
- Resource exhaustion (CPU, memory, connections)?
- External dependency issues?
- DNS issues?
- Certificate expiry?
20. Explain how you would handle a DDoS attack
Answer:
Immediate response:
-
Detection:
- Unusual traffic patterns
- Geographic anomalies
- Request rate spikes
-
Mitigation layers:
ββββββββββββββββββββββββββββββββββββββ
β AWS Shield / CloudFlare β β DDoS protection service
ββββββββββββββββββββββββββββββββββββββ€
β WAF (Web Application Firewall) β β Rate limiting, rules
ββββββββββββββββββββββββββββββββββββββ€
β CDN (CloudFront) β β Absorb traffic at edge
ββββββββββββββββββββββββββββββββββββββ€
β Load Balancer β β Connection limits
ββββββββββββββββββββββββββββββββββββββ€
β Application β β Rate limiting, captcha
ββββββββββββββββββββββββββββββββββββββ
Short-term actions:
- Enable DDoS protection service (Shield Advanced, CloudFlare)
- Add WAF rules to block attack patterns
- Scale infrastructure (auto-scaling)
- Block malicious IPs/ranges
Long-term prevention:
- Always-on DDoS protection
- Rate limiting at multiple layers
- Geographic restrictions if applicable
- Regular capacity planning
- Incident response runbooks
Scenario-Based Questions
21. Design a disaster recovery plan for a critical application
Answer:
Define objectives:
- RTO: 1 hour (recovery time)
- RPO: 5 minutes (data loss tolerance)
Strategy: Multi-region active-passive
Primary Region (us-east-1) DR Region (us-west-2)
βββββββββββββββββββββββ βββββββββββββββββββββββ
β Application β β Scaled-down App β
β (active) β β (warm standby) β
βββββββββββββββββββββββ€ βββββββββββββββββββββββ€
β RDS Multi-AZ ββββββββββββββΊβ RDS Read Replica β
β (primary) β async β (promotable) β
βββββββββββββββββββββββ€ βββββββββββββββββββββββ€
β S3 Bucket ββββββββββββββΊβ S3 Bucket β
β β CRR β (cross-region) β
βββββββββββββββββββββββ βββββββββββββββββββββββ
Components:
-
Data replication:
- Database: Async replication to DR region
- Object storage: Cross-region replication
- Configuration: Stored in Git, replicated
-
Infrastructure:
- Terraform/CloudFormation templates in Git
- Warm standby in DR region
- Auto-scaling pre-configured
-
Failover process:
- Promote read replica to primary
- Scale up DR application
- Update DNS (Route 53 health checks)
- Notify stakeholders
-
Testing:
- Regular DR drills (quarterly)
- Chaos engineering
- Document lessons learned
22. How would you migrate a monolith to microservices?
Answer:
Phase 1: Preparation
- Map existing functionality and dependencies
- Identify bounded contexts (DDD)
- Set up CI/CD, monitoring, container infrastructure
Phase 2: Strangler Fig Pattern
API Gateway
/ \
ββββββββ΄ββββββ βββ΄βββββββββββ
β β β β
β Monolith β β New β
β (legacy) β β Microserviceβ
β β β β
ββββββββββββββ ββββββββββββββ
β
Gradually move functionality
β
Eventually retire monolith
Step-by-step:
- Build facade: API Gateway in front of monolith
- Extract service: Move one bounded context out
- Route traffic: New requests to microservice
- Repeat: Extract next service
- Retire: Eventually decommission monolith
Key considerations:
- Start with least coupled components
- Handle data sharing (events, API calls)
- Maintain backward compatibility during transition
- Invest in observability early
Quick-Fire Questions
23. What's the difference between TCP and UDP?
- TCP: Connection-oriented, guaranteed delivery, ordered, slower
- UDP: Connectionless, no guarantee, unordered, faster (used for: streaming, gaming, DNS)
24. What is a reverse proxy?
Server that sits between clients and origin servers. Handles SSL termination, load balancing, caching, security. Examples: Nginx, HAProxy.
25. What's the CAP theorem?
Distributed systems can only guarantee two of: Consistency, Availability, Partition tolerance. During network partition, must choose between C and A.
26. Explain DNS resolution.
Browser cache β OS cache β Resolver β Root server β TLD server β Authoritative server β IP address returned.
27. What is a container registry?
Repository for storing and distributing container images. Examples: Docker Hub, ECR, GCR, Harbor.
28. What's immutable infrastructure?
Servers are never modified after deployment. Changes = deploy new servers, destroy old ones. Benefits: consistency, reproducibility.
29. Explain GitOps.
Using Git as single source of truth for infrastructure and application config. Changes via pull requests, automated sync to cluster.
30. What is service mesh?
Infrastructure layer for service-to-service communication. Handles: traffic management, security, observability. Examples: Istio, Linkerd.
Practice DevOps Interviews with AI
Reading questions is one thing. Explaining complex systems clearly under pressure is what gets you hired.
Interview Whisper lets you:
- Practice explaining architecture decisions
- Answer scenario-based questions
- Get feedback on technical clarity
- Build confidence with system design
The best DevOps engineers communicate complexity simply.
Start Practicing DevOps Interview Questions with AI
DevOps Interview Preparation Checklist
Fundamentals:
- Linux administration basics
- Networking (TCP/IP, DNS, HTTP)
- Git version control
- Scripting (Bash, Python)
Containers & Orchestration:
- Docker fundamentals
- Kubernetes architecture
- Pod lifecycle, deployments
- Service discovery, networking
Cloud Platforms:
- At least one cloud deeply (AWS, GCP, Azure)
- Compute, storage, networking services
- IAM and security best practices
- Cost optimization
CI/CD:
- Pipeline design
- Testing strategies
- Deployment patterns (blue-green, canary)
- GitOps workflows
Infrastructure as Code:
- Terraform or CloudFormation
- State management
- Module design
Observability:
- Metrics, logs, traces
- Alerting best practices
- SLI/SLO/SLA concepts
Related Articles
- System Design Interview Questions: Complete 2026 Guide
- FAANG Interview Preparation: Complete 2026 Guide
- 7 Common Coding Interview Mistakes and How to Avoid Them
- Top 10 Google Interview Questions
- Amazon Leadership Principles Interview Questions
- STAR Method Interview: Complete Guide with 20+ Examples
- AI Interview Practice Platforms: Complete 2025 Guide
DevOps interviews test both technical depth and communication skills. Practice explaining these concepts clearly.