👋 Hey there, I’m Dheeraj Choudhary an AI/ML educator, cloud enthusiast, and content creator on a mission to simplify tech for the world.
After years of building on YouTube and LinkedIn, I’ve finally launched TechInsight Neuron a no-fluff, insight-packed newsletter where I break down the latest in AI, Machine Learning, DevOps, and Cloud.
What to expect: actionable tutorials, tool breakdowns, industry trends, and career insights all crafted for engineers, builders, and the curious.
If you're someone who learns by doing and wants to stay ahead in the tech game you're in the right place.
Introduction
When you see "Up" in docker ps it does not mean that your application is working. The container is running,. The web server inside the container might be returning 500 errors. The database might be deadlocked. The connection pool might be exhausted. The application might be stuck in a loop. Using a lot of memory but it is not serving any requests. Docker thinks the container is fine.. Your users think the service is down.
Health checks and restart policies fix this problem. Health checks helps Docker to figure out if the application inside the container is actually working, not just if the process is running. Restart policies tell Docker what to do when a container stops suddenly. When you use health checks and restart policies together you do not have to watch the container all the time. The container can find its problems and fix them automatically.
This guide talks about health checks and restart policies. It covers everything about these two things. You will learn about all the health check parameters, including start_interval, which's not well known. You will learn about the difference between CMD and CMD-SHELL. You will see examples of how to check if common services are working. You will learn how restart policies work in modes. You will understand the difference, between unless-stopped. You will also learn how to use depends_on to start things up in the order so everything works reliably.
Running vs Actually Working: The Core Problem
Docker keeps an eye on the process that runs with the ID number 1 inside the container. If this process is working Docker thinks the container is running. That is all Docker can figure out by itself.
The issue is that a process being alive is a basic thing. Think about these ways things can go wrong that Docker cannot find out about without a health check:
A Node.js web server whose main loop's stuck because the computer is doing something that takes a lot of processing power so it can get connections but never sends a response
A PostgreSQL container that is running but has not finished setting up and is not getting connections yet
An API service that cannot get any more database connections and sends database errors every time it gets a request
A Redis instance that is running but has reached its memory limit and will not let you write anything
A Java application that has run out of memory the Java process is still running, but it sends an error every time it gets a request because it is, out of memory
In every case when you use the command docker ps it shows the container as working.. People using it see errors. Without health checks you have to either watch it yourself all the time or wait for people to tell you something is wrong.

The Three Container Health States
Once you set up a health check for a container Docker keeps track of one of three states:
starting: The container just. Is within a certain time frame called the start period. If the health check fails during this time it does not count towards the number of retries. The container stays in this state until either a check passes. It becomes healthy or the time frame elapses and failures start adding up.
healthy: The last health check. Returned a code of 0. This means the container is working correctly.
unhealthy: The health check failed a number of times in a row after the start period ended. The container gets marked as unhealthy. Docker does not automatically restart a container on its own but tools, like Docker Swarm and Kubernetes use this state to restart or replace the container.
You can see these states when you run the command to list containers in the STATUS column: Up 5 minutes ( Up 5 minutes (unhealthy) or Up 10 seconds (health: starting).
HEALTHCHECK in the Dockerfile
The HEALTHCHECK instruction bakes a health check directly into the image. Every container started from that image automatically has the check configured, without needing any additional flags or compose file entries:
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
CMD ["node", "server.js"]The health check runs every 30 seconds, times out after 5 seconds, gives the application a 15-second grace period on startup, and marks the container unhealthy after 3 consecutive failures.
One important detail: only the last HEALTHCHECK instruction in a Dockerfile takes effect. If you have multiple HEALTHCHECK lines, Docker uses the final one. This also means a base image's HEALTHCHECK can be overridden by adding a new HEALTHCHECK in your derived image.
The HEALTHCHECK NONE form disables any health check inherited from a base image:
FROM my-base-image
HEALTHCHECK NONE # disables the base image's health checkHealth Check Parameters: All Five Explained
The HEALTHCHECK instruction accepts five timing and behavior options:
HEALTHCHECK \
--interval=30s \
--timeout=10s \
--start-period=40s \
--start-interval=5s \
--retries=3 \
CMD curl -f http://localhost:3000/health || exit 1--interval(default:30s) : The interval default period of 30 seconds defines the interval for Docker health checks after the end of start period. The system measures time between health checks by using the duration from one check ending until the next check starting. The system detects failures when you set this threshold too low because it generates false alerts and it detects failures when you set this threshold too high. The default starting point of 30 seconds works well for most web services.--timeout(default:30s): Docker uses a default timeout of 30 seconds to wait for check command execution. The system treats check execution failure when it exceeds this duration. The system requires you to establish this time period shorter than the interval time. The health check process creates a continuous overlap because it uses the default 30-second timeout period which runs every 30 seconds. You should set timeout to the 99th percentile response time of your health endpoint plus an additional safety buffer.--start-period(default:0s) : The system uses a start period default of 0 seconds to define the time period during which no failed container checks will count towards the maximum allowed retries. The container becomes healthy after a passing check during this time period when checks continue to run. The system fails to record any detection of failures because of this situation. The system uses this location to record all applications that take time to start up their operations. JVM applications require 30 to 60 seconds to complete their initialization process. The system needs to set start period for initialization time because it requires this duration to complete its correct operations.--start-interval(added in Docker 25.0 / Compose 2.20.2) — How often to run checks during thestart_period. This lets you check more frequently during startup (when you want to know quickly when the service is ready) without incurring that same frequency cost during normal operation. For example,--start-period=60s --start-interval=5s --interval=30schecks every 5 seconds while starting up and every 30 seconds once healthy. This is particularly useful fordepends_on: condition: service_healthyscenarios where you want dependent services to start as quickly as possible after the dependency is ready.--retries(default:3) : The system will wait for three consecutive failures before it will declare the container as unhealthy. The container remains healthy until it experiences its first complete failure because this approach helps to minimize disruptions which occur during short-lived problems such as GC pauses and network outages.
CMD vs CMD-SHELL: When to Use Each
The health check command can be specified in two forms that behave differently:
CMD form: The command is executed directly by the container runtime without involving a shell. Arguments are passed as a JSON array:
HEALTHCHECK CMD ["curl", "-f", "http://localhost:3000/health"]# Docker Compose equivalent
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]Use CMD when your command is a single executable with arguments and you don't need shell features. It's safer because there's no shell parsing, no risk of word splitting, and no shell injection issues.
CMD-SHELL form: The command is run via /bin/sh -c, giving access to shell features like pipes (|), logical operators (||, &&), and variable substitution:
HEALTHCHECK CMD-SHELL "pg_isready -U postgres || exit 1"# Docker Compose equivalent
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres || exit 1"]
# or as a plain string (equivalent to CMD-SHELL)
test: "pg_isready -U postgres || exit 1"Use CMD-SHELL when you need shell operators or pipes. The pg_isready example above uses || exit 1 to ensure a non-zero exit code is returned on failure, which is required for Docker to recognize the check as failed. Without it, pg_isready would return a descriptive message but might not return a non-zero exit code in all failure cases.
A practical note: slim images based on Alpine Linux often don't include curl. Either install it in your Dockerfile (RUN apk add --no-cache curl) or use wget as an alternative:
# wget alternative for Alpine images
HEALTHCHECK CMD wget --quiet --tries=1 --spider http://localhost:3000/health || exit 1Health Checks in Docker Compose
Compose lets you define or override health checks per service in the healthcheck: block. This takes precedence over any HEALTHCHECK instruction in the image:
services:
web:
image: nginx
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
start_interval: 5s # Docker Compose 2.20.2+
db:
image: postgres:16-alpine
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres -d myapp"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
start_interval: 5s
cache:
image: redis:7-alpine
healthcheck:
test: ["CMD-SHELL", "redis-cli ping | grep PONG || exit 1"]
interval: 10s
timeout: 5s
retries: 3The Compose healthcheck block mirrors the Dockerfile parameters: test, interval, timeout, retries, start_period, and start_interval. Durations accept Go duration syntax: 30s, 1m30s, 2m.
Overriding and Disabling Health Checks
Sometimes you want to override a health check baked into an official image, or disable it entirely.
Override in Compose:
services:
postgres:
image: postgres:16
healthcheck:
# Override the image's default health check with a custom one
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30sDisable entirely:
services:
some-service:
image: some-image-with-built-in-healthcheck
healthcheck:
disable: trueOr in a Dockerfile:
HEALTHCHECK NONEDisabling health checks is reasonable for development environments or one-off tasks where health state tracking adds no value. In production, don't disable health checks on services that other services depend on.
Practical Health Check Patterns by Service Type
Web / API services
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 15sAlways check a dedicated /health endpoint rather than the root path. A good health endpoint verifies the application's critical dependencies: it can reach the database, the cache responds, required environment variables are set. It should return 200 OK when healthy and a 4xx or 5xx when something is wrong. Keep it fast — it runs every interval seconds.
PostgreSQL
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres -d mydb"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30spg_isready is the right tool here because it uses the actual PostgreSQL connection protocol rather than just checking if the port is open. It's included in all official PostgreSQL images.
MySQL / MariaDB
healthcheck:
test: ["CMD-SHELL", "mysqladmin ping -h localhost --silent"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30sFor MariaDB, the official image includes a healthcheck.sh script:
healthcheck:
test: ["CMD", "healthcheck.sh", "--connect", "--innodb_initialized"]
interval: 30s
timeout: 10s
retries: 3Redis
healthcheck:
test: ["CMD-SHELL", "redis-cli ping | grep PONG || exit 1"]
interval: 10s
timeout: 5s
retries: 3The grep PONG step is important. redis-cli ping returns the string "PONG" on success. Without checking the output, a failed connection might still exit with code 0. Piping to grep PONG ensures a non-zero exit if the expected response isn't received.
Workers and background processes
For background workers with no HTTP endpoint, check whether the process itself is running correctly:
healthcheck:
test: ["CMD-SHELL", "pgrep -f 'python worker.py' > /dev/null || exit 1"]
interval: 60s
timeout: 10s
retries: 3
start_period: 30sFor workers that write a heartbeat file, check its modification time:
healthcheck:
test: ["CMD-SHELL", "test $(find /tmp/heartbeat -mmin -2) || exit 1"]
interval: 120s
timeout: 5s
retries: 2
Inspecting Health Check Status and History
# See health status in docker ps (STATUS column)
docker ps
# Full health check details for a single container
docker inspect --format='{{json .State.Health}}' my-container
# Pretty-printed (requires jq)
docker inspect --format='{{json .State.Health}}' my-container | jq
# Quick health status only
docker inspect --format='{{.State.Health.Status}}' my-container
# In Compose
docker compose ps # shows health columnThe docker inspect health output includes the last five check results with timestamps, exit codes, and the stdout/stderr output of the check command. This is where to look when debugging why a container is stuck in unhealthy state.
The output looks like:
{
"Status": "unhealthy",
"FailingStreak": 3,
"Log": [
{
"Start": "2024-01-15T10:00:00Z",
"End": "2024-01-15T10:00:05Z",
"ExitCode": 1,
"Output": "curl: (7) Failed to connect to localhost port 3000: Connection refused\n"
}
]
}The Output field in the log tells you exactly what the check command returned, which is usually enough to diagnose the problem. A common issue to look for: the check command itself is missing from the container image (like curl not being installed in an Alpine image), which shows up as a command-not-found error in the output.
Docker Restart Policies: All Four Modes
Restart policies tell Docker what to do when a container's main process exits. Set them with --restart on docker run or with restart: in Compose.
no (default)
docker run --restart=no nginxDocker never restarts the container after it exits, regardless of exit code. This is the default. Fine for development, one-off tasks, and containers you manage manually.
on-failure[:max-retries]
docker run --restart=on-failure:5 my-appRestarts only when the container exits with a non-zero exit code (indicating an error). A clean exit with code 0 does not trigger a restart. The optional :max-retries suffix caps the number of restart attempts. After the limit is reached, the container stays stopped.
Use this for containers that should complete successfully and stop (like batch jobs), or in development where you want automatic recovery from crashes but not infinite restart loops.
always
docker run --restart=always nginxAlways restarts the container when it exits, regardless of exit code. Also starts the container automatically when the Docker daemon starts (after a host reboot, for example). If you manually stop the container with
docker stop, Docker still restarts it when the daemon restarts.Use
alwaysfor services that must run after every host reboot and should never be intentionally stopped outside of deliberate teardown operations.
unless-stopped
docker run --restart=unless-stopped nginxBehaves identically to
alwaysduring normal operation. The one difference: if you explicitly stop the container withdocker stop, it stays stopped even after the Docker daemon restarts. If the container exits on its own (crash, error, OOM kill), it restarts automatically.This is the most practical policy for production services that you want to survive host reboots and recover from crashes, but that you occasionally need to stop intentionally during maintenance.
Restart Policies in Docker Compose
services:
web:
image: nginx
restart: always
api:
build: ./api
restart: unless-stopped
worker:
build: ./worker
restart: on-failure
db:
image: postgres:16
restart: unless-stoppedThe restart: key in Compose directly corresponds to Docker's restart policies. For Swarm deployments, the deploy.restart_policy block provides more granular control:
services:
api:
image: my-app:1.0
deploy:
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120sThe window field specifies how long Docker waits to decide if a restart was successful. If the container fails again within the window after restarting, that counts as another failure toward max_attempts. This prevents a container that fails two seconds after every restart from rapidly consuming retries.
Note that deploy.restart_policy only applies in Swarm mode (docker stack deploy). For standalone Compose, use the top-level restart: key.
always vs unless-stopped: The Difference That Matters
This is the detail that catches people in production. The difference is small but significant:
Scenario |
|
|
|---|---|---|
Container crashes | Restarts | Restarts |
Docker daemon restarts (host reboot) | Starts | Starts |
Explicitly stopped with | Starts | Stays stopped |
With always, there is no way to stop a container and have it stay stopped across a daemon restart. The only ways to permanently stop an always container are to remove it or change its restart policy.
With unless-stopped, docker stop sets a "stopped" flag. The container won't start again until you explicitly run docker start or docker compose up. This is what you want for most production services: automatic recovery from crashes and host reboots, with the ability to intentionally stop services during maintenance without them immediately restarting.
What Restart Policies Don't Fix
Restart policies handle process exits. They don't address the unhealthy-but-running scenario described at the start of this guide. If a container's process is running but the application is broken, the restart policy does nothing because the container hasn't exited.
This is why restart policies and health checks are complementary rather than alternatives. Health checks detect the broken-but-running state and expose it via the health status. Orchestrators like Docker Swarm and Kubernetes use that health status to take action (restart, reschedule, remove from the load balancer). On a standalone Docker host, you need external tooling or a watchdog process to act on
unhealthystatus since Docker itself won't auto-restart based on health check failures alone.The practical approach for standalone Docker Compose deployments: combine a restart policy with a health check. The restart policy handles crashes. The health check surfaces degraded states that your monitoring system can alert on.
Combining Health Checks and Restart Policies
Using both together:
services:
db:
image: postgres:16-alpine
restart: unless-stopped
environment:
POSTGRES_USER: appuser
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
POSTGRES_DB: myapp
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d myapp"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
start_interval: 5s
api:
build: ./api
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 20s
start_interval: 5s
depends_on:
db:
condition: service_healthy # waits for pg_isready to passThe interaction here: restart: unless-stopped handles crashes. The healthcheck on db prevents api from starting before the database is genuinely ready (not just started). If api crashes, unless-stopped restarts it. If db becomes unhealthy, the health status is visible in docker ps and available for monitoring systems to alert on.

Tuning Parameters for Different Applications
There is no one-size-fits-all configuration. Here's how to think about tuning each parameter:
For fast-response, latency-sensitive services (payment APIs, authentication services):
HEALTHCHECK --interval=10s --timeout=2s --start-period=15s --retries=2 \
CMD curl -f http://localhost:8080/health || exit 1Shorter interval and fewer retries mean faster failure detection. The tradeoff is more check traffic and higher sensitivity to transient hiccups.
For slow-starting services (JVM applications, Spring Boot, large frameworks):
HEALTHCHECK --interval=30s --timeout=10s --start-period=90s --start-interval=5s --retries=3 \
CMD curl -f http://localhost:8080/actuator/health || exit 1A long start_period covers the JVM initialization time. start_interval=5s during startup means the container is marked healthy quickly once it's ready, even if the start_period hasn't elapsed. After startup, interval=30s prevents unnecessary check overhead.
For databases and heavy background workers:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres || exit 1"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
start_interval: 5sMore retries for databases because a single failed check during a brief write spike shouldn't mark the database unhealthy and block dependent services. Five retries at 10-second intervals means 50 seconds of consecutive failures before the database is considered unhealthy.
For batch workers and long-running jobs:
healthcheck:
test: ["CMD-SHELL", "test -f /tmp/worker.alive && test $(find /tmp/worker.alive -mmin -5) || exit 1"]
interval: 5m
timeout: 10s
retries: 2
start_period: 60sThe worker touches /tmp/worker.alive every few minutes to indicate it's still processing. The health check verifies the file exists and was modified recently. Long interval is appropriate since precise second-level failure detection isn't needed for background jobs.
Health Checks in Orchestration: Swarm and Kubernetes
Docker health checks become especially powerful in orchestrated environments.
In Docker Swarm, an unhealthy container is automatically replaced. The Swarm scheduler removes the unhealthy task, starts a new one, and only routes traffic to the replacement once it's healthy. This enables zero-downtime recovery from application-level failures that wouldn't have triggered a container restart on a standalone host.
In Kubernetes, the
HEALTHCHECKDockerfile instruction is not directly used. Kubernetes has its own probe system: liveness probes (equivalent to Docker health checks — restart if failing), readiness probes (remove from load balancer but don't restart), and startup probes (equivalent tostart_period— give slow-starting containers time to initialize). Understanding Docker health checks gives you a solid foundation for Kubernetes probes since the concepts map directly.Health checks also gate rolling deployments in both systems. A new version is only promoted to "ready" after its health check passes. If the health check on the new version fails, the rollout stops and the old version continues serving traffic. This makes health checks a critical part of safe deployment pipelines, not just runtime monitoring.
A Complete Production Example
A complete three-service stack with health checks, restart policies, and properly ordered startup:
services:
db:
image: postgres:16-alpine
restart: unless-stopped
environment:
POSTGRES_USER: appuser
POSTGRES_DB: myapp
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
secrets:
- db_password
volumes:
- postgres-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d myapp"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
start_interval: 5s
networks:
- backend
cache:
image: redis:7-alpine
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "redis-cli ping | grep PONG || exit 1"]
interval: 10s
timeout: 5s
retries: 3
start_period: 10s
start_interval: 5s
networks:
- backend
api:
build:
context: ./api
dockerfile: Dockerfile
restart: unless-stopped
secrets:
- db_password
environment:
DB_HOST: db
DB_USER: appuser
DB_NAME: myapp
DB_PASSWORD_FILE: /run/secrets/db_password
REDIS_URL: redis://cache:6379
NODE_ENV: production
ports:
- "3000:3000"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 20s
start_interval: 5s
depends_on:
db:
condition: service_healthy
cache:
condition: service_healthy
networks:
- backend
networks:
backend:
volumes:
postgres-data:
secrets:
db_password:
file: ./secrets/db_password.txtThe design decisions in this stack: all three services use unless-stopped so crashes trigger automatic restarts but intentional stops during maintenance persist across daemon restarts. Both db and cache have health checks that api waits on via condition: service_healthy, so the API never tries to connect to unready dependencies. start_interval: 5s on all services means dependent services start as quickly as possible after their dependencies become healthy. The api health check hits /health which the application should implement to verify its database and cache connections are working, not just that the HTTP server is listening
Conclusion
Health checks and restart policies address two distinct failure scenarios that containers face in production. Restart policies handle the case where the container process exits, whether cleanly or with an error. Health checks handle the case where the process is running but the application is not functioning correctly.
Neither is sufficient alone. A container without a restart policy stops on the first crash and waits for manual intervention. A container without a health check appears healthy while silently failing. Together, they create the foundation for self-healing infrastructure: containers that recover from crashes automatically and surface application-level health clearly enough for orchestrators and monitoring systems to take appropriate action.
The patterns here scale from a simple single-container application with
restart: unless-stoppedall the way to multi-container stacks with carefully ordered startup, to Swarm and Kubernetes deployments where health status drives automated traffic routing and container replacement. Start with the basics and layer in the more advanced configurations as your reliability requirements grow.
🔗Let’s Stay Connected
📱 Join Our WhatsApp Community
Get early access to AI/ML, Cloud & Devops resources, behind-the-scenes updates, and connect with like-minded learners.
➡️ Join the WhatsApp Group
✅ Follow Me for Daily Tech Insights
➡️ LinkedIN
➡️ YouTube
➡️ X (Twitter)
➡️ Website

