Health Monitoring Done Right: HTTP Checks That Actually Tell You Something

Every monitoring tutorial starts the same way: create a /health endpoint that returns 200 OK. Congratulations, you now have a health check that tells you absolutely nothing. Your database could be unreachable, your Redis cache could be full, your disk could be at 99% — and that endpoint will happily return 200 until the entire service collapses.
False positives are annoying. False negatives are dangerous. A health check that says “everything is fine” while your application is silently failing is worse than having no health check at all, because it gives you confidence you haven’t earned. Let’s talk about how to build HTTP checks that actually catch problems.
Status Code Mapping: Not Every Non-200 Is a Problem
The first mistake teams make is treating any non-200 response as a failure. This leads to alert storms that train your team to ignore notifications — the exact opposite of what you want. Here’s a more nuanced mapping:
200-299: Healthy. The service is responding and functioning. This is your baseline.
404 Not Found:Usually healthy. If you’re monitoring the root path of an API that doesn’t serve anything at /, a 404 means the server is running and responding to requests — it just doesn’t have a handler for that path. CPI-Control treats 404 as healthy by default because this pattern is extremely common with API servers, Kubernetes ingress controllers, and reverse proxies.
401/403 Unauthorized/Forbidden:The server is running and enforcing authentication. Unless your health check endpoint is supposed to be public, these codes confirm the service is alive and security is working. Don’t alert on these unless you explicitly expect a 200 from that endpoint.
429 Too Many Requests: The service is alive but rate-limiting. This is degradation, not downtime. Track it as a warning, not a critical alert.
500-503:This is where you pay attention. A 500 means unhandled errors, 502 means the upstream is unreachable, and 503 means the service is explicitly telling you it’s unavailable. These are critical.
Response Time as a Degradation Signal
Slow is the new down. A service that responds in 15 seconds is technically “up” but completely unusable for your customers. Response time monitoring catches the degradation phase that precedes most outages — the period where database connections are pooling up, memory is leaking, or a downstream dependency is timing out.
Set response time thresholds based on your actual baseline, not arbitrary numbers. If your API normally responds in 120ms, a 500ms response is a yellow flag. A 2-second response is a red flag. If your health check endpoint queries the database, cache, and any critical dependencies, its response time becomes a composite indicator of overall system health.
CPI-Control captures response time on every check and stores it as metadata. You can see trends over time and spot degradation before it turns into downtime. A service that goes from 100ms to 800ms over 30 minutes is telling you something — listen to it.
Consecutive Failure Thresholds: Don’t Page on One Timeout
Networks are unreliable. DNS resolvers hiccup. Load balancers rotate. A single failed health check means almost nothing. Two consecutive failures are concerning. Three are a pattern. Five are an incident.
Configure your monitoring to require multiple consecutive failures before escalating. The exact number depends on your check interval and your tolerance for detection latency. If you check every 30 seconds and require 3 consecutive failures, you’ll detect a real outage within 90 seconds while ignoring momentary blips.
This is one of the highest-impact configurations you can make. Teams that alert on every single failure end up with hundreds of notifications per week, most of which are transient network issues. Teams that use consecutive thresholds get maybe 2-3 alerts per week — and every single one matters.
Auto-Recovery: Know When to Close Incidents
A good monitoring system doesn’t just detect failures — it detects recovery. When your service comes back online and passes consecutive health checks, the incident should close automatically. This prevents stale incidents from cluttering your dashboard and gives you accurate uptime calculations.
Auto-recovery should mirror your failure detection. If you require 3 consecutive failures to open an incident, require 2-3 consecutive successes to close it. This prevents flapping — the scenario where a struggling service passes one check, closes the incident, fails the next check, opens a new incident, and repeats in an endless cycle.
Response Body Capture: What Happened When It Went Down
When a service goes down at 3 AM and recovers by 3:15 AM, the first question in the morning standup is: “What happened?” If all you have is “it returned 503 for 15 minutes,” you’re going to spend hours digging through logs trying to reconstruct the timeline.
Capture the response body on failure. Many applications return error details in their 500 responses — stack traces, error codes, database connection errors, memory warnings. This metadata turns a vague “it was down” into a specific “PostgreSQL connection pool was exhausted at 03:02 UTC.”
CPI-Control stores response metadata including status codes, response times, headers, and body snippets for every failed check. When you review an incident, you see exactly what the service returned at the moment it failed — not what it returns now that it’s recovered.
Muting During Maintenance Windows
Deploying a new version? Migrating a database? Rotating certificates? These planned operations will trigger health check failures, and those failures will generate notifications that your team will ignore. Worse, they’ll condition your team to ignore all notifications — including the real ones.
Mute specific services before planned maintenance. A muted service is still checked (you want to know when it comes back), but failures don’t generate notifications or incidents. When maintenance is complete, unmute the service. If you forgot to unmute, set an auto-unmute timer.
This is one of those features that seems trivial until you don’t have it. One Kubernetes rolling update across 20 services without muting can generate 60+ notifications in 5 minutes. That’s not monitoring — that’s spam.
CPI-Control’s Approach
CPI-Control implements all of these patterns out of the box. Health checks are configurable per service with custom intervals, timeout thresholds, and expected status codes. 404 responses are treated as healthy by default. Incidents are created automatically after consecutive failures and closed automatically on recovery. Response metadata — status codes, response times, and body snippets — is captured and stored locally in SQLite, so you always have the forensic data you need.
Muting is built into the notification system: mute a service, deploy your update, and unmute when you’re done. No alert storms, no notification fatigue, no false sense of security. Just health checks that actually tell you something.