Back to Homelab
Apr 19, 2025|22 min read

How I Debug Production Incidents — My Runbook From Three Real War Stories

The exact process I follow when things break at 2am — from triage to root cause analysis, with three real incidents from my homelab and Kumari.ai that taught me how to debug under pressure.

DevOpsLinuxDebuggingIncident ResponseHomelabSRE

How I Debug Production Incidents

There is a specific kind of silence that happens when something breaks in production. Not the peaceful kind. The kind where your phone buzzes at 2:17am and you know, before you even look, that the next few hours of your life belong to whatever just caught fire.

I have been running a homelab for years now — a Dell PowerEdge R720, three Dell OptiPlex 7050s in a Proxmox cluster, a ZFS NAS, and a bare metal Arch Linux workstation that I use for development. On top of that, I build and operate Kumari.ai, an AI agent platform with a FastAPI backend, Next.js frontend, Redis, PostgreSQL, and a bunch of moving pieces. Things break. They break in ways I never anticipated, at times I never wanted, and the only thing that separates a 20-minute fix from a 6-hour nightmare is having a process.

This post is that process. My runbook. Three real war stories. And the monitoring setup that means I rarely get surprised anymore.

My incident debugging flowchart
My incident debugging flowchart

The Methodology

Before I get into the stories, here is the framework I follow every single time. I have it burned into muscle memory at this point.

1. Triage (First 2 minutes)

The goal is not to fix anything. The goal is to understand what is on fire and how badly.

  • What is the user-facing impact? Is the site down? Is it slow? Is data being corrupted?
  • When did it start? Check your monitoring dashboards. Correlate with recent deploys or changes.
  • What changed recently?
    CODE
    git log --oneline -10
    , check deployment history, look at recent cron jobs.

2. Quick Checks (Next 5 minutes)

I run the same set of commands every time. They cover 80% of incidents:

Bash
1# System resources 2htop # CPU, memory, load at a glance 3free -h # Memory specifically 4df -h # Disk space 5iostat -x 1 3 # Disk I/O saturation 6 7# Processes 8ps aux --sort=-%mem | head 20 # Top memory consumers 9ps aux --sort=-%cpu | head 20 # Top CPU consumers 10 11# Networking 12ss -tlnp # What is listening where 13ss -s # Connection state summary 14 15# Logs 16journalctl -p err --since "1 hour ago" --no-pager | tail -50 17dmesg -T | tail -30

3. Deep Dive

Once I have a hypothesis, I pick the right tool:

SymptomToolCommand
High CPU, unknown cause
CODE
perf
CODE
perf top -p <pid>
Process hanging
CODE
strace
CODE
strace -fp <pid> -e trace=network
Disk I/O issues
CODE
iostat
,
CODE
iotop
CODE
iotop -aoP
Network weirdness
CODE
tcpdump
CODE
tcpdump -i eth0 -nn port 5432
File descriptor leaks
CODE
lsof
CODE
lsof -p <pid> | wc -l
Memory leaks
CODE
/proc/<pid>/smaps
CODE
cat /proc/<pid>/smaps_rollup

4. Correlate

The root cause is almost never the first thing you find. You find symptoms. The real cause usually lives one or two layers deeper. Correlate timestamps across

CODE
journalctl
,
CODE
dmesg
, application logs, and your monitoring stack.

5. Fix and Verify

Apply the minimal fix. Verify it works. Do not get clever at 3am.

6. Postmortem

Every incident gets a postmortem. Even if I am the only person reading it. I will share my template later in this post.

tip
[!TIP] Write the postmortem within 24 hours. Your memory of the timeline degrades fast, and the details matter.

War Story #1: The OOM Killer Mystery

The Alert

It was a Saturday morning. I was making tea when my Grafana alert fired: pve2 node unreachable. My phone lit up with three alerts in rapid succession — the node itself, then two VMs that were running on it.

I SSH'd into pve2 from my workstation. The connection was slow but it worked. That told me the node was alive but struggling.

The Investigation

First thing, always check

CODE
dmesg
:

Bash
1resham@pve2:~$ dmesg -T | grep -i oom 2[Sat Apr 12 02:41:17 2025] node2 invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 3[Sat Apr 12 02:41:17 2025] oom-killer: constraint=CONSTRAINT_NONE 4[Sat Apr 12 02:41:17 2025] Out of memory: Killed process 28413 (kvm) total-vm:8421504kB, anon-rss:4182272kB, file-rss:0kB, shmem-rss:0kB, UID=0 pgtables:16832kB oom_score_adj=0 5[Sat Apr 12 02:41:19 2025] Out of memory: Killed process 31087 (kvm) total-vm:4210688kB, anon-rss:2091136kB, file-rss:0kB, shmem-rss:0kB, UID=0 pgtables:8416kB oom_score_adj=0 6[Sat Apr 12 02:41:22 2025] oom_reaper: reaped process 28413 (kvm), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The OOM killer had taken out two KVM processes. Those were my VMs. But KVM processes do not just suddenly eat all memory on a host unless something else pushed the system to the edge first. The KVM processes were just the victims — they had the highest

CODE
oom_score
because they were the largest processes.

So who was the actual offender?

Bash
1resham@pve2:~$ journalctl --since "2025-04-12 02:00" --until "2025-04-12 02:45" -p warning --no-pager | head -40 2Apr 12 02:14:33 pve2 kernel: java[19842]: segfault at 0000000000000000 ip 00007f3a1c2e4a10 sp 00007f39e4bfe920 error 4 3Apr 12 02:27:11 pve2 systemd[1]: jenkins.service: memory usage 14.2G, limit set to infinity 4Apr 12 02:33:45 pve2 kernel: kswapd0: page allocation failure: order:0, mode:0x100cca(GFP_HIGHUSER_MOVABLE) 5Apr 12 02:38:02 pve2 kernel: Mem-Info: 6Apr 12 02:38:02 pve2 kernel: active_anon:3921847 inactive_anon:189432 isolated_anon:0 7Apr 12 02:38:02 pve2 kernel: active_file:1204 inactive_file:892 isolated_file:0 8Apr 12 02:38:02 pve2 kernel: unevictable:0 dirty:0 writeback:0 unstable:0 9Apr 12 02:38:02 pve2 kernel: slab_reclaimable:12841 slab_unreclaimable:48923 10Apr 12 02:41:17 pve2 kernel: node2 invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE)

There it was. Jenkins.

CODE
memory usage 14.2G, limit set to infinity
. This OptiPlex 7050 has 32GB of RAM. Jenkins was running inside an LXC container on pve2 and had no memory limits set. A pipeline with a particularly large Java build had been running since 2am, and the JVM just kept allocating.

Let me confirm:

Bash
1resham@pve2:~$ ps aux --sort=-%mem | head 10 2USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 3root 19842 84.3 44.7 16843776 14612480 ? Sl 01:58 32:14 /usr/bin/java -Djenkins.install.runSetup... -jar /usr/share/java/jenkins.war 4root 28413 2.1 12.8 8421504 4182272 ? Sl Apr11 8:42 /usr/bin/kvm -id 104 -name gitea-server... 5root 31087 1.4 6.4 4210688 2091136 ? Sl Apr11 5:11 /usr/bin/kvm -id 107 -name monitoring... 6root 1842 0.3 1.2 2341888 401408 ? Sl Apr09 4:22 /usr/bin/kvm -id 101 -name reverse-proxy...

44.7% of 32GB. Jenkins was eating 14.6GB and climbing. No

CODE
-Xmx
flag anywhere in the startup command.

The Fix

Immediate fix — kill the runaway build and restart Jenkins with memory limits:

Bash
1# Kill the runaway process 2resham@pve2:~$ kill -9 19842 3 4# Set JVM heap limits in Jenkins defaults 5resham@pve2:~$ cat /etc/default/jenkins 6JAVA_ARGS="-Xms512m -Xmx4g -XX:+UseG1GC -XX:MaxMetaspaceSize=512m" 7 8# Restart Jenkins 9resham@pve2:~$ systemctl restart jenkins

But that only fixes the application layer. What I really needed was a hard ceiling at the OS level so that no single container or service could ever OOM the entire host again:

Bash
1# Set memory limits in the systemd service override 2resham@pve2:~$ systemctl edit jenkins 3# Added: 4# [Service] 5# MemoryMax=6G 6# MemoryHigh=5G 7 8resham@pve2:~$ systemctl daemon-reload 9resham@pve2:~$ systemctl restart jenkins

I also went through every LXC container on the cluster and made sure Proxmox memory limits were properly set. Some of them had been created with "unlimited" memory because I was lazy when I first set them up.

Bash
1# In Proxmox, for the Jenkins CT (ID 112): 2resham@pve2:~$ pct set 112 -memory 8192 -swap 2048
warning
[!WARNING] Never run Java services without explicit -Xmx limits. The JVM will happily consume all available memory if you let it. On a hypervisor host, this is catastrophic because the OOM killer does not understand which process is the "important" one — it kills whatever has the highest score, which is usually your largest VM.

The Lesson

The root cause was not Jenkins. It was me, three months earlier, when I created that container without memory limits because "I'll set them later." I never did. The build that triggered the OOM was just the straw.

War Story #2: The Disk Full Silent Failure

The Alert

This one was subtle. I noticed that Kumari.ai's chat responses were working fine for existing conversations, but creating new conversations would hang and eventually time out. No errors in the FastAPI logs. No CPU or memory issues. The application just... stopped being able to write.

The Investigation

I SSH'd into the database server and started with the basics:

Bash
1resham@db-server:~$ df -h 2Filesystem Size Used Avail Use% Mounted on 3/dev/sda1 50G 50G 0 100% / 4tmpfs 7.8G 1.2M 7.8G 1% /dev/shm 5/dev/sdb1 200G 43G 148G 23% /mnt/data

100% on the root filesystem. There it is. But what was eating it?

Bash
1resham@db-server:~$ du -sh /var/lib/postgresql/ 247G /var/lib/postgresql/ 3 4resham@db-server:~$ du -sh /var/lib/postgresql/16/main/* 51.2G /var/lib/postgresql/16/main/base 638G /var/lib/postgresql/16/main/pg_wal 74.8G /var/lib/postgresql/16/main/pg_xlog_archive 81.1G /var/lib/postgresql/16/main/pg_stat_tmp

38GB of WAL (Write-Ahead Log) files. On a 50GB root partition. That is a problem.

Let me look at what was happening:

Bash
1resham@db-server:~$ ls -la /var/lib/postgresql/16/main/pg_wal/ | head -20 2total 39321600 3drwx------ 3 postgres postgres 4096 Apr 15 02:41 . 4drwx------ 19 postgres postgres 4096 Apr 10 14:22 .. 5-rw------- 1 postgres postgres 16777216 Apr 10 18:33 000000010000002A00000001 6-rw------- 1 postgres postgres 16777216 Apr 10 18:34 000000010000002A00000002 7-rw------- 1 postgres postgres 16777216 Apr 10 18:35 000000010000002A00000003 8... (2300+ files) 9-rw------- 1 postgres postgres 16777216 Apr 15 02:41 000000010000002A000008FF 10drwx------ 2 postgres postgres 4096 Apr 15 02:41 archive_status 11 12resham@db-server:~$ ls /var/lib/postgresql/16/main/pg_wal/ | wc -l 132347

Over 2300 WAL segment files, each 16MB. The reason they were accumulating was a replication slot that was no longer being consumed:

Bash
1resham@db-server:~$ sudo -u postgres psql -c "SELECT slot_name, active, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots;" 2 slot_name | active | restart_lsn | confirmed_flush_lsn 3------------------+--------+--------------+--------------------- 4 backup_replica | f | 2A/00000001 | 2A/00000001 5(1 row)

There it was.

CODE
backup_replica
— a replication slot I had created weeks ago when I was testing streaming replication to a backup server. The backup server had been shut down, but the replication slot was still there. PostgreSQL retains all WAL segments after the slot's
CODE
restart_lsn
because it assumes the replica will come back and need them.

The slot was inactive (

CODE
active: f
), but PostgreSQL does not care. It will hold WAL segments indefinitely for an inactive slot. This is by design — it is a safety feature so replicas can catch up after a network partition. But if you forget about the slot, it becomes a disk bomb.

The Fix

Bash
1# Drop the orphaned replication slot 2resham@db-server:~$ sudo -u postgres psql -c "SELECT pg_drop_replication_slot('backup_replica');" 3 pg_drop_replication_slot 4-------------------------- 5 6(1 row) 7 8# PostgreSQL immediately starts cleaning up old WAL segments 9resham@db-server:~$ sleep 5 && du -sh /var/lib/postgresql/16/main/pg_wal/ 101.1G /var/lib/postgresql/16/main/pg_wal/ 11 12resham@db-server:~$ df -h / 13Filesystem Size Used Avail Use% Mounted on 14/dev/sda1 50G 13G 35G 27% /

37GB freed instantly. PostgreSQL went right back to normal. But I needed to make sure this could never happen again.

Bash
1resham@db-server:~$ sudo -u postgres psql -c "ALTER SYSTEM SET max_wal_size = '4GB';" 2resham@db-server:~$ sudo -u postgres psql -c "ALTER SYSTEM SET wal_keep_size = '2GB';" 3resham@db-server:~$ sudo -u postgres psql -c "SELECT pg_reload_conf();"

Then I added a Prometheus alert so I would know before disk fills up:

YAML
1# /etc/prometheus/rules/postgres_alerts.yml 2groups: 3 - name: postgresql 4 rules: 5 - alert: PostgresWALAccumulation 6 expr: pg_wal_segments_count > 200 7 for: 10m 8 labels: 9 severity: warning 10 annotations: 11 summary: "PostgreSQL WAL segment count is high on {{ $labels.instance }}" 12 description: "{{ $value }} WAL segments accumulated. Check for orphaned replication slots." 13 14 - alert: DiskSpaceCritical 15 expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 16 for: 5m 17 labels: 18 severity: critical 19 annotations: 20 summary: "Root filesystem below 10% on {{ $labels.instance }}"
tip
[!TIP] Always monitor for orphaned replication slots. Run SELECT * FROM pg_replication_slots WHERE active = false; as a periodic check. Better yet, set max_slot_wal_keep_size in PostgreSQL 13+ to cap how much WAL a single slot can retain.

The Lesson

PostgreSQL does not log a warning when a replication slot is causing WAL accumulation. There is no "hey, you have 38GB of WAL files" message in the logs. The database just silently fills the disk and then refuses writes with a cryptic

CODE
PANIC: could not write to file "pg_wal/..."
message — if you are lucky. If you are unlucky, it just hangs.

Silent failures are the worst kind. They do not page you. They do not log errors. They just slowly degrade until something falls over. This is why monitoring disk space with aggressive thresholds is non-negotiable.

War Story #3: The Network Timeout That Wasn't

The Alert

This was the most frustrating one. Users of Kumari.ai were reporting intermittent 504 Gateway Timeout errors. Not consistent — maybe 5% of requests. Some retries would work fine. The pattern seemed random.

The Investigation

First, verify the backend is actually healthy:

Bash
1resham@gateway:~$ curl -w "\n%{http_code} %{time_total}s\n" https://api.kumari.ai/health 2{"status":"healthy","version":"2.4.1"} 3200 0.043s

Fine. Direct health check works. Let me check Nginx:

Bash
1resham@gateway:~$ tail -100 /var/log/nginx/kumari-api-error.log | grep 504 22025/04/18 14:23:11 [error] 1847#1847: *284719 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.0.1.1, server: api.kumari.ai, request: "POST /api/routing/generate HTTP/2.0", upstream: "http://10.0.50.12:8000/api/routing/generate" 32025/04/18 14:23:44 [error] 1847#1847: *284723 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.0.1.1, server: api.kumari.ai, request: "POST /api/v1/conversations HTTP/2.0", upstream: "http://10.0.50.12:8000/api/v1/conversations" 42025/04/18 14:24:02 [error] 1847#1847: *284731 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.0.1.1, server: api.kumari.ai, request: "GET /api/v1/models HTTP/2.0", upstream: "http://10.0.50.12:8000/api/v1/models"

Upstream timeouts. But the backend is healthy. Let me check if the backend is actually receiving these requests:

Bash
1resham@api-server:~$ ss -tlnp | grep 8000 2LISTEN 0 128 0.0.0.0:8000 0.0.0.0:* users:(("uvicorn",pid=4521,fd=7))

Uvicorn is listening. Let me look at the connection states:

Bash
1resham@gateway:~$ ss -s 2Total: 847 3TCP: 623 (estab 412, closed 89, orphaned 12, timewait 67) 4 5Transport Total IP IPv6 6RAW 0 0 0 7UDP 8 4 4 8TCP 534 489 45 9INET 542 493 49 10FRAG 0 0 0

89 closed connections and 12 orphaned. That is higher than normal. Let me look at the connections to the backend specifically:

Bash
1resham@gateway:~$ ss -tn state time-wait dst 10.0.50.12 2Recv-Q Send-Q Local Address:Port Peer Address:Port 30 0 10.0.50.10:42318 10.0.50.12:8000 40 0 10.0.50.10:42244 10.0.50.12:8000 50 0 10.0.50.10:41987 10.0.50.12:8000 6... (34 more)

A lot of TIME_WAIT connections. Now let me break out tcpdump:

Bash
1resham@gateway:~$ sudo tcpdump -i eth0 -nn host 10.0.50.12 and port 8000 -c 50 2tcpdump: verbose output suppressed, use -v[v]... for full protocol decode 3listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 414:31:02.441123 IP 10.0.50.10.43012 > 10.0.50.12.8000: Flags [S], seq 2847291034, win 64240, options [mss 1460,sackOK,TS val 1892441023 ecr 0,nop,wscale 7], length 0 514:31:02.441589 IP 10.0.50.12.8000 > 10.0.50.10.43012: Flags [S.], seq 1293847562, ack 2847291035, win 65160, options [mss 1460,sackOK,TS val 3847291034 ecr 1892441023,nop,wscale 7], length 0 614:31:02.441612 IP 10.0.50.10.43012 > 10.0.50.12.8000: Flags [.], ack 1, win 502, length 0 714:31:02.441834 IP 10.0.50.10.43012 > 10.0.50.12.8000: Flags [P.], seq 1:412, ack 1, win 502, length 411: HTTP: POST /api/routing/generate HTTP/1.1 814:31:02.712341 IP 10.0.50.12.8000 > 10.0.50.10.43012: Flags [.], ack 412, win 506, length 0 914:31:02.891002 IP 10.0.50.12.8000 > 10.0.50.10.43012: Flags [P.], seq 1:284, ack 412, win 506, length 283: HTTP: HTTP/1.1 200 OK 10 1114:31:17.223411 IP 10.0.50.10.43012 > 10.0.50.12.8000: Flags [P.], seq 412:847, ack 284, win 502, length 435: HTTP: GET /api/v1/models HTTP/1.1 1214:31:17.223899 IP 10.0.50.12.8000 > 10.0.50.10.43012: Flags [R.], seq 284, ack 847, win 506, length 0

There it is. Look at the last two lines. Nginx sends a request on an existing connection (reusing a keepalive connection), and the backend responds with a RST (reset). The connection was already dead on the backend side, but Nginx did not know.

The timing was the clue: 14:31:02 to 14:31:17 — a 15-second gap. I checked my Nginx upstream configuration:

NGINX
1upstream kumari_backend { 2 server 10.0.50.12:8000; 3 keepalive 64; 4 keepalive_timeout 65s; 5}

And then I checked pfSense's firewall state table settings:

CODE
1Firewall > Advanced > Firewall Maximum States: 200000 2Firewall > Advanced > TCP Idle Timeout: 10 seconds (!)

The pfSense TCP idle timeout was set to 10 seconds. Some previous "security hardening" exercise had lowered it aggressively. So here is what was happening:

  1. Nginx opens a connection to the backend and makes a request. Works fine.
  2. The connection goes into the keepalive pool (Nginx thinks it is good for 65 seconds).
  3. After 10 seconds of idle, pfSense drops the state table entry for this connection.
  4. 15 seconds later, Nginx reuses the connection and sends a new request.
  5. pfSense sees a packet for a connection it has no state for. Depending on configuration, it either drops the packet silently or sends a RST.
  6. The backend either never sees the request (silent drop = 504 timeout) or sees a RST and closes its side.

The 5% failure rate matched perfectly — it was only requests that happened to land on a keepalive connection that had been idle for more than 10 seconds.

The Fix

Two changes:

NGINX
1# /etc/nginx/conf.d/kumari-api.conf 2upstream kumari_backend { 3 server 10.0.50.12:8000; 4 keepalive 32; 5 keepalive_timeout 8s; # Lower than pfSense's 10s idle timeout 6 keepalive_requests 1000; 7} 8 9server { 10 # ... existing config ... 11 12 location /api/ { 13 proxy_pass http://kumari_backend; 14 proxy_http_version 1.1; 15 proxy_set_header Connection ""; # Required for upstream keepalive 16 proxy_connect_timeout 5s; 17 proxy_send_timeout 30s; 18 proxy_read_timeout 120s; # Long for streaming responses 19 } 20}
Bash
1resham@gateway:~$ nginx -t 2nginx: the configuration file /etc/nginx/nginx.conf syntax is ok 3nginx: configuration file /etc/nginx/nginx.conf test is successful 4resham@gateway:~$ systemctl reload nginx

I also raised pfSense's TCP idle timeout back to a sane value:

CODE
Firewall > Advanced > TCP Idle Timeout: 3600 seconds

After the fix, zero 504s. I monitored for 48 hours and did not see a single upstream timeout.

warning
[!WARNING] When using Nginx upstream keepalive behind a stateful firewall, your Nginx keepalive_timeout must be shorter than the firewall's state table idle timeout. If Nginx thinks a connection is alive but the firewall has already forgotten about it, you get intermittent failures that are incredibly hard to diagnose.

The Lesson

This was the hardest one to debug because the failure mode was intermittent and the root cause spanned three different systems (Nginx, pfSense, and the backend). The tcpdump output was the breakthrough — seeing the RST response to a reused connection immediately pointed to a stale connection problem.

Network issues are almost never about the network itself. They are about state — who thinks a connection is alive, who thinks it is dead, and what happens when they disagree.

My Postmortem Template

Every incident gets one of these. Even if it is just me reading it. The act of writing it forces clarity.

MARKDOWN
1# Incident Postmortem: [Title] 2 3## Summary 4- **Date:** YYYY-MM-DD 5- **Duration:** X hours Y minutes 6- **Severity:** P1/P2/P3/P4 7- **Impact:** [Who was affected and how] 8 9## Timeline (UTC) 10- HH:MM — First alert / symptom observed 11- HH:MM — Investigation started 12- HH:MM — Root cause identified 13- HH:MM — Fix applied 14- HH:MM — Verified resolution 15- HH:MM — All clear 16 17## Root Cause 18[One paragraph. Be specific. "The Jenkins JVM had no -Xmx limit 19and consumed 14.6GB of RAM on a 32GB host, triggering the OOM 20killer which terminated two KVM processes."] 21 22## Detection 23- How was the incident detected? (Alert, user report, manual check) 24- Could we have detected it sooner? How? 25 26## Resolution 27- What was the immediate fix? 28- What was the permanent fix? 29 30## Action Items 31| Action | Owner | Priority | Status | 32|--------|-------|----------|--------| 33| [Specific action] | [Name] | P1/P2 | Open | 34 35## Lessons Learned 36- What went well? 37- What went poorly? 38- Where did we get lucky? 39 40## Prevention 41- What monitoring/alerting changes prevent recurrence? 42- What process changes prevent recurrence?
tip
[!TIP] The "Where did we get lucky?" question is the most important one. It surfaces near-misses that could have been much worse. In the OOM story, I got lucky that the OOM killer did not take out the Proxmox host process itself — that would have required a physical reboot.

The Monitoring Setup That Catches Things Now

After these incidents (and a few others I will spare you), I built out a monitoring stack that actually works. Here is what runs on my cluster:

Prometheus + Alertmanager

YAML
1# /etc/prometheus/prometheus.yml (relevant scrape configs) 2scrape_configs: 3 - job_name: 'node-exporter' 4 static_configs: 5 - targets: 6 - 'pve1:9100' 7 - 'pve2:9100' 8 - 'pve3:9100' 9 - 'db-server:9100' 10 - 'gateway:9100' 11 scrape_interval: 15s 12 13 - job_name: 'postgres-exporter' 14 static_configs: 15 - targets: ['db-server:9187'] 16 scrape_interval: 30s 17 18 - job_name: 'nginx-exporter' 19 static_configs: 20 - targets: ['gateway:9113'] 21 scrape_interval: 15s 22 23 - job_name: 'blackbox-http' 24 metrics_path: /probe 25 params: 26 module: [http_2xx] 27 static_configs: 28 - targets: 29 - 'https://kumari.ai' 30 - 'https://api.kumari.ai/health' 31 relabel_configs: 32 - source_labels: [__address__] 33 target_label: __param_target 34 - source_labels: [__param_target] 35 target_label: instance 36 - target_label: __address__ 37 replacement: 'blackbox-exporter:9115'

Critical Alert Rules

These are the alerts that would have caught all three incidents before they became incidents:

YAML
1# /etc/prometheus/rules/critical_alerts.yml 2groups: 3 - name: system-critical 4 rules: 5 # Would have caught War Story #1 6 - alert: HostMemoryPressure 7 expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 15 8 for: 5m 9 labels: 10 severity: warning 11 annotations: 12 summary: "Low available memory on {{ $labels.instance }}" 13 description: "Only {{ $value | printf \"%.1f\" }}% memory available." 14 15 - alert: HostOOMKillerDetected 16 expr: increase(node_vmstat_oom_kill[5m]) > 0 17 labels: 18 severity: critical 19 annotations: 20 summary: "OOM killer invoked on {{ $labels.instance }}" 21 22 # Would have caught War Story #2 23 - alert: HostDiskWillFillIn24Hours 24 expr: predict_linear(node_filesystem_avail_bytes{fstype!="tmpfs"}[6h], 24*3600) < 0 25 for: 30m 26 labels: 27 severity: warning 28 annotations: 29 summary: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }} predicted to fill within 24h" 30 31 - alert: HostDiskSpaceLow 32 expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes) * 100 < 15 33 for: 5m 34 labels: 35 severity: warning 36 37 # Would have caught War Story #3 38 - alert: NginxHighUpstreamErrors 39 expr: rate(nginx_upstream_responses_total{status_code="502"}[5m]) + rate(nginx_upstream_responses_total{status_code="504"}[5m]) > 0.05 40 for: 5m 41 labels: 42 severity: warning 43 annotations: 44 summary: "High upstream error rate on {{ $labels.instance }}" 45 46 - alert: BlackboxProbeFailure 47 expr: probe_success == 0 48 for: 2m 49 labels: 50 severity: critical 51 annotations: 52 summary: "{{ $labels.instance }} is unreachable"

Grafana Dashboards

I run Grafana with dashboards for:

  • Node overview — CPU, memory, disk, network for every host in the cluster
  • PostgreSQL — connections, query duration, WAL size, replication lag, dead tuples
  • Nginx — request rate, error rate, upstream response times, connection states
  • Kumari.ai application — API latency (p50/p95/p99), active users, streaming response times, error rates by endpoint

The

CODE
predict_linear
alert for disk space is probably my single favorite Prometheus feature. It does not just tell you the disk is full — it tells you 24 hours before it will be full, based on the current growth rate. That would have caught the WAL accumulation days before it became a problem.

The Toolbox

Here is my personal ranking of debugging tools by how often I reach for them:

Tier 1: Every single incident

  • CODE
    journalctl
    — systemd journal is the first place I look.
    CODE
    journalctl -u <service> --since "1 hour ago" -p err
    covers most things.
  • CODE
    dmesg -T
    — kernel messages with human-readable timestamps. OOM kills, hardware errors, filesystem issues all show up here.
  • CODE
    htop
    — visual CPU and memory overview. Better than
    CODE
    top
    in every way.
  • CODE
    df -h
    /
    CODE
    du -sh
    — disk space. Boring but catches a shocking number of issues.
  • CODE
    ss -tlnp
    — what is listening on which port. Faster and more informative than
    CODE
    netstat
    .

Tier 2: Specific investigations

  • CODE
    tcpdump
    — when you suspect network issues. The learning curve is worth it.
  • CODE
    strace
    — when a process is hanging and you need to know what system call it is stuck on.
  • CODE
    lsof
    — file descriptor leaks, finding which process has a file open.
  • CODE
    iostat -x
    — disk I/O saturation. Look at
    CODE
    %util
    and
    CODE
    await
    columns.

Tier 3: Deep performance analysis

  • CODE
    perf
    — CPU profiling.
    CODE
    perf top
    for live analysis,
    CODE
    perf record
    +
    CODE
    perf report
    for detailed flame graphs.
  • CODE
    bpftrace
    — custom eBPF tracing. Nuclear option for when nothing else works.
  • CODE
    /proc/<pid>/smaps_rollup
    — detailed memory breakdown for a specific process.

Final Thoughts

Debugging production incidents is a skill that you can only build by doing it. Reading about it helps — I read a lot of postmortems from Google, Cloudflare, and GitHub — but the muscle memory comes from sitting in front of a terminal at 2am with a broken system and no one to ask for help.

If you run a homelab, you will have incidents. That is the entire point. You are building the experience of managing real infrastructure in an environment where the blast radius is limited to your own projects. Every outage, every misconfiguration, every "I wonder why that stopped working" is a lesson that makes you better at this job.

Three things I wish someone had told me when I started:

  1. Write everything down. Not for anyone else. For future you, who will face a similar problem in 8 months and will not remember the fix.
  2. Monitoring is not optional. You can either set up Prometheus now or debug blind at 3am later. Pick one.
  3. The root cause is almost never what it looks like. The OOM kill was not a memory problem — it was a missing limit. The disk full was not a storage problem — it was an orphaned replication slot. The 504 was not a network problem — it was a state table mismatch. Keep digging.

Stay curious. Break things. Fix them. Write it down.

If you want to see the rest of my homelab infrastructure, check out the other posts in this series — from the Proxmox cluster to the ZFS NAS to the Ansible automation that ties it all together.