GitOps on My Homelab: How ArgoCD and k3s Replaced My kubectl Habits

I deleted my staging environment. Not on purpose. Not gracefully. I ran

CODE

1 line

kubectl delete namespace kumari-staging

at 2 AM while I meant to target

CODE

1 line

kumari-staging-old

. There was no confirmation prompt. There was no undo button. One second the namespace existed with its Deployments, Services, ConfigMaps, Secrets, and PVCs — the next second, gone. Kubernetes doesn't have a recycle bin.

I sat there staring at my terminal, knowing I had no reliable way to reconstruct the exact state of that namespace. Sure, I had YAML files scattered across three directories on my workstation. Some were current. Some were from two months ago. Some had been hand-edited with

CODE

1 line

kubectl edit

and never saved back to a file. The manifests on disk didn't match what was running in the cluster, and I had no way to know what the delta was because the running state was now deleted.

That was a Tuesday in March. By Friday, I had ArgoCD running on a fresh k3s cluster, every manifest tracked in Git, and a single

CODE

1 line

git revert

away from restoring anything I could accidentally destroy.

This is how I set it up, what broke along the way, and why I'm never going back to imperative kubectl management.

Why GitOps

Let me be specific about what was wrong with my old workflow, because "GitOps is better" is the kind of vague statement that doesn't help anyone.

My old workflow looked like this:

Write a Kubernetes manifest on my workstation
CODE
1 line
kubectl apply -f deployment.yaml
Tweak something —
CODE
1 line
kubectl edit deployment kumari-backend
Forget to save the tweak back to the YAML file
Three weeks later, wonder why the YAML on disk doesn't match what's running
Give up tracking state, just
CODE
1 line
kubectl get deployment -o yaml > backup.yaml
occasionally
Accumulate 14 backup files with names like
CODE
1 line
deployment-fixed-v3-FINAL-actually-final.yaml

Sound familiar? The fundamental problem is configuration drift. The moment you allow changes to be made directly to the cluster without going through a tracked, versioned source of truth, the state of your cluster becomes unknowable. You might think you know what's deployed, but you don't. Not really. Not unless you've been inhumanly disciplined about exporting every single change back to version control.

GitOps solves this by inverting the flow. Instead of pushing changes to the cluster, you push changes to Git, and a controller running inside the cluster (ArgoCD, in my case) pulls those changes and applies them. The Git repo becomes the single source of truth. If the cluster state drifts from what's in Git, ArgoCD detects it and either alerts you or auto-corrects it.

The key principles:

Declarative: The entire system is described in Git
Versioned: Every change is a Git commit with a message, author, and timestamp
Automated: Changes are applied by a controller, not a human running kubectl
Self-healing: If someone manually changes something in the cluster, ArgoCD reverts it

That last point is what sold me. Even if 2 AM Resham runs

CODE

1 line

kubectl edit

on something, ArgoCD will notice the drift and sync it back to what Git says it should be. 2 AM Resham can't be trusted. Git can.

k3s Cluster Setup on KVM

I already had KVM/QEMU running on my bare metal Arch workstation (I wrote about it in my bare metal workstation post). So spinning up VMs for a Kubernetes cluster was straightforward — but there were decisions to make.

Why k3s and not full k8s? Resource constraints. My workstation has 64GB of RAM, but it's also running Docker Compose stacks, a development environment, and occasionally a browser with too many tabs. I can't dedicate 24GB+ to a proper kubeadm cluster. k3s runs the entire control plane in a single binary, uses SQLite (or etcd with

CODE

1 line

--cluster-init

), and a single node idles at ~500MB of RAM. Three nodes for an HA cluster costs me about 6GB total including workloads. That's manageable.

Creating the VMs

I use

CODE

1 line

virt-install

for VM creation because I like having the exact command documented and reproducible:


Bash
12 lines
1# Create the base image first — Ubuntu 22.04 Server
2resham@devbox:~$ virt-install \
3  --name k3s-base \
4  --ram 4096 \
5  --vcpus 2 \
6  --disk path=/var/lib/libvirt/images/k3s-base.qcow2,size=40 \
7  --os-variant ubuntu22.04 \
8  --network bridge=br0 \
9  --graphics none \
10  --console pty,target_type=serial \
11  --cdrom /home/resham/isos/ubuntu-22.04.4-live-server-amd64.iso \
12  --extra-args 'console=ttyS0'

After installing Ubuntu and configuring SSH keys, I clone the base image for each node:


Bash
10 lines
1# Clone for 3 nodes
2for i in 1 2 3; do
3  sudo virt-clone \
4    --original k3s-base \
5    --name k3s-node-${i} \
6    --file /var/lib/libvirt/images/k3s-node-${i}.qcow2
7
8  # Start the clone, change hostname
9  sudo virsh start k3s-node-${i}
10done

After booting each clone, I SSH in to set the hostname and static IP:


Bash
2 lines
1# On each node
2sudo hostnamectl set-hostname k3s-node-1  # 2, 3 respectively

My network layout:

VM	Hostname	IP	Role
k3s-node-1	k3s-node-1	10.0.50.41	Server (init)
k3s-node-2	k3s-node-2	10.0.50.42	Server
k3s-node-3	k3s-node-3	10.0.50.43	Server

All three are servers (control plane + worker). In a homelab, there's no reason to run dedicated agent-only nodes unless you're simulating a production topology. I want HA etcd, so all three participate in the control plane.

Installing k3s

The first node initializes the cluster with embedded etcd:


Bash
9 lines
1# Node 1 — initialize the cluster
2resham@k3s-node-1:~$ curl -sfL https://get.k3s.io | sh -s - server \
3  --cluster-init \
4  --disable traefik \
5  --disable servicelb \
6  --write-kubeconfig-mode 644 \
7  --tls-san 10.0.50.41 \
8  --tls-san k3s.homelab.local \
9  --node-taint CriticalAddonsOnly=true:NoExecute

tip

[!TIP] I disable the bundled Traefik and ServiceLB because I install my own Traefik instance via Helm later — managed by ArgoCD, naturally. Bundled components that you can't manage declaratively defeat the purpose of GitOps.

Wait for it to come up, then grab the token:


Bash
2 lines
1resham@k3s-node-1:~$ sudo cat /var/lib/rancher/k3s/server/node-token
2K10c4b2f3a8e9d::server:abc123xyz789...

Join the other two nodes:


Bash
19 lines
1# Node 2
2resham@k3s-node-2:~$ curl -sfL https://get.k3s.io | sh -s - server \
3  --server https://10.0.50.41:6443 \
4  --token K10c4b2f3a8e9d::server:abc123xyz789... \
5  --disable traefik \
6  --disable servicelb \
7  --write-kubeconfig-mode 644 \
8  --tls-san 10.0.50.42 \
9  --tls-san k3s.homelab.local
10
11# Node 3 — same but with 10.0.50.43
12resham@k3s-node-3:~$ curl -sfL https://get.k3s.io | sh -s - server \
13  --server https://10.0.50.41:6443 \
14  --token K10c4b2f3a8e9d::server:abc123xyz789... \
15  --disable traefik \
16  --disable servicelb \
17  --write-kubeconfig-mode 644 \
18  --tls-san 10.0.50.43 \
19  --tls-san k3s.homelab.local

After a minute, all three should show as Ready:


Bash
5 lines
1resham@k3s-node-1:~$ kubectl get nodes
2NAME         STATUS   ROLES                       AGE    VERSION
3k3s-node-1   Ready    control-plane,etcd,master   4m     v1.29.6+k3s1
4k3s-node-2   Ready    control-plane,etcd,master   2m     v1.29.6+k3s1
5k3s-node-3   Ready    control-plane,etcd,master   90s    v1.29.6+k3s1

Copy the kubeconfig to my workstation:


Bash
8 lines
1resham@devbox:~$ scp resham@10.0.50.41:/etc/rancher/k3s/k3s.yaml ~/.kube/config-k3s
2resham@devbox:~$ sed -i 's/127.0.0.1/10.0.50.41/' ~/.kube/config-k3s
3resham@devbox:~$ export KUBECONFIG=~/.kube/config-k3s
4resham@devbox:~$ kubectl get nodes
5NAME         STATUS   ROLES                       AGE    VERSION
6k3s-node-1   Ready    control-plane,etcd,master   6m     v1.29.6+k3s1
7k3s-node-2   Ready    control-plane,etcd,master   4m     v1.29.6+k3s1
8k3s-node-3   Ready    control-plane,etcd,master   3m     v1.29.6+k3s1

Cluster is up. Now let's make it impossible to accidentally destroy.

ArgoCD Installation

I install ArgoCD via Helm because I want to manage its own configuration declaratively later (ArgoCD managing itself is a beautiful thing — more on that).


Bash
12 lines
1resham@devbox:~$ helm repo add argo https://argoproj.github.io/argo-helm
2resham@devbox:~$ helm repo update
3
4resham@devbox:~$ kubectl create namespace argocd
5
6resham@devbox:~$ helm install argocd argo/argo-cd \
7  --namespace argocd \
8  --version 6.9.3 \
9  --set server.service.type=NodePort \
10  --set server.service.nodePortHttps=30443 \
11  --set configs.params."server\.insecure"=true \
12  --set server.extraArgs[0]="--insecure"

warning

[!WARNING] The --insecure flag disables TLS on the ArgoCD server itself. This is fine in a homelab where Traefik handles TLS termination in front of it. Do NOT do this in production without a reverse proxy handling TLS.

Wait for pods to come up:


Bash
9 lines
1resham@devbox:~$ kubectl -n argocd get pods
2NAME                                               READY   STATUS    RESTARTS   AGE
3argocd-application-controller-0                     1/1     Running   0          2m
4argocd-applicationset-controller-6b7b8d5d4-x7q2n   1/1     Running   0          2m
5argocd-dex-server-7c94bc5f8d-m4hzp                 1/1     Running   0          2m
6argocd-notifications-controller-5b8dbb7c9f-l9r4v   1/1     Running   0          2m
7argocd-redis-6f8d4bdff5-2kn7x                      1/1     Running   0          2m
8argocd-repo-server-7d9b6c8f7-p5t8n                 1/1     Running   0          2m
9argocd-server-5f8b7c4d6-k3m9x                      1/1     Running   0          2m

Grab the initial admin password:


Bash
3 lines
1resham@devbox:~$ kubectl -n argocd get secret argocd-initial-admin-secret \
2  -o jsonpath="{.data.password}" | base64 -d
3# outputs something like: aB3cD4eF5gH6

I can now hit

CODE

1 line

https://10.0.50.41:30443

in my browser and log in with

CODE

1 line

admin

/ that password. First thing I do is change the password and delete the initial secret:


Bash
3 lines
1resham@devbox:~$ argocd login 10.0.50.41:30443 --insecure
2resham@devbox:~$ argocd account update-password
3resham@devbox:~$ kubectl -n argocd delete secret argocd-initial-admin-secret

Connecting to GitHub

ArgoCD needs read access to my Git repository. I create a GitHub fine-grained personal access token with read-only access to the repo, then register it:


Bash
3 lines
1resham@devbox:~$ argocd repo add https://github.com/iamresham/kumari-k8s-manifests.git \
2  --username resham \
3  --password ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Or declaratively (which is what I actually committed):


YAML
13 lines
1# argocd/repo-secret.yaml
2apiVersion: v1
3kind: Secret
4metadata:
5  name: kumari-k8s-repo
6  namespace: argocd
7  labels:
8    argocd.argoproj.io/secret-type: repository
9stringData:
10  type: git
11  url: https://github.com/iamresham/kumari-k8s-manifests.git
12  username: resham
13  password: <sealed-secret-ref>  # More on this later

Repository Structure

This is the part I spent the most time designing. A bad repo layout makes GitOps miserable. Here's what I landed on after two false starts:


Bash
59 lines
1resham@devbox:~/kumari-k8s-manifests$ tree -L 3
2.
3├── README.md
4├── apps/                          # ArgoCD Application CRs
5│   ├── kumari-staging.yaml
6│   ├── kumari-prod.yaml
7│   ├── monitoring.yaml
8│   ├── traefik.yaml
9│   ├── sealed-secrets.yaml
10│   └── longhorn.yaml
11├── base/                          # Base manifests (shared)
12│   ├── kumari-backend/
13│   │   ├── deployment.yaml
14│   │   ├── service.yaml
15│   │   ├── hpa.yaml
16│   │   └── kustomization.yaml
17│   ├── kumari-frontend/
18│   │   ├── deployment.yaml
19│   │   ├── service.yaml
20│   │   └── kustomization.yaml
21│   ├── postgres/
22│   │   ├── statefulset.yaml
23│   │   ├── service.yaml
24│   │   ├── pvc.yaml
25│   │   └── kustomization.yaml
26│   └── redis/
27│       ├── deployment.yaml
28│       ├── service.yaml
29│       └── kustomization.yaml
30├── overlays/                      # Environment-specific patches
31│   ├── staging/
32│   │   ├── kustomization.yaml
33│   │   ├── namespace.yaml
34│   │   ├── ingress.yaml
35│   │   ├── patches/
36│   │   │   ├── backend-resources.yaml
37│   │   │   ├── backend-env.yaml
38│   │   │   ├── frontend-env.yaml
39│   │   │   ├── postgres-storage.yaml
40│   │   │   └── replica-count.yaml
41│   │   └── secrets/
42│   │       └── sealed-secrets.yaml
43│   └── production/
44│       ├── kustomization.yaml
45│       ├── namespace.yaml
46│       ├── ingress.yaml
47│       ├── patches/
48│       │   ├── backend-resources.yaml
49│       │   ├── backend-env.yaml
50│       │   ├── frontend-env.yaml
51│       │   ├── postgres-storage.yaml
52│       │   └── replica-count.yaml
53│       └── secrets/
54│           └── sealed-secrets.yaml
55└── helm-values/                   # Values files for Helm charts
56    ├── traefik-values.yaml
57    ├── monitoring-values.yaml
58    ├── longhorn-values.yaml
59    └── sealed-secrets-values.yaml

The key insight is the base + overlays pattern from Kustomize. The

CODE

1 line

base/

directory contains the canonical manifests — the Deployment, Service, etc. — without any environment-specific details. The

CODE

1 line

overlays/

directory contains per-environment patches that modify resource limits, replica counts, image tags, environment variables, and ingress hostnames.

Here's the staging Kustomization:


YAML
29 lines
1# overlays/staging/kustomization.yaml
2apiVersion: kustomize.config.k8s.io/v1beta1
3kind: Kustomization
4
5namespace: kumari-staging
6
7resources:
8  - namespace.yaml
9  - ingress.yaml
10  - secrets/sealed-secrets.yaml
11  - ../../base/kumari-backend
12  - ../../base/kumari-frontend
13  - ../../base/postgres
14  - ../../base/redis
15
16patches:
17  - path: patches/backend-resources.yaml
18  - path: patches/backend-env.yaml
19  - path: patches/frontend-env.yaml
20  - path: patches/postgres-storage.yaml
21  - path: patches/replica-count.yaml
22
23images:
24  - name: kumari-backend
25    newName: ghcr.io/iamresham/kumari-backend
26    newTag: staging-a3f8c2d
27  - name: kumari-frontend
28    newName: ghcr.io/iamresham/kumari-frontend
29    newTag: staging-a3f8c2d

And a resource patch that scales things down for staging:


YAML
21 lines
1# overlays/staging/patches/replica-count.yaml
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: kumari-backend
6spec:
7  replicas: 1
8---
9apiVersion: apps/v1
10kind: Deployment
11metadata:
12  name: kumari-frontend
13spec:
14  replicas: 1
15---
16apiVersion: apps/v1
17kind: Deployment
18metadata:
19  name: redis
20spec:
21  replicas: 1

Production uses the same base but with higher replica counts, more resources, and different image tags. One base, many overlays. Change the base, and all environments pick it up (unless an overlay patches it away).

Deploying Kumari.ai Staging

Now the fun part. Let me walk through the actual manifests for Kumari.ai's staging deployment.

FastAPI Backend


YAML
60 lines
1# base/kumari-backend/deployment.yaml
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: kumari-backend
6  labels:
7    app: kumari-backend
8    part-of: kumari-ai
9spec:
10  replicas: 2
11  selector:
12    matchLabels:
13      app: kumari-backend
14  template:
15    metadata:
16      labels:
17        app: kumari-backend
18      annotations:
19        prometheus.io/scrape: "true"
20        prometheus.io/port: "8000"
21        prometheus.io/path: "/metrics"
22    spec:
23      containers:
24        - name: backend
25          image: kumari-backend  # Overridden by Kustomize images
26          ports:
27            - containerPort: 8000
28              name: http
29          envFrom:
30            - configMapRef:
31                name: kumari-backend-config
32            - secretRef:
33                name: kumari-backend-secrets
34          resources:
35            requests:
36              cpu: 250m
37              memory: 512Mi
38            limits:
39              cpu: "1"
40              memory: 1Gi
41          readinessProbe:
42            httpGet:
43              path: /api/v1/health
44              port: 8000
45            initialDelaySeconds: 10
46            periodSeconds: 5
47          livenessProbe:
48            httpGet:
49              path: /api/v1/health
50              port: 8000
51            initialDelaySeconds: 30
52            periodSeconds: 10
53          startupProbe:
54            httpGet:
55              path: /api/v1/health
56              port: 8000
57            failureThreshold: 30
58            periodSeconds: 2
59      imagePullSecrets:
60        - name: ghcr-pull-secret

tip

[!TIP] The startupProbe is critical for FastAPI apps that run Alembic migrations on startup. Without it, the liveness probe can kill the pod before migrations finish, creating a CrashLoopBackoff. I set failureThreshold: 30 with periodSeconds: 2, giving the app 60 seconds to start before Kubernetes gives up.

The staging resource patch dials things down:


YAML
17 lines
1# overlays/staging/patches/backend-resources.yaml
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: kumari-backend
6spec:
7  template:
8    spec:
9      containers:
10        - name: backend
11          resources:
12            requests:
13              cpu: 100m
14              memory: 256Mi
15            limits:
16              cpu: 500m
17              memory: 512Mi

PostgreSQL StatefulSet

This one matters because data persistence on k3s can be tricky:


YAML
67 lines
1# base/postgres/statefulset.yaml
2apiVersion: apps/v1
3kind: StatefulSet
4metadata:
5  name: postgres
6  labels:
7    app: postgres
8    part-of: kumari-ai
9spec:
10  serviceName: postgres
11  replicas: 1
12  selector:
13    matchLabels:
14      app: postgres
15  template:
16    metadata:
17      labels:
18        app: postgres
19    spec:
20      containers:
21        - name: postgres
22          image: postgres:16-alpine
23          ports:
24            - containerPort: 5432
25              name: postgres
26          env:
27            - name: POSTGRES_DB
28              value: kumari
29            - name: POSTGRES_USER
30              valueFrom:
31                secretKeyRef:
32                  name: postgres-credentials
33                  key: username
34            - name: POSTGRES_PASSWORD
35              valueFrom:
36                secretKeyRef:
37                  name: postgres-credentials
38                  key: password
39            - name: PGDATA
40              value: /var/lib/postgresql/data/pgdata
41          volumeMounts:
42            - name: postgres-data
43              mountPath: /var/lib/postgresql/data
44          resources:
45            requests:
46              cpu: 250m
47              memory: 512Mi
48            limits:
49              cpu: "1"
50              memory: 1Gi
51          readinessProbe:
52            exec:
53              command:
54                - pg_isready
55                - -U
56                - kumari
57            initialDelaySeconds: 5
58            periodSeconds: 5
59  volumeClaimTemplates:
60    - metadata:
61        name: postgres-data
62      spec:
63        accessModes: ["ReadWriteOnce"]
64        storageClassName: longhorn
65        resources:
66          requests:
67            storage: 10Gi

warning

[!WARNING] Notice PGDATA is set to a subdirectory (/pgdata) inside the mount. The PostgreSQL Docker image requires PGDATA to be a subdirectory of the volume mount, not the mount root itself. If you set PGDATA to /var/lib/postgresql/data directly, the container will fail because the mount's lost+found directory confuses PostgreSQL's init process. This cost me two hours.

Redis

Redis is simpler — no persistent storage for the staging cache:


YAML
31 lines
1# base/redis/deployment.yaml
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: redis
6  labels:
7    app: redis
8    part-of: kumari-ai
9spec:
10  replicas: 1
11  selector:
12    matchLabels:
13      app: redis
14  template:
15    metadata:
16      labels:
17        app: redis
18    spec:
19      containers:
20        - name: redis
21          image: redis:7-alpine
22          ports:
23            - containerPort: 6379
24          command: ["redis-server", "--maxmemory", "128mb", "--maxmemory-policy", "allkeys-lru"]
25          resources:
26            requests:
27              cpu: 50m
28              memory: 64Mi
29            limits:
30              cpu: 200m
31              memory: 192Mi

Ingress with Traefik

Traefik is deployed as a Helm chart managed by ArgoCD. The actual Ingress for Kumari's staging:


YAML
33 lines
1# overlays/staging/ingress.yaml
2apiVersion: networking.k8s.io/v1
3kind: Ingress
4metadata:
5  name: kumari-staging-ingress
6  annotations:
7    traefik.ingress.kubernetes.io/router.entrypoints: websecure
8    traefik.ingress.kubernetes.io/router.tls: "true"
9    cert-manager.io/cluster-issuer: letsencrypt-staging
10spec:
11  ingressClassName: traefik
12  tls:
13    - hosts:
14        - staging.kumari.homelab.local
15      secretName: kumari-staging-tls
16  rules:
17    - host: staging.kumari.homelab.local
18      http:
19        paths:
20          - path: /api
21            pathType: Prefix
22            backend:
23              service:
24                name: kumari-backend
25                port:
26                  number: 8000
27          - path: /
28            pathType: Prefix
29            backend:
30              service:
31                name: kumari-frontend
32                port:
33                  number: 3000

The ArgoCD Application CR

This is where everything comes together. This single YAML file tells ArgoCD to watch the Git repo and deploy the staging overlay:


YAML
32 lines
1# apps/kumari-staging.yaml
2apiVersion: argoproj.io/v1alpha1
3kind: Application
4metadata:
5  name: kumari-staging
6  namespace: argocd
7  finalizers:
8    - resources-finalizer.argocd.argoproj.io
9spec:
10  project: default
11  source:
12    repoURL: https://github.com/iamresham/kumari-k8s-manifests.git
13    targetRevision: main
14    path: overlays/staging
15  destination:
16    server: https://kubernetes.default.svc
17    namespace: kumari-staging
18  syncPolicy:
19    automated:
20      prune: true
21      selfHeal: true
22      allowEmpty: false
23    syncOptions:
24      - CreateNamespace=true
25      - PrunePropagationPolicy=foreground
26      - PruneLast=true
27    retry:
28      limit: 3
29      backoff:
30        duration: 5s
31        factor: 2
32        maxDuration: 3m

Key decisions here:

CODE
1 line
automated.prune: true
— If I remove a manifest from Git, ArgoCD deletes it from the cluster. No orphaned resources.
CODE
1 line
automated.selfHeal: true
— If someone runs
CODE
1 line
kubectl edit
to change something, ArgoCD reverts it within 3 minutes. This is the anti-2-AM-Resham feature.
CODE
1 line
PruneLast: true
— When syncing, ArgoCD creates/updates resources before deleting old ones. This prevents downtime during transitions.
CODE
1 line
retry
— Transient failures (API server overloaded, etcd leader election) get automatic retries with exponential backoff.

Apply it:


Bash
2 lines
1resham@devbox:~$ kubectl apply -f apps/kumari-staging.yaml
2application.argoproj.io/kumari-staging created

Within seconds, ArgoCD clones the repo, runs Kustomize, and starts creating resources. In the ArgoCD UI, I watch the tree of resources turn from yellow (progressing) to green (healthy) one by one. It takes about 90 seconds for everything to come up, mostly waiting for PostgreSQL's readiness probe.


Bash
12 lines
1resham@devbox:~$ kubectl -n kumari-staging get all
2NAME                                   READY   STATUS    RESTARTS   AGE
3pod/kumari-backend-7d8f9c6b5-x4k2n    1/1     Running   0          2m
4pod/kumari-frontend-5c4d8f7a9-m3p7q   1/1     Running   0          2m
5pod/postgres-0                         1/1     Running   0          2m
6pod/redis-6f8d4bdff5-r2t9x            1/1     Running   0          2m
7
8NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)
9service/kumari-backend    ClusterIP   10.43.128.45    <none>        8000/TCP
10service/kumari-frontend   ClusterIP   10.43.201.12    <none>        3000/TCP
11service/postgres          ClusterIP   10.43.89.201    <none>        5432/TCP
12service/redis             ClusterIP   10.43.156.78    <none>        6379/TCP

Every single one of these resources is tracked in Git. If I delete the namespace now, I just wait 3 minutes and ArgoCD recreates everything. No panic. No scrambling through backup files.

Monitoring with Prometheus + Grafana

The monitoring stack is deployed the same way — an ArgoCD Application pointing at a Helm chart with my custom values:


YAML
78 lines
1# apps/monitoring.yaml
2apiVersion: argoproj.io/v1alpha1
3kind: Application
4metadata:
5  name: monitoring
6  namespace: argocd
7spec:
8  project: default
9  source:
10    repoURL: https://prometheus-community.github.io/helm-charts
11    chart: kube-prometheus-stack
12    targetRevision: 58.2.2
13    helm:
14      valuesObject:
15        prometheus:
16          prometheusSpec:
17            retention: 15d
18            storageSpec:
19              volumeClaimTemplate:
20                spec:
21                  storageClassName: longhorn
22                  accessModes: ["ReadWriteOnce"]
23                  resources:
24                    requests:
25                      storage: 20Gi
26            resources:
27              requests:
28                cpu: 200m
29                memory: 512Mi
30              limits:
31                cpu: "1"
32                memory: 2Gi
33            serviceMonitorSelector: {}
34            serviceMonitorNamespaceSelector: {}
35        grafana:
36          adminPassword: <from-sealed-secret>
37          persistence:
38            enabled: true
39            storageClassName: longhorn
40            size: 5Gi
41          dashboardProviders:
42            dashboardproviders.yaml:
43              apiVersion: 1
44              providers:
45                - name: 'default'
46                  folder: ''
47                  type: file
48                  options:
49                    path: /var/lib/grafana/dashboards/default
50          dashboards:
51            default:
52              kubernetes-cluster:
53                gnetId: 7249
54                revision: 1
55                datasource: Prometheus
56              argocd:
57                gnetId: 14584
58                revision: 1
59                datasource: Prometheus
60        alertmanager:
61          alertmanagerSpec:
62            resources:
63              requests:
64                cpu: 50m
65                memory: 64Mi
66              limits:
67                cpu: 200m
68                memory: 256Mi
69  destination:
70    server: https://kubernetes.default.svc
71    namespace: monitoring
72  syncPolicy:
73    automated:
74      prune: true
75      selfHeal: true
76    syncOptions:
77      - CreateNamespace=true
78      - ServerSideApply=true

tip

[!TIP] ServerSideApply=true is essential for the kube-prometheus-stack chart. Without it, you'll hit annotation size limits because the CRDs (PrometheusRule, ServiceMonitor) generate massive kubectl.kubernetes.io/last-applied-configuration annotations that exceed the 262144-byte limit. Server-side apply doesn't have this problem.

The

CODE

1 line

serviceMonitorSelector: {}

and

CODE

1 line

serviceMonitorNamespaceSelector: {}

settings tell Prometheus to scrape ServiceMonitors from all namespaces. This means my FastAPI backend's metrics (exposed via the Prometheus annotations in the Deployment) get picked up automatically.

I added a Grafana dashboard (ID 14584) specifically for ArgoCD metrics — it shows sync duration, sync failures, app health status, and repo server latency. Being able to see when ArgoCD is struggling to sync is critical for debugging.

The Sync Loop

Here's what the actual workflow looks like now:


CODE
25 lines
1Developer pushes commit
2       │
3       ▼
4GitHub repository updated
5       │
6       ▼
7ArgoCD polls repo (every 3 minutes by default)
8       │
9       ▼
10ArgoCD detects diff between desired state (Git) and live state (cluster)
11       │
12       ▼
13ArgoCD runs Kustomize build / Helm template
14       │
15       ▼
16ArgoCD applies manifests to cluster
17       │
18       ▼
19Kubernetes reconciles (creates/updates/deletes resources)
20       │
21       ▼
22ArgoCD verifies health checks pass
23       │
24       ▼
25Application status: Synced + Healthy ✓

In practice, a typical change — say, updating the backend image tag — looks like this:


Bash
7 lines
1# On my workstation, update the image tag in the staging overlay
2resham@devbox:~/kumari-k8s-manifests$ vim overlays/staging/kustomization.yaml
3# Change: newTag: staging-a3f8c2d → newTag: staging-b7e1d4f
4
5resham@devbox:~/kumari-k8s-manifests$ git add overlays/staging/kustomization.yaml
6resham@devbox:~/kumari-k8s-manifests$ git commit -m "feat: bump backend to staging-b7e1d4f (add websocket auth)"
7resham@devbox:~/kumari-k8s-manifests$ git push

Within 3 minutes (or immediately if I click "Refresh" in the ArgoCD UI), the new image rolls out. ArgoCD creates a new ReplicaSet, waits for the new pods to pass readiness probes, then scales down the old ReplicaSet. Standard Kubernetes rolling update, but triggered by a Git push instead of a manual

CODE

1 line

kubectl set image

If the new image is broken — say the readiness probe fails — the rollout stalls. The old pods keep serving traffic. I see the failure in ArgoCD's UI, revert the Git commit, push, and ArgoCD rolls back. The entire incident response is:


Bash
2 lines
1resham@devbox:~/kumari-k8s-manifests$ git revert HEAD
2resham@devbox:~/kumari-k8s-manifests$ git push

Two commands. No

CODE

1 line

kubectl rollout undo

(which only works if you haven't made other changes since). No "what was the previous image tag again?" Just revert the commit.

Secrets Management

This is the part of GitOps that everyone warns you about, and they're right to. You can't just commit Kubernetes Secrets to Git in plaintext. But you also can't have ArgoCD manage your deployments if the Secrets live outside of Git. The whole point is Git as the single source of truth.

I use Sealed Secrets from Bitnami. The architecture is straightforward:

A controller runs in the cluster and holds a private key
I encrypt Secrets on my workstation using the controller's public key
The encrypted SealedSecret resource is committed to Git
ArgoCD applies the SealedSecret to the cluster
The controller decrypts it into a regular Kubernetes Secret


Bash
7 lines
1# Install kubeseal CLI
2resham@devbox:~$ wget https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.26.2/kubeseal-0.26.2-linux-amd64.tar.gz
3resham@devbox:~$ tar xzf kubeseal-0.26.2-linux-amd64.tar.gz
4resham@devbox:~$ sudo mv kubeseal /usr/local/bin/
5
6# Fetch the public key from the cluster
7resham@devbox:~$ kubeseal --fetch-cert > sealed-secrets-pub.pem

Creating a sealed secret:


Bash
10 lines
1# Start with a regular Secret (never commit this)
2resham@devbox:~$ kubectl create secret generic kumari-backend-secrets \
3  --namespace kumari-staging \
4  --from-literal=DATABASE_URL='postgresql://kumari:s3cretP4ss@postgres:5432/kumari' \
5  --from-literal=REDIS_URL='redis://redis:6379/0' \
6  --from-literal=OPENAI_API_KEY='sk-proj-...' \
7  --from-literal=JWT_SECRET='ultra-secret-jwt-key-here' \
8  --dry-run=client -o yaml | \
9  kubeseal --cert sealed-secrets-pub.pem -o yaml \
10  > overlays/staging/secrets/sealed-secrets.yaml

The resulting SealedSecret looks like this (encrypted values are truncated):


YAML
17 lines
1# overlays/staging/secrets/sealed-secrets.yaml
2apiVersion: bitnami.com/v1alpha1
3kind: SealedSecret
4metadata:
5  name: kumari-backend-secrets
6  namespace: kumari-staging
7spec:
8  encryptedData:
9    DATABASE_URL: AgBx7c2K9f...  # ~200 chars of encrypted base64
10    REDIS_URL: AgDp3m8R4a...
11    OPENAI_API_KEY: AgFw5n1T7e...
12    JWT_SECRET: AgHy2p4V0i...
13  template:
14    metadata:
15      name: kumari-backend-secrets
16      namespace: kumari-staging
17    type: Opaque

This is safe to commit to Git. Only the Sealed Secrets controller in the cluster can decrypt it. If someone clones my repo, they see encrypted blobs.

warning

[!WARNING] Back up the Sealed Secrets controller's private key. If you lose it (node failure, cluster rebuild), you can't decrypt any existing SealedSecrets. I back up the key to my ZFS NAS with kubectl -n kube-system get secret -l sealedsecrets.bitnami.com/sealed-secrets-key -o yaml > sealed-secrets-master.key. This file is stored encrypted with age on the NAS, not in Git.

The source of truth for the actual secret values is an Ansible Vault file on my workstation. When I need to rotate a secret, I update the Vault, regenerate the SealedSecret, commit, and push. ArgoCD does the rest.


Bash
11 lines
1# My secret rotation workflow
2resham@devbox:~$ ansible-vault edit secrets/kumari-staging.vault.yml
3# Edit the value
4
5resham@devbox:~$ ./scripts/generate-sealed-secrets.sh staging
6# Script reads from vault, creates sealed secret YAML
7
8resham@devbox:~$ cd ~/kumari-k8s-manifests
9resham@devbox:~/kumari-k8s-manifests$ git add overlays/staging/secrets/
10resham@devbox:~/kumari-k8s-manifests$ git commit -m "chore: rotate staging DB credentials"
11resham@devbox:~/kumari-k8s-manifests$ git push

What Broke

Every homelab blog post that doesn't include a "what went wrong" section is lying. Here's what bit me.

1. ArgoCD Sync Stuck on Immutable Fields

The first time I tried to change the

CODE

1 line

storageClassName

on my PostgreSQL PVC, ArgoCD showed the app as "OutOfSync" but refused to sync. The error:


CODE
4 lines
1ComparisonError: failed to sync: The StatefulSet "postgres" is invalid:
2spec: Forbidden: updates to statefulset spec for fields other than
3'replicas', 'ordinals', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy'
4and 'minReadySeconds' are forbidden

Kubernetes StatefulSet

CODE

1 line

volumeClaimTemplates

are immutable after creation. You can't change storage class, size, or access modes on an existing StatefulSet's PVC template. The fix is ugly but necessary:

Back up the data (I used
CODE
1 line
pg_dump
into a temporary pod)
Delete the StatefulSet and its PVCs in Git (commit + push, let ArgoCD prune)
Update the volumeClaimTemplate with the new storage class
Commit and push (ArgoCD recreates everything fresh)
Restore the data from backup

tip

[!TIP] When you know you'll need to change storage parameters later, use a separate PVC resource instead of volumeClaimTemplates. Standalone PVCs can be resized (if the storage class supports it) without deleting the StatefulSet. You lose automatic PVC-per-replica naming, but in a homelab with a single Postgres replica, that doesn't matter.

2. CrashLoopBackoff from a Bad Image Tag

I pushed a commit with a typo in the image tag:

CODE

1 line

staging-b7e1d4

instead of

CODE

1 line

staging-b7e1d4f

(missing the last character). ArgoCD synced happily — it doesn't validate that the image exists before applying the manifest. Kubernetes tried to pull the image, got

CODE

1 line

ErrImagePull

, then

CODE

1 line

ImagePullBackOff

, and the pod went into

CODE

1 line

CrashLoopBackoff

The fix was simple (

CODE

1 line

git revert

), but the lesson was clear: ArgoCD validates YAML syntax, not image availability. I now have a CI step in GitHub Actions that checks if the image tag exists in GHCR before allowing a merge to main:


YAML
20 lines
1# .github/workflows/validate-image-tags.yml
2name: Validate Image Tags
3on:
4  pull_request:
5    paths:
6      - 'overlays/**/kustomization.yaml'
7jobs:
8  validate:
9    runs-on: ubuntu-latest
10    steps:
11      - uses: actions/checkout@v4
12      - name: Extract image tags
13        run: |
14          grep "newTag:" overlays/*/kustomization.yaml | while read line; do
15            TAG=$(echo "$line" | awk '{print $2}')
16            IMAGE=$(grep -B1 "newTag: $TAG" overlays/*/kustomization.yaml | grep "newName:" | awk '{print $2}')
17            echo "Checking $IMAGE:$TAG"
18            docker manifest inspect "$IMAGE:$TAG" > /dev/null 2>&1 || \
19              (echo "ERROR: $IMAGE:$TAG does not exist" && exit 1)
20          done

3. Longhorn PVC Binding Issues

I chose Longhorn as the storage backend because it's designed for Kubernetes on commodity hardware — exactly my situation with KVM VMs. But the initial setup had a frustrating issue: PVCs were stuck in

CODE

1 line

Pending

state.


Bash
3 lines
1resham@devbox:~$ kubectl get pvc -n kumari-staging
2NAME                   STATUS    VOLUME   CAPACITY   ACCESS MODES
3postgres-data-postgres-0   Pending

The

CODE

1 line

kubectl describe pvc

showed:


CODE
4 lines
1Events:
2  Type     Reason              Age   From                         Message
3  ----     ------              ----  ----                         -------
4  Warning  ProvisioningFailed  30s   longhorn-provisioner         waiting for a volume to be created

Digging into the Longhorn manager logs, the real issue was that my VM disks were using

CODE

1 line

virtio-scsi

and Longhorn couldn't detect the available disk space. Switching the VM storage to

CODE

1 line

virtio-blk

(editing the libvirt XML) and restarting the nodes fixed it:


Bash
5 lines
1resham@devbox:~$ sudo virsh edit k3s-node-1
2# Change: <target dev='sda' bus='scsi'/> 
3# To:     <target dev='vda' bus='virtio'/>
4
5resham@devbox:~$ sudo virsh destroy k3s-node-1 && sudo virsh start k3s-node-1

After the nodes came back, Longhorn detected the disks and PVCs bound within seconds.

4. ArgoCD App-of-Apps Bootstrap Loop

I wanted ArgoCD to manage all Applications, including the Application CRs themselves. This is the "app of apps" pattern. I created a root Application that points to the

CODE

1 line

apps/

directory:


YAML
19 lines
1# apps/root.yaml
2apiVersion: argoproj.io/v1alpha1
3kind: Application
4metadata:
5  name: root
6  namespace: argocd
7spec:
8  project: default
9  source:
10    repoURL: https://github.com/iamresham/kumari-k8s-manifests.git
11    targetRevision: main
12    path: apps
13  destination:
14    server: https://kubernetes.default.svc
15    namespace: argocd
16  syncPolicy:
17    automated:
18      prune: true
19      selfHeal: true

The problem: the

CODE

1 line

root

Application is in the

CODE

1 line

apps/

directory, so it tries to manage itself. If it detects drift on its own definition, it syncs, which triggers a new sync detection, which triggers another sync... it doesn't infinite loop (ArgoCD is smart enough to detect no-op syncs), but it did cause a lot of noise in the logs and the UI.

The fix: move

CODE

1 line

root.yaml

to a separate

CODE

1 line

bootstrap/

directory outside of

CODE

1 line

apps/

, and apply it manually once with

CODE

1 line

kubectl apply -f bootstrap/root.yaml

. The root app points at

CODE

1 line

apps/

but isn't inside

CODE

1 line

apps/

, so it doesn't try to manage itself.

Results

After three weeks of running this setup, here's the before/after:

Metric	Before (kubectl)	After (ArgoCD + GitOps)
Time to deploy a new image	2-5 min (find YAML, edit, apply, verify)	30 sec (edit tag, commit, push)
Time to rollback	5-15 min (find old YAML, hope it's current, apply)	10 sec ( CODE 1 line `git revert HEAD && git push` )
Configuration drift	Constant, undetectable	Zero (self-heal reverts manual changes)
Disaster recovery	Hours (reconstruct from memory and backups)	3 min (ArgoCD recreates from Git)
Audit trail	None	Full Git history with diffs, authors, timestamps
Sleep quality	Poor	Significantly improved

That last row isn't a joke. Knowing that the cluster state is in Git and can be reconstructed from scratch changed my relationship with the homelab. I'm no longer afraid to experiment because I can always go back.

What's Next

The setup is solid for staging, but there are things I want to improve:

GitHub webhook instead of polling — 3-minute sync interval is fine, but I want instant deploys. ArgoCD supports webhooks; I just need to expose it through Traefik with proper auth.
ArgoCD Image Updater — Instead of manually updating image tags in the Kustomization file, the Image Updater can watch GHCR for new tags matching a pattern and automatically commit the update. Full hands-off CI/CD.
Production environment — Right now only staging runs on k3s. Production is still Docker Compose on the R720. Moving it to the cluster is the next project — but it means solving persistent storage for real, not just for staging data I can afford to lose.
ArgoCD notifications — I want Slack (well, Discord) alerts when a sync fails or an app goes unhealthy. The ArgoCD Notifications controller supports this out of the box.

The broader point is this: GitOps isn't just for companies with SRE teams and production Kubernetes clusters. It's arguably even more valuable in a homelab, where you're the only operator, changes happen at odd hours, and there's no one to bail you out when you delete the wrong namespace at 2 AM.

CODE

1 line

git revert

is the best undo button I've ever had.