Back to Homelab
Jul 13, 2025|25 min read

GitOps on My Homelab: How ArgoCD and k3s Replaced My kubectl Habits

Setting up a proper GitOps pipeline with ArgoCD on a k3s cluster running on KVM, deploying Kumari.ai's staging environment, and why I'll never kubectl apply -f by hand again.

KubernetesGitOpsArgoCDk3sHomelabDevOpsHelm

I deleted my staging environment. Not on purpose. Not gracefully. I ran

CODE
kubectl delete namespace kumari-staging
at 2 AM while I meant to target
CODE
kumari-staging-old
. There was no confirmation prompt. There was no undo button. One second the namespace existed with its Deployments, Services, ConfigMaps, Secrets, and PVCs — the next second, gone. Kubernetes doesn't have a recycle bin.

I sat there staring at my terminal, knowing I had no reliable way to reconstruct the exact state of that namespace. Sure, I had YAML files scattered across three directories on my workstation. Some were current. Some were from two months ago. Some had been hand-edited with

CODE
kubectl edit
and never saved back to a file. The manifests on disk didn't match what was running in the cluster, and I had no way to know what the delta was because the running state was now deleted.

That was a Tuesday in March. By Friday, I had ArgoCD running on a fresh k3s cluster, every manifest tracked in Git, and a single

CODE
git revert
away from restoring anything I could accidentally destroy.

This is how I set it up, what broke along the way, and why I'm never going back to imperative kubectl management.

GitOps pipeline architecture
GitOps pipeline architecture

Why GitOps

Let me be specific about what was wrong with my old workflow, because "GitOps is better" is the kind of vague statement that doesn't help anyone.

My old workflow looked like this:

  1. Write a Kubernetes manifest on my workstation
  2. CODE
    kubectl apply -f deployment.yaml
  3. Tweak something —
    CODE
    kubectl edit deployment kumari-backend
  4. Forget to save the tweak back to the YAML file
  5. Three weeks later, wonder why the YAML on disk doesn't match what's running
  6. Give up tracking state, just
    CODE
    kubectl get deployment -o yaml > backup.yaml
    occasionally
  7. Accumulate 14 backup files with names like
    CODE
    deployment-fixed-v3-FINAL-actually-final.yaml

Sound familiar? The fundamental problem is configuration drift. The moment you allow changes to be made directly to the cluster without going through a tracked, versioned source of truth, the state of your cluster becomes unknowable. You might think you know what's deployed, but you don't. Not really. Not unless you've been inhumanly disciplined about exporting every single change back to version control.

GitOps solves this by inverting the flow. Instead of pushing changes to the cluster, you push changes to Git, and a controller running inside the cluster (ArgoCD, in my case) pulls those changes and applies them. The Git repo becomes the single source of truth. If the cluster state drifts from what's in Git, ArgoCD detects it and either alerts you or auto-corrects it.

The key principles:

  • Declarative: The entire system is described in Git
  • Versioned: Every change is a Git commit with a message, author, and timestamp
  • Automated: Changes are applied by a controller, not a human running kubectl
  • Self-healing: If someone manually changes something in the cluster, ArgoCD reverts it

That last point is what sold me. Even if 2 AM Resham runs

CODE
kubectl edit
on something, ArgoCD will notice the drift and sync it back to what Git says it should be. 2 AM Resham can't be trusted. Git can.

k3s Cluster Setup on KVM

I already had KVM/QEMU running on my bare metal Arch workstation (I wrote about it in my bare metal workstation post). So spinning up VMs for a Kubernetes cluster was straightforward — but there were decisions to make.

Why k3s and not full k8s? Resource constraints. My workstation has 64GB of RAM, but it's also running Docker Compose stacks, a development environment, and occasionally a browser with too many tabs. I can't dedicate 24GB+ to a proper kubeadm cluster. k3s runs the entire control plane in a single binary, uses SQLite (or etcd with

CODE
--cluster-init
), and a single node idles at ~500MB of RAM. Three nodes for an HA cluster costs me about 6GB total including workloads. That's manageable.

Creating the VMs

I use

CODE
virt-install
for VM creation because I like having the exact command documented and reproducible:

Bash
1# Create the base image first — Ubuntu 22.04 Server 2resham@devbox:~$ virt-install \ 3 --name k3s-base \ 4 --ram 4096 \ 5 --vcpus 2 \ 6 --disk path=/var/lib/libvirt/images/k3s-base.qcow2,size=40 \ 7 --os-variant ubuntu22.04 \ 8 --network bridge=br0 \ 9 --graphics none \ 10 --console pty,target_type=serial \ 11 --cdrom /home/resham/isos/ubuntu-22.04.4-live-server-amd64.iso \ 12 --extra-args 'console=ttyS0'

After installing Ubuntu and configuring SSH keys, I clone the base image for each node:

Bash
1# Clone for 3 nodes 2for i in 1 2 3; do 3 sudo virt-clone \ 4 --original k3s-base \ 5 --name k3s-node-${i} \ 6 --file /var/lib/libvirt/images/k3s-node-${i}.qcow2 7 8 # Start the clone, change hostname 9 sudo virsh start k3s-node-${i} 10done

After booting each clone, I SSH in to set the hostname and static IP:

Bash
1# On each node 2sudo hostnamectl set-hostname k3s-node-1 # 2, 3 respectively

My network layout:

VMHostnameIPRole
k3s-node-1k3s-node-110.0.50.41Server (init)
k3s-node-2k3s-node-210.0.50.42Server
k3s-node-3k3s-node-310.0.50.43Server

All three are servers (control plane + worker). In a homelab, there's no reason to run dedicated agent-only nodes unless you're simulating a production topology. I want HA etcd, so all three participate in the control plane.

Installing k3s

The first node initializes the cluster with embedded etcd:

Bash
1# Node 1 — initialize the cluster 2resham@k3s-node-1:~$ curl -sfL https://get.k3s.io | sh -s - server \ 3 --cluster-init \ 4 --disable traefik \ 5 --disable servicelb \ 6 --write-kubeconfig-mode 644 \ 7 --tls-san 10.0.50.41 \ 8 --tls-san k3s.homelab.local \ 9 --node-taint CriticalAddonsOnly=true:NoExecute
tip
[!TIP] I disable the bundled Traefik and ServiceLB because I install my own Traefik instance via Helm later — managed by ArgoCD, naturally. Bundled components that you can't manage declaratively defeat the purpose of GitOps.

Wait for it to come up, then grab the token:

Bash
1resham@k3s-node-1:~$ sudo cat /var/lib/rancher/k3s/server/node-token 2K10c4b2f3a8e9d::server:abc123xyz789...

Join the other two nodes:

Bash
1# Node 2 2resham@k3s-node-2:~$ curl -sfL https://get.k3s.io | sh -s - server \ 3 --server https://10.0.50.41:6443 \ 4 --token K10c4b2f3a8e9d::server:abc123xyz789... \ 5 --disable traefik \ 6 --disable servicelb \ 7 --write-kubeconfig-mode 644 \ 8 --tls-san 10.0.50.42 \ 9 --tls-san k3s.homelab.local 10 11# Node 3 — same but with 10.0.50.43 12resham@k3s-node-3:~$ curl -sfL https://get.k3s.io | sh -s - server \ 13 --server https://10.0.50.41:6443 \ 14 --token K10c4b2f3a8e9d::server:abc123xyz789... \ 15 --disable traefik \ 16 --disable servicelb \ 17 --write-kubeconfig-mode 644 \ 18 --tls-san 10.0.50.43 \ 19 --tls-san k3s.homelab.local

After a minute, all three should show as Ready:

Bash
1resham@k3s-node-1:~$ kubectl get nodes 2NAME STATUS ROLES AGE VERSION 3k3s-node-1 Ready control-plane,etcd,master 4m v1.29.6+k3s1 4k3s-node-2 Ready control-plane,etcd,master 2m v1.29.6+k3s1 5k3s-node-3 Ready control-plane,etcd,master 90s v1.29.6+k3s1

Copy the kubeconfig to my workstation:

Bash
1resham@devbox:~$ scp resham@10.0.50.41:/etc/rancher/k3s/k3s.yaml ~/.kube/config-k3s 2resham@devbox:~$ sed -i 's/127.0.0.1/10.0.50.41/' ~/.kube/config-k3s 3resham@devbox:~$ export KUBECONFIG=~/.kube/config-k3s 4resham@devbox:~$ kubectl get nodes 5NAME STATUS ROLES AGE VERSION 6k3s-node-1 Ready control-plane,etcd,master 6m v1.29.6+k3s1 7k3s-node-2 Ready control-plane,etcd,master 4m v1.29.6+k3s1 8k3s-node-3 Ready control-plane,etcd,master 3m v1.29.6+k3s1

Cluster is up. Now let's make it impossible to accidentally destroy.

ArgoCD Installation

I install ArgoCD via Helm because I want to manage its own configuration declaratively later (ArgoCD managing itself is a beautiful thing — more on that).

Bash
1resham@devbox:~$ helm repo add argo https://argoproj.github.io/argo-helm 2resham@devbox:~$ helm repo update 3 4resham@devbox:~$ kubectl create namespace argocd 5 6resham@devbox:~$ helm install argocd argo/argo-cd \ 7 --namespace argocd \ 8 --version 6.9.3 \ 9 --set server.service.type=NodePort \ 10 --set server.service.nodePortHttps=30443 \ 11 --set configs.params."server\.insecure"=true \ 12 --set server.extraArgs[0]="--insecure"
warning
[!WARNING] The --insecure flag disables TLS on the ArgoCD server itself. This is fine in a homelab where Traefik handles TLS termination in front of it. Do NOT do this in production without a reverse proxy handling TLS.

Wait for pods to come up:

Bash
1resham@devbox:~$ kubectl -n argocd get pods 2NAME READY STATUS RESTARTS AGE 3argocd-application-controller-0 1/1 Running 0 2m 4argocd-applicationset-controller-6b7b8d5d4-x7q2n 1/1 Running 0 2m 5argocd-dex-server-7c94bc5f8d-m4hzp 1/1 Running 0 2m 6argocd-notifications-controller-5b8dbb7c9f-l9r4v 1/1 Running 0 2m 7argocd-redis-6f8d4bdff5-2kn7x 1/1 Running 0 2m 8argocd-repo-server-7d9b6c8f7-p5t8n 1/1 Running 0 2m 9argocd-server-5f8b7c4d6-k3m9x 1/1 Running 0 2m

Grab the initial admin password:

Bash
1resham@devbox:~$ kubectl -n argocd get secret argocd-initial-admin-secret \ 2 -o jsonpath="{.data.password}" | base64 -d 3# outputs something like: aB3cD4eF5gH6

I can now hit

CODE
https://10.0.50.41:30443
in my browser and log in with
CODE
admin
/ that password. First thing I do is change the password and delete the initial secret:

Bash
1resham@devbox:~$ argocd login 10.0.50.41:30443 --insecure 2resham@devbox:~$ argocd account update-password 3resham@devbox:~$ kubectl -n argocd delete secret argocd-initial-admin-secret

Connecting to GitHub

ArgoCD needs read access to my Git repository. I create a GitHub fine-grained personal access token with read-only access to the repo, then register it:

Bash
1resham@devbox:~$ argocd repo add https://github.com/iamresham/kumari-k8s-manifests.git \ 2 --username resham \ 3 --password ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Or declaratively (which is what I actually committed):

YAML
1# argocd/repo-secret.yaml 2apiVersion: v1 3kind: Secret 4metadata: 5 name: kumari-k8s-repo 6 namespace: argocd 7 labels: 8 argocd.argoproj.io/secret-type: repository 9stringData: 10 type: git 11 url: https://github.com/iamresham/kumari-k8s-manifests.git 12 username: resham 13 password: <sealed-secret-ref> # More on this later

Repository Structure

This is the part I spent the most time designing. A bad repo layout makes GitOps miserable. Here's what I landed on after two false starts:

Bash
1resham@devbox:~/kumari-k8s-manifests$ tree -L 3 2. 3├── README.md 4├── apps/ # ArgoCD Application CRs 5│ ├── kumari-staging.yaml 6│ ├── kumari-prod.yaml 7│ ├── monitoring.yaml 8│ ├── traefik.yaml 9│ ├── sealed-secrets.yaml 10│ └── longhorn.yaml 11├── base/ # Base manifests (shared) 12│ ├── kumari-backend/ 13│ │ ├── deployment.yaml 14│ │ ├── service.yaml 15│ │ ├── hpa.yaml 16│ │ └── kustomization.yaml 17│ ├── kumari-frontend/ 18│ │ ├── deployment.yaml 19│ │ ├── service.yaml 20│ │ └── kustomization.yaml 21│ ├── postgres/ 22│ │ ├── statefulset.yaml 23│ │ ├── service.yaml 24│ │ ├── pvc.yaml 25│ │ └── kustomization.yaml 26│ └── redis/ 27│ ├── deployment.yaml 28│ ├── service.yaml 29│ └── kustomization.yaml 30├── overlays/ # Environment-specific patches 31│ ├── staging/ 32│ │ ├── kustomization.yaml 33│ │ ├── namespace.yaml 34│ │ ├── ingress.yaml 35│ │ ├── patches/ 36│ │ │ ├── backend-resources.yaml 37│ │ │ ├── backend-env.yaml 38│ │ │ ├── frontend-env.yaml 39│ │ │ ├── postgres-storage.yaml 40│ │ │ └── replica-count.yaml 41│ │ └── secrets/ 42│ │ └── sealed-secrets.yaml 43│ └── production/ 44│ ├── kustomization.yaml 45│ ├── namespace.yaml 46│ ├── ingress.yaml 47│ ├── patches/ 48│ │ ├── backend-resources.yaml 49│ │ ├── backend-env.yaml 50│ │ ├── frontend-env.yaml 51│ │ ├── postgres-storage.yaml 52│ │ └── replica-count.yaml 53│ └── secrets/ 54│ └── sealed-secrets.yaml 55└── helm-values/ # Values files for Helm charts 56 ├── traefik-values.yaml 57 ├── monitoring-values.yaml 58 ├── longhorn-values.yaml 59 └── sealed-secrets-values.yaml

The key insight is the base + overlays pattern from Kustomize. The

CODE
base/
directory contains the canonical manifests — the Deployment, Service, etc. — without any environment-specific details. The
CODE
overlays/
directory contains per-environment patches that modify resource limits, replica counts, image tags, environment variables, and ingress hostnames.

Here's the staging Kustomization:

YAML
1# overlays/staging/kustomization.yaml 2apiVersion: kustomize.config.k8s.io/v1beta1 3kind: Kustomization 4 5namespace: kumari-staging 6 7resources: 8 - namespace.yaml 9 - ingress.yaml 10 - secrets/sealed-secrets.yaml 11 - ../../base/kumari-backend 12 - ../../base/kumari-frontend 13 - ../../base/postgres 14 - ../../base/redis 15 16patches: 17 - path: patches/backend-resources.yaml 18 - path: patches/backend-env.yaml 19 - path: patches/frontend-env.yaml 20 - path: patches/postgres-storage.yaml 21 - path: patches/replica-count.yaml 22 23images: 24 - name: kumari-backend 25 newName: ghcr.io/iamresham/kumari-backend 26 newTag: staging-a3f8c2d 27 - name: kumari-frontend 28 newName: ghcr.io/iamresham/kumari-frontend 29 newTag: staging-a3f8c2d

And a resource patch that scales things down for staging:

YAML
1# overlays/staging/patches/replica-count.yaml 2apiVersion: apps/v1 3kind: Deployment 4metadata: 5 name: kumari-backend 6spec: 7 replicas: 1 8--- 9apiVersion: apps/v1 10kind: Deployment 11metadata: 12 name: kumari-frontend 13spec: 14 replicas: 1 15--- 16apiVersion: apps/v1 17kind: Deployment 18metadata: 19 name: redis 20spec: 21 replicas: 1

Production uses the same base but with higher replica counts, more resources, and different image tags. One base, many overlays. Change the base, and all environments pick it up (unless an overlay patches it away).

Deploying Kumari.ai Staging

Now the fun part. Let me walk through the actual manifests for Kumari.ai's staging deployment.

FastAPI Backend

YAML
1# base/kumari-backend/deployment.yaml 2apiVersion: apps/v1 3kind: Deployment 4metadata: 5 name: kumari-backend 6 labels: 7 app: kumari-backend 8 part-of: kumari-ai 9spec: 10 replicas: 2 11 selector: 12 matchLabels: 13 app: kumari-backend 14 template: 15 metadata: 16 labels: 17 app: kumari-backend 18 annotations: 19 prometheus.io/scrape: "true" 20 prometheus.io/port: "8000" 21 prometheus.io/path: "/metrics" 22 spec: 23 containers: 24 - name: backend 25 image: kumari-backend # Overridden by Kustomize images 26 ports: 27 - containerPort: 8000 28 name: http 29 envFrom: 30 - configMapRef: 31 name: kumari-backend-config 32 - secretRef: 33 name: kumari-backend-secrets 34 resources: 35 requests: 36 cpu: 250m 37 memory: 512Mi 38 limits: 39 cpu: "1" 40 memory: 1Gi 41 readinessProbe: 42 httpGet: 43 path: /api/v1/health 44 port: 8000 45 initialDelaySeconds: 10 46 periodSeconds: 5 47 livenessProbe: 48 httpGet: 49 path: /api/v1/health 50 port: 8000 51 initialDelaySeconds: 30 52 periodSeconds: 10 53 startupProbe: 54 httpGet: 55 path: /api/v1/health 56 port: 8000 57 failureThreshold: 30 58 periodSeconds: 2 59 imagePullSecrets: 60 - name: ghcr-pull-secret
tip
[!TIP] The startupProbe is critical for FastAPI apps that run Alembic migrations on startup. Without it, the liveness probe can kill the pod before migrations finish, creating a CrashLoopBackoff. I set failureThreshold: 30 with periodSeconds: 2, giving the app 60 seconds to start before Kubernetes gives up.

The staging resource patch dials things down:

YAML
1# overlays/staging/patches/backend-resources.yaml 2apiVersion: apps/v1 3kind: Deployment 4metadata: 5 name: kumari-backend 6spec: 7 template: 8 spec: 9 containers: 10 - name: backend 11 resources: 12 requests: 13 cpu: 100m 14 memory: 256Mi 15 limits: 16 cpu: 500m 17 memory: 512Mi

PostgreSQL StatefulSet

This one matters because data persistence on k3s can be tricky:

YAML
1# base/postgres/statefulset.yaml 2apiVersion: apps/v1 3kind: StatefulSet 4metadata: 5 name: postgres 6 labels: 7 app: postgres 8 part-of: kumari-ai 9spec: 10 serviceName: postgres 11 replicas: 1 12 selector: 13 matchLabels: 14 app: postgres 15 template: 16 metadata: 17 labels: 18 app: postgres 19 spec: 20 containers: 21 - name: postgres 22 image: postgres:16-alpine 23 ports: 24 - containerPort: 5432 25 name: postgres 26 env: 27 - name: POSTGRES_DB 28 value: kumari 29 - name: POSTGRES_USER 30 valueFrom: 31 secretKeyRef: 32 name: postgres-credentials 33 key: username 34 - name: POSTGRES_PASSWORD 35 valueFrom: 36 secretKeyRef: 37 name: postgres-credentials 38 key: password 39 - name: PGDATA 40 value: /var/lib/postgresql/data/pgdata 41 volumeMounts: 42 - name: postgres-data 43 mountPath: /var/lib/postgresql/data 44 resources: 45 requests: 46 cpu: 250m 47 memory: 512Mi 48 limits: 49 cpu: "1" 50 memory: 1Gi 51 readinessProbe: 52 exec: 53 command: 54 - pg_isready 55 - -U 56 - kumari 57 initialDelaySeconds: 5 58 periodSeconds: 5 59 volumeClaimTemplates: 60 - metadata: 61 name: postgres-data 62 spec: 63 accessModes: ["ReadWriteOnce"] 64 storageClassName: longhorn 65 resources: 66 requests: 67 storage: 10Gi
warning
[!WARNING] Notice PGDATA is set to a subdirectory (/pgdata) inside the mount. The PostgreSQL Docker image requires PGDATA to be a subdirectory of the volume mount, not the mount root itself. If you set PGDATA to /var/lib/postgresql/data directly, the container will fail because the mount's lost+found directory confuses PostgreSQL's init process. This cost me two hours.

Redis

Redis is simpler — no persistent storage for the staging cache:

YAML
1# base/redis/deployment.yaml 2apiVersion: apps/v1 3kind: Deployment 4metadata: 5 name: redis 6 labels: 7 app: redis 8 part-of: kumari-ai 9spec: 10 replicas: 1 11 selector: 12 matchLabels: 13 app: redis 14 template: 15 metadata: 16 labels: 17 app: redis 18 spec: 19 containers: 20 - name: redis 21 image: redis:7-alpine 22 ports: 23 - containerPort: 6379 24 command: ["redis-server", "--maxmemory", "128mb", "--maxmemory-policy", "allkeys-lru"] 25 resources: 26 requests: 27 cpu: 50m 28 memory: 64Mi 29 limits: 30 cpu: 200m 31 memory: 192Mi

Ingress with Traefik

Traefik is deployed as a Helm chart managed by ArgoCD. The actual Ingress for Kumari's staging:

YAML
1# overlays/staging/ingress.yaml 2apiVersion: networking.k8s.io/v1 3kind: Ingress 4metadata: 5 name: kumari-staging-ingress 6 annotations: 7 traefik.ingress.kubernetes.io/router.entrypoints: websecure 8 traefik.ingress.kubernetes.io/router.tls: "true" 9 cert-manager.io/cluster-issuer: letsencrypt-staging 10spec: 11 ingressClassName: traefik 12 tls: 13 - hosts: 14 - staging.kumari.homelab.local 15 secretName: kumari-staging-tls 16 rules: 17 - host: staging.kumari.homelab.local 18 http: 19 paths: 20 - path: /api 21 pathType: Prefix 22 backend: 23 service: 24 name: kumari-backend 25 port: 26 number: 8000 27 - path: / 28 pathType: Prefix 29 backend: 30 service: 31 name: kumari-frontend 32 port: 33 number: 3000

The ArgoCD Application CR

This is where everything comes together. This single YAML file tells ArgoCD to watch the Git repo and deploy the staging overlay:

YAML
1# apps/kumari-staging.yaml 2apiVersion: argoproj.io/v1alpha1 3kind: Application 4metadata: 5 name: kumari-staging 6 namespace: argocd 7 finalizers: 8 - resources-finalizer.argocd.argoproj.io 9spec: 10 project: default 11 source: 12 repoURL: https://github.com/iamresham/kumari-k8s-manifests.git 13 targetRevision: main 14 path: overlays/staging 15 destination: 16 server: https://kubernetes.default.svc 17 namespace: kumari-staging 18 syncPolicy: 19 automated: 20 prune: true 21 selfHeal: true 22 allowEmpty: false 23 syncOptions: 24 - CreateNamespace=true 25 - PrunePropagationPolicy=foreground 26 - PruneLast=true 27 retry: 28 limit: 3 29 backoff: 30 duration: 5s 31 factor: 2 32 maxDuration: 3m

Key decisions here:

  • CODE
    automated.prune: true
    — If I remove a manifest from Git, ArgoCD deletes it from the cluster. No orphaned resources.
  • CODE
    automated.selfHeal: true
    — If someone runs
    CODE
    kubectl edit
    to change something, ArgoCD reverts it within 3 minutes. This is the anti-2-AM-Resham feature.
  • CODE
    PruneLast: true
    — When syncing, ArgoCD creates/updates resources before deleting old ones. This prevents downtime during transitions.
  • CODE
    retry
    — Transient failures (API server overloaded, etcd leader election) get automatic retries with exponential backoff.

Apply it:

Bash
1resham@devbox:~$ kubectl apply -f apps/kumari-staging.yaml 2application.argoproj.io/kumari-staging created

Within seconds, ArgoCD clones the repo, runs Kustomize, and starts creating resources. In the ArgoCD UI, I watch the tree of resources turn from yellow (progressing) to green (healthy) one by one. It takes about 90 seconds for everything to come up, mostly waiting for PostgreSQL's readiness probe.

Bash
1resham@devbox:~$ kubectl -n kumari-staging get all 2NAME READY STATUS RESTARTS AGE 3pod/kumari-backend-7d8f9c6b5-x4k2n 1/1 Running 0 2m 4pod/kumari-frontend-5c4d8f7a9-m3p7q 1/1 Running 0 2m 5pod/postgres-0 1/1 Running 0 2m 6pod/redis-6f8d4bdff5-r2t9x 1/1 Running 0 2m 7 8NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) 9service/kumari-backend ClusterIP 10.43.128.45 <none> 8000/TCP 10service/kumari-frontend ClusterIP 10.43.201.12 <none> 3000/TCP 11service/postgres ClusterIP 10.43.89.201 <none> 5432/TCP 12service/redis ClusterIP 10.43.156.78 <none> 6379/TCP

Every single one of these resources is tracked in Git. If I delete the namespace now, I just wait 3 minutes and ArgoCD recreates everything. No panic. No scrambling through backup files.

Monitoring with Prometheus + Grafana

The monitoring stack is deployed the same way — an ArgoCD Application pointing at a Helm chart with my custom values:

YAML
1# apps/monitoring.yaml 2apiVersion: argoproj.io/v1alpha1 3kind: Application 4metadata: 5 name: monitoring 6 namespace: argocd 7spec: 8 project: default 9 source: 10 repoURL: https://prometheus-community.github.io/helm-charts 11 chart: kube-prometheus-stack 12 targetRevision: 58.2.2 13 helm: 14 valuesObject: 15 prometheus: 16 prometheusSpec: 17 retention: 15d 18 storageSpec: 19 volumeClaimTemplate: 20 spec: 21 storageClassName: longhorn 22 accessModes: ["ReadWriteOnce"] 23 resources: 24 requests: 25 storage: 20Gi 26 resources: 27 requests: 28 cpu: 200m 29 memory: 512Mi 30 limits: 31 cpu: "1" 32 memory: 2Gi 33 serviceMonitorSelector: {} 34 serviceMonitorNamespaceSelector: {} 35 grafana: 36 adminPassword: <from-sealed-secret> 37 persistence: 38 enabled: true 39 storageClassName: longhorn 40 size: 5Gi 41 dashboardProviders: 42 dashboardproviders.yaml: 43 apiVersion: 1 44 providers: 45 - name: 'default' 46 folder: '' 47 type: file 48 options: 49 path: /var/lib/grafana/dashboards/default 50 dashboards: 51 default: 52 kubernetes-cluster: 53 gnetId: 7249 54 revision: 1 55 datasource: Prometheus 56 argocd: 57 gnetId: 14584 58 revision: 1 59 datasource: Prometheus 60 alertmanager: 61 alertmanagerSpec: 62 resources: 63 requests: 64 cpu: 50m 65 memory: 64Mi 66 limits: 67 cpu: 200m 68 memory: 256Mi 69 destination: 70 server: https://kubernetes.default.svc 71 namespace: monitoring 72 syncPolicy: 73 automated: 74 prune: true 75 selfHeal: true 76 syncOptions: 77 - CreateNamespace=true 78 - ServerSideApply=true
tip
[!TIP] ServerSideApply=true is essential for the kube-prometheus-stack chart. Without it, you'll hit annotation size limits because the CRDs (PrometheusRule, ServiceMonitor) generate massive kubectl.kubernetes.io/last-applied-configuration annotations that exceed the 262144-byte limit. Server-side apply doesn't have this problem.

The

CODE
serviceMonitorSelector: {}
and
CODE
serviceMonitorNamespaceSelector: {}
settings tell Prometheus to scrape ServiceMonitors from all namespaces. This means my FastAPI backend's metrics (exposed via the Prometheus annotations in the Deployment) get picked up automatically.

I added a Grafana dashboard (ID 14584) specifically for ArgoCD metrics — it shows sync duration, sync failures, app health status, and repo server latency. Being able to see when ArgoCD is struggling to sync is critical for debugging.

The Sync Loop

Here's what the actual workflow looks like now:

CODE
1Developer pushes commit 234GitHub repository updated 567ArgoCD polls repo (every 3 minutes by default) 8910ArgoCD detects diff between desired state (Git) and live state (cluster) 111213ArgoCD runs Kustomize build / Helm template 141516ArgoCD applies manifests to cluster 171819Kubernetes reconciles (creates/updates/deletes resources) 202122ArgoCD verifies health checks pass 232425Application status: Synced + Healthy ✓

In practice, a typical change — say, updating the backend image tag — looks like this:

Bash
1# On my workstation, update the image tag in the staging overlay 2resham@devbox:~/kumari-k8s-manifests$ vim overlays/staging/kustomization.yaml 3# Change: newTag: staging-a3f8c2d → newTag: staging-b7e1d4f 4 5resham@devbox:~/kumari-k8s-manifests$ git add overlays/staging/kustomization.yaml 6resham@devbox:~/kumari-k8s-manifests$ git commit -m "feat: bump backend to staging-b7e1d4f (add websocket auth)" 7resham@devbox:~/kumari-k8s-manifests$ git push

Within 3 minutes (or immediately if I click "Refresh" in the ArgoCD UI), the new image rolls out. ArgoCD creates a new ReplicaSet, waits for the new pods to pass readiness probes, then scales down the old ReplicaSet. Standard Kubernetes rolling update, but triggered by a Git push instead of a manual

CODE
kubectl set image
.

If the new image is broken — say the readiness probe fails — the rollout stalls. The old pods keep serving traffic. I see the failure in ArgoCD's UI, revert the Git commit, push, and ArgoCD rolls back. The entire incident response is:

Bash
1resham@devbox:~/kumari-k8s-manifests$ git revert HEAD 2resham@devbox:~/kumari-k8s-manifests$ git push

Two commands. No

CODE
kubectl rollout undo
(which only works if you haven't made other changes since). No "what was the previous image tag again?" Just revert the commit.

Secrets Management

This is the part of GitOps that everyone warns you about, and they're right to. You can't just commit Kubernetes Secrets to Git in plaintext. But you also can't have ArgoCD manage your deployments if the Secrets live outside of Git. The whole point is Git as the single source of truth.

I use Sealed Secrets from Bitnami. The architecture is straightforward:

  1. A controller runs in the cluster and holds a private key
  2. I encrypt Secrets on my workstation using the controller's public key
  3. The encrypted SealedSecret resource is committed to Git
  4. ArgoCD applies the SealedSecret to the cluster
  5. The controller decrypts it into a regular Kubernetes Secret
Bash
1# Install kubeseal CLI 2resham@devbox:~$ wget https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.26.2/kubeseal-0.26.2-linux-amd64.tar.gz 3resham@devbox:~$ tar xzf kubeseal-0.26.2-linux-amd64.tar.gz 4resham@devbox:~$ sudo mv kubeseal /usr/local/bin/ 5 6# Fetch the public key from the cluster 7resham@devbox:~$ kubeseal --fetch-cert > sealed-secrets-pub.pem

Creating a sealed secret:

Bash
1# Start with a regular Secret (never commit this) 2resham@devbox:~$ kubectl create secret generic kumari-backend-secrets \ 3 --namespace kumari-staging \ 4 --from-literal=DATABASE_URL='postgresql://kumari:s3cretP4ss@postgres:5432/kumari' \ 5 --from-literal=REDIS_URL='redis://redis:6379/0' \ 6 --from-literal=OPENAI_API_KEY='sk-proj-...' \ 7 --from-literal=JWT_SECRET='ultra-secret-jwt-key-here' \ 8 --dry-run=client -o yaml | \ 9 kubeseal --cert sealed-secrets-pub.pem -o yaml \ 10 > overlays/staging/secrets/sealed-secrets.yaml

The resulting SealedSecret looks like this (encrypted values are truncated):

YAML
1# overlays/staging/secrets/sealed-secrets.yaml 2apiVersion: bitnami.com/v1alpha1 3kind: SealedSecret 4metadata: 5 name: kumari-backend-secrets 6 namespace: kumari-staging 7spec: 8 encryptedData: 9 DATABASE_URL: AgBx7c2K9f... # ~200 chars of encrypted base64 10 REDIS_URL: AgDp3m8R4a... 11 OPENAI_API_KEY: AgFw5n1T7e... 12 JWT_SECRET: AgHy2p4V0i... 13 template: 14 metadata: 15 name: kumari-backend-secrets 16 namespace: kumari-staging 17 type: Opaque

This is safe to commit to Git. Only the Sealed Secrets controller in the cluster can decrypt it. If someone clones my repo, they see encrypted blobs.

warning
[!WARNING] Back up the Sealed Secrets controller's private key. If you lose it (node failure, cluster rebuild), you can't decrypt any existing SealedSecrets. I back up the key to my ZFS NAS with kubectl -n kube-system get secret -l sealedsecrets.bitnami.com/sealed-secrets-key -o yaml > sealed-secrets-master.key. This file is stored encrypted with age on the NAS, not in Git.

The source of truth for the actual secret values is an Ansible Vault file on my workstation. When I need to rotate a secret, I update the Vault, regenerate the SealedSecret, commit, and push. ArgoCD does the rest.

Bash
1# My secret rotation workflow 2resham@devbox:~$ ansible-vault edit secrets/kumari-staging.vault.yml 3# Edit the value 4 5resham@devbox:~$ ./scripts/generate-sealed-secrets.sh staging 6# Script reads from vault, creates sealed secret YAML 7 8resham@devbox:~$ cd ~/kumari-k8s-manifests 9resham@devbox:~/kumari-k8s-manifests$ git add overlays/staging/secrets/ 10resham@devbox:~/kumari-k8s-manifests$ git commit -m "chore: rotate staging DB credentials" 11resham@devbox:~/kumari-k8s-manifests$ git push

What Broke

Every homelab blog post that doesn't include a "what went wrong" section is lying. Here's what bit me.

1. ArgoCD Sync Stuck on Immutable Fields

The first time I tried to change the

CODE
storageClassName
on my PostgreSQL PVC, ArgoCD showed the app as "OutOfSync" but refused to sync. The error:

CODE
1ComparisonError: failed to sync: The StatefulSet "postgres" is invalid: 2spec: Forbidden: updates to statefulset spec for fields other than 3'replicas', 'ordinals', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' 4and 'minReadySeconds' are forbidden

Kubernetes StatefulSet

CODE
volumeClaimTemplates
are immutable after creation. You can't change storage class, size, or access modes on an existing StatefulSet's PVC template. The fix is ugly but necessary:

  1. Back up the data (I used
    CODE
    pg_dump
    into a temporary pod)
  2. Delete the StatefulSet and its PVCs in Git (commit + push, let ArgoCD prune)
  3. Update the volumeClaimTemplate with the new storage class
  4. Commit and push (ArgoCD recreates everything fresh)
  5. Restore the data from backup
tip
[!TIP] When you know you'll need to change storage parameters later, use a separate PVC resource instead of volumeClaimTemplates. Standalone PVCs can be resized (if the storage class supports it) without deleting the StatefulSet. You lose automatic PVC-per-replica naming, but in a homelab with a single Postgres replica, that doesn't matter.

2. CrashLoopBackoff from a Bad Image Tag

I pushed a commit with a typo in the image tag:

CODE
staging-b7e1d4
instead of
CODE
staging-b7e1d4f
(missing the last character). ArgoCD synced happily — it doesn't validate that the image exists before applying the manifest. Kubernetes tried to pull the image, got
CODE
ErrImagePull
, then
CODE
ImagePullBackOff
, and the pod went into
CODE
CrashLoopBackoff
.

The fix was simple (

CODE
git revert
), but the lesson was clear: ArgoCD validates YAML syntax, not image availability. I now have a CI step in GitHub Actions that checks if the image tag exists in GHCR before allowing a merge to main:

YAML
1# .github/workflows/validate-image-tags.yml 2name: Validate Image Tags 3on: 4 pull_request: 5 paths: 6 - 'overlays/**/kustomization.yaml' 7jobs: 8 validate: 9 runs-on: ubuntu-latest 10 steps: 11 - uses: actions/checkout@v4 12 - name: Extract image tags 13 run: | 14 grep "newTag:" overlays/*/kustomization.yaml | while read line; do 15 TAG=$(echo "$line" | awk '{print $2}') 16 IMAGE=$(grep -B1 "newTag: $TAG" overlays/*/kustomization.yaml | grep "newName:" | awk '{print $2}') 17 echo "Checking $IMAGE:$TAG" 18 docker manifest inspect "$IMAGE:$TAG" > /dev/null 2>&1 || \ 19 (echo "ERROR: $IMAGE:$TAG does not exist" && exit 1) 20 done

3. Longhorn PVC Binding Issues

I chose Longhorn as the storage backend because it's designed for Kubernetes on commodity hardware — exactly my situation with KVM VMs. But the initial setup had a frustrating issue: PVCs were stuck in

CODE
Pending
state.

Bash
1resham@devbox:~$ kubectl get pvc -n kumari-staging 2NAME STATUS VOLUME CAPACITY ACCESS MODES 3postgres-data-postgres-0 Pending

The

CODE
kubectl describe pvc
showed:

CODE
1Events: 2 Type Reason Age From Message 3 ---- ------ ---- ---- ------- 4 Warning ProvisioningFailed 30s longhorn-provisioner waiting for a volume to be created

Digging into the Longhorn manager logs, the real issue was that my VM disks were using

CODE
virtio-scsi
and Longhorn couldn't detect the available disk space. Switching the VM storage to
CODE
virtio-blk
(editing the libvirt XML) and restarting the nodes fixed it:

Bash
1resham@devbox:~$ sudo virsh edit k3s-node-1 2# Change: <target dev='sda' bus='scsi'/> 3# To: <target dev='vda' bus='virtio'/> 4 5resham@devbox:~$ sudo virsh destroy k3s-node-1 && sudo virsh start k3s-node-1

After the nodes came back, Longhorn detected the disks and PVCs bound within seconds.

4. ArgoCD App-of-Apps Bootstrap Loop

I wanted ArgoCD to manage all Applications, including the Application CRs themselves. This is the "app of apps" pattern. I created a root Application that points to the

CODE
apps/
directory:

YAML
1# apps/root.yaml 2apiVersion: argoproj.io/v1alpha1 3kind: Application 4metadata: 5 name: root 6 namespace: argocd 7spec: 8 project: default 9 source: 10 repoURL: https://github.com/iamresham/kumari-k8s-manifests.git 11 targetRevision: main 12 path: apps 13 destination: 14 server: https://kubernetes.default.svc 15 namespace: argocd 16 syncPolicy: 17 automated: 18 prune: true 19 selfHeal: true

The problem: the

CODE
root
Application is in the
CODE
apps/
directory, so it tries to manage itself. If it detects drift on its own definition, it syncs, which triggers a new sync detection, which triggers another sync... it doesn't infinite loop (ArgoCD is smart enough to detect no-op syncs), but it did cause a lot of noise in the logs and the UI.

The fix: move

CODE
root.yaml
to a separate
CODE
bootstrap/
directory outside of
CODE
apps/
, and apply it manually once with
CODE
kubectl apply -f bootstrap/root.yaml
. The root app points at
CODE
apps/
but isn't inside
CODE
apps/
, so it doesn't try to manage itself.

Results

After three weeks of running this setup, here's the before/after:

MetricBefore (kubectl)After (ArgoCD + GitOps)
Time to deploy a new image2-5 min (find YAML, edit, apply, verify)30 sec (edit tag, commit, push)
Time to rollback5-15 min (find old YAML, hope it's current, apply)10 sec (
CODE
git revert HEAD && git push
)
Configuration driftConstant, undetectableZero (self-heal reverts manual changes)
Disaster recoveryHours (reconstruct from memory and backups)3 min (ArgoCD recreates from Git)
Audit trailNoneFull Git history with diffs, authors, timestamps
Sleep qualityPoorSignificantly improved

That last row isn't a joke. Knowing that the cluster state is in Git and can be reconstructed from scratch changed my relationship with the homelab. I'm no longer afraid to experiment because I can always go back.

What's Next

The setup is solid for staging, but there are things I want to improve:

  1. GitHub webhook instead of polling — 3-minute sync interval is fine, but I want instant deploys. ArgoCD supports webhooks; I just need to expose it through Traefik with proper auth.

  2. ArgoCD Image Updater — Instead of manually updating image tags in the Kustomization file, the Image Updater can watch GHCR for new tags matching a pattern and automatically commit the update. Full hands-off CI/CD.

  3. Production environment — Right now only staging runs on k3s. Production is still Docker Compose on the R720. Moving it to the cluster is the next project — but it means solving persistent storage for real, not just for staging data I can afford to lose.

  4. ArgoCD notifications — I want Slack (well, Discord) alerts when a sync fails or an app goes unhealthy. The ArgoCD Notifications controller supports this out of the box.

The broader point is this: GitOps isn't just for companies with SRE teams and production Kubernetes clusters. It's arguably even more valuable in a homelab, where you're the only operator, changes happen at odd hours, and there's no one to bail you out when you delete the wrong namespace at 2 AM.

CODE
git revert
is the best undo button I've ever had.