🚧 Blog under active construction — new posts dropping weekly · content, features & post pages coming soon 🚧
✕ close
No posts found.
View all posts
◈ Linux 100 · #01
My Linux Troubleshooting Playbook — 16 Steps from Any Production Incident
Production systems fail. The gap between a 5-minute fix and a 3-hour war room is a structured approach. Here's the exact 16-step workflow I've used across 7 years of cloud ops — from load averages to tracing deleted files holding disk space hostage.
eknatha@prod-node-01 ~
$ uptime
10:25:01 up 5 days, load average: 4.20, 3.80, 2.10 ← spike!
$ journalctl -u app -f --since "10 min ago"
OOMKilled: process app-server (pid 4821) killed
$ lsof | grep deleted
java 1847 /var/log/app.log (deleted) 27GB ✓ root cause found — step 12 of the playbook
#linux#troubleshooting#sre#incident-response
Prerequisites

You should be comfortable running these before reading.

◈ Linux 100 · #02
Linux Log Analysis for DevOps: grep, awk, journalctl and What to Actually Look For
Logs are the single best source of truth in any incident — but only if you know where to look. Covers the real workflow: systemd journals, filtering failed auth attempts, extracting signal from noise with grep and awk, and tail patterns that actually tell you something.
eknatha@prod-node-01 ~
$ grep "Failed password" /var/log/auth.log | wc -l
4821 ← brute force in progress
$ awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -5
2043 200 831 404 12 500
$ journalctl -u nginx --since "1 hour ago" -p err
✓ filtered to errors only — 3 lines, root cause visible
#linux#logs#grep#awk#journalctl
Prerequisites

You should be comfortable running these before reading.

◈ Linux 100 · #03
System Monitoring Deep Dive: What top Doesn't Tell You (and What Does)
Everyone knows top. But when a server crawls and top shows nothing obvious, where do you look? This covers the full stack: CPU steal time, memory pressure vs swap usage, I/O wait, and the vmstat + iostat combo that actually tells you what's blocking your system.
eknatha@prod-node-02 ~
$ vmstat 2 5
r b swpd free buff cache wa st 4 8 0 12000 2000 400000 68 0 ← wa=68% I/O wait!
$ iostat -x 2 | grep -v "^$"
sda r/s: 0.0 w/s: 842.0 await: 180ms ← saturated ✓ disk write saturation — not CPU, not memory
#linux#monitoring#performance#iostat#vmstat
Prerequisites

You should be comfortable running these before reading.

◈ Linux 100 · #04
Linux Networking Commands I Actually Use in Production (Not Just ping)
ping tells you a host is alive. But which process is holding port 8080? Why is DNS resolution slow only for this one service? Why is the connection established but no traffic flowing? Real production toolkit: ss, lsof -i, dig, tcpdump — with actual use cases for each.
eknatha@prod-node-01 ~
$ ss -tulnp | grep 8080
tcp LISTEN 0 128 *:8080 users:(("java",pid=3241,fd=42))
$ dig +short api.internal @10.0.0.2
;; connection timed out ← internal DNS down
$ tcpdump -i eth0 -n port 8080 -c 20
✓ SYN_SENT but no SYN-ACK — firewall rule missing
#linux#networking#ss#tcpdump#dns
Prerequisites

You should be comfortable running these before reading.

◈ Linux 100 · #05
Linux Process Management: Signals, Zombies, and When kill -9 Is the Wrong Answer
kill -9 is the nuclear option — it works, but skips cleanup handlers and can corrupt state. How Linux signals actually work, how to identify zombie processes, how to use strace to see what a hung process is doing, and the safe termination sequence every SRE should know.
eknatha@prod-node-03 ~
$ ps aux | awk '$8=="Z" {print $2, $11}'
8 zombie processes — all children of pid 1024
$ strace -p 1024 -e trace=wait4 2>&1 | head
strace: Process 1024 attached — blocked in wait()
$ kill -15 1024 && sleep 3 && kill -9 1024
✓ SIGTERM first, SIGKILL only if no response in 3s
#linux#processes#signals#strace#debugging
Prerequisites

You should be comfortable running these before reading.

⚡ Today I Fixed
TIF #01 — df Shows 100% But du Disagrees
Deleted log files held open by a running process. df counted the space as used, du didn't see it. lsof found the culprit in 30 seconds.
eknatha@prod-node-03 ~
$ df -h /
Filesystem Size Used Avail Use% /dev/sda1 50G 50G 0G 100% ← alarm fires
$ du -sh /*
... total: 23G ← only 23G visible?
$ lsof +L1 | grep deleted
java 1847 /var/log/app.log (deleted) 27GB ✓ kill -HUP 1847 → 27GB freed instantly
#linux#storage#lsof
Prerequisites

You should be comfortable running these before reading.

⚙ My Dotfiles Explained
My .bashrc, .vimrc & Aliases — Every Line Explained
13+ years of ops muscle memory, written down. Prompt tuning, kubectl shortcuts, SSH multiplexing, and the aliases I reach for every single day.
~/.bashrc (excerpt)
# Smart prompt: user@host:dir (git branch)
PS1='\[\e[32m\]\u@\h\[\e[0m\]:\[\e[34m\]\w\[\e[33m\]$(__git_ps1)\[\e[0m\]\$ '
# kubectl shortcuts — used 50x a day
alias k='kubectl'
alias kgp='kubectl get pods -o wide'
alias kns='kubectl config set-context --current --namespace'
# SSH jump via bastion without -J flag
alias ssh-prod='ssh -J bastion.internal eknatha@prod'
#dotfiles#bash#productivity
Prerequisites

You should be comfortable running these before reading.

◈ Linux 100 · #12
Find All Files Modified in Last 24 Hours Across a Fleet
Using find -mtime, -newer, and how to pipe results to a remote collector without spawning 40 subshells.
eknatha@prod-node-01 ~
# Find files changed in last 24h, exclude /proc /sys
$ find / -mtime -1 -type f \
-not \( -path "/proc/*" -o -path "/sys/*" \) \
2>/dev/null | wc -l
1,847 files
$ find /etc /var/log -newer /tmp/ref -ls 2>/dev/null
12345 4 -rw-r--r-- root /etc/nginx/nginx.conf Apr 15 03:22 12350 8 -rw-r----- root /var/log/auth.log Apr 15 03:19
#linux#find#fleet
Prerequisites

You should be comfortable running these before reading.

⚙ My Dotfiles Explained
My tmux Config — Split Panes, Session Persistence & DevOps Shortcuts
Named sessions per cluster, pane layouts for logs + shell + metrics, and the plugin that saved me when SSH dropped mid-deploy.
~/.tmux.conf (excerpt)
# Prefix: Ctrl+a (like screen, more natural)
set -g prefix C-a
# Split: | for vertical, - for horizontal
bind | split-window -h -c "#{pane_current_path}"
bind - split-window -v -c "#{pane_current_path}"
# One-key incident layout: logs | shell | top
bind I source ~/.tmux/incident-layout.conf
# Session per cluster
$ tmux new -s prod-k8s
Session prod-k8s created ← persists if SSH drops
#tmux#dotfiles#terminal
Prerequisites

You should be comfortable running these before reading.

☸ Kubernetes War Stories · #02
Pod Stuck in Terminating — 4 Root Causes and How to Force It
Finalizers, PVCs, network policies, and webhook timeouts. Four reasons a pod won't die — and the safe way to handle each one without corrupting state.
eknatha@prod-cluster ~
$ kubectl get pod nginx-77d9 -o json | jq '.metadata.finalizers'
["kubernetes.io/pvc-protection"] ← finalizer holding it
$ kubectl patch pod nginx-77d9 -p \
'{"metadata":{"finalizers":null}}'
pod/nginx-77d9 patched → Terminated immediately
# Root cause 2: webhook timeout (no --force needed)
$ kubectl describe pod nginx-77d9 | grep -A5 Events
Warning FailedKillPod hook "istio-proxy" timeout (30s)
#kubernetes#debugging#finalizers
Prerequisites

You should be comfortable running these before reading.

◈ Linux 100 · #08
Parse a 100K-Line Log File Without Loading It Into Memory
awk, sed, and grep patterns that run in constant memory. When you can't afford to cat a 2GB log on a live prod node.
eknatha@prod-node-01 ~
# Count ERRORs without loading 2GB into RAM
$ awk '/ERROR/{count++} END{print count " errors"}' app.log
4,821 errors
# Extract IPs of 5xx errors (streaming, constant mem)
$ grep " 5[0-9][0-9] " access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -5
892 10.0.4.23 ← bot hammering /api/v1 341 10.0.4.78 87 10.0.4.12
#linux#awk#logs
Prerequisites

You should be comfortable running these before reading.

◎ Quick Tips
rsync vs scp — Which One and When
rsync wins on resumable transfers and delta sync. scp wins on clean one-liners. Here's the exact scenarios where each belongs.
comparison
# scp: one file, trust the network, move on
$ scp dump.sql user@prod:/tmp/
dump.sql 100% 847MB 62.3MB/s 00:13
# rsync: large dir, resumable, delta only
$ rsync -avz --progress /data/ user@prod:/data/
sent 12.3GB received 2.1KB → only 340MB transferred ↑ delta sync: skipped 11.9GB of unchanged files
# rsync: safe deploy with --dry-run first
$ rsync -avzn /app/ user@prod:/app/ # -n = dry run
#rsync#scp#linux
Prerequisites

You should be comfortable running these before reading.

⬡ Platform Eng in Progress · #02
CKA Prep After 7+ Years of Cloud Ops — What Surprised Me
The exam isn't hard if you know Kubernetes. Except I thought I knew Kubernetes. Week 1 notes from someone with real cluster experience who still got humbled by imperative kubectl.
CKA speed tricks — imperative over YAML
# Create pod fast — no YAML editing under time pressure
$ kubectl run nginx --image=nginx --port=80 \
--dry-run=client -o yaml > pod.yaml
# Create service immediately
$ kubectl expose pod nginx --port=80 --type=NodePort
service/nginx exposed # What caught me: ETCD backup (memorise this path)
$ ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-backup.db \
--endpoints=https://127.0.0.1:2379 --cacert --cert --key
Snapshot saved at /tmp/etcd-backup.db
#cka#kubernetes#certification
Prerequisites

You should be comfortable running these before reading.