sysctl Tuning

Persistent kernel tuning with guardrails: where sysctls live, how networking and VM knobs behave, and how to measure before changing production defaults.

Tuning heuristics
  • Change one bottleneck at a time and measure it. If you cannot name the symptom, you are not tuning, you are guessing.
  • Persist settings in /etc/sysctl.d/*.conf, not by hand in /proc/sys. Runtime-only fixes vanish at the next reboot and are forgotten fastest.
  • Know the scope: some limits are global kernel ceilings, some are per-process limits, and some are service-manager settings. Raising the wrong one changes nothing.
  • For writeback tuning, use either dirty_bytes or dirty_ratio, not both. Mixing them causes confusion about which threshold is actually active.
  • Every production sysctl change should have a rollback path, a before/after measurement, and a reason written beside it.

Where sysctls live

Linux exposes tunables in /proc/sys/. sysctl is just a friendly interface over those files. Persistence comes from config files that systemd-sysctl or the init system loads during boot.

LocationWho writes thereNotes
/usr/lib/sysctl.d/*.confRPM/DEB packagesVendor defaults
/run/sysctl.d/*.confRuntime-generatedEphemeral overrides
/etc/sysctl.d/*.confLocal admin / config managementPreferred place for your settings
/etc/sysctl.confLegacy admin configStill works, but harder to organize than drop-ins

Files are applied in lexical order, so 60-web.conf loads after 50-default.conf. Later wins when the same key appears multiple times.

sysctl net.core.somaxconn
cat /proc/sys/net/core/somaxconn

# See what systemd loaded at boot
systemctl status systemd-sysctl
journalctl -u systemd-sysctl -b --no-pager

Runtime vs persistent changes

A runtime change is immediate but disappears on reboot. A persistent change comes from a file and survives reboot. Use runtime changes for testing only, then move the final value into a drop-in.

# Runtime only
sysctl -w net.core.somaxconn=4096

# Persistent file
cat > /etc/sysctl.d/60-web.conf <<'EOF'
net.core.somaxconn = 4096
EOF

# Load one file now
sysctl --load=/etc/sysctl.d/60-web.conf

# Or reload the full stack
sysctl --system
Test the smallest unit first. Loading one file with sysctl --load=/etc/sysctl.d/60-web.conf is safer than blasting every file with sysctl --system while you are still iterating.

Networking knobs that matter

Ignore giant internet lists of "performance tunings". Start from symptoms: listen queue overflow, SYN backlog overflow, ephemeral port exhaustion, asymmetric routing, or conntrack saturation.

KeyWhy it existsCommon use
net.core.somaxconnKernel ceiling for listen backlogHigh-connection-rate web tiers
net.ipv4.tcp_max_syn_backlogPending SYN queue sizeBusy TCP listeners under bursty load
net.ipv4.ip_local_port_rangeEphemeral source port rangeProxies, NAT gateways, busy clients
net.ipv4.ip_forwardEnable packet forwardingRouters, NAT, proxy nodes
net.ipv4.conf.all.rp_filterReverse-path validationUse 2 (loose) for policy routing or multihomed hosts
net.netfilter.nf_conntrack_maxConnection tracking table sizeStateful firewalls, NAT, Kubernetes nodes
# /etc/sysctl.d/60-network-ingress.conf
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.ip_local_port_range = 20000 65000
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 2
net.netfilter.nf_conntrack_max = 262144
# Observe the symptoms before changing anything
ss -lnt
ss -s
nstat -az | egrep 'ListenOverflows|ListenDrops|TcpExtTCPAbortOnMemory'
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
Backlog values are a chain. Raising somaxconn helps only if the application also requests a larger backlog. Nginx, HAProxy, JVMs, and Go services each have their own listener settings too.

VM and writeback knobs

Memory tuning is where overconfidence hurts. Small, well-understood changes are usually enough: reduce swap eagerness, cap dirty data so writeback storms stay bounded, and set workload-specific values like vm.max_map_count only when the application needs them.

# /etc/sysctl.d/60-vm.conf
vm.swappiness = 10
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 536870912
vm.overcommit_memory = 0
vm.max_map_count = 262144
grep -E 'Dirty|Writeback|MemAvailable|SwapTotal|SwapFree' /proc/meminfo
vmstat 1 10
sar -B 1 10
Do not mix ratios and bytes. If you set dirty_bytes, clear or ignore dirty_ratio. Pick one model so the next admin can reason about the box.

Inotify limits

Inotify errors show up when tools that watch many files run out of watches or instances: IDEs, sync daemons, kubelets, config reloaders, and log agents. The usual symptom is No space left on device from an app that still has plenty of disk.

# /etc/sysctl.d/60-inotify.conf
fs.inotify.max_user_instances = 1024
fs.inotify.max_user_watches = 524288
fs.inotify.max_queued_events = 32768
sysctl fs.inotify.max_user_instances
sysctl fs.inotify.max_user_watches
sysctl fs.inotify.max_queued_events

# Roughly identify which processes are using many watches
grep -R inotify /proc/*/fdinfo 2>/dev/null | cut -d/ -f3 | sort | uniq -c | sort -nr | head

If you are tuning this for developer workstations, remember that multiple Electron apps plus an IDE plus a file sync tool can consume watches surprisingly fast.

File descriptor limits interplay

"Too many open files" is rarely fixed by one knob. There are four layers that matter:

LayerWhat it controlsHow to inspect it
fs.file-maxSystem-wide file handle poolsysctl fs.file-max, cat /proc/sys/fs/file-nr
fs.nr_openKernel ceiling for per-process hard limitssysctl fs.nr_open
RLIMIT_NOFILEPer-process soft and hard open-file limitsulimit -Sn, ulimit -Hn, /proc/<pid>/limits
LimitNOFILE=systemd unit-level limitsystemctl show <unit> -p LimitNOFILE
# /etc/sysctl.d/60-fd.conf
fs.nr_open = 1048576
fs.file-max = 2097152
# /etc/systemd/system/nginx.service.d/limits.conf
[Service]
LimitNOFILE=262144
sysctl fs.nr_open
sysctl fs.file-max
cat /proc/sys/fs/file-nr

ulimit -Sn
ulimit -Hn
systemctl show nginx -p LimitNOFILE
cat /proc/$(pidof nginx | awk '{print $1}')/limits | grep 'open files'
Global vs per-service. Raising fs.file-max does not help if the service is capped at LimitNOFILE=1024. Diagnose from the process outward, not the kernel inward.

Measurement before and after

Take a snapshot before you change anything. Then compare not just the sysctl values, but the workload symptoms that motivated the change.

# Snapshot current values
sysctl -a > /var/tmp/sysctl.before

# Networking symptoms
ss -s
nstat -az | egrep 'ListenOverflows|ListenDrops|TCPSynRetrans|TcpExtTCPAbortOnMemory'

# Memory/writeback symptoms
vmstat 1 10
grep -E 'Dirty|Writeback|MemAvailable|SwapFree' /proc/meminfo
slabtop -o

# After the change
sysctl -a > /var/tmp/sysctl.after
diff -u /var/tmp/sysctl.before /var/tmp/sysctl.after

If you need deeper proof than these counters give you, move to perf & bpftrace rather than turning more knobs on instinct.

Safe rollout pattern

Keep changes small, named by purpose, and easy to remove.

cat > /etc/sysctl.d/60-web-stack.conf <<'EOF'
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 8192
EOF

# Apply only this file first
sysctl --load=/etc/sysctl.d/60-web-stack.conf

# Verify the values are live
sysctl net.core.somaxconn
sysctl net.ipv4.tcp_max_syn_backlog

# If the host behaves, keep it and let config management own it
systemctl restart nginx
tasks:
  - name: Install sysctl drop-in
    ansible.builtin.copy:
      dest: /etc/sysctl.d/60-web-stack.conf
      mode: '0644'
      content: |
        net.core.somaxconn = 4096
        net.ipv4.tcp_max_syn_backlog = 8192
    notify: reload sysctl

handlers:
  - name: reload sysctl
    ansible.builtin.command: sysctl --system

Rollback is just as important: remove or revert the file, reload that file or the whole stack, and compare the same counters you used during rollout.

Troubleshooting over-tuning

SymptomLikely causeWhat to check
App still reports Too many open files Per-process or systemd limit still low Check LimitNOFILE, ulimit -n, and /proc/<pid>/limits, not just fs.file-max
Multi-homed host loses replies or health checks after reboot rp_filter=1 rejecting asymmetric return traffic Use rp_filter=2 for loose mode if policy routing or multiple uplinks are intentional; see Linux Networking
Write latency spikes after increasing dirty thresholds Too much dirty page cache building before flush starts Lower dirty_bytes/dirty_background_bytes and re-check vmstat plus /proc/meminfo
Value is correct after sysctl -w but wrong after reboot Another file or a tuning profile overwrites it later Search /usr/lib/sysctl.d, /etc/sysctl.d, and check for tuned, cloud-init, or platform scripts
sysctl: permission denied in a container Unprivileged namespace or forbidden kernel key Many sysctls are host-only; set them on the node, not in an ordinary container
Raising nf_conntrack_max did not stop connection drops The table is still filling, or memory pressure is elsewhere Compare nf_conntrack_count to nf_conntrack_max and inspect the firewall/NAT workload itself
Inotify errors keep happening after raising watch limits The process leaks watches, or another resource is the real limit Identify which processes hold watches and inspect them with lsof & strace
# Good generic triage loop
sysctl --system
journalctl -u systemd-sysctl -b --no-pager
sysctl net.core.somaxconn vm.swappiness fs.file-max
ss -s
vmstat 1 5

Cross-reference