sysctl Tuning

Persistent kernel tuning with guardrails: where sysctls live, how networking and VM knobs behave, and how to measure before changing production defaults.

Tuning heuristics

Change one bottleneck at a time and measure it. If you cannot name the symptom, you are not tuning, you are guessing.
Persist settings in /etc/sysctl.d/*.conf, not by hand in /proc/sys. Runtime-only fixes vanish at the next reboot and are forgotten fastest.
Know the scope: some limits are global kernel ceilings, some are per-process limits, and some are service-manager settings. Raising the wrong one changes nothing.
For writeback tuning, use either dirty_bytes or dirty_ratio, not both. Mixing them causes confusion about which threshold is actually active.
Every production sysctl change should have a rollback path, a before/after measurement, and a reason written beside it.

On this page

Where sysctls live
Runtime vs persistent changes
Networking knobs that matter
VM and writeback knobs
Inotify limits
File descriptor limits interplay
Measurement before and after
Safe rollout pattern
Troubleshooting over-tuning
Cross-reference

Where sysctls live

Linux exposes tunables in /proc/sys/. sysctl is just a friendly interface over those files. Persistence comes from config files that systemd-sysctl or the init system loads during boot.

Location	Who writes there	Notes
`/usr/lib/sysctl.d/*.conf`	RPM/DEB packages	Vendor defaults
`/run/sysctl.d/*.conf`	Runtime-generated	Ephemeral overrides
`/etc/sysctl.d/*.conf`	Local admin / config management	Preferred place for your settings
`/etc/sysctl.conf`	Legacy admin config	Still works, but harder to organize than drop-ins

Files are applied in lexical order, so 60-web.conf loads after 50-default.conf. Later wins when the same key appears multiple times.

sysctl net.core.somaxconn
cat /proc/sys/net/core/somaxconn

# See what systemd loaded at boot
systemctl status systemd-sysctl
journalctl -u systemd-sysctl -b --no-pager

Runtime vs persistent changes

A runtime change is immediate but disappears on reboot. A persistent change comes from a file and survives reboot. Use runtime changes for testing only, then move the final value into a drop-in.

# Runtime only
sysctl -w net.core.somaxconn=4096

# Persistent file
cat > /etc/sysctl.d/60-web.conf <<'EOF'
net.core.somaxconn = 4096
EOF

# Load one file now
sysctl --load=/etc/sysctl.d/60-web.conf

# Or reload the full stack
sysctl --system

Test the smallest unit first. Loading one file with sysctl --load=/etc/sysctl.d/60-web.conf is safer than blasting every file with sysctl --system while you are still iterating.

Networking knobs that matter

Ignore giant internet lists of "performance tunings". Start from symptoms: listen queue overflow, SYN backlog overflow, ephemeral port exhaustion, asymmetric routing, or conntrack saturation.

Key	Why it exists	Common use
`net.core.somaxconn`	Kernel ceiling for listen backlog	High-connection-rate web tiers
`net.ipv4.tcp_max_syn_backlog`	Pending SYN queue size	Busy TCP listeners under bursty load
`net.ipv4.ip_local_port_range`	Ephemeral source port range	Proxies, NAT gateways, busy clients
`net.ipv4.ip_forward`	Enable packet forwarding	Routers, NAT, proxy nodes
`net.ipv4.conf.all.rp_filter`	Reverse-path validation	Use `2` (loose) for policy routing or multihomed hosts
`net.netfilter.nf_conntrack_max`	Connection tracking table size	Stateful firewalls, NAT, Kubernetes nodes

# /etc/sysctl.d/60-network-ingress.conf
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.ip_local_port_range = 20000 65000
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 2
net.netfilter.nf_conntrack_max = 262144

# Observe the symptoms before changing anything
ss -lnt
ss -s
nstat -az | egrep 'ListenOverflows|ListenDrops|TcpExtTCPAbortOnMemory'
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

Backlog values are a chain. Raising somaxconn helps only if the application also requests a larger backlog. Nginx, HAProxy, JVMs, and Go services each have their own listener settings too.

VM and writeback knobs

Memory tuning is where overconfidence hurts. Small, well-understood changes are usually enough: reduce swap eagerness, cap dirty data so writeback storms stay bounded, and set workload-specific values like vm.max_map_count only when the application needs them.

# /etc/sysctl.d/60-vm.conf
vm.swappiness = 10
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 536870912
vm.overcommit_memory = 0
vm.max_map_count = 262144

grep -E 'Dirty|Writeback|MemAvailable|SwapTotal|SwapFree' /proc/meminfo
vmstat 1 10
sar -B 1 10

vm.swappiness: lower values make the kernel less eager to swap anonymous memory.
vm.dirty_background_bytes and vm.dirty_bytes: start background flushing early and cap total dirty data so one writer cannot build a giant burst.
vm.overcommit_memory: leave at 0 unless the workload has a reason for strict or permissive overcommit semantics.
vm.max_map_count: often raised for Elasticsearch and some JVM-heavy software.

Do not mix ratios and bytes. If you set dirty_bytes, clear or ignore dirty_ratio. Pick one model so the next admin can reason about the box.

Inotify limits

Inotify errors show up when tools that watch many files run out of watches or instances: IDEs, sync daemons, kubelets, config reloaders, and log agents. The usual symptom is No space left on device from an app that still has plenty of disk.

# /etc/sysctl.d/60-inotify.conf
fs.inotify.max_user_instances = 1024
fs.inotify.max_user_watches = 524288
fs.inotify.max_queued_events = 32768

sysctl fs.inotify.max_user_instances
sysctl fs.inotify.max_user_watches
sysctl fs.inotify.max_queued_events

# Roughly identify which processes are using many watches
grep -R inotify /proc/*/fdinfo 2>/dev/null | cut -d/ -f3 | sort | uniq -c | sort -nr | head

If you are tuning this for developer workstations, remember that multiple Electron apps plus an IDE plus a file sync tool can consume watches surprisingly fast.

File descriptor limits interplay

"Too many open files" is rarely fixed by one knob. There are four layers that matter:

Layer	What it controls	How to inspect it
`fs.file-max`	System-wide file handle pool	`sysctl fs.file-max`, `cat /proc/sys/fs/file-nr`
`fs.nr_open`	Kernel ceiling for per-process hard limits	`sysctl fs.nr_open`
RLIMIT_NOFILE	Per-process soft and hard open-file limits	`ulimit -Sn`, `ulimit -Hn`, `/proc/<pid>/limits`
`LimitNOFILE=`	systemd unit-level limit	`systemctl show <unit> -p LimitNOFILE`

# /etc/sysctl.d/60-fd.conf
fs.nr_open = 1048576
fs.file-max = 2097152

# /etc/systemd/system/nginx.service.d/limits.conf
[Service]
LimitNOFILE=262144

sysctl fs.nr_open
sysctl fs.file-max
cat /proc/sys/fs/file-nr

ulimit -Sn
ulimit -Hn
systemctl show nginx -p LimitNOFILE
cat /proc/$(pidof nginx | awk '{print $1}')/limits | grep 'open files'

Global vs per-service. Raising fs.file-max does not help if the service is capped at LimitNOFILE=1024. Diagnose from the process outward, not the kernel inward.

Measurement before and after

Take a snapshot before you change anything. Then compare not just the sysctl values, but the workload symptoms that motivated the change.

# Snapshot current values
sysctl -a > /var/tmp/sysctl.before

# Networking symptoms
ss -s
nstat -az | egrep 'ListenOverflows|ListenDrops|TCPSynRetrans|TcpExtTCPAbortOnMemory'

# Memory/writeback symptoms
vmstat 1 10
grep -E 'Dirty|Writeback|MemAvailable|SwapFree' /proc/meminfo
slabtop -o

# After the change
sysctl -a > /var/tmp/sysctl.after
diff -u /var/tmp/sysctl.before /var/tmp/sysctl.after

If you need deeper proof than these counters give you, move to perf & bpftrace rather than turning more knobs on instinct.

Safe rollout pattern

Keep changes small, named by purpose, and easy to remove.

cat > /etc/sysctl.d/60-web-stack.conf <<'EOF'
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 8192
EOF

# Apply only this file first
sysctl --load=/etc/sysctl.d/60-web-stack.conf

# Verify the values are live
sysctl net.core.somaxconn
sysctl net.ipv4.tcp_max_syn_backlog

# If the host behaves, keep it and let config management own it
systemctl restart nginx

tasks:
  - name: Install sysctl drop-in
    ansible.builtin.copy:
      dest: /etc/sysctl.d/60-web-stack.conf
      mode: '0644'
      content: |
        net.core.somaxconn = 4096
        net.ipv4.tcp_max_syn_backlog = 8192
    notify: reload sysctl

handlers:
  - name: reload sysctl
    ansible.builtin.command: sysctl --system

Rollback is just as important: remove or revert the file, reload that file or the whole stack, and compare the same counters you used during rollout.

Troubleshooting over-tuning

Symptom	Likely cause	What to check
App still reports `Too many open files`	Per-process or systemd limit still low	Check `LimitNOFILE`, `ulimit -n`, and `/proc/<pid>/limits`, not just `fs.file-max`
Multi-homed host loses replies or health checks after reboot	`rp_filter=1` rejecting asymmetric return traffic	Use `rp_filter=2` for loose mode if policy routing or multiple uplinks are intentional; see Linux Networking
Write latency spikes after increasing dirty thresholds	Too much dirty page cache building before flush starts	Lower `dirty_bytes`/`dirty_background_bytes` and re-check `vmstat` plus `/proc/meminfo`
Value is correct after `sysctl -w` but wrong after reboot	Another file or a tuning profile overwrites it later	Search `/usr/lib/sysctl.d`, `/etc/sysctl.d`, and check for `tuned`, cloud-init, or platform scripts
`sysctl: permission denied` in a container	Unprivileged namespace or forbidden kernel key	Many sysctls are host-only; set them on the node, not in an ordinary container
Raising `nf_conntrack_max` did not stop connection drops	The table is still filling, or memory pressure is elsewhere	Compare `nf_conntrack_count` to `nf_conntrack_max` and inspect the firewall/NAT workload itself
Inotify errors keep happening after raising watch limits	The process leaks watches, or another resource is the real limit	Identify which processes hold watches and inspect them with lsof & strace

# Good generic triage loop
sysctl --system
journalctl -u systemd-sysctl -b --no-pager
sysctl net.core.somaxconn vm.swappiness fs.file-max
ss -s
vmstat 1 5

Cross-reference

Linux Networking for packet flow, sockets, and route debugging that justify network-side sysctl changes.
Bonding & Bridges for bridge and forwarding-related sysctls.
firewalld Rich Rules for forwarding/NAT workflows that depend on net.ipv4.ip_forward and conntrack sizing.
systemd Unit Authoring for LimitNOFILE= and other service-level limits.
perf & bpftrace when counters are not enough and you need kernel-level evidence.