sysctl Tuning
- Change one bottleneck at a time and measure it. If you cannot name the symptom, you are not tuning, you are guessing.
- Persist settings in
/etc/sysctl.d/*.conf, not by hand in/proc/sys. Runtime-only fixes vanish at the next reboot and are forgotten fastest. - Know the scope: some limits are global kernel ceilings, some are per-process limits, and some are service-manager settings. Raising the wrong one changes nothing.
- For writeback tuning, use either
dirty_bytesordirty_ratio, not both. Mixing them causes confusion about which threshold is actually active. - Every production sysctl change should have a rollback path, a before/after measurement, and a reason written beside it.
Where sysctls live
Linux exposes tunables in /proc/sys/. sysctl is just a friendly interface over those files. Persistence comes from config files that systemd-sysctl or the init system loads during boot.
| Location | Who writes there | Notes |
|---|---|---|
/usr/lib/sysctl.d/*.conf | RPM/DEB packages | Vendor defaults |
/run/sysctl.d/*.conf | Runtime-generated | Ephemeral overrides |
/etc/sysctl.d/*.conf | Local admin / config management | Preferred place for your settings |
/etc/sysctl.conf | Legacy admin config | Still works, but harder to organize than drop-ins |
Files are applied in lexical order, so 60-web.conf loads after 50-default.conf. Later wins when the same key appears multiple times.
sysctl net.core.somaxconn
cat /proc/sys/net/core/somaxconn
# See what systemd loaded at boot
systemctl status systemd-sysctl
journalctl -u systemd-sysctl -b --no-pager
Runtime vs persistent changes
A runtime change is immediate but disappears on reboot. A persistent change comes from a file and survives reboot. Use runtime changes for testing only, then move the final value into a drop-in.
# Runtime only
sysctl -w net.core.somaxconn=4096
# Persistent file
cat > /etc/sysctl.d/60-web.conf <<'EOF'
net.core.somaxconn = 4096
EOF
# Load one file now
sysctl --load=/etc/sysctl.d/60-web.conf
# Or reload the full stack
sysctl --system
sysctl --load=/etc/sysctl.d/60-web.conf is safer than blasting every file with sysctl --system while you are still iterating.
Networking knobs that matter
Ignore giant internet lists of "performance tunings". Start from symptoms: listen queue overflow, SYN backlog overflow, ephemeral port exhaustion, asymmetric routing, or conntrack saturation.
| Key | Why it exists | Common use |
|---|---|---|
net.core.somaxconn | Kernel ceiling for listen backlog | High-connection-rate web tiers |
net.ipv4.tcp_max_syn_backlog | Pending SYN queue size | Busy TCP listeners under bursty load |
net.ipv4.ip_local_port_range | Ephemeral source port range | Proxies, NAT gateways, busy clients |
net.ipv4.ip_forward | Enable packet forwarding | Routers, NAT, proxy nodes |
net.ipv4.conf.all.rp_filter | Reverse-path validation | Use 2 (loose) for policy routing or multihomed hosts |
net.netfilter.nf_conntrack_max | Connection tracking table size | Stateful firewalls, NAT, Kubernetes nodes |
# /etc/sysctl.d/60-network-ingress.conf
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.ip_local_port_range = 20000 65000
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 2
net.netfilter.nf_conntrack_max = 262144
# Observe the symptoms before changing anything
ss -lnt
ss -s
nstat -az | egrep 'ListenOverflows|ListenDrops|TcpExtTCPAbortOnMemory'
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
somaxconn helps only if the application also requests a larger backlog. Nginx, HAProxy, JVMs, and Go services each have their own listener settings too.
VM and writeback knobs
Memory tuning is where overconfidence hurts. Small, well-understood changes are usually enough: reduce swap eagerness, cap dirty data so writeback storms stay bounded, and set workload-specific values like vm.max_map_count only when the application needs them.
# /etc/sysctl.d/60-vm.conf
vm.swappiness = 10
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 536870912
vm.overcommit_memory = 0
vm.max_map_count = 262144
grep -E 'Dirty|Writeback|MemAvailable|SwapTotal|SwapFree' /proc/meminfo
vmstat 1 10
sar -B 1 10
vm.swappiness: lower values make the kernel less eager to swap anonymous memory.vm.dirty_background_bytesandvm.dirty_bytes: start background flushing early and cap total dirty data so one writer cannot build a giant burst.vm.overcommit_memory: leave at0unless the workload has a reason for strict or permissive overcommit semantics.vm.max_map_count: often raised for Elasticsearch and some JVM-heavy software.
dirty_bytes, clear or ignore dirty_ratio. Pick one model so the next admin can reason about the box.
Inotify limits
Inotify errors show up when tools that watch many files run out of watches or instances: IDEs, sync daemons, kubelets, config reloaders, and log agents. The usual symptom is No space left on device from an app that still has plenty of disk.
# /etc/sysctl.d/60-inotify.conf
fs.inotify.max_user_instances = 1024
fs.inotify.max_user_watches = 524288
fs.inotify.max_queued_events = 32768
sysctl fs.inotify.max_user_instances
sysctl fs.inotify.max_user_watches
sysctl fs.inotify.max_queued_events
# Roughly identify which processes are using many watches
grep -R inotify /proc/*/fdinfo 2>/dev/null | cut -d/ -f3 | sort | uniq -c | sort -nr | head
If you are tuning this for developer workstations, remember that multiple Electron apps plus an IDE plus a file sync tool can consume watches surprisingly fast.
File descriptor limits interplay
"Too many open files" is rarely fixed by one knob. There are four layers that matter:
| Layer | What it controls | How to inspect it |
|---|---|---|
fs.file-max | System-wide file handle pool | sysctl fs.file-max, cat /proc/sys/fs/file-nr |
fs.nr_open | Kernel ceiling for per-process hard limits | sysctl fs.nr_open |
| RLIMIT_NOFILE | Per-process soft and hard open-file limits | ulimit -Sn, ulimit -Hn, /proc/<pid>/limits |
LimitNOFILE= | systemd unit-level limit | systemctl show <unit> -p LimitNOFILE |
# /etc/sysctl.d/60-fd.conf
fs.nr_open = 1048576
fs.file-max = 2097152
# /etc/systemd/system/nginx.service.d/limits.conf
[Service]
LimitNOFILE=262144
sysctl fs.nr_open
sysctl fs.file-max
cat /proc/sys/fs/file-nr
ulimit -Sn
ulimit -Hn
systemctl show nginx -p LimitNOFILE
cat /proc/$(pidof nginx | awk '{print $1}')/limits | grep 'open files'
fs.file-max does not help if the service is capped at LimitNOFILE=1024. Diagnose from the process outward, not the kernel inward.
Measurement before and after
Take a snapshot before you change anything. Then compare not just the sysctl values, but the workload symptoms that motivated the change.
# Snapshot current values
sysctl -a > /var/tmp/sysctl.before
# Networking symptoms
ss -s
nstat -az | egrep 'ListenOverflows|ListenDrops|TCPSynRetrans|TcpExtTCPAbortOnMemory'
# Memory/writeback symptoms
vmstat 1 10
grep -E 'Dirty|Writeback|MemAvailable|SwapFree' /proc/meminfo
slabtop -o
# After the change
sysctl -a > /var/tmp/sysctl.after
diff -u /var/tmp/sysctl.before /var/tmp/sysctl.after
If you need deeper proof than these counters give you, move to perf & bpftrace rather than turning more knobs on instinct.
Safe rollout pattern
Keep changes small, named by purpose, and easy to remove.
cat > /etc/sysctl.d/60-web-stack.conf <<'EOF'
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 8192
EOF
# Apply only this file first
sysctl --load=/etc/sysctl.d/60-web-stack.conf
# Verify the values are live
sysctl net.core.somaxconn
sysctl net.ipv4.tcp_max_syn_backlog
# If the host behaves, keep it and let config management own it
systemctl restart nginx
tasks:
- name: Install sysctl drop-in
ansible.builtin.copy:
dest: /etc/sysctl.d/60-web-stack.conf
mode: '0644'
content: |
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 8192
notify: reload sysctl
handlers:
- name: reload sysctl
ansible.builtin.command: sysctl --system
Rollback is just as important: remove or revert the file, reload that file or the whole stack, and compare the same counters you used during rollout.
Troubleshooting over-tuning
| Symptom | Likely cause | What to check |
|---|---|---|
App still reports Too many open files |
Per-process or systemd limit still low | Check LimitNOFILE, ulimit -n, and /proc/<pid>/limits, not just fs.file-max |
| Multi-homed host loses replies or health checks after reboot | rp_filter=1 rejecting asymmetric return traffic |
Use rp_filter=2 for loose mode if policy routing or multiple uplinks are intentional; see Linux Networking |
| Write latency spikes after increasing dirty thresholds | Too much dirty page cache building before flush starts | Lower dirty_bytes/dirty_background_bytes and re-check vmstat plus /proc/meminfo |
Value is correct after sysctl -w but wrong after reboot |
Another file or a tuning profile overwrites it later | Search /usr/lib/sysctl.d, /etc/sysctl.d, and check for tuned, cloud-init, or platform scripts |
sysctl: permission denied in a container |
Unprivileged namespace or forbidden kernel key | Many sysctls are host-only; set them on the node, not in an ordinary container |
Raising nf_conntrack_max did not stop connection drops |
The table is still filling, or memory pressure is elsewhere | Compare nf_conntrack_count to nf_conntrack_max and inspect the firewall/NAT workload itself |
| Inotify errors keep happening after raising watch limits | The process leaks watches, or another resource is the real limit | Identify which processes hold watches and inspect them with lsof & strace |
# Good generic triage loop
sysctl --system
journalctl -u systemd-sysctl -b --no-pager
sysctl net.core.somaxconn vm.swappiness fs.file-max
ss -s
vmstat 1 5
Cross-reference
- Linux Networking for packet flow, sockets, and route debugging that justify network-side sysctl changes.
- Bonding & Bridges for bridge and forwarding-related sysctls.
- firewalld Rich Rules for forwarding/NAT workflows that depend on
net.ipv4.ip_forwardand conntrack sizing. - systemd Unit Authoring for
LimitNOFILE=and other service-level limits. - perf & bpftrace when counters are not enough and you need kernel-level evidence.