Service-Specific Troubleshooting
On this page
General first steps — always
Before diving into service-specific steps, do these first for any broken service:
# 1. Is the service running?
systemctl status SERVICE_NAME
# 2. What do the logs say?
journalctl -u SERVICE_NAME -n 50
journalctl -u SERVICE_NAME --since "10 minutes ago"
# 3. Is it listening on the right port?
ss -tlnp | grep SERVICE_NAME
# 4. Are there recent errors at the system level?
journalctl -p err -b --no-pager | head -30
# 5. Did something change recently?
git log --oneline -10 # in the ansible repo
nginx not responding
# Step 1: Is it running?
systemctl status nginx
journalctl -u nginx -n 30
# Step 2: Config syntax error?
nginx -t
# If syntax error, it will point to the file and line
# Step 3: Is it listening?
ss -tlnp | grep nginx
# If not listening: service probably failed to start — read the logs
# Step 4: Is the port open in the firewall?
firewall-cmd --list-all | grep -E "ports|services"
# Step 5: Is SELinux blocking it?
grep "type=AVC" /var/log/audit/audit.log | grep nginx | tail -10
# Step 6: Test from outside
curl -v http://HOSTNAME/
curl -vk https://HOSTNAME/ # -k skips cert check
# Step 7: Check if upstream backend is running (for reverse proxy)
curl -v http://127.0.0.1:8080/ # test backend directly
Common nginx failures:
- 502 Bad Gateway — nginx is up, backend is down or not listening on the expected port
- 504 Gateway Timeout — backend is too slow or hanging; check proxy_read_timeout
- 413 Request Entity Too Large — increase
client_max_body_size - SSL_ERROR_RX_RECORD_TOO_LONG — HTTPS client connecting to HTTP port; check listen and port config
Postfix mail not sending
# Step 1: Is postfix running?
systemctl status postfix
journalctl -u postfix -n 30
# Step 2: What is in the queue?
postqueue -p
# Look at the "stuck" messages — the reason is shown
# Step 3: Try to flush the queue and watch what happens
postqueue -f
journalctl -u postfix -f # watch in another terminal
# Step 4: Test sending manually
echo "test" | mail -s "test" you@example.com
journalctl -u postfix -n 20 # check what happened
Reading queue errors in postqueue -p:
# Connection refused to relayhost
connect to smtp.example.com[10.0.0.2]:25: Connection refused
→ relayhost is down or wrong port in main.cf
# Authentication failure
SASL authentication failed
→ Wrong credentials in sasl_passwd, or sasl_passwd.db not updated (run postmap)
# TLS required but not offered
server requires encryption
→ set smtp_tls_security_level = encrypt (or = may for opportunistic TLS)
# DNS lookup failed
Host or domain name not found. Name service error
→ relayhost hostname does not resolve; add [] brackets to skip MX lookup
# Check DNS resolution of relayhost
dig +short smtp.example.com
# Check TCP connectivity
nc -zv smtp.example.com 587
# Check SASL credentials file
postconf smtp_sasl_password_maps
ls -la /etc/postfix/sasl_passwd.db # must exist and be newer than sasl_passwd
SSH connection failing
# Step 1: Test with verbose output
ssh -vvv user@host 2>&1 | head -50
# Look for:
# - "Connecting to host port 22" — network connectivity
# - "Authentications that can continue" — what the server accepts
# - "No more authentication methods to try" — key not accepted
# Step 2: Is sshd running on the target?
systemctl status sshd
# Step 3: Is port 22 open?
ss -tlnp | grep sshd
firewall-cmd --list-all | grep ssh
# Step 4: Key issues
# Check the key is in authorized_keys
cat ~/.ssh/authorized_keys | grep "$(cut -d' ' -f2 ~/.ssh/id_ed25519.pub)"
# Check permissions (must be exact)
ls -la ~/.ssh/ # dir: 700
ls -la ~/.ssh/authorized_keys # file: 600
# Step 5: Check SELinux
restorecon -Rv ~/.ssh/ # fix any context issues
# Step 6: Check sshd logs on target
journalctl -u sshd -n 30
Common SSH error messages:
- Connection refused — sshd not running, or wrong port, or firewall blocking
- Connection timed out — network unreachable or firewall silently dropping
- Permission denied (publickey) — key not in authorized_keys, wrong permissions, or key type not accepted
- Host key verification failed — host key changed (or known_hosts is stale); remove the old entry with
ssh-keygen -R hostname
Time sync problems
# Step 1: Is chrony running?
systemctl status chronyd
# Step 2: Is it synced?
chronyc tracking
# Look for "System time" — should be small (milliseconds)
# Look for "Leap status: Normal" — not "Not synchronised"
# Step 3: What sources is it using?
chronyc sources -v
# '*' = currently synced source
# '+' = acceptable source
# '?' = unreachable source
# Step 4: Can it reach the NTP servers?
chronyc sourcestats
ping ntp1.example.com
# Step 5: Force a sync (if clock is far off)
chronyc makestep
# or
chronyc -a makestep
# Step 6: Check the config
cat /etc/chrony.conf | grep server
If all NTP sources show ? (unreachable):
# DNS check
dig +short ntp1.example.com
# Connectivity check (NTP uses UDP port 123)
nc -zuv ntp1.example.com 123
# Firewall check
firewall-cmd --list-all | grep -E "ntp|123"
Login / authentication failing
# Step 1: Is SSSD running?
systemctl status sssd
# Step 2: Can SSSD resolve the user?
id username@example.com
# Step 3: Test HBAC rules
ipa hbactest --user=username --host=$(hostname) --service=sshd --detail
# Step 4: Check Kerberos
kinit username@EXAMPLE.COM
klist # see if a ticket was issued
# Step 5: Check time sync (Kerberos fails with clock skew > 5 min)
chronyc tracking | grep "System time"
date # compare with date on the IPA server
# Step 6: SSSD logs
tail -f /var/log/sssd/sssd_example.com.log
journalctl -u sssd -n 50
# Step 7: PAM auth logs
journalctl -u sshd -n 20 # sshd pam logs
tail -f /var/log/secure
DNS resolution failing
# Step 1: Basic test
dig example.com
nslookup example.com
# Step 2: Which resolver is being used?
cat /etc/resolv.conf
resolvectl status # on systemd-resolved systems
# Step 3: Test with a specific resolver
dig example.com @8.8.8.8
dig example.com @10.0.0.10 # your internal DNS
# Step 4: Is the resolver reachable?
nc -zuv 10.0.0.10 53 # UDP
nc -zv 10.0.0.10 53 # TCP (used for large responses)
# Step 5: Check /etc/hosts for overrides
grep example.com /etc/hosts
# Step 6: Check nsswitch.conf
grep hosts /etc/nsswitch.conf # should be: files dns
Blocked by SELinux
# Step 1: Is SELinux in enforcing mode?
getenforce
# Step 2: Check for recent denials
ausearch -m avc -ts recent | tail -20
grep "type=AVC" /var/log/audit/audit.log | tail -10
# Step 3: Explain the denial
ausearch -m avc -ts recent | audit2why
# Step 4: Quick test — switch to permissive temporarily
setenforce 0
# retry the operation
# if it works in permissive, SELinux is the cause
setenforce 1
# Step 5: Fix it properly
# Check for a boolean that covers this use case:
getsebool -a | grep relevant_keyword
setsebool -P boolean_name on
# Or fix a file context:
semanage fcontext -a -t correct_type_t "/path/to/files(/.*)?"
restorecon -Rv /path/to/files/
Blocked by firewall
# Step 1: Check what is allowed
firewall-cmd --list-all
# Step 2: Test from the client side
nc -zv targethost port
curl -v http://targethost:port
# Step 3: Verify traffic is reaching the server at all
tcpdump -i eth0 port PORT # run on the server; check if packets arrive
# If packets arrive but are rejected:
# → service is down or listening on wrong interface (not a firewall issue)
# If no packets arrive:
# → firewall is blocking (on this host or upstream)
# Step 4: Add the rule
firewall-cmd --permanent --add-port=PORT/tcp
firewall-cmd --reload
Disk full / inodes exhausted
# Step 1: Check disk space AND inodes
df -h # blocks
df -i # inodes (a full inode table looks like free space but still fails)
# Step 2: Find what is using space — stay on the same filesystem with -x
du -xhd1 /var | sort -rh | head # top dirs in /var only, don't cross mounts
du -xhd1 / | sort -rh | head # top dirs at root level
du -sh /var/log/* | sort -rh | head
du -sh /var/spool/* | sort -rh | head
# Large individual files
find /var/log -xdev -size +100M -printf '%s\t%p\n' | sort -rn | head
# Inodes exhausted but df -h shows space free? Find dirs with many tiny files
find / -xdev -type d -printf '%p\n' 2>/dev/null | while read d; do echo "$(ls -A "$d" 2>/dev/null | wc -l) $d"; done | sort -rn | head
# Step 3: Check mail queue size
postqueue -p | wc -l
ls /var/spool/postfix/deferred/ | wc -l
# Step 4: Rotate or truncate logs safely
journalctl --vacuum-size=2G
logrotate -f /etc/logrotate.conf
# Step 5: Truncate a large log that a service still has open
> /var/log/some.log # truncates in place — preserves inode and permissions
# DO NOT rm a log file that a running service has open — the blocks stay allocated
# until the service closes its file descriptor. Use lsof +L1 to find such cases:
lsof +L1 # files that are deleted but still open — space still in use
# Step 6: If the filesystem is on LVM, extend it online instead of cleaning up
sudo lvextend -r -L +5G /dev/vg_data/lv_var # -r resizes the filesystem too
Disk full kills services silently. nginx cannot write access logs, postfix cannot write to the queue, and SSSD cannot cache. Always check
df -h and df -i early — inode exhaustion looks like "plenty of space" to most metrics but still returns ENOSPC. For the LVM extend recipe and how to tell if a volume can grow online, see LVM — Logical Volume Manager.
Postfix queue management
# Show the queue (deferred, active, hold)
postqueue -p
mailq # same, shorthand
# Count queued messages
postqueue -p | grep -c "^[0-9A-F]"
# Flush: attempt to deliver all deferred messages now
postqueue -f
# Delete a specific message by queue ID
postsuper -d QUEUEID
# Delete all deferred messages (use with care)
postsuper -d ALL deferred
# Inspect a specific message including headers
postcat -q QUEUEID
# Delete all messages in queue (emergency only)
postsuper -d ALL
postsuper -d ALL deferred only deletes messages stuck in the deferred queue — messages being actively delivered are unaffected. Use this when a large backlog of undeliverable messages is consuming disk space.
Advanced systemd troubleshooting
# See all failed units
systemctl list-units --failed
# Find which service is slow at boot
systemd-analyze blame
# See the full critical chain for boot time
systemd-analyze critical-chain
# Show full unit properties (all settings, including computed defaults)
systemctl show nginx
# Show the dependency tree of a unit
systemctl list-dependencies nginx
systemctl list-dependencies nginx --reverse # who depends ON nginx
systemctl show nginx outputs every key=value pair for the unit — useful when a setting from a drop-in is not being picked up, or you want to confirm the actual Restart= or ExecStart= value that systemd is using (not just what the file says).
You've reached the end of the guides. Head back to Home → to browse all topics.