Service-Specific Troubleshooting

Concrete diagnosis steps for nginx, postfix, SSH, NTP, and auth failures.

On this page

General first steps — always
nginx not responding
Postfix mail not sending
SSH connection failing
Time sync problems
Login / authentication failing
DNS resolution failing
Blocked by SELinux
Blocked by firewall
Disk full / inodes exhausted
Postfix queue management
Advanced systemd troubleshooting

General first steps — always

Before diving into service-specific steps, do these first for any broken service:

# 1. Is the service running?
systemctl status SERVICE_NAME

# 2. What do the logs say?
journalctl -u SERVICE_NAME -n 50
journalctl -u SERVICE_NAME --since "10 minutes ago"

# 3. Is it listening on the right port?
ss -tlnp | grep SERVICE_NAME

# 4. Are there recent errors at the system level?
journalctl -p err -b --no-pager | head -30

# 5. Did something change recently?
git log --oneline -10   # in the ansible repo

nginx not responding

# Step 1: Is it running?
systemctl status nginx
journalctl -u nginx -n 30

# Step 2: Config syntax error?
nginx -t
# If syntax error, it will point to the file and line

# Step 3: Is it listening?
ss -tlnp | grep nginx
# If not listening: service probably failed to start — read the logs

# Step 4: Is the port open in the firewall?
firewall-cmd --list-all | grep -E "ports|services"

# Step 5: Is SELinux blocking it?
grep "type=AVC" /var/log/audit/audit.log | grep nginx | tail -10

# Step 6: Test from outside
curl -v http://HOSTNAME/
curl -vk https://HOSTNAME/   # -k skips cert check

# Step 7: Check if upstream backend is running (for reverse proxy)
curl -v http://127.0.0.1:8080/    # test backend directly

Common nginx failures:

502 Bad Gateway — nginx is up, backend is down or not listening on the expected port
504 Gateway Timeout — backend is too slow or hanging; check proxy_read_timeout
413 Request Entity Too Large — increase client_max_body_size
SSL_ERROR_RX_RECORD_TOO_LONG — HTTPS client connecting to HTTP port; check listen and port config

Postfix mail not sending

# Step 1: Is postfix running?
systemctl status postfix
journalctl -u postfix -n 30

# Step 2: What is in the queue?
postqueue -p
# Look at the "stuck" messages — the reason is shown

# Step 3: Try to flush the queue and watch what happens
postqueue -f
journalctl -u postfix -f    # watch in another terminal

# Step 4: Test sending manually
echo "test" | mail -s "test" you@example.com
journalctl -u postfix -n 20   # check what happened

Reading queue errors in postqueue -p:

# Connection refused to relayhost
connect to smtp.example.com[10.0.0.2]:25: Connection refused
→ relayhost is down or wrong port in main.cf

# Authentication failure
SASL authentication failed
→ Wrong credentials in sasl_passwd, or sasl_passwd.db not updated (run postmap)

# TLS required but not offered
server requires encryption
→ set smtp_tls_security_level = encrypt (or = may for opportunistic TLS)

# DNS lookup failed
Host or domain name not found. Name service error
→ relayhost hostname does not resolve; add [] brackets to skip MX lookup

# Check DNS resolution of relayhost
dig +short smtp.example.com

# Check TCP connectivity
nc -zv smtp.example.com 587

# Check SASL credentials file
postconf smtp_sasl_password_maps
ls -la /etc/postfix/sasl_passwd.db   # must exist and be newer than sasl_passwd

SSH connection failing

# Step 1: Test with verbose output
ssh -vvv user@host 2>&1 | head -50
# Look for:
# - "Connecting to host port 22" — network connectivity
# - "Authentications that can continue" — what the server accepts
# - "No more authentication methods to try" — key not accepted

# Step 2: Is sshd running on the target?
systemctl status sshd

# Step 3: Is port 22 open?
ss -tlnp | grep sshd
firewall-cmd --list-all | grep ssh

# Step 4: Key issues
# Check the key is in authorized_keys
cat ~/.ssh/authorized_keys | grep "$(cut -d' ' -f2 ~/.ssh/id_ed25519.pub)"

# Check permissions (must be exact)
ls -la ~/.ssh/                 # dir: 700
ls -la ~/.ssh/authorized_keys  # file: 600

# Step 5: Check SELinux
restorecon -Rv ~/.ssh/         # fix any context issues

# Step 6: Check sshd logs on target
journalctl -u sshd -n 30

Common SSH error messages:

Connection refused — sshd not running, or wrong port, or firewall blocking
Connection timed out — network unreachable or firewall silently dropping
Permission denied (publickey) — key not in authorized_keys, wrong permissions, or key type not accepted
Host key verification failed — host key changed (or known_hosts is stale); remove the old entry with ssh-keygen -R hostname

Time sync problems

# Step 1: Is chrony running?
systemctl status chronyd

# Step 2: Is it synced?
chronyc tracking
# Look for "System time" — should be small (milliseconds)
# Look for "Leap status: Normal" — not "Not synchronised"

# Step 3: What sources is it using?
chronyc sources -v
# '*' = currently synced source
# '+' = acceptable source
# '?' = unreachable source

# Step 4: Can it reach the NTP servers?
chronyc sourcestats
ping ntp1.example.com

# Step 5: Force a sync (if clock is far off)
chronyc makestep
# or
chronyc -a makestep

# Step 6: Check the config
cat /etc/chrony.conf | grep server

If all NTP sources show ? (unreachable):

# DNS check
dig +short ntp1.example.com

# Connectivity check (NTP uses UDP port 123)
nc -zuv ntp1.example.com 123

# Firewall check
firewall-cmd --list-all | grep -E "ntp|123"

Login / authentication failing

# Step 1: Is SSSD running?
systemctl status sssd

# Step 2: Can SSSD resolve the user?
id username@example.com

# Step 3: Test HBAC rules
ipa hbactest --user=username --host=$(hostname) --service=sshd --detail

# Step 4: Check Kerberos
kinit username@EXAMPLE.COM
klist    # see if a ticket was issued

# Step 5: Check time sync (Kerberos fails with clock skew > 5 min)
chronyc tracking | grep "System time"
date    # compare with date on the IPA server

# Step 6: SSSD logs
tail -f /var/log/sssd/sssd_example.com.log
journalctl -u sssd -n 50

# Step 7: PAM auth logs
journalctl -u sshd -n 20    # sshd pam logs
tail -f /var/log/secure

DNS resolution failing

# Step 1: Basic test
dig example.com
nslookup example.com

# Step 2: Which resolver is being used?
cat /etc/resolv.conf
resolvectl status   # on systemd-resolved systems

# Step 3: Test with a specific resolver
dig example.com @8.8.8.8
dig example.com @10.0.0.10   # your internal DNS

# Step 4: Is the resolver reachable?
nc -zuv 10.0.0.10 53   # UDP
nc -zv  10.0.0.10 53   # TCP (used for large responses)

# Step 5: Check /etc/hosts for overrides
grep example.com /etc/hosts

# Step 6: Check nsswitch.conf
grep hosts /etc/nsswitch.conf   # should be: files dns

Blocked by SELinux

# Step 1: Is SELinux in enforcing mode?
getenforce

# Step 2: Check for recent denials
ausearch -m avc -ts recent | tail -20
grep "type=AVC" /var/log/audit/audit.log | tail -10

# Step 3: Explain the denial
ausearch -m avc -ts recent | audit2why

# Step 4: Quick test — switch to permissive temporarily
setenforce 0
# retry the operation
# if it works in permissive, SELinux is the cause
setenforce 1

# Step 5: Fix it properly
# Check for a boolean that covers this use case:
getsebool -a | grep relevant_keyword
setsebool -P boolean_name on

# Or fix a file context:
semanage fcontext -a -t correct_type_t "/path/to/files(/.*)?"
restorecon -Rv /path/to/files/

Blocked by firewall

# Step 1: Check what is allowed
firewall-cmd --list-all

# Step 2: Test from the client side
nc -zv targethost port
curl -v http://targethost:port

# Step 3: Verify traffic is reaching the server at all
tcpdump -i eth0 port PORT   # run on the server; check if packets arrive

# If packets arrive but are rejected:
# → service is down or listening on wrong interface (not a firewall issue)

# If no packets arrive:
# → firewall is blocking (on this host or upstream)

# Step 4: Add the rule
firewall-cmd --permanent --add-port=PORT/tcp
firewall-cmd --reload

Disk full / inodes exhausted

# Step 1: Check disk space AND inodes
df -h             # blocks
df -i             # inodes (a full inode table looks like free space but still fails)

# Step 2: Find what is using space — stay on the same filesystem with -x
du -xhd1 /var    | sort -rh | head      # top dirs in /var only, don't cross mounts
du -xhd1 /       | sort -rh | head      # top dirs at root level
du -sh /var/log/* | sort -rh | head
du -sh /var/spool/* | sort -rh | head

# Large individual files
find /var/log -xdev -size +100M -printf '%s\t%p\n' | sort -rn | head

# Inodes exhausted but df -h shows space free? Find dirs with many tiny files
find / -xdev -type d -printf '%p\n' 2>/dev/null | while read d; do echo "$(ls -A "$d" 2>/dev/null | wc -l) $d"; done | sort -rn | head

# Step 3: Check mail queue size
postqueue -p | wc -l
ls /var/spool/postfix/deferred/ | wc -l

# Step 4: Rotate or truncate logs safely
journalctl --vacuum-size=2G
logrotate -f /etc/logrotate.conf

# Step 5: Truncate a large log that a service still has open
> /var/log/some.log    # truncates in place — preserves inode and permissions
# DO NOT rm a log file that a running service has open — the blocks stay allocated
# until the service closes its file descriptor. Use lsof +L1 to find such cases:
lsof +L1               # files that are deleted but still open — space still in use

# Step 6: If the filesystem is on LVM, extend it online instead of cleaning up
sudo lvextend -r -L +5G /dev/vg_data/lv_var       # -r resizes the filesystem too

Disk full kills services silently. nginx cannot write access logs, postfix cannot write to the queue, and SSSD cannot cache. Always check df -h and df -i early — inode exhaustion looks like "plenty of space" to most metrics but still returns ENOSPC. For the LVM extend recipe and how to tell if a volume can grow online, see LVM — Logical Volume Manager.

Postfix queue management

# Show the queue (deferred, active, hold)
postqueue -p
mailq               # same, shorthand

# Count queued messages
postqueue -p | grep -c "^[0-9A-F]"

# Flush: attempt to deliver all deferred messages now
postqueue -f

# Delete a specific message by queue ID
postsuper -d QUEUEID

# Delete all deferred messages (use with care)
postsuper -d ALL deferred

# Inspect a specific message including headers
postcat -q QUEUEID

# Delete all messages in queue (emergency only)
postsuper -d ALL

postsuper -d ALL deferred only deletes messages stuck in the deferred queue — messages being actively delivered are unaffected. Use this when a large backlog of undeliverable messages is consuming disk space.

Advanced systemd troubleshooting

# See all failed units
systemctl list-units --failed

# Find which service is slow at boot
systemd-analyze blame

# See the full critical chain for boot time
systemd-analyze critical-chain

# Show full unit properties (all settings, including computed defaults)
systemctl show nginx

# Show the dependency tree of a unit
systemctl list-dependencies nginx
systemctl list-dependencies nginx --reverse   # who depends ON nginx

systemctl show nginx outputs every key=value pair for the unit — useful when a setting from a drop-in is not being picked up, or you want to confirm the actual Restart= or ExecStart= value that systemd is using (not just what the file says).

You've reached the end of the guides. Head back to Home → to browse all topics.