Generic
This page turns a broad infra troubleshooting memo into a single reference you can reuse across tickets. Start by narrowing the symptom, then validate dependencies, test safely, confirm the fix, and document the rollback path before you call it done.
- confirm the symptom and scope before changing anything
- check service state, logs, config, ports, and recent changes
- solve plain connectivity before diving into TLS or app logic
- test one dependency at a time: DNS, trust, identity, database, proxy
- validate the fix with a real transaction, not just a green service state
- leave behind root cause, evidence, and rollback notes
- Universal troubleshooting sequence
- Core commands
- Generic Ansible troubleshooting
- Generic systemd service failure workflow
- Generic TLS and certificate troubleshooting
- Generic LDAP, LDAPS, and identity troubleshooting
- Generic rsyslog troubleshooting
- Generic Postfix troubleshooting
- Generic Dovecot troubleshooting
- Generic Keycloak and federation troubleshooting
- Generic Openfire and Java app upgrade troubleshooting
- Windows and FreeIPA reality check
- Review, rollout, and rollback checks
- Sprint and ticket framing language
- Quick symptom-to-action map
- Best-practice reminders
- One-screen compact version
Universal troubleshooting sequence
Use this sequence first, even when you think you already know the answer. It keeps you from skipping basic checks and blaming the wrong layer.
- Confirm the symptom. Be specific: startup, auth, network, TLS, config drift, or data flow.
- Confirm the scope. One host or many? One user or all users? Dev only, test only, or prod?
- Check the service state. Review the unit status, logs, config validation output, and whether the expected port is listening.
- Check recent change. Look for role or template updates, package upgrades, cert renewals, DNS changes, firewall edits, or runtime changes.
- Validate dependencies. DNS, ports, trust, identity, database, reverse proxy, and upstream services.
- Reproduce safely. Use a manual test command, stage it if possible, and pilot risky changes on one host first.
- Validate the fix. Healthy service state, port listening, expected logs, and a working end-to-end test.
- Document. Capture root cause, exact change, evidence, and rollback path.
Core commands
These are the checks you reach for most often when triaging a Linux or service issue.
Service checks
systemctl status SERVICE -l --no-pager
journalctl -u SERVICE -b --no-pager
journalctl -xeu SERVICE
systemctl restart SERVICE
systemctl reload SERVICE
systemctl is-active SERVICE
systemctl is-enabled SERVICE
Port and listener checks
ss -lntup
ss -lntp | grep PORT
netstat -plnt
lsof -i :PORT
Firewall and connectivity
firewall-cmd --list-all
firewall-cmd --list-ports
ping HOST
nc -zv HOST PORT
telnet HOST PORT
tcpdump -nn host HOST and port PORT
DNS and name resolution
hostname -f
getent hosts HOST
dig HOST
nslookup HOST
resolvectl status
TLS and certificate checks
openssl s_client -connect HOST:PORT -servername HOST
openssl x509 -in CERTFILE -text -noout
keytool -list -keystore TRUSTSTORE
Ansible validation
ansible-playbook PLAYBOOK.yml --syntax-check
ansible-playbook PLAYBOOK.yml --check
ansible-playbook PLAYBOOK.yml --diff
ansible-inventory --graph
ansible-inventory --list
Config validation
rsyslogd -N1
postfix check
dovecot -n
nginx -t
apachectl configtest
python -m py_compile FILE.py
Generic Ansible troubleshooting
If automation broke something, work in this order so you can separate bad input data from bad role logic.
- Inventory: verify the expected host or group was targeted.
- group_vars and host_vars: verify environment values and overrides.
- Tasks: inspect what changed and whether conditionals and idempotency still make sense.
- Templates: confirm Jinja rendered valid config and whitespace did not break syntax.
- Handlers: make sure the right restart or reload happened, and that validation runs before the bounce when appropriate.
- Repeated runs: confirm the role stays clean on rerun instead of drifting or restarting every time.
- Validation before restart: never bounce a service blindly when a config test exists.
Good Ansible pattern
- name: Deploy config
template:
src: service.conf.j2
dest: /etc/service/service.conf
owner: root
group: root
mode: "0644"
notify: validate and restart service
- name: Validate service config
command: service-binary validation-flag
changed_when: false
- name: Restart service
service:
name: service
state: restarted
What to look for in roles
- hardcoded values instead of variables
- service restart without validation
- tasks that are not idempotent
- templates missing guard clauses
- variables split across too many files
- one role doing too many unrelated things
- missing or miswired handlers
--check --diff as an early warning system, not proof that the deployment is safe. Modules like command and shell do not always model reality well.
Generic systemd service failure workflow
Use this when a service will not start, exits immediately, or flaps on restart.
- Check
systemctl status SERVICE. - Read
journalctl -u SERVICE -b. - Inspect the unit file with
systemctl cat SERVICE. - Verify
ExecStart,WorkingDirectory,User,Group,EnvironmentFile, permissions, and dependent services. - Run the service command manually as the service user if possible.
- Check for a port conflict.
- Check dependent config, cert, database, auth, or upstream endpoints.
Common causes
- wrong binary path
- wrong virtualenv or Java path
- invalid config file
- port already in use
- missing environment variable
- file permission issue
- service account cannot read cert or key
- dependency service is down
Fix pattern
- correct the unit or config
- validate the config
- restart the service
- confirm active state
- test the real function end to end
Generic TLS and certificate troubleshooting
Use this for LDAPS, SMTP TLS, IMAPS, rsyslog TLS, or web admin TLS problems.
Checklist
- Is the service presenting the expected certificate?
- Is the certificate expired?
- Does the CN or SAN match the hostname in use?
- Does the client trust the issuing CA?
- Is the full chain present?
- Are you using the right port?
- Are STARTTLS and implicit TLS being mixed up?
Quick commands
openssl s_client -connect host:636 -servername host
openssl s_client -connect host:6514 -servername host
openssl s_client -starttls smtp -connect host:25
openssl s_client -starttls imap -connect host:143
Common causes
- hostname mismatch
- CA missing from truststore
- incomplete chain
- key and cert mismatch
- wrong listener or port
- STARTTLS enabled when the client expects LDAPS, or the reverse
Fix pattern
- confirm the hostname clients use
- deploy the proper cert, key, and chain
- import the CA into the truststore where needed
- retest with
openssl - then retest the app or service
openssl s_client cannot complete the handshake or shows the wrong certificate, stay in the TLS layer until that is fixed.
Generic LDAP, LDAPS, and identity troubleshooting
Use this for FreeIPA, Keycloak federation, bind failures, group mapping, and sync or login issues.
Troubleshooting order
- network reachability
- DNS resolution
- certificate trust, if using LDAPS
- bind DN works
- search base works
- user search works
- group search works
- role or group mapper works
- real app login works
Common causes
- wrong bind DN
- wrong user or group search base
- CA not trusted
- port 636 used incorrectly
- STARTTLS and LDAPS confusion
- group mapping not aligned with app expectations
- cache not refreshed after changes
Fix pattern
- test LDAP or LDAPS independently first
- only then update the application
- validate user sync
- validate group sync
- validate one real login
Generic rsyslog troubleshooting
Use this when logs do not arrive, TCP forwarding fails, or TLS forwarding breaks.
Checks
rsyslogd -N1
systemctl status rsyslog
journalctl -u rsyslog -n 100 --no-pager
ss -lntup | grep 514
ss -lntup | grep 6514
grep -R "imtcp\|omfwd\|gtls\|@@\|@" /etc/rsyslog* -n
Common causes
- server listening on the wrong port
- sender using UDP while the receiver expects TCP
- TLS configured on only one side
- firewall blocking 514 or 6514
- invalid template syntax
- duplicate input blocks after rerun
Fix pattern
- decide the transport first: UDP, TCP, or TCP plus TLS
- align both ends
- validate the config
- restart rsyslog
- send a test log with
logger - confirm arrival at the receiver
Test
logger -n SERVER -P 514 -T "tcp syslog test"
Generic Postfix troubleshooting
Use this when mail is not relaying correctly, TLS is failing, auth is broken, or relay policy looks wrong.
Checks
postconf -n
postconf mynetworks mydestination relay_domains smtpd_recipient_restrictions
postfix check
journalctl -u postfix -n 100 --no-pager
tail -f /var/log/maillog
Common causes
mynetworkstoo broadrelay_domainsandmydestinationconfusion- weak or wrong recipient restrictions
- TLS cert missing or invalid
- submission settings not aligned with auth policy
Fix pattern
- lock down relay policy
- validate the config
- reload Postfix
- test the expected relay path
- test the denied relay path
- review the logs
Useful tests
swaks --server MAILHOST --from test@external.example --to user@external.example
swaks --server MAILHOST --port 587 --tls --auth \
--auth-user USER --auth-password 'PASSWORD' --to user@example.com
Generic Dovecot troubleshooting
Use this for IMAP or POP login failures, TLS errors, client trust issues, or mailbox sync trouble.
Checks
dovecot -n
doveconf -n | egrep 'ssl|auth|imap|pop3|lmtp'
journalctl -u dovecot -n 100 --no-pager
openssl s_client -connect HOST:993
openssl s_client -starttls imap -connect HOST:143
Common causes
- wrong cert path
- key unreadable by the service
disable_plaintext_authdoes not match client expectations- client using the wrong protocol or port
Fix pattern
- confirm TLS and auth settings
- validate cert and key readability
- restart Dovecot
- test with an IMAP TLS client
- validate mailbox access in a real client
Generic Keycloak and federation troubleshooting
Use this when Keycloak cannot connect to LDAP or LDAPS, sync works but logins fail, or role mapping differs by environment.
Checks
- verify the LDAP or LDAPS endpoint is reachable
- confirm the CA is trusted by Java and Keycloak
- verify bind DN, search base, and group base
- test the connection in Keycloak
- test user sync
- test group mapping
- test application login end to end
Common causes
- truststore missing the CA
- wrong URL or port
- STARTTLS and LDAPS confusion
- mapper mismatch
- application expecting roles that are never produced
Fix pattern
- fix trust first
- then fix federation settings
- then retest sync
- then retest app auth
Generic Openfire and Java app upgrade troubleshooting
SysRef does not have a dedicated Openfire page; this section is the in-site coverage. Use it for Openfire or similar Java services that fail after an upgrade.
Checks
java -version
systemctl status openfire
systemctl cat openfire
journalctl -u openfire -b --no-pager
tail -n 200 /opt/openfire/logs/error.log
tail -n 200 /opt/openfire/logs/info.log
Common causes
- incompatible Java version
- plugin incompatibility
- database migration issue
- old config carried into the new version badly
- systemd environment still pointing at the old install
Fix pattern
- take a snapshot or backup first
- test in staging
- confirm Java prerequisites
- inventory installed plugins
- upgrade the application
- watch the first startup closely
- validate admin UI and client connectivity
- keep the rollback path ready
Windows and FreeIPA reality check
Do not assume FreeIPA gives you native AD-style Windows policy, GPO, or admin behavior. Clarify the actual requirement before you recommend a design.
Ask these questions
- Are we solving authentication only?
- Or policy management too?
- Or local admin rights too?
- Is there an AD trust model or not?
- Is this a pilot workaround or a supported long-term design?
Safe conclusion pattern
If native AD or GPO-like behavior is required, validate whether Active Directory or an AD trust is required. If this is only a pilot or local-admin workaround, use scripts, endpoint tooling, or automation instead of pretending FreeIPA is a full Windows policy plane.
Review, rollout, and rollback checks
Before-merge peer review checklist
- syntax is valid
- template renders correctly
- variables are environment-safe
- no hardcoded secrets
- role is idempotent
- repeated run causes no drift
- service config is validated before restart
- handler behavior is correct
- firewall and ports are aligned
- rollback path is documented
- functional validation is complete
Rollout order for risky changes
- local syntax and config validation
- one-host staging test
- functional test
- peer review
- merge
- controlled deploy
- service validation
- log review
- user or app validation
- monitor for regression
Rollback checklist
Define rollback before risky changes, not during the incident.
- config backup taken
- service unit backup taken
- database backup or snapshot taken, if applicable
- old package or install path retained
- previous cert or truststore version retained
- previous Ansible vars or template recoverable
- tested restore or revert procedure documented
Sprint and ticket framing language
Status language
- Resolved: issue reproduced, fix applied, validation complete, ready to close
- Patched: code or config updated, validation complete or pending final deployment check
- In progress: change implemented, awaiting review, testing, or dependency action
- Planned: objective defined, approach known, dependencies identified
- Blocked: cannot progress until dependency, approval, or access is provided
Useful next-steps language
- peer review and merge
- stage deploy and validate
- monitor logs for regression
- confirm maintenance window
- validate pilot host behavior
- confirm certificate and trust path
- document rollback and runbook
Quick symptom-to-action map
SERVICE WON'T START
-> systemctl status
-> journalctl
-> config validation
-> unit file
-> port check
AUTH FAILS
-> DNS
-> bind and search base
-> group mapping
-> trust and cert if LDAPS
-> app mapper or role mapping
TLS FAILS
-> openssl s_client
-> hostname match
-> cert chain
-> truststore
-> right protocol and port
NO LOGS ARRIVE
-> listener port
-> sender config
-> protocol alignment
-> firewall
-> rsyslog validation
MAIL RELAY WRONG
-> postconf -n
-> recipient restrictions
-> mynetworks
-> relay domains
-> swaks tests
UPGRADE FAILED
-> runtime version
-> logs
-> config drift
-> plugin compatibility
-> database migration
-> rollback readiness
Best-practice reminders
- validate config before restart
- solve plain connectivity before solving TLS
- solve TLS before blaming the application
- solve LDAP bind and search before blaming role mapping
- pilot risky changes on one host first
- do not assume Windows behaves like Linux with FreeIPA
- keep service-specific roles small and clear
- keep baseline roles reusable
- separate provisioning roles from service roles
- always know your rollback path
One-screen compact version
INFRA TROUBLESHOOTING QUICK FLOW
1. Confirm symptom
2. Confirm scope
3. systemctl status / journalctl
4. Validate config
5. Check listener and port
6. Check DNS and connectivity
7. Check cert and trust if TLS
8. Check auth and group mapping if identity
9. Check recent Ansible or template change
10. Test one real transaction
11. Fix
12. Re-test
13. Document root cause and rollback
CORE COMMANDS
- systemctl status SERVICE -l --no-pager
- journalctl -u SERVICE -b --no-pager
- ss -lntup
- openssl s_client -connect host:port -servername host
- rsyslogd -N1
- postfix check
- dovecot -n
- ansible-playbook PLAYBOOK.yml --check --diff
GOLDEN RULES
- validate before restart
- align both ends of a connection
- pilot first
- trust and certs before app blame
- rollback always ready