Generic

Reusable troubleshooting and resolution cheat sheet for infrastructure tickets, service failures, identity problems, TLS issues, and rollout risk checks.

This page turns a broad infra troubleshooting memo into a single reference you can reuse across tickets. Start by narrowing the symptom, then validate dependencies, test safely, confirm the fix, and document the rollback path before you call it done.

Use this flow when something breaks
  • confirm the symptom and scope before changing anything
  • check service state, logs, config, ports, and recent changes
  • solve plain connectivity before diving into TLS or app logic
  • test one dependency at a time: DNS, trust, identity, database, proxy
  • validate the fix with a real transaction, not just a green service state
  • leave behind root cause, evidence, and rollback notes

Universal troubleshooting sequence

Use this sequence first, even when you think you already know the answer. It keeps you from skipping basic checks and blaming the wrong layer.

  1. Confirm the symptom. Be specific: startup, auth, network, TLS, config drift, or data flow.
  2. Confirm the scope. One host or many? One user or all users? Dev only, test only, or prod?
  3. Check the service state. Review the unit status, logs, config validation output, and whether the expected port is listening.
  4. Check recent change. Look for role or template updates, package upgrades, cert renewals, DNS changes, firewall edits, or runtime changes.
  5. Validate dependencies. DNS, ports, trust, identity, database, reverse proxy, and upstream services.
  6. Reproduce safely. Use a manual test command, stage it if possible, and pilot risky changes on one host first.
  7. Validate the fix. Healthy service state, port listening, expected logs, and a working end-to-end test.
  8. Document. Capture root cause, exact change, evidence, and rollback path.
Most bad troubleshooting starts with a fuzzy symptom. Write the broken behavior down in one sentence before you touch the config. That one sentence becomes your test case when you validate the fix.

Core commands

These are the checks you reach for most often when triaging a Linux or service issue.

Service checks

systemctl status SERVICE -l --no-pager
journalctl -u SERVICE -b --no-pager
journalctl -xeu SERVICE
systemctl restart SERVICE
systemctl reload SERVICE
systemctl is-active SERVICE
systemctl is-enabled SERVICE

Port and listener checks

ss -lntup
ss -lntp | grep PORT
netstat -plnt
lsof -i :PORT

Firewall and connectivity

firewall-cmd --list-all
firewall-cmd --list-ports
ping HOST
nc -zv HOST PORT
telnet HOST PORT
tcpdump -nn host HOST and port PORT

DNS and name resolution

hostname -f
getent hosts HOST
dig HOST
nslookup HOST
resolvectl status

TLS and certificate checks

openssl s_client -connect HOST:PORT -servername HOST
openssl x509 -in CERTFILE -text -noout
keytool -list -keystore TRUSTSTORE

Ansible validation

ansible-playbook PLAYBOOK.yml --syntax-check
ansible-playbook PLAYBOOK.yml --check
ansible-playbook PLAYBOOK.yml --diff
ansible-inventory --graph
ansible-inventory --list

Config validation

rsyslogd -N1
postfix check
dovecot -n
nginx -t
apachectl configtest
python -m py_compile FILE.py

Generic Ansible troubleshooting

If automation broke something, work in this order so you can separate bad input data from bad role logic.

  1. Inventory: verify the expected host or group was targeted.
  2. group_vars and host_vars: verify environment values and overrides.
  3. Tasks: inspect what changed and whether conditionals and idempotency still make sense.
  4. Templates: confirm Jinja rendered valid config and whitespace did not break syntax.
  5. Handlers: make sure the right restart or reload happened, and that validation runs before the bounce when appropriate.
  6. Repeated runs: confirm the role stays clean on rerun instead of drifting or restarting every time.
  7. Validation before restart: never bounce a service blindly when a config test exists.

Good Ansible pattern

- name: Deploy config
  template:
    src: service.conf.j2
    dest: /etc/service/service.conf
    owner: root
    group: root
    mode: "0644"
  notify: validate and restart service

- name: Validate service config
  command: service-binary validation-flag
  changed_when: false

- name: Restart service
  service:
    name: service
    state: restarted

What to look for in roles

Check mode is not a full simulation. Treat --check --diff as an early warning system, not proof that the deployment is safe. Modules like command and shell do not always model reality well.

Generic systemd service failure workflow

Use this when a service will not start, exits immediately, or flaps on restart.

  1. Check systemctl status SERVICE.
  2. Read journalctl -u SERVICE -b.
  3. Inspect the unit file with systemctl cat SERVICE.
  4. Verify ExecStart, WorkingDirectory, User, Group, EnvironmentFile, permissions, and dependent services.
  5. Run the service command manually as the service user if possible.
  6. Check for a port conflict.
  7. Check dependent config, cert, database, auth, or upstream endpoints.

Common causes

Fix pattern

  1. correct the unit or config
  2. validate the config
  3. restart the service
  4. confirm active state
  5. test the real function end to end

Generic TLS and certificate troubleshooting

Use this for LDAPS, SMTP TLS, IMAPS, rsyslog TLS, or web admin TLS problems.

Checklist

  1. Is the service presenting the expected certificate?
  2. Is the certificate expired?
  3. Does the CN or SAN match the hostname in use?
  4. Does the client trust the issuing CA?
  5. Is the full chain present?
  6. Are you using the right port?
  7. Are STARTTLS and implicit TLS being mixed up?

Quick commands

openssl s_client -connect host:636 -servername host
openssl s_client -connect host:6514 -servername host
openssl s_client -starttls smtp -connect host:25
openssl s_client -starttls imap -connect host:143

Common causes

Fix pattern

  1. confirm the hostname clients use
  2. deploy the proper cert, key, and chain
  3. import the CA into the truststore where needed
  4. retest with openssl
  5. then retest the app or service
Do not blame the application before proving the wire is healthy. If openssl s_client cannot complete the handshake or shows the wrong certificate, stay in the TLS layer until that is fixed.

Generic LDAP, LDAPS, and identity troubleshooting

Use this for FreeIPA, Keycloak federation, bind failures, group mapping, and sync or login issues.

Troubleshooting order

  1. network reachability
  2. DNS resolution
  3. certificate trust, if using LDAPS
  4. bind DN works
  5. search base works
  6. user search works
  7. group search works
  8. role or group mapper works
  9. real app login works

Common causes

Fix pattern

  1. test LDAP or LDAPS independently first
  2. only then update the application
  3. validate user sync
  4. validate group sync
  5. validate one real login

Generic rsyslog troubleshooting

Use this when logs do not arrive, TCP forwarding fails, or TLS forwarding breaks.

Checks

rsyslogd -N1
systemctl status rsyslog
journalctl -u rsyslog -n 100 --no-pager
ss -lntup | grep 514
ss -lntup | grep 6514
grep -R "imtcp\|omfwd\|gtls\|@@\|@" /etc/rsyslog* -n

Common causes

Fix pattern

  1. decide the transport first: UDP, TCP, or TCP plus TLS
  2. align both ends
  3. validate the config
  4. restart rsyslog
  5. send a test log with logger
  6. confirm arrival at the receiver

Test

logger -n SERVER -P 514 -T "tcp syslog test"

Generic Postfix troubleshooting

Use this when mail is not relaying correctly, TLS is failing, auth is broken, or relay policy looks wrong.

Checks

postconf -n
postconf mynetworks mydestination relay_domains smtpd_recipient_restrictions
postfix check
journalctl -u postfix -n 100 --no-pager
tail -f /var/log/maillog

Common causes

Fix pattern

  1. lock down relay policy
  2. validate the config
  3. reload Postfix
  4. test the expected relay path
  5. test the denied relay path
  6. review the logs

Useful tests

swaks --server MAILHOST --from test@external.example --to user@external.example
swaks --server MAILHOST --port 587 --tls --auth \
  --auth-user USER --auth-password 'PASSWORD' --to user@example.com

Generic Dovecot troubleshooting

Use this for IMAP or POP login failures, TLS errors, client trust issues, or mailbox sync trouble.

Checks

dovecot -n
doveconf -n | egrep 'ssl|auth|imap|pop3|lmtp'
journalctl -u dovecot -n 100 --no-pager
openssl s_client -connect HOST:993
openssl s_client -starttls imap -connect HOST:143

Common causes

Fix pattern

  1. confirm TLS and auth settings
  2. validate cert and key readability
  3. restart Dovecot
  4. test with an IMAP TLS client
  5. validate mailbox access in a real client

Generic Keycloak and federation troubleshooting

Use this when Keycloak cannot connect to LDAP or LDAPS, sync works but logins fail, or role mapping differs by environment.

Checks

  1. verify the LDAP or LDAPS endpoint is reachable
  2. confirm the CA is trusted by Java and Keycloak
  3. verify bind DN, search base, and group base
  4. test the connection in Keycloak
  5. test user sync
  6. test group mapping
  7. test application login end to end

Common causes

Fix pattern

  1. fix trust first
  2. then fix federation settings
  3. then retest sync
  4. then retest app auth

Generic Openfire and Java app upgrade troubleshooting

SysRef does not have a dedicated Openfire page; this section is the in-site coverage. Use it for Openfire or similar Java services that fail after an upgrade.

Checks

java -version
systemctl status openfire
systemctl cat openfire
journalctl -u openfire -b --no-pager
tail -n 200 /opt/openfire/logs/error.log
tail -n 200 /opt/openfire/logs/info.log

Common causes

Fix pattern

  1. take a snapshot or backup first
  2. test in staging
  3. confirm Java prerequisites
  4. inventory installed plugins
  5. upgrade the application
  6. watch the first startup closely
  7. validate admin UI and client connectivity
  8. keep the rollback path ready

Windows and FreeIPA reality check

Do not assume FreeIPA gives you native AD-style Windows policy, GPO, or admin behavior. Clarify the actual requirement before you recommend a design.

Ask these questions

Safe conclusion pattern

If native AD or GPO-like behavior is required, validate whether Active Directory or an AD trust is required. If this is only a pilot or local-admin workaround, use scripts, endpoint tooling, or automation instead of pretending FreeIPA is a full Windows policy plane.

Architectural mismatch is not a troubleshooting bug. Sometimes the right answer is that the chosen identity platform does not provide the Windows behavior the stakeholders expect.

Review, rollout, and rollback checks

Before-merge peer review checklist

Rollout order for risky changes

  1. local syntax and config validation
  2. one-host staging test
  3. functional test
  4. peer review
  5. merge
  6. controlled deploy
  7. service validation
  8. log review
  9. user or app validation
  10. monitor for regression

Rollback checklist

Define rollback before risky changes, not during the incident.

Sprint and ticket framing language

Status language

Useful next-steps language

Quick symptom-to-action map

SERVICE WON'T START
-> systemctl status
-> journalctl
-> config validation
-> unit file
-> port check

AUTH FAILS
-> DNS
-> bind and search base
-> group mapping
-> trust and cert if LDAPS
-> app mapper or role mapping

TLS FAILS
-> openssl s_client
-> hostname match
-> cert chain
-> truststore
-> right protocol and port

NO LOGS ARRIVE
-> listener port
-> sender config
-> protocol alignment
-> firewall
-> rsyslog validation

MAIL RELAY WRONG
-> postconf -n
-> recipient restrictions
-> mynetworks
-> relay domains
-> swaks tests

UPGRADE FAILED
-> runtime version
-> logs
-> config drift
-> plugin compatibility
-> database migration
-> rollback readiness

Best-practice reminders

One-screen compact version

INFRA TROUBLESHOOTING QUICK FLOW
1. Confirm symptom
2. Confirm scope
3. systemctl status / journalctl
4. Validate config
5. Check listener and port
6. Check DNS and connectivity
7. Check cert and trust if TLS
8. Check auth and group mapping if identity
9. Check recent Ansible or template change
10. Test one real transaction
11. Fix
12. Re-test
13. Document root cause and rollback

CORE COMMANDS
- systemctl status SERVICE -l --no-pager
- journalctl -u SERVICE -b --no-pager
- ss -lntup
- openssl s_client -connect host:port -servername host
- rsyslogd -N1
- postfix check
- dovecot -n
- ansible-playbook PLAYBOOK.yml --check --diff

GOLDEN RULES
- validate before restart
- align both ends of a connection
- pilot first
- trust and certs before app blame
- rollback always ready
This page is meant to be reused. Copy the relevant section into tickets, change plans, or handover notes, then trim it to the exact fault domain you are working.