Generic

Reusable troubleshooting and resolution cheat sheet for infrastructure tickets, service failures, identity problems, TLS issues, and rollout risk checks.

This page turns a broad infra troubleshooting memo into a single reference you can reuse across tickets. Start by narrowing the symptom, then validate dependencies, test safely, confirm the fix, and document the rollback path before you call it done.

Use this flow when something breaks

confirm the symptom and scope before changing anything
check service state, logs, config, ports, and recent changes
solve plain connectivity before diving into TLS or app logic
test one dependency at a time: DNS, trust, identity, database, proxy
validate the fix with a real transaction, not just a green service state
leave behind root cause, evidence, and rollback notes

On this page

Universal troubleshooting sequence
Core commands
Generic Ansible troubleshooting
Generic systemd service failure workflow
Generic TLS and certificate troubleshooting
Generic LDAP, LDAPS, and identity troubleshooting
Generic rsyslog troubleshooting
Generic Postfix troubleshooting
Generic Dovecot troubleshooting
Generic Keycloak and federation troubleshooting
Generic Openfire and Java app upgrade troubleshooting
Windows and FreeIPA reality check
Review, rollout, and rollback checks
Sprint and ticket framing language
Quick symptom-to-action map
Best-practice reminders
One-screen compact version

Universal troubleshooting sequence

Use this sequence first, even when you think you already know the answer. It keeps you from skipping basic checks and blaming the wrong layer.

Confirm the symptom. Be specific: startup, auth, network, TLS, config drift, or data flow.
Confirm the scope. One host or many? One user or all users? Dev only, test only, or prod?
Check the service state. Review the unit status, logs, config validation output, and whether the expected port is listening.
Check recent change. Look for role or template updates, package upgrades, cert renewals, DNS changes, firewall edits, or runtime changes.
Validate dependencies. DNS, ports, trust, identity, database, reverse proxy, and upstream services.
Reproduce safely. Use a manual test command, stage it if possible, and pilot risky changes on one host first.
Validate the fix. Healthy service state, port listening, expected logs, and a working end-to-end test.
Document. Capture root cause, exact change, evidence, and rollback path.

Most bad troubleshooting starts with a fuzzy symptom. Write the broken behavior down in one sentence before you touch the config. That one sentence becomes your test case when you validate the fix.

Core commands

These are the checks you reach for most often when triaging a Linux or service issue.

Service checks

systemctl status SERVICE -l --no-pager
journalctl -u SERVICE -b --no-pager
journalctl -xeu SERVICE
systemctl restart SERVICE
systemctl reload SERVICE
systemctl is-active SERVICE
systemctl is-enabled SERVICE

Port and listener checks

ss -lntup
ss -lntp | grep PORT
netstat -plnt
lsof -i :PORT

Firewall and connectivity

firewall-cmd --list-all
firewall-cmd --list-ports
ping HOST
nc -zv HOST PORT
telnet HOST PORT
tcpdump -nn host HOST and port PORT

DNS and name resolution

hostname -f
getent hosts HOST
dig HOST
nslookup HOST
resolvectl status

TLS and certificate checks

openssl s_client -connect HOST:PORT -servername HOST
openssl x509 -in CERTFILE -text -noout
keytool -list -keystore TRUSTSTORE

Ansible validation

ansible-playbook PLAYBOOK.yml --syntax-check
ansible-playbook PLAYBOOK.yml --check
ansible-playbook PLAYBOOK.yml --diff
ansible-inventory --graph
ansible-inventory --list

Config validation

rsyslogd -N1
postfix check
dovecot -n
nginx -t
apachectl configtest
python -m py_compile FILE.py

Generic Ansible troubleshooting

If automation broke something, work in this order so you can separate bad input data from bad role logic.

Inventory: verify the expected host or group was targeted.
group_vars and host_vars: verify environment values and overrides.
Tasks: inspect what changed and whether conditionals and idempotency still make sense.
Templates: confirm Jinja rendered valid config and whitespace did not break syntax.
Handlers: make sure the right restart or reload happened, and that validation runs before the bounce when appropriate.
Repeated runs: confirm the role stays clean on rerun instead of drifting or restarting every time.
Validation before restart: never bounce a service blindly when a config test exists.

Good Ansible pattern

- name: Deploy config
  template:
    src: service.conf.j2
    dest: /etc/service/service.conf
    owner: root
    group: root
    mode: "0644"
  notify: validate and restart service

- name: Validate service config
  command: service-binary validation-flag
  changed_when: false

- name: Restart service
  service:
    name: service
    state: restarted

What to look for in roles

hardcoded values instead of variables
service restart without validation
tasks that are not idempotent
templates missing guard clauses
variables split across too many files
one role doing too many unrelated things
missing or miswired handlers

Check mode is not a full simulation. Treat --check --diff as an early warning system, not proof that the deployment is safe. Modules like command and shell do not always model reality well.

Generic systemd service failure workflow

Use this when a service will not start, exits immediately, or flaps on restart.

Check systemctl status SERVICE.
Read journalctl -u SERVICE -b.
Inspect the unit file with systemctl cat SERVICE.
Verify ExecStart, WorkingDirectory, User, Group, EnvironmentFile, permissions, and dependent services.
Run the service command manually as the service user if possible.
Check for a port conflict.
Check dependent config, cert, database, auth, or upstream endpoints.

Common causes

wrong binary path
wrong virtualenv or Java path
invalid config file
port already in use
missing environment variable
file permission issue
service account cannot read cert or key
dependency service is down

Fix pattern

correct the unit or config
validate the config
restart the service
confirm active state
test the real function end to end

Generic TLS and certificate troubleshooting

Use this for LDAPS, SMTP TLS, IMAPS, rsyslog TLS, or web admin TLS problems.

Checklist

Is the service presenting the expected certificate?
Is the certificate expired?
Does the CN or SAN match the hostname in use?
Does the client trust the issuing CA?
Is the full chain present?
Are you using the right port?
Are STARTTLS and implicit TLS being mixed up?

Quick commands

openssl s_client -connect host:636 -servername host
openssl s_client -connect host:6514 -servername host
openssl s_client -starttls smtp -connect host:25
openssl s_client -starttls imap -connect host:143

Common causes

hostname mismatch
CA missing from truststore
incomplete chain
key and cert mismatch
wrong listener or port
STARTTLS enabled when the client expects LDAPS, or the reverse

Fix pattern

confirm the hostname clients use
deploy the proper cert, key, and chain
import the CA into the truststore where needed
retest with openssl
then retest the app or service

Do not blame the application before proving the wire is healthy. If openssl s_client cannot complete the handshake or shows the wrong certificate, stay in the TLS layer until that is fixed.

Generic LDAP, LDAPS, and identity troubleshooting

Use this for FreeIPA, Keycloak federation, bind failures, group mapping, and sync or login issues.

Troubleshooting order

network reachability
DNS resolution
certificate trust, if using LDAPS
bind DN works
search base works
user search works
group search works
role or group mapper works
real app login works

Common causes

wrong bind DN
wrong user or group search base
CA not trusted
port 636 used incorrectly
STARTTLS and LDAPS confusion
group mapping not aligned with app expectations
cache not refreshed after changes

Fix pattern

test LDAP or LDAPS independently first
only then update the application
validate user sync
validate group sync
validate one real login

Generic rsyslog troubleshooting

Use this when logs do not arrive, TCP forwarding fails, or TLS forwarding breaks.

Checks

rsyslogd -N1
systemctl status rsyslog
journalctl -u rsyslog -n 100 --no-pager
ss -lntup | grep 514
ss -lntup | grep 6514
grep -R "imtcp\|omfwd\|gtls\|@@\|@" /etc/rsyslog* -n

Common causes

server listening on the wrong port
sender using UDP while the receiver expects TCP
TLS configured on only one side
firewall blocking 514 or 6514
invalid template syntax
duplicate input blocks after rerun

Fix pattern

decide the transport first: UDP, TCP, or TCP plus TLS
align both ends
validate the config
restart rsyslog
send a test log with logger
confirm arrival at the receiver

Test

logger -n SERVER -P 514 -T "tcp syslog test"

Generic Postfix troubleshooting

Use this when mail is not relaying correctly, TLS is failing, auth is broken, or relay policy looks wrong.

Checks

postconf -n
postconf mynetworks mydestination relay_domains smtpd_recipient_restrictions
postfix check
journalctl -u postfix -n 100 --no-pager
tail -f /var/log/maillog

Common causes

mynetworks too broad
relay_domains and mydestination confusion
weak or wrong recipient restrictions
TLS cert missing or invalid
submission settings not aligned with auth policy

Fix pattern

lock down relay policy
validate the config
reload Postfix
test the expected relay path
test the denied relay path
review the logs

Useful tests

swaks --server MAILHOST --from test@external.example --to user@external.example
swaks --server MAILHOST --port 587 --tls --auth \
  --auth-user USER --auth-password 'PASSWORD' --to user@example.com

Generic Dovecot troubleshooting

Use this for IMAP or POP login failures, TLS errors, client trust issues, or mailbox sync trouble.

Checks

dovecot -n
doveconf -n | egrep 'ssl|auth|imap|pop3|lmtp'
journalctl -u dovecot -n 100 --no-pager
openssl s_client -connect HOST:993
openssl s_client -starttls imap -connect HOST:143

Common causes

wrong cert path
key unreadable by the service
disable_plaintext_auth does not match client expectations
client using the wrong protocol or port

Fix pattern

confirm TLS and auth settings
validate cert and key readability
restart Dovecot
test with an IMAP TLS client
validate mailbox access in a real client

Generic Keycloak and federation troubleshooting

Use this when Keycloak cannot connect to LDAP or LDAPS, sync works but logins fail, or role mapping differs by environment.

Checks

verify the LDAP or LDAPS endpoint is reachable
confirm the CA is trusted by Java and Keycloak
verify bind DN, search base, and group base
test the connection in Keycloak
test user sync
test group mapping
test application login end to end

Common causes

truststore missing the CA
wrong URL or port
STARTTLS and LDAPS confusion
mapper mismatch
application expecting roles that are never produced

Fix pattern

fix trust first
then fix federation settings
then retest sync
then retest app auth

Generic Openfire and Java app upgrade troubleshooting

SysRef does not have a dedicated Openfire page; this section is the in-site coverage. Use it for Openfire or similar Java services that fail after an upgrade.

Checks

java -version
systemctl status openfire
systemctl cat openfire
journalctl -u openfire -b --no-pager
tail -n 200 /opt/openfire/logs/error.log
tail -n 200 /opt/openfire/logs/info.log

Common causes

incompatible Java version
plugin incompatibility
database migration issue
old config carried into the new version badly
systemd environment still pointing at the old install

Fix pattern

take a snapshot or backup first
test in staging
confirm Java prerequisites
inventory installed plugins
upgrade the application
watch the first startup closely
validate admin UI and client connectivity
keep the rollback path ready

Windows and FreeIPA reality check

Do not assume FreeIPA gives you native AD-style Windows policy, GPO, or admin behavior. Clarify the actual requirement before you recommend a design.

Ask these questions

Are we solving authentication only?
Or policy management too?
Or local admin rights too?
Is there an AD trust model or not?
Is this a pilot workaround or a supported long-term design?

Safe conclusion pattern

If native AD or GPO-like behavior is required, validate whether Active Directory or an AD trust is required. If this is only a pilot or local-admin workaround, use scripts, endpoint tooling, or automation instead of pretending FreeIPA is a full Windows policy plane.

Architectural mismatch is not a troubleshooting bug. Sometimes the right answer is that the chosen identity platform does not provide the Windows behavior the stakeholders expect.

Review, rollout, and rollback checks

Before-merge peer review checklist

syntax is valid
template renders correctly
variables are environment-safe
no hardcoded secrets
role is idempotent
repeated run causes no drift
service config is validated before restart
handler behavior is correct
firewall and ports are aligned
rollback path is documented
functional validation is complete

Rollout order for risky changes

local syntax and config validation
one-host staging test
functional test
peer review
merge
controlled deploy
service validation
log review
user or app validation
monitor for regression

Rollback checklist

Define rollback before risky changes, not during the incident.

config backup taken
service unit backup taken
database backup or snapshot taken, if applicable
old package or install path retained
previous cert or truststore version retained
previous Ansible vars or template recoverable
tested restore or revert procedure documented

Sprint and ticket framing language

Status language

Resolved: issue reproduced, fix applied, validation complete, ready to close
Patched: code or config updated, validation complete or pending final deployment check
In progress: change implemented, awaiting review, testing, or dependency action
Planned: objective defined, approach known, dependencies identified
Blocked: cannot progress until dependency, approval, or access is provided

Useful next-steps language

peer review and merge
stage deploy and validate
monitor logs for regression
confirm maintenance window
validate pilot host behavior
confirm certificate and trust path
document rollback and runbook

Quick symptom-to-action map

SERVICE WON'T START
-> systemctl status
-> journalctl
-> config validation
-> unit file
-> port check

AUTH FAILS
-> DNS
-> bind and search base
-> group mapping
-> trust and cert if LDAPS
-> app mapper or role mapping

TLS FAILS
-> openssl s_client
-> hostname match
-> cert chain
-> truststore
-> right protocol and port

NO LOGS ARRIVE
-> listener port
-> sender config
-> protocol alignment
-> firewall
-> rsyslog validation

MAIL RELAY WRONG
-> postconf -n
-> recipient restrictions
-> mynetworks
-> relay domains
-> swaks tests

UPGRADE FAILED
-> runtime version
-> logs
-> config drift
-> plugin compatibility
-> database migration
-> rollback readiness

Best-practice reminders

validate config before restart
solve plain connectivity before solving TLS
solve TLS before blaming the application
solve LDAP bind and search before blaming role mapping
pilot risky changes on one host first
do not assume Windows behaves like Linux with FreeIPA
keep service-specific roles small and clear
keep baseline roles reusable
separate provisioning roles from service roles
always know your rollback path

One-screen compact version

INFRA TROUBLESHOOTING QUICK FLOW
1. Confirm symptom
2. Confirm scope
3. systemctl status / journalctl
4. Validate config
5. Check listener and port
6. Check DNS and connectivity
7. Check cert and trust if TLS
8. Check auth and group mapping if identity
9. Check recent Ansible or template change
10. Test one real transaction
11. Fix
12. Re-test
13. Document root cause and rollback

CORE COMMANDS
- systemctl status SERVICE -l --no-pager
- journalctl -u SERVICE -b --no-pager
- ss -lntup
- openssl s_client -connect host:port -servername host
- rsyslogd -N1
- postfix check
- dovecot -n
- ansible-playbook PLAYBOOK.yml --check --diff

GOLDEN RULES
- validate before restart
- align both ends of a connection
- pilot first
- trust and certs before app blame
- rollback always ready

This page is meant to be reused. Copy the relevant section into tickets, change plans, or handover notes, then trim it to the exact fault domain you are working.