Environment: Authentik, Docker Compose, PostgreSQL, Nginx Proxy Manager, Oracle Cloud VM
Why This Matters
Identity systems can fail in a way that looks deceptively simple from the browser.
In this case, Authentik did not show a dramatic application error. It showed the familiar startup message:
Server is starting up. Refreshing in a few seconds...
That page is easy to misread as a short boot delay. The important detail was that it never cleared. The service was alive enough to answer HTTP requests, but not ready enough to process authentication flows.
This is a useful reminder: for identity infrastructure, “container is running” and “authentication is working” are not the same thing.
The Problem
Users attempting to authenticate through authentik.tekonline.com.au were redirected to Authentik, but the login flow stayed on the startup screen.
Typical symptoms:
- Public Authentik endpoint returned
503 - Browser showed “server is starting up”
- Docker showed the Authentik server and worker containers running
- Authentik’s live health endpoint responded successfully
- Authentik’s ready health endpoint failed
That last point was the key.
Live vs Ready Health Checks
Authentik exposes more than one kind of health signal:
- Live means the process is up.
- Ready means the backend is ready to serve real application traffic.
The live endpoint was healthy, but the readiness endpoint was not. That told us the process had started, but something required by the application stack was missing or unavailable.
This distinction matters for monitoring. A live-only check can miss a real outage.
What We Found
The Authentik server and worker containers were present, but the PostgreSQL container was missing from the Docker Compose project.
The Authentik containers were configured to connect to PostgreSQL using the Docker service name:
postgresql
But because the database container was not running, Docker DNS could not resolve that service name inside the Authentik network.
The worker logs showed the pattern clearly:
OperationalError('[Errno -3] Temporary failure in name resolution')
That means the application was not failing because of an OAuth configuration issue, a bad redirect URI, or a broken reverse proxy rule. It was failing because its database dependency was absent.
Root Cause
The immediate root cause was:
Authentik could not become ready because its PostgreSQL service was missing from the Docker Compose runtime.
The database volume itself was still present and contained the existing PostgreSQL data. The missing piece was the running container that mounted and served that data.
There was also a contributing operational signal: Docker had recently logged disk pressure while writing container logs. Around that same period, there was evidence of Docker cleanup activity. The exact command or actor that removed the stopped database container was not recoverable from the available logs, so we are careful not to overstate it.
The safe conclusion is:
- PostgreSQL had been interrupted earlier.
- The Authentik database volume survived.
- The PostgreSQL container was no longer running.
- Authentik kept running, but stayed unready because it could not resolve or connect to its database service.
Why Docker “Healthy” Was Not Enough
Docker health checks can be useful, but they are only as good as what they test.
In this incident, the Authentik containers looked healthy from a high-level container view, but the public application was not ready. The more accurate signal was Authentik’s own readiness endpoint.
A better monitoring target is:
/-/health/ready/
For public monitoring, the check should expect HTTP 200. If it returns 503, Authentik may be alive but not usable.
The Fix
The fix was intentionally conservative:
- Confirm the existing PostgreSQL data volume was present.
- Back up that volume before starting anything.
- Start only the missing PostgreSQL service from the existing Compose file.
- Wait for PostgreSQL to become healthy.
- Restart the Authentik server and worker.
- Verify readiness and the full authentication flow.
The database started successfully and performed automatic recovery from the earlier unclean stop. Authentik then connected to PostgreSQL, finished startup, and began serving the login flow again.
Verification
After restoring the database container:
- PostgreSQL reported healthy.
- Authentik server reported healthy.
- Authentik worker reported healthy.
- Authentik readiness returned HTTP
200. - The affected OAuth login URL no longer returned the startup page.
- A real browser login flow completed successfully.
That confirmed this was a dependency/runtime issue, not an application-provider configuration issue.
Preventing a Repeat
1. Monitor readiness, not just liveness
For Authentik, monitor:
https://authentik.example.com/-/health/ready/
This catches backend dependency failures that a basic process check may miss.
2. Alert if required containers are absent
For a Compose-based Authentik deployment, the expected core services are:
serverworkerpostgresql
If PostgreSQL is missing or unhealthy, the identity service is at risk even if the web container still responds.
3. Be careful with broad Docker cleanup
Docker prune commands are useful, but they can remove stopped containers. If a stateful service is stopped and not protected by a running Compose deployment, cleanup can remove the container while leaving the volume behind.
The data may still be safe, but the application will not recover until the missing service is recreated.
4. Keep disk pressure visible
Disk pressure can create secondary failures:
- logs fail to write
- containers fail to start
- cleanup commands get run under pressure
- old stopped containers may be removed without enough review
Disk alerts should fire well before the host reaches a critical state.
5. Back up before repairing stateful services
Before restarting or recreating a database container, back up the volume. Even if the fix looks obvious, stateful services deserve a checkpoint before changes are made.
Key Takeaways
- The browser message “server is starting up” can mean a dependency is missing, not just that the app is booting.
- Authentik liveness and readiness are different signals.
- A running Authentik container does not guarantee a working authentication flow.
- Missing Docker service discovery for
postgresqlcaused the backend to stay unready. - The database volume survived, so recovery was straightforward once the PostgreSQL container was restored.
- Public monitoring should use Authentik’s readiness endpoint, not only Docker container health.
Final Thought
This was a good example of a small infrastructure gap creating a visible identity outage. The repair was simple once the signal was clear: Authentik was alive, but not ready, because the database service it depended on was missing.
The durable fix is better monitoring around readiness, dependencies, and disk pressure.
Leave a Reply