Chapter 15: Troubleshooting Runbook¶
Audience: DevOps, platform engineers
Login loop¶
Symptoms: User hits the login page, authenticates, gets redirected back to the login page, loop repeats indefinitely.
Root cause pattern: Any database failure causes
_get_or_create_oauth2_user() to fail silently. /auth/check returns
{authenticated: false}. The frontend redirects to /oauth2/sign_in. Loop.
Common triggers: - Cloud SQL Proxy container died or was restarted without restarting the API - Network partition between VM and Cloud SQL instance - Cloud SQL instance maintenance or restart
Detection:
- Browser: redirect counter triggers after 2 redirects in 30 seconds, showing
an error/retry UI instead of looping
- API: /auth/check returns HTTP 503 (not 200) when the database is
unreachable, with auth_error in the response
- Logs: look for connection refused or timeout errors in catscan-api logs
Fix:
1. Check Cloud SQL Proxy: sudo docker ps | grep cloudsql
2. If down: sudo docker compose -f docker-compose.gcp.yml restart cloudsql-proxy
3. Wait 10 seconds, then restart the API:
sudo docker compose -f docker-compose.gcp.yml restart api
4. Verify: curl -sS http://localhost:8000/health
Prevention: The three-layer fix (applied Feb 2026):
1. Backend propagates DB errors via request.state.auth_error
2. /auth/check returns 503 when DB is unreachable
3. Frontend has redirect counter (max 2 in 30s) + error/retry UI
Data-freshness timeout¶
Symptoms: /uploads/data-freshness returns 500, times out, or the
runtime health gate shows BLOCKED on data health.
Root cause pattern: The data-freshness query scans large tables
(rtb_daily at 84M rows, rtb_bidstream at 21M rows). If the query plan
degrades to a sequential scan instead of using indexes, it can take 160+
seconds.
Detection: 1. Hit the endpoint directly from the VM:
curl -sS --max-time 60 -H 'X-Email: cat-scan@rtb.cat' \
'http://localhost:8000/uploads/data-freshness?days=14&buyer_id=<ID>'
sudo docker exec catscan-api python -c "
import os, psycopg
conn = psycopg.connect(os.environ['POSTGRES_DSN'])
for r in conn.execute('EXPLAIN (ANALYZE, BUFFERS) <query>').fetchall():
print(list(r.values())[0])
"
Parallel Seq Scan on large tables. This is the problem.
Fix pattern:
- Rewrite GROUP BY queries as generate_series + EXISTS to force index
lookups. See Database Operations for the pattern.
- Ensure SET LOCAL statement_timeout is used (not SET + RESET).
- Check that indexes (buyer_account_id, metric_date DESC) exist on all
target tables.
Gmail import failure¶
Symptoms: Data freshness grid shows "missing" cells for recent dates. Import history has no recent entries.
Detection:
curl -sS -H 'X-Email: cat-scan@rtb.cat' \
http://localhost:8000/gmail/status
Check: last_reason, unread count, latest_metric_date.
Common causes:
- Gmail OAuth token expired: re-authorize at /settings/accounts > Gmail tab
- Cloud SQL Proxy down: Gmail import writes to Postgres, so DB must be
reachable
- Large unread count (30+): import may be stuck processing or the mailbox
has a backlog
Fix:
1. If last_reason shows an error: restart the import job from the UI or API
2. If the token expired: re-authorize Gmail integration
3. If Cloud SQL is down: fix the database connection first (see login loop)
Container restart ordering¶
Symptom: API logs show "connection refused" to port 5432 on startup.
Cause: The API container started before Cloud SQL Proxy was ready.
Fix: Restart with correct ordering:
sudo docker compose -f docker-compose.gcp.yml up -d cloudsql-proxy
sleep 10
sudo docker compose -f docker-compose.gcp.yml up -d api
Or restart everything (compose handles dependencies):
sudo docker compose -f docker-compose.gcp.yml up -d --force-recreate
SET statement_timeout syntax error¶
Symptom: Endpoint returns 500 with error:
syntax error at or near "$1" LINE 1: SET statement_timeout = $1
Cause: psycopg3 converts %s to $1 for server-side parameter binding,
but PostgreSQL's SET command does not support parameter placeholders.
Fix: Use f-string with validated integer:
# Wrong:
conn.execute("SET statement_timeout = %s", (timeout_ms,))
# Right:
timeout_ms = max(int(statement_timeout_ms), 1) # validated int
conn.execute(f"SET LOCAL statement_timeout = {timeout_ms}")
Runtime health gate failure¶
Symptom: v1-runtime-health-strict.yml workflow fails.
Triage:
1. Check the workflow logs: gh run view <id> --log-failed
2. Look for FAIL vs. BLOCKED:
- FAIL = something broke, investigate
- BLOCKED = dependency missing (no data, no endpoint), may be
pre-existing
3. Common pre-existing BLOCKED reasons:
- "rtb_quality_freshness state is unavailable": no quality data for
this buyer/period
- "proposal has no billing_id": data setup issue
- "QPS page API rollup missing required paths": analytics endpoint
not populated yet
4. Compare against previous runs to identify regressions vs. pre-existing
issues.
Related¶
- Health Monitoring: monitoring tools
- Database Operations: query and index details
- Deployment: deploying fixes