Observability
This page is for Picora platform administrators managing the production deployments. It covers logging, alerting, on-call procedures, and the manual reconciliation tools available.
Logging
Picora emits structured JSON logs for every business event. There are no plain-text logs in production code.
Each log entry includes:
{ "level": "info | warn | error", "event": "image.upload.success", "userId": "xK9mR2pQ7vB...", "imageId": "...", "sizeBytes": 102400, "requestId": "req_abc123", "ts": "2026-04-27T08:00:00.000Z", "platform": "cloudflare | china", ...event-specific fields...}The requestId is the key for correlating logs across services for one user request. It’s set in the x-request-id response header so users can include it in support tickets.
Where logs go
| Platform | Destination | Retention | Access |
|---|---|---|---|
| Overseas (CF) | Cloudflare Logpush → R2 bucket | 30 days hot, 1 year cold | Admin via dashboard or API |
| Mainland (Aliyun) | Aliyun Log Service (SLS) | 90 days indexed, 1 year cold | Admin via SLS console |
We do not run a heavyweight APM (Datadog / New Relic) — Worker bundle size sensitivity makes the trade-off not worthwhile.
Common queries
-- Find all errors for a specific user in last 24h (Logpush SQL)SELECT * FROM picora_logsWHERE userId = 'xK9mR2pQ7vB...' AND level = 'error' AND ts > NOW() - INTERVAL '24 hours'ORDER BY ts DESC;
-- Trace a request across servicesSELECT * FROM picora_logsWHERE requestId = 'req_abc123'ORDER BY ts ASC;
-- Bandwidth users approaching limitSELECT userId, MAX(used) as monthly_bwFROM bandwidth_logWHERE ts > date_trunc('month', NOW())GROUP BY userIdHAVING MAX(used) / plan_limit > 0.8;Key event reference
The complete event taxonomy is in CLAUDE.md §18.2. Highlights of high-volume events:
| Event | When | Action |
|---|---|---|
image.upload.success | Image uploaded | Standard log; no action |
image.upload.quota_exceeded | User hit quota | Send quota warning email if 100% threshold |
auth.login.fail | Failed login | Increment IP / user fail counter; trigger lockout at 5/5min |
video.bandwidth.degraded | User auto-switched to 360p | Send degraded email; admin observes ratio |
video.bandwidth.suspended | User video suspended | Send suspended email; admin investigates |
payment.webhook.received | Payment received | Activate plan; send receipt |
mcp.tool.upload_doc | MCP doc upload | Sampled at 10%; admin observes adoption |
cdn_whitelist.refresh_failed | CDN allowlist couldn’t reload | P2 alert — fix DB connectivity |
unhandled.error | Uncaught exception | P1 alert if rate spikes |
Alerts
Alert routing
| Severity | Channel | Response time |
|---|---|---|
| P1 (production down / data loss risk) | PagerDuty → on-call phone | 15 minutes |
| P2 (degraded but not down) | Slack #picora-alerts channel | 1 business hour |
| P3 (informational, e.g. cost spike) | Email digest, daily | Next business day |
Alert thresholds
| Metric | Threshold | Severity |
|---|---|---|
| 5xx rate | > 1% / minute (sliding window) | P1 |
| Image upload fail rate | > 5% / 5 minutes | P2 |
| Email send rate | > 3× daily mean (see §9 anti-abuse) | P1 |
| Video degraded user count | > expected ratio | P3 cost review |
| API P95 response time | > 1000ms | P2 |
unhandled.error rate | > 0.1% / minute | P1 |
cdn_whitelist.refresh_failed | any occurrence | P2 |
Reconciliation jobs
Several scheduled jobs reconcile state between subsystems and external providers.
Bandwidth attribution (hourly)
Pulls Bunny.net’s total bandwidth report and attributes per-user share by storage ratio. Bug here means user bandwidth quotas misalign with actual usage.
- Schedule: every hour at HH:05
- Logs:
scheduled.bandwidth.update(success) /scheduled.bandwidth.failed(error, P1) - Manual run: Admin → Bandwidth → Trigger reconciliation (logs
admin.bandwidth.manual_trigger)
Storage orphan cleanup (nightly)
Finds object storage entries with no matching DB row and deletes them (e.g., from upload aborted mid-flight). Currently only for images; video / audio orphan cleanup is on the v0.20+ roadmap.
- Schedule: 03:00 UTC daily
- Logs:
scheduled.orphan_cleanup.start/.complete
Failed transcoding cleanup
Videos stuck in status: processing for >24 hours are checked against Bunny.net; if Bunny doesn’t have them, mark as failed and notify the user.
Subscription state sync
Reconciles Lemon Squeezy / Polar / WeChat / Alipay subscription state with Picora’s database. Catches webhook delivery failures.
On-call playbooks
”5xx rate spike P1 alert”
- Check Cloudflare dashboard for incident on Workers
- Check
unhandled.errorevent rate by URL pattern — narrow which endpoint - Recent deploys? Roll back if release-induced
- Check downstream dependencies (R2, D1, KV) status pages
- If user-impacting, post status update at status.picora.me
”User says ‘it’s broken’”
Without a requestId, you’re guessing. Always ask for:
- requestId (in error toast or network tab response header)
- Approximate time of attempt
- User email / account ID
- Browser / OS
Then query logs by userId + ts range (above example queries).
”Bandwidth attribution looks off”
- Check
scheduled.bandwidth.failedevents - Verify Bunny.net API credentials are not expired
- Manually trigger reconciliation via admin panel
- If sustained issue, fall back to “use storage ratio with last good attribution” mode
Health checks
Each service exposes:
GET /health — basic alive checkGET /health/deep — DB + R2 + cache connectivityStatus page polls /health/deep every 60 seconds across regions.
Configuration knobs
Sensitive operational config is held in environment / secrets:
# LoggingLOG_LEVEL=info # debug | info | warn | errorLOG_SAMPLE_RATE_DEBUG=0.01 # debug events sampling
# AlertingPAGERDUTY_INTEGRATION_KEY=<from secrets>SLACK_WEBHOOK_URL=<from secrets>
# ReconciliationBANDWIDTH_RECON_ENABLED=trueORPHAN_CLEANUP_ENABLED=true
# Reflexive controlsRATE_LIMIT_ENABLED=true # set false in extreme load to prevent cascadeMAINTENANCE_MODE=false # toggle to read-only / 503 site-wideIncident retrospective process
After any P1 incident:
- Within 24h: written timeline (who did what when, in
/admin/postmortems/) - Within 1 week: root cause analysis with engineering
- Within 2 weeks: action items (with owners, deadlines) — added to backlog
- Public summary (if user-impacting): published to status.picora.me and emailed to affected users
Common issues
“Logs missing for a specific time window” — Logpush has a 60-90 second ingestion lag. Logs from “now-2 minutes” may not yet be queryable.
“Can’t find requestId in logs” — verify the user’s request actually reached our infrastructure (could be network / DNS issue on their side).
“Reconciliation job failed but no alert” — alert routing may be misconfigured. Verify PAGERDUTY_INTEGRATION_KEY is current.
Related
- CDN Allowlist — admin operations
- Content Moderation — moderation queue and audit
- Project rules — §18 observability — full event taxonomy and CF Workers constraints
- Status page