Observability

This page is for Picora platform administrators managing the production deployments. It covers logging, alerting, on-call procedures, and the manual reconciliation tools available.

Logging

Picora emits structured JSON logs for every business event. There are no plain-text logs in production code.

Each log entry includes:

{
  "level": "info | warn | error",
  "event": "image.upload.success",
  "userId": "xK9mR2pQ7vB...",
  "imageId": "...",
  "sizeBytes": 102400,
  "requestId": "req_abc123",
  "ts": "2026-04-27T08:00:00.000Z",
  "platform": "cloudflare | china",
  ...event-specific fields...
}

The requestId is the key for correlating logs across services for one user request. It’s set in the x-request-id response header so users can include it in support tickets.

Where logs go

Platform	Destination	Retention	Access
Overseas (CF)	Cloudflare Logpush → R2 bucket	30 days hot, 1 year cold	Admin via dashboard or API
Mainland (Aliyun)	Aliyun Log Service (SLS)	90 days indexed, 1 year cold	Admin via SLS console

We do not run a heavyweight APM (Datadog / New Relic) — Worker bundle size sensitivity makes the trade-off not worthwhile.

Common queries

-- Find all errors for a specific user in last 24h (Logpush SQL)
SELECT * FROM picora_logs
WHERE userId = 'xK9mR2pQ7vB...' AND level = 'error'
  AND ts > NOW() - INTERVAL '24 hours'
ORDER BY ts DESC;

-- Trace a request across services
SELECT * FROM picora_logs
WHERE requestId = 'req_abc123'
ORDER BY ts ASC;

-- Bandwidth users approaching limit
SELECT userId, MAX(used) as monthly_bw
FROM bandwidth_log
WHERE ts > date_trunc('month', NOW())
GROUP BY userId
HAVING MAX(used) / plan_limit > 0.8;

Key event reference

The complete event taxonomy is in CLAUDE.md §18.2. Highlights of high-volume events:

Event	When	Action
`image.upload.success`	Image uploaded	Standard log; no action
`image.upload.quota_exceeded`	User hit quota	Send quota warning email if 100% threshold
`auth.login.fail`	Failed login	Increment IP / user fail counter; trigger lockout at 5/5min
`video.bandwidth.degraded`	User auto-switched to 360p	Send degraded email; admin observes ratio
`video.bandwidth.suspended`	User video suspended	Send suspended email; admin investigates
`payment.webhook.received`	Payment received	Activate plan; send receipt
`mcp.tool.upload_doc`	MCP doc upload	Sampled at 10%; admin observes adoption
`cdn_whitelist.refresh_failed`	CDN allowlist couldn’t reload	P2 alert — fix DB connectivity
`unhandled.error`	Uncaught exception	P1 alert if rate spikes

Alerts

Alert routing

Severity	Channel	Response time
P1 (production down / data loss risk)	PagerDuty → on-call phone	15 minutes
P2 (degraded but not down)	Slack `#picora-alerts` channel	1 business hour
P3 (informational, e.g. cost spike)	Email digest, daily	Next business day

Alert thresholds

Metric	Threshold	Severity
5xx rate	> 1% / minute (sliding window)	P1
Image upload fail rate	> 5% / 5 minutes	P2
Email send rate	> 3× daily mean (see §9 anti-abuse)	P1
Video degraded user count	> expected ratio	P3 cost review
API P95 response time	> 1000ms	P2
`unhandled.error` rate	> 0.1% / minute	P1
`cdn_whitelist.refresh_failed`	any occurrence	P2

Reconciliation jobs

Several scheduled jobs reconcile state between subsystems and external providers.

Bandwidth attribution (hourly)

Pulls Bunny.net’s total bandwidth report and attributes per-user share by storage ratio. Bug here means user bandwidth quotas misalign with actual usage.

Schedule: every hour at HH:05
Logs: scheduled.bandwidth.update (success) / scheduled.bandwidth.failed (error, P1)
Manual run: Admin → Bandwidth → Trigger reconciliation (logs admin.bandwidth.manual_trigger)

Storage orphan cleanup (nightly)

Finds object storage entries with no matching DB row and deletes them (e.g., from upload aborted mid-flight). Currently only for images; video / audio orphan cleanup is on the v0.20+ roadmap.

Schedule: 03:00 UTC daily
Logs: scheduled.orphan_cleanup.start / .complete

Failed transcoding cleanup

Videos stuck in status: processing for >24 hours are checked against Bunny.net; if Bunny doesn’t have them, mark as failed and notify the user.

Subscription state sync

Reconciles Lemon Squeezy / Polar / WeChat / Alipay subscription state with Picora’s database. Catches webhook delivery failures.

On-call playbooks

”5xx rate spike P1 alert”

Check Cloudflare dashboard for incident on Workers
Check unhandled.error event rate by URL pattern — narrow which endpoint
Recent deploys? Roll back if release-induced
Check downstream dependencies (R2, D1, KV) status pages
If user-impacting, post status update at status.picora.me

”User says ‘it’s broken’”

Without a requestId, you’re guessing. Always ask for:

requestId (in error toast or network tab response header)
Approximate time of attempt
User email / account ID
Browser / OS

Then query logs by userId + ts range (above example queries).

”Bandwidth attribution looks off”

Check scheduled.bandwidth.failed events
Verify Bunny.net API credentials are not expired
Manually trigger reconciliation via admin panel
If sustained issue, fall back to “use storage ratio with last good attribution” mode

Health checks

Each service exposes:

GET /health          — basic alive check
GET /health/deep     — DB + R2 + cache connectivity

Status page polls /health/deep every 60 seconds across regions.

Configuration knobs

Sensitive operational config is held in environment / secrets:

# Logging
LOG_LEVEL=info                # debug | info | warn | error
LOG_SAMPLE_RATE_DEBUG=0.01    # debug events sampling

# Alerting
PAGERDUTY_INTEGRATION_KEY=<from secrets>
SLACK_WEBHOOK_URL=<from secrets>

# Reconciliation
BANDWIDTH_RECON_ENABLED=true
ORPHAN_CLEANUP_ENABLED=true

# Reflexive controls
RATE_LIMIT_ENABLED=true       # set false in extreme load to prevent cascade
MAINTENANCE_MODE=false        # toggle to read-only / 503 site-wide

Incident retrospective process

After any P1 incident:

Within 24h: written timeline (who did what when, in /admin/postmortems/)
Within 1 week: root cause analysis with engineering
Within 2 weeks: action items (with owners, deadlines) — added to backlog
Public summary (if user-impacting): published to status.picora.me and emailed to affected users

Common issues

“Logs missing for a specific time window” — Logpush has a 60-90 second ingestion lag. Logs from “now-2 minutes” may not yet be queryable.

“Can’t find requestId in logs” — verify the user’s request actually reached our infrastructure (could be network / DNS issue on their side).

“Reconciliation job failed but no alert” — alert routing may be misconfigured. Verify PAGERDUTY_INTEGRATION_KEY is current.

CDN Allowlist — admin operations
Content Moderation — moderation queue and audit
Project rules — §18 observability — full event taxonomy and CF Workers constraints
Status page