Skip to content

Observability

This page is for Picora platform administrators managing the production deployments. It covers logging, alerting, on-call procedures, and the manual reconciliation tools available.

Logging

Picora emits structured JSON logs for every business event. There are no plain-text logs in production code.

Each log entry includes:

{
"level": "info | warn | error",
"event": "image.upload.success",
"userId": "xK9mR2pQ7vB...",
"imageId": "...",
"sizeBytes": 102400,
"requestId": "req_abc123",
"ts": "2026-04-27T08:00:00.000Z",
"platform": "cloudflare | china",
...event-specific fields...
}

The requestId is the key for correlating logs across services for one user request. It’s set in the x-request-id response header so users can include it in support tickets.

Where logs go

PlatformDestinationRetentionAccess
Overseas (CF)Cloudflare Logpush → R2 bucket30 days hot, 1 year coldAdmin via dashboard or API
Mainland (Aliyun)Aliyun Log Service (SLS)90 days indexed, 1 year coldAdmin via SLS console

We do not run a heavyweight APM (Datadog / New Relic) — Worker bundle size sensitivity makes the trade-off not worthwhile.

Common queries

-- Find all errors for a specific user in last 24h (Logpush SQL)
SELECT * FROM picora_logs
WHERE userId = 'xK9mR2pQ7vB...' AND level = 'error'
AND ts > NOW() - INTERVAL '24 hours'
ORDER BY ts DESC;
-- Trace a request across services
SELECT * FROM picora_logs
WHERE requestId = 'req_abc123'
ORDER BY ts ASC;
-- Bandwidth users approaching limit
SELECT userId, MAX(used) as monthly_bw
FROM bandwidth_log
WHERE ts > date_trunc('month', NOW())
GROUP BY userId
HAVING MAX(used) / plan_limit > 0.8;

Key event reference

The complete event taxonomy is in CLAUDE.md §18.2. Highlights of high-volume events:

EventWhenAction
image.upload.successImage uploadedStandard log; no action
image.upload.quota_exceededUser hit quotaSend quota warning email if 100% threshold
auth.login.failFailed loginIncrement IP / user fail counter; trigger lockout at 5/5min
video.bandwidth.degradedUser auto-switched to 360pSend degraded email; admin observes ratio
video.bandwidth.suspendedUser video suspendedSend suspended email; admin investigates
payment.webhook.receivedPayment receivedActivate plan; send receipt
mcp.tool.upload_docMCP doc uploadSampled at 10%; admin observes adoption
cdn_whitelist.refresh_failedCDN allowlist couldn’t reloadP2 alert — fix DB connectivity
unhandled.errorUncaught exceptionP1 alert if rate spikes

Alerts

Alert routing

SeverityChannelResponse time
P1 (production down / data loss risk)PagerDuty → on-call phone15 minutes
P2 (degraded but not down)Slack #picora-alerts channel1 business hour
P3 (informational, e.g. cost spike)Email digest, dailyNext business day

Alert thresholds

MetricThresholdSeverity
5xx rate> 1% / minute (sliding window)P1
Image upload fail rate> 5% / 5 minutesP2
Email send rate> 3× daily mean (see §9 anti-abuse)P1
Video degraded user count> expected ratioP3 cost review
API P95 response time> 1000msP2
unhandled.error rate> 0.1% / minuteP1
cdn_whitelist.refresh_failedany occurrenceP2

Reconciliation jobs

Several scheduled jobs reconcile state between subsystems and external providers.

Bandwidth attribution (hourly)

Pulls Bunny.net’s total bandwidth report and attributes per-user share by storage ratio. Bug here means user bandwidth quotas misalign with actual usage.

Storage orphan cleanup (nightly)

Finds object storage entries with no matching DB row and deletes them (e.g., from upload aborted mid-flight). Currently only for images; video / audio orphan cleanup is on the v0.20+ roadmap.

  • Schedule: 03:00 UTC daily
  • Logs: scheduled.orphan_cleanup.start / .complete

Failed transcoding cleanup

Videos stuck in status: processing for >24 hours are checked against Bunny.net; if Bunny doesn’t have them, mark as failed and notify the user.

Subscription state sync

Reconciles Lemon Squeezy / Polar / WeChat / Alipay subscription state with Picora’s database. Catches webhook delivery failures.

On-call playbooks

”5xx rate spike P1 alert”

  1. Check Cloudflare dashboard for incident on Workers
  2. Check unhandled.error event rate by URL pattern — narrow which endpoint
  3. Recent deploys? Roll back if release-induced
  4. Check downstream dependencies (R2, D1, KV) status pages
  5. If user-impacting, post status update at status.picora.me

”User says ‘it’s broken’”

Without a requestId, you’re guessing. Always ask for:

  • requestId (in error toast or network tab response header)
  • Approximate time of attempt
  • User email / account ID
  • Browser / OS

Then query logs by userId + ts range (above example queries).

”Bandwidth attribution looks off”

  1. Check scheduled.bandwidth.failed events
  2. Verify Bunny.net API credentials are not expired
  3. Manually trigger reconciliation via admin panel
  4. If sustained issue, fall back to “use storage ratio with last good attribution” mode

Health checks

Each service exposes:

GET /health — basic alive check
GET /health/deep — DB + R2 + cache connectivity

Status page polls /health/deep every 60 seconds across regions.

Configuration knobs

Sensitive operational config is held in environment / secrets:

Terminal window
# Logging
LOG_LEVEL=info # debug | info | warn | error
LOG_SAMPLE_RATE_DEBUG=0.01 # debug events sampling
# Alerting
PAGERDUTY_INTEGRATION_KEY=<from secrets>
SLACK_WEBHOOK_URL=<from secrets>
# Reconciliation
BANDWIDTH_RECON_ENABLED=true
ORPHAN_CLEANUP_ENABLED=true
# Reflexive controls
RATE_LIMIT_ENABLED=true # set false in extreme load to prevent cascade
MAINTENANCE_MODE=false # toggle to read-only / 503 site-wide

Incident retrospective process

After any P1 incident:

  1. Within 24h: written timeline (who did what when, in /admin/postmortems/)
  2. Within 1 week: root cause analysis with engineering
  3. Within 2 weeks: action items (with owners, deadlines) — added to backlog
  4. Public summary (if user-impacting): published to status.picora.me and emailed to affected users

Common issues

“Logs missing for a specific time window” — Logpush has a 60-90 second ingestion lag. Logs from “now-2 minutes” may not yet be queryable.

“Can’t find requestId in logs” — verify the user’s request actually reached our infrastructure (could be network / DNS issue on their side).

“Reconciliation job failed but no alert” — alert routing may be misconfigured. Verify PAGERDUTY_INTEGRATION_KEY is current.