Akamai Cloud Queues + Notifications — architecture

Audience: Akamai cloud architects + Fermyon SMEs. For customer-facing materials see the customer brief. For one-pager wiring + sequence flows, see docs/architecture.md. For full ADRs see DECISIONS.md.

1. TL;DR

Akamai Cloud Queues + Notifications is a globally-distributed queue + topic service on the same 27-region NATS JetStream substrate that backs Akamai Cloud KV (codename nats-kv). Customer-facing HTTP API matches AWS SQS+SNS shape, plus:

Designed-for; not a managed offering. Substrate is proven. The customer-facing surface is the unknown to validate.

2. One-page wiring

┌──────────────────────────────────────────────────────────────────────────────┐ │ CUSTOMER / BROWSER / SDK │ └────┬─────────────────────────────────────────────────────────────────────────┘ │ HTTPS mq.connected-cloud.io (single hostname, per-path origin per ADR-006) ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ AKAMAI PROPERTY (per-path origin routing — Option C) │ │ /v1/* → mq-adapter LB (CORS *: cross-origin OK)│ │ /v1/subscriptions/*/stream → mq-adapter LB + SSE pass-through XML │ │ (buffer-response-v2 off, │ │ chunk-to-client on) │ │ /api/control/* → mq-control LB (admin / invite / claim) │ │ / → FWF Spin (user UI) (same-origin SSE for UI) │ └────┬───────────────┬─────────────────────────────┬───────────────────────────┘ │ REST + SSE │ control plane │ user-facing UI ▼ ▼ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ LKE-Enterprise pods (per region; mq-control LZ-only) │ │ ┌──────────────────────────────┐ ┌─────────────────────────────────┐ │ │ │ mq-adapter :8080 │ │ mq-control :8090 │ │ │ │ HTTP REST + SSE in │ │ Tenant CRUD, key issue/revoke │ │ │ │ NATS protocol out │ │ Invite mint + claim │ │ │ │ KeyCache (bucket-watched) │ │ Audit log (SQLite on PVC) │ │ │ │ PushSubManager (reconciler) │ │ KV→MQ bridge embedded (ADR-018) │ │ │ │ CircuitBreaker (in-memory) │ │ Single replica (RWO PVC) │ │ │ │ /metrics + per-tenant labels │ │ │ │ │ └────────────┬─────────────────┘ └────────────┬────────────────────┘ │ └───────────────┼─────────────────────────────────┼────────────────────────────┘ │ NATS / JetStream │ ▼ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ Shared 27-region NATS JetStream cluster (ADR-002) │ │ ┌────────────────────────────┐ ┌────────────────────────────────────┐ │ │ │ KV ACCOUNT (nats-kv) │ │ MQ ACCOUNT (this repo) │ │ │ │ Tenant buckets + $KV.> │ │ Q_<tenant>__<queue>, q.<...>.> │ │ │ │ kv-admin-change-streams-v1│────┼─► IMPORTED here (read-only) │ │ │ │ EXPORTS $KV.>, mappings │────┼─► bridge consumes them │ │ │ │ │ │ T_<tenant>__<topic>, t.<...>.> │ │ │ │ │ │ <stream>_dlq, _dlq.<stream>.> │ │ │ │ │ │ Per-sub durable consumers │ │ │ │ │ │ 4 admin buckets (R5/NA) │ │ │ └────────────────────────────┘ └────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────────┘

3. Components

ComponentWhatWhere
mq-adapter Stateless data plane: REST in / JetStream out; KeyCache (bucket-watched bearer validation); PushSubManager (per-sub delivery loop); CircuitBreaker; SSE handler; /metrics per region (DaemonSet on leaves, Deployment on LZ)
mq-control Admin plane: tenant CRUD, key issue/revoke, invite mint, claim flow, audit log. Embeds KV→MQ change-stream bridge runtime per ADR-018. LZ only, single replica, SQLite on PVC
FWF Spin apps (user + admin) Browser-facing UIs. User app serves /, /play, /verify, /loadtest, /topology, /api-explorer, /guide (open), /docs (open). Admin app lives behind its own gate on a separate hostname. Both proxy /api/* inward. Akamai Functions runtime (every Akamai POP)
NATS cluster 27-region JetStream cluster; shared with nats-kv via separate account. LKE-Enterprise clusters (1 LZ + 26 leaves)
Akamai property Single hostname; per-path origin override routes streams to LKE direct (bypassing FWF wall-clock). CPS cert from shared enrollment 293468. edge

4. Auth model

Three-token gate (inherited from ADR-024 in nats-kv):

TokenWhere livesValidated by
UI gate tokenBrowser cookie (set via ?access=<t>) FWF Spin app — page access only
Admin gate tokenAdmin Spin app cookie Admin Spin app — separate from this app
Tenant bearerBrowser localStorage / SDK env mq-adapter via KeyCache (sha256(bearer) looked up in mq-admin-keys-v1 bucket)
Admin bearerVault path api/nats-mq/admin rendered to /vault/secrets/admin-token both adapter (cross-tenant bypass) and control (admin endpoints)

Adapter watches mq-admin-keys-v1 via JetStream KV watch — sub-second revocation propagation across every region. No HTTP polling (nats-kv landed on polling after an R1-cross-cluster-leader edge case; our prototype runs single-cluster so the watch is reliable).

5. HTTP push delivery contract

For HTTP-push subscriptions, the adapter holds the JetStream pull consumer and POSTs each message to the customer URL. Per-message headers:

HeaderValue
X-MQ-Topicsource topic (un-scoped)
X-MQ-Subjectfull NATS subject — useful with subject-pattern subscribes
X-MQ-IdJetStream sequence — your idempotency key
X-MQ-Attempt1..max_deliver
X-MQ-Timestampunix-millis of publish
X-MQ-Signaturesha256=hex(HMAC-SHA256(secret, ts + "." + body)) (when secret set)

Subscriber response semantics (ADR-017):

StatusAdapter action
2xxAck. Breaker counter resets.
408 / 429Nak + backoff (1s, 2s, 4s, 8s, 16s, cap 60s). Counts toward breaker.
Other 4xxTerm → DLQ on FIRST delivery. Does not trip the breaker — assumed misconfig.
5xx / network / timeoutSame as 408/429.

Per-adapter, per-subscription circuit breaker (in-memory): 10 consecutive transient failures trip; 30s initial cooldown, exponential to 5min cap; half-open probe on next pull.

The "Other 4xx" rule matters. Returning 400 from a misconfigured HMAC verify will silently fill the DLQ on every delivery. Operator signal is DLQ growth with attempt_count=1.

6. SSE / WebSocket — the FWF bypass

Long-lived connections cannot terminate in a Spin function (FWF wall-clock 30s, soon 35s). The Akamai property has a per-path origin rule that routes /v1/subscriptions/*/stream and /v1/ws/* straight from the CDN to the LKE adapter (NO_STORE, 1h read timeout).

The path also carries Akamai SSE pass-through metadata so the edge doesn't buffer the stream (recipe inherited from sse-cdn): http.buffer-response-v2 off, modify-outgoing-response.chunk-to-client on, http.forward-chunk-boundary-alignment on. Without this, the edge holds frames until the response completes — fatal for live streams.

Customer SDKs see one hostname — the path-based routing is invisible to them:

const es = new EventSource(
  'https://mq.connected-cloud.io/v1/subscriptions/' + subId + '/stream?bearer=' + bearer
);
es.addEventListener('message', e => {
  const frame = JSON.parse(e.data);
  console.log(frame.subject, atob(frame.body_b64));
});

Auth on the SSE path: EventSource can't set the Authorization header (browser limitation), so the adapter accepts the tenant bearer in a ?bearer= query parameter as a fallback. Same KeyCache validation as the header path — no separate token type. The adapter sets Access-Control-Allow-Origin: * on every response so third-party sites can subscribe to a tenant's stream directly without a proxy. FWF features (auth pre-check, rate limit) don't apply on this path since the CDN bypasses FWF entirely.

A future revision should move SSE auth to one-shot signed tokens (short TTL, control-plane minted) so the durable tenant bearer doesn't leak into URLs / proxy logs. Acceptable for the POC.

7. Cross-product (KV → MQ)

Per ADR-007: POST /v1/buckets in nats-kv accepts optional change_stream_topic. On the next KV write, the mapping (bucket → MQ topic) flows through a bridge in mq-control (ADR-018) and republishes the write into MQ's account.

The bridge sits on the MQ side per ADR-018. KV exports $KV.> read-only; MQ imports it. No cross-account write credential needed. The bridge is dormant when no mappings exist.

POST https://nats-kv.connected-cloud.io/v1/buckets
Authorization: Bearer <kv-tenant>
Content-Type: application/json

{ "name": "cart", "replicas": 3, "geo": "auto",
  "change_stream_topic": "cart-events" }

# subsequent: each KV write to bucket "cart" appears on
# t.<tenant>__cart-events.<key> in the MQ account.
# Customer's HTTP push subscriber sees it as a normal MQ topic message.

This single flag unlocks: CQRS (source-of-truth bucket + projection buckets), real-time UI (browser EventSource on a bucket's change stream), audit-by-default (every write also goes to an audit topic for a compliance archiver function), saga / workflow orchestration. What costs days on AWS (DynamoDB Streams → Lambda → EventBridge → SNS) is one config line here.

8. vs SQS+SNS

CapabilitySQS / SNSnats-mq
Global by default✗ regional✓ 27-region single namespace
Subject-pattern subscribe (orders.*.created)✗ msg-attribute filter✓ NATS native
Replay from sequence✓ JetStream stream
In-browser direct subscribe✗ (API GW + Lambda + WS bridge)✓ SSE / WS
Cross-product with KV~ DynamoDB Streams + Lambda + EventBridge✓ one flag
Per-message billing$0.40-0.50/M$0 (flat infra)
Geo placement control✗ account-level✓ per-queue R1/R3/R5 + geo
HMAC webhook signing standard~ optional✓ default
Native SDKs in 30+ languages✗ HTTP API only (NATS SDK for native)
Built-in mobile push / email / SMS✓ SNS✗ function-as-deliverer recipes
Production SLA✗ demo-grade

9. Spin function consumer pattern

The function is a normal HTTP-triggered Spin handler — no special trigger needed (no wasi:messaging today; see §10). The adapter holds the NATS subscription and POSTs to the function URL.

use spin_sdk::{http::*, http_component, variables};
use hmac::{Hmac, Mac}; use sha2::Sha256;

#[http_component]
async fn handle(req: Request) -> anyhow::Result<impl IntoResponse> {
    // 1. Verify HMAC (return 400 on mismatch — Term-to-DLQ, NOT 401 which would trip breaker)
    let secret = variables::get("mq_secret")?;
    let ts  = header(&req, "x-mq-timestamp").unwrap_or_default();
    let sig = header(&req, "x-mq-signature").unwrap_or_default();
    let mut mac = Hmac::<Sha256>::new_from_slice(secret.as_bytes())?;
    mac.update(format!("{}.", ts).as_bytes());
    mac.update(req.body());
    let expected = format!("sha256={}", hex::encode(mac.finalize().into_bytes()));
    if !constant_time_eq(sig.as_bytes(), expected.as_bytes()) {
        return Ok(Response::builder().status(400).body("bad signature").build());
    }

    // 2. Your business logic — return 2xx to ack, 5xx to retry, 4xx to DLQ.
    let subject = header(&req, "x-mq-subject").unwrap_or_default();
    process(&subject, req.body()).await?;
    Ok(Response::builder().status(200).body("ok").build())
}

Three full recipes (Slack, Telegram fallback, APNS) live at docs/recipes/.

10. wasi:messaging?

Not present in FWF as of Spin 3.6.3 (verified). HTTP push is the v1 function-as-subscriber shape because it works today with zero platform changes. When Fermyon ships the WIT, our subscription registry's delivery_type field is forward-compatible (currently http_push | sse; would gain wasi_messaging as another value).

The HTTP-push pattern is also closer to how SNS-style customers think — subscribers are HTTP endpoints, period. wasi:messaging would be an option, not a replacement.

11. Operational gotchas (worth knowing before you build on this)

GotchaMitigation
ack_wait default is 25s (not 30s) 5s buffer under FWF's 30s wall-clock. Bumps to 30s when FWF moves to 35s.
4xx from a webhook (other than 408/429) is treated as misconfiguration Goes to DLQ immediately. Doesn't trip the breaker. Operator signal: DLQ growth with attempt_count=1.
R3 is the default replica count, not R1 Single-node losses cause data loss; we learned this on nats-kv's geocode-r1 incident. R1 is explicit opt-in.
Cross-product loop (function subscribes to bucket change-stream and writes back to same bucket) Documentation-only gate. "Source-of-truth bucket + projection bucket; never write back to the source you're consuming."
Per-tenant MLT metric cardinality capped to curated set Prevents Prometheus blow-up across N tenants × M metrics. Logs + traces stay per-tenant.
NATS errors → HTTP status codes propagate explicitly No silent message-eating. nats: maximum messages exceeded → 507, no responders → 502, etc.
Push delivery uses queue-group across adapters (ADR-014) Free load-balancing + auto-failover via JetStream pull-consumer semantics. Customer sees variable egress IPs across Linode AS63949 — CIDR-friendly stable egress is a v2 path.
SQLite-on-PVC for invites + audit is single-writer, no replication PVC snapshot + sqlite3 .backup to object storage is required before going beyond demo. Tenants + keys are in NATS admin buckets (R5/NA) so the data plane survives PVC loss.

Questions or PRs welcome — owner: brian@apley.net. Demo-grade research POC; not production.