Authentication & Authorization - Caricash Nova Platform

A complete, regulator-grade implementation blueprint

0) Executive summary

This document specifies an end-to-end, enterprise-grade AuthN/Z stack for a regulated, multi-tenant payments system. It emphasizes defense-in-depth, tenant isolation, provable policy correctness, auditor-ready evidence, and latency budgets compatible with high-throughput ledgers and payment rails.

Core pillars

Identity & Sessions: WebAuthn/AAL2+, short-lived audience-scoped JWTs, DPoP/mTLS binding, opaque rotating refresh tokens, continuous session evaluation.
Authorization: Externalized policy-as-code (OPA) + Zanzibar-style ReBAC, plus database Row-Level Security (RLS) for irreversible isolation. Envoy ext_authz at the edge, partial evaluation for p99 < 5ms.
Enterprise Integrations: OIDC/SAML federation per tenant, SCIM user provisioning, per-tenant JWKS, step-up MFA with bound challenges, purpose-based access control (PBAC).
Operational Integrity: Hash-chained audit logs to WORM storage, policy supply-chain signing, blue/green policy rollouts, mutation & property-based tests, formal invariants for crown-jewel actions.
Compliance & Privacy: PCI DSS, PSD2/SCA, GDPR, Kenya NDPA mapping; data minimization, field-level encryption, selective disclosure; documented control evidence.

1) Design principles

Fail-closed everywhere; never allow on error or cache miss.
Separation of concerns: app enforces, OPA decides, DB constrains.
Least privilege by default, time-boxed elevation with explicit approvals.
Deterministic request identity: stable headers/claims across all layers.
Provability over intuition: mutation tests, property-based tests, formal specs.
Operator ergonomics: clear dashboards, kill switches, and reversible rollouts.
Privacy by design: minimize, mask, encrypt, and justify access via PBAC.

2) Identity model

2.1 Principal types

User: human (employee, tenant admin, merchant operator, auditor).
Service: machine/workload identities (SPIFFE/SPIRE SVIDs).
API Client: external integrators (mTLS + PrivateKeyJWT or HMAC legacy).

2.2 Tenancy

Tenant (e.g., Merchant of Record, Institution, Region).
Optional sub-tenants for regions/brands.
Data residency (e.g., KE/NG/BB) attached to tenant; influences routing, KMS keys, and policy.

2.3 Claims & headers (canonical contract)

Headers: x-req-id, x-tenant-id, x-principal-id, x-session-id, x-client-ip, x-device-id.
JWT claims:
- sub, tid, jti, iat/exp, aud, iss,
- aal (Authenticator Assurance Level), amr[] (methods),
- scope[], org_roles[], tenant_roles[{tenant, roles[]}],
- cnf (confirmation / key binding for DPoP or mTLS).

3) Authentication (AuthN)

3.1 Factors and assurance

Primary: WebAuthn (passkeys) → AAL2/3; TOTP fallback; SMS discouraged (risk/lawful intercept).
Step-up: Bound challenge tokens for high-risk actions; enforce aal>=2.

3.2 Sessions & tokens

Access tokens: JWT, 5–10 min, audience-scoped (per service).
Refresh tokens: opaque, rotating, stored server-side; theft ⇒ single-use rotation detects replay.
Binding:
- Browser: DPoP (proof of possession) + device context.
- Service→Service: mTLS with SPIFFE/SPIRE; tokens issued via token exchange to bind to workload identity.

Bound step-up (prevents replay across endpoints)

// On protected action with insufficient AAL:
const origHash = base64url(sha256(`${method}|${path}|${normalizedBody}`));
return 403, {
  error: "MFA_REQUIRED",
  challenge: signJWT({ origHash, tid, sub, exp: now+300 }) // short-lived
};

// /mfa/complete validates factor + challenge:
assert verify(challengeJWT) && challenge.origHash == recompute();
return accessToken({ aal: 2, cnf: { orig: origHash }, aud, exp: now+10*60 });

3.3 Federation & provisioning

OIDC/SAML per tenant:
- Store: saml_idp_metadata_url, oidc_issuer, client_id, redirect_uris.
- Validate signatures, cache JWKS with TTL.
SCIM 2.0: /Users, /Groups → map to internal principals and ReBAC tuples.
Attribute mapping: IdP groups → roles; time-boxed caveats for contractors.

3.4 Continuous Access Evaluation (CAE)

Reevaluate mid-session on impossible travel, device posture change, IP drift.
If risk escalates → downgrade AAL, force step-up, or revoke.

4) Authorization (AuthZ)

4.1 Strategy

OPA policy server (centralized behind Envoy ext_authz) for all HTTP/gRPC requests.
ReBAC graph (Zanzibar-style tuples) as the source of truth for relationships.
Database RLS as a hard backstop.

4.2 Data model — relationship tuples

CREATE TABLE rel_tuples (
  subject_ns text,  -- user | role | service | tenant
  subject_id text,
  relation   text,  -- owner | admin | member | editor | viewer | parent | ...
  object_ns  text,  -- tenant | merchant | account | payout | ...
  object_id  text,
  caveat     jsonb, -- { "expires_at": "...", "hours":[9,17], "ip_ranges":["203.0.113.0/24"] }
  PRIMARY KEY (subject_ns, subject_id, relation, object_ns, object_id)
);
CREATE INDEX ON rel_tuples (object_ns, object_id, relation);
CREATE INDEX ON rel_tuples USING gin (caveat);

4.3 OPA input contract (pin this in a shared package)

{
  "tenant": { "id": "acme", "security": {"mfa_required": true} },
  "subject": { "id": "user_123", "type": "user", "roles": ["ops"], "aal": 1 },
  "resource": { "type": "merchant", "id": "m_789", "tenant_id": "acme", "attrs": {"region":"KE"} },
  "action": "settlement.update",
  "purpose": "reconciliation",
  "context": { "ip":"203.0.113.5", "ua":"...", "risk":"high", "time":"2025-09-22T09:10:00Z" }
}

4.4 Rego (ABAC + ReBAC + PBAC + step-up)

package payments.authz
default allow := false
default step_up_required := false

same_tenant { input.resource.tenant_id == input.tenant.id }

# Purpose-based access control (PBAC)
purpose_allowed {
  some p
  data.purposes[p].name == input.purpose
  data.purposes[p].resources[_] == input.resource.type
  data.purposes[p].actions[_] == input.action
}

# ReBAC: subject related to object (direct or via parent)
related(subject, object, rel) {
  some i
  data.relationships[i].subject_ns == subject.type
  data.relationships[i].subject_id == subject.id
  data.relationships[i].relation   == rel
  data.relationships[i].object_ns  == object.type
  data.relationships[i].object_id  == object.id
  not expired(data.relationships[i])
}

expired(t) { t.caveat.expires_at != "" ; time.now_ns() > time.parse_ns_rfc3339(t.caveat.expires_at) }

# High-risk signals
step_up_required { input.action == "settlement.update" } else { input.context.risk == "high" }

# Allow if:
allow {
  same_tenant
  purpose_allowed
  related({"id": input.subject.id, "type": "user"},
          {"id": input.resource.id, "type": input.resource.type},
          "editor")
  not step_up_required
}

# Allow with AAL2 if step-up needed:
allow { same_tenant; purpose_allowed; step_up_required; input.subject.aal >= 2 }

4.5 Performance

Partial evaluation & compiled bundles for hot paths.
In-cluster OPA, memory-resident data (relationships, purpose registry).
p99 decision < 5ms target.

4.6 Policy lifecycle

Git monorepo for policies; code review required; cosign signed bundles.
Blue/green policy clusters; Envoy routes canary % → green; auto-promote/rollback based on error budgets.
Decision logs shipped to secure sink with redaction.

5) Database-level isolation (RLS & crypto)

5.1 RLS

-- Set at connection per request
-- SET app.current_tenant = '<tid>';
ALTER TABLE merchants ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON merchants
  USING (tenant_id = current_setting('app.current_tenant', true));

-- Optionally: fine-grained role mirror (if needed)

5.2 Security definer & views

Use SECURITY DEFINER functions with strict parameter validation for complex reads.
Expose views that already filter by tenant to reduce foot-guns.

5.3 Field-level encryption

Per-tenant DEKs (enveloped by region KMS keys).
Decrypt only the columns required for the action; propagate “crypto shields” so logs never contain plaintext PII/PAN.

6) API keys & machine auth

6.1 Key lifecycle & storage

Keys have public ID (ak_live_xxx) and secret (shown once).
Store argon2id hash of secret; rotateable; per-key metadata: scopes, budget, rate plan, allowed IPs, expiry.

6.2 HMAC signed requests (legacy/compat)

function hmacSignature({method, path, body, date, nonce}: any, secret: Buffer) {
  const payload = [method.toUpperCase(), path, sha256(body), date, nonce].join("\n");
  return base64url(hmacSha256(secret, payload));
}

Replay cache (nonce TTL 5–10 min).
Enforce budgets (amount/day, txn/min) and purpose. Deny with BUDGET_EXCEEDED.

6.3 Preferred for new clients

OAuth 2.1 confidential clients: mTLS + PrivateKeyJWT; GNAP-ready facade for future migration.

7) Risk engine & adaptive controls

Signals: IP reputation, device posture, geo/residency, impossible travel, velocity, amount, time-of-day.
UEBA profiles per role & tenant; deviations → step-up or JIT requirement.
Risk decision included in OPA input; explainable components logged.

8) Auditing & evidence

8.1 Tamper-evident log (hash chain)

CREATE TABLE audit_log (
  id BIGSERIAL PRIMARY KEY,
  ts timestamptz NOT NULL DEFAULT now(),
  tenant_id text NOT NULL,
  actor jsonb NOT NULL,       -- {id,type,aal,session_id}
  action text NOT NULL,       -- "settlement.update"
  target jsonb NOT NULL,      -- {type,id}
  decision jsonb NOT NULL,    -- {allow, policy_version, trace_id}
  attrs jsonb,                -- IP, UA, risk, purpose, mfa_used
  prev_hash bytea,
  row_hash bytea
);

Daily export to WORM S3 with bucket retention; KMS signed manifests.
Coverage SLO: 100% of mutating calls have audit entries; alert <100%.

8.2 Redaction & minimization

Centralized redaction rules (mask PANs, tokens).
No decrypted fields in logs; store token handles not secrets.

9) Compliance mapping (excerpt)

Domain	Control	Mechanism
PCI DSS 7.x	Access control	ReBAC+OPA, RLS, least privilege, step-up
PCI DSS 10.x	Logging	Hash-chain audit, WORM export
PSD2/SCA	Strong auth	WebAuthn/TOTP step-up; AAL2 thresholds
GDPR/NDPA	Purpose limitation	PBAC + purpose registry; minimization; field-level crypto
ISO 27001	Policy management	Signed bundles, reviews, change control
SOC 2	Evidence	Automated reports (MFA rates, revocation SLAs, decision latency)

10) Edge & service mesh

Envoy ext_authz calls centralized OPA for allow/deny before routing.
mTLS everywhere, workload identity via SPIFFE/SPIRE.
Per-tenant JWKS (rotation states: active|grace|retired, max 24h overlap).
Secrets via Vault/Secrets Manager with rotation jobs.

Envoy excerpt

http_filters:
- name: envoy.filters.http.ext_authz
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz
    grpc_service: { envoy_grpc: { cluster_name: authz } }
    with_request_body: { max_request_bytes: 8192, allow_partial_message: true }
    include_peer_certificate: true

11) Observability & SLOs

11.1 Metrics (dashboards)

AuthN: login success/fail by reason, AAL distribution, step-up prompts & acceptance, session churn.
AuthZ: allow/deny by rule, top deny reasons, decision latency p50/p95/p99, policy version adoption.
ReBAC: tuple counts, delta lag to OPA, cache hit rate.
API keys: usage by scope/geo, HMAC failures, nonce replays.
Audit: hash verification success, coverage %, WORM exports.

11.2 Latency budgets

JWT verify + Redis session check: p99 < 2ms
OPA decision: p99 < 5ms
AuthN full login with step-up: p95 < 1.5s

12) Testing & assurance

12.1 Automated tests

Unit & integration for AuthN flows (token rotation, DPoP binding, step-up).
Rego tests with examples mirroring prod traffic.
Property-based: fuzz tenant graphs, roles, caveats.
Mutation testing for Rego: flip/remove rules; ensure failures → Policy Mutation Score.

12.2 Formal invariants (crown-jewels)

Model settlement approval invariants in TLA+/Alloy: no external principal of tenant T can transition settlement state for T without AAL2 + requisite relation + purpose. Check small state spaces per CI.

12.3 Game days & chaos

Kill OPA cluster, corrupt tuple stream, spike step-up; verify fail-closed and graceful degradation paths.
Drill break-glass account procedures (YubiKey+passkey; dual approvals; time-boxed; extra audit stream).

13) Policy & data delivery

13.1 Bundles

Build: opa build -t eval -e payments.authz/allow -o bundle.tar.gz policy/ data/
Sign with cosign; store in OCI registry or S3; OPA verifies signature before load.
OPAL for near-real-time tuple deltas with read-your-writes gating for admin UI.

13.2 Versioning & simulation

/v1/auth/simulate?policy_ref=refs/heads/next accepts input doc, returns allow/deny + explain trace.
Policy drift detection: shadow-eval requests against “next” and compare decisions before rollout.

14) Administrative UX & governance

Security settings per tenant (security_settings JSONB): mfa_required_for_all_users, session_timeout_minutes, allowed_ip_ranges, resident_regions, require_dpop, require_mtls_for_m2m.
JIT access workflows: request → approval → short-lived credential → auto-expire; link every action back to ticket.
Kill switches for disbursement/settlement by tenant/region; flipped via change-controlled UI; propagate within seconds.
Evidence exports: click-to-download PDF/CSV summaries for auditors with cryptographic attestations.

15) Reference schemas & endpoints

15.1 Tables (core)

sessions (Redis JSON)

{
  "jti":"sess_abc", "sub":"user_123", "tid":"acme",
  "aal":1, "device":"browser:chrome", "ip":"203.0.113.5",
  "created_at":"2025-09-22T09:00:00Z", "last_seen":"2025-09-22T09:05:12Z",
  "revoked_at":null
}

api_keys (SQL)

CREATE TABLE api_keys (
  id text PRIMARY KEY,             -- ak_live_xxx
  tenant_id text NOT NULL,
  secret_hash text NOT NULL,       -- argon2id
  scopes text[] NOT NULL,
  budget jsonb,                    -- {amount_daily: 1000000, currency:"KES"}
  ip_allowlist cidr[],
  expires_at timestamptz,
  created_by text, created_at timestamptz default now(), revoked_at timestamptz
);

tenants.security_settings (JSONB)

{
  "mfa_required_for_all_users": true,
  "session_timeout_minutes": 30,
  "allowed_ip_ranges": ["196.201.0.0/16"],
  "resident_regions": ["KE","UG"],
  "require_dpop": true,
  "require_mtls_for_m2m": true
}

15.2 Key endpoints (sketch)

POST /auth/login → WebAuthn begin/finish or OIDC callback → access & refresh tokens.
POST /auth/token → refresh (rotate), DPoP proof validation.
POST /auth/mfa/challenge → produce bound challenge on 403.
POST /auth/mfa/complete → exchange for AAL2 token (bound).
GET /sessions?userId=... / DELETE /sessions/{id} → admin visibility/revocation.
POST /api-keys / DELETE /api-keys/{id} → create/rotate/revoke; show secret once.
POST /authz/decision (internal) → Envoy ext_authz integration.
POST /authz/simulate?policy_ref=... → dry-run with explain.
POST /scim/v2/Users, PATCH /scim/v2/Users/{id}, DELETE /scim/v2/Users/{id}.

16) Security operations runbook (high level)

Key compromise suspected:
- Revoke sessions (sess:*), rotate signing keys (per-tenant JWKS), invalidate refresh tokens, notify tenants.
Deny storm:
- Inspect policy rollout; compare canary vs stable; auto-rollback if drift detected; check tuple stream lag.
MFA provider outage:
- Allow only already-AAL2 sessions; pause new step-ups; enable break-glass for ops with dual approvals.
Region isolation:
- Enforce residency via PBAC; kill switch high-risk actions in region; fail service-to-service to local region only.

17) Delivery plan (90 days)

Weeks 0–2

Canonical input schema (TypeScript types + Zod).
Envoy ext_authz + centralized OPA; decision logging to secure sink.
WebAuthn + step-up bound challenges; Redis sessions & revocation.

Weeks 3–6

ReBAC tuples + OPAL deltas; DB RLS; partial-eval bundles for hot paths.
Per-tenant federation setup & SCIM MVP; per-tenant JWKS rotation flow.
HMAC legacy support with nonce replay cache; OAuth 2.1 mTLS for new clients.

Weeks 7–9

Hash-chained audit + WORM export; PBAC registry; purpose on all requests.
Mutation testing for Rego; property-based authz tests; policy coverage dashboard.

Weeks 10–12

Blue/green policy deploys with signed bundles; drift detection + canaries.
Formal invariants (TLA+) for settlement approval; game day #1; compliance evidence exports.

18) Minimal code snippets to bootstrap

TypeScript: request identity extractor

export type AuthzInput = {
  tenant: { id: string; security: Record<string, unknown> };
  subject: { id: string; type: "user"|"service"; roles: string[]; aal: number };
  resource: { type: string; id: string; tenant_id: string; attrs?: Record<string, unknown> };
  action: string;
  purpose: string;
  context: { ip: string; ua: string; risk: string; time: string };
};

export function toAuthzInput(req: any): AuthzInput {
  return {
    tenant: { id: req.headers["x-tenant-id"], security: req.tenantSecurity },
    subject: { id: req.user.sub, type: req.user.typ, roles: req.user.roles ?? [], aal: req.user.aal ?? 1 },
    resource: req.resourceDescriptor,         // set by router/resource middleware
    action: req.action,                       // e.g., "settlement.update"
    purpose: req.headers["x-purpose"] ?? "operational",
    context: { ip: req.ip, ua: req.headers["user-agent"], risk: req.risk, time: new Date().toISOString() }
  };
}

Node: API key verification with budgets

async function verifyApiKey(req) {
  const id = req.get("x-api-key-id");
  const sig = req.get("x-signature");
  const nonce = req.get("x-nonce");
  const date = req.get("date");
  const rec = await db.api_keys.findByPk(id);
  if (!rec || rec.revoked_at) throw forbidden("invalid_key");
  assertWithin(rec.ip_allowlist, req.ip);
  assertNotExpired(rec.expires_at);
  const ok = await argon2Verify(rec.secret_hash, req.get("x-api-key-secret") ?? ""); // or use HMAC only
  if (!ok) throw forbidden("bad_secret");
  verifyHmac(sig, {method:req.method,path:req.path,body:req.rawBody,date,nonce}, rec.key_bytes);
  await assertBudget(rec, req); // amount/day, txn/min
}

PostgreSQL: per-request tenant scoping

-- At request start:
-- SELECT set_config('app.current_tenant', $1, true);
-- SELECT set_config('app.principal_roles', $2, true); -- optional JSON of role list

19) Risk register (selected)

Policy drift (fallback logic diverges) → Block fallbacks; enforce ext_authz mandatory; test drift in canaries.
Tuple staleness → OPAL deltas + admin UI waits for ingestion ACK; display “effective policy time”.
JWKS misconfiguration per tenant → rotation states + alarms on stale keys; tests on federation setup.
Audit PII leakage → central redaction, e2e tests asserting absence of sensitive fields.
Deny storms → dashboards, rate-of-change alerts, kill-switches for specific actions/tenants.

20) Final checklist (go-live)

TL;DR

This blueprint yields regulator-ready, breach-resilient AuthN/Z: passkeys and bound step-ups, DPoP/mTLS-bound short-lived tokens, centralized OPA with ReBAC and PBAC, plus DB-level RLS as a backstop. Everything is signed, measured, tested, and reversible, with clean audit trails and fast failure modes—exactly what a payments platform needs to scale cross-tenant and cross-region without sacrificing safety or speed.