Get Started

Overview & Onboarding

Project context, scope, and how the design documents fit together.

Meta Data Customer Onboarding

Execution Strategy & Rollout Plan

Multi-Tenant Platform Migration

Executive Summary

This document outlines the strategy to migrate our multi-tenant platform from MongoDB to PostgreSQL. The migration is structured in two phases: a controlled rollout for the first five tenants, followed by a bucketed delivery model for the remaining customer base.

The approach is designed to minimize risk, preserve every customer's production configuration exactly as it exists today, and validate the new platform end-to-end before any customer is moved live. Mocking and isolated environments ensure we never disturb external partners or production systems during validation.

Guiding Principle

No customer goes live on PostgreSQL until their tenant-specific configuration, data, workflows, and integrations have been validated against production behaviour in a clean, isolated environment.

Migration at a Glance

Phase	Scope	Approach	Outcome
Phase 1	First 5 tenants	Clean-state INTQA environments; each team mirrors production config as-is	Validated reference implementation
Phase 2	Remaining tenants	Customer bucketing based on configuration similarity	Bucket-by-bucket go-live

1. Configuration & Operational Flow Extraction

Before any data moves, we need a complete inventory of how each tenant operates today. This is the foundation everything else depends on.

What we extract

Tenant-level configurations — feature flags, environment-specific settings, branding, locale, and routing rules
Workflows — business process definitions, state machines, approval chains, and trigger conditions
Business rules — validation logic, eligibility criteria, calculation rules, and decision tables
Integration settings — endpoints, credentials (referenced, not extracted), webhook subscriptions, and partner identifiers
User and role mappings — RBAC definitions, team structures, and permission grants

Execution approach

We build a dedicated extraction utility that connects to each production tenant's MongoDB instance in read-only mode and produces a structured artifact per tenant — a single source of truth for that tenant's operational state. Each artifact is versioned, signed, and stored in a dedicated configuration repository. Sensitive values are tokenized; credentials are never copied, only referenced.

Deliverable

A per-tenant configuration manifest (one per tenant, five for Phase 1) that downstream teams use as the authoritative input for setting up the PostgreSQL environment. Any configuration drift between this manifest and what is actually configured in INTQA becomes an immediate flagged item.

2. Schema Mapping: MongoDB to PostgreSQL

The data models are fundamentally different. MongoDB's document model accommodates nested structures and flexible schemas; PostgreSQL is relational with strict typing. The mapping is not mechanical — it requires deliberate decisions about normalization, indexing, and how to handle existing flexibility.

Mapping principles

Normalize where data is queried relationally — break embedded documents into related tables when joins are common
Preserve as JSONB where data is genuinely schema-flexible or rarely queried by inner fields — PostgreSQL JSONB handles this well
Establish primary key strategy — typically migrate Mongo ObjectIds to UUID columns, with the original ID preserved for traceability
Define index parity — every query pattern that performs in MongoDB must have a corresponding index in PostgreSQL
Handle nulls, defaults, and enums explicitly — MongoDB's permissiveness must become PostgreSQL's strictness without breaking existing records

Deliverable

A schema mapping specification, maintained as code, that defines for every MongoDB collection: the target PostgreSQL table(s), field-by-field type translation, normalization decisions, constraint rules, and index definitions. This document is reviewed and signed off by engineering before any transformation code is written.

MongoDB Element	Typical PostgreSQL Mapping
Collection	Table (or set of related tables when normalized)
Document _id (ObjectId)	UUID primary key + legacy_id column for traceability
Embedded document	JSONB column OR separate related table (decision per case)
Array of references	Junction table with foreign keys
Array of values	JSONB array OR normalized child table
Dynamic / sparse fields	JSONB column with documented expected keys
Mongo indexes	B-tree, GIN (for JSONB), or partial indexes as appropriate

3. Prod-to-Lower-Environment Sync Framework

Lower environments must reflect production configuration to make validation meaningful. Without this, teams test against a version of reality that does not match what customers actually use.

What gets synced

Configurations, workflows, rules, and other operational settings — the same artifacts produced by the extraction step
Reference data — lookup tables, code lists, and master data
Selectively masked transactional data — only where required for realistic scenario coverage, with PII redacted

Execution approach

A sync framework is built as a one-way pipeline from production to lower environments. It runs on demand and on a schedule, with three modes:

Full sync — used when standing up a fresh environment or after major configuration changes
Delta sync — applies only changes since the last sync, used routinely to keep environments current
Selective sync — operator chooses specific tenants or specific artifact types to refresh

All sync operations are audited, reversible (via snapshot before sync), and gated by approval for non-routine modes. PII handling rules are enforced at the framework level, not left to individual scripts.

Deliverable

A sync service with a simple operator interface, scheduled jobs for routine delta sync, and a full audit log of every sync operation. Lower environments converge to production within a defined SLA.

4. Data Validation Framework

Migrated data must be provably equivalent to source data. This is non-negotiable — financial, regulatory, and operational integrity all depend on it.

Validation layers

Layer	What It Checks	How
Volumetric	Row counts match per tenant per entity	Automated count reconciliation
Field-level	Every field migrated correctly with right type	Sampled deep comparison, plus 100% comparison on critical fields
Referential	All relationships intact post-normalization	Foreign key integrity checks
Business-rule	Aggregates, calculations, and derived values match	Targeted queries on both sides, results compared
Temporal	Created/updated timestamps preserved	Direct comparison

Execution approach

Validation runs as part of the migration pipeline, not as a separate afterthought. A migration is not considered complete for a tenant until all five validation layers pass. Discrepancies are logged with full context, classified by severity, and routed to the responsible team.

Deliverable

A validation report per tenant, produced automatically at the end of each migration run. The report is the formal artifact that signs off data correctness before a tenant proceeds to use-case validation.

5. End-to-End Use Case Validation Framework

Data correctness is necessary but not sufficient. The system must behave correctly under real-world operational flows. This framework validates that customer journeys work end-to-end on PostgreSQL.

What gets validated

Critical customer journeys — every workflow that a tenant uses in production
Cross-system flows — operations that touch multiple services or external partners
Edge cases and known production scenarios — including historical incidents and complex configurations
Performance baselines — response times under representative load

Execution approach

Each tenant's team builds a use-case catalogue from their production operational flows. These become automated end-to-end test suites that run in INTQA against the migrated environment. The same suites run pre-migration against MongoDB and post-migration against PostgreSQL — pass/fail parity is the gate.

Test suites are tenant-aware: each suite runs against its specific tenant configuration, ensuring that customer-specific behaviour is preserved, not just the platform default.

Deliverable

A use-case catalogue and automated test suite per tenant, with a pass/fail dashboard that leadership can review. A tenant does not move to production cutover until their suite is green and matches MongoDB behaviour.

6. Mocking Framework for External Integrations

During migration validation, we cannot call production third-party systems — doing so would create real-world side effects (payments, notifications, partner submissions). At the same time, we must validate that our integration code works correctly against the new database.

Approach

A dedicated mocking service replicates the behaviour of every external integration our platform depends on
Mocks are configured per scenario — happy path, error responses, timeout, partial success, etc.
All outbound calls from INTQA are routed to the mocking layer; no production partner is ever called from a non-production environment
Mock behaviour is captured from real production interactions (anonymized) so it reflects actual partner contracts, not idealized assumptions

Deliverable

A mocking service with a catalogue of scenarios for every integration. Teams can switch scenarios per test run to validate both happy-path and failure-mode behavior without external impact.

7. Integration Payload Validation with Mocking

Beyond mocking responses, we validate that our system sends the correct payloads and handles responses correctly — both sides of the wire.

What gets validated

Request payloads — schema, required fields, data types, business field correctness
Response handling — successful parsing, correct mapping to internal state, error handling
Contract conformance — payloads match the published contract with each partner
Behavior under partner failure modes — timeouts, 5xx errors, malformed responses

Execution approach

The mocking framework records every request/response pair during validation runs. These are diffed against a baseline captured from MongoDB production behavior. Any deviation in payload structure or values is flagged. This catches subtle issues where the migration changes a field's representation in a way that breaks downstream contracts.

Deliverable

A payload validation report per integration per tenant, with diffs against the production baseline. Zero unexplained deviations is the bar for cutover.

8. Phase 1 — Clean-State INTQA for First 5 Tenants

The first five tenants are migrated through dedicated, isolated INTQA environments — one per tenant — set up from a clean slate. Each tenant's team owns its environment and is responsible for configuring it to mirror production exactly as-is.

Setup model

Element	Description
Environment isolation	Each tenant gets its own INTQA environment — separate database, separate service instances, no cross-tenant contamination
Configuration source	Tenant team uses the configuration manifest from Section 1 as the authoritative source
Mirror-as-is principle	No 'cleanup' or 'optimization' during setup — production is replicated faithfully, including known quirks
Team ownership	Each tenant's team owns setup, validation, and sign-off for their environment
Shared infrastructure	Mocking layer and validation framework are shared across all five environments

Why clean-state matters

Starting from a clean state — rather than copying an existing test environment — eliminates inherited drift. Every configuration in INTQA exists because it was deliberately put there based on the production manifest. This makes drift detection meaningful and gives us a true representation of what each customer experiences.

Phase 1 success criteria

All five tenant environments configured and signed off by their respective teams
Data validation framework reports green for all five tenants
End-to-end use case suites pass at parity with MongoDB
Integration payload validation shows zero unexplained deviations
Tenant team sign-off — the team that owns the customer relationship confirms readiness

Phase 1 Outcome

Five tenants live on PostgreSQL, with a proven playbook covering extraction, mapping, sync, validation, mocking, and tenant-team ownership. This becomes the template for everything that follows.

9. Phase 2 — Customer Bucketing & Bucketed Go-Live

After the first five tenants are live, we shift from per-tenant treatment to bucketed delivery. Migrating the remaining customers one at a time would be too slow; treating them all the same would be too risky. Bucketing balances both.

How buckets are defined

Customers are grouped into buckets based on the configurations, workflows, rules, and other settings extracted in Section 1. Customers with similar operational profiles share a bucket, because they share migration risk and validation needs.

Bucketing Dimension	What It Captures
Workflow complexity	Number, depth, and customization level of business workflows
Integration footprint	Count and type of external integrations active
Custom rule density	Volume of tenant-specific business rules and logic
Data volume	Scale of records — affects migration runtime and validation cost
Compliance / regulatory profile	Special handling requirements that affect migration approach

Bucketed delivery

Each bucket gets a tailored migration playbook derived from Phase 1 learnings but adapted to the bucket's profile
Within a bucket, customers are migrated in parallel where capacity allows, sequentially where risk requires it
Validation thresholds are bucket-specific — high-complexity buckets get more rigorous gates
Each bucket is fully delivered to live before the next bucket begins, unless explicitly parallelized with leadership approval

Bucket delivery cadence

After each bucket goes live, we hold a structured retrospective. Findings feed into the next bucket's playbook. This means the migration gets faster and lower-risk with each bucket, not just more numerous.

Deliverable

A bucket assignment for every remaining customer, a tailored playbook per bucket, and a sequenced delivery plan with milestones leadership can track. Each bucket completion is a clean checkpoint with a customer count moved, a risk summary, and lessons applied to the next.

Summary of Deliverables

Component	Deliverable
Configuration extraction	Per-tenant configuration manifest
Schema mapping	Mapping specification maintained as code
Prod-to-lower sync	Sync service with audit log and operator interface
Data validation	Per-tenant validation report from automated pipeline
Use case validation	Per-tenant test suite and pass/fail dashboard
External integration mocking	Mocking service with scenario catalog
Payload validation	Per-integration payload diff report
Phase 1 — first 5 tenants	Five clean-state INTQA environments, signed off by tenant teams, live on PostgreSQL
Phase 2 — bucketed delivery	Bucket assignments, per-bucket playbooks, sequenced go-live plan

Bottom Line

The migration is structured to make risk visible and manageable at every step. We validate before we move. We move five customers through a deliberate, hands-on process to prove the approach. Then we scale through bucketing, with each bucket making the next one safer and faster.

High-Level Designs

Prod-to-Lower Sync

Two-hop chained sync: Mongo Prod → Mongo Lower → Postgres Lower.

HLD — Prod-to-Lower Environment Sync Framework

High-Level Design · Workstream 3 of 9 · MongoDB → Aurora PostgreSQL Migration Companion document: LLD · Source: metadata-documents/MongoDB_to_PostgreSQL_v2_WS3_Sync_Framework.docx

Version	Date	Author	Notes
v0.1	2026-05-18	Migration Architecture	Initial draft from WS3 execution plan
v0.2	2026-05-18	Migration Architecture	Chained sync (Hop 1 + Hop 2) woven through executive summary, scope, diagrams, components, operations, and interactions

1. ⭐ Executive Summary

In one sentence: A two-hop, refNum-scoped, one-way pipeline that keeps lower environments looking like production — Hop 1 continuously refreshes Mongo Lower from Mongo Prod (automatic), and Hop 2 transforms Mongo Lower into Postgres Lower on operator demand by invoking the existing migration ingest pipeline (manual, gated).

Owner: DevOps / Platform Team | Phase: Phase 1 (first 5 customers) and Phase 2 (all remaining) | Workflow gates supported: Steps 1, 7, 8 of the nine-step workflow

What we get when it's done

Hop 1 (continuous, automatic): Lower-env Mongo (INTQA, Staging, Dev) converges to production Mongo on a scheduled cadence, per refNum.
Hop 2 (deliberate, gated): Operator-triggered, Full-only re-ingest from Mongo Lower into Postgres Lower (tenant_{refNum} schemas) using the same migration ingest pipeline that runs at cutover — strongest possible dogfooding.
Pre-flight freshness check blocks Hop 2 against a stale Mongo Lower. Override requires senior approval and is audited.
Snapshot before write on both hops; one-operation rollback.
Masking is non-bypassable and driven by MDS field classifications — PII never reaches a lower environment.
Auto-validation runs the Data Validation Framework immediately after Hop 2 completes.
End-to-end auditability: every chain run is queryable by a single correlation ID linking Hop 1 → Hop 2 → ingest → validation.
Refresh discipline post-cutover: once a customer cuts over, Hop 1 transitions to Aurora Prod → Aurora Lower; Hop 2 becomes unnecessary because source and target are both Aurora.

2. Scope

In scope	Out of scope
Hop 1 — Mongo Prod → Mongo Lower per refNum, in three modes: Full, Delta (CDC), Selective	Production credentials, secrets, API keys, KMS key material
Hop 2 — Mongo Lower → Postgres Lower (`tenant_{refNum}`), Full only, operator-triggered	Application code, infrastructure-as-code, schema DDL — owned elsewhere
Operational settings, lookup/reference data, selected transactional data (with PII masked)	Any reverse path from lower environment to production
Pre-flight freshness check + senior-approver override path for Hop 2	Cross-tenant data movement (impossible by design)
Snapshot-and-restore mechanism per sync run (both hops)	Backup / DR (handled by the DevOps Platform backup process)
Correlation ID issuance and end-to-end chain audit view	Synthetic test data generation
Auto-trigger of Data Validation Framework after Hop 2	The migration ingest pipeline itself — Hop 2 invokes it, doesn't contain it

3. ⭐ System Context Diagram (C4 L1)

flowchart LR
    classDef prod fill:#e0f2f1,stroke:#00695c,color:#004d40
    classDef lower fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef actor fill:#fff3e0,stroke:#ef6c00,color:#e65100
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f
    classDef gate fill:#ffebee,stroke:#c62828,color:#b71c1c

    Sched([Delta Scheduler]):::actor
    Op([Operator]):::actor
    Snr([Senior Approver]):::actor
    Sec([Security / Compliance]):::actor

    subgraph ProdZone[" Production zone (read-only egress) "]
        MProd[(MongoDB Prod<br/>read replica)]:::prod
        AProd[(Aurora Prod<br/>post-cutover only)]:::prod
        MDS[MDS — field classifications]:::external
    end

    Sync["Prod-to-Lower<br/>Sync Framework<br/>(Hop 1 + Hop 2)"]:::internal
    Preflight[Hop 2 Pre-flight<br/>Freshness Check]:::gate

    subgraph LowerZone[" Lower environments (per refNum) "]
        MLower[(Mongo Lower<br/>INTQA · Staging · Dev)]:::lower
        PLower[(Postgres Lower<br/>tenant_refNum)]:::lower
    end

    MigPipe[Migration Ingest Pipeline<br/>JOLT → DAAL]:::external
    Val[Data Validation Framework]:::external
    Audit[(Audit Sink)]:::internal

    Sched -- "Hop 1 Delta (auto)" --> Sync
    Op -- "Hop 1 Full / Selective (with approval)" --> Sync
    Op -- "Hop 2 trigger (Full, with approval)" --> Sync
    Snr -- "preflight override (if stale)" --> Sync

    Sync -- "reads (replica only)" --> MProd
    Sync -- "reads (replica only, post-cutover)" --> AProd
    Sync -- "reads classifications" --> MDS
    Sync -- "Hop 1: writes one-way" --> MLower

    Sync --> Preflight
    Preflight -- "pass" --> MigPipe
    Preflight -. "fail → 409" .-> Op
    MLower -- "Hop 2 source" --> MigPipe
    MigPipe -- "writes" --> PLower
    Sync -- "Hop 2 auto-trigger" --> Val
    Val -. "reads" .-> PLower

    Sync -- "records every operation" --> Audit
    Sec -- "reviews audit + approvals + overrides" --> Audit

    style ProdZone stroke-dasharray: 5 5,stroke:#00695c
    style LowerZone stroke-dasharray: 5 5,stroke:#1565c0

The Sync Framework owns both hops. Hop 1 writes directly to Mongo Lower. Hop 2 invokes the migration ingest pipeline — the same pipeline that runs at cutover — to populate Postgres Lower from Mongo Lower. Production is never a writable target from any lower environment.

4. High-Level Architecture (C4 L2 — Container)

flowchart TB
    classDef prod fill:#e0f2f1,stroke:#00695c,color:#004d40
    classDef lower fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92
    classDef actor fill:#fff3e0,stroke:#ef6c00,color:#e65100
    classDef gate fill:#ffebee,stroke:#c62828,color:#b71c1c
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f

    Op([Operator UI / CLI]):::actor
    Sched([Delta Scheduler]):::actor

    subgraph Sync[" Sync Framework "]
        direction TB
        API[Sync API / Orchestrator]:::internal
        Approval[Approval Workflow<br/>Hop 1 Full/Selective + Hop 2]:::internal
        Tagger[Correlation ID Tagger]:::internal
        Preflight[Hop 2 Pre-flight<br/>Freshness Validator]:::gate

        subgraph Hop1[" Hop 1 Pipeline "]
            direction TB
            Router[Mode Router<br/>Full · Delta · Selective]:::internal
            Reader[Production Reader<br/>replica-only, throttled]:::internal
            Mask[Masking Engine<br/>MDS-driven]:::internal
            Snap1["Snapshot Manager<br/>(Mongo Lower)"]:::internal
            Writer[Target Writer<br/>refNum-scoped]:::internal
        end

        subgraph Hop2[" Hop 2 Surface "]
            direction TB
            Snap2["Snapshot Manager<br/>(Postgres Lower)"]:::internal
            Adapter[Migration Pipeline<br/>Adapter]:::internal
            ValHook[Validation Webhook]:::internal
        end

        CDC[(CDC Resume<br/>Token Store)]:::store
        SnapStore[(Snapshot Store)]:::store
        AuditLog[(Audit Log)]:::store
    end

    ProdDB[(Production<br/>read replica)]:::prod
    MDS[MDS]:::internal
    MLower[(Mongo Lower<br/>tenant_refNum)]:::lower
    PLower[(Postgres Lower<br/>tenant_refNum)]:::lower
    MigPipe[Migration Ingest<br/>Pipeline]:::external
    Val[Data Validation<br/>Framework]:::external

    Op --> API
    Sched --> API
    API --> Tagger
    Tagger --> Approval
    Approval --> Router
    Approval --> Preflight

    Router --> Reader --> Mask --> Snap1 --> Writer --> MLower
    Reader --> ProdDB
    Mask --> MDS
    Reader -. resume tokens .-> CDC
    Snap1 -. snapshots .-> SnapStore

    Preflight -- "pass" --> Snap2 --> Adapter
    Snap2 -. snapshots .-> SnapStore
    Adapter -- "invoke (source=Mongo Lower, target=tenant_refNum, full)" --> MigPipe
    MLower --> MigPipe --> PLower
    Adapter --> ValHook --> Val

    Writer --> AuditLog
    Adapter --> AuditLog
    Preflight --> AuditLog

Hop 1 and Hop 2 share one control plane (API, Tagger, Approval, Snapshot Manager, Audit Log) but have distinct execution pipelines. Hop 2 does not contain its own ingest logic — it invokes the existing migration ingest pipeline.

5. Key Components

Component	Hop	Responsibility	Owner
Sync API / Orchestrator	both	Front door for operator + scheduler; exposes Hop 1 and Hop 2 endpoints; coordinates downstream steps.	Platform Eng
Correlation ID Tagger	both	Issues a `chain-{refNum}-{date}-{shortid}` per Hop 2 trigger; stamps both `SYNC_RUN` rows and downstream ingest + validation records.	Platform Eng
Approval Workflow	both	Gates Hop 1 Full/Selective and every Hop 2 behind recorded second-party approval.	Platform Eng
Delta Scheduler	Hop 1	Triggers Hop 1 Delta runs on cadence per refNum/category.	Platform Eng
Mode Router	Hop 1	Dispatches to Full, Delta, or Selective handler with correct scope.	Platform Eng
Production Reader	Hop 1	Reads from production replica only; throttled to avoid replication lag.	Platform Eng + DBA
Masking Engine	Hop 1	Applies field-level masking driven by MDS classifications (PII / sensitive / public).	Platform Eng + Security
Snapshot Manager (Mongo Lower)	Hop 1	Captures Mongo Lower state before every Hop 1 sync; supports one-call restore.	Platform Eng
Target Writer (Mongo Lower)	Hop 1	Applies changes per `refNum` to Mongo Lower; refNum-scoped statements only.	Platform Eng
Hop 2 Pre-flight Freshness Validator	Hop 2	Blocks Hop 2 unless last successful Hop 1 for the same `(refNum, target_env)` falls within the staleness window.	Platform Eng + DevOps
Snapshot Manager (Postgres Lower)	Hop 2	Captures Postgres Lower `tenant_{refNum}` schema before the destructive truncate+ingest.	Platform Eng + DBA
Migration Pipeline Adapter	Hop 2	Invokes the existing migration ingest pipeline with `source=mongo-lower-{env}`, mode=Full, propagating the correlation ID.	Platform Eng
Validation Webhook	Hop 2	Auto-triggers the Data Validation Framework once Hop 2 ingest completes; carries the correlation ID.	Platform Eng + QA Eng
CDC Resume Token Store	Hop 1	Durable store of MongoDB change-stream resume tokens, per refNum.	Platform Eng
Snapshot Store	both	Durable store of pre-sync snapshots for Mongo Lower and Postgres Lower; configurable retention.	Platform Eng + DBA
Audit Log	both	Append-only record of every sync operation, including chain runs joined by correlation ID.	Platform Eng + Security
Migration Ingest Pipeline (external collaborator)	Hop 2	The platform's JOLT → DAAL pipeline. Hop 2 reuses this; it is not part of the Sync Framework.	Mapping + Platform Eng

6. Core Principles

One-way only. Data flows production → lower. Never the reverse. Enforced at the infrastructure layer, not by policy.
refNum-scoped. Every operation in either hop touches one customer's data at a time. Cross-customer contamination is impossible by design.
Hop 1 is continuous; Hop 2 is deliberate. Scheduled Delta keeps Mongo Lower fresh without operator attention. Hop 2 is always explicit, approved, gated, and snapshotted.
Reuse the migration pipeline for Hop 2. The pipeline that runs at cutover is the pipeline that refreshes Postgres Lower — no parallel implementation, strongest dogfooding signal.
Reversible. Every sync (both hops) takes a snapshot first. The snapshot is the restore point.
PII handling at the framework level. Masking rules come from MDS classifications, not from individual scripts. Hop 2 inherits the already-masked data from Mongo Lower; it never re-touches production.
Validation-aware. Every Hop 2 emits a correlation ID and auto-triggers the Data Validation Framework. The chain is auditable end-to-end by that ID.
Audited. Every sync — both hops, including pre-flight overrides — is logged with actor, refNum, scope, snapshot id, and outcome.

7. Interactions With Other Workstreams

Consumes from	What
WS1 — Configuration Extraction	Customer manifests; defines what "operational settings" exist per refNum.
MDS	Field classifications (PII / sensitive / public) drive Hop 1 masking. Also drives Hop 2 indirectly (the ingest pipeline that Hop 2 invokes references MDS).
Migration Ingest Pipeline (WS2 + platform)	Hop 2 invokes this pipeline. It owns JOLT execution, DAAL writes, and DLQ.
WS8 — Clean-State INTQA	Receives Hop 1 syncs (Mongo Lower) and Hop 2 ingests (Postgres Lower) into per-customer INTQA environments.

Publishes to	What
WS4 — Data Validation	Hop 2 auto-triggers validation after each successful run. Lookup data synced in Hop 1 is the precondition for the FK-integrity check on root entities.
WS5 — End-to-End Validation	E2E suites run against the Postgres Lower state produced by Hop 2.
Security & Compliance	Audit log (including pre-flight overrides) used as evidence in the customer's sign-off package.
Leadership Dashboard	Chain run view (Hop 1 → Hop 2 → ingest → validation) per refNum.

8. Non-Functional Requirements

Category	Requirement
Security	No path from lower to prod. All credentials environment-scoped. Masking non-bypassable. Approval recorded for Hop 1 Full/Selective and every Hop 2. Pre-flight override requires `senior-approver` role; override events are page-worthy.
Compliance	PII never reaches lower envs. MDS classifications are the single source for masking rules. Audit log retained per policy and queryable by correlation ID end-to-end.
Performance	Hop 1 reads from production replica only, throttled to avoid replication lag impact. Hop 2 throughput governed by migration ingest pipeline; not a Sync Framework concern.
Reliability	Snapshot taken before every sync on both hops. CDC tokens persist across runs. Loss of token triggers Hop 1 Full Sync fallback, recorded.
Observability	Each sync (both hops) produces a structured run record; chain runs queryable as one row via `CHAIN_RUN_VIEW`.
Operability	Runbooks published for Hop 1 Full / Selective / Restore, Hop 2 Trigger / Pre-flight Override / Restore. Designated operators trained.
Cost	Replica reads only. Snapshot storage tiered to lower-cost storage after configurable retention.

9. Sync Operations

The framework exposes four operations, all governed by the same control plane.

Operation	Hop	When to use	Trigger	Approval	Snapshot	Destructive on target?
Full	1	Standing up a fresh Mongo Lower environment; after major production changes.	Operator request	Required, recorded	Yes	Yes (Mongo Lower)
Delta	1	Routine, scheduled refresh of Mongo Lower.	Automatic (scheduler)	Not required	Yes	No (incremental)
Selective	1	Targeted refresh of one refNum or artifact type; tactical fixes in Mongo Lower.	Operator request	Required, recorded	Yes	Partial (scoped)
Hop 2 (Full)	2	Refresh Postgres Lower from Mongo Lower for validation, rehearsal, or pipeline dogfooding.	Operator only	Required, recorded	Yes (Postgres Lower)	Yes (truncate + re-ingest)

10. ⭐ Chained Sync — Lifecycle & Promises

The two hops cooperate as one strategy. This section captures how the strategy changes through the migration lifecycle and the user-facing guarantees it offers.

10.1 How the chain changes over the migration lifecycle

Phase	Hop 1	Hop 2	Notes
Pre-cutover (Phase 1 + Phase 2 buckets)	Mongo Prod → Mongo Lower	Mongo Lower → Postgres Lower	The active chain. Used for INTQA validation, rehearsal, and pipeline dogfooding.
During cutover for a refNum	Continues for other refNums; paused for this one	Paused for this refNum	Cutover runs the migration ingest directly Mongo Prod → Aurora Prod for the specific refNum. The lower-env chain is irrelevant during the cutover window.
Post-cutover for a refNum	Aurora Prod → Aurora Lower	Not needed (source and target are both Aurora)	The chain collapses to a single hop; the same Sync Framework handles it without code changes. Hop 2 surface idles for this refNum.

10.2 What the chain promises

refNum-scoped end to end. A chain run only affects one customer's data in Mongo Lower and tenant_{refNum} in Postgres Lower.
No PII propagation. Hop 1 masks at the framework level; Hop 2 inherits the masked data. Masking is never re-applied or relaxed downstream.
No destructive surprises. Hop 2 cannot truncate Postgres Lower without first snapshotting it and recording operator approval.
No stale ingests. Pre-flight check blocks Hop 2 against a stale Mongo Lower unless senior approval is recorded.
Auto-validated. Every Hop 2 success triggers the Data Validation Framework; validation outcomes are pinned to the chain.
Auditable end-to-end. One correlation ID joins Hop 1 → Hop 2 → ingest → validation. Available via GET /chains/{correlation_id}.

Operating-model details and component-level mechanics are in the LLD §3.6, §4, §5.4–5.5.

11. ⭐ Risks & Mitigations

Risk	Mitigation
PII leakage to lower environments.	Masking enforced by the framework using MDS classifications. Sync rejects any operation that cannot satisfy masking policy.
Accidental write-back to production.	Lower environments have no credentials to write to production. Enforced at the infrastructure layer.
A sync breaks a working lower environment.	Snapshot taken before every sync on both hops. Restore is one operator action away.
Production load impact during sync.	Reads come from replica only. Throttled to keep replication lag within bounds.
CDC resume tokens lost or corrupted.	Tokens persist in durable storage. Loss triggers fallback to Full Sync and is recorded in the audit log.
Cross-customer contamination.	Sync operates per refNum. Cross-customer movement is impossible by design.
Operators bypass approval.	Approval is enforced by the service, not by policy. No path to Full / Selective / Hop 2 without recorded approval.
Lower environment drifts again after sync.	Scheduled Hop 1 Delta runs keep convergence ongoing. Out-of-cycle change in a lower env triggers next Delta to restore.
Operator triggers Hop 2 against a stale Mongo Lower.	Pre-flight freshness check blocks the run unless Hop 1 has succeeded for this refNum within the configured window. Override requires recorded senior approval.
Hop 2 destroys a working Postgres Lower mid-rehearsal.	Hop 2 always snapshots the target before truncate+ingest. Restore is one operator action. Approval is required to trigger Hop 2 in the first place.
Migration pipeline behavior diverges between Hop 2 and real cutover.	Hop 2 invokes the same pipeline binary as cutover with `source=mongo-lower-{env}`. No parallel implementation.
Validation results not traceable back to a specific chain run.	Correlation ID is propagated from Hop 2 trigger through ingest and into validation reports. `CHAIN_RUN_VIEW` joins all four artifacts.

12. ⭐ Success Criteria

Sync policy document signed off by DevOps, Security, and Compliance.
Sync service built, deployed, and accessible to authorized operators in INTQA, Staging, Development.
Masking driven by MDS field classifications — verified no PII reaches lower environments.
Hop 1: scheduled Delta sync running successfully on cadence; CDC resume tokens persist between runs.
Hop 1: snapshot capture and restore proven on a real environment.
Hop 2: Migration Pipeline Adapter invokes the cutover pipeline against source=mongo-lower-{env} for every Phase 1 customer.
Hop 2: pre-flight freshness check blocks stale-source runs; override path enforces senior approval and pages security.
Hop 2: snapshot capture and restore of Postgres Lower proven on a real environment.
Auto-trigger of Data Validation Framework after Hop 2 produces a validation report linked by correlation ID.
Audit log captures every sync operation (both hops) with full traceability (who, what, when, refNum, scope, snapshot id, correlation id).
Operators trained; runbooks published for Hop 1 Full / Selective / Restore and Hop 2 Trigger / Override / Restore.
Lower environments measurably converge to production state within the defined freshness expectation.
GET /chains/{correlation_id} returns the end-to-end view for every Phase 1 customer's chain runs.

13. Glossary Reference

See _shared/glossary.md. Key terms used in this document: refNum, MDS, CDC, Sync mode, Snapshot, Resume token, Hop 1, Hop 2, Correlation ID, Pre-flight freshness check.

High-Level Designs

Data Validation

Reconciliation rules, assertions, severity, gates, and signed reports.

HLD — Data Validation Framework

High-Level Design · Workstream 4 of 9 · MongoDB → Aurora PostgreSQL Migration Companion document: LLD · Source: metadata-documents/MongoDB_to_PostgreSQL_v2_WS4_Data_Validation.docx

Version	Date	Author	Notes
v0.1	2026-05-18	Migration Architecture	Initial draft from WS4 execution plan

1. ⭐ Executive Summary

In one sentence: Run the playbook's reconciliation rules — row count parity, FK integrity, required field coverage, sampled deep-diff — automatically, on every entity, for every customer, as a gate the migration cannot pass without.

Owner: QA / Data Validation Team | Phase: Phase 1 (proves the framework) and Phase 2 (runs unchanged) | Workflow gates supported: Steps 3, 7, 8, 9 of the nine-step workflow

What we get when it's done

Every customer ingestion produces a signed, per-entity validation report.
The four playbook reconciliation rules run automatically — no entity is hand-checked.
Per-entity business assertions catch domain-specific errors that structural checks miss.
Discrepancies are classified by severity; Critical issues block workflow advancement.
Dead Letter Queue inventory is a first-class output — a clean DLQ is a precondition for sign-off.
The framework is verified against seeded errors before any customer relies on it.
Auto-triggered after Sync Framework Hop 2 — every chained sync run automatically produces a validation report, tagged with the chain's correlation_id for end-to-end traceability.

2. Scope

In scope	Out of scope
Four default reconciliation rules: row count parity, FK integrity, required-field coverage ≥99.9%, sampled deep-diff	Manual spot-checks; ad-hoc SQL by individuals
Per-entity business assertions authored by QA during entity declaration review	Application-level business validation (covered by E2E framework, WS5)
Severity classification + routing to responsible teams	Defect ticket management (handled by existing tooling)
DLQ inventory and disposition tracking	Source-side data cleanup (owned by Domain Owners)
Per-customer, per-entity report generation and storage	Cutover go/no-go (decision belongs to release authority; framework provides evidence)
Pipeline integration so every ingest produces a validation result	Backup, retention, or DR for source data

3. ⭐ System Context Diagram (C4 L1)

flowchart LR
    classDef prod fill:#e0f2f1,stroke:#00695c,color:#004d40
    classDef lower fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef actor fill:#fff3e0,stroke:#ef6c00,color:#e65100
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f

    QA([QA Team]):::actor
    DomainOwner([Domain Owners]):::actor
    Leadership([Leadership]):::actor

    MongoSrc[(MongoDB Source)]:::prod
    Ingest[Ingest Pipeline<br/>JOLT → DAAL]:::external
    Aurora[(Aurora Target<br/>tenant_refNum)]:::lower

    VF["Data Validation<br/>Framework"]:::internal

    MDS[MDS — entity decls<br/>+ assertions]:::external
    History[(metadata.<br/>ingest_run_history)]:::internal
    DLQ[(Dead Letter Queue)]:::internal
    Reports[(Per-customer<br/>per-entity reports)]:::internal
    Dashboard[Leadership Dashboard]:::internal

    MongoSrc --> Ingest --> Aurora
    Ingest -.->|run record| History
    Ingest -.->|failed records + context| DLQ

    VF -- "reads counts/keys/samples" --> MongoSrc
    VF -- "reads counts/keys/samples" --> Aurora
    VF -- "reads rules + assertions" --> MDS
    VF -- "reads run records + DLQ" --> History
    VF -- "reads DLQ inventory" --> DLQ
    VF -- "writes results" --> History
    VF -- "writes report" --> Reports

    Reports --> Dashboard --> Leadership
    Reports --> QA
    DomainOwner --> MDS

The Data Validation Framework attaches to the existing ingest pipeline. It is not a separate, opt-in step — every ingest produces a validation result alongside the run record.

4. High-Level Architecture (C4 L2 — Container)

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f
    classDef gate fill:#ffebee,stroke:#c62828,color:#b71c1c

    Ingest[Ingest Pipeline]:::external
    MDS[MDS]:::external

    subgraph VF[" Data Validation Framework "]
        direction TB
        Engine[Rule Engine<br/>4 defaults + assertion executor]:::internal
        DLQI[DLQ Inspector]:::internal
        Classifier[Severity Classifier<br/>+ Router]:::internal
        Reporter[Reporting Service]:::internal
        Gate[Gate Decision API]:::gate

        Rules[(Rule + Assertion<br/>Library)]:::store
        Results[(Validation Results<br/>in metadata.ingest_run_history)]:::store
        ReportStore[(Report Store)]:::store
    end

    Dashboard[Leadership Dashboard]:::internal

    Ingest -- "ingest run event" --> Engine
    MDS -- "entity decls + assertions" --> Rules
    Rules --> Engine
    Engine --> Classifier
    DLQI -- "DLQ inventory" --> Engine
    Classifier --> Reporter
    Reporter --> Results
    Reporter --> ReportStore
    Classifier --> Gate
    Gate -- "block / allow" --> Ingest
    ReportStore --> Dashboard

The Gate Decision API is the integration point that prevents workflow advancement when Critical findings exist. Override is possible only with explicit, recorded approval.

5. Key Components

Component	Responsibility	Owner
Rule Engine	Executes the four default reconciliation rules and per-entity assertions; produces structured findings.	QA Eng + Platform Eng
Rule + Assertion Library	Version-controlled definitions of default rules and per-entity assertions, sourced from MDS.	QA Team + Domain Owners
DLQ Inspector	Counts and classifies DLQ records by reason; feeds inventory into findings.	QA Eng
Severity Classifier + Router	Maps each finding to Critical / High / Medium / Informational and routes to the responsible team.	QA Eng
Reporting Service	Generates per-customer, per-entity report; stores in report store and surfaces in dashboard.	QA Eng
Gate Decision API	Called by ingest pipeline / workflow tooling to know whether a step can advance.	QA Eng + Platform Eng
Report Store	Durable, queryable store of every report produced; supports historical comparison.	Platform Eng
`metadata.ingest_run_history` integration	Validation results land alongside the ingest run that produced them; one row, complete context.	Platform Eng

6. Core Principles

Automated, not manual. Every check runs in code. Human review interprets reports; it does not perform validation.
Built into the migration pipeline. Validation runs as part of every ingest, not as a separate step that can be skipped.
Per-customer, per-entity reporting. Aggregate results hide entity-specific problems. The per-entity report is the artifact that signs off.
Severity-classified discrepancies. Not every mismatch is a blocker; classification keeps focus on what matters.
Reproducible. Re-running validation on the same dataset produces the same result. Auditable, trustworthy.
The framework itself is verified. Seeded-error tests prove the engine catches what it should before any real customer relies on it.

7. Interactions With Other Workstreams

Consumes from	What
WS1 — Configuration Extraction	Customer manifests; defines entities and lookups in scope per refNum.
WS2 — Schema Mapping	Entity declarations + JOLT specs; the contract validation rules apply against.
WS3 — Prod-to-Lower Sync	Lookup data must be present in the lower env (precondition for the FK-integrity check on the root entity). Sync Framework Hop 2 auto-triggers a validation run on completion, passing the chain's `correlation_id`; the resulting report is queryable via `GET /chains/{correlation_id}`.
MDS	Entity declarations, field classifications, and per-entity assertions.
Ingest Pipeline	Run events (including those produced by Hop 2) trigger validation. Validation also runs on cutover ingest and any other invocation of the migration pipeline.

Publishes to	What
Nine-step workflow gates	Gate Decision API outcomes block Steps 7, 8, and 9.
WS3 — Prod-to-Lower Sync	Validation outcome for a Hop 2 run is reflected back into the chain status (`Succeeded` vs `SucceededWithBlockingFindings`).
WS5 — End-to-End Validation	E2E suites read validation status to know whether the dataset they test against is reconciled.
WS8 — Clean-State INTQA	Reports are part of the customer's evidence package.
Security & Compliance	Signed reports go into the audit/evidence trail.

8. Non-Functional Requirements

Category	Requirement
Correctness	Framework itself is verified against seeded errors (Step 5 of the execution plan). False-negative rate measured.
Performance	Default rules tuned for speed. Sampled deep-diff covers the long tail with a target completion time per entity defined per playbook SLA.
Scale	Designed to handle every customer, every entity, every ingest — including bulk dry-run on Step 8.
Reproducibility	Same dataset + same rule version produces identical findings. Rule version recorded with each result.
Observability	Each rule execution is traceable: which rule version, which dataset version, which run record.
Auditability	Reports are signed; reproducible on demand; traceable to data and rule versions in `metadata.ingest_run_history`.
Cost	Runs proportional to ingest volume; no separate full-scan jobs unless Selective revalidation is explicitly requested.

9. Where Validation Fits in the Nine-Step Workflow

Workflow step	What validation does
Step 3 — Load Lookup Tables	Validates each lookup is 100% reconciled before the root entity ingests.
Step 7 — Test Ingest (Small Batch)	Verifies test docs land correctly; FK links resolve; lookup resolution works; no errors.
Step 8 — Bulk Dry-Run on Staging	Full reconciliation run: row parity, checksums, ref integrity, DLQ inventory, perf within SLA.
Step 9 — Cutover and Hypercare	Daily reconciliation throughout the 2-week hypercare window. Final sign-off requires green reconciliation.

10. ⭐ Risks & Mitigations

Risk	Mitigation
Framework misses real errors.	Step 5 of the execution plan verifies the framework itself with seeded errors before any customer relies on it.
Validation runs too slowly to be practical.	Default rules tuned for speed. Sampled deep-diff covers the long tail. Performance budget per entity defined upfront.
Discrepancies overwhelm reviewers.	Severity classification routes only Critical and High to active review. Medium and Informational logged but not in the active queue.
Entity-specific assertions are not authored.	Assertion authoring is part of entity declaration review. No entity passes declaration sign-off without its assertions.
DLQ records accumulate without resolution.	DLQ inventory is a Step 8 gate. Clean DLQ is a precondition for bulk dry-run sign-off.
Cutover proceeds despite failed validation.	Pipeline enforces gates via the Gate Decision API. Override requires explicit, recorded approval.
Validation results are not trusted by stakeholders.	Reports are signed, reproducible on demand, and traceable to exact rule and data versions.

11. ⭐ Success Criteria

Default reconciliation rules configured per playbook defaults (0% row count tolerance, 0 orphans, ≥99.9% coverage, 0 diffs in sample).
Entity-specific assertions authored for every Phase 1 entity.
Validation framework integrated into the ingest pipeline; results land in metadata.ingest_run_history.
Severity classification and routing rules configured and verified.
Framework verified against seeded errors — every seeded error caught and correctly classified.
Per-customer, per-entity validation reports produced automatically and used as workflow gates.
DLQ records resolved for every Phase 1 entity before workflow advancement.
All four default rules pass plus entity-specific assertions for each Phase 1 customer entity.

12. Glossary Reference

See _shared/glossary.md. Key terms used in this document: MDS, DAAL, JOLT, _lookup, auto-injected FK, refNum, DLQ, Reconciliation rule, Assertion, Severity class.

High-Level Designs

End-to-End Validation

Unified use-case, mocking, and payload-diff framework (WS5+6+7).

HLD — End-to-End Validation Framework

High-Level Design · Workstreams 5 + 6 + 7 of 9 · MongoDB → Aurora PostgreSQL Migration Companion document: LLD · Sources: MongoDB_to_PostgreSQL_v2_WS5_Use_Case_Validation.docx, WS6_Mocking_Framework.docx, WS7_Payload_Validation.docx

Version	Date	Author	Notes
v0.1	2026-05-18	Migration Architecture	Initial draft unifying WS5 Use Case Validation + WS6 Mocking + WS7 Payload Validation

1. ⭐ Executive Summary

In one sentence: Prove that everything a customer does today still works after migration — workflows, screens, integrations, and the bytes we send to partners — by running automated, customer-specific suites against the migrated system, with mocks for every external partner and byte-level diffs against production baselines.

Owner: QA / Data Validation Team (Use Case + Payload) and Platform Eng (Mocking) | Phase: Phase 1 (proves the framework) and Phase 2 (runs unchanged) | Workflow gates supported: Steps 7, 8, 9 of the nine-step workflow

What we get when it's done

Every Phase 1 customer has a signed, automated use-case suite that proves their business workflows produce the same outcomes on Aurora as on the existing system.
Every external partner has a faithful mock; no INTQA call can ever reach a real partner.
Every outbound integration payload is compared byte-by-byte against a production baseline; blocking differences cannot pass through to cutover.
Performance regressions are caught against measured baselines, not assumed thresholds.
A leadership-visible pass/fail dashboard shows readiness per customer.
Daily smoke tests run throughout the 2-week hypercare window; sign-off requires zero P1/P2 for 7 consecutive days.

2. Why three workstreams in one framework

The three source workstreams answer three different questions, but they only deliver value together. Data Validation (WS4) confirms the data is correct at rest; this framework confirms the system behaves correctly when used.

Question	Answered by	Without the others
"Do customer workflows produce the same end-to-end outcome?"	WS5 — Use Case Validation	Can't run safely against real partners; brittle without payload-level evidence.
"How do we exercise integrations without touching real partners?"	WS6 — Mocking Framework	Has no purpose unless something runs against it.
"Do the bytes we send to partners still match what production sent?"	WS7 — Payload Validation	Can't capture meaningful traffic unless workflows actually execute.

We model them as one framework with three internal sub-systems, sharing one orchestrator, one dashboard, and one evidence package.

3. Scope

In scope	Out of scope
Customer-specific use-case catalogs covering critical journeys, business-process smoke tests, cross-system flows, edge cases, and performance baselines	Application functional unit tests (owned by application teams)
Mock service that stands in for every external partner during INTQA runs	Real-partner contract negotiation; partner-side changes
Network controls that prevent INTQA from reaching real partners	Production network policy (owned by DevOps Platform)
Outbound payload capture from production traffic, anonymized via MDS classifications	Inbound webhook capture from production (covered separately if scoped)
Diff engine and classification rule book per integration	Defect ticketing (uses existing tooling)
Performance baselines measured on the existing system	Capacity planning for Aurora (owned by DBA/Platform)
Daily smoke tests during the 2-week hypercare period	Customer feature roadmap during hypercare
Pass/fail dashboard for leadership and engineering	Long-term operational dashboards post-cutover (owned by DevOps)

4. ⭐ System Context Diagram (C4 L1)

flowchart LR
    classDef prod fill:#e0f2f1,stroke:#00695c,color:#004d40
    classDef lower fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef actor fill:#fff3e0,stroke:#ef6c00,color:#e65100
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f

    CustomerTeam([Customer / Tenant Team]):::actor
    QA([QA Team]):::actor
    IntOwner([Integration Owner]):::actor
    Leadership([Leadership]):::actor

    subgraph ProdZone[" Production "]
        ProdSys[Existing MongoDB Platform]:::prod
        ProdPartners[Real External Partners]:::external
    end

    subgraph INTQAZone[" INTQA (migrated) "]
        Migrated[Aurora-backed Platform]:::lower
    end

    E2E["End-to-End<br/>Validation Framework<br/>(Use Cases · Mocks · Payload Diff)"]:::internal

    Baseline[(Parity + Perf<br/>Baselines)]:::internal
    Diff[(Payload Diff<br/>Reports)]:::internal
    Dashboard[Pass/Fail Dashboard]:::internal

    CustomerTeam -- "authors use cases" --> E2E
    IntOwner -- "owns mocks + classification" --> E2E
    QA -- "operates suites" --> E2E

    ProdSys -- "captured baselines" --> Baseline
    ProdSys -. "real traffic for capture only" .-> ProdPartners

    E2E -- "runs scenarios" --> Migrated
    Migrated -. "outbound integration calls" .-> E2E
    E2E -- "reads/writes" --> Baseline
    E2E -- "produces" --> Diff

    Diff --> Dashboard --> Leadership

    style ProdZone stroke-dasharray: 5 5,stroke:#00695c
    style INTQAZone stroke-dasharray: 5 5,stroke:#1565c0

No outbound traffic from INTQA ever reaches a real partner. Capture happens in production for read-only baselining; replay happens in INTQA against the mock layer inside this framework.

5. High-Level Architecture (C4 L2 — Container)

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f
    classDef lower fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef prod fill:#e0f2f1,stroke:#00695c,color:#004d40

    Migrated["Aurora-backed Platform<br/>(INTQA)"]:::lower
    ProdCap["Production traffic<br/>capture (read-only)"]:::prod

    subgraph E2E[" End-to-End Validation Framework "]
        direction TB
        Orchestrator[Use Case Orchestrator]:::internal
        Catalog[(Use Case Catalog<br/>per refNum)]:::store
        ParityBase[(Parity Baseline<br/>Store)]:::store
        PerfBase[(Performance Baseline<br/>Store)]:::store

        Mock[Mock Service<br/>+ Scenario Manager]:::internal
        ScenStore[(Mock Scenario Catalog)]:::store

        Capture[Outbound Payload<br/>Capture]:::internal
        Diff[Diff Engine<br/>+ Classification Rules]:::internal
        BaselineLib[(Production Payload<br/>Baseline Library)]:::store
        ClassRules[(Classification<br/>Rule Book)]:::store

        Reporter[Reporting + Dashboard API]:::internal
        Dashboard[Pass/Fail Dashboard]:::internal

        Catalog --> Orchestrator
        ParityBase --> Orchestrator
        PerfBase --> Orchestrator

        Orchestrator -- "run scenarios" --> Migrated
        Migrated -. "outbound call" .-> Capture
        Capture --> Mock
        Mock --> Migrated
        ScenStore --> Mock

        Capture --> Diff
        BaselineLib --> Diff
        ClassRules --> Diff

        Orchestrator --> Reporter
        Diff --> Reporter
        Reporter --> Dashboard
    end

    ProdCap --> BaselineLib

Outbound calls from the migrated platform are intercepted by the Capture component, routed to the Mock for a response, and simultaneously diffed against the production baseline. One pipeline produces both functional and payload evidence.

6. Key Components

Component	Sub-system	Responsibility	Owner
Use Case Orchestrator	WS5	Loads catalog, runs scenarios against migrated platform, compares to parity baseline, applies performance gate.	QA Eng
Use Case Catalog	WS5	Customer-specific scenarios — critical journeys, business smoke tests, cross-system flows, edge cases.	Customer Team
Parity Baseline Store	WS5	Captured outputs from existing MongoDB system; reference for "did the migrated system do the same thing?"	QA Eng
Performance Baseline Store	WS5	Measured response times and throughput on the existing system per scenario.	QA Eng + DevOps
Mock Service + Scenario Manager	WS6	Stands in for every external partner; switches scenarios per test run (happy path / errors / timeouts / etc.).	Platform Eng
Mock Scenario Catalog	WS6	Versioned scenarios per integration; happy path, validation error, auth failure, timeout, rate limit, malformed, partial, 5xx.	Platform Eng + Integration Owners
Outbound Payload Capture	WS7	Intercepts outbound integration calls during INTQA runs; logs request and response.	Platform Eng
Diff Engine + Classification Rules	WS7	Compares captured payloads to production baselines byte-by-byte; classifies differences.	Platform Eng + QA Eng
Production Payload Baseline Library	WS7	Anonymized request/response pairs from production interactions per integration per refNum.	Platform Eng + Security
Classification Rule Book	WS7	Version-controlled rules for Expected / Tolerable / Notable / Blocking.	Integration Owners + Security
Reporting + Dashboard API	shared	Aggregates results from all three sub-systems into one customer-readiness view.	QA Eng
Pass/Fail Dashboard	shared	Leadership-visible readiness view per customer; drilldown by scenario and integration.	QA Eng + Platform Eng

7. Core Principles

Customer-aware (per refNum). Each customer's test suite runs against their specific configuration; the same workflow can produce different correct results for different customers.
Automated and repeatable. Every scenario runs in code. Re-running produces identical setup and predictable results.
Pass/fail parity, not "looks fine". Each scenario runs on both the existing and migrated systems; outcomes must match. Parity is the gate.
Owned by the tenant teams. The team that owns the customer relationship owns the catalog and is accountable for completeness.
Visible to leadership. Test results are not buried in engineering tools; a clear dashboard surfaces status to anyone who needs it.
Zero external impact during validation. No INTQA call reaches a real partner — network-level controls, not honor system.
Byte-level comparison, judgment-level classification. The diff engine looks at every byte but does not panic at every difference; classification rules separate signal from noise.
Production behavior is the baseline. Not documentation; not spec — what production actually does.

8. Interactions With Other Workstreams

Consumes from	What
WS1 — Configuration Extraction	Customer manifests; defines integrations and scenarios per refNum.
WS2 — Schema Mapping	Entity definitions; needed for parity comparison logic.
WS3 — Prod-to-Lower Sync	The Postgres Lower environment that E2E suites run against is populated by Sync Framework Hop 2 (Mongo Lower → Postgres Lower via the migration ingest pipeline). Hop 1 keeps the upstream Mongo Lower production-shaped. E2E suite runs may optionally pin to a specific chain `correlation_id` for reproducible test conditions.
WS4 — Data Validation	E2E suites only run against datasets that have passed data validation. When a suite runs against a chain-refreshed environment, the matching validation report is reachable via the chain's `correlation_id`.
MDS	Field classifications drive anonymization of payload captures.

Publishes to	What
Nine-step workflow gates	Step 8 sign-off requires green E2E + green payload diff. Step 9 requires hypercare smoke tests stay green for 7 consecutive days.
WS8 — Clean-State INTQA	Suite results and signed payload reports go into customer evidence package.
Integration Owners	Signed payload validation report per integration per customer.
Customer Team	Signed parity report per customer; basis for formal sign-off.

9. Non-Functional Requirements

Category	Requirement
Security	All payload captures anonymized using MDS classifications. Baseline data carries same controls as production.
Network isolation	INTQA outbound to real partners blocked at the network layer; framework verifies and alerts on unexpected destinations.
Reliability	Mocks deployed as a real service with health checks, scenario switching, and observability.
Performance	Suites complete inside the validation window allocated by the migration timeline. Mocks respond with realistic latency profiles.
Reproducibility	Same suite + same baseline + same scenarios produce the same pass/fail outcome. Versions recorded per run.
Observability	Every request the mock receives and every response it returns is logged. Diff outcomes recorded with full context.
Auditability	Signed reports per customer per integration; reproducible on demand.
Scale	Designed to support every Phase 1 customer in parallel; scales to Phase 2 buckets without rework.

10. ⭐ Risks & Mitigations

Risk	Mitigation
Use-case catalog misses critical scenarios.	Customer team owns the catalog and is accountable for completeness; reviews are structured, not perfunctory.
Tests pass but the system still misbehaves post-cutover.	Hypercare smoke tests run daily; 7-consecutive-day clean window required for sign-off.
Tests are too brittle, failing on cosmetic differences.	Comparisons focus on outcomes and key state, not byte-equal output. Realistic tolerances defined per scenario type.
Performance baselines are unrealistic.	Baselines measured on the actual existing system, not assumed. Variance thresholds aligned to playbook SLA.
An integration is missing from the catalog.	WS6 Step 1 cross-checks application input against customer manifests (WS1). Discrepancies investigated, not assumed away.
A path from INTQA to a real partner is missed.	Network-level verification at WS6 Step 5. Outbound traffic monitored — unexpected destinations trigger alerts.
Mock behavior drifts from real partner behavior.	Mocks derived from real production captures; refreshed when partner contracts change; drift monitored.
Baseline payloads go stale as production evolves.	Baselines refreshed on a defined cadence. Partner contract changes trigger immediate refresh.
Classification rules let real issues through as "expected".	Rules reviewed by integration owners; periodic spot checks confirm classification is conservative.
Failures pile up faster than fixes.	Severity classification ensures Critical/Blocking are prioritized; lower-priority issues queued, not lost.
Sign-off becomes a rubber stamp.	Customer team records the specific checks performed; the artifact is auditable.
Hypercare ends prematurely.	Exit criteria are explicit: zero P1/P2 for 7 consecutive days. Cannot be waived without senior approval.

11. ⭐ Success Criteria

Use-case catalog complete and signed off by each Phase 1 customer team.
Business-process smoke tests defined per entity (playbook §4.4).
Parity baseline captured from the existing system for every scenario in scope.
Automated test suites built, customer-aware, integrated into the validation pipeline.
Performance baselines defined with agreed variance thresholds.
Complete integration catalog signed off by all application teams.
Mocking service deployed; mock implementation complete for every integration; scenario coverage verified.
INTQA network routing verified — no path from INTQA to real partner systems exists.
Production payload baseline captured per integration per customer.
Classification rule book signed off by integration owners.
Diff engine verified against known-different payload pairs.
Zero unresolved Blocking payload differences for any customer before cutover.
Pass/fail dashboard live and used by leadership and engineering.
Hypercare completed for each customer with zero P1/P2 for 7 consecutive days.
Customer sign-off recorded for each Phase 1 customer.

12. Glossary Reference

See _shared/glossary.md. Key terms used in this document: refNum, DAAL, MDS, Use case, Parity baseline, Performance baseline, Scenario, Capture, Diff classification, Hypercare.

Low-Level Designs

Prod-to-Lower Sync

Sync service components, data model, workflows, APIs, deployment.

LLD — Prod-to-Lower Environment Sync Framework

Low-Level Design · Workstream 3 of 9 · MongoDB → Aurora PostgreSQL Migration Companion document: HLD · Source: metadata-documents/MongoDB_to_PostgreSQL_v2_WS3_Sync_Framework.docx

Version	Date	Author	Notes
v0.1	2026-05-18	Migration Architecture	Initial draft from WS3 execution plan
v0.2	2026-05-18	Migration Architecture	Chained sync (Hop 1 + Hop 2) woven through all sections — components, data model, workflows, deployment

1. ⭐ Executive Summary

In one sentence: A control-plane service that owns two hops — Hop 1 reads production replicas, applies MDS-driven masking, and writes per-refNum to Mongo Lower in one of three modes (Full / Delta / Selective); Hop 2 is an operator-triggered, snapshotted, pre-flight-gated invocation of the existing migration ingest pipeline (source=mongo-lower-{env}) that re-populates tenant_{refNum} in Postgres Lower and auto-triggers the Data Validation Framework.

If you only read one section, read this.

What we build:
- Hop 1: Sync API + Mode Router + Production Reader (replica-only, throttled) + Masking Engine (MDS-driven) + Snapshot Manager + Target Writer.
- Hop 2 surface: Pre-flight Freshness Validator + Snapshot Manager (Postgres Lower) + Migration Pipeline Adapter + Validation Webhook.
- Shared control plane: Sync API/Orchestrator + Correlation ID Tagger + Approval Workflow + Audit Log.
Key design decisions:
- Hop 2 reuses the cutover migration pipeline — no parallel ingestion code path.
- Hop 2 is manual, Full-only, snapshotted, and pre-flight-gated against Mongo Lower freshness.
- Pre-flight failure returns HTTP 409; override requires senior-approver and is page-worthy for security.
- Correlation ID is the join key across Hop 1 → Hop 2 → ingest → validation. Persisted on SYNC_RUN.correlation_id (Sync-owned), INGEST_RUN.correlation_id (migration pipeline / DV-owned, physical table metadata.ingest_run_history), and VALIDATION_RUN.correlation_id (DV-owned).
- Masking is enforced inside the framework on Hop 1; Hop 2 inherits the already-masked data and never touches production.
- Snapshot before every write is non-negotiable on both hops.
Compliance posture: No PII in lower environments. Per-customer refNum scoping makes cross-tenant contamination structurally impossible.
Open items at v0.2: per-env default for pre-flight staleness window; replica-lag thresholds per environment; snapshot retention policy per environment; SIEM mirror destination.

2. Reference Architecture (recap from HLD)

flowchart LR
    classDef prod fill:#e0f2f1,stroke:#00695c,color:#004d40
    classDef lower fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f
    classDef gate fill:#ffebee,stroke:#c62828,color:#b71c1c

    Prod[(Production<br/>read replica)]:::prod --> Hop1[Sync Framework<br/>Hop 1]:::internal --> MLower[(Mongo Lower)]:::lower
    MDS[MDS]:::internal --> Hop1
    Op([Operator]) --> Hop2[Sync Framework<br/>Hop 2 Surface]:::internal
    Hop2 --> Pre[Pre-flight Check]:::gate --> MigPipe[Migration Ingest<br/>Pipeline]:::external
    MLower --> MigPipe --> PLower[(Postgres Lower)]:::lower
    Hop2 -- "auto-trigger" --> Val[Data Validation]:::external
    Hop1 --> Audit[(Audit Log)]:::internal
    Hop2 --> Audit

Full HLD diagrams in the HLD §3 and §4. The LLD elaborates each container.

3. Component Designs (C4 L3)

3.1 Sync API / Orchestrator (shared control plane)

flowchart LR
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92

    REST[REST API<br/>POST /syncs · GET /chains/...]:::internal --> Validator[Request Validator]:::internal
    Validator --> Authz[AuthZ<br/>scope = refNum + mode + chain_hop]:::internal
    Authz --> Tagger[Correlation ID Tagger<br/>Hop 2 only]:::internal
    Tagger --> Approval[Approval Gate<br/>Hop 1 Full/Selective + every Hop 2]:::internal
    Approval --> Enqueue[Run Enqueuer]:::internal
    Enqueue --> Queue[(Run Queue)]:::store
    Queue --> Dispatcher[Dispatcher<br/>per-refNum, per-target mutex]:::internal
    Dispatcher --> RouteHop{chain_hop?}
    RouteHop -- "1" --> Router1[Mode Router]:::internal
    RouteHop -- "2" --> Preflight[Pre-flight Validator]:::internal

Module	Responsibility
REST API	`POST /syncs` (create run, Hop 1 or Hop 2), `GET /syncs/{id}`, `POST /syncs/{id}/approve`, `POST /syncs/{id}/restore`, `GET /chains/{correlation_id}`, `POST /chains/{id}/override-preflight`. Operator UI + scheduler use the same API.
Request Validator	Schema-validates inputs; rejects malformed requests early. Validates `chain_hop ∈ {1, 2}` and required fields per hop.
AuthZ	Verifies the caller may sync this `refNum` in this mode / hop. Roles: `sync-operator`, `sync-approver`, `senior-approver` (for pre-flight overrides), `auditor`.
Correlation ID Tagger	On Hop 2 trigger only: issues `chain-{refNum}-{yyyymmdd}-{shortid}` before approval. Stamps `SYNC_RUN.correlation_id` and propagates to the migration pipeline.
Approval Gate	Requires a second-party approver for Hop 1 Full/Selective and every Hop 2 run. Approver must be a distinct identity from the requester (enforced).
Run Enqueuer / Queue / Dispatcher	Decouples API from execution; runs are durable. Dispatcher enforces a per-`(refNum, target_env, chain_hop)` mutex — no concurrent runs on the same target.

3.2 Mode Router (Hop 1)

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121

    In[Hop 1 run from Queue]:::internal --> Decide{Mode?}
    Decide -- Full --> Full[Full-Sync Handler]:::internal
    Decide -- Delta --> Delta[Delta-Sync Handler]:::internal
    Decide -- Selective --> Sel[Selective-Sync Handler]:::internal

    Full --> Plan[Scope Planner<br/>categories + entities]:::internal
    Delta --> Plan
    Sel --> Plan
    Plan --> Reader[Production Reader]:::internal

Handler	Behavior
Full-Sync Handler	Replaces target operational settings entirely with production state for the requested refNum. Reads from a consistent snapshot point.
Delta-Sync Handler	Consumes CDC change stream from the last resume token; applies events in order. If token is missing, fails closed and queues a Full Sync (recorded in audit).
Selective-Sync Handler	Operator specifies categories (operational / lookup / transactional) and optionally specific artifact ids. Same execution path as Full, narrower scope.
Scope Planner	Translates the mode + parameters into a concrete read plan (collections, filters, time bounds).

3.3 Production Reader (Hop 1)

flowchart LR
    classDef prod fill:#e0f2f1,stroke:#00695c,color:#004d40
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121

    Plan[Read Plan]:::internal --> Reader[Reader Pool]:::internal
    Reader --> Throttle[Replica Lag Monitor<br/>+ Throttle]:::internal
    Throttle --> Replica[(Prod Read Replica)]:::prod
    Replica --> Throttle --> Buffer[Bounded Buffer]:::internal
    Buffer --> Next[→ Masking]:::internal

Reads only from a designated read replica; no direct primary access. Enforced by connection string + IAM.
Replica-lag monitor compares replica heartbeat to primary; if lag exceeds the configured threshold, the reader pauses and resumes when the replica catches up.
A bounded buffer downstream prevents memory blowup if Masking or Writer is slower than Reader.

3.4 Masking Engine (Hop 1)

flowchart LR
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f
    classDef gate fill:#ffebee,stroke:#c62828,color:#b71c1c

    In[Record from Reader]:::internal --> Lookup[MDS Classification Lookup]:::internal
    MDS[MDS]:::external --> Lookup
    Lookup --> Apply[Field-by-Field Masker]:::internal
    Apply --> Check[Policy Check<br/>fail-closed if classification missing]:::gate
    Check --> Out[→ Snapshot Manager]:::internal

Classifications cached locally with a short TTL; cache miss forces an MDS roundtrip.
Fail-closed: if a field's classification cannot be resolved, the record is rejected and the run fails. No "default to public".
Masking strategies per classification: PII → deterministic pseudonymization (so referential integrity holds across rows), sensitive → null or fixed token, public → pass-through.
Strategy is centrally defined per classification; not per call site.
Hop 2 does not re-mask. Mongo Lower is already masked; Hop 2 just transforms the masked data into the Postgres shape.

3.5 Snapshot Manager + Target Writer (Hop 1)

flowchart LR
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92
    classDef lower fill:#e3f2fd,stroke:#1565c0,color:#0d47a1

    In[Masked record stream]:::internal --> Snap[Snapshot Manager]:::internal
    Snap --> SnapStore[(Snapshot Store<br/>per refNum, retained N days)]:::store
    Snap --> TxBatch[Transactional Batcher]:::internal
    TxBatch --> Writer[Target Writer<br/>tenant_refNum scope]:::internal
    Writer --> Lower[(Mongo Lower<br/>tenant_refNum)]:::lower
    Writer --> Verify[Post-Write Verifier<br/>row counts + spot diff]:::internal
    Verify --> Audit[(Audit Log)]:::store

Snapshot is taken before any write to the target schema. Snapshot ID is recorded against the run.
Target Writer scopes every statement to tenant_{refNum} — never writes outside the requested schema; enforced by a session-level role.
Post-Write Verifier runs lightweight reconciliation (counts, spot-check) and records the result in the audit entry. Heavier validation is delegated to the Data Validation Framework.

3.6 Hop 2 Surface

The Hop 2 surface is what makes the chain real. It is small — four components — because Hop 2 reuses the migration ingest pipeline rather than re-implementing it.

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f
    classDef lower fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef gate fill:#ffebee,stroke:#c62828,color:#b71c1c

    In[Hop 2 run with correlation_id<br/>from Tagger]:::internal --> Preflight[Pre-flight Freshness<br/>Validator]:::gate
    Preflight -- "fail (409)" --> Override[Override Path<br/>senior-approver]:::gate
    Preflight -- "pass" --> Apv[Approval Gate]:::internal
    Override --> Apv
    Apv --> Snap["Snapshot Manager<br/>(Postgres Lower)"]:::internal
    Snap --> SnapStore[(Snapshot Store)]:::store
    Snap --> Adapter[Migration Pipeline Adapter]:::internal
    Adapter -- "source=mongo-lower-{env}<br/>target=tenant_refNum (Postgres)<br/>mode=Full<br/>correlation_id=..." --> MigPipe[Migration Ingest Pipeline<br/>JOLT → DAAL]:::external
    MLower[(Mongo Lower)]:::lower --> MigPipe
    MigPipe --> PLower[(Postgres Lower)]:::lower
    MigPipe -- "ingest_run_id" --> Adapter
    Adapter --> Hook[Validation Webhook]:::internal
    Hook -- "trigger with correlation_id" --> Val[Data Validation<br/>Framework]:::external
    Adapter --> Audit[(Audit Log)]:::store
    Preflight --> Audit

Module	Responsibility
Pre-flight Freshness Validator	Verifies `last_successful_hop1_run.completed_at >= now() - chain.preflight.max_staleness_minutes` for `(refNum, target_env)`. Pass → continue to Approval Gate. Fail → HTTP 409 with `last_hop1_run_id`, `last_hop1_completed_at`, and the threshold. No state mutation.
Override Path	`POST /chains/{id}/override-preflight` accepts a free-text rationale. Requires `senior-approver` role. Override event is recorded in the audit log and emits an alert page for security review. Override does not skip the Approval Gate.
Snapshot Manager (Postgres Lower)	Takes a schema-level snapshot of `tenant_{refNum}` in Postgres Lower before the migration pipeline truncates and re-ingests. Snapshot ID is recorded on `SYNC_RUN`. Restore is a single operator action.
Migration Pipeline Adapter	Invokes the existing migration ingest pipeline binary with parameters `source=mongo-lower-{env}`, `target=tenant_{refNum}`, `mode=Full`, and the correlation ID. Returns `ingest_run_id`, DLQ counts, and durations. Does not re-implement JOLT or DAAL.
Validation Webhook	On successful ingest completion, invokes the Data Validation Framework's `POST /validation/run` with `correlation_id`, `refNum`, `ingest_run_id`, `trigger_source=hop2`, and a `callback_url` pointing at `POST /chains/{correlation_id}/validation-complete`. DV returns `202` with a `validation_run_id` (persisted as `linked_validation_run_id`); the validation runs asynchronously. When DV finishes it calls the callback URL with `outcome ∈ {pass, fail_blocking, fail_nonblocking}`, at which point Sync updates the chain status and `CHAIN_RUN_VIEW` reflects the validation results. Both the trigger and the callback handler are idempotent on `(correlation_id, validation_run_id)`.

Pre-flight check semantics (table form):

Aspect	Detail
Definition	`last_successful_hop1_run.completed_at >= now() - chain.preflight.max_staleness_minutes` for this `(refNum, target_env)`.
Default window	`1440` minutes (24 h), configurable per environment.
Pass response	Proceed to Approval Gate.
Fail response	HTTP 409 with `last_hop1_run_id`, `last_hop1_completed_at`, configured threshold. No mutation.
Override	`senior-approver` calls `POST /chains/{id}/override-preflight` with a documented rationale. Override is page-worthy for security. Does not skip the Approval Gate.
Implementation note	Backed by an indexed query against `SYNC_RUN` (`chain_hop=1`, `status=Succeeded`, latest per `(refNum, target_env)`). No new datastore required.

Correlation ID model (table form):

Aspect	Detail
Format	`chain-{refNum}-{yyyymmdd}-{shortid}` (e.g., `chain-ABC123-20260518-7f3a`).
Generation	Issued by the Correlation ID Tagger at Hop 2 trigger time, before approval.
Propagation	Persisted on `SYNC_RUN.correlation_id` (Sync) and stamped onto the migration pipeline's `INGEST_RUN` record (physical table `metadata.ingest_run_history`). `VALIDATION_RUN.correlation_id` is set by DV when `POST /validation/run` is called with `trigger_source=hop2`.
Linkage to Hop 1	The most recent successful Hop 1 at trigger time is stamped as `linked_hop1_run_id` on the Hop 2 row.
Query surface	`GET /chains/{correlation_id}` — assembles Hop 1, Hop 2, ingest, and validation summaries via `CHAIN_RUN_VIEW`.

Concurrency model:

Per (refNum, target_env, chain_hop): at most one non-terminal run. Existing Hop 1 per-refNum mutex extended to include (target_env, chain_hop).
Across refNums: parallel Hop 2 runs allowed up to worker-pool size. Writes isolated to each customer's tenant_{refNum} schema, so concurrency is safe.
Hop 1 ↔ Hop 2 interlock: Hop 2 will not dispatch while a Hop 1 is Running for the same (refNum, target_env). Pre-flight surfaces this as a clear error.

4. Data Model

The diagram below shows entities owned by the Sync Framework, plus the external entities it joins against to assemble the chain view. The external entities are defined authoritatively in the Data Validation LLD (§4) and the migration pipeline; Sync only depends on the columns shown here.

erDiagram
    SYNC_RUN ||--o{ SYNC_RUN_ITEM : contains
    SYNC_RUN ||--|| SNAPSHOT : "captures before run"
    SYNC_RUN ||--o{ AUDIT_ENTRY : "emits"
    SYNC_RUN }o--|| CDC_RESUME_TOKEN : "Delta mode reads"
    SYNC_RUN }o--|| APPROVAL : "Full/Selective + Hop 2 require"
    SYNC_RUN }o--o| SYNC_RUN : "Hop 2 links to Hop 1 (linked_hop1_run_id)"
    SYNC_RUN }o--o| INGEST_RUN : "Hop 2 links to ingest (linked_ingest_run_id)"
    INGEST_RUN ||--o{ VALIDATION_RUN : "validated by (DV-owned)"
    VALIDATION_RUN ||--|| REPORT : "produces (DV-owned)"
    MDS_CLASSIFICATION_CACHE ||--o{ SYNC_RUN_ITEM : "masking applied"

    SYNC_RUN {
        uuid id PK
        text refNum
        text target_env
        smallint chain_hop "1 or 2"
        text mode "Full|Delta|Selective"
        text status "Queued|AwaitingPreflight|AwaitingApproval|PreflightFailed|Running|Succeeded|SucceededWithBlockingFindings|Failed|RolledBack|Rejected"
        text correlation_id "Hop 2 only"
        uuid linked_hop1_run_id FK "Hop 2 only"
        uuid linked_ingest_run_id FK "Hop 2 only"
        uuid linked_validation_run_id FK "Hop 2 only; populated by webhook"
        text preflight_result "Hop 2 only: pass|fail|overridden"
        text preflight_override_by "Hop 2 only"
        text preflight_override_rationale "Hop 2 only; redacted at retention"
        text requested_by
        text approved_by
        uuid snapshot_id FK
        uuid resume_token_id FK "Hop 1 Delta only"
        timestamp started_at
        timestamp completed_at
        jsonb scope "categories + filters"
    }
    SYNC_RUN_ITEM {
        uuid id PK
        uuid run_id FK
        text category "operational|lookup|transactional"
        text artifact_type
        int records_in
        int records_written
        int records_skipped
        jsonb errors
    }
    SNAPSHOT {
        uuid id PK
        text refNum
        text target_env
        text target_tech "mongo|postgres"
        timestamp taken_at
        text storage_uri
        text retention_class
    }
    CDC_RESUME_TOKEN {
        uuid id PK
        text refNum
        text source_collection
        bytea token
        timestamp updated_at
    }
    APPROVAL {
        uuid id PK
        uuid run_id FK
        text approver
        text justification
        timestamp approved_at
    }
    AUDIT_ENTRY {
        uuid id PK
        uuid run_id FK
        text correlation_id "Hop 2 only"
        timestamp at
        text actor
        text action
        jsonb details
    }
    INGEST_RUN {
        uuid id PK "external: defined in DV LLD §4"
        text refNum
        text correlation_id "stamped by Hop 2"
        timestamp started_at
        timestamp completed_at
    }
    VALIDATION_RUN {
        uuid id PK "external: defined in DV LLD §4"
        uuid ingest_run_id FK
        text correlation_id
        text trigger_source "ingest|hop2|manual"
        text status "Queued|Running|Succeeded|Failed"
    }
    REPORT {
        uuid id PK "external: defined in DV LLD §4"
        uuid validation_run_id FK
        text refNum
        text entity
        text content_hash
    }
    MDS_CLASSIFICATION_CACHE {
        text field_path PK
        text classification "PII|sensitive|public"
        timestamp fetched_at
    }

Notes on chain columns added in v0.2:

linked_validation_run_id is populated by the Validation Webhook callback (see §5.4). Until the callback arrives, the column is NULL and chain status is Succeeded (Hop 2 ingest complete, validation pending) or Running (validation in flight).
preflight_override_rationale is captured as a structured column so auditors can query overrides without scanning AUDIT_ENTRY.details. The same rationale is also written to AUDIT_ENTRY for the override event.

Chain end-to-end view (CHAIN_RUN_VIEW):

The view joins one Hop 2 row to its Hop 1 predecessor, the migration pipeline's INGEST_RUN, and the Data Validation Framework's VALIDATION_RUN + REPORT — all by the chain correlation_id. Table and column names below match the authoritative definitions in DV LLD §4.

-- One row per Hop 2 run, with Hop 1, ingest, and validation joined for the chain view.
SELECT
  h2.correlation_id,
  h2.refNum,
  h2.target_env,
  h2.id                       AS hop2_run_id,
  h2.status                   AS hop2_status,
  h2.started_at               AS hop2_started_at,
  h2.completed_at             AS hop2_completed_at,
  h2.preflight_result,
  h2.preflight_override_by,
  h1.id                       AS hop1_run_id,
  h1.completed_at             AS hop1_completed_at,
  ir.id                       AS ingest_run_id,
  ir.started_at               AS ingest_started_at,
  ir.completed_at             AS ingest_completed_at,
  vr.id                       AS validation_run_id,
  vr.status                   AS validation_status,
  vr.trigger_source           AS validation_trigger_source,
  rep.id                      AS validation_report_id,
  rep.content_hash            AS validation_report_hash
FROM SYNC_RUN h2
LEFT JOIN SYNC_RUN       h1  ON h1.id  = h2.linked_hop1_run_id
LEFT JOIN INGEST_RUN     ir  ON ir.id  = h2.linked_ingest_run_id
LEFT JOIN VALIDATION_RUN vr  ON vr.correlation_id = h2.correlation_id
LEFT JOIN REPORT         rep ON rep.validation_run_id = vr.id
WHERE h2.chain_hop = 2;

DLQ counts are deliberately not in this view — they live on DLQ_RECORD (owned by DV) and are queried separately via DV's GET /dlq/inventory. Joining DLQ counts here would force a per-row aggregation; the chain view is for orientation, not deep DLQ analysis.

Key indexes:

Table	Index	Reason	Owner
`SYNC_RUN`	`(refNum, target_env, chain_hop, status, started_at desc)`	Latest run per customer per env per hop.	Sync
`SYNC_RUN`	`(correlation_id)` partial WHERE `chain_hop=2`	Chain view lookups.	Sync
`SYNC_RUN`	`(refNum, chain_hop, status) WHERE chain_hop=1 AND status='Succeeded'`	Pre-flight freshness query.	Sync
`SYNC_RUN_ITEM`	`(run_id)`	Listing items for a run.	Sync
`CDC_RESUME_TOKEN`	`(refNum, source_collection)` unique	One token per source per customer.	Sync
`AUDIT_ENTRY`	`(run_id, at)` and `(correlation_id, at)`	Per-run and per-chain audit replay.	Sync
`INGEST_RUN`	`(correlation_id)`	Joining ingest into the chain view.	DV / migration pipeline (see DV LLD §4)
`VALIDATION_RUN`	`(correlation_id)`	Joining validation into the chain view.	DV (see DV LLD §4)

Retention:

Table	Default retention
`SYNC_RUN` / `SYNC_RUN_ITEM`	1 year hot, archive thereafter
`SNAPSHOT` storage	30 days hot, 90 days archive (configurable per env and per `target_tech`)
`CDC_RESUME_TOKEN`	Indefinite while sync is active
`AUDIT_ENTRY`	Per security policy (typically 7 years)

5. Key Workflows

5.1 Hop 1 — Scheduled Delta Sync (happy path)

sequenceDiagram
    autonumber
    participant Sched as Delta Scheduler
    participant API as Sync API
    participant Disp as Dispatcher
    participant Reader as Prod Reader
    participant Mask as Masking
    participant Snap as Snapshot
    participant Writer as Target Writer
    participant Aud as Audit Log

    Sched->>API: POST /syncs {chain_hop:1, mode:Delta, refNum}
    API->>API: validate, authz (no approval needed)
    API-->>Sched: 202 run_id
    API->>Disp: enqueue run
    Disp->>Reader: read since CDC resume token
    Reader->>Reader: throttle on replica lag
    Reader->>Mask: stream records
    Mask->>Mask: classify + apply policy
    Mask->>Snap: emit
    Snap->>Snap: capture pre-image (one-time for run)
    Snap->>Writer: write to tenant_refNum (Mongo Lower)
    Writer-->>Aud: per-item result
    Writer-->>Disp: run complete
    Disp->>Aud: persist resume token
    Disp-->>API: status = Succeeded

5.2 Hop 1 — Operator-Initiated Full Sync with Approval, Snapshot, and Rollback (error path)

sequenceDiagram
    autonumber
    participant Op as Operator
    participant API as Sync API
    participant Apv as Approver
    participant Snap as Snapshot Mgr
    participant Run as Run Executor
    participant Aud as Audit Log

    Op->>API: POST /syncs {chain_hop:1, mode:Full, refNum} + justification
    API->>API: validate, authz
    API-->>Op: 202 run_id (status: Awaiting Approval)
    API->>Apv: notify
    Apv->>API: POST /syncs/{id}/approve
    API->>Run: dispatch
    Run->>Snap: take snapshot of Mongo Lower tenant_refNum
    Snap-->>Run: snapshot_id stored
    Run->>Run: read + mask + write (errors mid-run)
    Run-->>Aud: failures recorded
    Run-->>API: status = Failed
    API-->>Op: alert with run_id
    Op->>API: POST /syncs/{id}/restore
    API->>Snap: restore from snapshot_id
    Snap-->>API: target restored
    API-->>Aud: restore event
    API-->>Op: status = Rolled Back

5.3 Sync Run State Lifecycle (both hops)

stateDiagram-v2
    [*] --> Queued
    Queued --> AwaitingPreflight: Hop 2
    Queued --> AwaitingApproval: Hop 1 Full/Selective
    Queued --> Running: Hop 1 Delta
    AwaitingPreflight --> AwaitingApproval: preflight pass
    AwaitingPreflight --> PreflightFailed: stale source
    PreflightFailed --> AwaitingApproval: senior override
    PreflightFailed --> [*]: operator cancels
    AwaitingApproval --> Running: approved
    AwaitingApproval --> Rejected: declined
    Running --> Succeeded
    Running --> Failed
    Succeeded --> SucceededWithBlockingFindings: validation callback fail_blocking (Hop 2)
    Failed --> RolledBack: operator restore
    Succeeded --> [*]
    SucceededWithBlockingFindings --> [*]
    RolledBack --> [*]
    Rejected --> [*]

5.4 Hop 2 — Happy path

Validation is invoked asynchronously. The Validation Webhook component on the Sync side calls DV's POST /validation/run with a callback_url; DV responds 202 with a validation_run_id and runs validation in the background. When validation finishes (success or failure), DV calls back into Sync at POST /chains/{correlation_id}/validation-complete, which causes Sync to update linked_validation_run_id and (if validation is blocking-Critical) flip status from Succeeded to SucceededWithBlockingFindings.

sequenceDiagram
    autonumber
    participant Op as Operator
    participant API as Sync API
    participant Pre as Pre-flight Validator
    participant Apv as Approver
    participant Snap as Snapshot Mgr
    participant Adp as Migration Pipeline Adapter
    participant Pipe as Migration Ingest Pipeline
    participant PLow as Postgres Lower
    participant Hook as Validation Webhook
    participant Val as Data Validation Framework
    participant Aud as Audit Log

    Op->>API: POST /syncs {chain_hop:2, mode:Full, refNum, target_env}
    API->>API: validate + authz + tag correlation_id
    API->>Pre: check freshness for (refNum, target_env)
    Pre-->>API: pass (last_hop1_completed_at within window)
    API-->>Op: 202 run_id, correlation_id (status: AwaitingApproval)
    API->>Apv: notify
    Apv->>API: POST /syncs/{id}/approve
    API->>Snap: snapshot Postgres Lower tenant_refNum
    Snap-->>API: snapshot_id
    API->>Adp: invoke ingest (source=mongo-lower-{env}, target=tenant_refNum, full, correlation_id)
    Adp->>Pipe: run with correlation_id propagated
    Pipe->>PLow: truncate + JOLT → DAAL writes
    Pipe-->>Adp: ingest_run_id, durations
    Adp-->>API: ingest complete, linked_ingest_run_id
    API->>API: persist linked_ingest_run_id, set status to Succeeded
    API-->>Aud: ingest-complete event
    API-->>Op: 200 status Succeeded, validation pending

    rect rgb(225,245,254)
    Note over Hook,Val: Asynchronous validation phase
    API->>Hook: enqueue validation trigger
    Hook->>Val: POST /validation/run with refNum, ingest_run_id, correlation_id, trigger_source hop2, callback_url
    Val-->>Hook: 202 validation_run_id
    Hook->>API: persist linked_validation_run_id
    Val->>Val: run rules and assertions (async)
    Val->>API: POST /chains/correlation_id/validation-complete with outcome pass or fail_blocking or fail_nonblocking
    API->>API: if fail_blocking then status SucceededWithBlockingFindings
    API-->>Aud: validation-complete event with outcome
    end

5.5 Hop 2 — Pre-flight failure with senior override

sequenceDiagram
    autonumber
    participant Op as Operator
    participant API as Sync API
    participant Pre as Pre-flight Validator
    participant Snr as Senior Approver
    participant Aud as Audit Log

    Op->>API: POST /syncs {chain_hop:2, refNum, target_env}
    API->>Pre: check freshness
    Pre-->>API: fail (last_hop1_completed_at older than window)
    API-->>Op: 409 with last_hop1_run_id, last_hop1_at, threshold
    Note over Op: Option A: run Hop 1 first, then retry
    Note over Op: Option B: request override (justified)
    Op->>Snr: request override with rationale
    Snr->>API: POST /chains/{id}/override-preflight {rationale}
    API-->>Aud: override event recorded (security page)
    Note over API: Sequence merges into 5.4 from "API-->>Op: 202..."

6. APIs & Contracts

Method	Path	Purpose	Auth
`POST`	`/syncs`	Create a sync run. `chain_hop` accepts `1` (default) or `2`.	`sync-operator`
`GET`	`/syncs/{id}`	Fetch run + items + status.	`sync-operator`, `viewer`
`POST`	`/syncs/{id}/approve`	Approve a Hop 1 Full/Selective run, or any Hop 2 run.	`sync-approver`
`POST`	`/syncs/{id}/restore`	Restore the target from this run's pre-snapshot.	`sync-operator`
`GET`	`/syncs?refNum={x}`	List recent runs for a customer.	`sync-operator`, `viewer`
`GET`	`/snapshots?refNum={x}`	List snapshots for a customer (with retention info).	`sync-operator`
`GET`	`/audit?run_id={x}`	Retrieve audit trail for a run.	`auditor`, `security`
`GET`	`/chains/{correlation_id}`	End-to-end view: Hop 1 → Hop 2 → ingest → validation. Reads `CHAIN_RUN_VIEW`.	`sync-operator`, `viewer`, `qa-eng`
`POST`	`/chains/{id}/override-preflight`	Override a Hop 2 pre-flight freshness check with a documented rationale.	`senior-approver`
`POST`	`/chains/{correlation_id}/validation-complete`	Webhook called by Data Validation Framework when async validation finishes. Body includes `validation_run_id`, `report_id`, `outcome ∈ {pass, fail_blocking, fail_nonblocking}`. Causes the Hop 2 run to update `linked_validation_run_id` and (on `fail_blocking`) flip status to `SucceededWithBlockingFindings`. Idempotent on `(correlation_id, validation_run_id)`.	`dv-service`
`GET`	`/chains?refNum={x}`	List recent chain runs for a customer.	`sync-operator`, `viewer`

Hop 1 request shape (excerpt) — POST /syncs:

{
  "chain_hop": 1,
  "mode": "Full | Delta | Selective",
  "refNum": "ABC123",
  "target_env": "INTQA | Staging | Dev",
  "scope": {
    "categories": ["operational", "lookup"],
    "artifact_types": ["entity_decl", "workflow", "ruleset"]
  },
  "justification": "fresh INTQA stand-up for customer ABC123"
}

Hop 2 request shape (excerpt) — POST /syncs:

{
  "chain_hop": 2,
  "mode": "Full",
  "refNum": "ABC123",
  "source_env": "mongo-lower-INTQA",
  "target_env": "postgres-lower-INTQA",
  "auto_validate": true,
  "justification": "INTQA baseline refresh for customer ABC123 Step 8 rehearsal"
}

Response includes run_id, correlation_id (Hop 2 only), current status, the approval URL (when applicable), and a chain_view_url pointing at GET /chains/{correlation_id}.

7. Configuration & Feature Flags

Setting	Default	Notes
`delta.schedule.cron`	`/30 * * *`	Hop 1; configurable per refNum or category.
`read.replica.lag.threshold_ms`	`2000`	Hop 1; above this, Reader pauses.
`read.throughput.max_rps`	`500`	Hop 1; per refNum; protects replica.
`mask.cache.ttl_seconds`	`300`	Hop 1; MDS classification cache.
`mask.failclosed_on_missing_classification`	`true`	Cannot be flipped without a `security-admin` change ticket.
`snapshot.retention.hot_days`	`30`	Per target environment, per target tech (Mongo / Postgres).
`snapshot.retention.archive_days`	`90`	Per target environment, per target tech.
`approval.required_for`	`["Full", "Selective", "Hop2"]`	Hop 1 Delta excluded by design.
`audit.sink`	`internal`	Can also stream to SIEM.
`chain.preflight.max_staleness_minutes`	`1440`	Hop 2 pre-flight freshness window. Configurable per target environment.
`chain.autovalidate.enabled`	`true`	Trigger Data Validation Framework after every successful Hop 2.
`chain.override.requires_role`	`senior-approver`	Role required to override the pre-flight freshness check.

8. Security & Compliance

AuthN: Service accounts (for scheduler) and operator identities (federated SSO). Tokens short-lived.
AuthZ: Roles — sync-operator, sync-approver, senior-approver (for Hop 2 pre-flight overrides), auditor, dv-service (service account used by the Data Validation Framework to call POST /chains/{correlation_id}/validation-complete). Approver must be a different identity than the requester (enforced for both Hop 1 and Hop 2).
Network: Sync Service runs in a dedicated subnet. Egress to lower envs allowed; egress to production primary is denied — only read-replica endpoints are reachable. No ingress from lower envs. The Migration Pipeline Adapter calls the migration ingest pipeline over an internal service mesh; the pipeline itself runs in its own VPC.
Encryption: All data in transit TLS 1.3+. Snapshot store encrypted at rest with environment-scoped KMS keys.
Masking (Hop 1): MDS classifications drive masking; missing classification is a hard error. PII uses deterministic pseudonymization keyed per-environment so referential integrity holds but identities cannot be reversed.
Masking (Hop 2): Hop 2 inherits already-masked data from Mongo Lower. It does not re-mask, but it also cannot un-mask — the migration pipeline never reaches production.
Audit: Every run (both hops) records actor, mode, refNum, approver, scope, snapshot id, resume token id (Hop 1), correlation id (Hop 2), pre-flight outcome (Hop 2), start/end time, and per-item outcomes. Audit entries are append-only.
Compliance evidence: The audit log and chain view are referenced by Security & Compliance in the customer evidence package. Reports are exportable on demand and queryable by correlation ID.

9. Observability

Signal	Source	Used by
`sync.runs.total{mode, status, refNum, chain_hop}` (counter)	Sync API	Dashboards, alerting
`sync.run.duration_seconds{mode, refNum, chain_hop}` (histogram)	Run Executor	Capacity planning
`sync.records.read	masked	written{refNum}` (counters)
`sync.replica.lag_ms` (gauge)	Hop 1 Replica Lag Monitor	Throttle, paging
`sync.mask.classification_misses` (counter)	Hop 1 Masking Engine	Alerts (>0 is a bug)
`sync.cdc.token_lost_total{refNum}` (counter)	Hop 1 Delta Handler	Alerts; triggers Full Sync fallback
`chain.preflight.failures_total{refNum, reason}` (counter)	Hop 2 Pre-flight Validator	Dashboards; trend of stale-source attempts
`chain.preflight.overrides_total{approver, refNum}` (counter)	Override endpoint	Security review (page on any non-zero)
`chain.runs.duration_seconds{refNum}` (histogram)	Chain view	End-to-end cycle time per chain run
`chain.runs.blocking_findings_total{refNum}` (counter)	Validation Webhook	Alerts on `SucceededWithBlockingFindings`
Structured run logs	All components	Run drill-down in dashboard
Audit log entries	Audit Sink	Compliance evidence

Alerts:

Alert	Condition	Page?
Sync failure	Any run in `Failed` state	Yes (on-call)
Replica lag sustained	`replica.lag_ms > threshold` for 10m	Yes
Classification miss	Any non-zero increment	Yes (security)
CDC token loss	Any non-zero increment	Yes (on-call)
Snapshot store at capacity	>85% utilization	Yes (DBA)
Hop 2 pre-flight override	`chain.preflight.overrides_total` increments	Yes (security)
Hop 2 chain run failed with blocking findings	Hop 2 status = `SucceededWithBlockingFindings`	Yes (QA + on-call)

10. Deployment

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef prod fill:#e0f2f1,stroke:#00695c,color:#004d40
    classDef lower fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f

    subgraph ProdVPC[" Production VPC "]
        ProdReplica[(MongoDB / Aurora<br/>Read Replica)]:::prod
    end

    subgraph SyncVPC[" Sync Service VPC "]
        SyncSvc[Sync Service<br/>API + Workers]:::internal
        QueueDb[(Run Queue)]:::internal
        SnapDb[(Snapshot Store<br/>S3 / equivalent)]:::internal
        AuditDb[(Audit Log)]:::internal
    end

    subgraph MigVPC[" Migration Pipeline VPC "]
        MigPipe[Migration Ingest Pipeline<br/>JOLT → DAAL]:::external
    end

    subgraph ValVPC[" Validation VPC "]
        Val[Data Validation Framework]:::external
    end

    subgraph LowerVPCs[" Lower-env VPCs (per env, per refNum) "]
        MLower[(Mongo Lower<br/>tenant_refNum)]:::lower
        PLower[(Postgres Lower<br/>tenant_refNum)]:::lower
    end

    SyncSvc -- "TLS, replica-only<br/>(Hop 1 reads)" --> ProdReplica
    SyncSvc -- "Hop 1 writes" --> MLower
    SyncSvc --> QueueDb
    SyncSvc --> SnapDb
    SyncSvc --> AuditDb

    SyncSvc -- "Hop 2: invoke<br/>(source=Mongo Lower)" --> MigPipe
    MLower --> MigPipe
    MigPipe --> PLower
    SyncSvc -- "Hop 2: trigger" --> Val
    Val -. reads .-> PLower

    style ProdVPC stroke-dasharray: 5 5,stroke:#00695c
    style LowerVPCs stroke-dasharray: 5 5,stroke:#1565c0
    style MigVPC stroke-dasharray: 5 5,stroke:#ad1457
    style ValVPC stroke-dasharray: 5 5,stroke:#ad1457

Concern	Decision
Where Sync Service runs	A dedicated VPC. Allowed egress to production replicas, lower-env Mongo / Postgres endpoints, the Migration Pipeline VPC, and the Validation VPC.
Migration Pipeline Adapter transport	Internal service-mesh call to the existing migration ingest pipeline. The pipeline binary is unchanged; the adapter just passes parameters and the correlation ID.
High availability	Multi-AZ deployment; workers stateless; queue durable.
Snapshot store	Object storage (e.g., S3) with versioning + lifecycle policy for hot→archive transition. Mongo Lower and Postgres Lower snapshots are stored in distinct prefixes with their own retention classes.
Audit log	Append-only, with optional mirror to centralized SIEM.
Promotion	Same container image promoted through Dev → Staging → INTQA; per-env config via secret manager.

11. Operational Runbook Hooks

Runbook	When to use
`runbook-sync-full-stand-up.md`	Hop 1: fresh Mongo Lower environment for a new customer.
`runbook-sync-selective.md`	Hop 1: targeted refresh of one refNum or artifact type.
`runbook-sync-restore-from-snapshot.md`	Hop 1: rolling back Mongo Lower when a sync misbehaves.
`runbook-sync-cdc-token-loss.md`	Hop 1: CDC token missing/corrupted, Delta cannot resume.
`runbook-sync-on-call-paging.md`	What to do when paged for any of the alerts in §9.
`runbook-hop2-trigger.md`	Triggering a Hop 2 refresh of Postgres Lower for a refNum.
`runbook-hop2-preflight-override.md`	Senior-approver procedure to override a stale-source pre-flight failure.
`runbook-hop2-restore.md`	Restoring Postgres Lower from a Hop 2 pre-snapshot.
`runbook-chain-view.md`	Using `GET /chains/{correlation_id}` to investigate an end-to-end chain run.

12. Failure Modes & Recovery

Failure	Detection	Recovery	Owner
Replica lag exceeds threshold (Hop 1)	`sync.replica.lag_ms` alert	Reader auto-pauses; on-call decides whether to wait or skip the window.	On-call
MDS classification missing for a field (Hop 1)	`sync.mask.classification_misses` increments	Run fails closed; Domain Owner adds classification; operator retriggers.	Security + Domain Owner
CDC resume token lost (Hop 1)	`sync.cdc.token_lost_total` increments	Automatic fallback to Full Sync; recorded in audit. Operator reviews.	On-call + Operator
Approval timeout (either hop)	Run sits in `AwaitingApproval` past SLA	Auto-expire; requester re-submits.	Operator
Target write failure mid-Hop-1	Run fails with partial writes	Operator restores Mongo Lower from pre-snapshot, then retries.	Operator
Snapshot storage full	Snapshot creation fails	Run rejected before any writes; capacity alert pages DBA.	DBA
Network egress denied (misconfig)	Run fails with auth/network error	Operator validates IAM + routing; rolls forward fix.	DevOps
Cross-tenant write attempt	Pre-flight scope check fails	Run rejected; security event raised.	Security
Hop 2 pre-flight fails (Mongo Lower stale)	API returns 409 with last_hop1 details	Run Hop 1 first or request senior override with rationale. No state mutation.	Operator
Migration pipeline DLQ exceeds threshold mid-Hop-2	DLQ count crosses configured limit	Hop 2 marked `Failed`; Postgres Lower left under pre-Hop-2 snapshot; operator restores or fixes JOLT/data and reruns.	Operator + Mapping Team
Hop 2 auto-validation reports Critical findings	Validation report outcome `FailedBlocking`	Hop 2 marked `SucceededWithBlockingFindings`; workflow gates remain blocked until findings resolved or overridden via Data Validation override path.	QA Team
Snapshot capacity full on Postgres Lower (Hop 2 pre-flight)	Snapshot creation fails	Hop 2 rejected pre-flight; no truncate occurs; capacity alert pages DBA.	DBA
Migration Pipeline Adapter cannot reach the migration pipeline	Adapter call fails health probe / times out	Hop 2 marked `Failed` before any target mutation; on-call investigates pipeline VPC.	Platform Eng + DevOps

13. Open Questions / Decisions Log

#	Question	Status	Notes
1	Final snapshot retention policy per environment per target_tech	Open	DBA + Security to confirm hot/archive split. Mongo Lower and Postgres Lower may need different classes.
2	Exact replica-lag threshold per environment	Open	Needs measurement during Phase 1 first runs.
3	Pseudonymization key rotation cadence	Open	Pending security policy.
4	Whether Hop 1 Selective sync supports time-bounded windows	Decided — yes	Operator can supply `since` / `until` in scope.
5	Where audit log mirrors to (SIEM choice)	Open	Pending Security tooling decision.
6	Default `chain.preflight.max_staleness_minutes` per env (currently 1440 everywhere)	Open	Likely tighter for INTQA-rehearsal envs, looser for Dev.
7	Whether parallel Hop 2 runs per refNum are ever allowed (currently no)	Decided — no	Per-(refNum, target_env, chain_hop) mutex enforced. Reopen only if a concrete need arises.
8	Whether Hop 2 should support a "dry-run" mode that snapshots but does not invoke the migration pipeline	Open	Could help operators confirm freshness before consuming pipeline capacity.

14. Glossary Reference

See _shared/glossary.md.

Low-Level Designs

Data Validation

Rule engine, DLQ inspector, gate decisions, async chain callback.

LLD — Data Validation Framework

Low-Level Design · Workstream 4 of 9 · MongoDB → Aurora PostgreSQL Migration Companion document: HLD · Source: metadata-documents/MongoDB_to_PostgreSQL_v2_WS4_Data_Validation.docx

Version	Date	Author	Notes
v0.1	2026-05-18	Migration Architecture	Initial draft from WS4 execution plan

1. ⭐ Executive Summary

In one sentence: A pipeline-integrated engine that runs the four playbook reconciliation rules and per-entity business assertions on every ingest, classifies findings by severity, blocks workflow advancement on Critical findings, and produces a signed per-customer-per-entity report.

If you only read one section, read this.

What we build: A Rule Engine with four checkers (row count, FK integrity, required-field coverage, sampled deep-diff), an Assertion Executor, a DLQ Inspector, a Severity Classifier, a Reporting Service, and a Gate Decision API used by the ingest pipeline and workflow tooling.
Key design decisions: Validation is not an optional step — it's wired into the ingest pipeline so results land in metadata.ingest_run_history alongside the run record. Reproducibility is non-negotiable: rule version and data version are recorded on every finding. Validation runs accept an optional correlation_id (set by Sync Framework Hop 2) so the resulting report participates in the chain view.
Trust model: The framework itself is verified against seeded errors (Step 5) before any customer relies on it. False-negative rate is measured.
Phasing: Built once, applies to every customer. Phase 2 inherits without modification.
Triggers: ingest pipeline at cutover; Sync Framework Hop 2 (auto-trigger); operator manual re-run.
Open items at v0.1: sampled deep-diff sample size formula per entity volume; assertion-authoring linting rules; DLQ disposition SLA per severity.

2. Reference Architecture (recap from HLD)

flowchart LR
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f
    classDef gate fill:#ffebee,stroke:#c62828,color:#b71c1c

    Ingest[Ingest Pipeline]:::external --> Engine[Rule Engine]:::internal --> Class[Severity Classifier]:::internal --> Gate[Gate API]:::gate
    Class --> Reporter[Reporting Service]:::internal

Full diagrams in the HLD. The LLD elaborates each container.

3. Component Designs (C4 L3)

3.1 Rule Engine

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92

    In[Ingest Run Event<br/>refNum + entity + run_id]:::internal --> Plan[Execution Planner]:::internal
    Plan --> Row[Row Count Checker]:::internal
    Plan --> FK[FK Integrity Walker]:::internal
    Plan --> Cov[Required-Field Coverage Analyzer]:::internal
    Plan --> Diff[Sampled Deep-Diff Comparator]:::internal
    Plan --> Asrt[Assertion Executor]:::internal

    Row --> Findings[(Findings)]:::store
    FK --> Findings
    Cov --> Findings
    Diff --> Findings
    Asrt --> Findings

    Rules[(Rule + Assertion Library)]:::store --> Plan

Module	Responsibility	Notes
Execution Planner	Resolves which rules + assertions apply to this entity at this rule-library version.	Records `rule_version` and `entity_version` on every finding for reproducibility.
Row Count Checker	`COUNT()` in Mongo (filtered by refNum) vs `COUNT()` in `tenant_{refNum}` per entity.	Tolerance 0%.
FK Integrity Walker	For each FK column, verifies every value resolves to a parent row; resolves both auto-injected FKs and `_lookup`-resolved UUIDs.	Tolerance 0 orphans.
Required-Field Coverage Analyzer	Computes % of rows where required field is non-null in the target table.	Threshold ≥ 99.9%.
Sampled Deep-Diff Comparator	Random N rows compared field-by-field, source vs target, post-transformation.	Tolerance 0 diffs in sample. Sample size is rule-configurable per entity volume.
Assertion Executor	Runs per-entity business assertions authored by QA.	Each assertion is a versioned, named, declarative check.

3.2 DLQ Inspector

flowchart LR
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92

    DLQ[(Dead Letter Queue)]:::store --> Inspector[DLQ Inspector]:::internal
    Inspector --> Group[Reason Grouper]:::internal
    Group --> Aging[Aging Tracker]:::internal
    Aging --> Findings[(Findings)]:::store

Groups DLQ records by reason (required_field_missing, lookup_miss, type_conversion_failure, fk_integrity_failure, assertion_failure).
Aging tracker flags records older than the disposition SLA per severity.
Output joins into the same Findings store as the rule engine — one report covers both rule failures and DLQ inventory.

3.3 Severity Classifier + Router

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef gate fill:#ffebee,stroke:#c62828,color:#b71c1c

    Findings[(Findings)]:::internal --> Classifier[Classifier]:::internal
    Classifier --> Routes{Severity?}
    Routes -- Critical --> Block[Block Gate]:::gate
    Routes -- High --> Review[Active Review Queue]:::internal
    Routes -- Medium --> Logged[Logged]:::internal
    Routes -- Informational --> Logged
    Block --> GateAPI[Gate Decision API]:::gate

Severity matrix (from the source playbook):

Severity	Examples	Action
Critical	Row count mismatch · required field coverage below threshold · FK integrity failure · unresolved DLQ records	Workflow advancement blocked. Issue resolved before retry.
High	Sampled deep-diff finds mismatches · business-rule assertion failure on important entities	Reviewed before workflow advancement. Resolution required or explicitly accepted with documented rationale.
Medium	Cosmetic field differences · timestamp variations within acceptable tolerance	Logged. Reviewed in aggregate. Does not block.
Informational	Expected differences from data cleanup · known schema-conversion outcomes	Logged for transparency. No action required.

3.4 Reporting Service

flowchart LR
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92

    Findings[(Findings)]:::internal --> Composer[Report Composer]:::internal
    Composer --> Sign[Signer<br/>rule_version + data_version + hash]:::internal
    Sign --> ReportStore[(Report Store)]:::store
    Sign --> History[(metadata.ingest_run_history)]:::store
    ReportStore --> Dash[Leadership Dashboard]:::internal

One report per customer per entity per ingest run.
Reports are cryptographically signed with rule + data version metadata; re-running the same inputs produces an identical report hash.
The Leadership Dashboard reads from the Report Store; engineering can drill into specific findings; security can verify signatures.

3.5 Gate Decision API

Endpoint	Purpose
`GET /gate/{refNum}/{entity}/{step}`	Workflow tooling asks: can this step advance for this customer + entity? Response: `pass` / `fail` with reasons + overriding-approval URL.
`POST /gate/override`	Records an explicit override (severity `Critical` can only advance with senior approval and a documented rationale).

The pipeline calls the Gate API at Steps 3, 7, 8, and 9. Workflow tooling can disable advancement entirely until pass.

4. Data Model

erDiagram
    INGEST_RUN ||--o{ VALIDATION_RUN : "produces"
    VALIDATION_RUN ||--o{ FINDING : "contains"
    VALIDATION_RUN ||--|| REPORT : "produces"
    FINDING }o--|| RULE_DEFINITION : "based on"
    FINDING }o--o| ASSERTION : "based on"
    FINDING }o--|| SEVERITY_RULE : "classified by"
    DLQ_RECORD ||--o{ FINDING : "inventoried into"
    REPORT ||--o| GATE_DECISION : "drives"

    INGEST_RUN {
        uuid id PK
        text refNum
        text entity
        text step "3|7|8|9"
        timestamp started_at
        timestamp ended_at
    }
    VALIDATION_RUN {
        uuid id PK
        uuid ingest_run_id FK
        text correlation_id "stamped by Sync Hop 2; null otherwise"
        text rule_library_version
        text entity_decl_version
        text trigger_source "ingest|hop2|manual"
        text status "Queued|Running|Succeeded|Failed"
    }
    RULE_DEFINITION {
        text id PK "e.g. row_count_parity"
        text kind "default|entity_specific"
        text version
        jsonb spec
    }
    ASSERTION {
        text id PK
        text entity
        text version
        text expression
        text severity_override
    }
    SEVERITY_RULE {
        text finding_kind PK
        text default_severity
        text override_policy
    }
    FINDING {
        uuid id PK
        uuid validation_run_id FK
        text rule_id FK
        text assertion_id FK
        text severity "Critical|High|Medium|Informational"
        int observed_value
        int expected_value
        jsonb evidence
    }
    DLQ_RECORD {
        uuid id PK
        text refNum
        text entity
        text reason
        timestamp arrived_at
        jsonb source_document
        jsonb jolt_output
    }
    REPORT {
        uuid id PK
        uuid validation_run_id FK
        text refNum
        text entity
        text rule_library_version
        text content_hash
        text signature
        timestamp created_at
        text storage_uri
    }
    GATE_DECISION {
        uuid id PK
        uuid report_id FK
        text outcome "pass|fail|overridden"
        text approver
        timestamp decided_at
    }

Key indexes:

Table	Index	Reason
`VALIDATION_RUN`	`(ingest_run_id)`	One-to-many lookup.
`FINDING`	`(validation_run_id, severity)`	Severity-grouped report rendering.
`DLQ_RECORD`	`(refNum, entity, reason)`	Reason-grouped inventory + age queries.
`REPORT`	`(refNum, entity, created_at desc)`	"Latest report per customer per entity."

Retention:

Table	Default retention
`VALIDATION_RUN`, `FINDING`, `REPORT`	7 years (compliance)
`DLQ_RECORD`	Until disposition + 1 year

5. Key Workflows

5.1 Per-Ingest Validation (happy path → gate pass)

sequenceDiagram
    autonumber
    participant Ingest as Ingest Pipeline
    participant Engine as Rule Engine
    participant DLQI as DLQ Inspector
    participant Class as Classifier
    participant Rep as Reporting
    participant Gate as Gate API
    participant Wf as Workflow Tool

    Ingest->>Engine: ingest_run_completed(refNum, entity, run_id)
    Engine->>Engine: plan = rules ∪ assertions for entity
    par
        Engine->>Engine: row count
        Engine->>Engine: FK integrity
        Engine->>Engine: required-field coverage
        Engine->>Engine: sampled deep-diff
        Engine->>Engine: assertions
    end
    Engine->>DLQI: inventory DLQ for (refNum, entity)
    DLQI-->>Engine: DLQ findings
    Engine->>Class: findings
    Class->>Class: classify each finding
    Class->>Rep: classified findings
    Rep->>Rep: compose + sign report
    Rep-->>Gate: report ready
    Wf->>Gate: GET /gate/{refNum}/{entity}/{step}
    Gate-->>Wf: pass

5.2 Step 8 Bulk Dry-Run with DLQ Resolution (error path)

sequenceDiagram
    autonumber
    participant QA as QA Team
    participant Ingest as Bulk Ingest
    participant Engine as Rule Engine
    participant DLQI as DLQ Inspector
    participant DomOwner as Domain Owner
    participant Mapping as Mapping Team
    participant Gate as Gate API

    QA->>Ingest: kick off bulk dry-run
    Ingest->>Engine: run validation per entity
    Engine-->>QA: findings (Critical: 3 DLQ assertion_failure)
    DLQI-->>QA: DLQ inventory grouped by reason
    QA->>DomOwner: triage assertion_failure records
    DomOwner->>Mapping: correct JOLT spec or assertion
    Mapping-->>Ingest: re-register spec
    QA->>Ingest: re-run for affected entity
    Engine-->>QA: findings (Critical: 0)
    Engine-->>Gate: report ready, pass
    QA->>Gate: GET /gate/{refNum}/{entity}/8 → pass
    Note over QA,Gate: Step 8 sign-off can proceed

5.3 Validation Run State Lifecycle

stateDiagram-v2
    [*] --> Queued
    Queued --> Running
    Running --> SucceededClean: 0 Critical
    Running --> SucceededWithFindings: High/Medium only
    Running --> FailedBlocking: ≥1 Critical
    FailedBlocking --> Overridden: senior approval
    SucceededClean --> [*]
    SucceededWithFindings --> [*]
    Overridden --> [*]

6. APIs & Contracts

Method	Path	Purpose	Auth
`POST`	`/validation/run`	Trigger a validation run. Called by the ingest pipeline at cutover, by Sync Framework Hop 2 (auto-trigger with `correlation_id`), or manually for re-runs.	`pipeline`, `sync-service`, `qa-eng`
`GET`	`/validation/runs/{id}`	Fetch run status + summary.	`qa-eng`, `viewer`
`GET`	`/validation/runs?correlation_id={x}`	Find validation runs that are part of a Sync Framework chain.	`qa-eng`, `viewer`, `sync-service`
`GET`	`/validation/findings?run_id={x}&severity={s}`	List findings for a run.	`qa-eng`, `viewer`
`GET`	`/reports?refNum={x}&entity={y}&latest=true`	Get latest signed report.	any authorized role
`GET`	`/gate/{refNum}/{entity}/{step}`	Gate decision query.	`pipeline`, `workflow`
`POST`	`/gate/override`	Record an explicit override of a blocking gate.	`senior-approver`
`GET`	`/dlq/inventory?refNum={x}&entity={y}`	DLQ inventory by reason and age.	`qa-eng`, `viewer`

POST /validation/run request shape (excerpt):

{
  "refNum": "ABC123",
  "ingest_run_id": "uuid",
  "trigger_source": "hop2",
  "correlation_id": "chain-ABC123-20260518-7f3a",
  "callback_url": "https://sync-service.internal/chains/chain-ABC123-20260518-7f3a/validation-complete"
}

POST /validation/run returns immediately with 202 { "validation_run_id": "uuid" }. The framework runs validation in the background; on completion (success or failure), it POSTs to callback_url with:

{
  "validation_run_id": "uuid",
  "report_id": "uuid",
  "outcome": "pass | fail_blocking | fail_nonblocking",
  "summary": {
    "critical": 0,
    "high": 0,
    "medium": 0,
    "informational": 0
  }
}

The callback is retried with exponential backoff up to chain.callback.max_retries (default 5); the receiver (Sync Framework) is expected to be idempotent on (correlation_id, validation_run_id).

When correlation_id is present, the resulting VALIDATION_RUN row carries it through; the Sync Framework's GET /chains/{correlation_id} view joins this report into the end-to-end chain. trigger_source is one of ingest, hop2, manual. callback_url is only honored when trigger_source = hop2; for ingest and manual, callers poll GET /validation/runs/{id} instead.

Findings response (excerpt):

{
  "validation_run_id": "uuid",
  "refNum": "ABC123",
  "entity": "claim",
  "rule_library_version": "2026.05.18-3",
  "correlation_id": "chain-ABC123-20260518-7f3a",
  "findings": [
    {
      "rule_id": "row_count_parity",
      "severity": "Critical",
      "observed": 99987,
      "expected": 100000,
      "evidence": { "source_query": "...", "target_query": "..." }
    }
  ]
}

7. Configuration & Feature Flags

Setting	Default	Notes
`defaults.row_count.tolerance_pct`	`0`	Locked by playbook.
`defaults.fk_integrity.orphans_allowed`	`0`	Locked by playbook.
`defaults.coverage.min_pct`	`99.9`	Locked by playbook.
`defaults.deep_diff.sample_size_strategy`	`sqrt(N) clamped 100..10000`	Configurable per entity.
`severity.routing.critical_blocks`	`true`	Cannot be flipped without `senior-approver`.
`override.expiry_hours`	`24`	Overrides auto-expire.
`dlq.disposition_sla.critical_hours`	`24`	SLA before aging escalates.
`chain.callback.max_retries`	`5`	Max retries for the outbound `POST {callback_url}` to the Sync Framework when `trigger_source = hop2`. Exponential backoff.
`chain.callback.timeout_seconds`	`10`	Per-attempt timeout for the callback POST.

8. Security & Compliance

AuthN/AuthZ: Roles — pipeline (service account for ingest), sync-service (service account used by Sync Framework Hop 2 to call POST /validation/run and GET /validation/runs?correlation_id=…), qa-eng, viewer, senior-approver, auditor. The framework also makes outbound calls back into the Sync Framework on validation completion; that outbound call uses a dv-service identity issued to this framework.
Signed reports: Each report carries rule version, entity declaration version, content hash, and signature. Any change in inputs produces a different hash; tampering is detectable.
Audit: Every gate decision, override, and report retrieval is logged. Override events are page-worthy for security awareness.
Data sensitivity: Reports contain aggregate counts, not raw rows. Deep-diff evidence stores field-level diffs but redacts values classified as PII via MDS.
Retention: 7 years for reports and findings (compliance baseline; configurable to satisfy specific regulators).

9. Observability

Signal	Source	Used by
`validation.runs.total{status, severity}`	Engine	Dashboards
`validation.findings.total{rule_id, severity}`	Engine	Drill-down
`validation.run.duration_seconds{rule_id}`	Engine	Perf budget tracking
`validation.gate.blocked_total{refNum, entity, step}`	Gate API	Alerts when blocked > threshold
`validation.gate.overrides_total{approver}`	Gate API	Security review
`validation.dlq.size{refNum, entity, reason}`	DLQ Inspector	Disposition tracking
`validation.report.signature_failures`	Reporting	Alerts (>0 is integrity bug)

Alerts:

Alert	Condition	Page?
Gate blocked > SLA	Same `(refNum, entity)` blocked for >24h	Yes (QA lead)
Override recorded	Any non-zero increment	Yes (security)
Report signature failure	Any non-zero increment	Yes (security)
DLQ aging past disposition SLA	Any Critical DLQ record older than SLA	Yes (QA + domain owner)

10. Deployment

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f

    subgraph IngestVPC[" Ingest VPC "]
        Ingest[Ingest Pipeline]:::external
        DLQ[(DLQ)]:::store
        History[(metadata.<br/>ingest_run_history)]:::store
    end

    subgraph ValVPC[" Validation Service VPC "]
        Eng[Rule Engine workers]:::internal
        Cls[Classifier]:::internal
        Rep[Reporting]:::internal
        Gate[Gate API]:::internal
        RuleStore[(Rule Library)]:::store
        ReportStore[(Report Store)]:::store
    end

    Ingest -->|run event| Eng
    Ingest --> DLQ --> Eng
    Eng --> Cls --> Rep --> ReportStore
    Rep --> History
    Cls --> Gate
    RuleStore --> Eng

    Dash[Leadership Dashboard]:::internal
    ReportStore --> Dash

Concern	Decision
Where it runs	Stateless workers, horizontally scalable; pipeline triggers via run-event topic.
Storage	Rule Library + Report Store: durable; backed up daily.
Promotion	Same image promoted Dev → Staging → INTQA → Production-of-validation-service.
Cross-customer	Workers pool across customers; rule executions are stateless per refNum + entity.

11. Operational Runbook Hooks

Runbook	When to use
`runbook-validation-rerun.md`	Manually re-run validation on a specific run id.
`runbook-validation-gate-override.md`	Procedure (and required approvals) to override a Critical gate.
`runbook-dlq-triage.md`	Working through DLQ inventory by reason.
`runbook-validation-seeded-error-test.md`	Periodic re-verification of the framework against seeded errors.
`runbook-validation-rule-library-upgrade.md`	Rolling out a new rule library version.

12. Failure Modes & Recovery

Failure	Detection	Recovery	Owner
Validation worker crashes mid-run	Run in `Running` past SLA	Run marked Failed; pipeline retries idempotently.	QA Eng
Rule library out of sync with entity decls	Planner cannot resolve rule for entity	Run rejected with clear error; library or decl updated; re-run.	QA Eng + Domain Owner
False negative — framework misses a real error	Detected by Step 5 seeded-error re-run	Engine fix; rule library bump; all customers re-validated for that rule.	QA Eng
Sampled deep-diff misses signal due to small sample	Detected by spot-check or production incident	Adjust sample size strategy; re-run.	QA Eng
Gate API unreachable	Workflow tool gets error	Workflow tool fails closed (cannot advance) until restored.	Platform Eng
Report signing key rotated	Old signatures still verifiable; new reports signed with new key	Standard key-rotation procedure.	Security
Override misuse	Override count spikes	Security review of approver activity.	Security

13. Open Questions / Decisions Log

#	Question	Status	Notes
1	Final sample size strategy for sampled deep-diff per entity volume	Open	Measure during Phase 1.
2	DLQ disposition SLA per severity	Open	Aligning with playbook hypercare clock.
3	Whether assertion authoring lives in MDS or a dedicated assertion registry	Decided — MDS	Assertions ride alongside entity declarations.
4	Report storage location (alongside MDS vs separate)	Open	Pending Platform decision.
5	Whether overrides require two senior approvers (vs one)	Open	Security review.

14. Glossary Reference

See _shared/glossary.md.

Low-Level Designs

End-to-End Validation

Use-case orchestrator, mock service, capture + diff engine, dashboard.

LLD — End-to-End Validation Framework

Low-Level Design · Workstreams 5 + 6 + 7 of 9 · MongoDB → Aurora PostgreSQL Migration Companion document: HLD · Sources: WS5_Use_Case_Validation.docx, WS6_Mocking_Framework.docx, WS7_Payload_Validation.docx

Version	Date	Author	Notes
v0.1	2026-05-18	Migration Architecture	Initial draft unifying WS5 + WS6 + WS7

1. ⭐ Executive Summary

In one sentence: A unified framework with three cooperating sub-systems — Use Case Orchestrator (workflows + parity + performance), Mock Service (every external partner, scenario-switchable), and Payload Diff Engine (byte-level diff against production baselines) — wired together so a single test run produces functional, performance, and payload evidence.

If you only read one section, read this.

What we build: Three sub-systems backed by one orchestrator, one dashboard, and one evidence pipeline. INTQA outbound traffic is intercepted, mocked, and diffed in the same flow.
Key design decisions: Mock interception sits at the platform's outbound HTTP layer (single integration point per app, not per-call). Network-level controls also block direct egress as a safety net. Baselines are anonymized using MDS classifications. Diff classification is rule-driven; only Blocking differences gate workflow advancement.
Operating model: Customer teams own use-case catalogs. Integration owners own classification rules and per-integration sign-off. Platform Eng owns the mock service and diff engine.
Phasing: Built once. Phase 2 customers reuse infrastructure with their own catalogs and baselines; bucketing makes catalogs largely shared.
Open items at v0.1: scenario versioning strategy across customers; performance baseline staleness threshold; hypercare clock automation policy.

2. Reference Architecture (recap from HLD)

flowchart LR
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef lower fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef prod fill:#e0f2f1,stroke:#00695c,color:#004d40

    Orch[Use Case Orchestrator]:::internal --> Plat[Migrated Platform - INTQA]:::lower
    Plat -. outbound .-> Mock[Mock Service]:::internal
    Plat -. outbound .-> Cap[Outbound Capture]:::internal
    Cap --> Diff[Diff Engine]:::internal
    Base[(Production Baselines)]:::prod --> Diff

Full diagrams in the HLD. The LLD elaborates each sub-system.

3. Component Designs (C4 L3)

3.1 Use Case Orchestrator (WS5)

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92

    Trigger[Trigger<br/>scheduler · operator · CI]:::internal --> Loader[Catalog Loader]:::internal
    Loader --> Picker[Scenario Picker<br/>per refNum]:::internal
    Picker --> Runner[Scenario Runner]:::internal
    Runner --> ParityCmp[Parity Comparator]:::internal
    Runner --> PerfGate[Performance Gate]:::internal
    ParityCmp --> Resulter[Result Aggregator]:::internal
    PerfGate --> Resulter
    Resulter --> ReporterAPI[Reporting API]:::internal

    Catalog[(Use Case Catalog)]:::store --> Loader
    ParityBase[(Parity Baseline)]:::store --> ParityCmp
    PerfBase[(Perf Baseline)]:::store --> PerfGate

Module	Responsibility
Catalog Loader	Resolves the customer's catalog version pinned to the migration wave.
Scenario Picker	Selects scenarios that apply to this run (full suite, smoke-only, perf-only).
Scenario Runner	Executes the scenario against the migrated platform using DAAL SDK; collects outputs, side-effects, and timings. The platform's Postgres Lower state is refreshed by Sync Framework Hop 2; suite runs may optionally pin to a specific `correlation_id` from the chain to guarantee reproducible test conditions.
Parity Comparator	Compares outputs/key state to parity baseline; supports tolerance config (key-set match, semantic-equal vs byte-equal).
Performance Gate	Compares observed latency/throughput against perf baseline; flags variance beyond agreed threshold.
Result Aggregator	One pass/fail per scenario; rolls up to per-customer readiness.
Reporting API	Surfaces results to the shared dashboard and the customer evidence package.

3.2 Mock Service + Scenario Manager (WS6)

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f

    AppCall[App outbound call<br/>via Mock Adapter]:::external --> Router[Request Router<br/>per integration]:::internal
    Router --> Selector[Scenario Selector<br/>active scenario per test run]:::internal
    Selector --> Matcher[Request Matcher<br/>method · path · headers · body]:::internal
    Matcher --> Builder[Response Builder<br/>shape · headers · status · latency profile]:::internal
    Builder --> Out[Response to app]:::external
    Webhook[Webhook Trigger<br/>incoming events]:::internal --> Plat[App webhook endpoint]:::external

    ScenStore[(Scenario Catalog)]:::store --> Selector
    BehRef[(Behavior Reference<br/>captured from prod)]:::store --> Builder
    Log[(Mock Interaction Log)]:::store
    Router --> Log
    Builder --> Log

Module	Responsibility
Request Router	One entry point per integration; mapped by hostname + path.
Scenario Selector	The active scenario is set per test run (default `happy_path`); switched via API for failure-mode tests.
Request Matcher	Matches incoming requests by configurable criteria (method, path pattern, header values, body shape).
Response Builder	Produces a response that mirrors real partner behavior — correct status, headers, body shape, and a configurable latency profile so timing is realistic.
Webhook Trigger	Drives inbound webhook scenarios — the framework can call the app's webhook endpoint to simulate partner-initiated events.
Mock Interaction Log	Every request and response captured for review and replay.

Scenarios per integration (standard catalog): happy_path, validation_error, auth_failure, timeout, rate_limit, malformed_response, partial_success, 5xx_server_error.

3.3 Outbound Payload Capture (WS7)

flowchart LR
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f

    App[Migrated App]:::external --> Adapter["Mock Adapter<br/>(also captures)"]:::internal
    Adapter --> Capture[Capture Recorder]:::internal
    Capture --> Anon[Anonymizer<br/>MDS classifications]:::internal
    Anon --> Store[(Captured Payload Store)]:::internal
    Adapter --> MockSvc[Mock Service]:::internal

The same adapter that routes calls to the Mock also captures the payload. One integration point, two outputs.
Anonymization runs before persistence — captures never store raw PII regardless of source.

3.4 Diff Engine + Classification (WS7)

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92
    classDef gate fill:#ffebee,stroke:#c62828,color:#b71c1c

    Cap[(Captured Payload)]:::store --> Normalize[Normalizer<br/>canonical form]:::internal
    Base[(Production Baseline)]:::store --> Normalize2[Normalizer]:::internal
    Normalize --> Comparator[Byte / Tree Comparator]:::internal
    Normalize2 --> Comparator
    Comparator --> Classifier[Classification Engine]:::internal
    Rules[(Classification Rule Book)]:::store --> Classifier
    Classifier --> Output{Diff class}
    Output -- Blocking --> Gate[Blocking Gate]:::gate
    Output -- Notable --> Review[Active Review]:::internal
    Output -- Tolerable --> Logged[Logged in aggregate]:::internal
    Output -- Expected --> Ignored[Ignored]:::internal
    Classifier --> Report[Diff Report Writer]:::internal

Module	Responsibility
Normalizer	Canonicalizes payloads (sort JSON keys, normalize whitespace, neutralize known-volatile fields like timestamps and IDs) before comparison.
Byte / Tree Comparator	Two modes: byte-level for strict comparison, tree-level for structured payloads to produce field-path-aware diffs.
Classification Engine	Applies the rule book to each diff: `Expected` / `Tolerable` / `Notable` / `Blocking`.
Blocking Gate	A `Blocking` classification gates the customer's payload-validation sign-off.
Diff Report Writer	Per-integration, per-customer report; aggregated across a test run.

Classification matrix (from the source playbook):

Class	Examples	Action
Expected	Timestamps · generated IDs · per-call nonces	Ignored automatically. Not surfaced.
Tolerable	Field-reordering where partner accepts any order · cosmetic whitespace	Logged. Reviewed in aggregate. Does not block.
Notable	New optional fields · removed deprecated fields · format changes where partner accepts both	Reviewed before cutover. Confirmed acceptable to partner.
Blocking	Type changes · missing required fields · changed business values · broken response parsing	Workflow cannot advance until resolved.

3.5 Reporting + Dashboard

flowchart LR
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92

    OrchRes[Orchestrator Results]:::internal --> Agg[Aggregator]:::internal
    DiffRes[Diff Reports]:::internal --> Agg
    MockLog[Mock Interaction Logs]:::internal --> Agg
    Agg --> Pkg[Evidence Packager]:::internal
    Pkg --> Reports[(Per-customer Evidence)]:::store
    Pkg --> Dash[Pass/Fail Dashboard]:::internal

The dashboard shows per-customer readiness; drilldown gets to per-scenario and per-integration detail. The evidence package is what the customer team signs at Step 9.

4. Data Model

erDiagram
    USE_CASE ||--o{ SCENARIO_RUN : "executed as"
    USE_CASE }o--|| CUSTOMER_CATALOG : "belongs to"
    SCENARIO_RUN ||--|| PARITY_RESULT : "produces"
    SCENARIO_RUN ||--|| PERF_RESULT : "produces"
    SCENARIO_RUN ||--o{ CAPTURED_PAYLOAD : "may capture"
    CAPTURED_PAYLOAD ||--|| DIFF_RECORD : "compared into"
    DIFF_RECORD }o--|| PRODUCTION_BASELINE : "vs"
    DIFF_RECORD }o--|| CLASSIFICATION_RULE : "classified by"
    MOCK_SCENARIO ||--o{ MOCK_INTERACTION : "produced"
    INTEGRATION ||--o{ MOCK_SCENARIO : "has"
    INTEGRATION ||--o{ PRODUCTION_BASELINE : "has"
    CUSTOMER_CATALOG }o--|| CUSTOMER : "belongs to"
    PERF_BASELINE }o--|| USE_CASE : "for"

    USE_CASE {
        uuid id PK
        text refNum
        text name
        text category "journey|smoke|cross_system|edge|perf"
        text version
    }
    CUSTOMER_CATALOG {
        uuid id PK
        text refNum
        text version
        timestamp signed_off_at
        text signed_off_by
    }
    SCENARIO_RUN {
        uuid id PK
        uuid use_case_id FK
        text refNum
        text status "Pass|Fail|Error"
        text pinned_correlation_id "optional: Sync Framework Hop 2 chain run"
        timestamp started_at
        timestamp ended_at
        jsonb outputs
    }
    PARITY_RESULT {
        uuid id PK
        uuid scenario_run_id FK
        text outcome "Match|Mismatch"
        jsonb diff
    }
    PERF_RESULT {
        uuid id PK
        uuid scenario_run_id FK
        int observed_p95_ms
        int baseline_p95_ms
        text outcome "Within|Variance|Regressed"
    }
    PERF_BASELINE {
        uuid id PK
        uuid use_case_id FK
        int baseline_p50_ms
        int baseline_p95_ms
        int baseline_p99_ms
        int tolerance_pct
        timestamp measured_at
    }
    INTEGRATION {
        uuid id PK
        text name
        text owner_team
        text contract_version
    }
    MOCK_SCENARIO {
        uuid id PK
        uuid integration_id FK
        text name "happy_path|validation_error|..."
        jsonb response_spec
        int latency_profile_ms
    }
    MOCK_INTERACTION {
        uuid id PK
        uuid scenario_run_id FK
        uuid mock_scenario_id FK
        jsonb request
        jsonb response
        timestamp at
    }
    CAPTURED_PAYLOAD {
        uuid id PK
        uuid scenario_run_id FK
        uuid integration_id FK
        text refNum
        jsonb request_payload
        jsonb response_payload
        text hash
    }
    PRODUCTION_BASELINE {
        uuid id PK
        uuid integration_id FK
        text refNum
        jsonb request_payload
        jsonb response_payload
        text hash
        timestamp captured_at
    }
    DIFF_RECORD {
        uuid id PK
        uuid captured_payload_id FK
        uuid baseline_id FK
        text classification "Expected|Tolerable|Notable|Blocking"
        jsonb diff_tree
    }
    CLASSIFICATION_RULE {
        uuid id PK
        uuid integration_id FK
        text version
        jsonb rule_spec
        text signed_off_by
    }
    CUSTOMER {
        text refNum PK
        text name
    }

Key indexes:

Table	Index	Reason
`SCENARIO_RUN`	`(refNum, started_at desc)`	Customer-readiness timeline.
`MOCK_INTERACTION`	`(scenario_run_id, at)`	Replay/audit of one run.
`CAPTURED_PAYLOAD`	`(scenario_run_id, integration_id)`	Per-integration diff lookups.
`PRODUCTION_BASELINE`	`(integration_id, refNum, captured_at desc)`	Latest baseline per integration per customer.
`DIFF_RECORD`	`(captured_payload_id, classification)`	Blocking-first filtering.

Retention:

Table	Default retention
`SCENARIO_RUN`, `PARITY_RESULT`, `PERF_RESULT`	2 years
`CAPTURED_PAYLOAD`	Until customer sign-off + 1 year
`PRODUCTION_BASELINE`	Refreshed on cadence; old versions retained for 90 days
`DIFF_RECORD`	Co-retained with captured payloads
`MOCK_INTERACTION`	90 days (heavier traffic; archive older runs)

5. Key Workflows

5.1 INTQA Test Run with Mocked Integrations + Payload Diff (happy path)

sequenceDiagram
    autonumber
    participant Trig as Trigger (CI / Operator)
    participant Orch as Use Case Orchestrator
    participant App as Migrated App (INTQA)
    participant Ada as Mock Adapter (in app)
    participant Mock as Mock Service
    participant Cap as Capture
    participant Diff as Diff Engine
    participant Dash as Dashboard

    Trig->>Orch: run customer suite (refNum)
    Orch->>App: execute scenario via DAAL
    App->>Ada: outbound HTTP call
    par interception
        Ada->>Mock: route to active scenario
        Mock-->>Ada: response (shape, status, latency)
    and capture
        Ada->>Cap: record request + response (anonymized)
    end
    Ada-->>App: response
    App-->>Orch: scenario output + timings
    Cap->>Diff: compare to production baseline
    Diff-->>Orch: classification per integration
    Orch->>Orch: parity check + perf check
    Orch->>Dash: aggregated pass/fail

5.2 INTQA Test Run with Forced Failure Scenario (timeout)

sequenceDiagram
    autonumber
    participant Op as Operator
    participant Mock as Mock Service
    participant Orch as Orchestrator
    participant App as Migrated App
    participant Ada as Mock Adapter

    Op->>Mock: switch integration X to "timeout"
    Op->>Orch: re-run scenario Y
    Orch->>App: execute
    App->>Ada: outbound call
    Ada->>Mock: route
    Mock-->>Ada: (no response - simulated timeout)
    Note over Ada,App: App enters retry/backoff logic
    Ada-->>App: timeout error after configured latency
    App-->>Orch: scenario result (handled gracefully?)
    Orch->>Orch: parity check: did app handle timeout per baseline?

5.3 Hypercare Daily Smoke

sequenceDiagram
    autonumber
    participant Cron as Hypercare Scheduler
    participant Orch as Orchestrator
    participant Dash as Dashboard
    participant Triage as Triage Hook (PagerDuty/Slack)

    Cron->>Orch: kick smoke suite for refNum
    Orch->>Orch: run scenarios + parity + perf
    Orch->>Dash: post results
    alt new P1/P2
        Orch->>Triage: alert + reset 7-day clock
    else clean
        Orch->>Dash: increment clean streak
    end

5.4 Hypercare Clock Lifecycle

stateDiagram-v2
    [*] --> Day1: cutover
    Day1 --> Day2: clean
    Day2 --> Day3: clean
    Day3 --> Day4: clean
    Day4 --> Day5: clean
    Day5 --> Day6: clean
    Day6 --> Day7: clean
    Day7 --> SignOffEligible: 7 clean days
    Day2 --> Day1: P1/P2 raised (reset)
    Day3 --> Day1: P1/P2 raised (reset)
    Day4 --> Day1: P1/P2 raised
    Day5 --> Day1: P1/P2 raised
    Day6 --> Day1: P1/P2 raised
    Day7 --> Day1: P1/P2 raised
    SignOffEligible --> [*]: customer sign-off

6. APIs & Contracts

Orchestrator

Method	Path	Purpose	Auth
`POST`	`/runs`	Trigger a suite or scenario subset.	`qa-eng`, `ci`, `operator`
`GET`	`/runs/{id}`	Fetch run status + aggregated result.	`qa-eng`, `viewer`
`GET`	`/results?refNum={x}`	Latest results per customer.	`viewer`, `leadership`

Mock Service

Method	Path	Purpose	Auth
`POST`	`/mocks/{integration}/scenarios/{name}/activate`	Set active scenario.	`qa-eng`
`POST`	`/mocks/webhooks/{integration}/{event}`	Trigger an inbound webhook event into the app.	`qa-eng`
`GET`	`/mocks/{integration}/interactions?run_id={x}`	List captured interactions for a run.	`qa-eng`, `viewer`
`*`	`/integrations/{integration}/*`	The mock endpoint itself; receives outbound traffic from the app.	service-to-service

Payload Diff

Method	Path	Purpose	Auth
`GET`	`/diffs?run_id={x}`	Diff results for a run.	`qa-eng`, `integration-owner`
`GET`	`/diffs/{id}`	Specific diff record.	`qa-eng`, `integration-owner`
`POST`	`/baselines/{integration}/refresh`	Refresh production baselines from latest capture window.	`integration-owner`
`POST`	`/classification/rules`	Update classification rule book (versioned).	`integration-owner`
`POST`	`/diffs/{id}/accept`	Accept a Notable diff with rationale (does not affect Blocking).	`integration-owner`

Gate

Method	Path	Purpose	Auth
`GET`	`/gate/{refNum}/{step}`	Workflow gate decision (rolls up parity, perf, blocking diffs).	`pipeline`, `workflow`
`POST`	`/gate/override`	Senior override of a blocking gate.	`senior-approver`

7. Configuration & Feature Flags

Setting	Default	Notes
`mock.default_scenario`	`happy_path`	Reset between runs.
`mock.latency_profile.default_ms`	match production baseline	Per integration.
`perf.tolerance_pct`	`15`	Configurable per use case.
`diff.normalizer.ignore_fields`	`[timestamps, uuids, nonces]`	Per integration override.
`hypercare.required_clean_days`	`7`	Cannot be lowered without senior approval.
`baseline.refresh.cadence_days`	`7`	Partner contract change triggers immediate refresh.
`network.allow_real_partners`	`false` (INTQA), `true` (prod)	Enforced at infra; this flag is for in-app guardrails.

8. Security & Compliance

Network isolation: INTQA outbound to real-partner endpoints is denied at the firewall. The Mock Service is the only allowed destination for integration calls. Unexpected egress triggers an alert.
Anonymization: All production captures pass through the Anonymizer, which uses MDS classifications to redact PII before storage. The same anonymization runs at INTQA capture time as a defense-in-depth measure.
Baseline access: Production baselines are read-only to the diff engine; refresh is authenticated and audited.
Classification rule changes: All rule changes are versioned, reviewed, and signed off by the integration owner. The rule version is recorded on each diff record.
Sign-off artifacts: Per-integration, per-customer signed reports are cryptographically signed; reproducible on demand.
Mock interactions retention: 90 days, then archived. Hot retention sized for replay use cases.
Override discipline: A blocking-gate override requires a senior-approver identity and a free-text rationale; security reviews overrides weekly.

9. Observability

Signal	Source	Used by
`e2e.scenario_runs.total{customer, outcome}`	Orchestrator	Dashboards
`e2e.parity_mismatches.total{customer, scenario}`	Parity Comparator	Trend / spike alerts
`e2e.perf_variance.observed_pct{customer, scenario}`	Perf Gate	Perf dashboard
`mock.requests.total{integration, scenario, status}`	Mock Service	Mock health
`mock.unexpected_destinations.total`	Egress monitor	Alerts (>0 is a security incident)
`diff.records.total{integration, classification}`	Diff Engine	Drilldown
`diff.blocking.total{integration, customer}`	Diff Engine	Alerts
`hypercare.clean_streak_days{customer}`	Hypercare Scheduler	Dashboard / sign-off readiness

Alerts:

Alert	Condition	Page?
Unexpected outbound destination from INTQA	Any non-zero increment	Yes (security + network)
Blocking diff present at scheduled cutover window	`diff.blocking.total > 0`	Yes (QA + integration owner)
Hypercare clock reset	New P1/P2 during hypercare	Yes (migration lead + customer team)
Mock service down	Health check fails	Yes (platform)
Baseline drift detected	New diff classifications spike vs prior runs	Yes (integration owner)

10. Deployment

flowchart TB
    classDef internal fill:#f5f5f5,stroke:#616161,color:#212121
    classDef lower fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef store fill:#ede7f6,stroke:#5e35b1,color:#311b92
    classDef prod fill:#e0f2f1,stroke:#00695c,color:#004d40
    classDef external fill:#fce4ec,stroke:#ad1457,color:#880e4f

    subgraph INTQAEnv[" INTQA VPC (per customer / shared) "]
        App[Migrated App<br/>with Mock Adapter]:::lower
        Mock[Mock Service]:::internal
        Cap[Capture Workers]:::internal
        Diff[Diff Engine workers]:::internal
        OrchSvc[Orchestrator]:::internal
        ScenStore[(Scenario Catalog)]:::store
        Captured[(Captured Payload Store)]:::store
        DiffStore[(Diff Records)]:::store
    end

    subgraph ProdEnv[" Production VPC "]
        ProdApp[Existing Platform]:::prod
        CapWindow["Production Capture Window<br/>(read-only)"]:::prod
    end

    subgraph BaseEnv[" Baseline Storage "]
        BaseStore[(Anonymized Production Baselines)]:::store
    end

    subgraph Net[" Firewall "]
        FW[INTQA Egress Policy<br/>real partners DENIED]:::internal
    end

    OrchSvc --> App
    App --> Mock
    App --> Cap
    Cap --> Captured
    Cap --> Diff
    Diff --> DiffStore
    Diff --> BaseStore

    ProdApp --> CapWindow --> BaseStore

    App -. blocked .- FW
    FW -.->|alerts on attempts| Sec[Security]:::external

    Dash[Pass/Fail Dashboard]:::internal
    DiffStore --> Dash
    OrchSvc --> Dash

    style INTQAEnv stroke-dasharray: 5 5,stroke:#1565c0
    style ProdEnv stroke-dasharray: 5 5,stroke:#00695c

Concern	Decision
Mock Adapter location	Library inside the migrated app — same library used in INTQA (routes to mock) and prod (passes through to real partner). One integration surface for the app team.
Network policy	INTQA egress to real-partner endpoints blocked at firewall + verified by egress monitor.
Capture pipeline	Workers stateless, horizontally scaled; capture store is durable.
Production capture	Read-only window from existing platform's outbound traffic; anonymized before persistence.
Baseline storage	Centralized, versioned; access audited.

11. Operational Runbook Hooks

Runbook	When to use
`runbook-e2e-suite-run.md`	Operator-triggered customer suite.
`runbook-e2e-scenario-switch.md`	Activating a non-default mock scenario for failure-mode testing.
`runbook-e2e-baseline-refresh.md`	Refreshing production baselines for an integration.
`runbook-e2e-blocking-diff-resolution.md`	Triage and resolution of a Blocking diff.
`runbook-hypercare-daily.md`	Daily smoke + reconciliation during the 2-week window.
`runbook-hypercare-clock-reset.md`	Procedure when a P1/P2 occurs during hypercare.
`runbook-egress-incident.md`	Procedure when INTQA attempts to reach a real partner.

12. Failure Modes & Recovery

Failure	Detection	Recovery	Owner
Mock service unavailable	Health check + run errors	Orchestrator pauses; on-call restores; runs replayed.	Platform Eng
INTQA attempts to reach real partner	Egress monitor alert	Network policy enforces denial; security investigates and fixes config.	Security + DevOps
Mock scenario drift from real partner	Diff blocking spike on integration	Refresh production baseline; update mock; re-run.	Integration owner
Parity baseline missing for a new scenario	Orchestrator skips comparison with warning	Capture baseline from existing system; re-run.	Customer team
Perf variance is real regression	Perf gate fails	Engineering tunes; re-run; if not fixable, blocks cutover.	Eng + DevOps
Capture pipeline lag	Backlog metric	Scale workers; backfill diffs.	Platform Eng
Classification rule lets a real issue through	Production incident or spot-check	Tighten rule, re-version, re-run all affected customers' diffs.	Integration owner + Security
Hypercare clock reset due to P1/P2	Hypercare alert	Triage + fix; clock resumes from Day 1; cannot waive without senior approval.	Migration lead + customer team
Customer signs off without all scenarios green	Pre-sign-off validator catches	Sign-off blocked until green or each gap explicitly accepted with rationale.	Migration lead

13. Open Questions / Decisions Log

#	Question	Status	Notes
1	Scenario versioning strategy across customers (shared vs forked)	Open	Phase 2 bucketing depends on this.
2	Performance baseline staleness threshold	Open	Likely 30 days; pending DevOps measurement.
3	Hypercare clock automation policy — auto-reset vs manual	Decided — auto-reset on P1/P2 trigger	Reset event recorded.
4	Whether webhook simulation is in scope at Phase 1	Decided — yes	Critical for inbound event handling.
5	Where the Production Baseline Library lives (alongside MDS vs separate)	Open	Pending Platform decision.
6	Diff engine multi-modal (byte + tree) — both required?	Decided — both	Byte mode for strict; tree for human-readable diff paths.
7	Mock Adapter rollback if it misbehaves in prod	Decided — feature flag pass-through default in prod	Adapter is a no-op in production unless explicitly toggled.

14. Glossary Reference

See _shared/glossary.md.