Files
apophis-fastify/APOPHIS_REMEDIATION_PLAN.md
T
John Dvorak d0523fcc2d fix: harden engine, enrich failure diagnostics, close adoption gaps
- P0: CLI verify now honors  test budget with seeded multi-sample
- P0: Observe sampling enforced via Math.random() gate in hook-validator
- P1: Remove misleading undici-mock-agent isolation option
- P1: Qualify reuses shared discoverRouteDetails() with warnings
- P1: Chaos/scenario config exposed via preset schema
- P1: README/docs limitations updated to current state
- P2: Nested response annotations prefer 2xx deterministically
- P2: --changed documented as heuristic in verify.md

- Add observe sink tests (sampling 0/1, sink failure non-interference)
- Add verify runs regression tests (scale, determinism, variants)
- Add configured-scenario qualify test (independent of OAuth fixture)
- Add coverageBreakdown to qualify artifacts (per-gate route coverage)
- Add production-style observe example with real sink in docs/observe.md
- Add nightly/staging vs PR gating guidance to docs/qualify.md

- Enrich VerifyFailure with formula-aware diagnostics:
  status:201 => 'HTTP 200', body field checks => actual values
- Remove stale observe CLI activation message
- Document outbound mocks as process-global in getting-started.md
- Refresh APOPHIS_ADOPTION_AUDIT.md with current state

903 tests pass, build clean, typecheck clean.
2026-05-21 20:39:36 -07:00

21 KiB

APOPHIS Remediation And Refactor Plan

Date: 2026-05-21

Source assessment: APOPHIS_ADOPTION_AUDIT.md

Goal: resolve the adoption-blocking gaps identified in the audit and move APOPHIS from "credible pilot" to "safe team-standard tool" for Fastify v5 ESM services.

Summary

The audit findings are fixable without rewriting APOPHIS. The necessary refactor is not a large architecture replacement; it is mostly consolidation of duplicate execution paths, making discovery state explicit, and tightening product claims around observe/qualify.

Recommended sequencing:

  1. Fix route discovery observability and messages first.
  2. Make createFastify() fail loudly when requested runtime registration fails.
  3. Fix route filter wildcard matching.
  4. Unify CLI verify with the programmatic PETIT execution path.
  5. Decide whether observe is a real runtime feature now or a config-validation command for this release.
  6. Expand or relabel qualify chaos coverage.
  7. Document outbound mock isolation and introduce a scoped alternative.
  8. Clean up docs, config extension points, and TypeScript loading semantics.

Target End State

APOPHIS should provide these guarantees to adopters:

  • If contracts exist on routes, doctor and verify can tell whether they were discovered with full schema metadata or only as schema-less fallback paths.
  • apophis verify and fastify.apophis.contract() exercise routes through the same request generation and contract evaluation engine.
  • createFastify({ apophis: { runtime: 'error' } }) either enables runtime validation or fails loudly.
  • Route filters are literal-safe with only documented wildcard semantics.
  • observe docs match actual behavior, or observe emits real non-blocking events to configured sinks.
  • qualify reports actual coverage and does not imply broad chaos coverage when only one route is sampled.
  • Outbound mocks are clearly documented as process-global unless a scoped runtime is used.
  • Config remains strict for APOPHIS-owned fields while allowing documented user metadata.

Track 1: Route Discovery Metadata Visibility

Priority: P0

Risk: medium

Primary files:

  • src/domain/discovery.ts
  • src/infrastructure/discovery-hook.ts
  • src/cli/core/app-loader.ts
  • src/plugin/builders.ts
  • src/cli/commands/doctor/checks/routes.ts
  • src/test/integration.test.ts
  • src/test/cli/doctor-consistency.test.ts
  • docs/getting-started.md

Problem

APOPHIS can discover route paths through printRoutes() after routes are registered, but it cannot recover route schemas or behavioral annotations from that fallback. Users can see "routes found" but "no contracts found" without understanding that discovery degraded.

Refactor Path

  1. Replace discoverRoutes() internal return flow with metadata-aware discovery.

    Add an internal result type:

    export interface DiscoveryResult {
      routes: RouteContract[]
      source: 'captured' | 'legacy-routes-array' | 'print-routes' | 'none'
      hasSchemaMetadata: boolean
      warnings: string[]
    }
    
  2. Keep discoverRoutes() for compatibility, but implement it as discoverRouteDetails(instance).routes.

  3. In discoverRoutesFallback(), set source: 'print-routes' and hasSchemaMetadata: false.

  4. Update buildContract() empty-suite error to distinguish:

    • no routes discovered
    • routes discovered without schema metadata
    • routes discovered with schemas but no x-ensures/x-requires
  5. Add a doctor route check that reports:

    • pass: captured routes with schemas
    • warn: schema-less fallback route discovery
    • fail: no routes discovered
  6. Update CLI verify artifacts/warnings to include discovery source when no contracts are found.

  7. Update docs to say printRoutes() fallback can identify paths but cannot recover contract annotations.

Compatibility Notes

Do not remove discoverRoutes(). It is used widely. Add discoverRouteDetails() and migrate only the call sites that need diagnostics.

Acceptance Criteria

  • discoverRouteDetails() returns captured + hasSchemaMetadata: true for routes captured by the plugin/discovery hook.
  • discoverRouteDetails() returns print-routes + hasSchemaMetadata: false for fallback routes.
  • apophis doctor warns when only schema-less fallback discovery is available.
  • apophis verify no-contract output says whether contracts are absent or unavailable due to schema-less discovery.
  • Existing route discovery tests still pass.

Track 2: createFastify() Runtime Registration Semantics

Priority: P1

Risk: low

Primary files:

  • src/fastify-factory.ts
  • src/test/integration.test.ts
  • docs/getting-started.md
  • README.md

Problem

createFastify() silently catches all plugin registration failures when runtime validation is requested. This can leave users believing runtime validation is active when it is not.

Refactor Path

  1. Change CreateFastifyOptions:

    apophis?: {
      runtime?: 'off' | 'warn' | 'error'
      discoveryOnly?: boolean
    }
    
  2. Always install route discovery.

  3. If discoveryOnly === true, skip plugin registration.

  4. If runtime is set and not off, register the plugin and let failures throw.

  5. Add one test that simulates plugin registration failure and asserts a thrown error when runtime is requested.

  6. Add one test that discoveryOnly: true does not attempt runtime plugin registration.

Acceptance Criteria

  • Runtime registration failures are visible.
  • Discovery-only behavior remains explicitly available.
  • Docs show discoveryOnly only for advanced/diagnostic use.

Track 3: Safe Route Filter Matching

Priority: P1

Risk: low

Primary files:

  • src/cli/commands/verify/runner.ts
  • src/infrastructure/wildcard-match.ts
  • src/test/cli/verify-ux.test.ts

Problem

matchRoutePattern() builds regexes without escaping non-wildcard regex metacharacters.

Refactor Path

  1. Add a route-specific helper instead of reusing URL target semantics directly:

    export function matchesWildcardPattern(value: string, pattern: string): boolean
    
  2. Escape regex metacharacters first.

  3. Replace escaped \* with .* and escaped \? with . only after escaping.

  4. Anchor the regex.

  5. Use this helper in verify/runner.ts.

  6. Add tests for literal ., +, (, ), [, ], $, ^, and \ in route filters.

Acceptance Criteria

  • Literal route filters match literal paths.
  • * and ? retain documented wildcard behavior.
  • Invalid user patterns cannot throw regex syntax errors.

Track 4: CLI Verify And PETIT Engine Unification

Priority: P1

Risk: high

Primary files:

  • src/cli/commands/verify/runner.ts
  • src/quality/petit-runner.ts
  • src/quality/petit-command-step.ts
  • src/quality/route-filter.ts
  • src/domain/request-builder.ts
  • src/domain/schema-to-arbitrary.ts
  • src/cli/commands/verify/index.ts
  • src/test/cli/verify-ux.test.ts
  • src/test/cross-operation-support.test.ts

Problem

The CLI verify path has a simpler request builder and execution loop than the programmatic contract runner. This creates inconsistent behavior and makes the main user workflow less capable than the library API.

Refactor Path

  1. Extract a shared verification core from runPetitTests().

    Candidate API:

    export interface ContractExecutionOptions extends TestConfig {
      routeFilters?: string[]
      profileRoutes?: string[]
      failOnNoRoutes?: boolean
      failOnNoContracts?: boolean
    }
    
    export async function runContractVerification(
      fastify: FastifyInjectInstance,
      options: ContractExecutionOptions,
      deps?: ContractExecutionDeps,
    ): Promise<ContractVerificationResult>
    
  2. Move route filtering into shared helpers.

    Current filtering exists in both CLI verify and PETIT route filtering. Collapse these into one module that supports:

    • route patterns
    • scope filters
    • profile routes
    • HEAD exclusion
    • skipped-route reporting
  3. Make CLI runVerify() call the shared core.

  4. Preserve CLI artifact shape by mapping ContractVerificationResult into current VerifyRunResult.

  5. Keep the existing simple example-body execution only if explicitly needed as a --sample example mode. Do not make it the default if the product claim is property-based verification.

  6. Add config support for runs in the CLI verify path. Today verifyCommand() passes timeout but does not fully pass the resolved preset/profile run configuration into the execution core.

  7. Add tests for CLI verify that cover:

    • generated body values from schema constraints
    • querystring generation
    • path parameter generation
    • variants
    • non-POST bodies where Fastify route semantics allow them
    • route-level timeouts
    • extension build-request hooks, if verify should support them

Suggested Intermediate Step

Before full unification, replace CLI buildRouteRequest() with buildRequest() plus convertSchema() sampling. This reduces behavior drift while the larger shared core is extracted.

Acceptance Criteria

  • A route tested by fastify.apophis.contract({ seed, runs }) and apophis verify --seed uses the same request generation semantics.
  • CLI verify respects configured runs.
  • CLI verify can exercise query, path, body, headers/variants, and cross-operation contracts.
  • Golden CLI output remains stable except for intentional added diagnostics.

Track 5: Observe Mode Decision And Implementation

Priority: P0

Risk: medium if docs-only, high if implementing runtime observe

Primary files:

  • src/cli/commands/observe/index.ts
  • src/cli/commands/observe/validator.ts
  • src/plugin/index.ts
  • src/infrastructure/hook-validator.ts
  • src/types/core.ts
  • docs/observe.md
  • README.md

Problem

Docs imply production runtime visibility and drift detection. Current code validates observe configuration and prints what would be activated.

Path A: Honest Docs For Current Release

This is the safer short-term path.

  1. Rename docs language from "observe activates runtime visibility" to "observe validates runtime-observe readiness".

  2. Make observe command output explicitly say "no runtime observer is started by this command".

  3. Keep observe --check-config as the canonical current behavior.

  4. Add README "Current Limitations" entry for observe.

Acceptance criteria:

  • No docs claim request telemetry is emitted unless a tested implementation exists.
  • Platform teams can tell observe is a readiness/config command.

Path B: Implement Real Observe

This is the product-complete path.

  1. Define sink interface:

    export interface ObserveSink {
      emit(event: ObserveEvent): void | Promise<void>
    }
    
  2. Define event schema:

    interface ObserveEvent {
      type: 'contract.pass' | 'contract.violation' | 'contract.error'
      route: string
      method: string
      statusCode: number
      durationMs: number
      formula?: string
      error?: string
      sampled: boolean
      timestamp: string
    }
    
  3. Extend plugin options with observe config:

    observe?: {
      enabled?: boolean
      sampling?: number
      blocking?: false
      sinks?: ObserveSink[]
    }
    
  4. Reuse runtime validation hook evaluation, but in observe mode never throw and never delay responses waiting for sinks.

  5. Implement bounded async sink dispatch:

    • fire-and-forget by default
    • max queue length
    • dropped-event counter
    • sink failure isolation
  6. Add integration tests:

    • passing contract emits pass event when sampled
    • failing contract emits violation event and response remains successful
    • sink throw does not affect response
    • sampling 0 emits nothing
    • sampling 1 emits deterministically in tests

Recommended decision: do Path A immediately, then Path B as a dedicated feature milestone.

Track 6: Qualify Chaos Coverage And Cleanup Outcomes

Priority: P1

Risk: medium

Primary files:

  • src/cli/commands/qualify/runner.ts
  • src/cli/commands/qualify/chaos-handler.ts
  • src/cli/commands/qualify/index.ts
  • src/cli/core/config-loader.ts
  • src/types/formula.ts
  • src/infrastructure/cleanup-manager.ts
  • src/test/cli/qualify-signal.test.ts

Problem

Qualify chaos currently picks one route deterministically. Cleanup failures are represented but not wired to real cleanup outcomes.

Refactor Path

  1. Extend ChaosConfig:

    strategy?: 'one' | 'all' | 'sample' | 'routes'
    sampleSize?: number
    routes?: string[]
    
  2. Implement selectChaosRoutes(routes, chaosConfig, seed) as a pure helper.

  3. Run chaos for all selected routes and return chaosResults: ChaosRunResult[] instead of a single optional result.

  4. Preserve a compatibility field temporarily if needed:

    chaosResult?: ChaosRunResult
    chaosResults: ChaosRunResult[]
    
  5. Update artifacts and renderers to report:

    • planned chaos route count
    • executed chaos route count
    • applied chaos count
    • skipped chaos routes and reasons
  6. Wire CleanupManager into qualify runner instead of simulated empty cleanup failures.

  7. Update docs to describe default strategy. Recommended default: sample with sampleSize: 1, explicitly labeled as sampled.

Acceptance Criteria

  • Qualify artifacts can prove exactly which routes were chaos-tested.
  • Users can request all-route chaos explicitly.
  • Cleanup failures reflect actual cleanup manager failures.
  • Existing deterministic seed tests remain stable.

Track 7: Outbound Mock Isolation

Priority: P1

Risk: medium

Primary files:

  • src/infrastructure/outbound-mock-runtime.ts
  • src/infrastructure/production-safety.ts
  • src/quality/petit-runner.ts
  • src/quality/stateful-runner.ts
  • docs/quality.md
  • docs/getting-started.md

Problem

Outbound mocks patch globalThis.fetch. This is simple but process-global and unsafe for parallel suites.

Refactor Path

  1. Document current global behavior immediately.

  2. Add a runtime mode:

    isolation?: 'global-fetch' | 'undici-mock-agent'
    
  3. Implement undici-mock-agent for consumers that use undici/fetch-compatible clients.

  4. Add a global install guard with owner metadata:

    activeRuntimeId: string
    installedAt: Error stack or timestamp
    

    Error messages should say which runtime currently owns the global patch.

  5. Add tests for overlapping installs and restore order.

  6. Add docs with recommended test-runner settings:

    • run outbound mock tests serially
    • isolate by process
    • prefer scoped MockAgent where possible

Acceptance Criteria

  • Users are warned when they choose global mocks.
  • Overlapping global installs fail with actionable diagnostics.
  • Scoped mock path exists for teams that can use it.

Track 8: Documentation Maturity Alignment

Priority: P2

Risk: low

Primary files:

  • README.md
  • docs/getting-started.md
  • docs/observe.md
  • docs/qualify.md
  • docs/verify.md
  • docs/troubleshooting.md

Problem

Docs sometimes describe intended platform behavior rather than current implementation behavior.

Refactor Path

  1. Add Current Limitations to README:

    • route discovery fallback loses schemas
    • observe is config readiness unless real observe is implemented
    • qualify chaos default is sampled unless configured otherwise
    • outbound mocks are process-global unless scoped mode is used
  2. Add Recommended Integration:

    • use createFastify() for new apps
    • register APOPHIS/discovery before routes for existing apps
    • run doctor and confirm schema-backed discovery
  3. Update getting-started.md to avoid saying nested response annotations are selected by actual status code unless implemented.

  4. Add a migration note for teams with already-registered routes.

  5. Keep docs smoke tests updated.

Acceptance Criteria

  • No doc makes a runtime/coverage claim that lacks implementation and tests.
  • Quickstart remains short but links to limitations.
  • Docs smoke tests pass.

Track 9: TypeScript Entrypoint Loading Semantics

Priority: P2

Risk: medium

Primary files:

  • src/cli/core/app-loader.ts
  • src/cli/core/config-loader.ts
  • src/test/cli/init.test.ts
  • docs/cli.md
  • docs/troubleshooting.md

Problem

The loader says TypeScript entrypoints require tsx, but it uses dynamic import directly. This is environment-sensitive.

Refactor Path

Choose one policy.

Policy A: JS-only installed CLI.

  • Reject .ts app/config entrypoints unless process.execArgv includes a TS loader.
  • Error tells users to export a JS app entrypoint or run through tsx.

Policy B: Built-in TS loading.

  • Detect .ts entrypoints.
  • Use tsx programmatically or spawn a subprocess through tsx.
  • Keep dependency/devDependency implications clear.

Recommended policy: A for now. It is simpler, honest, and avoids magic loader behavior in installed CLIs.

Acceptance Criteria

  • Installed CLI behavior is deterministic.
  • .ts entrypoint errors are clear and tested.
  • Docs match actual supported path.

Track 10: Config Extensibility Namespace

Priority: P2

Risk: low

Primary files:

  • src/cli/core/config-loader.ts
  • src/test/cli/config-validation.test.ts
  • docs/cli.md

Problem

Strict unknown-key rejection is useful, but teams need a place for internal metadata.

Refactor Path

  1. Add top-level metadata?: object to config schema.

  2. Allow x-* keys at top-level and inside profiles/presets/environments without validation beyond JSON object compatibility.

  3. Document that APOPHIS will never interpret metadata or x-* fields unless promoted in a future major version.

  4. Keep all APOPHIS-owned fields strict.

Acceptance Criteria

  • Unknown typo like rouets still fails.
  • metadata.owner = 'platform' passes.
  • x-team-policy passes.
  • Tests cover top-level and nested metadata.

Cross-Cutting Test Plan

Run after each track:

npm run typecheck
npm run build

Run before merging a track:

npm run test:src
npm run test:cli
npm run test:docs

Add targeted tests for each track before refactoring where feasible. For high-risk tracks, add characterization tests first.

Suggested Milestones

Milestone 1: Safety And Honesty

Tracks:

  • Track 1: discovery visibility
  • Track 2: createFastify() loud failures
  • Track 3: route filter escaping
  • Track 8 Path A docs updates for current limitations

Outcome: fewer silent failures; docs align with current behavior.

Milestone 2: Verify Quality

Tracks:

  • Track 4: CLI verify/PETIT unification
  • Track 9: TypeScript loading policy
  • Track 10: config metadata namespace

Outcome: the main CLI workflow reflects the real engine and is easier to adopt in teams.

Milestone 3: Platform Features

Tracks:

  • Track 5 Path B: real observe implementation
  • Track 6: qualify chaos coverage and cleanup outcomes
  • Track 7: outbound mock isolation

Outcome: production/platform-facing claims become technically defensible.

Work Breakdown Estimate

Track Size Risk Suggested Owner
Discovery metadata visibility M Medium Core/Fastify integration
createFastify() semantics S Low Core/Fastify integration
Route filter escaping S Low CLI
CLI verify/PETIT unification L High Quality engine + CLI
Observe decision/docs S Low Docs + CLI
Real observe implementation L High Runtime/platform
Qualify chaos coverage M Medium CLI quality
Outbound mock isolation M Medium Runtime/testing infra
Docs maturity alignment S Low Docs
TypeScript loading policy M Medium CLI
Config metadata namespace S Low CLI config

Dependency Graph

  • Track 1 should happen before Track 4, because verify unification needs clear discovery diagnostics.
  • Track 3 can happen anytime.
  • Track 2 can happen anytime.
  • Track 5 Path A should happen immediately if Path B will not ship in the same release.
  • Track 6 can happen independently of Track 4 but should reuse shared route filtering after Track 4 if possible.
  • Track 7 can happen independently but affects PETIT/stateful/qualify if scoped mocks are introduced.
  • Track 8 depends on decisions from Tracks 5, 6, and 7.
  • Track 9 should happen before making stronger CLI adoption claims.
  • Track 10 can happen anytime.

Release Recommendation

For the next release, do not attempt every deep refactor at once. Ship a release focused on safety, honesty, and mainline CLI correctness:

  1. Discovery visibility and doctor warning.
  2. createFastify() loud failure semantics.
  3. Safe route filter matching.
  4. Docs current-limitations update.
  5. Initial CLI verify request-generation convergence, even if full PETIT unification takes another release.

Then schedule real observe, full verify unification, qualify coverage, and outbound mock isolation as dedicated milestones with their own acceptance tests.