Files
apophis-fastify/APOPHIS_REMEDIATION_PLAN.md
T
John Dvorak d0523fcc2d fix: harden engine, enrich failure diagnostics, close adoption gaps
- P0: CLI verify now honors  test budget with seeded multi-sample
- P0: Observe sampling enforced via Math.random() gate in hook-validator
- P1: Remove misleading undici-mock-agent isolation option
- P1: Qualify reuses shared discoverRouteDetails() with warnings
- P1: Chaos/scenario config exposed via preset schema
- P1: README/docs limitations updated to current state
- P2: Nested response annotations prefer 2xx deterministically
- P2: --changed documented as heuristic in verify.md

- Add observe sink tests (sampling 0/1, sink failure non-interference)
- Add verify runs regression tests (scale, determinism, variants)
- Add configured-scenario qualify test (independent of OAuth fixture)
- Add coverageBreakdown to qualify artifacts (per-gate route coverage)
- Add production-style observe example with real sink in docs/observe.md
- Add nightly/staging vs PR gating guidance to docs/qualify.md

- Enrich VerifyFailure with formula-aware diagnostics:
  status:201 => 'HTTP 200', body field checks => actual values
- Remove stale observe CLI activation message
- Document outbound mocks as process-global in getting-started.md
- Refresh APOPHIS_ADOPTION_AUDIT.md with current state

903 tests pass, build clean, typecheck clean.
2026-05-21 20:39:36 -07:00

690 lines
21 KiB
Markdown

# APOPHIS Remediation And Refactor Plan
Date: 2026-05-21
Source assessment: `APOPHIS_ADOPTION_AUDIT.md`
Goal: resolve the adoption-blocking gaps identified in the audit and move APOPHIS from "credible pilot" to "safe team-standard tool" for Fastify v5 ESM services.
## Summary
The audit findings are fixable without rewriting APOPHIS. The necessary refactor is not a large architecture replacement; it is mostly consolidation of duplicate execution paths, making discovery state explicit, and tightening product claims around observe/qualify.
Recommended sequencing:
1. Fix route discovery observability and messages first.
2. Make `createFastify()` fail loudly when requested runtime registration fails.
3. Fix route filter wildcard matching.
4. Unify CLI `verify` with the programmatic PETIT execution path.
5. Decide whether `observe` is a real runtime feature now or a config-validation command for this release.
6. Expand or relabel `qualify` chaos coverage.
7. Document outbound mock isolation and introduce a scoped alternative.
8. Clean up docs, config extension points, and TypeScript loading semantics.
## Target End State
APOPHIS should provide these guarantees to adopters:
- If contracts exist on routes, `doctor` and `verify` can tell whether they were discovered with full schema metadata or only as schema-less fallback paths.
- `apophis verify` and `fastify.apophis.contract()` exercise routes through the same request generation and contract evaluation engine.
- `createFastify({ apophis: { runtime: 'error' } })` either enables runtime validation or fails loudly.
- Route filters are literal-safe with only documented wildcard semantics.
- `observe` docs match actual behavior, or observe emits real non-blocking events to configured sinks.
- `qualify` reports actual coverage and does not imply broad chaos coverage when only one route is sampled.
- Outbound mocks are clearly documented as process-global unless a scoped runtime is used.
- Config remains strict for APOPHIS-owned fields while allowing documented user metadata.
## Track 1: Route Discovery Metadata Visibility
Priority: P0
Risk: medium
Primary files:
- `src/domain/discovery.ts`
- `src/infrastructure/discovery-hook.ts`
- `src/cli/core/app-loader.ts`
- `src/plugin/builders.ts`
- `src/cli/commands/doctor/checks/routes.ts`
- `src/test/integration.test.ts`
- `src/test/cli/doctor-consistency.test.ts`
- `docs/getting-started.md`
### Problem
APOPHIS can discover route paths through `printRoutes()` after routes are registered, but it cannot recover route schemas or behavioral annotations from that fallback. Users can see "routes found" but "no contracts found" without understanding that discovery degraded.
### Refactor Path
1. Replace `discoverRoutes()` internal return flow with metadata-aware discovery.
Add an internal result type:
```ts
export interface DiscoveryResult {
routes: RouteContract[]
source: 'captured' | 'legacy-routes-array' | 'print-routes' | 'none'
hasSchemaMetadata: boolean
warnings: string[]
}
```
2. Keep `discoverRoutes()` for compatibility, but implement it as `discoverRouteDetails(instance).routes`.
3. In `discoverRoutesFallback()`, set `source: 'print-routes'` and `hasSchemaMetadata: false`.
4. Update `buildContract()` empty-suite error to distinguish:
- no routes discovered
- routes discovered without schema metadata
- routes discovered with schemas but no `x-ensures`/`x-requires`
5. Add a doctor route check that reports:
- pass: captured routes with schemas
- warn: schema-less fallback route discovery
- fail: no routes discovered
6. Update CLI `verify` artifacts/warnings to include discovery source when no contracts are found.
7. Update docs to say `printRoutes()` fallback can identify paths but cannot recover contract annotations.
### Compatibility Notes
Do not remove `discoverRoutes()`. It is used widely. Add `discoverRouteDetails()` and migrate only the call sites that need diagnostics.
### Acceptance Criteria
- `discoverRouteDetails()` returns `captured + hasSchemaMetadata: true` for routes captured by the plugin/discovery hook.
- `discoverRouteDetails()` returns `print-routes + hasSchemaMetadata: false` for fallback routes.
- `apophis doctor` warns when only schema-less fallback discovery is available.
- `apophis verify` no-contract output says whether contracts are absent or unavailable due to schema-less discovery.
- Existing route discovery tests still pass.
## Track 2: `createFastify()` Runtime Registration Semantics
Priority: P1
Risk: low
Primary files:
- `src/fastify-factory.ts`
- `src/test/integration.test.ts`
- `docs/getting-started.md`
- `README.md`
### Problem
`createFastify()` silently catches all plugin registration failures when runtime validation is requested. This can leave users believing runtime validation is active when it is not.
### Refactor Path
1. Change `CreateFastifyOptions`:
```ts
apophis?: {
runtime?: 'off' | 'warn' | 'error'
discoveryOnly?: boolean
}
```
2. Always install route discovery.
3. If `discoveryOnly === true`, skip plugin registration.
4. If `runtime` is set and not `off`, register the plugin and let failures throw.
5. Add one test that simulates plugin registration failure and asserts a thrown error when runtime is requested.
6. Add one test that `discoveryOnly: true` does not attempt runtime plugin registration.
### Acceptance Criteria
- Runtime registration failures are visible.
- Discovery-only behavior remains explicitly available.
- Docs show `discoveryOnly` only for advanced/diagnostic use.
## Track 3: Safe Route Filter Matching
Priority: P1
Risk: low
Primary files:
- `src/cli/commands/verify/runner.ts`
- `src/infrastructure/wildcard-match.ts`
- `src/test/cli/verify-ux.test.ts`
### Problem
`matchRoutePattern()` builds regexes without escaping non-wildcard regex metacharacters.
### Refactor Path
1. Add a route-specific helper instead of reusing URL target semantics directly:
```ts
export function matchesWildcardPattern(value: string, pattern: string): boolean
```
2. Escape regex metacharacters first.
3. Replace escaped `\*` with `.*` and escaped `\?` with `.` only after escaping.
4. Anchor the regex.
5. Use this helper in `verify/runner.ts`.
6. Add tests for literal `.`, `+`, `(`, `)`, `[`, `]`, `$`, `^`, and `\` in route filters.
### Acceptance Criteria
- Literal route filters match literal paths.
- `*` and `?` retain documented wildcard behavior.
- Invalid user patterns cannot throw regex syntax errors.
## Track 4: CLI Verify And PETIT Engine Unification
Priority: P1
Risk: high
Primary files:
- `src/cli/commands/verify/runner.ts`
- `src/quality/petit-runner.ts`
- `src/quality/petit-command-step.ts`
- `src/quality/route-filter.ts`
- `src/domain/request-builder.ts`
- `src/domain/schema-to-arbitrary.ts`
- `src/cli/commands/verify/index.ts`
- `src/test/cli/verify-ux.test.ts`
- `src/test/cross-operation-support.test.ts`
### Problem
The CLI `verify` path has a simpler request builder and execution loop than the programmatic contract runner. This creates inconsistent behavior and makes the main user workflow less capable than the library API.
### Refactor Path
1. Extract a shared verification core from `runPetitTests()`.
Candidate API:
```ts
export interface ContractExecutionOptions extends TestConfig {
routeFilters?: string[]
profileRoutes?: string[]
failOnNoRoutes?: boolean
failOnNoContracts?: boolean
}
export async function runContractVerification(
fastify: FastifyInjectInstance,
options: ContractExecutionOptions,
deps?: ContractExecutionDeps,
): Promise<ContractVerificationResult>
```
2. Move route filtering into shared helpers.
Current filtering exists in both CLI verify and PETIT route filtering. Collapse these into one module that supports:
- route patterns
- scope filters
- profile routes
- HEAD exclusion
- skipped-route reporting
3. Make CLI `runVerify()` call the shared core.
4. Preserve CLI artifact shape by mapping `ContractVerificationResult` into current `VerifyRunResult`.
5. Keep the existing simple example-body execution only if explicitly needed as a `--sample example` mode. Do not make it the default if the product claim is property-based verification.
6. Add config support for `runs` in the CLI verify path. Today `verifyCommand()` passes timeout but does not fully pass the resolved preset/profile run configuration into the execution core.
7. Add tests for CLI verify that cover:
- generated body values from schema constraints
- querystring generation
- path parameter generation
- variants
- non-POST bodies where Fastify route semantics allow them
- route-level timeouts
- extension build-request hooks, if verify should support them
### Suggested Intermediate Step
Before full unification, replace CLI `buildRouteRequest()` with `buildRequest()` plus `convertSchema()` sampling. This reduces behavior drift while the larger shared core is extracted.
### Acceptance Criteria
- A route tested by `fastify.apophis.contract({ seed, runs })` and `apophis verify --seed` uses the same request generation semantics.
- CLI verify respects configured `runs`.
- CLI verify can exercise query, path, body, headers/variants, and cross-operation contracts.
- Golden CLI output remains stable except for intentional added diagnostics.
## Track 5: Observe Mode Decision And Implementation
Priority: P0
Risk: medium if docs-only, high if implementing runtime observe
Primary files:
- `src/cli/commands/observe/index.ts`
- `src/cli/commands/observe/validator.ts`
- `src/plugin/index.ts`
- `src/infrastructure/hook-validator.ts`
- `src/types/core.ts`
- `docs/observe.md`
- `README.md`
### Problem
Docs imply production runtime visibility and drift detection. Current code validates observe configuration and prints what would be activated.
### Path A: Honest Docs For Current Release
This is the safer short-term path.
1. Rename docs language from "observe activates runtime visibility" to "observe validates runtime-observe readiness".
2. Make `observe` command output explicitly say "no runtime observer is started by this command".
3. Keep `observe --check-config` as the canonical current behavior.
4. Add README "Current Limitations" entry for observe.
Acceptance criteria:
- No docs claim request telemetry is emitted unless a tested implementation exists.
- Platform teams can tell observe is a readiness/config command.
### Path B: Implement Real Observe
This is the product-complete path.
1. Define sink interface:
```ts
export interface ObserveSink {
emit(event: ObserveEvent): void | Promise<void>
}
```
2. Define event schema:
```ts
interface ObserveEvent {
type: 'contract.pass' | 'contract.violation' | 'contract.error'
route: string
method: string
statusCode: number
durationMs: number
formula?: string
error?: string
sampled: boolean
timestamp: string
}
```
3. Extend plugin options with observe config:
```ts
observe?: {
enabled?: boolean
sampling?: number
blocking?: false
sinks?: ObserveSink[]
}
```
4. Reuse runtime validation hook evaluation, but in observe mode never throw and never delay responses waiting for sinks.
5. Implement bounded async sink dispatch:
- fire-and-forget by default
- max queue length
- dropped-event counter
- sink failure isolation
6. Add integration tests:
- passing contract emits pass event when sampled
- failing contract emits violation event and response remains successful
- sink throw does not affect response
- sampling 0 emits nothing
- sampling 1 emits deterministically in tests
Recommended decision: do Path A immediately, then Path B as a dedicated feature milestone.
## Track 6: Qualify Chaos Coverage And Cleanup Outcomes
Priority: P1
Risk: medium
Primary files:
- `src/cli/commands/qualify/runner.ts`
- `src/cli/commands/qualify/chaos-handler.ts`
- `src/cli/commands/qualify/index.ts`
- `src/cli/core/config-loader.ts`
- `src/types/formula.ts`
- `src/infrastructure/cleanup-manager.ts`
- `src/test/cli/qualify-signal.test.ts`
### Problem
Qualify chaos currently picks one route deterministically. Cleanup failures are represented but not wired to real cleanup outcomes.
### Refactor Path
1. Extend `ChaosConfig`:
```ts
strategy?: 'one' | 'all' | 'sample' | 'routes'
sampleSize?: number
routes?: string[]
```
2. Implement `selectChaosRoutes(routes, chaosConfig, seed)` as a pure helper.
3. Run chaos for all selected routes and return `chaosResults: ChaosRunResult[]` instead of a single optional result.
4. Preserve a compatibility field temporarily if needed:
```ts
chaosResult?: ChaosRunResult
chaosResults: ChaosRunResult[]
```
5. Update artifacts and renderers to report:
- planned chaos route count
- executed chaos route count
- applied chaos count
- skipped chaos routes and reasons
6. Wire `CleanupManager` into qualify runner instead of simulated empty cleanup failures.
7. Update docs to describe default strategy. Recommended default: `sample` with `sampleSize: 1`, explicitly labeled as sampled.
### Acceptance Criteria
- Qualify artifacts can prove exactly which routes were chaos-tested.
- Users can request all-route chaos explicitly.
- Cleanup failures reflect actual cleanup manager failures.
- Existing deterministic seed tests remain stable.
## Track 7: Outbound Mock Isolation
Priority: P1
Risk: medium
Primary files:
- `src/infrastructure/outbound-mock-runtime.ts`
- `src/infrastructure/production-safety.ts`
- `src/quality/petit-runner.ts`
- `src/quality/stateful-runner.ts`
- `docs/quality.md`
- `docs/getting-started.md`
### Problem
Outbound mocks patch `globalThis.fetch`. This is simple but process-global and unsafe for parallel suites.
### Refactor Path
1. Document current global behavior immediately.
2. Add a runtime mode:
```ts
isolation?: 'global-fetch' | 'undici-mock-agent'
```
3. Implement `undici-mock-agent` for consumers that use undici/fetch-compatible clients.
4. Add a global install guard with owner metadata:
```ts
activeRuntimeId: string
installedAt: Error stack or timestamp
```
Error messages should say which runtime currently owns the global patch.
5. Add tests for overlapping installs and restore order.
6. Add docs with recommended test-runner settings:
- run outbound mock tests serially
- isolate by process
- prefer scoped MockAgent where possible
### Acceptance Criteria
- Users are warned when they choose global mocks.
- Overlapping global installs fail with actionable diagnostics.
- Scoped mock path exists for teams that can use it.
## Track 8: Documentation Maturity Alignment
Priority: P2
Risk: low
Primary files:
- `README.md`
- `docs/getting-started.md`
- `docs/observe.md`
- `docs/qualify.md`
- `docs/verify.md`
- `docs/troubleshooting.md`
### Problem
Docs sometimes describe intended platform behavior rather than current implementation behavior.
### Refactor Path
1. Add `Current Limitations` to README:
- route discovery fallback loses schemas
- observe is config readiness unless real observe is implemented
- qualify chaos default is sampled unless configured otherwise
- outbound mocks are process-global unless scoped mode is used
2. Add `Recommended Integration`:
- use `createFastify()` for new apps
- register APOPHIS/discovery before routes for existing apps
- run `doctor` and confirm schema-backed discovery
3. Update `getting-started.md` to avoid saying nested response annotations are selected by actual status code unless implemented.
4. Add a migration note for teams with already-registered routes.
5. Keep docs smoke tests updated.
### Acceptance Criteria
- No doc makes a runtime/coverage claim that lacks implementation and tests.
- Quickstart remains short but links to limitations.
- Docs smoke tests pass.
## Track 9: TypeScript Entrypoint Loading Semantics
Priority: P2
Risk: medium
Primary files:
- `src/cli/core/app-loader.ts`
- `src/cli/core/config-loader.ts`
- `src/test/cli/init.test.ts`
- `docs/cli.md`
- `docs/troubleshooting.md`
### Problem
The loader says TypeScript entrypoints require `tsx`, but it uses dynamic import directly. This is environment-sensitive.
### Refactor Path
Choose one policy.
Policy A: JS-only installed CLI.
- Reject `.ts` app/config entrypoints unless `process.execArgv` includes a TS loader.
- Error tells users to export a JS app entrypoint or run through `tsx`.
Policy B: Built-in TS loading.
- Detect `.ts` entrypoints.
- Use `tsx` programmatically or spawn a subprocess through `tsx`.
- Keep dependency/devDependency implications clear.
Recommended policy: A for now. It is simpler, honest, and avoids magic loader behavior in installed CLIs.
### Acceptance Criteria
- Installed CLI behavior is deterministic.
- `.ts` entrypoint errors are clear and tested.
- Docs match actual supported path.
## Track 10: Config Extensibility Namespace
Priority: P2
Risk: low
Primary files:
- `src/cli/core/config-loader.ts`
- `src/test/cli/config-validation.test.ts`
- `docs/cli.md`
### Problem
Strict unknown-key rejection is useful, but teams need a place for internal metadata.
### Refactor Path
1. Add top-level `metadata?: object` to config schema.
2. Allow `x-*` keys at top-level and inside profiles/presets/environments without validation beyond JSON object compatibility.
3. Document that APOPHIS will never interpret `metadata` or `x-*` fields unless promoted in a future major version.
4. Keep all APOPHIS-owned fields strict.
### Acceptance Criteria
- Unknown typo like `rouets` still fails.
- `metadata.owner = 'platform'` passes.
- `x-team-policy` passes.
- Tests cover top-level and nested metadata.
## Cross-Cutting Test Plan
Run after each track:
```bash
npm run typecheck
npm run build
```
Run before merging a track:
```bash
npm run test:src
npm run test:cli
npm run test:docs
```
Add targeted tests for each track before refactoring where feasible. For high-risk tracks, add characterization tests first.
## Suggested Milestones
### Milestone 1: Safety And Honesty
Tracks:
- Track 1: discovery visibility
- Track 2: `createFastify()` loud failures
- Track 3: route filter escaping
- Track 8 Path A docs updates for current limitations
Outcome: fewer silent failures; docs align with current behavior.
### Milestone 2: Verify Quality
Tracks:
- Track 4: CLI verify/PETIT unification
- Track 9: TypeScript loading policy
- Track 10: config metadata namespace
Outcome: the main CLI workflow reflects the real engine and is easier to adopt in teams.
### Milestone 3: Platform Features
Tracks:
- Track 5 Path B: real observe implementation
- Track 6: qualify chaos coverage and cleanup outcomes
- Track 7: outbound mock isolation
Outcome: production/platform-facing claims become technically defensible.
## Work Breakdown Estimate
| Track | Size | Risk | Suggested Owner |
|---|---:|---|---|
| Discovery metadata visibility | M | Medium | Core/Fastify integration |
| `createFastify()` semantics | S | Low | Core/Fastify integration |
| Route filter escaping | S | Low | CLI |
| CLI verify/PETIT unification | L | High | Quality engine + CLI |
| Observe decision/docs | S | Low | Docs + CLI |
| Real observe implementation | L | High | Runtime/platform |
| Qualify chaos coverage | M | Medium | CLI quality |
| Outbound mock isolation | M | Medium | Runtime/testing infra |
| Docs maturity alignment | S | Low | Docs |
| TypeScript loading policy | M | Medium | CLI |
| Config metadata namespace | S | Low | CLI config |
## Dependency Graph
- Track 1 should happen before Track 4, because verify unification needs clear discovery diagnostics.
- Track 3 can happen anytime.
- Track 2 can happen anytime.
- Track 5 Path A should happen immediately if Path B will not ship in the same release.
- Track 6 can happen independently of Track 4 but should reuse shared route filtering after Track 4 if possible.
- Track 7 can happen independently but affects PETIT/stateful/qualify if scoped mocks are introduced.
- Track 8 depends on decisions from Tracks 5, 6, and 7.
- Track 9 should happen before making stronger CLI adoption claims.
- Track 10 can happen anytime.
## Release Recommendation
For the next release, do not attempt every deep refactor at once. Ship a release focused on safety, honesty, and mainline CLI correctness:
1. Discovery visibility and doctor warning.
2. `createFastify()` loud failure semantics.
3. Safe route filter matching.
4. Docs current-limitations update.
5. Initial CLI verify request-generation convergence, even if full PETIT unification takes another release.
Then schedule real observe, full verify unification, qualify coverage, and outbound mock isolation as dedicated milestones with their own acceptance tests.