Specshift methodology

Specshift methodology v1.0

Public mirror of the scoring rules. Published at pragma.wentzel.ai/specshift/methodology and updated on every version bump.

What Specshift measures

How well a documentation site serves an LLM trying to use the platform the docs describe. Not "are the docs pretty" — "would an agent reading these docs end up at a correct answer / a working call / a passing build?"

Suites (v1)

SuiteQuestion it answersDefault weight
retrievalCan a vector index over these docs answer realistic developer questions?1.0
agentCan a tool-use agent complete a representative task using only these docs?1.0
structureAre the docs organised so a model can navigate without context overflow?1.0
driftDo the docs describe the platform's actual current behaviour?1.0

Future suite: oscal (compliance crosswalk for FedRAMP-targeting platforms). Deferred to v0.3 — ships after first 3PAO partnership lands (PRAG/09).

Scoring

Every Test produces a score in [0, 1]. NaN means the test could not run (e.g. timeout, infrastructure failure) and is excluded from the suite roll-up. Suite scores are weighted averages of their tests; the overall score is the weighted average of suite scores.

Reports are reproducible: methodologyVersion + engineVersion are pinned in every Scorecard. Re-running the same suite on the same target with the same engine + methodology MUST produce a deterministic score (or document the source of randomness).

Disputes

Customers may dispute a score per the public correction-and-dispute policy at PRAG/06. Each dispute writes a tamper-evident audit-chain record (PRAG-019) — the chain hash is published alongside the public ruling.

Versioning

  • Bumping the methodology version is a load-bearing event. The

methodology-pin CI gate (PRAG-016) blocks merges that change rules without:

  1. A new entry in VERSIONS (registry)
  2. CHANGELOG.md entry describing what changed and why
  3. limitations.md update describing what's still NOT measured

See limitations.md for what v1.0 does NOT measure.

v1 patches

v1.1 (2026-05-04) — non-breaking scoring patch

This patch makes the suite implementations representative for modern SPA-style developer documentation without changing the public scoring contract (Test still emits [0,1], Suite still rolls up via weighted average, methodology version pin remains v1.0). Changes:

  • Retrieval + Agent corpora: each expects / successCriteria

entry is now a synonym group — ANY synonym in the group counts as a hit. The previous single-phrase form is still accepted. This stops Specshift penalising sites for using "auth token" instead of "api key" when both refer to the same concept.

  • Retrieval + Agent fetch: the suites now visit a small bounded

set of in-domain links from the landing (≤6 sub-pages, doc-shaped paths only — quickstart, auth, api, reference, sdk, etc.) and, when the landing is a marketing splash with a discoverable /docs entrypoint, pivot to that root before crawling. SPA shells whose prose only renders client-side are no longer scored as if their HTML were the entire docs surface.

  • Retrieval + Agent partial credit: once at least one expected

concept is matched, the test scores no lower than 0.5 — a 1-of-3 match shouldn't drop the test to 33%, that conflates retrieval recall with a structural defect. A zero hit still scores 0.

  • Structure: heuristics return partial credit instead of binary,

the sitemap check is now NaN (informational, not a hard zero) when no sitemap is found, anchored-headings is skipped on SPA shells with zero anchored ids in initial HTML, and a new link-density test rewards landings that surface enough internal links to be navigable. Like retrieval, structure also pivots to /docs when the landing is a marketing splash.

  • Drift: added https-canonical and landing-reachable checks so

the suite produces a real score on sites that don't surface the 401 claim language on the landing. Tightened the robots.txt check so Disallow: /api/ no longer trips a site-wide-block false positive.

  • Engine: when every test in a suite is NaN, the suite score is

still NaN (so dashboards can show "incomplete") but contributes a neutral 0.5 to the overall roll-up rather than vanishing entirely — failed suites must pull the overall down, not silently inflate it.

The corpus on disk is still labelled v1.0 and remains replay-stable for any Scorecard pinned to methodology v1.0. v1.2 will re-evaluate whether the synonym-group form deserves a methodology version bump.

Methodology version is pinned in every Specshift report. Spot a discrepancy between this page and a report you ran? Get in touch via the contact form — discrepancies are tracked publicly under PRAG/05.