Methodology: how the cust.co retention benchmark is built

What we publish

For every covered public B2B SaaS company, we extract and publish 100+ structured fields when disclosed (40 per-period quantitative metrics that flow into the time series, plus ~60 document-wide context fields on each company's profile). Not every company discloses every field; missing values render as "-" rather than estimates.

Per-period quantitative metrics

Retention - Net Revenue Retention (NRR / NDR / DBNRR), Gross Retention Rate (GRR), Logo Retention
Customer cohorts - total customers, customers over $100K / $1M / $10M ARR
Revenue mix - US / International / Enterprise share of revenue
Concentration - top-10 customer share of revenue
Unit economics - average ACV, annualized churn rate, customers-per-CSM ratio
Commercial structure - multi-year contract %, average contract length, total + current RPO, RPO duration breakdown, new customers added per period, subscription vs services revenue mix
Customer experience - NPS, CSAT, active users, products-per-customer (when disclosed)
Cohorted retention - NRR / GRR broken down by segment (Enterprise / Mid-Market / SMB), geography (US / EMEA / APAC), or customer-size cohort (over $1M / over $100K)
Scale + headcount - total ARR, ARR growth YoY, AE headcount, total employees, lost customers per period

Document-wide CS context

CS team size + structure, customers-per-CSM ratio, CSM coverage model (account-named / pooled / hybrid / digital-led)
Support tier structure, customer segmentation labels
Time-to-value (days from contract to first integration), customer education programs, customer advisory board
Renewal cadence (annual / multi-year / monthly / consumption), pricing model (subscription / consumption / hybrid)
Top customer-facing executive (CCO / CRO with retention scope), executive comp tied to retention metrics, reporting line
Named CS initiatives + descriptions, acknowledged challenges, executive quotes about post-sales motion, competitive dynamics

Derived metrics (computed, not extracted)

Expansion contribution (NRR − GRR), GRR drag, peak-to-current decline, concentration trend, multi-year mix evolution
ARR per CSM, ARR per AE, ARR per FTE, AE-to-CSM ratio, bookings per CSM
Quick ratio, customer lifetime months, RPO coverage years

Sources

SEC EDGAR filings - 10-K (annual), 10-Q (quarterly), 8-K (current), DEF 14A (proxy / exec comp), 20-F (foreign annual), 6-K (foreign current), S-1 (IPO prospectus). Both the cover document and Exhibit 99.1 (earnings press release content) for 8-K/6-K.
Earnings call transcripts - public-domain transcripts via several free aggregators. CFO + CEO commentary often surfaces retention details that don't appear in filings.
IR-page hosted documents - investor presentations, supplemental decks, sustainability reports, hand-curated press releases. Fetched directly from each company's investor relations site.
Founder submissions - private B2B SaaS founders submit their numbers via the free calculator. Work-email gated, anonymized by default. Aggregated medians at community-benchmarks/ with a privacy floor (cells require ≥5 submissions before publishing).

Extraction pipeline

Discovery - for each tracked ticker, enumerate every recent SEC filing across the form list above. SEC's full-text search also surfaces companies disclosing retention phrases we haven't yet catalogued.
Pre-filter - long-form documents (10-K / 10-Q / 20-F / 6-K / S-1 / proxies / transcripts / decks) ALWAYS go to LLM extraction. Press releases that don't mention any retention term skip extraction (boilerplate).
Regex extraction - per-company hand-tuned extractors handle the headline NRR disclosures (DBNRR, NDR, dollar net retention, "respectively" patterns, multi-period tables).
LLM extraction - large-language-model parses the document slice and emits a strict JSON schema covering all 80+ fields. Different slice budgets per source type (e.g., 10-K gets the largest window so cohorted retention sections - which often live far from headline NRR - actually reach the model).
Cross-source agreement - when the same value appears in both regex and LLM output, OR in two different filings (press release + 10-Q), confidence is boosted.
Period resolution - fiscal year + quarter inferred from in-text labels OR from the SEC filing's reportDate metadata as a fallback.
Validation - every disclosure runs through quality gates (below) before being marked verified.

Verification gates

An auto-extracted disclosure is marked verified only if it passes the full gate set:

Range - each metric clamped to a sane range (e.g. NRR 50%–250%, RPO $1M–$100B, ACV $100–$100M). Out-of-range values are dropped.
Period determined - fiscal year + fiscal quarter resolved (or fiscal year for full-year disclosures).
Future-date guard - period end-date must be ≤ today + 30 days AND ≤ filing-date + 7 days. Rejects forward-looking guidance the LLM may have surfaced as a current value.
Self-consistency - GRR ≤ NRR (always); cohorted NRR within plausible spread of headline NRR.
Source trust - SEC filings (10-K / 10-Q / 8-K / DEF 14A / 20-F / 6-K) are accepted at single-source. Non-SEC sources (earnings call transcripts, investor presentation PDFs) require either multi-source corroboration OR a continuity check (NRR within ±5pp of a verified prior period for the same ticker).
Severe-anomaly drop - if a new value is >50pp YoY or >25pp QoQ from the same ticker's verified history, it's discarded as a likely extraction bug. Moderate swings (>25pp YoY / >12pp QoQ) are flagged but kept if other gates pass. Bypassed when the LLM detects a material acquisition or divestiture completed in the period - acquired ARR can legitimately spike NRR.
Cohort scope - each disclosure carries an nrrCohort field tagging whether the figure covers all customers (default) or a specific segment (Enterprise, customers >$1M ARR, etc.). Companies that disclose only segment NRR get tagged at extraction time so the published number isn't confused with company-wide NRR.

If a candidate fails the gates, it's discarded - not queued for human review. The system runs fully automated: every disclosure on the public benchmark is auto-verified by the rules above, no manual sign-off in the loop.

Self-healing data hygiene

The system runs validators on every load - not just on extraction. If a stale entry from an earlier run violates a current rule (e.g. a future-dated row from before the future-date guard existed), it's automatically dropped. This means cleanup commits propagate immediately.

Known-discloser monitoring

Companies known to disclose retention metrics (either from prior extraction history or hand-flagged in the catalog) are tagged knownDiscloser: true. When a known-discloser ticker comes back empty for a run, a diagnostic entry is written to data/scrape-missed.json as a prompt-tuning queue. Per-ticker extractor hints (scraper.notes in the catalog) get fed to the LLM as an additional prompt - useful for companies that use non-standard terminology like "Net Subscriber Retention" or report retention in unusual filing sections.

Forward-looking guidance (separate from actuals)

When a filing mentions a projected NRR for a future period ("we expect FY27 NRR of 110%", "our long-term framework assumes 115%"), the LLM tags it with is_guidance: true and routes it to data/nrr-guidance.json rather than the main disclosures store. The public benchmark, leaderboard, and time-series charts read ONLY actuals - guidance is preserved separately for "guidance vs actual" analytics on company pages.

Ticker auto-discovery

A weekly job (scripts/discover-tickers.mjs) walks SEC EDGAR's full-text search for retention phrases ("net revenue retention", "dollar-based net retention", etc.) and identifies public companies disclosing retention that aren't yet in our catalog. Each newly-discovered ticker is then classified by an LLM to determine B2B SaaS-ness and pre-populate catalog metadata (vertical, ACV band, GTM motion, fiscal year end). High-confidence candidates appear in data/discovered-tickers.json as an operator review queue; the operator confirms + merges into data/public-companies.json. New IPOs (S-1 filings) are picked up the same way when the --include-s1 flag is passed.

Cell publication rules

A benchmark cell goes live only when:

≥2 distinct companies have verified disclosures in that cell
≥2 verified disclosures total
No single company contributes >50% of the data

This avoids "single-company medians" that would mislead viewers.

Conflict handling

If a previously-verified value is contradicted by a fresh extraction for the same period, the system auto-resolves:

If the new source is higher-trust (SEC vs transcript/PDF), the new value wins and replaces the existing record
If both sources are the same trust tier, the more recent extraction wins (later filings supersede earlier ones)
Every conflict is logged to the conflict-log for analytics, but the public cell page always shows one canonical value

Source attribution

Every published disclosure carries:

The exact source URL (SEC filing, transcript, press release)
The source type (10-K / 10-Q / def-14a / earnings-call-transcript / etc.)
The fiscal period and reporting date
The extraction method used (regex / LLM)
The verification status

You can independently verify any number on this site by following its source link. We invite that.

Update cadence

The scraper runs daily at 06:00 UTC. Daily runs only process NEW filings since the last successful run (cached URLs skip extraction to keep cost predictable). New SEC filings typically appear within 24 hours of being posted by the company.

Two types of change propagate at different speeds:

Validator-rule changes (range limits, future-date guards, severity thresholds) apply on every build automatically - stale entries that violate a new rule are dropped immediately. See Self-healing data hygiene below.
Extractor changes (new schema fields, prompt rewrites, additional sources) only flow into existing data when an operator triggers a manual full re-extraction; cached LLM outputs per source URL aren't reprocessed otherwise.

Historical backfill covers SEC filings since 2018.

Current coverage

180 public B2B SaaS companies with verified retention data
105 of those disclose headline NRR (the rest publish only secondary metrics like GRR, customer counts, or cohorted retention)
913 verified NRR disclosures
823 additional verified disclosures for non-NRR retention metrics
25 live benchmark cells (per the publication rules above)

Found an error?

Every disclosure on this site links to its source URL. If the published value differs from the company's filing, please email us with:

The cust.co URL where the bad value appears
The source filing URL the company actually published
The correct value

We typically re-verify within 24 hours.

Citing this data

The dataset is free to cite. Recommended attribution: "Data from cust.co, sourced from SEC filings and earnings call transcripts."

Per-company JSON is available at /api/companies/<name>.json and per-cell aggregations at /api/cells/<vertical>/<stage>/<acv-band>/. Bulk access on request.

Conflicts of interest

Cust is a customer-success product for VPs of CS. We benchmark public companies because their data is publicly disclosable and verifiable, and because doing so trains the same AI we use in our own product. We do not accept payment to include or exclude any company from this benchmark, nor to weight any disclosure favorably. Companies cannot opt out of being indexed (the underlying data is public). Companies CAN flag inaccuracies, which we re-verify against the source.

Author

Maintained by Laimonas Noreika, CEO and Co-founder of Cust. Reach out via LinkedIn for corrections or to flag a missing source.

How the data is built