Google Ranking 2026: DOJ + API Leak + MWC Exploit Synthesis

Over the past 18 months, SEO has received three pieces of hard evidence that finally let us replace speculation with fact:

DOJ Antitrust Trial (Sep 2023 – Aug 2024): Google engineers under oath, internal emails entered as court evidence. Pandu Nayak full testimony PDF
Content Warehouse API Leak (May 2024): 2,500+ internal modules, 14,000+ attributes. Mike King’s initial technical breakdown (iPullRank) / Rand Fishkin’s first disclosure (SparkToro)
MWC Exploit (Dec 2024): Mark Williams-Cook’s team pulled live data on 2M sites / 90M queries through an API endpoint vulnerability. Search Engine Land coverage

This is my synthesis of all three sources. If you spot errors, please tell me. Learning in public.

What Each Source Actually Gives Us

Source	Nature	Unique contribution
DOJ Trial	Sworn testimony + internal emails	Google’s own names for the mechanisms, and their relative importance
API Leak	Internal technical docs	Specific variable names, data structures, module organization
MWC Exploit	Live API data (now patched)	Actual numerical distributions across 200K+ sites

Key framing: DOJ tells you what Google admits to using. API Leak tells you what those things are called internally. MWC tells you the real numbers. Any single source is arguable. All three together is hard evidence.

AJ Kohn’s SEO-focused breakdown of Pandu Nayak’s testimony is still the deepest first-hand analysis I’ve seen.

Google’s Ranking Architecture (all three sources agree)

Search processing runs through five stages:

1. Crawling
2. Indexing (tiered: Base / Zeppelins / Landfills)
3. Query Processing
4. Core Ranking (Ascorer / Mustang) — T* × Q* × P*
5. Post-Ranking Re-ranking (Twiddler framework)
   → NavBoost / Freshness / QualityBoost etc. running in parallel
6. SERP Generation (including Gemini-generated AI Overviews)

Two things to internalize:

Core Ranking decides whether you’re in the candidate pool
Twiddlers decide where you rank inside it
AIO, Featured Snippets, PAA are all Twiddler-layer outputs — which explains why these SERP features move fast and flicker in/out

For a fuller architectural integration, Shaun Anderson’s 2025 synthesis is the cleanest writeup.

T* × Q* × P* — The Three-Factor Formula

Shaun Anderson’s distilled formula based on DOJ testimony:

Ranking = T* × Q* × P*

T* (Topicality)

Google engineer HJ Kim testified that T* is built from three signals — the ABC signals:

A = Anchors — anchor text from links pointing at the page
B = Body — content match to the query
C = Clicks — user clicks on this result for this query

Q* (Site Quality)

Maps directly to MWC’s site quality score. Cross-validated across all three sources:

DOJ: confirmed Q* exists
API Leak: siteAuthority is a real, stored variable (see Shaun Anderson’s Q* deep-dive)
MWC: actual values are on a 0–1 scale, and 0.4 is the hard eligibility threshold for SERP features (Featured Snippets, PAA)

Q*’s three calculation inputs (from MWC’s SearchNorwich presentation):

Brand Visibility — queries directly containing the brand name, or brand + modifier
SERP Selection Rate — how often users select your result, especially when you’re not in position 1
Anchor Text Brand Prevalence — how often anchor text on the wider web contains your brand / domain name

Selection Rate is the most counter-intuitive signal and probably the most important:

Ranking #5 but users consistently skip 1–4 and click you → Selection Rate high → Q* climbs
Ranking #1 but users scroll past you → Selection Rate low → Q* drops

P* (Popularity)

Primarily driven by NavBoost (13-month rolling click data) + the link graph.

DOJ: Nayak under oath called NavBoost “one of the important signals” Google has
Internal email (2019) from VP Alexander Grushetsky: NavBoost alone may be more impactful than the rest of ranking combined (“stealing wins”)
API Leak: NavBoost is referenced 84 times across Content Warehouse modules

This formula replaces every speculative ranking-factor list floating around SEO. It has three-source evidence behind it and is directly usable.

⚡ Status April 2026: Fully holds up. Shaun Anderson, Mike King, and others continue to enrich the model.

NavBoost — The Full Picture (three-source reconstruction)

Baseline (DOJ confirmed)

13-month rolling window of click data
Not raw counts — classified click quality
Nayak testified in Oct 2023 that it’s one of the strongest ranking signals

Variable layer (from the Leak)

Content Warehouse modules related to NavBoost include:

goodClicks — user clicked and stayed (no pogo-stick)
badClicks — user clicked and immediately returned to SERP
lastLongestClicks — the longest dwell in a session (the “ultimate satisfaction” signal)
country + language — ratings are stored separately by country and language

Full technical breakdown: Shaun Anderson’s NavBoost deep-dive.

Layered architecture (Leak + MWC)

NavBoost is the data collection layer
A separate system called CRAPS (Click and Results Prediction System) converts click data into demotion scores
Applied to SERPs via the Twiddler framework

Practical implications

13-month window → short-term CTR manipulation doesn’t work
Country-segmented → US click data and German click data are scored separately for the same page
Device-segmented → mobile and desktop are separate scores
Compounding advantage → sustained quality clicks become a moat competitors can’t replicate short-term

⚡ Status April 2026: Fully holds up.

Site Quality Score — Full Mechanism

Baseline (MWC Exploit, first public disclosure)

Subdomain-level scoring (not domain) — www.example.com and help.example.com get different scores
0–1 scale, 0.4 is the SERP-features eligibility threshold
Inputs already covered in section 3 (Brand Visibility + Selection Rate + Anchor Text Brand Prevalence)

MWC’s top-0.1% case

From the August 2025 Advanced Web Ranking interview: some of the highest-scoring Q* sites he saw in the exploit data were FAQ sections on university library subdomains. The reason — those pages could only be found via search, and search traffic was a dominant share of total traffic.

His core takeaway:

“If, for whatever reason, you lose visibility, and Google sees that nobody is actively searching for you — if you don’t appear in search, you just don’t exist — then your site is dead in the water.”

New-site prediction scoring (the “Predicting Site Quality” patent)

Separate patent: predicting site quality
Google vectorizes existing indexed content
When a new site publishes, it’s compared against the vector space of known sites
It inherits a starting score from its nearest mathematical neighbors

This explains the “fly then crash” cycle of AI content farms:

AI content is trained on the best of the web, so its vectors resemble high-quality sites
Google assigns an initial score of 0.8–0.9, rankings fly
6–12 months in, real user signals don’t match the prediction
Score gets revised down → rankings collapse

Variables exposed by the Leak

siteAuthority — site-level authority score (Google denied “domain authority” for years; turns out the concept exists under a different name)
siteFocusScore — topic concentration
siteRadius — how far a page deviates from the site’s core topic
hostAge — site age (covered below)

⚡ Status April 2026: The 0.4 threshold is now the industry consensus benchmark.

Google’s 8 Query Classes (Refined Query Semantic Classes)

The MWC exploit revealed that Google classifies nearly all queries into 8 categories, and ranking weights differ across categories. Best English writeup of the 8 classes: Harry Clarkson-Bennett’s Leadership in SEO piece.

Class	Meaning	Example	SEO implication
Short Facts	Direct factual answer	“who is UK PM”	Heaviest AIO cannibalization
Comparison	Entity comparison	“iPhone vs Samsung”	Core B2B decision queries
Consequence	Outcome of an action	“what happens if you drink too much coffee”	YMYL risk
Reason	Why something occurs	“why is the sky blue”	High AIO hit rate
Definition	Meaning of a concept	“what is blockchain”	Heavy AIO cannibalization
Instruction	Step-by-step how-to	“how to bake a cake”	HowTo long-tail
Boolean	Yes/no question	“is it raining today”	Heavy AIO cannibalization
Other	Everything else, incl. local	“coffee shops near me”	Catch-all

MWC trained an open-source classifier based on these 8 classes — you can plug in your keywords and get Google’s own classification.

Practical value

After keyword research, run your queries through the classifier first, then write the content to match the user expectation for that query class. More precise than reverse-engineering SERP skeletons, because this is Google’s internal taxonomy.

⚡ Status April 2026: Holds up.

HCU + The Disconnected Entity Hypothesis

Surface vs. reality

Google’s public framing: HCU evaluates whether content is “written for people.”

What the three sources actually show:

HCU is a site-wide signal, not page-level (API Leak confirms)
Merged into core ranking in March 2024
Mechanism is demotion-first (it only demotes, doesn’t promote)
The real trigger isn’t “content quality” — it’s “undefined entity”

Disconnected Entity Hypothesis (Shaun Anderson, 2025)

Original article. The causal chain:

Entity not defined
    ↓
Google can't evaluate "why you exist"
    ↓
Site classified as "Unhelpful"
    ↓
HCU site-wide demotion triggered
    ↓
Every page on the site gets demoted (including genuinely good pages)

Recovery path:

Not content optimization
Not technical SEO
It’s entity definition: About page, author info, schema, sameAs, real-world business evidence
Core reference is Section 2.5.2 (page 16) of the Search Quality Rater Guidelines — “Finding Who is Responsible for the Website and Who Created the Content on the Page”

Shaun’s April 2026 update on HCU’s current state further reinforces this framework.

Tom Capper’s Synthetic Gap addendum

Tom Capper’s original Moz research: “The Helpful Content Update Was Not What You Think” revealed the key data pattern:

HCU losers share a common profile: Domain Authority significantly higher than Brand Authority (DA:BA ≥ 2:1). Google flags this profile as “synthetic authority” and demotes.

Translation: you’ve built links fast but nobody searches for your brand — that’s the risk profile.

Capper’s data (1.9M keyword sample): HCU losers average Brand Authority 37, winners and neutrals average 50–52. That BA gap is the mathematical signature of synthetic authority. PPC Land’s shorter breakdown if you want the TL;DR.

⚡ Status April 2026: Holds up, and the March 2026 Core Update sharpened this direction further.

The Demotion Variable Set (from the Leak)

Shaun Anderson’s evidence-based mapping of updates to leaked signals covers the full set. The ones that matter most:

Demotion	Variable	Trigger
Anchor mismatch	anchorMismatchDemotion	Anchor text doesn’t match target page topic
Exact-match domain	exactMatchDomainDemotion	Domain exactly matches target keyword → partial demotion
SERP dissatisfaction	serpDemotion	Users pogo-stick back to SERP from your page
Navigation	navDemotion	Aggregate NavBoost negative signal
Product reviews	productReviewsDemotion	Low-quality product review content
Clutter	clutterScore	Too many ads / popups
Mobile interstitials	violatesMobileInterstitialPolicy	Full-screen ads on mobile

⚠️ exactMatchDomainDemotion is the hidden tax on a lot of “keyword domain” strategies people still recommend.

hostAge and the Truth About Sandbox

Widely misunderstood variable. Needs to be re-read carefully.

Raw API Leak description

hostAge (type: integer):
- Earliest first-seen date of all pages in this host/domain
- Used by twiddler to sandbox fresh spam at serving time
- 16-bit, day count starting from 2005-12-31
- If URL's host_age == domain_age, domain_age is omitted

Key phrase: “fresh spam”. This variable’s purpose is identifying newly-appeared spam content, not “punishing all new sites.”

What Sandbox actually is

Shaun Anderson’s March 2026 hostAge deep-dive is definitive:

“If you don’t look like spam, you don’t get sandboxed.” “Google wasn’t lying. It caught you.”

Sandbox isn’t a new-site penalty. It’s a demotion mechanism targeting “untrusted + suddenly active” entities.

What actually triggers Sandbox

New domain + sudden high content volume (content farm pattern)
Old domain + sudden topic change + bulk content (expired domain abuse)
New subdomain on a clean old domain + sudden activity (a form of site reputation abuse)
Any “newly-appearing” entity + absent user / link signals

Why clean new sites still feel sandboxed

Google isn’t actively sandboxing them. It’s that:

No PageRank signal → low crawl priority
No user signal → no NavBoost data
No authority links → the Anchors component of T* is zero
No brand searches → Q* sits low

Result looks identical to sandbox — no rankings. But the mechanism is different: it’s “not yet vetted,” not “punished.”

Shaun’s core insight on hostAge

“PageRank is the VIP pass that skips sandbox.”

In the leaked architecture, a high pagerank_nsr tells the hostAge twiddler: “this entity has been vetted by the wider web — skip the spam classification check.”

Actual impact range

Scenario	hostAge impact
New domain + clean content + compliant SEO	Virtually none
New domain + sudden bulk content + spam signals	Sandbox active, visibility limited
Old domain + topic pivot + bulk content	Sandbox triggered
Old domain + sustained operation	Age alone isn’t a boost — accumulated authority is

Conclusion: don’t buy old domains for SEO age. Old domain + topic change = worse than starting fresh.

⚡ Status April 2026: Holds up.

Freshness — The Most Misread Factor

Ahrefs 2025 hard data

Ahrefs / Patrick Stox’s May 2025 top-10 age study is counter-intuitive:

Metric	2017	2025	Direction
Top 10 pages 3+ years old	59%	72.9%	⬆️ More old content
Top 10 pages under 1 year old	22%	13.7%	⬇️ Fewer new pages
Average age of #1 page	2 years	5 years	⬆️ Doubled
New pages reaching Top 10 within 1 year	5.7%	1.74%	⬇️ Dropped sharply

Conclusion: 2026 SERPs are more dominated by old content than ever before.

QDF (Query Deserves Freshness) only activates for specific queries

Search Engine Land’s QDF definition makes this explicit. QDF only fires for 3 query types:

Breaking news / hot topics
Recurring events (Olympics, elections, annual conferences)
Frequently-changing topics (product launches, pricing, policy)

Activation conditions (Amit Singhal’s original 2007 NYT definition):

News sites are actively covering
Blogs are publishing frequently
Search volume is spiking

All three → QDF activates → new content surfaces Any one missing → QDF dormant → old content dominates

Freshness in 2026 reality

Most B2B / B2C / tutorial queries don’t trigger QDF, so old-content domination is structural, not anomalous.

Actual value of freshness:

QDF queries: direct ranking boost
Non-QDF queries: indirect effect — continuous publishing signals an “active” site, which increases crawl frequency for older pages

SE Ranking’s 16-month AI content experiment showed: after publishing new content, old page traffic jumped 17–19×. The real value of new content is activating site-level crawl, not the new pages themselves ranking.

Common misconception, corrected

The “update old articles regularly” advice is widely recommended. But:

First check whether the keyword is a QDF query
For Definition / Comparison / Consequence / Reason queries, old content has a structural advantage
Sloppy updates can trigger lastmod trust issues (see next section)

⚡ Status April 2026: Freshness isn’t a general ranking factor. It’s QDF-specific.

The lastmod Binary Trust Rule

Gary Illyes confirmed this directly on LinkedIn (June 2024), with MWC asking. Full exchange: Search Engine Journal’s writeup / Barry Schwartz’s Search Engine Roundtable record.

MWC: “If I’m specifying lastmod and Google’s signals consistently find I haven’t made significant changes, do you have any kind of reputation system to decide how much to trust what a site tells you?”

Illyes: “It’s binary — we either trust it or we don’t.”

MWC’s Leak finding

Google stores each URL’s “last significant update” timestamp (epoch format)
A boolean governs whether to trust your lastmod at all
“Once you’re a liar, permanent distrust”

Specific rules

Edit type	Google response
Significant edit + lastmod updated	Positive signal
Significant edit + lastmod not updated	Neutral
Minor edit (few words) + lastmod updated	Negative; repeated → lastmod signal disabled
No edit + lastmod updated	Most negative — direct “liar” classification

Supporting Leak detail

Google stores up to 20 historical versions of every page. Implications:

Google knows what you looked like historically
“Multiple small edits” accumulate into comparison basis — but lastmod trust is binary; once blacklisted, it doesn’t come back

⚡ Status April 2026: Fully holds, no changes.

Parasite SEO / LinkedIn Pulse Bleaching — Dead

The “publish on LinkedIn Pulse / Medium / Forbes Advisor to borrow authority” playbook is over in 2026.

Timeline

Date	Event
March 2024	Google introduces Site Reputation Abuse policy
November 2024	Manual actions on Forbes, WSJ, Time, CNN
January 2025	Written into Search Quality Rater Guidelines
August 2025	Spam Update begins algorithmic enforcement (previously manual-only)
November 2025	EU DMA investigation launched (Google accused of suppressing news publishers)
March 2026	Core Update further sharpens enforcement

Full timeline and technical detail: Digital Hitmen’s March 2026 Site Reputation Abuse complete guide.

Current state (April 2026)

Crude parasite SEO is dead
LinkedIn Pulse still ranks reasonably — because LinkedIn has “editorial friction” (connection requirements), Google treats it as a “quality filter”
But posting unrelated topics is high-risk (gambling, loans, CBD, etc.)
What works in 2026: publishing on LinkedIn in topics genuinely aligned with your professional identity

Schema and Entity Building

Integrating the two camps

Camp 1: schema has near-zero direct impact on LLM citations
Camp 2: schema is core to entity building

Integrated truth: Schema isn’t a direct ranking factor. It’s an entity-building accelerator.

Schema
    ↓
Entity disambiguation accelerated (Google confirms "who you are" faster)
    ↓
Entity authority established faster
    ↓
Knowledge Graph recognition
    ↓
Increased LLM citation probability (indirect, not direct)

The 3 core conditions for entity building in 2026

Notability — at least 20–30 independent authoritative mentions
Entity Home — one URL as the “source of truth,” typically the About page
Corroboration — information fully consistent across all platforms

Practical entity verification (more realistic than pursuing a Knowledge Panel)

A full Knowledge Panel isn’t realistic for most sites — Google deleted 3 billion low-quality entities in June 2025. Knowledge Panels are for high-confidence entities, not every site owner.

Tiered entity verification:

Tier	Indicator	Difficulty
Tier 1 (baseline)	Brand search → your site ranks #1	Easy
Tier 2 (decent)	Brand search shows brand card or sitelinks	Moderate
Tier 3 (good)	Knowledge Graph API returns your entity with a kg:/m/ ID	Hard
Tier 4 (strong)	Full Knowledge Panel on SERP	Very hard
Tier 5 (top-tier)	AI systems (ChatGPT / Gemini / Perplexity) cite you unprompted	Hardest

Tier 2 is sufficient for most sites. Realistic target: brand search → site #1, and Knowledge Graph API finds your entity ID — not waiting for a full Knowledge Panel.

Why “brand search → site #1” is the core entity-health indicator

It directly maps to two of Q*’s three inputs:

Brand Search (people search for you) → Q* Brand Visibility input
Selection Rate (they pick your site) → Q* Selection Rate input

If people search your brand and can’t find or don’t pick your official site:

Brand Visibility data exists but doesn’t resolve to you
Selection Rate is low
Q* can’t clear 0.4
You’re not even eligible for Featured Snippets / PAA, let alone higher rankings

Entity verification tool

Use Google Knowledge Graph API directly.

If it returns @id: “kg:/m/…” for your entity, Google recognizes you as an entity — more accurate than checking for a SERP Knowledge Panel.

Deployment

Full deployment guide: Hobo-Web’s Entity SEO guide.

Person schema: name, jobTitle, knowsAbout, alumniOf, sameAs
Organization schema: name, legalName, url, logo, foundingDate, sameAs
The core is sameAs — connecting external authoritative identities

External platform weighting (by ROI)

Wikidata (highest ROI — direct input to Knowledge Graph)
Google Business Profile
LinkedIn
Crunchbase
Industry-specific authority platforms
Official brand social accounts

Timeline expectations

Schema + sameAs deployed → Google processes the connections: 4–8 weeks
Knowledge Panel trigger: 3–6 months
Full recognition: 6–12 months

⚡ Status April 2026: Holds up, and entity building has shifted from “nice to have” to core defense against HCU / Spam Update collateral damage.

Integrated SEO Priority Order (April 2026)

Ranking diagnosis sequence, after three-source cross-validation:

Layer 1 — Entity health (foundation)

Is the entity clearly defined? (Disconnected Entity Hypothesis)
About page + schema + sameAs complete?
Quality Rater Guidelines Section 2.5.2 compliant?

Layer 2 — Site-level quality (Q*)

Brand search → site #1 (most direct Q* health signal)
Branded search volume
SERP Selection Rate (especially when not in #1)
Brand prevalence in anchor text

Layer 3 — Site-level authority (links + content depth)

Link graph quality (not just DR number)
Topic focus (siteFocusScore)
Content breadth and depth

Layer 4 — User signals (P*)

NavBoost data accumulation (13-month rolling window)
goodClicks / badClicks / lastLongestClicks trends
Country / device performance

Layer 5 — Single-page content (T*)

ABC signals (Anchors / Body / Clicks)
Query class match (Short Fact / Comparison / Definition etc.)
Schema implementation details

Most SEOs work this list in reverse — from Layer 5 up to Layer 1 — which is why results are slow or fragile.

What’s Now “Hard Fact” (April 2026)

Confirmed across three sources:

Site authority (Q*) is real, subdomain-level, 0–1 scale, 0.4 is the SERP-features eligibility threshold
Q*’s inputs are Brand Visibility + Selection Rate + Anchor Text Brand Prevalence
NavBoost is one of the strongest ranking signals, 13-month window, country + device segmented
HCU is site-wide, root cause is undefined entity
Demotion mechanisms are explicit algorithmic flags, not vague “Google just knows”
Sandbox is not a new-site penalty — it’s a demotion on “untrusted + suddenly-active” entities
hostAge sandboxes fresh spam only, not clean new sites
lastmod trust is binary — fake updates → permanent distrust
Freshness is not a general ranking factor — QDF-specific
Parasite SEO pathway is closed
Schema accelerates entity disambiguation, indirectly affects LLM citation
The 8 query classes determine differential algorithm weights
Full Knowledge Panel is unrealistic for most sites — the practical bar is “brand search → site #1”

Things the English SEO world discusses but the Chinese SEO world barely touches yet:

Disconnected Entity Hypothesis
T* × Q* × P* framework
Site Quality Score 0.4 threshold
Q*’s three precise inputs (Brand + Selection Rate + Anchor)
NavBoost 13-month window and CRAPS
Synthetic Gap (Capper’s DA:BA 2:1 risk threshold)
Freshness and QDF limits
End of the Parasite SEO era
lastmod binary rule
The real mechanism of Sandbox (targeting fresh spam, not clean new sites)

These are the highest-value topics for Chinese SEO content right now, which is why I’m building out ylsseo.com around them.

Full Reference List

Primary:

Secondary core:

鸭老师SEO

独立Google SEO专家，ylsseo.com创始人，基于Google专利、IR与API Leak解读排名机制，中文SEO启蒙第一人。