Google Ranking 2026: DOJ + API Leak + MWC Exploit Synthesis

Over the past 18 months, SEO has received three pieces of hard evidence that finally let us replace speculation with fact:

This is my synthesis of all three sources. If you spot errors, please tell me. Learning in public.

What Each Source Actually Gives Us

SourceNatureUnique contribution
DOJ TrialSworn testimony + internal emailsGoogle’s own names for the mechanisms, and their relative importance
API LeakInternal technical docsSpecific variable names, data structures, module organization
MWC ExploitLive API data (now patched)Actual numerical distributions across 200K+ sites

Key framing: DOJ tells you what Google admits to using. API Leak tells you what those things are called internally. MWC tells you the real numbers. Any single source is arguable. All three together is hard evidence.

AJ Kohn’s SEO-focused breakdown of Pandu Nayak’s testimony is still the deepest first-hand analysis I’ve seen.

Google’s Ranking Architecture (all three sources agree)

Search processing runs through five stages:

1. Crawling
2. Indexing (tiered: Base / Zeppelins / Landfills)
3. Query Processing
4. Core Ranking (Ascorer / Mustang) — T* × Q* × P*
5. Post-Ranking Re-ranking (Twiddler framework)
   → NavBoost / Freshness / QualityBoost etc. running in parallel
6. SERP Generation (including Gemini-generated AI Overviews)

Two things to internalize:

  • Core Ranking decides whether you’re in the candidate pool
  • Twiddlers decide where you rank inside it
  • AIO, Featured Snippets, PAA are all Twiddler-layer outputs — which explains why these SERP features move fast and flicker in/out

For a fuller architectural integration, Shaun Anderson’s 2025 synthesis is the cleanest writeup.

T* × Q* × P* — The Three-Factor Formula

Shaun Anderson’s distilled formula based on DOJ testimony:

Ranking = T* × Q* × P*

T* (Topicality)

Google engineer HJ Kim testified that T* is built from three signals — the ABC signals:

  • A = Anchors — anchor text from links pointing at the page
  • B = Body — content match to the query
  • C = Clicks — user clicks on this result for this query

Q* (Site Quality)

Maps directly to MWC’s site quality score. Cross-validated across all three sources:

  • DOJ: confirmed Q* exists
  • API Leak: siteAuthority is a real, stored variable (see Shaun Anderson’s Q* deep-dive)
  • MWC: actual values are on a 0–1 scale, and 0.4 is the hard eligibility threshold for SERP features (Featured Snippets, PAA)

Q*’s three calculation inputs (from MWC’s SearchNorwich presentation):

  1. Brand Visibility — queries directly containing the brand name, or brand + modifier
  2. SERP Selection Rate — how often users select your result, especially when you’re not in position 1
  3. Anchor Text Brand Prevalence — how often anchor text on the wider web contains your brand / domain name

Selection Rate is the most counter-intuitive signal and probably the most important:

  • Ranking #5 but users consistently skip 1–4 and click you → Selection Rate high → Q* climbs
  • Ranking #1 but users scroll past you → Selection Rate low → Q* drops

P* (Popularity)

Primarily driven by NavBoost (13-month rolling click data) + the link graph.

  • DOJ: Nayak under oath called NavBoost “one of the important signals” Google has
  • Internal email (2019) from VP Alexander Grushetsky: NavBoost alone may be more impactful than the rest of ranking combined (“stealing wins”)
  • API Leak: NavBoost is referenced 84 times across Content Warehouse modules

This formula replaces every speculative ranking-factor list floating around SEO. It has three-source evidence behind it and is directly usable.

Status April 2026: Fully holds up. Shaun Anderson, Mike King, and others continue to enrich the model.

NavBoost — The Full Picture (three-source reconstruction)

Baseline (DOJ confirmed)

  • 13-month rolling window of click data
  • Not raw counts — classified click quality
  • Nayak testified in Oct 2023 that it’s one of the strongest ranking signals

Variable layer (from the Leak)

Content Warehouse modules related to NavBoost include:

  • goodClicks — user clicked and stayed (no pogo-stick)
  • badClicks — user clicked and immediately returned to SERP
  • lastLongestClicks — the longest dwell in a session (the “ultimate satisfaction” signal)
  • country + language — ratings are stored separately by country and language

Full technical breakdown: Shaun Anderson’s NavBoost deep-dive.

Layered architecture (Leak + MWC)

  • NavBoost is the data collection layer
  • A separate system called CRAPS (Click and Results Prediction System) converts click data into demotion scores
  • Applied to SERPs via the Twiddler framework

Practical implications

  • 13-month window → short-term CTR manipulation doesn’t work
  • Country-segmented → US click data and German click data are scored separately for the same page
  • Device-segmented → mobile and desktop are separate scores
  • Compounding advantage → sustained quality clicks become a moat competitors can’t replicate short-term

Status April 2026: Fully holds up.

Site Quality Score — Full Mechanism

Baseline (MWC Exploit, first public disclosure)

  • Subdomain-level scoring (not domain) — www.example.com and help.example.com get different scores
  • 0–1 scale, 0.4 is the SERP-features eligibility threshold
  • Inputs already covered in section 3 (Brand Visibility + Selection Rate + Anchor Text Brand Prevalence)

MWC’s top-0.1% case

From the August 2025 Advanced Web Ranking interview: some of the highest-scoring Q* sites he saw in the exploit data were FAQ sections on university library subdomains. The reason — those pages could only be found via search, and search traffic was a dominant share of total traffic.

His core takeaway:

“If, for whatever reason, you lose visibility, and Google sees that nobody is actively searching for you — if you don’t appear in search, you just don’t exist — then your site is dead in the water.”

New-site prediction scoring (the “Predicting Site Quality” patent)

  • Separate patent: predicting site quality
  • Google vectorizes existing indexed content
  • When a new site publishes, it’s compared against the vector space of known sites
  • It inherits a starting score from its nearest mathematical neighbors

This explains the “fly then crash” cycle of AI content farms:

  1. AI content is trained on the best of the web, so its vectors resemble high-quality sites
  2. Google assigns an initial score of 0.8–0.9, rankings fly
  3. 6–12 months in, real user signals don’t match the prediction
  4. Score gets revised down → rankings collapse

Variables exposed by the Leak

  • siteAuthority — site-level authority score (Google denied “domain authority” for years; turns out the concept exists under a different name)
  • siteFocusScore — topic concentration
  • siteRadius — how far a page deviates from the site’s core topic
  • hostAge — site age (covered below)

Status April 2026: The 0.4 threshold is now the industry consensus benchmark.

Google’s 8 Query Classes (Refined Query Semantic Classes)

The MWC exploit revealed that Google classifies nearly all queries into 8 categories, and ranking weights differ across categories. Best English writeup of the 8 classes: Harry Clarkson-Bennett’s Leadership in SEO piece.

ClassMeaningExampleSEO implication
Short FactsDirect factual answer“who is UK PM”Heaviest AIO cannibalization
ComparisonEntity comparison“iPhone vs Samsung”Core B2B decision queries
ConsequenceOutcome of an action“what happens if you drink too much coffee”YMYL risk
ReasonWhy something occurs“why is the sky blue”High AIO hit rate
DefinitionMeaning of a concept“what is blockchain”Heavy AIO cannibalization
InstructionStep-by-step how-to“how to bake a cake”HowTo long-tail
BooleanYes/no question“is it raining today”Heavy AIO cannibalization
OtherEverything else, incl. local“coffee shops near me”Catch-all

MWC trained an open-source classifier based on these 8 classes — you can plug in your keywords and get Google’s own classification.

Practical value

After keyword research, run your queries through the classifier first, then write the content to match the user expectation for that query class. More precise than reverse-engineering SERP skeletons, because this is Google’s internal taxonomy.

Status April 2026: Holds up.

HCU + The Disconnected Entity Hypothesis

Surface vs. reality

Google’s public framing: HCU evaluates whether content is “written for people.”

What the three sources actually show:

  • HCU is a site-wide signal, not page-level (API Leak confirms)
  • Merged into core ranking in March 2024
  • Mechanism is demotion-first (it only demotes, doesn’t promote)
  • The real trigger isn’t “content quality” — it’s “undefined entity”

Disconnected Entity Hypothesis (Shaun Anderson, 2025)

Original article. The causal chain:

Entity not defined
    ↓
Google can't evaluate "why you exist"
    ↓
Site classified as "Unhelpful"
    ↓
HCU site-wide demotion triggered
    ↓
Every page on the site gets demoted (including genuinely good pages)

Recovery path:

  • Not content optimization
  • Not technical SEO
  • It’s entity definition: About page, author info, schema, sameAs, real-world business evidence
  • Core reference is Section 2.5.2 (page 16) of the Search Quality Rater Guidelines — “Finding Who is Responsible for the Website and Who Created the Content on the Page”

Shaun’s April 2026 update on HCU’s current state further reinforces this framework.

Tom Capper’s Synthetic Gap addendum

Tom Capper’s original Moz research: “The Helpful Content Update Was Not What You Think” revealed the key data pattern:

HCU losers share a common profile: Domain Authority significantly higher than Brand Authority (DA:BA ≥ 2:1). Google flags this profile as “synthetic authority” and demotes.

Translation: you’ve built links fast but nobody searches for your brand — that’s the risk profile.

Capper’s data (1.9M keyword sample): HCU losers average Brand Authority 37, winners and neutrals average 50–52. That BA gap is the mathematical signature of synthetic authority. PPC Land’s shorter breakdown if you want the TL;DR.

Status April 2026: Holds up, and the March 2026 Core Update sharpened this direction further.

The Demotion Variable Set (from the Leak)

Shaun Anderson’s evidence-based mapping of updates to leaked signals covers the full set. The ones that matter most:

DemotionVariableTrigger
Anchor mismatchanchorMismatchDemotionAnchor text doesn’t match target page topic
Exact-match domainexactMatchDomainDemotionDomain exactly matches target keyword → partial demotion
SERP dissatisfactionserpDemotionUsers pogo-stick back to SERP from your page
NavigationnavDemotionAggregate NavBoost negative signal
Product reviewsproductReviewsDemotionLow-quality product review content
ClutterclutterScoreToo many ads / popups
Mobile interstitialsviolatesMobileInterstitialPolicyFull-screen ads on mobile

⚠️ exactMatchDomainDemotion is the hidden tax on a lot of “keyword domain” strategies people still recommend.

hostAge and the Truth About Sandbox

Widely misunderstood variable. Needs to be re-read carefully.

Raw API Leak description

hostAge (type: integer):
- Earliest first-seen date of all pages in this host/domain
- Used by twiddler to sandbox fresh spam at serving time
- 16-bit, day count starting from 2005-12-31
- If URL's host_age == domain_age, domain_age is omitted

Key phrase: “fresh spam”. This variable’s purpose is identifying newly-appeared spam content, not “punishing all new sites.”

What Sandbox actually is

Shaun Anderson’s March 2026 hostAge deep-dive is definitive:

“If you don’t look like spam, you don’t get sandboxed.” “Google wasn’t lying. It caught you.”

Sandbox isn’t a new-site penalty. It’s a demotion mechanism targeting “untrusted + suddenly active” entities.

What actually triggers Sandbox

  1. New domain + sudden high content volume (content farm pattern)
  2. Old domain + sudden topic change + bulk content (expired domain abuse)
  3. New subdomain on a clean old domain + sudden activity (a form of site reputation abuse)
  4. Any “newly-appearing” entity + absent user / link signals

Why clean new sites still feel sandboxed

Google isn’t actively sandboxing them. It’s that:

  • No PageRank signal → low crawl priority
  • No user signal → no NavBoost data
  • No authority links → the Anchors component of T* is zero
  • No brand searches → Q* sits low

Result looks identical to sandbox — no rankings. But the mechanism is different: it’s “not yet vetted,” not “punished.”

Shaun’s core insight on hostAge

“PageRank is the VIP pass that skips sandbox.”

In the leaked architecture, a high pagerank_nsr tells the hostAge twiddler: “this entity has been vetted by the wider web — skip the spam classification check.”

Actual impact range

ScenariohostAge impact
New domain + clean content + compliant SEOVirtually none
New domain + sudden bulk content + spam signalsSandbox active, visibility limited
Old domain + topic pivot + bulk contentSandbox triggered
Old domain + sustained operationAge alone isn’t a boost — accumulated authority is

Conclusion: don’t buy old domains for SEO age. Old domain + topic change = worse than starting fresh.

Status April 2026: Holds up.

Freshness — The Most Misread Factor

Ahrefs 2025 hard data

Ahrefs / Patrick Stox’s May 2025 top-10 age study is counter-intuitive:

Metric20172025Direction
Top 10 pages 3+ years old59%72.9%⬆️ More old content
Top 10 pages under 1 year old22%13.7%⬇️ Fewer new pages
Average age of #1 page2 years5 years⬆️ Doubled
New pages reaching Top 10 within 1 year5.7%1.74%⬇️ Dropped sharply

Conclusion: 2026 SERPs are more dominated by old content than ever before.

QDF (Query Deserves Freshness) only activates for specific queries

Search Engine Land’s QDF definition makes this explicit. QDF only fires for 3 query types:

  1. Breaking news / hot topics
  2. Recurring events (Olympics, elections, annual conferences)
  3. Frequently-changing topics (product launches, pricing, policy)

Activation conditions (Amit Singhal’s original 2007 NYT definition):

  • News sites are actively covering
  • Blogs are publishing frequently
  • Search volume is spiking

All three → QDF activates → new content surfaces Any one missing → QDF dormant → old content dominates

Freshness in 2026 reality

Most B2B / B2C / tutorial queries don’t trigger QDF, so old-content domination is structural, not anomalous.

Actual value of freshness:

  • QDF queries: direct ranking boost
  • Non-QDF queries: indirect effect — continuous publishing signals an “active” site, which increases crawl frequency for older pages

SE Ranking’s 16-month AI content experiment showed: after publishing new content, old page traffic jumped 17–19×. The real value of new content is activating site-level crawl, not the new pages themselves ranking.

Common misconception, corrected

The “update old articles regularly” advice is widely recommended. But:

  • First check whether the keyword is a QDF query
  • For Definition / Comparison / Consequence / Reason queries, old content has a structural advantage
  • Sloppy updates can trigger lastmod trust issues (see next section)

Status April 2026: Freshness isn’t a general ranking factor. It’s QDF-specific.

The lastmod Binary Trust Rule

Gary Illyes confirmed this directly on LinkedIn (June 2024), with MWC asking. Full exchange: Search Engine Journal’s writeup / Barry Schwartz’s Search Engine Roundtable record.

MWC: “If I’m specifying lastmod and Google’s signals consistently find I haven’t made significant changes, do you have any kind of reputation system to decide how much to trust what a site tells you?”

Illyes: “It’s binary — we either trust it or we don’t.”

MWC’s Leak finding

  • Google stores each URL’s “last significant update” timestamp (epoch format)
  • A boolean governs whether to trust your lastmod at all
  • “Once you’re a liar, permanent distrust”

Specific rules

Edit typeGoogle response
Significant edit + lastmod updatedPositive signal
Significant edit + lastmod not updatedNeutral
Minor edit (few words) + lastmod updatedNegative; repeated → lastmod signal disabled
No edit + lastmod updatedMost negative — direct “liar” classification

Supporting Leak detail

Google stores up to 20 historical versions of every page. Implications:

  • Google knows what you looked like historically
  • “Multiple small edits” accumulate into comparison basis — but lastmod trust is binary; once blacklisted, it doesn’t come back

Status April 2026: Fully holds, no changes.

Parasite SEO / LinkedIn Pulse Bleaching — Dead

The “publish on LinkedIn Pulse / Medium / Forbes Advisor to borrow authority” playbook is over in 2026.

Timeline

DateEvent
March 2024Google introduces Site Reputation Abuse policy
November 2024Manual actions on Forbes, WSJ, Time, CNN
January 2025Written into Search Quality Rater Guidelines
August 2025Spam Update begins algorithmic enforcement (previously manual-only)
November 2025EU DMA investigation launched (Google accused of suppressing news publishers)
March 2026Core Update further sharpens enforcement

Full timeline and technical detail: Digital Hitmen’s March 2026 Site Reputation Abuse complete guide.

Current state (April 2026)

  • Crude parasite SEO is dead
  • LinkedIn Pulse still ranks reasonably — because LinkedIn has “editorial friction” (connection requirements), Google treats it as a “quality filter”
  • But posting unrelated topics is high-risk (gambling, loans, CBD, etc.)
  • What works in 2026: publishing on LinkedIn in topics genuinely aligned with your professional identity

Schema and Entity Building

Integrating the two camps

  • Camp 1: schema has near-zero direct impact on LLM citations
  • Camp 2: schema is core to entity building

Integrated truth: Schema isn’t a direct ranking factor. It’s an entity-building accelerator.

Schema
    ↓
Entity disambiguation accelerated (Google confirms "who you are" faster)
    ↓
Entity authority established faster
    ↓
Knowledge Graph recognition
    ↓
Increased LLM citation probability (indirect, not direct)

The 3 core conditions for entity building in 2026

  1. Notability — at least 20–30 independent authoritative mentions
  2. Entity Home — one URL as the “source of truth,” typically the About page
  3. Corroboration — information fully consistent across all platforms

Practical entity verification (more realistic than pursuing a Knowledge Panel)

A full Knowledge Panel isn’t realistic for most sites — Google deleted 3 billion low-quality entities in June 2025. Knowledge Panels are for high-confidence entities, not every site owner.

Tiered entity verification:

TierIndicatorDifficulty
Tier 1 (baseline)Brand search → your site ranks #1Easy
Tier 2 (decent)Brand search shows brand card or sitelinksModerate
Tier 3 (good)Knowledge Graph API returns your entity with a kg:/m/ IDHard
Tier 4 (strong)Full Knowledge Panel on SERPVery hard
Tier 5 (top-tier)AI systems (ChatGPT / Gemini / Perplexity) cite you unpromptedHardest

Tier 2 is sufficient for most sites. Realistic target: brand search → site #1, and Knowledge Graph API finds your entity ID — not waiting for a full Knowledge Panel.

Why “brand search → site #1” is the core entity-health indicator

It directly maps to two of Q*’s three inputs:

  1. Brand Search (people search for you) → Q* Brand Visibility input
  2. Selection Rate (they pick your site) → Q* Selection Rate input

If people search your brand and can’t find or don’t pick your official site:

  • Brand Visibility data exists but doesn’t resolve to you
  • Selection Rate is low
  • Q* can’t clear 0.4
  • You’re not even eligible for Featured Snippets / PAA, let alone higher rankings

Entity verification tool

Use Google Knowledge Graph API directly.

If it returns @id: “kg:/m/…” for your entity, Google recognizes you as an entity — more accurate than checking for a SERP Knowledge Panel.

Deployment

Full deployment guide: Hobo-Web’s Entity SEO guide.

  • Person schema: name, jobTitle, knowsAbout, alumniOf, sameAs
  • Organization schema: name, legalName, url, logo, foundingDate, sameAs
  • The core is sameAs — connecting external authoritative identities

External platform weighting (by ROI)

  1. Wikidata (highest ROI — direct input to Knowledge Graph)
  2. Google Business Profile
  3. LinkedIn
  4. Crunchbase
  5. Industry-specific authority platforms
  6. Official brand social accounts

Timeline expectations

  • Schema + sameAs deployed → Google processes the connections: 4–8 weeks
  • Knowledge Panel trigger: 3–6 months
  • Full recognition: 6–12 months

Status April 2026: Holds up, and entity building has shifted from “nice to have” to core defense against HCU / Spam Update collateral damage.

Integrated SEO Priority Order (April 2026)

Ranking diagnosis sequence, after three-source cross-validation:

Layer 1 — Entity health (foundation)

  • Is the entity clearly defined? (Disconnected Entity Hypothesis)
  • About page + schema + sameAs complete?
  • Quality Rater Guidelines Section 2.5.2 compliant?

Layer 2 — Site-level quality (Q*)

  • Brand search → site #1 (most direct Q* health signal)
  • Branded search volume
  • SERP Selection Rate (especially when not in #1)
  • Brand prevalence in anchor text

Layer 3 — Site-level authority (links + content depth)

  • Link graph quality (not just DR number)
  • Topic focus (siteFocusScore)
  • Content breadth and depth

Layer 4 — User signals (P*)

  • NavBoost data accumulation (13-month rolling window)
  • goodClicks / badClicks / lastLongestClicks trends
  • Country / device performance

Layer 5 — Single-page content (T*)

  • ABC signals (Anchors / Body / Clicks)
  • Query class match (Short Fact / Comparison / Definition etc.)
  • Schema implementation details

Most SEOs work this list in reverse — from Layer 5 up to Layer 1 — which is why results are slow or fragile.

What’s Now “Hard Fact” (April 2026)

Confirmed across three sources:

  1. Site authority (Q*) is real, subdomain-level, 0–1 scale, 0.4 is the SERP-features eligibility threshold
  2. Q*’s inputs are Brand Visibility + Selection Rate + Anchor Text Brand Prevalence
  3. NavBoost is one of the strongest ranking signals, 13-month window, country + device segmented
  4. HCU is site-wide, root cause is undefined entity
  5. Demotion mechanisms are explicit algorithmic flags, not vague “Google just knows”
  6. Sandbox is not a new-site penalty — it’s a demotion on “untrusted + suddenly-active” entities
  7. hostAge sandboxes fresh spam only, not clean new sites
  8. lastmod trust is binary — fake updates → permanent distrust
  9. Freshness is not a general ranking factor — QDF-specific
  10. Parasite SEO pathway is closed
  11. Schema accelerates entity disambiguation, indirectly affects LLM citation
  12. The 8 query classes determine differential algorithm weights
  13. Full Knowledge Panel is unrealistic for most sites — the practical bar is “brand search → site #1”

Things the English SEO world discusses but the Chinese SEO world barely touches yet:

  • Disconnected Entity Hypothesis
  • T* × Q* × P* framework
  • Site Quality Score 0.4 threshold
  • Q*’s three precise inputs (Brand + Selection Rate + Anchor)
  • NavBoost 13-month window and CRAPS
  • Synthetic Gap (Capper’s DA:BA 2:1 risk threshold)
  • Freshness and QDF limits
  • End of the Parasite SEO era
  • lastmod binary rule
  • The real mechanism of Sandbox (targeting fresh spam, not clean new sites)

These are the highest-value topics for Chinese SEO content right now, which is why I’m building out ylsseo.com around them.

Full Reference List

Primary:

Secondary core:

本文对你有帮助吗?
鸭老师SEO
鸭老师SEO

独立Google SEO专家,ylsseo.com创始人,基于Google专利与API Leak解读排名机制,中文SEO启蒙第一人。

滚动至顶部