Over the past 18 months, SEO has received three pieces of hard evidence that finally let us replace speculation with fact:
- DOJ Antitrust Trial (Sep 2023 – Aug 2024): Google engineers under oath, internal emails entered as court evidence. Pandu Nayak full testimony PDF
- Content Warehouse API Leak (May 2024): 2,500+ internal modules, 14,000+ attributes. Mike King’s initial technical breakdown (iPullRank) / Rand Fishkin’s first disclosure (SparkToro)
- MWC Exploit (Dec 2024): Mark Williams-Cook’s team pulled live data on 2M sites / 90M queries through an API endpoint vulnerability. Search Engine Land coverage
This is my synthesis of all three sources. If you spot errors, please tell me. Learning in public.
What Each Source Actually Gives Us
| Source | Nature | Unique contribution |
|---|---|---|
| DOJ Trial | Sworn testimony + internal emails | Google’s own names for the mechanisms, and their relative importance |
| API Leak | Internal technical docs | Specific variable names, data structures, module organization |
| MWC Exploit | Live API data (now patched) | Actual numerical distributions across 200K+ sites |
Key framing: DOJ tells you what Google admits to using. API Leak tells you what those things are called internally. MWC tells you the real numbers. Any single source is arguable. All three together is hard evidence.
AJ Kohn’s SEO-focused breakdown of Pandu Nayak’s testimony is still the deepest first-hand analysis I’ve seen.
Google’s Ranking Architecture (all three sources agree)
Search processing runs through five stages:
1. Crawling
2. Indexing (tiered: Base / Zeppelins / Landfills)
3. Query Processing
4. Core Ranking (Ascorer / Mustang) — T* × Q* × P*
5. Post-Ranking Re-ranking (Twiddler framework)
→ NavBoost / Freshness / QualityBoost etc. running in parallel
6. SERP Generation (including Gemini-generated AI Overviews)
Two things to internalize:
- Core Ranking decides whether you’re in the candidate pool
- Twiddlers decide where you rank inside it
- AIO, Featured Snippets, PAA are all Twiddler-layer outputs — which explains why these SERP features move fast and flicker in/out
For a fuller architectural integration, Shaun Anderson’s 2025 synthesis is the cleanest writeup.
T* × Q* × P* — The Three-Factor Formula
Shaun Anderson’s distilled formula based on DOJ testimony:
Ranking = T* × Q* × P*
T* (Topicality)
Google engineer HJ Kim testified that T* is built from three signals — the ABC signals:
- A = Anchors — anchor text from links pointing at the page
- B = Body — content match to the query
- C = Clicks — user clicks on this result for this query
Q* (Site Quality)
Maps directly to MWC’s site quality score. Cross-validated across all three sources:
- DOJ: confirmed Q* exists
- API Leak:
siteAuthorityis a real, stored variable (see Shaun Anderson’s Q* deep-dive) - MWC: actual values are on a 0–1 scale, and 0.4 is the hard eligibility threshold for SERP features (Featured Snippets, PAA)
Q*’s three calculation inputs (from MWC’s SearchNorwich presentation):
- Brand Visibility — queries directly containing the brand name, or brand + modifier
- SERP Selection Rate — how often users select your result, especially when you’re not in position 1
- Anchor Text Brand Prevalence — how often anchor text on the wider web contains your brand / domain name
Selection Rate is the most counter-intuitive signal and probably the most important:
- Ranking #5 but users consistently skip 1–4 and click you → Selection Rate high → Q* climbs
- Ranking #1 but users scroll past you → Selection Rate low → Q* drops
P* (Popularity)
Primarily driven by NavBoost (13-month rolling click data) + the link graph.
- DOJ: Nayak under oath called NavBoost “one of the important signals” Google has
- Internal email (2019) from VP Alexander Grushetsky: NavBoost alone may be more impactful than the rest of ranking combined (“stealing wins”)
- API Leak: NavBoost is referenced 84 times across Content Warehouse modules
This formula replaces every speculative ranking-factor list floating around SEO. It has three-source evidence behind it and is directly usable.
⚡ Status April 2026: Fully holds up. Shaun Anderson, Mike King, and others continue to enrich the model.
NavBoost — The Full Picture (three-source reconstruction)
Baseline (DOJ confirmed)
- 13-month rolling window of click data
- Not raw counts — classified click quality
- Nayak testified in Oct 2023 that it’s one of the strongest ranking signals
Variable layer (from the Leak)
Content Warehouse modules related to NavBoost include:
- goodClicks — user clicked and stayed (no pogo-stick)
- badClicks — user clicked and immediately returned to SERP
- lastLongestClicks — the longest dwell in a session (the “ultimate satisfaction” signal)
- country + language — ratings are stored separately by country and language
Full technical breakdown: Shaun Anderson’s NavBoost deep-dive.
Layered architecture (Leak + MWC)
- NavBoost is the data collection layer
- A separate system called CRAPS (Click and Results Prediction System) converts click data into demotion scores
- Applied to SERPs via the Twiddler framework
Practical implications
- 13-month window → short-term CTR manipulation doesn’t work
- Country-segmented → US click data and German click data are scored separately for the same page
- Device-segmented → mobile and desktop are separate scores
- Compounding advantage → sustained quality clicks become a moat competitors can’t replicate short-term
⚡ Status April 2026: Fully holds up.
Site Quality Score — Full Mechanism
Baseline (MWC Exploit, first public disclosure)
- Subdomain-level scoring (not domain) —
www.example.comandhelp.example.comget different scores - 0–1 scale, 0.4 is the SERP-features eligibility threshold
- Inputs already covered in section 3 (Brand Visibility + Selection Rate + Anchor Text Brand Prevalence)
MWC’s top-0.1% case
From the August 2025 Advanced Web Ranking interview: some of the highest-scoring Q* sites he saw in the exploit data were FAQ sections on university library subdomains. The reason — those pages could only be found via search, and search traffic was a dominant share of total traffic.
His core takeaway:
“If, for whatever reason, you lose visibility, and Google sees that nobody is actively searching for you — if you don’t appear in search, you just don’t exist — then your site is dead in the water.”
New-site prediction scoring (the “Predicting Site Quality” patent)
- Separate patent:
predicting site quality - Google vectorizes existing indexed content
- When a new site publishes, it’s compared against the vector space of known sites
- It inherits a starting score from its nearest mathematical neighbors
This explains the “fly then crash” cycle of AI content farms:
- AI content is trained on the best of the web, so its vectors resemble high-quality sites
- Google assigns an initial score of 0.8–0.9, rankings fly
- 6–12 months in, real user signals don’t match the prediction
- Score gets revised down → rankings collapse
Variables exposed by the Leak
siteAuthority— site-level authority score (Google denied “domain authority” for years; turns out the concept exists under a different name)siteFocusScore— topic concentrationsiteRadius— how far a page deviates from the site’s core topichostAge— site age (covered below)
⚡ Status April 2026: The 0.4 threshold is now the industry consensus benchmark.
Google’s 8 Query Classes (Refined Query Semantic Classes)
The MWC exploit revealed that Google classifies nearly all queries into 8 categories, and ranking weights differ across categories. Best English writeup of the 8 classes: Harry Clarkson-Bennett’s Leadership in SEO piece.
| Class | Meaning | Example | SEO implication |
|---|---|---|---|
| Short Facts | Direct factual answer | “who is UK PM” | Heaviest AIO cannibalization |
| Comparison | Entity comparison | “iPhone vs Samsung” | Core B2B decision queries |
| Consequence | Outcome of an action | “what happens if you drink too much coffee” | YMYL risk |
| Reason | Why something occurs | “why is the sky blue” | High AIO hit rate |
| Definition | Meaning of a concept | “what is blockchain” | Heavy AIO cannibalization |
| Instruction | Step-by-step how-to | “how to bake a cake” | HowTo long-tail |
| Boolean | Yes/no question | “is it raining today” | Heavy AIO cannibalization |
| Other | Everything else, incl. local | “coffee shops near me” | Catch-all |
MWC trained an open-source classifier based on these 8 classes — you can plug in your keywords and get Google’s own classification.
Practical value
After keyword research, run your queries through the classifier first, then write the content to match the user expectation for that query class. More precise than reverse-engineering SERP skeletons, because this is Google’s internal taxonomy.
⚡ Status April 2026: Holds up.
HCU + The Disconnected Entity Hypothesis
Surface vs. reality
Google’s public framing: HCU evaluates whether content is “written for people.”
What the three sources actually show:
- HCU is a site-wide signal, not page-level (API Leak confirms)
- Merged into core ranking in March 2024
- Mechanism is demotion-first (it only demotes, doesn’t promote)
- The real trigger isn’t “content quality” — it’s “undefined entity”
Disconnected Entity Hypothesis (Shaun Anderson, 2025)
Original article. The causal chain:
Entity not defined
↓
Google can't evaluate "why you exist"
↓
Site classified as "Unhelpful"
↓
HCU site-wide demotion triggered
↓
Every page on the site gets demoted (including genuinely good pages)
Recovery path:
- Not content optimization
- Not technical SEO
- It’s entity definition: About page, author info, schema, sameAs, real-world business evidence
- Core reference is Section 2.5.2 (page 16) of the Search Quality Rater Guidelines — “Finding Who is Responsible for the Website and Who Created the Content on the Page”
Shaun’s April 2026 update on HCU’s current state further reinforces this framework.
Tom Capper’s Synthetic Gap addendum
Tom Capper’s original Moz research: “The Helpful Content Update Was Not What You Think” revealed the key data pattern:
HCU losers share a common profile: Domain Authority significantly higher than Brand Authority (DA:BA ≥ 2:1). Google flags this profile as “synthetic authority” and demotes.
Translation: you’ve built links fast but nobody searches for your brand — that’s the risk profile.
Capper’s data (1.9M keyword sample): HCU losers average Brand Authority 37, winners and neutrals average 50–52. That BA gap is the mathematical signature of synthetic authority. PPC Land’s shorter breakdown if you want the TL;DR.
⚡ Status April 2026: Holds up, and the March 2026 Core Update sharpened this direction further.
The Demotion Variable Set (from the Leak)
Shaun Anderson’s evidence-based mapping of updates to leaked signals covers the full set. The ones that matter most:
| Demotion | Variable | Trigger |
|---|---|---|
| Anchor mismatch | anchorMismatchDemotion | Anchor text doesn’t match target page topic |
| Exact-match domain | exactMatchDomainDemotion | Domain exactly matches target keyword → partial demotion |
| SERP dissatisfaction | serpDemotion | Users pogo-stick back to SERP from your page |
| Navigation | navDemotion | Aggregate NavBoost negative signal |
| Product reviews | productReviewsDemotion | Low-quality product review content |
| Clutter | clutterScore | Too many ads / popups |
| Mobile interstitials | violatesMobileInterstitialPolicy | Full-screen ads on mobile |
⚠️ exactMatchDomainDemotion is the hidden tax on a lot of “keyword domain” strategies people still recommend.
hostAge and the Truth About Sandbox
Widely misunderstood variable. Needs to be re-read carefully.
Raw API Leak description
hostAge (type: integer):
- Earliest first-seen date of all pages in this host/domain
- Used by twiddler to sandbox fresh spam at serving time
- 16-bit, day count starting from 2005-12-31
- If URL's host_age == domain_age, domain_age is omitted
Key phrase: “fresh spam”. This variable’s purpose is identifying newly-appeared spam content, not “punishing all new sites.”
What Sandbox actually is
Shaun Anderson’s March 2026 hostAge deep-dive is definitive:
“If you don’t look like spam, you don’t get sandboxed.” “Google wasn’t lying. It caught you.”
Sandbox isn’t a new-site penalty. It’s a demotion mechanism targeting “untrusted + suddenly active” entities.
What actually triggers Sandbox
- New domain + sudden high content volume (content farm pattern)
- Old domain + sudden topic change + bulk content (expired domain abuse)
- New subdomain on a clean old domain + sudden activity (a form of site reputation abuse)
- Any “newly-appearing” entity + absent user / link signals
Why clean new sites still feel sandboxed
Google isn’t actively sandboxing them. It’s that:
- No PageRank signal → low crawl priority
- No user signal → no NavBoost data
- No authority links → the Anchors component of T* is zero
- No brand searches → Q* sits low
Result looks identical to sandbox — no rankings. But the mechanism is different: it’s “not yet vetted,” not “punished.”
Shaun’s core insight on hostAge
“PageRank is the VIP pass that skips sandbox.”
In the leaked architecture, a high pagerank_nsr tells the hostAge twiddler: “this entity has been vetted by the wider web — skip the spam classification check.”
Actual impact range
| Scenario | hostAge impact |
|---|---|
| New domain + clean content + compliant SEO | Virtually none |
| New domain + sudden bulk content + spam signals | Sandbox active, visibility limited |
| Old domain + topic pivot + bulk content | Sandbox triggered |
| Old domain + sustained operation | Age alone isn’t a boost — accumulated authority is |
Conclusion: don’t buy old domains for SEO age. Old domain + topic change = worse than starting fresh.
⚡ Status April 2026: Holds up.
Freshness — The Most Misread Factor
Ahrefs 2025 hard data
Ahrefs / Patrick Stox’s May 2025 top-10 age study is counter-intuitive:
| Metric | 2017 | 2025 | Direction |
|---|---|---|---|
| Top 10 pages 3+ years old | 59% | 72.9% | ⬆️ More old content |
| Top 10 pages under 1 year old | 22% | 13.7% | ⬇️ Fewer new pages |
| Average age of #1 page | 2 years | 5 years | ⬆️ Doubled |
| New pages reaching Top 10 within 1 year | 5.7% | 1.74% | ⬇️ Dropped sharply |
Conclusion: 2026 SERPs are more dominated by old content than ever before.
QDF (Query Deserves Freshness) only activates for specific queries
Search Engine Land’s QDF definition makes this explicit. QDF only fires for 3 query types:
- Breaking news / hot topics
- Recurring events (Olympics, elections, annual conferences)
- Frequently-changing topics (product launches, pricing, policy)
Activation conditions (Amit Singhal’s original 2007 NYT definition):
- News sites are actively covering
- Blogs are publishing frequently
- Search volume is spiking
All three → QDF activates → new content surfaces Any one missing → QDF dormant → old content dominates
Freshness in 2026 reality
Most B2B / B2C / tutorial queries don’t trigger QDF, so old-content domination is structural, not anomalous.
Actual value of freshness:
- QDF queries: direct ranking boost
- Non-QDF queries: indirect effect — continuous publishing signals an “active” site, which increases crawl frequency for older pages
SE Ranking’s 16-month AI content experiment showed: after publishing new content, old page traffic jumped 17–19×. The real value of new content is activating site-level crawl, not the new pages themselves ranking.
Common misconception, corrected
The “update old articles regularly” advice is widely recommended. But:
- First check whether the keyword is a QDF query
- For Definition / Comparison / Consequence / Reason queries, old content has a structural advantage
- Sloppy updates can trigger lastmod trust issues (see next section)
⚡ Status April 2026: Freshness isn’t a general ranking factor. It’s QDF-specific.
The lastmod Binary Trust Rule
Gary Illyes confirmed this directly on LinkedIn (June 2024), with MWC asking. Full exchange: Search Engine Journal’s writeup / Barry Schwartz’s Search Engine Roundtable record.
MWC: “If I’m specifying lastmod and Google’s signals consistently find I haven’t made significant changes, do you have any kind of reputation system to decide how much to trust what a site tells you?”
Illyes: “It’s binary — we either trust it or we don’t.”
MWC’s Leak finding
- Google stores each URL’s “last significant update” timestamp (epoch format)
- A boolean governs whether to trust your lastmod at all
- “Once you’re a liar, permanent distrust”
Specific rules
| Edit type | Google response |
|---|---|
| Significant edit + lastmod updated | Positive signal |
| Significant edit + lastmod not updated | Neutral |
| Minor edit (few words) + lastmod updated | Negative; repeated → lastmod signal disabled |
| No edit + lastmod updated | Most negative — direct “liar” classification |
Supporting Leak detail
Google stores up to 20 historical versions of every page. Implications:
- Google knows what you looked like historically
- “Multiple small edits” accumulate into comparison basis — but lastmod trust is binary; once blacklisted, it doesn’t come back
⚡ Status April 2026: Fully holds, no changes.
Parasite SEO / LinkedIn Pulse Bleaching — Dead
The “publish on LinkedIn Pulse / Medium / Forbes Advisor to borrow authority” playbook is over in 2026.
Timeline
| Date | Event |
|---|---|
| March 2024 | Google introduces Site Reputation Abuse policy |
| November 2024 | Manual actions on Forbes, WSJ, Time, CNN |
| January 2025 | Written into Search Quality Rater Guidelines |
| August 2025 | Spam Update begins algorithmic enforcement (previously manual-only) |
| November 2025 | EU DMA investigation launched (Google accused of suppressing news publishers) |
| March 2026 | Core Update further sharpens enforcement |
Full timeline and technical detail: Digital Hitmen’s March 2026 Site Reputation Abuse complete guide.
Current state (April 2026)
- Crude parasite SEO is dead
- LinkedIn Pulse still ranks reasonably — because LinkedIn has “editorial friction” (connection requirements), Google treats it as a “quality filter”
- But posting unrelated topics is high-risk (gambling, loans, CBD, etc.)
- What works in 2026: publishing on LinkedIn in topics genuinely aligned with your professional identity
Schema and Entity Building
Integrating the two camps
- Camp 1: schema has near-zero direct impact on LLM citations
- Camp 2: schema is core to entity building
Integrated truth: Schema isn’t a direct ranking factor. It’s an entity-building accelerator.
Schema
↓
Entity disambiguation accelerated (Google confirms "who you are" faster)
↓
Entity authority established faster
↓
Knowledge Graph recognition
↓
Increased LLM citation probability (indirect, not direct)
The 3 core conditions for entity building in 2026
- Notability — at least 20–30 independent authoritative mentions
- Entity Home — one URL as the “source of truth,” typically the About page
- Corroboration — information fully consistent across all platforms
Practical entity verification (more realistic than pursuing a Knowledge Panel)
A full Knowledge Panel isn’t realistic for most sites — Google deleted 3 billion low-quality entities in June 2025. Knowledge Panels are for high-confidence entities, not every site owner.
Tiered entity verification:
| Tier | Indicator | Difficulty |
|---|---|---|
| Tier 1 (baseline) | Brand search → your site ranks #1 | Easy |
| Tier 2 (decent) | Brand search shows brand card or sitelinks | Moderate |
| Tier 3 (good) | Knowledge Graph API returns your entity with a kg:/m/ ID | Hard |
| Tier 4 (strong) | Full Knowledge Panel on SERP | Very hard |
| Tier 5 (top-tier) | AI systems (ChatGPT / Gemini / Perplexity) cite you unprompted | Hardest |
Tier 2 is sufficient for most sites. Realistic target: brand search → site #1, and Knowledge Graph API finds your entity ID — not waiting for a full Knowledge Panel.
Why “brand search → site #1” is the core entity-health indicator
It directly maps to two of Q*’s three inputs:
- Brand Search (people search for you) → Q* Brand Visibility input
- Selection Rate (they pick your site) → Q* Selection Rate input
If people search your brand and can’t find or don’t pick your official site:
- Brand Visibility data exists but doesn’t resolve to you
- Selection Rate is low
- Q* can’t clear 0.4
- You’re not even eligible for Featured Snippets / PAA, let alone higher rankings
Entity verification tool
Use Google Knowledge Graph API directly.
If it returns @id: “kg:/m/…” for your entity, Google recognizes you as an entity — more accurate than checking for a SERP Knowledge Panel.
Deployment
Full deployment guide: Hobo-Web’s Entity SEO guide.
- Person schema: name, jobTitle, knowsAbout, alumniOf, sameAs
- Organization schema: name, legalName, url, logo, foundingDate, sameAs
- The core is
sameAs— connecting external authoritative identities
External platform weighting (by ROI)
- Wikidata (highest ROI — direct input to Knowledge Graph)
- Google Business Profile
- Crunchbase
- Industry-specific authority platforms
- Official brand social accounts
Timeline expectations
- Schema + sameAs deployed → Google processes the connections: 4–8 weeks
- Knowledge Panel trigger: 3–6 months
- Full recognition: 6–12 months
⚡ Status April 2026: Holds up, and entity building has shifted from “nice to have” to core defense against HCU / Spam Update collateral damage.
Integrated SEO Priority Order (April 2026)
Ranking diagnosis sequence, after three-source cross-validation:
Layer 1 — Entity health (foundation)
- Is the entity clearly defined? (Disconnected Entity Hypothesis)
- About page + schema + sameAs complete?
- Quality Rater Guidelines Section 2.5.2 compliant?
Layer 2 — Site-level quality (Q*)
- Brand search → site #1 (most direct Q* health signal)
- Branded search volume
- SERP Selection Rate (especially when not in #1)
- Brand prevalence in anchor text
Layer 3 — Site-level authority (links + content depth)
- Link graph quality (not just DR number)
- Topic focus (
siteFocusScore) - Content breadth and depth
Layer 4 — User signals (P*)
- NavBoost data accumulation (13-month rolling window)
- goodClicks / badClicks / lastLongestClicks trends
- Country / device performance
Layer 5 — Single-page content (T*)
- ABC signals (Anchors / Body / Clicks)
- Query class match (Short Fact / Comparison / Definition etc.)
- Schema implementation details
Most SEOs work this list in reverse — from Layer 5 up to Layer 1 — which is why results are slow or fragile.
What’s Now “Hard Fact” (April 2026)
Confirmed across three sources:
- Site authority (Q*) is real, subdomain-level, 0–1 scale, 0.4 is the SERP-features eligibility threshold
- Q*’s inputs are Brand Visibility + Selection Rate + Anchor Text Brand Prevalence
- NavBoost is one of the strongest ranking signals, 13-month window, country + device segmented
- HCU is site-wide, root cause is undefined entity
- Demotion mechanisms are explicit algorithmic flags, not vague “Google just knows”
- Sandbox is not a new-site penalty — it’s a demotion on “untrusted + suddenly-active” entities
hostAgesandboxes fresh spam only, not clean new sites- lastmod trust is binary — fake updates → permanent distrust
- Freshness is not a general ranking factor — QDF-specific
- Parasite SEO pathway is closed
- Schema accelerates entity disambiguation, indirectly affects LLM citation
- The 8 query classes determine differential algorithm weights
- Full Knowledge Panel is unrealistic for most sites — the practical bar is “brand search → site #1”
Things the English SEO world discusses but the Chinese SEO world barely touches yet:
- Disconnected Entity Hypothesis
- T* × Q* × P* framework
- Site Quality Score 0.4 threshold
- Q*’s three precise inputs (Brand + Selection Rate + Anchor)
- NavBoost 13-month window and CRAPS
- Synthetic Gap (Capper’s DA:BA 2:1 risk threshold)
- Freshness and QDF limits
- End of the Parasite SEO era
- lastmod binary rule
- The real mechanism of Sandbox (targeting fresh spam, not clean new sites)
These are the highest-value topics for Chinese SEO content right now, which is why I’m building out ylsseo.com around them.
Full Reference List
Primary:
- Pandu Nayak DOJ testimony full PDF
- Mike King (iPullRank) API Leak initial analysis
- Rand Fishkin (SparkToro) first disclosure
- MWC Exploit coverage (Search Engine Land)
- Gary Illyes lastmod binary — Search Engine Journal
- Google Site Reputation Abuse policy
- Google Knowledge Graph API
Secondary core:
- Shaun Anderson — Disconnected Entity Hypothesis
- Shaun Anderson — Q* deep-dive
- Shaun Anderson — NavBoost mechanism
- Shaun Anderson — DOJ antitrust trial synthesis
- Shaun Anderson — hostAge and Sandbox truth
- Shaun Anderson — How Google Works 2025–2026
- Tom Capper (Moz) — Helpful Content Update Was Not What You Think
- AJ Kohn — What Pandu Nayak Taught Me About SEO
- Harry Clarkson-Bennett — 8 query classes breakdown
- Ahrefs Patrick Stox — top-10 page age study
- MWC’s open-source 8-class query classifier
独立Google SEO专家,ylsseo.com创始人,基于Google专利与API Leak解读排名机制,中文SEO启蒙第一人。