The era of indiscriminate web crawling is coming to an end. As infrastructure costs mount and legal frameworks become more stringent, both search engines and AI companies are shifting from “crawl everything” to “crawl selectively.” This transformation, catalyzed by initiatives like Cloudflare’s Pay Per Crawl and growing resource constraints, demands that content creators optimize for crawl-worthiness or risk digital invisibility.
TL;DR:
The economics of web crawling have changed. Search engines and AI platforms are under increasing resource pressure, forcing them to prioritize what they crawl, index, and present. This article introduces a framework for Selective Content Ingestion—helping brands rethink content structure, value, and accessibility from the search engine’s perspective.
Key takeaways:
- Crawling is no longer guaranteed—every request must justify its cost.
- Value, structure, and change signals now determine crawl-worthiness.
- XML sitemaps, canonical signals, and crawl hints are strategic levers, not checkboxes.
- Brands must optimize not just for users, but for machine efficiency and AI retrievability.
If your content isn’t deemed worth crawling, it won’t exist in the new search economy.
I. The Paradigm Shift: From Open Access to Selective Ingestion
The Old Model: “Crawl Everything”
- Universal access assumption: Web content was considered freely crawlable
- Minimal cost considerations: Server capacity was the primary limiting factor
- Honor system: robots.txt compliance was voluntary and often ignored
- Unlimited appetite: More content was generally considered better
The New Reality: Economic Accountability
- Pay-to-crawl models: Cloudflare’s Pay Per Crawl marketplace allows publishers to charge AI companies for each crawl request
- Legal enforcement: Over one million Cloudflare customers have opted to block AI crawlers by default
- Permission-based access: New domains on Cloudflare default to blocking AI crawlers, requiring explicit permission
- Resource optimization: Search engines consider financial cost calculations when determining crawl frequency
Has Google Already Implemented Selective Indexing?
The theoretical has become a reality. Recent analysis by Adam Gent in his amazing Indexing Insight Newsletter reveals that Google has already begun aggressive, selective indexing based on engagement metrics and content value.
The May 2025 “Indexing Purge”:
In his research, Adam refers to the May update as the “Indexing Purge”, where he saw some alarming trends:
- Since May 26th, 2025, over 25% of monitored URLs have changed from ‘submitted and indexed’ to ‘crawled – previously indexed’
- Google broke its traditional 130-day indexing rule and began actively deindexing pages it had recrawled in just 90-130 days
- 15% – 75% of indexed pages were actively removed from websites across all sizes
The selection criteria are clear: the reason pages are being actively removed is due to a lack of user engagement. Google actively purged “poor-performing” pages with zero or low SEO performance (clicks and impressions) from its index.
No Technical Issues – This is Intentional: Google’s John Mueller confirmed there were no “technical issues” and emphasized the link between capacity/demand in indexing, stating, “Our systems make adjustments in what’s crawled & indexed regularly.”
This isn’t a prediction for the future—it’s happening now. Google has moved from passive crawl budget management to active content curation based on engagement and value metrics.
II. The Crawl Queue Lottery: Being Found ≠ Being Indexed
The Fundamental Misunderstanding
Most content creators operate under a dangerous misconception: that being crawled equals being indexed. This is like believing that buying a lottery ticket guarantees a win—you need the ticket to play, but having the ticket doesn’t mean you will win.
The Two-Stage Process
Stage 1: Getting Your Ticket (Crawl Queue Entry)
- URLs must first be discovered through sitemaps, links, or direct submission
- Being added to the crawl queue is just the entry ticket, not the prize
- Like lottery tickets, millions of URLs enter the queue, but only a fraction “win” indexing
Stage 2: Having the Winning Numbers (Meeting Index Criteria)
- Once crawled, content must pass Google’s quality and engagement thresholds
- The May 2025 purge revealed that these thresholds are now strictly enforced
- Pages with zero engagement are actively removed, even after being crawled multiple times
John Mueller’s Queue Reality Check
When Google’s John Mueller stated that “Our systems make adjustments in what’s crawled & indexed regularly,” he was describing this two-stage lottery system. Being crawled is just buying the ticket—indexing requires having the winning combination of:
- User engagement signals
- Content quality markers
- Technical optimization
- Topical authority
- Commercial viability
The Harsh Truth: In the May 2025 purge, websites had 15-75% of their “winning tickets” (indexed pages) revoked when Google recalculated the winning numbers. These pages were crawled, evaluated, and found lacking in engagement value.
Why the Lottery Analogy Matters
This system explains why:
- Crawl budget optimization is critical: More tickets = more chances to win
- Content quality is non-negotiable: You need the correct numbers to win big and small prizes
- Engagement drives everything: The “winning numbers” are user interaction metrics
- Being crawled means nothing: Without meeting index criteria, crawling is just an expensive audit
1. Infrastructure Costs and Sustainability
Rising Computational Demands
- Crawler traffic rose 18% from May 2024 to May 2025, with GPTBot growing 305% and Googlebot 96% – Source: Cloudflare
- Google has a long-term strategy to improve sustainability and reduce carbon emissions, making resource optimization critical – Source: Google
- More tickets (URLs) in the system require more processing power to evaluate winners.
Economic Reality Check
- OpenAI’s crawler scraped websites 1,700 times for every one referral, while Anthropic scraped 73,000 times per referral. Source: Cloudflare June Network Data
- The traditional value exchange (crawling for referrals) no longer justifies the resource expenditure
- The cost of processing millions of tickets must be balanced against the value of the prizes (indexed content)
2. Legal and Ethical Pressures
Copyright Protection
- Publishers are asserting greater control over their intellectual property
- Legal challenges to unauthorized scraping are increasing
Revenue Disruption
- Several large publishers, including Conde Nast, TIME, The Associated Press, The Atlantic, ADWEEK, and Fortune, have signed on with Cloudflare to block AI crawlers by default
- AI-generated responses reduce click-through rates to original sources
3. Technical Sophistication and Proven Results
AI-Powered Selection Already Working
- Google’s May 2025 indexing purge proved that engagement-based selection works: websites that lost 75% of indexed pages saw zero negative impact on clicks and impressions
- AI-powered crawling algorithms use machine learning to dynamically allocate crawl budget by determining the value of crawling each page
- Natural Language Processing (NLP) and semantic analysis allow AI-powered crawlers to understand content’s meaning
Performance-Based Deindexing
- Google now actively removes pages with zero or low engagement within 90 days instead of waiting 130 days
- Quality thresholds have been updated to prioritize content that drives meaningful engagement (clicks, queries, impressions)
III. The Decision Framework: Nine Factors for Crawl-Worthiness
I have tried to build out a weighting framework that AI companies and search engines likely use to determine content worthiness:
Primary Filters (Must-Pass)
- Content Availability & Access Rights
- Do we already have equivalent content?
- Is it legally accessible or behind paywalls/permission barriers?
- Does the site comply with crawling protocols?
- Technical Crawlability
- Is the site technically accessible and stable?
- Fast websites load quickly, giving users a nice experience and may be crawled more often
- Are there excessive redirects or errors consuming crawl budget?
Value Assessment Matrix
- Authority and Trust Signals
- Backlinks from popular sites indicate to Google that your site is trustworthy. Trustworthy sites may get crawled sooner
- Domain age and historical reliability
- Editorial standards and fact-checking protocols
- Content Freshness and Update Frequency
- Sites frequently updated with new content are prioritized. Frequent updates signal relevance and dynamism
- Time-sensitive information value
- Historical update patterns
- User Demand and Engagement
- Proven critical: Google’s May 2025 purge removed pages with zero or low engagement (clicks, impressions, queries)
- Search volume and query patterns
- User engagement metrics like time on site, bounce rate, and click-through rate are now decisive factors
- Commercial query potential
Strategic Considerations
- Competitive Intelligence
- Unique information not available elsewhere
- Strategic importance for training data
- Coverage gaps in existing knowledge base
- Commercial Viability
- Revenue potential from indexed content
- Advertiser interest in related queries
- E-commerce and transactional value
- Semantic and Topical Value
- Knowledge graph optimization and entity recognition are crucial for content categorization.
- Topic authority and expertise demonstration
- Diversity of perspectives on important subjects
- Technical Implementation Quality
- Schema markup and structured data
- Mobile optimization and accessibility
- Page speed and user experience metrics
IV. Evidence from Patents and Technical Documentation
Google’s Crawling Priorities
Google’s foundational PageRank patent describes calculating document importance based on the number of citing documents and a random jump probability, establishing the principle that not all content deserves equal attention.
Google considers key factors, including website freshness, mobile optimization, backlink quality, site speed, and user engagement, when deciding crawl frequency.
AI Training Data Selection
Various Google and AI patents reveal sophisticated algorithms for:
- Quality Assessment: N-grams are used to evaluate copy quality by comparing against known quality benchmarks
- Content Clustering: Content from expert clusters is given priority, while content outside clusters may be skipped entirely
- Behavioral Analysis: Click-through rates and user behavior metrics influence content prioritization
V. The Signals AI Companies and Search Engines Monitor
Technical Signals
- HTTP Response Codes: Pages serving 4xx HTTP status codes don’t waste crawl budget
- Load Time: If your server responds to requests quickly, we might be able to crawl more pages on your site
- Mobile Responsiveness: Websites optimized for mobile devices are crawled more frequently by Google
Content Quality Indicators
- Update Frequency: Recent changes signal relevance
- Link Profile: Both internal and external link patterns
- User Engagement: Dwell time, bounce rate, return visits
- Social Signals: Shares, mentions, and discussion volume
Economic Indicators
- Commercial Value: Revenue potential and advertiser interest
- Competitive Intelligence: Unique insights or data
- Legal Status: Clear rights and permissions
VI. The Critical Reality: Every Barrier Removed = Potential for New Pages
The Massive Crawl Waste Problem
Many SEOs believe crawl budget optimization is “Google’s job”—but this perspective fundamentally misunderstands the economics of selective crawling. Every barrier you can remove is the potential for a new page to be discovered and indexed. I wrote this rant a few years ago about “Why SEO’s don’t Respect Crawling and Indexing,” which goes into detail about the barriers sites create for getting indexed.
The scale of waste is staggering and has been building for years:
- Cloudflare’s analysis: 53% of good bot traffic is wasted on pages that haven’t changed since the last visit (Cloudflare 2022)
- Historical waste patterns: Analysis shows 98% of crawl effort can be wasted when sites have poor organization
- Google’s own data: 30% of discovered URLs are flagged as spam or junk
- Sitemap failures: Most sitemaps include 404s, noindex pages, canonicalized duplicates, or forbidden URLs
Real-World Evidence of Catastrophic Waste
Case Study from Bill Hunt’s Analysis: Source: Bill Hunt
- 158,000 URLs discovered by crawlers
- 148,000 excluded from indexing
- Only 7,450 are actually indexed
- 98% of crawl effort is completely wasted
“If a website wastes a crawler’s time 98% of the time, why should that crawler come back?”
Enterprise-Level Dysfunction: In a multinational company audit that revealed “no crawling or indexing issues” according to their agency:
- Only 6.5% of pages found in a crawl were present in sitemaps
- 76% of sitemap pages were not crawlable
- 34% of business-critical pages were missing from sitemaps entirely
- Simple case-sensitivity fixes resulted in 2,000% increase in visits
Why Removing Barriers Matters More Than Ever
The Math is Simple:
- If 53% of the crawl budget is wasted on unchanged pages
- And 30% of discovered URLs are spam/junk
- And 98% of crawl effort can be wasted on poorly organized sites
- Then, removing inefficiencies isn’t just optimization—it’s survival
The Infrastructure Reality: Google’s data centers consume 15.5 terawatt-hours of electricity annually—about the same as all of San Francisco —so they need to reduce those costs. Just basic logic dictates that every wasted crawl request is a resource that could have been used more effectively elsewhere.
Common Crawl Budget Killers:
- Duplicate content drains crawl budget as Google doesn’t want to waste resources indexing multiple pages with the same content
- Dynamic URLs with parameters create hundreds of duplicate pages that waste crawl budget on identical content
- 404 errors and broken links force crawlers to spend time on redundant, dead pages
- Faceted navigation and session identifiers typically provide duplicate content that wastes crawling resources
Cloudflare’s Crawler Hints: The Solution Framework
Recognizing this waste, Cloudflare created Crawler Hints and Bing launched IndexNow to tell search engines when content has actually changed, rather than having them repeatedly crawl unchanged pages. This innovation points to the future: intelligent, signal-driven crawling that eliminates waste.
VII. Strategic Recommendations: Making the Cut
For Publishers and Content Creators
1. Eliminate Crawl Waste First (Highest Impact)
The numbers don’t lie: 98% of crawl effort can be wasted on poorly organized sites. This isn’t theoretical—it’s documented reality from enterprise audits.
Priority Actions:
- Audit for duplicate content: Make sure 100% of your site’s pages are made up of unique, high-quality content
- Fix broken links immediately: Every 404 error is wasted crawl budget that could have gone to a new page.
- Block parameter URLs: Use robots.txt to block internal search results, facets, and query parameters that create duplicate content
- Implement canonical tags: Prevent crawlers from wasting budget on duplicate content variations.
- Sitemap hygiene: Only include 200-response, indexable URLs – a long-standing request to CMS providers!
The Reality Check: If your average Search Console report shows 3 to 5 bad URLs for every good one, you don’t have a crawl issue—you have a quality and alignment issue. Search engines provide the data, hoping you’ll take action, but most marketers view it as a weather report instead of a system health dashboard.
2. Optimize for Authority
- Build high-quality backlink profiles
- Establish topical expertise in specific domains
- Maintain consistent publishing schedules
3. Technical Excellence
- Improve site speed—a faster loading website means Google can crawl more URLs in the same amount of time
- Implement proper schema markup
- Optimize mobile experience
- Prevent large but unimportant resources from being loaded by Googlebot using robots.txt
4. Content Strategy
- Focus on unique, valuable information
- Update content regularly to signal freshness
- Create comprehensive topic clusters
- Address search intent directly
5. Legal and Economic Positioning
- Consider participation in paid crawling marketplaces
- Establish clear licensing terms
- Document content ownership and rights
For SEO Professionals
1. Crawl Budget Optimization is YOUR Job, Not Google’s
The biggest misconception in SEO is that crawl budget optimization is Google’s responsibility. It’s not. Every inefficiency on your site directly reduces the opportunity for new content discovery.
The Historical Evidence: Years of audits reveal consistent patterns: only 6.5% of pages identified in crawls were present in sitemaps, 76% of sitemap pages were not crawlable, and 34% of business-critical pages were entirely missing from sitemaps. This has been building for years.
Priority Actions:
- Remove crawl waste before adding new content: If 53% of crawls find no changes and 98% of the effort can be wasted, you’re essentially throwing away your crawl opportunities
- Monitor crawl stats in Google Search Console to understand crawling patterns and identify waste
- Implement Cloudflare Crawler Hints or similar technologies to signal when content actually changes.
- Infrastructure awareness: Know your app servers, their loads, thread limits, and failover behavior—backend failures can kill crawling for days
The Truth About Scale: “Unless you have a million pages, you don’t have to worry about crawl budget” is flat wrong. Every site has a crawl ceiling based on Google’s decision of whether crawling your site is worth the electricity and infrastructure cost.
2. Priority Page Identification
- Map high-value pages for preferential treatment
- Ensure important pages are easily discoverable through internal linking
- Use XML sitemaps strategically—include only canonical URLs and keep them updated
3. Quality Over Quantity Philosophy
- Google only wants to crawl high-quality content, so making low-quality pages faster won’t encourage more crawling
- Focus resources on pages that drive business value
- Consider removing or consolidating low-value content that wastes crawl budget
- Every barrier removed = potential for a new page to be indexed
The Bottom Line from Years of Evidence: Search engines owe you nothing. They are not obligated to crawl your 4 million thin variations or to guess which URL you want indexed. If you submit garbage, they’ll skip you. If your systems are broken, they’ll quietly stop visiting.
The Future is Resource-Aware SEO: With Google, Bing, and AI engines shifting to intent-driven and resource-aware systems, the traditional “spray-and-pray” indexing model is obsolete. Tools like Cloudflare’s Crawl Hints and Bing’s IndexNow aren’t optional—they’re previews of where search is going. You want to be indexed? Prove it’s worth the cost.
VIII. Future Implications and Predictions
The Emerging Ecosystem
- Tiered Web Access: Premium content behind paywalls or permission gates
- Quality Premiums: High-authority sources commanding higher crawl rates
- Engagement-Based Survival: The May 2025 purge proved that content without user engagement will be actively removed
- Efficiency Focus: Resource management becomes critical as computational costs drive selective crawling
Competitive Advantages
Sites that optimize for selective crawling will:
- Gain preferential indexing and faster updates
- Maintain visibility as quality thresholds tighten
- Capture more AI training opportunities
- Generate revenue from content licensing
- Survive the ongoing engagement-based culling
The Risk of Digital Extinction
Sites that fail to optimize face:
- Active deindexing: Google is now removing underperforming content within 90 days
- Reduced crawl frequency and delayed indexing
- Exclusion from AI training datasets
- Lost revenue opportunities
- Complete removal from search results despite having content
IX. Conclusion: The New Digital Darwinism is Here
The shift from indiscriminate to selective crawling isn’t a future prediction—it’s the current reality. As Indeing Insight reported, Google’s May 2025 indexing purge, which removed 15-75% of pages from websites while maintaining their search performance, suggests that the majority of web content provides little to no value to users. For sites that have any of the issues highlighted, you may want to set up active monitoring of your indexing frequency and levels.
As infrastructure providers like Cloudflare introduce paid or permissioned crawling models, and search engines actively cull underperforming content, we’ve entered an era where survival requires demonstrable user value. The organizations that understand this shift—optimizing for engagement, removing crawl waste, and proving content worth—will thrive in an increasingly selective digital ecosystem.
The question is no longer whether your content can be found, but whether it deserves to be found. In this new AI-driven content economy, cutting isn’t just about SEO—it’s about digital survival in a world where algorithms actively remove content that fails to engage users.
The future isn’t coming—it’s here. And it’s already deciding which content lives or dies.