A client called me last month in a mild panic. Their hosting bill had tripled. No traffic spike in Google Analytics, no viral post, no product launch. Just a slow, silent surge of server requests from bots they'd never heard of โ€” Bytespider, OAI-SearchBot, ClaudeBot, PerplexityBot. When we pulled the server logs, AI crawlers were eating about 38% of their total bandwidth. And their robots.txt file? Untouched since 2019. It still had a Disallow: /wp-admin/ rule and not much else.

This situation is not unusual anymore. According to Cloudflare Radar data from Q1 2026, AI and LLM crawlers grew from 2.6% to 10.1% of all web traffic in under a year. GPTBot alone surged 305% year-over-year. We're past the point where this is a "watch and see" issue. If you haven't deliberately configured your robots.txt for the AI crawler era, you're flying blind โ€” and in some cases, paying for the privilege.

305%
GPTBot traffic growth, year-over-year (Q1 2026)
52%
Share of all global web traffic now coming from bots
40%
Max bandwidth consumed by a single deep AI crawl cycle

Your robots.txt Now Has Two Jobs, Not One

For 20-odd years, robots.txt had one main job: tell Googlebot what not to crawl. Keep the staging environments out of the index, block the internal search result pages, prevent the cart URLs from eating up crawl budget. Simple stuff.

Now it has a second, more complicated job: decide which AI systems get to read your content, for what purpose, and under what conditions. This is genuinely new territory. The old SEO playbook doesn't really cover it, and most "robots.txt guides" are still recycling 2020-era advice.

The core tension here is real. You want Googlebot to crawl everything relevant so you rank. You might want some AI search bots (like OAI-SearchBot, which powers ChatGPT's web browsing feature) to access your content so you appear in AI search results. But you probably don't want training crawlers scraping your entire site to feed LLM training datasets โ€” especially if you're a publisher, a SaaS company with docs, or anyone who creates original content.

๐Ÿ’ก
The Distinction That Matters There are two fundamentally different types of AI crawlers: training crawlers (they scrape your content to train AI models) and retrieval/search crawlers (they index your content so AI search tools can cite it). Blocking both equally is usually a mistake.

The AI Crawlers You Actually Need to Know

Let me give you a practical rundown of the crawlers that actually matter right now. I'm skipping the long tail of sketchy bots โ€” focus on these first.

The ones you generally want to allow:

  • OAI-SearchBot โ€” OpenAI's retrieval bot. When someone asks ChatGPT to browse the web, this is what crawls your page. Blocking it means you won't appear in ChatGPT web search results.
  • PerplexityBot โ€” Powers Perplexity.ai's search engine. Very active, respects robots.txt. Worth allowing if you want AI search citations.
  • Google-Extended โ€” Google's signal for AI training data vs. regular search. Blocking it prevents your content from being used in Gemini's training. This one's nuanced โ€” it doesn't affect your regular Google search rankings.
  • Applebot-Extended โ€” Similar deal for Apple Intelligence. Still early days, but the traffic is growing.

The ones you should think hard about:

  • GPTBot โ€” OpenAI's training crawler (separate from OAI-SearchBot). Crawling your content for training. You can block GPTBot while still allowing OAI-SearchBot if you want ChatGPT search citations without contributing to training data.
  • ClaudeBot โ€” Anthropic's training crawler. Same deal as GPTBot. Aggressive, high-bandwidth. Many publishers block it entirely.
  • Bytespider โ€” ByteDance's crawler. TikTok's parent company. No clear AI product benefit to allowing this one for most sites. One site reported saving $900/month just by blocking it.
  • CCBot โ€” Common Crawl bot. Data is widely used for training models. Unless you actively want to contribute to open training datasets, there's no upside to allowing it.
โš ๏ธ
Important Caveat Most of the legitimate crawlers listed here do respect robots.txt โ€” studies from 2025-2026 log analysis confirm this. Sketchy scrapers don't. But blocking the legitimate ones via robots.txt is still worth doing because: (1) it signals your preference, (2) it reduces load from compliant bots, and (3) it creates a legal record of your access restrictions.

What a 2026-Ready robots.txt Actually Looks Like

Here's a practical template. I'm going to explain each decision rather than just dump a config file at you, because the right choices depend on your situation.

# Standard Googlebot โ€” full access (except private areas)
User-agent: Googlebot
Disallow: /admin/
Disallow: /checkout/
Disallow: /account/
Disallow: /*?*sort=

# Bing/Microsoft โ€” allow for search
User-agent: Bingbot
Allow: /

# AI Search crawlers โ€” allow for citation/visibility
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: YouBot
Allow: /

# AI Training crawlers โ€” block (no search visibility benefit)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

# Google AI training signal (doesn't affect search rankings)
User-agent: Google-Extended
Disallow: /

# High-bandwidth nuisance crawlers
User-agent: Bytespider
Disallow: /

User-agent: DiffBot
Disallow: /

# Sitemap
Sitemap: https://yoursite.com/sitemap.xml

This configuration threads the needle: you keep AI search visibility (ChatGPT search, Perplexity) while protecting your content from training scrapers and cutting unnecessary bandwidth.

Generate a 2026-Ready robots.txt in Seconds

Don't write it by hand and risk syntax errors. RankSorcery's Robots.txt Generator builds a clean, properly formatted file with AI crawler rules, crawl-delay settings, and sitemap declaration โ€” ready to drop straight into your server root.

Generate My robots.txt โ†’

The 5 robots.txt Mistakes I See Most Often

After auditing a fair number of sites this year, these are the patterns that keep showing up.

1

Using a wildcard block for all bots

The User-agent: * + Disallow: / pattern that some developers use "to block everything except Googlebot" is the nuclear option. Done wrong, it blocks everything including OAI-SearchBot, PerplexityBot, and other AI search crawlers you actually want. Always define specific rules for crawlers you want, then use the wildcard block for everything else.

2

Blocking AI search bots because they "sound like" AI training bots

GPTBot = training. OAI-SearchBot = retrieval for ChatGPT search. These are different user agents with different purposes. I've seen site owners block OAI-SearchBot and then wonder why their site never appears in ChatGPT web search answers. Check the user agent string carefully.

3

No Crawl-delay directive for heavy crawlers you're allowing

Even bots you want to allow can be rude about how hard they hit your server. Adding Crawl-delay: 10 under PerplexityBot or OAI-SearchBot tells them to wait 10 seconds between requests. Not all bots respect this, but the major ones generally do, and it can meaningfully reduce server load during deep crawl cycles.

4

Missing or wrong Sitemap declaration

Your robots.txt should include a Sitemap: line pointing to your XML sitemap. Not just for Googlebot โ€” AI search crawlers use it too. If they can't find your sitemap from robots.txt, they rely on internal link discovery, which is slower and less complete.

5

Blocking dynamic URL patterns that you actually need indexed

Old habit from the 2010s: block all URLs with query parameters to avoid duplicate content. But in 2026, many paginated product listings, filtered search results, and content URLs use query strings legitimately. Blanket Disallow: /*?* rules knock out pages that should be crawled. Review these patterns carefully.

Should You Actually Block AI Training Crawlers?

This is where I'll give you an actual opinion rather than the usual "it depends on your goals" non-answer.

If you're a publisher, blogger, or content creator: yes, block training crawlers. The value exchange is too lopsided. You create content, they scrape it to train models that then answer questions without sending traffic back to you. Allowing training access and hoping for AI search visibility is a false tradeoff โ€” you can get AI search citations through OAI-SearchBot and PerplexityBot without feeding GPTBot.

If you're a SaaS or tool company: this is more nuanced. Having your product docs and help content included in LLM training can increase brand recall when AI tools make recommendations. But weigh that against the bandwidth cost and the fact that AI might recommend you based on outdated documentation.

If you run an e-commerce site: block training crawlers without hesitation. Your product pages, pricing, and category structure being in training data doesn't help you. Your bandwidth being consumed does hurt you.

"The question isn't 'should I block AI crawlers?' It's 'which AI crawlers serve my business goals, and which ones just consume my server resources?'"

How to Check What's Actually Hitting Your Site Right Now

Before you change anything in your robots.txt, know what you're dealing with. Here's a quick way to audit your actual crawler traffic:

  • Pull your server access logs for the past 30 days (not GA โ€” raw server logs)
  • Filter for known AI crawler user agent strings: GPTBot, ClaudeBot, OAI-SearchBot, PerplexityBot, Bytespider, CCBot, Google-Extended, Amazonbot
  • Count total requests and bytes transferred per bot
  • Identify which bots are consuming bandwidth vs. which are crawling useful pages
  • Check if any bots are ignoring your existing Disallow rules (signs: hitting /admin/, /checkout/, etc.)
  • Note which bots are crawling your highest-value pages (blog posts, product pages, docs)

On most shared and managed WordPress hosts, you won't have direct log access. Check with your hosting provider โ€” many now offer basic bot traffic breakdowns in their dashboards. Cloudflare users have a significant advantage here; the Cloudflare Radar data is excellent.

๐Ÿ“Š
Quick Wins from the Data In a recent case study, a mid-size SaaS company blocked Bytespider, CCBot, and GPTBot while keeping OAI-SearchBot and PerplexityBot allowed. Result: AI crawler bandwidth dropped 75%, from 45 GB/month to about 11 GB/month, saving $450/month in overage charges. Their ChatGPT and Perplexity citation rate didn't change.

What About llms.txt? Is That the New robots.txt?

You might have heard about the llms.txt proposal โ€” a file placed at your site root that gives AI systems structured information about your content. Think of it as a sitemap specifically for LLMs.

As of mid-2026, this is still more of an emerging convention than a hard standard. Perplexity has said they look at it. Claude apparently uses it in some contexts. But it's not replacing robots.txt โ€” it's supplementary. My recommendation: implement llms.txt if you have the time, but it's not a substitute for getting your robots.txt right. Robots.txt controls access. llms.txt guides understanding. Both matter, in that order.

The Quick-Start Action Plan

If you've been putting this off, here's your 20-minute fix:

1

Check your current robots.txt

Visit yourdomain.com/robots.txt and look at what's actually there. Note what AI crawlers are currently allowed or blocked (probably nothing is specified โ€” that's the common case).

2

Decide your AI crawler policy

Using the framework above: which crawlers help your visibility in AI search (allow), which ones are training scrapers with no upside for you (block), and which are unknown bandwidth consumers (block).

3

Generate and validate your new file

Write your rules, validate the syntax, and make sure you haven't accidentally blocked Googlebot or your preferred AI search bots. Use RankSorcery's Robots.txt Generator to build it cleanly without syntax headaches.

4

Deploy and monitor

Upload the new robots.txt to your server root. Check Google Search Console's robots.txt tester to confirm Googlebot rules look right. Monitor server logs over the next week to see if the bandwidth from blocked crawlers drops as expected.

๐Ÿค–
One More Thing: AI Search Visibility Getting cited in ChatGPT, Perplexity, and Gemini search results is increasingly valuable as traditional organic click-through rates drop. Allowing OAI-SearchBot and PerplexityBot is just table stakes. For deeper AI search visibility work โ€” tracking where and how often AI tools mention your site โ€” check out RankSorcery's AI Search Visibility tool.

The robots.txt file used to be a footnote in a technical SEO audit. In 2026, it's a meaningful business decision. Take 20 minutes and sort it out โ€” your server bill (and your AI search visibility) will thank you.

JR

James Reyes โ€” RankSorcery

James has been doing SEO for longer than he'd like to admit. He runs RankSorcery and writes about the parts of search that don't make it into the standard playbooks. He's been wrong about a few predictions. He's been embarrassingly right about others.