A client called me last month in a mild panic. Their hosting bill had tripled. No traffic spike in Google Analytics, no viral post, no product launch. Just a slow, silent surge of server requests from bots they'd never heard of โ Bytespider, OAI-SearchBot, ClaudeBot, PerplexityBot. When we pulled the server logs, AI crawlers were eating about 38% of their total bandwidth. And their robots.txt file? Untouched since 2019. It still had a Disallow: /wp-admin/ rule and not much else.
This situation is not unusual anymore. According to Cloudflare Radar data from Q1 2026, AI and LLM crawlers grew from 2.6% to 10.1% of all web traffic in under a year. GPTBot alone surged 305% year-over-year. We're past the point where this is a "watch and see" issue. If you haven't deliberately configured your robots.txt for the AI crawler era, you're flying blind โ and in some cases, paying for the privilege.
Your robots.txt Now Has Two Jobs, Not One
For 20-odd years, robots.txt had one main job: tell Googlebot what not to crawl. Keep the staging environments out of the index, block the internal search result pages, prevent the cart URLs from eating up crawl budget. Simple stuff.
Now it has a second, more complicated job: decide which AI systems get to read your content, for what purpose, and under what conditions. This is genuinely new territory. The old SEO playbook doesn't really cover it, and most "robots.txt guides" are still recycling 2020-era advice.
The core tension here is real. You want Googlebot to crawl everything relevant so you rank. You might want some AI search bots (like OAI-SearchBot, which powers ChatGPT's web browsing feature) to access your content so you appear in AI search results. But you probably don't want training crawlers scraping your entire site to feed LLM training datasets โ especially if you're a publisher, a SaaS company with docs, or anyone who creates original content.
The AI Crawlers You Actually Need to Know
Let me give you a practical rundown of the crawlers that actually matter right now. I'm skipping the long tail of sketchy bots โ focus on these first.
The ones you generally want to allow:
- OAI-SearchBot โ OpenAI's retrieval bot. When someone asks ChatGPT to browse the web, this is what crawls your page. Blocking it means you won't appear in ChatGPT web search results.
- PerplexityBot โ Powers Perplexity.ai's search engine. Very active, respects robots.txt. Worth allowing if you want AI search citations.
- Google-Extended โ Google's signal for AI training data vs. regular search. Blocking it prevents your content from being used in Gemini's training. This one's nuanced โ it doesn't affect your regular Google search rankings.
- Applebot-Extended โ Similar deal for Apple Intelligence. Still early days, but the traffic is growing.
The ones you should think hard about:
- GPTBot โ OpenAI's training crawler (separate from OAI-SearchBot). Crawling your content for training. You can block GPTBot while still allowing OAI-SearchBot if you want ChatGPT search citations without contributing to training data.
- ClaudeBot โ Anthropic's training crawler. Same deal as GPTBot. Aggressive, high-bandwidth. Many publishers block it entirely.
- Bytespider โ ByteDance's crawler. TikTok's parent company. No clear AI product benefit to allowing this one for most sites. One site reported saving $900/month just by blocking it.
- CCBot โ Common Crawl bot. Data is widely used for training models. Unless you actively want to contribute to open training datasets, there's no upside to allowing it.
What a 2026-Ready robots.txt Actually Looks Like
Here's a practical template. I'm going to explain each decision rather than just dump a config file at you, because the right choices depend on your situation.
# Standard Googlebot โ full access (except private areas) User-agent: Googlebot Disallow: /admin/ Disallow: /checkout/ Disallow: /account/ Disallow: /*?*sort= # Bing/Microsoft โ allow for search User-agent: Bingbot Allow: / # AI Search crawlers โ allow for citation/visibility User-agent: OAI-SearchBot Allow: / User-agent: PerplexityBot Allow: / User-agent: YouBot Allow: / # AI Training crawlers โ block (no search visibility benefit) User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / # Google AI training signal (doesn't affect search rankings) User-agent: Google-Extended Disallow: / # High-bandwidth nuisance crawlers User-agent: Bytespider Disallow: / User-agent: DiffBot Disallow: / # Sitemap Sitemap: https://yoursite.com/sitemap.xml
This configuration threads the needle: you keep AI search visibility (ChatGPT search, Perplexity) while protecting your content from training scrapers and cutting unnecessary bandwidth.
Generate a 2026-Ready robots.txt in Seconds
Don't write it by hand and risk syntax errors. RankSorcery's Robots.txt Generator builds a clean, properly formatted file with AI crawler rules, crawl-delay settings, and sitemap declaration โ ready to drop straight into your server root.
Generate My robots.txt โThe 5 robots.txt Mistakes I See Most Often
After auditing a fair number of sites this year, these are the patterns that keep showing up.
Using a wildcard block for all bots
The User-agent: * + Disallow: / pattern that some developers use "to block everything except Googlebot" is the nuclear option. Done wrong, it blocks everything including OAI-SearchBot, PerplexityBot, and other AI search crawlers you actually want. Always define specific rules for crawlers you want, then use the wildcard block for everything else.
Blocking AI search bots because they "sound like" AI training bots
GPTBot = training. OAI-SearchBot = retrieval for ChatGPT search. These are different user agents with different purposes. I've seen site owners block OAI-SearchBot and then wonder why their site never appears in ChatGPT web search answers. Check the user agent string carefully.
No Crawl-delay directive for heavy crawlers you're allowing
Even bots you want to allow can be rude about how hard they hit your server. Adding Crawl-delay: 10 under PerplexityBot or OAI-SearchBot tells them to wait 10 seconds between requests. Not all bots respect this, but the major ones generally do, and it can meaningfully reduce server load during deep crawl cycles.
Missing or wrong Sitemap declaration
Your robots.txt should include a Sitemap: line pointing to your XML sitemap. Not just for Googlebot โ AI search crawlers use it too. If they can't find your sitemap from robots.txt, they rely on internal link discovery, which is slower and less complete.
Blocking dynamic URL patterns that you actually need indexed
Old habit from the 2010s: block all URLs with query parameters to avoid duplicate content. But in 2026, many paginated product listings, filtered search results, and content URLs use query strings legitimately. Blanket Disallow: /*?* rules knock out pages that should be crawled. Review these patterns carefully.
Should You Actually Block AI Training Crawlers?
This is where I'll give you an actual opinion rather than the usual "it depends on your goals" non-answer.
If you're a publisher, blogger, or content creator: yes, block training crawlers. The value exchange is too lopsided. You create content, they scrape it to train models that then answer questions without sending traffic back to you. Allowing training access and hoping for AI search visibility is a false tradeoff โ you can get AI search citations through OAI-SearchBot and PerplexityBot without feeding GPTBot.
If you're a SaaS or tool company: this is more nuanced. Having your product docs and help content included in LLM training can increase brand recall when AI tools make recommendations. But weigh that against the bandwidth cost and the fact that AI might recommend you based on outdated documentation.
If you run an e-commerce site: block training crawlers without hesitation. Your product pages, pricing, and category structure being in training data doesn't help you. Your bandwidth being consumed does hurt you.
How to Check What's Actually Hitting Your Site Right Now
Before you change anything in your robots.txt, know what you're dealing with. Here's a quick way to audit your actual crawler traffic:
- Pull your server access logs for the past 30 days (not GA โ raw server logs)
- Filter for known AI crawler user agent strings: GPTBot, ClaudeBot, OAI-SearchBot, PerplexityBot, Bytespider, CCBot, Google-Extended, Amazonbot
- Count total requests and bytes transferred per bot
- Identify which bots are consuming bandwidth vs. which are crawling useful pages
- Check if any bots are ignoring your existing Disallow rules (signs: hitting /admin/, /checkout/, etc.)
- Note which bots are crawling your highest-value pages (blog posts, product pages, docs)
On most shared and managed WordPress hosts, you won't have direct log access. Check with your hosting provider โ many now offer basic bot traffic breakdowns in their dashboards. Cloudflare users have a significant advantage here; the Cloudflare Radar data is excellent.
What About llms.txt? Is That the New robots.txt?
You might have heard about the llms.txt proposal โ a file placed at your site root that gives AI systems structured information about your content. Think of it as a sitemap specifically for LLMs.
As of mid-2026, this is still more of an emerging convention than a hard standard. Perplexity has said they look at it. Claude apparently uses it in some contexts. But it's not replacing robots.txt โ it's supplementary. My recommendation: implement llms.txt if you have the time, but it's not a substitute for getting your robots.txt right. Robots.txt controls access. llms.txt guides understanding. Both matter, in that order.
The Quick-Start Action Plan
If you've been putting this off, here's your 20-minute fix:
Check your current robots.txt
Visit yourdomain.com/robots.txt and look at what's actually there. Note what AI crawlers are currently allowed or blocked (probably nothing is specified โ that's the common case).
Decide your AI crawler policy
Using the framework above: which crawlers help your visibility in AI search (allow), which ones are training scrapers with no upside for you (block), and which are unknown bandwidth consumers (block).
Generate and validate your new file
Write your rules, validate the syntax, and make sure you haven't accidentally blocked Googlebot or your preferred AI search bots. Use RankSorcery's Robots.txt Generator to build it cleanly without syntax headaches.
Deploy and monitor
Upload the new robots.txt to your server root. Check Google Search Console's robots.txt tester to confirm Googlebot rules look right. Monitor server logs over the next week to see if the bandwidth from blocked crawlers drops as expected.
The robots.txt file used to be a footnote in a technical SEO audit. In 2026, it's a meaningful business decision. Take 20 minutes and sort it out โ your server bill (and your AI search visibility) will thank you.