Orova OROVA.VN Marketing AI Agent
Playbook

GPTBot, ClaudeBot, PerplexityBot: Who's Crawling You and Why

Orova 2 views
GPTBot, ClaudeBot, PerplexityBot: Who's Crawling You and Why

Open your server logs for any week of 2026 and you will find visitors that did not exist three years ago. GPTBot working through your blog archive at two in the morning. ClaudeBot fetching the same pillar page four times in a month. PerplexityBot arriving seconds after you publish. Something called OAI-SearchBot that looks like OpenAI but is not GPTBot, and something called ChatGPT-User that looks like both and behaves like neither. For most site owners, this traffic is a fog — clearly machine, clearly AI-related, purpose unknown.

The fog matters because these bots are not interchangeable, and treating them as one undifferentiated swarm leads to bad decisions in both directions. Some of these crawlers are building training corpora for future models, a use you may or may not want to feed. Some are building search indexes that determine whether AI assistants can recommend you — traffic you almost certainly do want. Some are not crawlers at all but live fetches made on behalf of a human who asked about you ten seconds ago. Block the wrong one and you disappear from an answer engine; welcome the wrong one and you may be donating your content to a product that never sends anything back.

This is a field guide. We will sort the AI bots into the three categories that actually matter, walk through the roster company by company — OpenAI, Anthropic, Perplexity, Google, and the rest — explain what each visit means for your business, show you how to verify that a bot is who it claims to be, and map each user-agent to the control lever that governs it. By the end, your logs should read like a guest list instead of a fog.

AI bots fall into three types: training crawlers (GPTBot, ClaudeBot) that gather data for model training, search-index crawlers (OAI-SearchBot, PerplexityBot) that decide whether AI search can find you, and user-triggered fetchers (ChatGPT-User, Perplexity-User) that retrieve pages live when someone asks about you. Each type carries different value and a different robots.txt decision.

The three categories that make everything legible

Forget the individual bot names for a moment. Every AI-related visitor to your site is doing one of three jobs, and the job — not the company logo — determines what the visit is worth to you.

Category one: training crawlers

Training crawlers harvest web content to build the datasets that future AI models learn from. Their work is bulk, patient, and disconnected from any user: nobody asked a question that sent the bot to you; it is simply accumulating text. The value exchange here is the most contested question in publishing, and the honest framing is this: content crawled for training may influence what future models know and say about your domain — there is evidence that models reflect what they trained on — but it produces no link, no referral, and no attribution mechanism you can audit. You are feeding a model's general knowledge, not earning a citation. Whether that trade is good for you is a genuine business question, which we examine in depth in our piece on blocking AI crawlers as a business decision.

Category two: search-index crawlers

Search-index crawlers build the retrieval indexes behind AI search products — the live lookup layer that ChatGPT Search, Perplexity, and their peers query when answering questions about current, specific things. These bots are functionally the Googlebots of the answer-engine era. Their visits determine whether you are findable when an assistant goes looking, and the products they feed do cite sources and do send referral traffic — modest in volume, often strikingly high in intent, as our analysis of where Perplexity's referral traffic actually goes found. Blocking this category is functionally de-listing yourself from AI search.

Category three: user-triggered fetchers

The third category is not crawling at all, though it lives in the same logs. When a person asks an assistant something that requires reading your site right now — "summarize this article," "compare these two products," "what does this company's pricing page say" — the assistant fetches the page on the spot, on that user's behalf. These fetchers announce themselves with distinct user-agents, and the major vendors explicitly treat them as different from crawlers: they fetch a specific page because a human effectively requested it. A visit from this category is the closest thing the AI era has to a human reading your page through a machine intermediary. It is, in a real sense, the most valuable hit in the category — a live prospect is consuming your content through an assistant at this exact moment.

Hold those three categories and the roster below stops being alphabet soup.

Taxonomy diagram sorting AI bots into three categories: training crawlers like GPTBot and ClaudeBot, search-index crawlers like OAI-SearchBot and PerplexityBot, and user-triggered fetchers like ChatGPT-User and Perplexity-User, each with a different value to the site owner

OpenAI: three bots, three different deals

OpenAI operates the clearest example of the three-category split, with one bot per category — and the distinction it draws between them is the single most practically important fact in this guide, because people block the wrong one constantly.

GPTBot is the training crawler. It gathers web content that may be used to train and improve OpenAI's foundation models. It honors robots.txt, identifies itself plainly in its user-agent string, and OpenAI publishes the IP ranges it operates from so you can verify visits. Disallowing GPTBot tells OpenAI not to use your content for model training. It does not remove you from ChatGPT Search — that is a different bot.

OAI-SearchBot is the search-index crawler. It builds and maintains the index behind ChatGPT's search features — the linked, cited results that appear when ChatGPT searches the web. OpenAI states explicitly that OAI-SearchBot is not used for training. The practical consequence cuts both ways and is worth stating twice. You can block GPTBot and still appear, linked and cited, in ChatGPT Search. And if you block OAI-SearchBot — which some sites did unintentionally with blanket "block everything OpenAI" rules written before the bots were separated — you vanish from a search surface used by hundreds of millions of people, while gaining no training protection you did not already have.

ChatGPT-User is the user-triggered fetcher. It appears when a ChatGPT user's request requires fetching a specific page live — and increasingly when agentic features browse on a user's behalf. It is not a crawler in any meaningful sense; it goes where conversations send it. Blocking it does not protect your content from training or remove you from an index. It mostly means that when a real person asks ChatGPT about your page, the assistant fails to read it — and answers from whatever secondhand sources it can reach instead. For most businesses, that is strictly worse than the assistant reading the primary source.

Anthropic: ClaudeBot and its siblings

Anthropic's lineup mirrors the same structure. ClaudeBot is the training-and-improvement crawler — the bulk collector, analogous to GPTBot. It respects robots.txt and identifies itself in the user-agent string. Site owners who watched their logs through 2024 and 2025 will remember ClaudeBot's reputation for enthusiasm: numerous operators reported heavy crawl volumes, and a few high-profile complaints about aggressive crawling made the rounds. Anthropic has since published guidance on its crawlers and the controls it honors, but the episode is a useful reminder that crawl pressure — plain server load — is its own reason to manage bots, separate from any intellectual-property question.

Anthropic also operates user-level fetching under names including Claude-User — the user-triggered fetcher that retrieves pages when a Claude user's request requires it — and Claude-SearchBot for search-related indexing. Older tokens such as anthropic-ai and claude-web appear in legacy block lists around the web; if your robots.txt was written by copying one of those lists in 2023, it is worth reconciling it against Anthropic's current documentation, because rules aimed at retired names govern nothing.

The category logic applies unchanged: disallowing ClaudeBot is a training opt-out; blocking the user-triggered fetcher mostly breaks the experience of real people asking Claude about your pages.

Perplexity: the answer engine that arrives fast

Perplexity is an answer engine first — its product is a cited answer assembled from live retrieval — so its bots map naturally onto the second and third categories. PerplexityBot is the index crawler that keeps Perplexity's retrieval layer fresh; site owners routinely observe it arriving within hours or faster when new content is published on sites it tracks. Perplexity-User is the user-triggered fetcher for live page reads during a conversation.

Perplexity deserves a candid note, because it is the company around which the crawler-trust conversation got loudest. Through 2024 and 2025, multiple independent investigations — including a widely discussed technical report from a major CDN provider — alleged that Perplexity retrieved content from sites that had blocked its declared bots, using undeclared crawlers that did not identify themselves. Perplexity disputed aspects of the characterization, and the dispute itself became part of the industry's education: robots.txt is a convention backed by reputation, not a lock backed by enforcement. Two practical lessons follow. First, if your blocking intent is serious, implement it at the network level — CDN bot management, WAF rules — rather than relying on robots.txt alone. Second, judge each vendor's bots on observed behavior in your own logs, not on the label in the user-agent string, because the string is self-reported.

On the value side, Perplexity is the strongest citation engine in the category: answers carry prominent source links, and for many B2B niches it sends small but unusually high-intent referral traffic. For most content businesses, PerplexityBot is a bot you want present.

Google: one crawler, one token, and a trap

Google's situation is structurally different from everyone else's, and misunderstanding it is the most consequential mistake in this entire topic.

Google does not operate a separate AI training crawler for the open web. Googlebot — the same crawler that has indexed your site for decades — feeds Google Search, and Google Search now includes AI Overviews and AI Mode. Google-Extended, the name you see in block lists, is not a crawler at all. It is a robots.txt token — a control flag. Disallowing Google-Extended tells Google not to use your content for training and grounding Gemini models. Googlebot's crawling, your search indexing, and your rankings are unaffected by it.

Here is the trap: Google-Extended does not control AI Overviews. AI Overviews are a Search feature, built from the Search index, governed by the same crawler and largely the same controls as classic results. There is no robots.txt line that removes you from AI Overviews while keeping you in the ten blue links; the closest available levers are snippet controls like nosnippet and max-snippet, which constrain what Search can display from your pages — at a cost to your regular snippets too. The decision Google's architecture forces is all-or-nothing in a way OpenAI's is not, and any blocking strategy that ignores this distinction is built on a false map. If AI Overviews are your concern, the productive path is not blocking but optimizing — which is exactly what our complete guide to ranking in AI Overviews covers.

The rest of the roster

Beyond the big four, several names recur in 2026 logs and deserve quick identification. CCBot is Common Crawl — a nonprofit that builds an open web corpus used by many AI labs as training data; blocking it is the broadest single training opt-out available, affecting every model trained on that corpus downstream. Applebot-Extended is Apple's equivalent of Google-Extended: a token controlling whether content Applebot already crawled may train Apple's models, not a separate crawler. Meta-ExternalAgent serves Meta's AI data collection. Bytespider, associated with ByteDance, earned a reputation across the industry for heavy crawl volume and inconsistent robots.txt compliance, making it the standard example of a bot best handled at the network level rather than the honor-system level. Amazonbot and a tail of smaller agents round out the list, and the tail grows monthly.

You do not need to memorize the tail. You need the habit: when a new agent appears in your logs, identify the operator, place it in one of the three categories, check whether it documents itself and honors robots.txt, and file it under the policy you have already decided for that category.

Reference table mapping major AI user-agents — GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Bytespider — to their operator, purpose, and what blocking each one actually changes

Reading your own logs: finding and verifying them

Theory aside, here is the practical workflow for turning your logs into that guest list.

First, surface the visitors. Filter your access logs — or your CDN's bot analytics, which do this with less friction — for the user-agent substrings above: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User, PerplexityBot, Perplexity-User, CCBot, Bytespider, Meta-ExternalAgent, Amazonbot. Tally hits per bot per week, note which sections of the site each one favors, and watch for the patterns that matter: training crawlers sweeping archives in bulk, index crawlers hitting new URLs quickly after publication, user-triggered fetchers arriving at specific deep pages in ones and twos during business hours.

Second, verify before you trust. A user-agent string is a costume — anyone can wear it, and scrapers impersonate reputable bots precisely because reputable bots get whitelisted. The major operators publish the IP ranges their bots crawl from; OpenAI, Anthropic, Google, and Perplexity all maintain published ranges or DNS verification methods. Cross-check a sample of each bot's traffic against its published ranges. Traffic claiming to be GPTBot from an unpublished residential IP is not GPTBot, and what it actually is belongs in your block rules regardless of your AI policy. Modern CDN bot-management products automate this verification and label traffic as verified or impersonated, which is one of the better reasons to route your site through one.

Third, watch the cost column. AI crawl traffic is mostly cheap for text sites, but training crawlers at full enthusiasm can account for a meaningful share of total requests — and on sites with expensive dynamically rendered pages or large media libraries, crawl pressure becomes a real bill. Crawl-delay directives, rate limiting at the CDN, and caching policies are legitimate tools here, and they are policy-neutral: you can welcome a bot and still refuse to let it request your pages forty times a second.

Matching bots to levers: a decision in three lines

With categories, roster, and verification in hand, the actual governance decision compresses neatly.

For training crawlers — GPTBot, ClaudeBot, CCBot, Google-Extended and Applebot-Extended as tokens, Meta-ExternalAgent — the question is philosophical and commercial: do you want your content shaping future models' knowledge of your domain, given that nothing measurable flows back? For most marketing-driven businesses the honest answer is that being known by models is worth more than the unenforceable principle of withholding; for businesses whose content is the product, the answer often inverts. Decide once, as policy.

For search-index crawlers — OAI-SearchBot, PerplexityBot, and Googlebot in its dual role — the answer for almost everyone is allow, for the same reason almost everyone allows Googlebot: these indexes are where your future customers ask questions, and answer engines cite and link what they retrieve. The mechanics of earning those citations are their own discipline — our GEO playbook covers the full system — but the precondition is presence.

For user-triggered fetchers — ChatGPT-User, Claude-User, Perplexity-User — the answer is allow unless you have a specific abuse problem, because each hit is a live human consuming your content through an assistant. Blocking this category is turning away readers at the door because they arrived in a machine-shaped coat.

Write the three decisions into robots.txt for the compliant bots, enforce the hard lines at the CDN for the rest, document why, and revisit quarterly — because this roster changes faster than any other part of technical SEO right now.

What a healthy week of AI bot traffic looks like

To make the abstractions concrete, here is the shape of a typical week on a mid-sized B2B content site once you have the labels in place — the pattern we see repeatedly when teams run this exercise for the first time.

Training crawlers dominate raw volume. GPTBot and ClaudeBot together often account for the majority of AI-labeled hits, sweeping methodically through archive pages, old posts, and category indexes — content a human rarely touches anymore. Their pattern is broad and shallow: many URLs, few repeats, no correlation with your publishing schedule. CCBot appears in slower, periodic waves matching its corpus-building cycles. This volume looks alarming on a chart and means little individually; it is bulk ingestion, and the only number worth watching is total request load.

Index crawlers are smaller and sharper. PerplexityBot and OAI-SearchBot concentrate on your newest and most-linked URLs, often arriving within hours of publication and returning to refresh pages that get cited in answers. A spike of index-crawler attention on a specific post is an early signal that the post is being retrieved for live answers — worth cross-checking against your referral data.

User-triggered fetchers are the smallest line and the most interesting one. A handful of ChatGPT-User and Perplexity-User hits per day, landing on pricing pages, comparison posts, and documentation — each one is a person, mid-conversation, asking an assistant about you. When that line trends upward week over week, your brand is entering more AI conversations. That is the single most underrated growth signal in a 2026 server log, and almost nobody is watching it.

From guest list to guest behavior

Knowing who crawls you is the prerequisite, not the payoff. The payoff comes from connecting the guest list to outcomes: which engines cite you, which assistants describe your product accurately, which AI surfaces send the visitors that sign up. That connection — crawl access on one side, citations and referrals on the other — is the feedback loop that should drive your AI-search strategy, and it only exists if someone is watching both sides continuously. That monitoring is part of what Orova automates: tracking your visibility across classic search and AI engines alike, so that when GPTBot's successor shows up in your logs under a new name, you are deciding from data about what these bots have actually done for you — not from fog.

Let an AI Agent handle your SEO

Orova plans, writes, optimizes, and tracks rankings on its own — you just read the results.

Try it free