AI / LLM User-Agents: Blocking Guide
Why Block AI/LLM User Agents?
As AI and Large Language Models (LLMs) become more prevalent, website owners may want to control access to their content for several reasons:
- Protect original content from being used to train AI models without consent
- Maintain website performance by limiting automated access
- Ensure human-centric engagement with content
- Comply with data protection regulations and intellectual property rights
However, it's important to note that some LLMs, like those using CommonCrawl data (e.g., GPT-3), access content through public data sources. Blocking only their respective user agents may not prevent all AI access to your content.
Known AI User-Agents
GPTBot (OpenAI)
GPTBot is OpenAI's web crawler used for training language models like ChatGPT. It scans and collects data from websites to improve the AI's knowledge and capabilities. By allowing GPTBot access, website owners indirectly contribute to the development of OpenAI's language models.
To block in robots.txt:
User-Agent: GPTBot
Disallow: /
Blocking GPTBot will also blocks automatically ChatGPT-User and vice versa.
OpenAI AI/LLM models are also used by Microsoft's search engine Bing to generate Bing Chat. Blocking OpenAI may also impact the visibility of your content in Bing Chat results.
ClaudeBot / Claude-Web (Anthropic)
ClaudeBot is Anthropic's web crawler used for training their AI model, Claude. It gathers information from the internet to enhance Claude's knowledge and capabilities. Claude-Web is a variant of Anthropic's web crawler, used for specific tasks or newer versions of their AI. It may have different crawling patterns or data collection methods compared to ClaudeBot. Allowing ClaudeBot access means your website's content may be used to improve Anthropic's AI systems.
To block in robots.txt:
User-Agent: ClaudeBot
User-Agent: Claude-Web
Disallow: /
CCBot (Common Crawl)
CCBot is the web crawler used by Common Crawl, a non-profit organization that creates and maintains an open repository of web crawl data. This data is used by various researchers, companies, and AI/LLM models for training and analysis. By allowing CCBot, your website's content becomes part of a widely used public dataset.
To block in robots.txt:
User-Agent: CCBot
Disallow: /
To opt-out training most public and non-public AI/LLM models, block CCbot.
Googlebot-Extended
Googlebot-Extended is Google's secondary user agent that provides additional control over content usage in AI systems. It allows web publishers to opt out of their content being used to train Google's foundation models for generative AI features. Like Applebot-Extended, Googlebot-Extended does not crawl webpages but determines how to use data already crawled by the standard Googlebot.
To block in robots.txt:
User-Agent: Googlebot-Extended
Disallow: /
Blocking Googlebot-Extended only applies to the training of Google's generative AI models and does not affect search result inclusion.
Blocking Googlebot-Extended does not exclude your website's content being used in Google AI overviews in the search results. For that, use <meta name="robots" content="nosnippet" />.
Applebot-Extended
Applebot-Extended is Apple's secondary user agent, providing web publishers with additional control over how their content is used in Apple's AI systems. It allows opting out of content being used to train Apple's foundation models for generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. Like Googlebot-Extended, Applebot-Extended does not crawl webpages but determines how to use data already crawled by the standard Applebot.
To block in robots.txt:
User-Agent: Applebot-Extended
Disallow: /
If robots.txt doesn't mention Applebot but mentions Googlebot, Applebot will follow Googlebot instructions.
Facebookbot
Facebookbot is Meta's web crawler, used for various purposes including link previews and AI training. It gathers data that may be used in Meta's AI models and language processing systems. Allowing Facebookbot access could mean your content is used in Meta's AI development and social media features.
To block in robots.txt:
User-Agent: Facebookbot
Disallow: /
Meta-ExternalAgent
Meta-ExternalAgent is a Meta web crawler, used for various purposes including training AI models or improving products by indexing content directly. Allowing Meta-ExternalAgent access could mean your content is used in Meta's AI development and social media features.
To block in robots.txt:
User-Agent: Meta-ExternalAgent
Disallow: /
Diffbot
Diffbot is an AI-powered web scraping and data extraction service. It uses machine learning to understand and extract structured data from web pages. By allowing Diffbot, your website's content may be processed and included in various datasets used by Diffbot's clients.
To block in robots.txt:
User-Agent: diffbot
Disallow: /
PerplexityBot
PerplexityBot is the web crawler used by Perplexity AI, a company developing advanced search and AI technologies. It collects data to improve their AI models and search capabilities. Allowing PerplexityBot may contribute to the development of new AI-powered search and information retrieval systems.
To block in robots.txt:
User-Agent: PerplexityBot
Disallow: /
Omgili / Omgilibot
Omgili (Oh My God I Love It) is a web monitoring and analysis service. Their bot, Omgilibot, crawls websites to gather data for trend analysis and market intelligence. Allowing Omgilibot means your content may be included in Omgili's data analysis and reports.
To block in robots.txt:
User-Agent: Omgili
User-Agent: Omgilibot
Disallow: /
ImagesiftBot
ImagesiftBot is a specialized crawler focused on collecting and analyzing image data from websites. It may be used for image recognition AI training or building image databases. Allowing ImagesiftBot could mean your website's images are included in AI training datasets or image search engines.
To block in robots.txt:
User-Agent: ImagesiftBot
Disallow: /
Bytespider (Bytedance)
Bytespider are web crawlers associated with ByteDance, the company behind TikTok and other AI-powered platforms. These bots likely collect data for various AI applications, including content recommendation systems. Allowing this bot may contribute to the development of ByteDance's AI technologies.
To block in robots.txt:
User-Agent: Bytespider
Disallow: /
Amazonbot
Amazonbot is Amazon's web crawler, used for various purposes including product information gathering and AI training. It may collect data to improve Amazon's search, recommendation systems, and AI services. Allowing Amazonbot could mean your content contributes to Amazon's AI and e-commerce technologies.
To block in robots.txt:
User-Agent: Amazonbot
Disallow: /
Youbot
Youbot is associated with a search engine/AI company SuSea, Inc. It collects web data for search indexing, AI training, or other data analysis purposes. Allowing Youbot access could mean your content is used in various AI or search-related applications.
To block in robots.txt:
User-Agent: Youbot
Disallow: /
Blocking All Known AI User-Agents
To block all the AI user agents listed above, you can use the following robots.txt configuration:
User-Agent: GPTBot
User-Agent: ClaudeBot
User-Agent: Claude-Web
User-Agent: CCBot
User-Agent: Applebot-Extended
User-Agent: Facebookbot
User-Agent: Meta-ExternalAgent
User-Agent: diffbot
User-Agent: PerplexityBot
User-Agent: Omgili
User-Agent: Omgilibot
User-Agent: ImagesiftBot
User-Agent: Bytespider
User-Agent: Amazonbot
User-Agent: Youbot
Disallow: /
Remember that this approach may also block legitimate web crawlers, potentially affecting your site's visibility in search engines. Consider your specific needs and the trade-offs involved when implementing such restrictions.
Disclaimer
The robots.txt tools and information shared on this website are under constant development and improvements will be added over time. Any data uploaded to this website, or its APIs, are deleted within a few days after the purpose has ended, including personal information such as email addresses. For example, email addresses and all related identifable data is deleted shortly after unsubscribing from the notifications. Email addresses are not distributed or sold to third parties or used for any other purpose than stated (notifying of results). Only in the rare occasion when a bug appears, some uploaded data may be stored for slightly longer to debugging purposes, afterwhich the uploaded data is still completely disregarded and deleted.
Bugs will happen. Despite best efforts to maintain the code base and data quality of the information shared on this website, no guarantees can or will be given. Data, information and results may be incomplete and/or errors may occur. This is a personal website and for-fun project. Use at your own risk.