AI / LLM User-Agents: Blocking Guide

Why Block AI/LLM User Agents?

As AI and Large Language Models (LLMs) become more prevalent, website owners may want to control access to their content for several reasons:

  • Protect original content from being used to train AI models without consent
  • Maintain website performance by limiting automated access
  • Ensure human-centric engagement with content
  • Comply with data protection regulations and intellectual property rights

However, it's important to note that some LLMs, like those using CommonCrawl data (e.g., GPT-3), access content through public data sources. Blocking only their respective user agents may not prevent all AI access to your content.

Known AI User-Agents

GPTBot (OpenAI)

GPTBot is OpenAI's web crawler used for training language models like ChatGPT. It scans and collects data from websites to improve the AI's knowledge and capabilities. By allowing GPTBot access, website owners indirectly contribute to the development of OpenAI's language models.

To block in robots.txt:

        User-Agent: GPTBot
Disallow: /
    

ClaudeBot / Claude-Web (Anthropic)

ClaudeBot is Anthropic's web crawler used for training their AI model, Claude. It gathers information from the internet to enhance Claude's knowledge and capabilities. Claude-Web is a variant of Anthropic's web crawler, used for specific tasks or newer versions of their AI. It may have different crawling patterns or data collection methods compared to ClaudeBot. Allowing ClaudeBot access means your website's content may be used to improve Anthropic's AI systems.

To block in robots.txt:

        User-Agent: ClaudeBot
User-Agent: Claude-Web
Disallow: /
    

CCBot (Common Crawl)

CCBot is the web crawler used by Common Crawl, a non-profit organization that creates and maintains an open repository of web crawl data. This data is used by various researchers, companies, and AI/LLM models for training and analysis. By allowing CCBot, your website's content becomes part of a widely used public dataset.

To block in robots.txt:

        User-Agent: CCBot
Disallow: /
    

Googlebot-Extended

Googlebot-Extended is Google's secondary user agent that provides additional control over content usage in AI systems. It allows web publishers to opt out of their content being used to train Google's foundation models for generative AI features. Like Applebot-Extended, Googlebot-Extended does not crawl webpages but determines how to use data already crawled by the standard Googlebot.

To block in robots.txt:

        User-Agent: Googlebot-Extended
Disallow: /
    
Fili, ex-Google engineer, SEO Expert

Applebot-Extended

Applebot-Extended is Apple's secondary user agent, providing web publishers with additional control over how their content is used in Apple's AI systems. It allows opting out of content being used to train Apple's foundation models for generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. Like Googlebot-Extended, Applebot-Extended does not crawl webpages but determines how to use data already crawled by the standard Applebot.

To block in robots.txt:

        User-Agent: Applebot-Extended
Disallow: /
    

Facebookbot

Facebookbot is Meta's web crawler, used for various purposes including link previews and AI training. It gathers data that may be used in Meta's AI models and language processing systems. Allowing Facebookbot access could mean your content is used in Meta's AI development and social media features.

To block in robots.txt:

        User-Agent: Facebookbot
Disallow: /
    

Meta-ExternalAgent

Meta-ExternalAgent is a Meta web crawler, used for various purposes including training AI models or improving products by indexing content directly. Allowing Meta-ExternalAgent access could mean your content is used in Meta's AI development and social media features.

To block in robots.txt:

        User-Agent: Meta-ExternalAgent
Disallow: /
    

Diffbot

Diffbot is an AI-powered web scraping and data extraction service. It uses machine learning to understand and extract structured data from web pages. By allowing Diffbot, your website's content may be processed and included in various datasets used by Diffbot's clients.

To block in robots.txt:

        User-Agent: diffbot
Disallow: /
    

PerplexityBot

PerplexityBot is the web crawler used by Perplexity AI, a company developing advanced search and AI technologies. It collects data to improve their AI models and search capabilities. Allowing PerplexityBot may contribute to the development of new AI-powered search and information retrieval systems.

To block in robots.txt:

        User-Agent: PerplexityBot
Disallow: /
    

Omgili / Omgilibot

Omgili (Oh My God I Love It) is a web monitoring and analysis service. Their bot, Omgilibot, crawls websites to gather data for trend analysis and market intelligence. Allowing Omgilibot means your content may be included in Omgili's data analysis and reports.

To block in robots.txt:

        User-Agent: Omgili
User-Agent: Omgilibot
Disallow: /
    

ImagesiftBot

ImagesiftBot is a specialized crawler focused on collecting and analyzing image data from websites. It may be used for image recognition AI training or building image databases. Allowing ImagesiftBot could mean your website's images are included in AI training datasets or image search engines.

To block in robots.txt:

        User-Agent: ImagesiftBot
Disallow: /
    

Bytespider (Bytedance)

Bytespider are web crawlers associated with ByteDance, the company behind TikTok and other AI-powered platforms. These bots likely collect data for various AI applications, including content recommendation systems. Allowing this bot may contribute to the development of ByteDance's AI technologies.

To block in robots.txt:

        User-Agent: Bytespider
Disallow: /
    

Amazonbot

Amazonbot is Amazon's web crawler, used for various purposes including product information gathering and AI training. It may collect data to improve Amazon's search, recommendation systems, and AI services. Allowing Amazonbot could mean your content contributes to Amazon's AI and e-commerce technologies.

To block in robots.txt:

        User-Agent: Amazonbot
Disallow: /
    

Youbot

Youbot is associated with a search engine/AI company SuSea, Inc. It collects web data for search indexing, AI training, or other data analysis purposes. Allowing Youbot access could mean your content is used in various AI or search-related applications.

To block in robots.txt:

        User-Agent: Youbot
Disallow: /
    

Blocking All Known AI User-Agents

To block all the AI user agents listed above, you can use the following robots.txt configuration:

        User-Agent: GPTBot
User-Agent: ClaudeBot
User-Agent: Claude-Web
User-Agent: CCBot
User-Agent: Applebot-Extended
User-Agent: Facebookbot
User-Agent: Meta-ExternalAgent
User-Agent: diffbot
User-Agent: PerplexityBot
User-Agent: Omgili
User-Agent: Omgilibot
User-Agent: ImagesiftBot
User-Agent: Bytespider
User-Agent: Amazonbot
User-Agent: Youbot
Disallow: /
    

Disclaimer

The robots.txt tools and information shared on this website are under constant development and improvements will be added over time. Any data uploaded to this website, or its APIs, are deleted within a few days after the purpose has ended, including personal information such as email addresses. For example, email addresses and all related identifable data is deleted shortly after unsubscribing from the notifications. Email addresses are not distributed or sold to third parties or used for any other purpose than stated (notifying of results). Only in the rare occasion when a bug appears, some uploaded data may be stored for slightly longer to debugging purposes, afterwhich the uploaded data is still completely disregarded and deleted.

Bugs will happen. Despite best efforts to maintain the code base and data quality of the information shared on this website, no guarantees can or will be given. Data, information and results may be incomplete and/or errors may occur. This is a personal website and for-fun project. Use at your own risk.


Made with by SEO Expert Fili © 2023 - 2024