AI / LLM User-Agents: Blocking Guide

Why Block AI/LLM User Agents?

As AI, Large Language Models (LLMs) and AI Agents become more prevalent, website owners may want to control access to their content for several reasons:

  • Protect original content from being used to train AI models without consent
  • Maintain website performance by limiting automated access
  • Ensure human-centric engagement with content
  • Comply with data protection regulations and intellectual property rights

However, it's important to note that some LLMs, like those using CommonCrawl data (e.g., GPT-3), access content through public data sources. Blocking only their respective user agents may not prevent all AI accessing your content.

Blocking AI User-Agents

To block all the AI user agents listed below, you can use the following robots.txt configuration:

        # Block all known AI crawlers and assistants
# from using content for training AI models.
# Source: https://robotstxt.com/ai
User-Agent: GPTBot
User-Agent: ClaudeBot
User-Agent: Claude-Web
User-Agent: CCBot
User-Agent: Googlebot-Extended
User-Agent: Applebot-Extended
User-Agent: Facebookbot
User-Agent: Meta-ExternalAgent
User-Agent: Meta-ExternalFetcher
User-Agent: diffbot
User-Agent: PerplexityBot
User-Agent: Omgili
User-Agent: Omgilibot
User-Agent: webzio-extended
User-Agent: ImagesiftBot
User-Agent: Bytespider
User-Agent: Amazonbot
User-Agent: Youbot
User-Agent: SemrushBot-OCOB
User-Agent: Petalbot
User-Agent: VelenPublicWebCrawler
User-Agent: TurnitinBot
User-Agent: Timpibot
User-Agent: OAI-SearchBot
User-Agent: ICC-Crawler
User-Agent: AI2Bot
User-Agent: AI2Bot-Dolma
User-Agent: DataForSeoBot
User-Agent: AwarioBot
User-Agent: AwarioSmartBot
User-Agent: AwarioRssBot
User-Agent: Google-CloudVertexBot
User-Agent: PanguBot
User-Agent: Kangaroo Bot
User-Agent: Sentibot
User-Agent: img2dataset
User-Agent: Meltwater
User-Agent: Seekr
User-Agent: peer39_crawler
User-Agent: cohere-ai
User-Agent: cohere-training-data-crawler
User-Agent: DuckAssistBot
User-Agent: Scrapy
Disallow: /
DisallowAITraining: /

# Block any non-specified AI crawlers (e.g., new
# or unknown bots) from using content for training
# AI models.  This directive is still experimental
# and may not be supported by all AI crawlers.
User-Agent: *
DisallowAITraining: /
    

DisallowAITraining

Microsoft has proposed an experimental extension to the Robots Exclusion Protocol, introducing a new directive, DisallowAITraining. This directive is designed to prevent AI crawlers from using public web content for training purposes. It simplifies the process by allowing publishers to block all AI training crawlers with a single rule. As part of a broader effort to enhance control over content usage, this directive aligns with ongoing industry discussions and aims to standardize protections against unauthorized AI data collection.

To use in robots.txt:

        User-Agent: *
DisallowAITraining: /
    

Known AI User-Agents

GPTBot (OpenAI)

GPTBot is OpenAI's web crawler used for training language models like ChatGPT. It scans and collects data from websites to improve the AI's knowledge and capabilities. By allowing GPTBot access, website owners indirectly contribute to the development of OpenAI's language models.

To block in robots.txt:

        User-Agent: GPTBot
Disallow: /
    

ClaudeBot / Claude-Web (Anthropic)

ClaudeBot is Anthropic's web crawler used for training their AI model, Claude. It gathers information from the internet to enhance Claude's knowledge and capabilities. Claude-Web is a variant of Anthropic's web crawler, used for specific tasks or newer versions of their AI. It may have different crawling patterns or data collection methods compared to ClaudeBot. Allowing ClaudeBot access means your website's content may be used to improve Anthropic's AI systems.

To block in robots.txt:

        User-Agent: ClaudeBot
User-Agent: Claude-Web
Disallow: /
    

CCBot (Common Crawl)

CCBot is the web crawler used by Common Crawl, a non-profit organization that creates and maintains an open repository of web crawl data. This data is used by various researchers, companies, and AI/LLM models for training and analysis. By allowing CCBot, your website's content becomes part of a widely used public dataset.

To block in robots.txt:

        User-Agent: CCBot
Disallow: /
    

Googlebot-Extended

Googlebot-Extended is Google's secondary user agent that provides additional control over content usage in AI systems. It allows web publishers to opt out of their content being used to train Google's foundation models for generative AI features. Like Applebot-Extended, Googlebot-Extended does not crawl webpages but determines how to use data already crawled by the standard Googlebot.

To block in robots.txt:

        User-Agent: Googlebot-Extended
Disallow: /
    

Applebot-Extended

Applebot-Extended is Apple's secondary user agent, providing web publishers with additional control over how their content is used in Apple's AI systems. It allows opting out of content being used to train Apple's foundation models for generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. Like Googlebot-Extended, Applebot-Extended does not crawl webpages but determines how to use data already crawled by the standard Applebot.

To block in robots.txt:

        User-Agent: Applebot-Extended
Disallow: /
    

Facebookbot

Facebookbot is Meta's web crawler, used for various purposes including link previews and AI training. It gathers data that may be used in Meta's AI models and language processing systems. Allowing Facebookbot access could mean your content is used in Meta's AI development and social media features.

To block in robots.txt:

        User-Agent: Facebookbot
Disallow: /
    

Meta

Meta-ExternalAgent is a Meta web crawler, used for various purposes including training AI models or improving products by indexing content directly. Allowing Meta-ExternalAgent access could mean your content is used in Meta's AI development and social media features.

Meta-ExternalFetcher is a crawler used by Meta (formerly Facebook) to fetch external content, such as link previews and metadata, when users share links on their platforms. While the primary purpose is to enhance user experience, the collected data may also inform AI models that manage content display and engagement.

To block in robots.txt:

        User-Agent: Meta-ExternalAgent
User-Agent: Meta-ExternalFetcher
Disallow: /
    

Diffbot

Diffbot is an AI-powered web scraping and data extraction service. It uses machine learning to understand and extract structured data from web pages. By allowing Diffbot, your website's content may be processed and included in various datasets used by Diffbot's clients.

To block in robots.txt:

        User-Agent: diffbot
Disallow: /
    

PerplexityBot

PerplexityBot is the web crawler used by Perplexity AI, a company developing advanced search and AI technologies. It collects data to improve their AI models and search capabilities. Allowing PerplexityBot may contribute to the development of new AI-powered search and information retrieval systems.

To block in robots.txt:

        User-Agent: PerplexityBot
Disallow: /
    

Omgili / Omgilibot

Omgili (Oh My God I Love It) is a web monitoring and analysis service. Their bot, Omgilibot, crawls websites to gather data for trend analysis and market intelligence. Allowing Omgilibot means your content may be included in Omgili's data analysis and reports.

More recently, Omgili has started to replace Omgilibot with a new user agent specifically for AI training and which identifies as webzio-extended. To make sure that both the new and old bots of Omgili are blocked, you can add all three user agents to your robots.txt.

To block in robots.txt:

        User-Agent: Omgili
User-Agent: Omgilibot
User-Agent: webzio-extended
Disallow: /
    

ImagesiftBot

ImagesiftBot is a specialized crawler focused on collecting and analyzing image data from websites. It may be used for image recognition AI training or building image databases. Allowing ImagesiftBot could mean your website's images are included in AI training datasets or image search engines.

To block in robots.txt:

        User-Agent: ImagesiftBot
Disallow: /
    

Bytespider (Bytedance)

Bytespider are web crawlers associated with ByteDance, the company behind TikTok and other AI-powered platforms. These bots likely collect data for various AI applications, including content recommendation systems. Allowing this bot may contribute to the development of ByteDance's AI technologies.

To block in robots.txt:

        User-Agent: Bytespider
Disallow: /
    

Amazonbot

Amazonbot is Amazon's web crawler, used for various purposes including product information gathering and AI training. It may collect data to improve Amazon's search, recommendation systems, and AI services. Allowing Amazonbot could mean your content contributes to Amazon's AI and e-commerce technologies.

To block in robots.txt:

        User-Agent: Amazonbot
Disallow: /
    

Youbot

Youbot is associated with a search engine/AI company SuSea, Inc. It collects web data for search indexing, AI training, or other data analysis purposes. Allowing Youbot access could mean your content is used in various AI or search-related applications.

To block in robots.txt:

        User-Agent: Youbot
Disallow: /
    

SemrushBot-OCOB

SemrushBot-OCOB is a specialized web crawler deployed by Semrush's ContentShake AI tool. It is designed to analyze site content and generate actionable insights for content creation and optimization. By allowing SemrushBot-OCOB, your website's data may be utilized to enhance AI-driven recommendations and strategies offered through the ContentShake tool, aiding users in creating and refining content.

To block in robots.txt:

        User-Agent: SemrushBot-OCOB
Disallow: /
    

Petalbot

PetalBot is Huawei's web crawler for the Petal Search engine, supporting Huawei Assistant and AI Search services. It crawls websites to build an index database, enabling users to search for content on the Petal Search platform and powering AI-driven recommendations. Allowing PetalBot access ensures your website's visibility and inclusion in Huawei's search and AI services.

To block in robots.txt:

        User-Agent: Petalbot
Disallow: /
    

VelenPublicWebCrawler

VelenPublicWebCrawler is a web crawler developed by Velen for Hunter. It is designed to analyze publicly accessible internet pages to build business datasets and train machine learning models. Allowing VelenPublicWebCrawler access enables your content to contribute to datasets and machine learning insights for web understanding.

To block in robots.txt:

        User-Agent: VelenPublicWebCrawler
Disallow: /
    

TurnitinBot

TurnitinBot is Turnitin's web crawler, used to collect publicly accessible content for plagiarism prevention and academic integrity services. It ensures that student papers and other submissions are compared against a comprehensive database of indexed content. Allowing TurnitinBot ensures your website’s content is included in plagiarism checks, benefiting educational institutions globally.

To block in robots.txt:

        User-Agent: TurnitinBot
Disallow: /
    

Timpibot

Timpibot is the web crawler developed by Timpi to power its decentralized search engine and data index. It collects publicly accessible web content to support unbiased search results, train large language models (LLMs), and offer insights like sentiment analysis and cybersecurity ratings. Allowing Timpibot access ensures your content is included in Timpi’s innovative decentralized index, providing visibility in its search and analytical tools.

To block in robots.txt:

        User-Agent: Timpibot
Disallow: /
    

OAI-SearchBot

OAI-SearchBot is OpenAI's web crawler designed to index publicly available web content to enhance its AI models and search functionalities. It aims to support generative AI tools and applications powered by OpenAI's technologies. Allowing OAI-SearchBot access ensures your site contributes to OpenAI’s advanced AI training and search capabilities.

To block in robots.txt:

        User-Agent: OAI-SearchBot
Disallow: /
    

ICC-Crawler

ICC-Crawler is operated by the Universal Communication Research Institute (NICT) to collect web pages for research and development of advanced information processing technologies, including multilingual translation and AI. Collected data is used for research purposes and may be shared with third parties for collaborative projects. Allowing ICC-Crawler contributes to advancements in language processing and AI research.

To block in robots.txt:

        User-Agent: ICC-Crawler
Disallow: /
    

AI2Bot

AI2Bot is the Allen Institute for AI's (AI2) web crawler, used for indexing publicly accessible web content to advance AI research and power tools like the Semantic Scholar. The bot collects data to support academic initiatives and the development of AI-driven systems. Allowing AI2Bot ensures your content contributes to these research efforts, fostering innovation in AI and machine learning.

AI2Bot-Dolma is a specialized crawler by the Allen Institute for AI designed to collect diverse web data for the Dolma dataset, a pretraining dataset for AI models like OLMo. It focuses on gathering high-quality, openly licensed content for improving the performance of language models. Allowing AI2Bot-Dolma access supports advancements in AI research and the creation of open datasets.

To block in robots.txt:

        User-Agent: AI2Bot
User-Agent: AI2Bot-Dolma
Disallow: /
    

DataForSeoBot

DataForSeoBot is the web crawler operated by DataForSEO to collect publicly available online data for powering SEO tools, keyword research, and market analysis. The data collected supports advanced tools and AI models used to deliver insights on search engine optimization, online visibility, and marketing trends. By allowing DataForSeoBot access, your website's data may contribute to enhancing AI-powered features and analytical capabilities provided by DataForSEO.

To block in robots.txt:

        User-Agent: DataForSeoBot
Disallow: /
    

AwarioBot

AwarioBot is one of Awario's primary web crawlers, designed to collect data from publicly available online sources to support social listening and brand monitoring. This data is used in AI-powered analytics tools to track brand mentions, audience sentiment, and market trends. By allowing AwarioBot access, your content may contribute to machine learning models that enhance real-time analytics and competitive insights for businesses.

AwarioSmartBot is a focused crawler developed by Awario for deeper analysis and collection of online content. It leverages AI to provide businesses with actionable insights about brand reputation and audience behavior. This bot supports the development of smarter AI models and tools that improve predictive analysis and decision-making.

AwarioRssBot is used by Awario to fetch and monitor updates from RSS feeds, enabling users to stay informed on the latest mentions and relevant topics. While its primary function is not AI-driven, the data collected by AwarioRssBot feeds into AI tools that analyze trends and generate insights for marketers.

To block in robots.txt:

        User-Agent: AwarioBot
User-Agent: AwarioSmartBot
User-Agent: AwarioRssBot
Disallow: /
    

Google-CloudVertexBot

Google-CloudVertexBot is a crawler associated with Google's Cloud Vertex AI platform. It is designed to access publicly available web content to support AI model training and data analysis within Google Cloud services.

To block in robots.txt:

        User-Agent: Google-CloudVertexBot
Disallow: /
    

PanguBot

PanguBot is Huawei's web crawler, specifically designed to collect web content for training its multimodal large language model (LLM) known as PanGu. The bot downloads publicly available data to enhance AI capabilities, such as natural language understanding and multimodal applications, across Huawei's AI-driven platforms. By allowing PanguBot access, your website's content may contribute to advancing AI technologies developed by Huawei. However, concerns about attribution and usage of content in AI models may lead some website owners to block this bot.

To block in robots.txt:

        User-Agent: PanguBot
Disallow: /
    

Kangaroo Bot

Kangaroo Bot is a web crawler developed by Kangaroo LLM to gather publicly available data for training and improving its large language models (LLMs). These models are used for various applications, including natural language processing, content generation, and AI-driven tools tailored to industries such as education and customer support. By permitting Kangaroo Bot, your website's data may contribute to the refinement of AI technologies offered by Kangaroo LLM.

To block in robots.txt:

        User-Agent: Kangaroo Bot
Disallow: /
    

Sentibot

Sentibot is a web crawler developed by SentiOne, designed to gather publicly accessible data for social listening, sentiment analysis, and AI-driven insights. The bot collects online discussions and mentions to train and enhance SentiOne's conversational AI models and natural language understanding (NLU) engines. This data contributes to the development of advanced AI tools, such as chatbots, voicebots, and sentiment tracking systems, empowering businesses with actionable insights for brand management and customer engagement.

To block in robots.txt:

        User-Agent: Sentibot
Disallow: /
    

img2dataset

img2dataset is an open-source tool that downloads large sets of image URLs to create image datasets, facilitating machine learning tasks. It is particularly useful for training AI models that require extensive image data. By automating the collection and preprocessing of images, img2dataset streamlines the development of computer vision applications.

To block in robots.txt:

        User-Agent: img2dataset
Disallow: /
    

Meltwater

Meltwater utilizes web crawlers to collect data from news sources, blogs, and social media platforms. This data is analyzed using AI-driven tools to provide media intelligence, social listening, and consumer insights. By leveraging AI, Meltwater helps businesses monitor brand perception, track market trends, and make informed decisions.

To block in robots.txt:

        User-Agent: Meltwater
Disallow: /
    

Seekr

Seekr employs AI technologies to analyze and evaluate online content, providing tools for brand safety, content alignment, and AI model development. Their platform, SeekrFlow, allows businesses to build, validate, and run trusted AI models. While specific details about their web crawling activities are limited, Seekr's services involve collecting and analyzing web data to enhance AI applications.

To block in robots.txt:

        User-Agent: Seekr
Disallow: /
    

peer39_crawler

peer39_crawler is used by Peer39 to analyze website content for contextual advertising purposes. It helps advertisers understand the context and suitability of webpages, enabling them to align ads with content effectively. This process involves AI-driven contextual analysis to ensure brand safety and relevance in ad placements.

To block in robots.txt:

        User-Agent: peer39_crawler
Disallow: /
    

Cohere

cohere-ai utilizes web crawlers to gather publicly available text data for training and improving its large language models (LLMs). These models are foundational for natural language processing (NLP) applications, including text generation, sentiment analysis, search enhancement, and semantic understanding. The collected data supports Cohere's efforts to develop cutting-edge AI tools, such as the Cohere Command and Embed models, which are widely used in industries like finance, marketing, and customer support. By allowing Cohere-AI's crawlers, your website's content may contribute to advancements in language model research and enterprise-specific AI solutions.

cohere-training-data-crawler is operated by Cohere, a company specializing in natural language processing (NLP) models. This crawler collects textual data from the web to train and improve Cohere's language models, which are used in various AI applications, including text generation, sentiment analysis, and language understanding.

To block in robots.txt:

        User-agent: cohere-ai
User-Agent: cohere-training-data-crawler
Disallow: /
    

DuckAssistBot

DuckAssistBot is associated with DuckDuckGo's DuckAssist feature, which provides AI-generated answers to user queries. The bot collects data to enhance the AI's ability to generate accurate and relevant responses, improving the search experience for users.

To block in robots.txt:

        User-Agent: DuckAssistBot
Disallow: /
    

Scrapy

Scrapy is an open-source web crawling framework written in Python. It is used by developers to build custom web scrapers for data extraction. While Scrapy itself is a tool, the data collected using Scrapy can be employed in various applications, including AI model training, data analysis, and research.

To block in robots.txt:

        User-Agent: Scrapy
Disallow: /
    

Disclaimer

The robots.txt tools and information shared on this website are under constant development and improvements will be added over time. Any data uploaded to this website, or its APIs, are deleted within a few days after the purpose has ended, including personal information such as email addresses. For example, email addresses and all related identifable data is deleted shortly after unsubscribing from the notifications. Email addresses are not distributed or sold to third parties or used for any other purpose than stated (notifying of results). Only in the rare occasion when a bug appears, some uploaded data may be stored for slightly longer to debugging purposes, afterwhich the uploaded data is still completely disregarded and deleted.

Bugs will happen. Despite best efforts to maintain the code base and data quality of the information shared on this website, no guarantees can or will be given. Data, information and results may be incomplete and/or errors may occur. This is a personal website and for-fun project. Use at your own risk.


Made with by SEO Expert Fili © 2023 - 2025