What Is a Robots.txt Tester?
A robots.txt file is a plain-text directive that tells web crawlers which parts of your site they may or may not access. Every major search engine — Google, Bing, and increasingly AI training crawlers like GPTBot and ClaudeBot — fetches and parses this file before crawling any other URL. A single misplaced rule can accidentally block your entire site from being indexed, or unintentionally open sensitive pages to every bot on the internet.
Google operated an official robots.txt tester inside Google Search Console until it was shut down in 2023, leaving SEOs without a canonical tool for verifying rules. This free robots.txt tester fills that gap, implementing the RFC 9309 matching algorithm — the same specification that Google, Bing, and AI crawlers use to evaluate directives.
For a deeper understanding of how crawl control fits into your overall site architecture, see the guide to technical SEO.
Why Robots.txt Errors Are Costly
Robots.txt mistakes fall into two categories: over-blocking (accidentally preventing crawlers from reaching content you want indexed) and under-blocking (failing to restrict crawlers from pages that should remain private, like staging environments, admin panels, or duplicate content directories).
Over-blocking is more common and more damaging. A directive like Disallow: / under User-agent: * blocks every crawler from every URL — a configuration sometimes set on staging environments and accidentally promoted to production. Without a testing tool, this error can go undetected for days or weeks, causing an indexing collapse that takes months to recover from.
Under-blocking carries different risks. If you fail to block GPTBot or ClaudeBot, your original content may be used to train large language models without your consent. Many publishers made deliberate choices in 2023–2025 to allow or deny AI crawlers, and robots.txt remains the standard mechanism for communicating those preferences.
RFC 9309: The Matching Algorithm That Actually Matters
Not all robots.txt parsers behave identically. Google's implementation, now codified in RFC 9309, defines two critical rules that many simplified testers get wrong:
Longest match wins. When multiple rules match a URL path, the most specific one takes precedence — not the first one, and not the most permissive one. Given:
Disallow: /products/
Allow: /products/sale/
A request to /products/sale/item-1 matches both rules. The Allow: /products/sale/ rule wins because /products/sale/ (15 characters) is longer than /products/ (10 characters).
Equal-length ties go to Allow. If a Disallow and an Allow rule are exactly the same length and both match a path, the Allow directive wins. This is a tiebreaker that exists to protect against accidental blocking.
User-agent precedence. If a user-agent block exactly matches the crawler's name (e.g., User-agent: Googlebot), that block takes full precedence and the User-agent: * wildcard block is ignored for that crawler entirely.
AI Crawlers: A 2026 Priority
The robots.txt landscape changed significantly when major AI companies began deploying training crawlers at scale. As of 2026, the most commonly blocked AI bots are:
| User-Agent | Operator | Common Use |
|---|
| GPTBot | OpenAI | LLM training data collection |
| ClaudeBot | Anthropic | LLM training data collection |
| PerplexityBot | Perplexity AI | Real-time search index |
| CCBot | Common Crawl | Open dataset used by many LLMs |
| Google-Extended | Google | Gemini training (separate from Googlebot) |
Blocking these crawlers requires explicit User-agent blocks in your robots.txt. A blanket User-agent: * Disallow will block them, but it will also block Googlebot — which is rarely the intent. This tester lets you simulate each user-agent individually to confirm your rules behave correctly for each crawler class.
The relationship between AI crawlers and organic search is explored further in the AI Overviews guide, which covers how Google's own AI-generated summaries interact with your content strategy.
How to Use the Robots.txt Tester
- Paste your robots.txt content — copy the raw text from
https://yourdomain.com/robots.txt and paste it into the input field.
- Select a user-agent — choose from the dropdown: Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, Google-Extended, or enter a custom user-agent string.
- Enter URL paths to test — add one or more paths (e.g.,
/products/, /admin/login, /blog/) in the path list. You can test multiple paths in a single run.
- Run the test — the tester applies RFC 9309 matching and returns a result for each path.
- Read the results — each path shows: Allowed or Blocked, the specific rule that matched, and the line number in your robots.txt where that rule appears.
- Iterate — edit your robots.txt in the input field and re-run to test proposed changes before deploying them.
Common Robots.txt Mistakes
Forgetting the trailing slash on directories. Disallow: /admin blocks /admin but may not block /admin/settings in all parsers. Use Disallow: /admin/ to be explicit.
Blocking crawlers from your XML sitemap. Your sitemap should always be accessible. If you have a broad Disallow: / block for certain agents, make sure your sitemap URL is not caught by it. The XML Sitemap Generator produces sitemap files and explains the correct Sitemap: directive syntax for referencing them inside robots.txt.
Using wildcards incorrectly. Disallow: /*.pdf$ requires a parser that supports Google's extended syntax. The RFC 9309 core spec does not mandate wildcard support — which means some crawlers will ignore the rule entirely.
Multiple User-agent blocks for the same agent. If Googlebot appears in two separate blocks, Google reads only the first matching block and ignores the second.
FAQ
Does Google always respect robots.txt?
For crawling purposes, yes — Google will not fetch a URL blocked by robots.txt. However, Google may still index a blocked URL if other sites link to it; it will appear in search results with a "No information is available for this page" snippet. To prevent indexing entirely, you need a noindex directive delivered via HTTP response header or meta tag — which requires the page to be crawlable.
Can I block AI crawlers without affecting Googlebot?
Yes. Use separate User-agent blocks:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Googlebot
Allow: /
This configuration blocks AI training crawlers while leaving Googlebot unrestricted.
How often do crawlers re-read robots.txt?
Googlebot typically caches robots.txt for up to 24 hours. Changes you deploy may not take effect for existing crawl queues until the cache expires. For urgent blocks (e.g., a staging environment accidentally exposed), use Google Search Console's URL Inspection tool to request immediate recrawling.