Who is Robots.txt 测试器 best for?

Technical SEOs validating crawl rules before deploying robots.txt Publishers deciding which AI training crawlers to allow or block Developers debugging why a page is or is not being indexed

What should beginners know about Robots.txt 测试器?

Robots.txt controls crawling, not indexing; a blocked URL can still appear in results, so use noindex for true removal. Test the exact user-agent string a crawler sends, because rules for Googlebot do not automatically apply to AI bots like GPTBot. Remember that the most specific matching rule wins, so a broad Disallow can be overridden by a narrower Allow.

SEO 工具

Robots.txt 测试器

免费

免费 robots.txt 测试器，支持 Googlebot、GPTBot、ClaudeBot、PerplexityBot 及所有主流爬虫。粘贴 robots.txt，选择 user-agent，同时测试多个 URL 路径——每条结果显示路径是允许还是被屏蔽，以及触发匹配的具体规则，完整复现搜索引擎和 AI 爬虫使用的 RFC 9309 匹配算法。

搭建robots.txt抓取控制技术SEOGooglebotAI爬虫RFC 9309

主要用途

This tester recreates the RFC 9309 robots.txt matching algorithm that search engines and major AI crawlers follow, so you can confirm exactly which URLs a given user-agent may fetch. Paste your robots.txt contents, choose a crawler such as Googlebot, GPTBot, ClaudeBot, PerplexityBot, or Google-Extended, then list the URL paths you want to check. The tool reports allowed or blocked for each path and, crucially, names the specific directive responsible — including how longest-match and allow-over-disallow precedence rules resolve conflicts. It fills the gap left when Google retired its official robots.txt tester in 2023, and it is especially useful now that publishers want to control AI training crawlers as deliberately as they manage traditional search bots.

所在链路

This tester sits in the technical SEO setup stage, verifying crawl directives before they ship and after every robots.txt edit.

核心功能

RFC 9309-compliant matching that mirrors real crawler behavior
Support for Googlebot plus AI crawlers like GPTBot and ClaudeBot
Per-path allowed or blocked verdict for batch URL testing
Identification of the exact directive controlling each result
Correct handling of longest-match and allow-over-disallow precedence

适合谁用

Technical SEOs validating crawl rules before deploying robots.txt
Publishers deciding which AI training crawlers to allow or block
Developers debugging why a page is or is not being indexed

新手提示

Robots.txt controls crawling, not indexing; a blocked URL can still appear in results, so use noindex for true removal.
Test the exact user-agent string a crawler sends, because rules for Googlebot do not automatically apply to AI bots like GPTBot.
Remember that the most specific matching rule wins, so a broad Disallow can be overridden by a narrower Allow.

Robots.txt 测试器 / Robots.txt Tester

按 RFC 9309 规则验证 robots.txt 指令 / Validate robots.txt directives per RFC 9309

robots.txt 内容 / Content

User-Agent

测试路径（每行一个）/ Test Paths (one per line)

路径 / Path	结果 / Result	匹配规则 / Matched Rule
/	✓ Allowed	Allow: / [line 7]
/admin/dashboard	✓ Allowed	Allow: / [line 7]
/admin/public/page	✓ Allowed	Allow: / [line 7]
/docs/report.pdf	✗ Blocked	Disallow: /*.pdf$ [line 6]
/docs/report.pdfx	✓ Allowed	Allow: / [line 7]
/about	✓ Allowed	Allow: / [line 7]

什么是 Robots.txt 测试器？

Robots.txt 是一个纯文本指令文件，用于告诉网络爬虫哪些页面可以访问、哪些不可以访问。Google、Bing 等主流搜索引擎，以及 GPTBot、ClaudeBot 等 AI 训练爬虫，都会在抓取任何其他 URL 之前先读取并解析这个文件。一条写错的规则，可能会意外地把整个网站都屏蔽在索引之外，或者反过来把敏感页面暴露给所有爬虫。

Google 曾在 Google Search Console 内提供官方 robots.txt 测试工具，但该工具已于 2023 年关闭，留下了一个工具空白。这款免费的 robots.txt 测试器填补了这一空缺，完整实现了 RFC 9309 匹配算法——也就是 Google、Bing 和 AI 爬虫实际使用的规则解析规范。

关于 robots.txt 在整体网站架构中的定位，请参阅技术 SEO 指南。

为什么 Robots.txt 错误代价高昂？

Robots.txt 的常见错误分为两类：过度屏蔽（意外阻止爬虫访问你希望被索引的内容）和屏蔽不足（未能阻止爬虫访问应当保密的页面，如测试环境、管理后台或重复内容目录）。

过度屏蔽更为常见，也更具破坏性。User-agent: * 下的 Disallow: / 指令会屏蔽所有爬虫访问所有 URL——这种配置有时在测试环境中设置，然后意外上线到生产环境。没有测试工具，这个错误可能在数天甚至数周内不被发现，导致索引崩溃，恢复需要数月。

屏蔽不足则有另一层风险。如果你没有屏蔽 GPTBot 或 ClaudeBot，你的原创内容可能在未经同意的情况下被用于训练大型语言模型。2023 至 2025 年间，许多出版商开始有意识地通过 robots.txt 决定是否允许 AI 爬虫访问。

RFC 9309：真正重要的匹配算法

不同的 robots.txt 解析器行为并不完全一致。Google 的实现现已在 RFC 9309 中标准化，其中定义了两条许多简化版测试器都会出错的关键规则：

最长匹配优先。 当多条规则同时匹配某个 URL 路径时，最具体的规则优先——不是第一条，也不是最宽松的那条。例如：

Disallow: /products/
Allow: /products/sale/

对于路径 /products/sale/item-1，两条规则都能匹配。Allow: /products/sale/ 获胜，因为 /products/sale/（15 个字符）比 /products/（10 个字符）更长。

等长规则，Allow 优先。 如果一条 Disallow 和一条 Allow 规则长度完全相同且都匹配某路径，Allow 指令获胜。这是防止意外屏蔽的保底规则。

User-agent 精确匹配优先。 如果存在与爬虫名称完全匹配的 user-agent 块（例如 User-agent: Googlebot），该块完全优先，User-agent: * 通配块对该爬虫完全无效。

AI 爬虫：2026 年的重点议题

当各大 AI 公司开始大规模部署训练爬虫，robots.txt 的应用场景发生了显著变化。2026 年，最常被屏蔽的 AI 爬虫包括：

User-Agent	运营方	主要用途
GPTBot	OpenAI	大模型训练数据采集
ClaudeBot	Anthropic	大模型训练数据采集
PerplexityBot	Perplexity AI	实时搜索索引
CCBot	Common Crawl	开放数据集（被众多大模型使用）
Google-Extended	Google	Gemini 训练（独立于 Googlebot）

屏蔽这些爬虫需要在 robots.txt 中设置明确的 User-agent 块。一刀切的 User-agent: * Disallow 会同时屏蔽 Googlebot，这通常不是预期效果。使用本工具可以逐一模拟每个 user-agent，确认规则对每类爬虫的行为都符合预期。

AI 爬虫与自然搜索的关系，以及 Google AI 概述如何与内容策略互动，请参阅 AI Overviews 指南。

如何使用 Robots.txt 测试器

粘贴 robots.txt 内容 — 从 https://yourdomain.com/robots.txt 复制原始文本，粘贴到输入框中。
选择 user-agent — 从下拉菜单中选择：Googlebot、Bingbot、GPTBot、ClaudeBot、PerplexityBot、Google-Extended，或输入自定义 user-agent 字符串。
输入要测试的 URL 路径 — 添加一个或多个路径（如 /products/、/admin/login、/blog/），支持单次测试多个路径。
运行测试 — 测试器应用 RFC 9309 匹配算法，为每个路径返回结果。
阅读结果 — 每个路径显示：允许还是屏蔽、触发匹配的具体规则、该规则在 robots.txt 中的行号。
反复迭代 — 在输入框中修改 robots.txt 并重新测试，在部署前确认修改效果。

常见 Robots.txt 错误

目录路径缺少斜杠。 Disallow: /admin 屏蔽 /admin，但在某些解析器中可能无法屏蔽 /admin/settings。应使用 Disallow: /admin/ 以明确表达意图。

爬虫被屏蔽访问 XML Sitemap。 你的 Sitemap 应始终可访问。如果你为某些 agent 设置了宽泛的 Disallow: / 块，确保 Sitemap URL 不被包含在内。XML Sitemap 生成器介绍了在 robots.txt 中用 Sitemap: 指令引用 Sitemap 文件的正确语法。

通配符使用不当。 Disallow: /*.pdf$ 需要支持 Google 扩展语法的解析器。RFC 9309 核心规范并不强制要求通配符支持——这意味着某些爬虫会完全忽略该规则。

同一 agent 出现多个 User-agent 块。 如果 Googlebot 出现在两个独立的块中，Google 只读取第一个匹配的块，忽略第二个。

常见问题

Google 一定遵守 robots.txt 吗？

在抓取层面，是的——Google 不会访问被 robots.txt 屏蔽的 URL。但如果其他网站链接到被屏蔽的 URL，Google 仍可能将其收录，只是在搜索结果中显示"此页面无相关信息"的片段。要彻底防止被收录，你需要通过 HTTP 响应头或 meta 标签传递 noindex 指令——这要求页面本身必须可以被抓取。

我能在不影响 Googlebot 的前提下屏蔽 AI 爬虫吗？

可以。使用独立的 User-agent 块：

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Googlebot
Allow: /

这个配置屏蔽 AI 训练爬虫，同时对 Googlebot 不加限制。

爬虫多久重新读取一次 robots.txt？

Googlebot 通常会缓存 robots.txt 最多 24 小时。你部署的修改在缓存到期前可能不会对现有抓取队列生效。如需紧急屏蔽（例如测试环境意外暴露），请在 Google Search Console 中使用 URL 检查工具请求立即重新抓取。

更多工具: SEO 工具

免费

Schema 生成器

无需编写代码，为任何页面生成有效的 JSON-LD 结构化数据。从六种 Schema 类型中选择——Article、FAQPage、Product、Organization、LocalBusiness、BreadcrumbList——填写表单字段，工具输出经 Schema.org 标准验证、可直接粘贴的标记代码。复制输出内容粘贴到页面 `<head>` 即可解锁 Google 富媒体结果。

SEO 工具

免费

Hreflang 生成器 / 校验器

为多语言和多地区网站创建并校验 hreflang link 标签。生成器模式输出包含 x-default 的完整 `<link rel="alternate" hreflang="...">` 标签集，校验器模式检查现有标签的常见错误，如错误的语言代码（en-UK 而非 en-GB）、重复的 hreflang 值、缺少 x-default 声明，以及导致 Google 忽略整个 hreflang 集群的缺少自我引用标签问题。

SEO 工具

免费

XML Sitemap 生成器

在几秒内从任意 URL 列表生成符合标准的 XML Sitemap。每行粘贴一个 URL，配置可选的 lastmod 日期、changefreq 和 priority 值，然后将完成的 sitemap.xml 直接下载到电脑。工具强制执行 50,000 URL 协议上限，在下载前标记格式错误的 URL，并支持带 xhtml:link hreflang 互链的多语言 Sitemap。

SEO 工具

Robots.txt 测试器

主要用途

所在链路

核心功能

适合谁用

新手提示

Robots.txt 测试器 / Robots.txt Tester

什么是 Robots.txt 测试器？

为什么 Robots.txt 错误代价高昂？

RFC 9309：真正重要的匹配算法

AI 爬虫：2026 年的重点议题

如何使用 Robots.txt 测试器

常见 Robots.txt 错误

常见问题

Google 一定遵守 robots.txt 吗？

我能在不影响 Googlebot 的前提下屏蔽 AI 爬虫吗？

爬虫多久重新读取一次 robots.txt？

相关基础概念

技术 SEO

AI概览

更多工具: SEO 工具

Schema 生成器

Hreflang 生成器 / 校验器

XML Sitemap 生成器