A03 · Access & Crawlability

No Robots Txt Blocking

Jump to section

TL;DR

There’s a technical or content issue reducing how well your page can be crawled, understood, or cited. Follow the steps below to diagnose the cause, apply the fix, and verify the result. Finish by running an Oversearch AI Page Optimizer scan.

Why this matters

Access and crawlability are prerequisites. If crawlers can’t fetch or parse your content, rankings and citations become unreliable, and LLMs may fail to extract answers.

Where this shows up in Oversearch

In Oversearch, open AI Page Optimizer and run a scan for the affected page. Then open Benchmark Breakdown to see evidence, and use the View guide link to jump back here when needed.

How do I check if robots.txt is blocking my page?

Fetch your site’s robots.txt file and look for Disallow rules that match your page’s URL path.

Robots.txt is always at the root of your domain (e.g., https://example.com/robots.txt). Any Disallow rule that matches a URL path prefix will prevent compliant crawlers from fetching that page.

  • Open https://yourdomain.com/robots.txt in a browser.
  • Look for Disallow: rules under the relevant User-agent: section.
  • Use Google Search Console → robots.txt Tester to check specific URLs.
  • Remember: path matching is prefix-based (Disallow: /blog blocks /blog/my-post too).
  • Check for wildcard patterns like Disallow: /*? that block query parameters.

If you use Oversearch, open AI Page OptimizerBenchmark Breakdown to see whether robots.txt blocking was detected.

What does Disallow: / mean and why is it dangerous?

Disallow: / tells all matching crawlers to not fetch any page on the entire site. It is the nuclear option that blocks everything.

This is sometimes added accidentally during development or staging setup and forgotten when deploying to production. A single line can make your entire site invisible to search engines and AI systems.

  • Disallow: / blocks every URL on the domain for the specified user-agent.
  • Under User-agent: *, it blocks all crawlers.
  • It does not remove pages from the index — but prevents re-crawling and discovery of new pages.
  • Check that staging robots.txt rules are not deployed to production.

If you use Oversearch, open AI Page OptimizerBenchmark Breakdown to confirm no blanket disallow is active.

Do AI crawlers follow robots.txt?

Most major AI crawlers respect robots.txt, but each uses its own user-agent string, so you need to allow them explicitly.

OpenAI uses OAI-SearchBot and GPTBot, Anthropic uses anthropic-ai, Google AI uses Google-Extended. If your robots.txt has User-agent: * / Disallow: /, all of these are blocked. You need either a blanket allow or specific user-agent sections.

  • Check your robots.txt for user-agent-specific blocks that might affect AI crawlers.
  • Add explicit Allow rules for AI user-agents if you want LLM citations.
  • The main AI crawler user-agents: GPTBot, OAI-SearchBot, anthropic-ai, Google-Extended, PerplexityBot.
  • Allowing AI crawlers while blocking others is a valid strategy.

If you use Oversearch, open AI Page OptimizerBenchmark Breakdown to see which crawlers can access your page.

How do I allow crawling for a specific folder only?

Combine a broad Disallow with a specific Allow for the folder you want crawled. Allow rules take precedence over Disallow for matching paths.

This is useful when you want to expose only certain sections (e.g., /blog/, /docs/) to crawlers while keeping the rest private.

  • Use Disallow: / followed by Allow: /blog/ to open only the blog.
  • Allow rules must be more specific than Disallow rules to take effect.
  • Test with Google’s robots.txt Tester before deploying.
  • Remember: robots.txt is public — do not use it to hide sensitive URLs.

If you use Oversearch, open AI Page OptimizerBenchmark Breakdown to verify crawl access for the target URL.

Common root causes

  • Template-level configuration mismatch or conflicting signals.

How to detect

  • In Oversearch AI Page Optimizer, open the scan for this URL and review the Benchmark Breakdown evidence.
  • Verify the signal outside Oversearch with at least one method: fetch the HTML with curl -L, check response headers, or use a crawler/URL inspection.
  • Confirm you’re testing the exact canonical URL (final URL after redirects), not a variant.

How to fix

Start by checking your robots.txt rules (see: How do I check if robots.txt is blocking my page?) and whether AI crawlers are affected (see: Do AI crawlers follow robots.txt?). Then follow the steps below.

  1. Apply the fix recommended by your scan and validate with Oversearch.

Verify the fix

  • Run an Oversearch AI Page Optimizer scan for the same URL and confirm the benchmark is now passing.
  • Confirm the page is 200 OK and the primary content is present in initial HTML.
  • Validate with an external tool (crawler, URL inspection, Lighthouse) to avoid false positives.

Prevention

  • Add automated checks for robots/noindex/canonical on deploy.
  • Keep a single, documented preferred URL policy (host/protocol/trailing slash).
  • After releases, spot-check Oversearch AI Page Optimizer on critical templates.

FAQ

Why is Google saying ‘Blocked by robots.txt’?

Your robots.txt has a Disallow rule matching the page URL. Check the file at /robots.txt and look for rules that match the path. Use Google Search Console’s robots.txt Tester for an exact match check. When in doubt, temporarily remove the Disallow rule and re-test.

Can I block some bots but allow Google and Bing?

Yes. Add separate User-agent sections for Googlebot and Bingbot with Allow rules, while keeping a restrictive default under User-agent: *. When in doubt, test each user-agent section independently.

How do I test robots.txt rules quickly?

Use Google Search Console’s robots.txt Tester or an online robots.txt validator. Paste your rules and test specific URLs against them. When in doubt, use the Tester to check your highest-value pages.

Why did my new pages stop getting crawled after a deploy?

A deploy may have overwritten robots.txt with a staging version containing Disallow: /. Check whether your build process generates or copies a robots.txt file. When in doubt, always verify robots.txt immediately after every deploy.

Does blocking AI crawlers in robots.txt also block Google?

No. AI crawlers use separate user-agent strings (GPTBot, anthropic-ai, etc.). Blocking those does not affect Googlebot. You can selectively allow or block each. When in doubt, add separate User-agent sections for each crawler you want to control.

Will removing a Disallow rule make my pages appear in search instantly?

No. After removing the block, crawlers need to re-discover and re-crawl the pages. This can take days to weeks. Speed it up by submitting URLs in Google Search Console. When in doubt, request indexing for your most important pages after updating robots.txt.