Back to Blog
robots.txttechnical seocrawl budgetsearch engine crawlersai crawlersrobots exclusion protocolgooglebot

Robots.txt Guide — Managing Search Engine Crawlers (2026)

SEOctopus14 min read

A website''s communication with search engines extends far beyond the pages users see. When search engine crawlers (bots) visit your site, the first file they look for is robots.txt. This small text file acts like a security guard at your site''s entrance: it determines which crawlers can access which sections and which ones should stay away. As of 2026, the importance of robots.txt is no longer limited to traditional search engines like Google and Bing — AI crawlers such as GPTBot, ClaudeBot, and PerplexityBot also read this file to learn your site''s access rules.

In this guide, we will thoroughly examine everything from the basic structure of robots.txt to advanced strategies, from AI crawler management to common mistakes. Our goal is to ensure that by the time you finish this article, you can write an optimized robots.txt file for your own site.

What Is Robots.txt?

Robots.txt is a plain text file located in the root directory of a website. It is based on the Robots Exclusion Protocol (REP), proposed by Martijn Koster in 1994. The file must always be accessible at https://example.com/robots.txt.

The primary purpose of this file is to tell search engine crawlers which parts of your site they can crawl and which parts they should avoid. An important distinction: robots.txt is advisory in nature, not mandatory. Well-behaved crawlers (Googlebot, Bingbot) respect these rules, but malicious bots may ignore the file entirely. Therefore, robots.txt should not be used as a security mechanism to protect sensitive content — use authentication, encryption, or server-level access controls instead.

Robots.txt Syntax and Directives

A robots.txt file consists of several core directives. Each directive serves a specific purpose:

User-agent

Specifies which crawler the rules apply to. The wildcard character (*) covers all crawlers:

```

User-agent: *

Disallow: /admin/

User-agent: Googlebot

Disallow: /internal/

```

The first block blocks all crawlers from the /admin/ directory. The second block defines a rule specifically for Googlebot. When a crawler finds both specific and general (*) rules, it applies the specific rules meant for it.

Disallow

Prevents crawling of the specified path:

```

User-agent: *

Disallow: /admin/

Disallow: /private/

Disallow: /tmp/

```

An empty Disallow: line means no restrictions — the crawler can access the entire site:

```

User-agent: *

Disallow:

```

Allow

Permits crawling of specific subpaths within a Disallowed parent directory. Google and Bing support this directive:

```

User-agent: *

Disallow: /admin/

Allow: /admin/public-reports/

```

In this example, while the /admin/ directory is blocked, pages under /admin/public-reports/ can still be crawled.

Sitemap

Declares the location of the XML sitemap file. This directive is independent of User-agent blocks and can be placed anywhere in the file:

```

Sitemap: https://example.com/sitemap.xml

Sitemap: https://example.com/sitemap-images.xml

```

If you have multiple sitemap files, you can specify each on a separate line. Check out our detailed guide on XML sitemap optimization.

Crawl-delay

Tells the crawler how many seconds to wait between consecutive requests:

```

User-agent: Bingbot

Crawl-delay: 10

```

Important note: Google does not support the Crawl-delay directive. To control Google''s crawl rate, use the crawl rate settings in Google Search Console. However, Bing, Yandex, and some other crawlers do honor this directive.

Wildcards and Advanced Patterns

Google and Bing support extended pattern matching in robots.txt, including wildcards (*) and the dollar sign ($):

Asterisk (*) — Any Character Sequence

```

User-agent: *

Disallow: /*.pdf$

Disallow: /*/print/

Disallow: /search?*q=

```

  • /*.pdf$: Blocks all URLs ending in .pdf.
  • /*/print/: Blocks print/ subpaths under any directory.
  • /search?*q=: Blocks search result pages.

Dollar Sign ($) — End of URL

The dollar sign indicates that the URL must end exactly at that point:

```

User-agent: *

Disallow: /*.php$

Allow: /index.php$

```

This rule blocks all URLs ending in .php but allows the exact URL /index.php. URLs with query parameters like /index.php?id=5 are not blocked because they don''t meet the $ (end) condition.

Common Robots.txt Patterns

Let''s examine the most frequently used robots.txt configurations in practice:

1. Blocking Admin Panels

```

User-agent: *

Disallow: /admin/

Disallow: /wp-admin/

Disallow: /dashboard/

Allow: /wp-admin/admin-ajax.php

```

For WordPress sites, allowing admin-ajax.php is important because many front-end features depend on this file.

2. Blocking Search Result Pages

```

User-agent: *

Disallow: /search

Disallow: /?s=

Disallow: /search?*

```

Search result pages can generate low-quality, dynamic content that wastes crawl budget.

3. Blocking Staging/Test Environments

```

User-agent: *

Disallow: /staging/

Disallow: /test/

Disallow: /dev/

```

Alternatively, if your staging environment is on a completely separate subdomain (staging.example.com), it is better to block all access in that subdomain''s own robots.txt:

```

User-agent: *

Disallow: /

```

4. Blocking Parameter-Based Filter Pages

```

User-agent: *

Disallow: /*?sort=

Disallow: /*?filter=

Disallow: /*&page=

Disallow: /*?color=

```

On e-commerce sites, sorting and filtering parameters can create thousands of duplicate pages. Blocking these pages from crawling is one of the most effective ways to preserve crawl budget.

[Görsel: GORSEL: Robots.txt directive flow diagram showing how User-agent, Disallow, Allow, and Sitemap directives interact]

Robots.txt vs Meta Robots vs X-Robots-Tag

There are three different mechanisms for controlling crawler behavior. Each has a different scope and use case:

Robots.txt

  • Scope: Controls the crawling stage.
  • Location: A single file in the site''s root directory.
  • What it does: Tells the crawler "don''t crawl this page."
  • What it doesn''t do: Does not prevent page indexing. If other sites link to that page, Google can index it without even crawling it.
  • Ideal use: Crawl budget management, preventing unnecessary pages from being crawled.

Meta Robots Tag

  • Scope: Controls the indexing stage.
  • Location: Inside the section of the HTML page.
  • What it does: Tells the crawler "don''t index this page" or "don''t follow links on this page."
  • Values: noindex, nofollow, noarchive, nosnippet, max-snippet, max-image-preview, max-video-preview.

```html

```

  • Ideal use: Removing specific pages from the index.

X-Robots-Tag HTTP Header

  • Scope: Controls the indexing stage (same as meta robots).
  • Location: In the HTTP response header.
  • Advantage: Can be used for non-HTML files such as PDFs, images, and videos.

```

HTTP/1.1 200 OK

X-Robots-Tag: noindex, nofollow

```

  • Ideal use: Controlling indexing of non-HTML resources.

Comparison Table

Featurerobots.txtMeta RobotsX-Robots-Tag
Blocks crawlingYesNoNo
Blocks indexingNoYesYes
Non-HTML filesNoNoYes
Page-level controlLimitedYesYes
ImplementationFileHTML headHTTP header

Critical mistake: Blocking a page with robots.txt while simultaneously trying to de-index it with a noindex meta tag does not work. Since the crawler cannot access the page, it never sees the meta tag. If you want to remove a page from the index, don''t block it with robots.txt — leave the page crawlable and use the noindex tag. We cover this in detail in our technical SEO guide.

Testing Your Robots.txt

Testing your robots.txt file before deploying it is critical. An incorrect rule could cause your entire site to drop from the index.

Google Search Console — Robots.txt Tester

The robots.txt testing tool in Google Search Console allows you to see how rules in your file apply to specific URLs:

  1. Log in to Google Search Console.
  2. Navigate to "Settings" > "robots.txt" in the left menu.
  3. View your current robots.txt file.
  4. Test specific URLs to check blocked/allowed status.

Bing Webmaster Tools

Bing Webmaster Tools also offers a similar robots.txt validation tool. It is particularly useful for verifying how Bing interprets the Crawl-delay directive.

Command Line Testing

To check the accessibility and HTTP status code of your robots.txt file:

```bash

curl -I https://example.com/robots.txt

```

The expected response should be 200 OK. A 404 Not Found means search engines will consider the entire site crawlable. A 5xx error can cause crawlers to stop crawling your site entirely — a problem that''s hard to detect but has severe consequences.

Python Validation

Python''s standard library includes the urllib.robotparser module for parsing robots.txt files:

```python

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()

rp.set_url("https://example.com/robots.txt")

rp.read()

Check if a specific URL can be fetched

print(rp.can_fetch("Googlebot", "https://example.com/admin/"))

False

print(rp.can_fetch("*", "https://example.com/blog/"))

True

```

AI Crawler Management (2026)

In 2026, one of the most current and critical use cases for robots.txt is managing AI crawlers. Large language models like ChatGPT, Claude, Perplexity, and Gemini use specialized crawlers to collect content from the web. These crawlers commit to respecting robots.txt rules.

Major AI Crawlers and Their User-agent Names

CrawlerUser-agentCompanyPurpose
GPTBotGPTBotOpenAIChatGPT training data and web browsing
ChatGPT-UserChatGPT-UserOpenAIChatGPT real-time web search
Google-ExtendedGoogle-ExtendedGoogleGemini AI training data
ClaudeBotanthropic-aiAnthropicClaude training data
PerplexityBotPerplexityBotPerplexityAI search engine
CCBotCCBotCommon CrawlOpen-source data collection
BytespiderBytespiderByteDanceTikTok/ByteDance AI models
Applebot-ExtendedApplebot-ExtendedAppleApple Intelligence

Blocking All AI Crawlers

If you don''t want your content to be used as training data by AI models:

```

User-agent: GPTBot

Disallow: /

User-agent: ChatGPT-User

Disallow: /

User-agent: Google-Extended

Disallow: /

User-agent: anthropic-ai

Disallow: /

User-agent: PerplexityBot

Disallow: /

User-agent: CCBot

Disallow: /

User-agent: Bytespider

Disallow: /

User-agent: Applebot-Extended

Disallow: /

```

Selective AI Crawler Management

You may want to allow some AI crawlers while blocking others. For example, receiving traffic from Perplexity and ChatGPT''s web search feature while blocking training data collection:

```

Real-time AI search — allow (brings traffic)

User-agent: ChatGPT-User

Disallow:

User-agent: PerplexityBot

Disallow:

Training data collection — block

User-agent: GPTBot

Disallow: /

User-agent: Google-Extended

Disallow: /

User-agent: anthropic-ai

Disallow: /

User-agent: CCBot

Disallow: /

```

Partial AI Crawler Access

Keep your blog content open to AI crawlers while blocking premium or gated content:

```

User-agent: GPTBot

Disallow: /premium/

Disallow: /members-only/

Disallow: /api/

Allow: /blog/

Allow: /docs/

User-agent: PerplexityBot

Disallow: /premium/

Disallow: /members-only/

Allow: /blog/

```

Strategic Thinking on AI Crawler Management

In 2026, completely blocking AI crawlers can be a costly decision in terms of visibility. Platforms like ChatGPT, Perplexity, and Gemini have become information sources for millions of users. Having your content referenced on these platforms can create a new traffic channel. When determining your strategy, consider these questions:

  • Are you uncomfortable with your content being used as training data by AI models?
  • Is receiving organic traffic from AI search engines valuable for your business model?
  • Should only premium/paid content be protected, or all your content?

A selective approach based on the answers to these questions is generally the most sensible strategy.

Common Robots.txt Mistakes

Errors in robots.txt files can cause issues that are hard to detect but have severe consequences. Let''s examine the most common mistakes and their solutions:

1. Blocking CSS and JavaScript Files

```

WRONG — Don''t do this!

User-agent: *

Disallow: /css/

Disallow: /js/

Disallow: /assets/

```

Google needs access to CSS and JavaScript files to render (visually interpret) your pages. If you block these files, Google cannot render your page properly, and your ranking performance will drop significantly. You''ll receive "Resources blocked" warnings in Google Search Console.

2. Accidentally Blocking the Entire Site

```

DANGEROUS — A single slash blocks the entire site!

User-agent: *

Disallow: /

```

This rule prevents all crawlers from accessing the entire site. It may be appropriate for staging environments, but it should never be used in production. A single character mistake can completely remove your site from the index.

3. Case Sensitivity

Robots.txt paths are case-sensitive:

```

Disallow: /Admin/ # Only blocks /Admin/

Disallow: /admin/ # Only blocks /admin/ — these are different rules

```

Even if your server operates case-insensitively, paths in robots.txt are interpreted as case-sensitive. Adding both variations is a safe practice.

4. Forgetting the Trailing Slash

```

Disallow: /admin # Blocks /admin, /admin.html, /administrator — everything starting with /admin

Disallow: /admin/ # Only blocks the /admin/ directory and its contents

```

Without a trailing slash, Disallow: /admin matches all URLs starting with /admin. This can lead to unintended consequences.

5. Confusing Empty File with No File

  • No file (404): All crawlers can access the entire site.
  • Empty file (200, no content): All crawlers can access the entire site.
  • File with content: Rules are applied.

Having no file and having an empty file are practically the same, but an empty file represents a conscious decision and is more professional. At minimum, maintaining a robots.txt file with the Sitemap directive is best practice.

6. Placing robots.txt in the Wrong Location

Robots.txt only works in the site''s root directory:

```

Correct: https://example.com/robots.txt

Wrong: https://example.com/pages/robots.txt

Different: https://blog.example.com/robots.txt — only applies to blog.example.com

```

Each subdomain requires its own robots.txt file. www.example.com and example.com may have different robots.txt files.

7. Protocol Error in Sitemap URL

```

WRONG

Sitemap: /sitemap.xml

CORRECT

Sitemap: https://example.com/sitemap.xml

```

The Sitemap directive must always use a full URL (protocol + domain + path).

Robots.txt Strategies for Large Sites

For sites with hundreds of thousands or millions of pages, robots.txt is one of the most important tools for crawl budget management.

Faceted Navigation Control

On e-commerce sites, filtering, sorting, and pagination parameters create enormous numbers of URL combinations:

```

User-agent: *

Filtering parameters

Disallow: /*?sort=

Disallow: /*?order=

Disallow: /*?filter=

Disallow: /*&color=

Disallow: /*&size=

Disallow: /*&brand=

Cross-filtering

Disallow: /?color=&size=

Disallow: /?brand=&color=

Printer-friendly pages

Disallow: /*/print/

Disallow: /*?print=

Session and tracking parameters

Disallow: /*?session_id=

Disallow: /*?utm_

Disallow: /*?ref=

```

Crawl Budget Optimization

Direct your crawl budget toward high-value pages by removing low-value pages from crawling:

```

User-agent: *

Low-value pages

Disallow: /tag/

Disallow: /author/

Disallow: /archive/

Disallow: /page/

High-value pages — access open

Allow: /products/

Allow: /categories/

Allow: /blog/

```

By performing log file analysis, you can identify which pages are being unnecessarily crawled and optimize your robots.txt rules accordingly.

Robots.txt HTTP Status Codes and Crawler Behavior

When robots.txt returns different HTTP status codes, crawlers behave differently:

Status CodeCrawler Behavior
200 OKRules are read and applied
301/302 RedirectRedirected target is read (up to 5 redirects)
404 Not FoundNo restrictions, entire site is crawlable
410 GoneNo restrictions, entire site is crawlable
5xx Server ErrorGoogle temporarily stops crawling (full restriction)

The 5xx error is particularly dangerous. When robots.txt cannot be accessed due to a server error, Google''s "stay on the safe side" principle causes it to stop crawling the entire site. If this continues for hours, it can lead to index loss.

Creating and Managing Your Robots.txt File

Basic Template

A suitable starting template for most websites:

```

Robots.txt — example.com

Last updated: 2026-03-01

User-agent: *

Disallow: /admin/

Disallow: /api/

Disallow: /private/

Disallow: /tmp/

Disallow: /*?session_id=

Disallow: /*?utm_

AI Crawler Management

User-agent: GPTBot

Disallow: /premium/

Allow: /blog/

User-agent: Google-Extended

Disallow: /

User-agent: anthropic-ai

Disallow: /premium/

Allow: /blog/

Sitemap

Sitemap: https://example.com/sitemap.xml

```

Robots.txt for WordPress

```

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

Disallow: /wp-includes/

Disallow: /wp-content/plugins/

Disallow: /wp-content/cache/

Disallow: /trackback/

Disallow: /feed/

Disallow: /comments/feed/

Disallow: /?s=

Disallow: /search/

Sitemap: https://example.com/sitemap_index.xml

```

Robots.txt for Next.js / React Applications

```

User-agent: *

Disallow: /api/

Disallow: /_next/static/

Allow: /_next/image/

Disallow: /admin/

Disallow: /dashboard/

Sitemap: https://example.com/sitemap.xml

```

Monitoring Robots.txt Changes

Monitoring changes to your robots.txt file is critical. An accidental change can cause major problems:

  • Use version control. Track your robots.txt file with Git.
  • Set up change notifications. Notify your team when the file changes.
  • Conduct regular audits. Review your robots.txt file monthly and include it in your SEO audit process.
  • Monitor Google Search Console alerts. Regularly check the "URLs blocked by robots.txt" report.

2026 Robots.txt Checklist

A comprehensive checklist to evaluate your site''s robots.txt file:

Basic Structure:

  • [ ] File is accessible at https://yoursite.com/robots.txt with 200 status code
  • [ ] UTF-8 encoding is used
  • [ ] File size is under 500 KB (Google''s limit)
  • [ ] Syntax is error-free

Access Rules:

  • [ ] CSS, JavaScript, and image files are not blocked
  • [ ] Important pages are not accidentally blocked
  • [ ] Admin/API directories are blocked
  • [ ] Search result pages are blocked
  • [ ] Parameter-based duplicate pages are blocked

Sitemap:

  • [ ] Sitemap directive includes full URL
  • [ ] Sitemap file is accessible and valid

AI Crawlers:

  • [ ] AI crawler strategy determined (allow / partial access / block)
  • [ ] All known AI crawler user-agents addressed
  • [ ] AI access decisions aligned with business strategy

Testing and Monitoring:

  • [ ] Validated with Google Search Console
  • [ ] Critical URLs tested
  • [ ] Under version control
  • [ ] Regular audit scheduled

Review this checklist monthly. When your site''s structure changes — new directories added, subdomains launched, or new content types published — remember to update your robots.txt file.

Conclusion

Robots.txt is one of the most fundamental building blocks of technical SEO. A properly configured robots.txt file ensures search engines crawl your site efficiently, protects your crawl budget, and manages your relationship with AI crawlers. In 2026, AI crawler management has become the most dynamic and strategic use case for robots.txt.

Remember: robots.txt is not a security tool but a communication tool. It conveys your message to crawlers about "what to crawl and what not to crawl." Getting this message right forms the foundation of your organic visibility.

Track Your Brand's AI Visibility

See how your brand appears in ChatGPT, Perplexity and other AI search engines.