Robots.txt playbook
Master technical SEO with the Robots.txt Generator
The Robots.txt Generator gives you full control over how search engines and AI crawlers interact with your content. Use it to launch clean builds, protect staging environments, and keep critical assets discoverable.
What does robots.txt control?
A robots.txt
file sits at the root of your domain and tells crawlers which parts of your site they can request. Search engines read it before fetching any URL, making it the first line of defense for staging folders, duplicate archives, and experimental content.
Each rule is grouped by User-agent. Within a group you can declare Disallow paths to block crawling, Allow directives to override blocks, and optional Crawl-delay values to slow bots down. Finally, Sitemap directives point crawlers to your XML sitemap files for rapid discovery.
How to craft allow/disallow directives that work
Allow list tips
- Match folders with trailing slashes (e.g.
/blog/
) for clarity. - Expose required assets like
/_next/static/
or/wp-admin/admin-ajax.php
. - Use shorter rules first; Google respects the most specific match.
Disallow best practices
- Block query-heavy URLs like
/?session=
to save crawl budget. - Protect internal dashboards and previews (
/staging/
,/drafts/
). - Never disallow assets that power core pages (images, CSS, JavaScript).
When in doubt, keep the file simple. Combine robots rules with on-page canonical tags, structured data, and optimized metadata. Our Meta Tag Generator pairs perfectly with this workflow.
Why CMS presets matter
Content management systems ship with unique folder structures. The generator includes curated presets forNext.js and WordPress so you can deploy safely:
- Next.js preset: keeps critical static assets crawlable while hiding API routes and build output.
- WordPress preset: blocks admin dashboards but keeps
admin-ajax.php
open for comments, forms, and ecommerce plugins.
Want staging protection? Apply the preset, add our “Disallow staging paths” template, then drop the file into your deployment pipeline.
Managing AI crawlers alongside search engines
Generative AI services such as GPTBot, CCBot, PerplexityBot, and Claude-Web honor robots.txt
directives. Use the “Block AI crawlers” preset to deny their access while keeping Google and Bing online.
Remember: robots.txt does not protect private data. It is a public signal that well-behaved crawlers follow. Still expose only what you want copied, and pair blocks with authentication for sensitive paths.
Validator checklist before you publish
- Ensure every rule group has at least one
User-agent
. - Confirm disallowed paths start with a slash or include the full protocol.
- Double-check sitemap URLs are absolute—
https://
plus domain. - Generate a fresh XML sitemap so crawlers discover the sections you're allowing.
- Test critical pages in SERP Preview to confirm they render correctly.
- Audit final URLs with the SEO URL Checker before launch.
Frequently asked questions
Does robots.txt block indexing?
Not by itself. Blocking a path in robots.txt
stops compliant crawlers from fetching the page, but the URL can still appear in search if other pages link to it. Use a noindex
meta tag or a removal request to guarantee full de-indexing.
Where should I host my robots.txt file?
Place the file at the root of the domain—https://example.com/robots.txt
. Subdomains need their own copy if you serve different content.
How often do search engines fetch robots.txt?
Major crawlers check frequently—Google can refresh the file multiple times per day. Update the file and monitor server logs or Search Console to confirm new rules are respected.