Configuring robots.txt Correctly: The Practical Guide
Last updated: March 2026 · Reading time: 6 minutes
The robots.txt follows the Robots Exclusion Standard and controls which parts of your website search engines crawl. The file is purely advisory: reputable bots follow it, malicious bots ignore it.
In 2026, the robots.txt has a new dimension: alongside Google and Bing, AI bots like GPTBot (OpenAI), Google-Extended (Gemini), and CCBot (Common Crawl) arrive. The question is no longer just "What should Google crawl?" but "Which AI systems may use your content?"
Structure of a robots.txt
The robots.txt consists of user-agent blocks with allow and disallow rules:
User-agent: Names the bot (e.g., Googlebot, Bingbot, GPTBot). The wildcard * applies to all bots.
Disallow: Paths the bot should not crawl. An empty value allows everything.
Allow: Overrides a previous disallow for specific paths. Useful for exceptions within blocked directories.
Sitemap: References the XML sitemap. Google recommends specifying the sitemap URL here.
Controlling AI Bots in robots.txt
Since 2024, AI bots crawl the web systematically. The most important ones:
GPTBot (OpenAI): Crawls for ChatGPT training and real-time search. Allow = your content can appear in ChatGPT answers.
Google-Extended: Crawls for Gemini training. Independent from Googlebot — you can block Google-Extended without affecting your Google ranking.
CCBot (Common Crawl): Crawls for the Common Crawl dataset used by many AI models.
arocom recommends: allow GPTBot and Google-Extended (for GEO visibility), block CCBot (no direct benefit, high crawl overhead). Additionally, provide an AI.txt with "Preference: allow-with-attribution."
Configuring robots.txt in Drupal
Drupal ships with a default robots.txt. For production websites, it must be customized:
- Block admin paths (/admin/, /user/login)
- Block internal search pages (/search/)
- Selectively control pagination pages
- Completely block staging environments
- Add sitemap URL
In Drupal, you can maintain the robots.txt as a static file or generate it dynamically via the RobotsTxt module. arocom uses the static variant — it is faster and prevents errors from module updates.
Have Your robots.txt Reviewed
Check your robots.txt: yourdomain.com/robots.txt. Does it accidentally block important pages? Are rules for AI bots missing? The Future Check by arocom reviews this systematically — as part of the technical SEO analysis.
Does the robots.txt protect my content from access?
No. The robots.txt is not a security mechanism. Reputable bots follow it, but anyone can access the content through a browser. For real access protection, you need authentication.
Can a wrong robots.txt destroy my ranking?
Yes. A Disallow: / blocks all crawling. Google then removes all pages from the index. This frequently happens during relaunches when the staging robots.txt is accidentally applied to the production environment.
Should I block or allow AI bots?
That depends on your GEO strategy. Those who want to be cited in AI answers must allow GPTBot and Google-Extended. Those who do not want this can block them. arocom recommends: allow with attribution preference.
Where do I find my robots.txt?
The robots.txt is always at yourdomain.com/robots.txt. In Google Search Console under Settings > robots.txt, you can check how Google interprets it.
Read more
- XML Sitemap — Why it matters more for GEO than for SEO
- Crawl Budget — How Google prioritizes your website
- GEO Optimization — Visibility in search engines and AI systems
- Future Check (Audit) — Independent analysis of your installation
Discover a random article
Questions about this topic? We'd love to help.
GEO & SEO Guide
Guide: How to optimize your website for search engines and AI systems.
Was this article helpful?