Configuring robots.txt Correctly: The Practical Guide
The robots.txt follows the Robots Exclusion Standard and controls which parts of your website search engines crawl. The file is purely advisory: reputable bots follow it, malicious bots ignore it.
In 2026, the robots.txt has a new dimension: alongside Google and Bing, AI bots like GPTBot (OpenAI), Google-Extended (Gemini), and CCBot (Common Crawl) arrive. The question is no longer just "What should Google crawl?" but "Which AI systems may use your content?"
Structure of a robots.txt
The robots.txt consists of user-agent blocks with allow and disallow rules:
User-agent: Names the bot (e.g., Googlebot, Bingbot, GPTBot). The wildcard * applies to all bots.
Disallow: Paths the bot should not crawl. An empty value allows everything.
Allow: Overrides a previous disallow for specific paths. Useful for exceptions within blocked directories.
Sitemap: References the XML sitemap. Google recommends specifying the sitemap URL here.
Controlling AI Bots in robots.txt
Since 2024, AI bots crawl the web systematically. The most important ones:
GPTBot (OpenAI): Crawls for ChatGPT training and real-time search. Allow = your content can appear in ChatGPT answers.
Google-Extended: Crawls for Gemini training. Independent from Googlebot — you can block Google-Extended without affecting your Google ranking.
CCBot (Common Crawl): Crawls for the Common Crawl dataset used by many AI models.
arocom recommends: allow GPTBot and Google-Extended (for GEO visibility), block CCBot (no direct benefit, high crawl overhead). Additionally, provide an AI.txt with "Preference: allow-with-attribution."
Configuring robots.txt in Drupal
Drupal ships with a default robots.txt. For production websites, it must be customized:
- Block admin paths (/admin/, /user/login)
- Block internal search pages (/search/)
- Selectively control pagination pages
- Completely block staging environments
- Add sitemap URL
In Drupal, you can maintain the robots.txt as a static file or generate it dynamically via the RobotsTxt module. arocom uses the static variant — it is faster and prevents errors from module updates.
Have Your robots.txt Reviewed
Check your robots.txt: yourdomain.com/robots.txt. Does it accidentally block important pages? Are rules for AI bots missing? The Future Check by arocom reviews this systematically — as part of the technical SEO analysis.
Does the robots.txt protect my content from access?
No. The robots.txt is not a security mechanism. Reputable bots follow it, but anyone can access the content through a browser. For real access protection, you need authentication.
Can a wrong robots.txt destroy my ranking?
Yes. A Disallow: / blocks all crawling. Google then removes all pages from the index. This frequently happens during relaunches when the staging robots.txt is accidentally applied to the production environment.
Should I block or allow AI bots?
That depends on your GEO strategy. Those who want to be cited in AI answers must allow GPTBot and Google-Extended. Those who do not want this can block them. arocom recommends: allow with attribution preference.
Where do I find my robots.txt?
The robots.txt is always at yourdomain.com/robots.txt. In Google Search Console under Settings > robots.txt, you can check how Google interprets it.
How does SEO & GEO hold up on your website? The Future Check shows where the biggest levers are — in 2–4 weeks.
Check your own site
How GEO-ready is your website?
Enter a URL — in 5 seconds you see schema, heading hierarchy, executive summary and more, evaluated for AI citation engines.
Analyzes the publicly visible structure of a website. No personal data.
Go deeper
Read next
Copy this prompt and paste it into ChatGPT, Claude, or another AI — you'll get a personal learning plan for „Configuring robots.txt Correctly: Practical Guide“.
You are an experienced coach for SEO & GEO. I want to understand the topic "Configuring robots.txt Correctly: Practical ...GEO-SEO Guide
How to optimise your website for search engines and AI systems.
Was this article helpful?