AI assistants read websites in two stages: first a page has to be findable and cleanly crawlable, then the model decides which passage to cite. Three levers pay off most: server-rendered, clearly structured content that answers real questions completely; Schema.org markup in the order Organization, Person, Article, FAQPage; and a robots.txt that steers AI bots deliberately rather than by accident. llms.txt is a young proposal with low effort and unclear effect. We use it as a bet, not as a duty. The biggest mistake remains relying on one file while the content itself stays unreadable for machines.

llms.txt, Schema & Co.: Making Your Website Readable for AI Assistants

When a ChatGPT or Perplexity user asks for a solution in your field, an AI crawler often visits your website in the background. It sees no glossy layout and no animation. It sees text, structure and machine-readable markup, and it looks for the one spot that answers the question asked.

This guide answers the four questions decision-makers ask us most often right now: What do AI crawlers actually read? Which schema types give the biggest lever? How should I handle AI bots in robots.txt? And does this new file llms.txt do anything? The answers are craft, not magic. If you do not yet know GEO as a discipline, you will find the framing in our post on GEO: visibility in AI search; here we look at the technical work underneath it.

What AI crawlers actually read

AI assistants access your content in two ways. Some use the existing Google index (this is how AI Overviews work), others run their own crawlers that fetch your pages directly. In both cases the same basic rule applies: what is in the delivered HTML gets read. What only appears after loading via JavaScript often does not get read.

This is the most important and most overlooked point. Many modern websites render their main content only in the user's browser. A classic search engine crawler now usually executes JavaScript, but many AI crawlers do not, or only to a limited degree. A page whose text is created client-side can be effectively empty for an AI assistant. In our audits this is the most common reason a strong page fails to appear in AI answers.

Concretely, AI crawlers look at these signals, in this order of importance:

  • Server-rendered text. The actual content has to be in the page source without a script having to load it. A quick test: open the page in your browser, right-click, "View page source". If your core content is there as text, you are on the safe side.
  • Heading hierarchy. A clean structure from H1 through H2 to H3 tells the machine how the content is built and which passage belongs to which sub-question.
  • Clear answer blocks. A paragraph that answers a question directly and completely is easier to cite than an argument woven across ten sentences.
  • Structured data. Schema.org markup makes explicit what a text means implicitly. More on that shortly.
  • Internal links and sitemap. They help the crawler find every relevant page. How a sitemap works here is explained in our knowledge article on the XML sitemap.

Schema.org, in the right order

Schema.org is a standardised markup that lets you tell machines what a piece of content means: this text is an article, this person is its author, this block is a question with an answer. AI systems use these hints to classify content correctly and to recognise authorship.

You do not have to mark up everything at once. This order has proven itself in our projects by effort and effect:

PrioritySchema typeWhat it clarifiesEffort
1OrganizationWho you are: name, address, logo, profiles. Anchor for all other typeslow, one-time
2PersonWho is responsible for content. Links posts to verifiable expertiselow, one-time
3Article / BlogPostingWhich text is by whom and when. Authorship and recencylow, per post
4FAQPageQuestion-answer pairs, directly extractable as suchmedium, per page
5Service / ProductWhat you offer, with region and scopemedium, per service

Organization and Person take a few hours and affect every page of your website. That is why they sit at the top. One rule applies to all types, without exception: the markup may only describe what is visibly on the page. A FAQPage marker without visible questions and answers is not a trick but a guideline violation that, in the worst case, damages the credibility of the entire domain.

robots.txt and AI bots: decide deliberately

The robots.txt is the file you use to tell crawlers which areas they may fetch. With the rise of dedicated AI crawlers, a new question appears: do you want these bots to read your content?

This calls for a deliberate decision, not a default setting. If you block AI crawlers wholesale, you are not citable in AI answers, because the source is read beforehand. If you leave everything open, you accept that your content may also flow into the training of future models. Both stances are legitimate, but they should be intended.

A pragmatic distinction helps: there are crawlers that fetch content for the live answering of user questions, and crawlers that collect training data. Anyone who wants to be visible in AI answers but objects to model training can treat these groups separately in robots.txt. The exact bot names change constantly, so robots.txt belongs on a short review list, roughly quarterly.

llms.txt: a bet, not a duty

llms.txt is still a young proposal. The idea: a single file in your website's root directory in which you offer AI systems a curated overview of your most important content as plain, well-readable text. Instead of working through nested HTML, a model would find a tidy map of your site here.

The honest status: it is a proposal, not a standard. No major AI system has yet committed to evaluating the file. It costs little, it does no harm, and it forces a useful discipline, namely naming your most important content clearly once. But it replaces none of the measures from the previous sections.

Our recommendation matches what we do on our own website: we maintain an llms.txt but treat it as a bet with a small stake, not as a mandatory task. Anyone still left with JavaScript-only content, missing schema or an unclarified robots.txt invests their time there first. An llms.txt before readable content is like a table of contents for a book with blank pages.

In which order to proceed

The four building blocks give a clear ranking. Work through them top to bottom, and your effort always goes where it has the most effect:

1. Server-render content. Use the page source to check whether your core content is visible without JavaScript. This is the foundation, without which everything else comes to nothing. 2. Sharpen structure. Clean headings, a short summary at the top, one answer block per core question. 3. Add schema. In the order Organization, Person, Article, FAQPage. Only for visible content. 4. Clarify robots.txt. Deliberately decide which AI bots may read, and review the state quarterly. 5. Add llms.txt. As an optional extra, once the first four points are in place.

This order is also a budget logic: the first two points cost little and work immediately, the fifth is cheap but uncertain. Anyone who starts at the bottom optimises a file that perhaps no one reads, while the actual content stays invisible.

Do we need an llms.txt?

No. llms.txt is a proposal, not a standard, and no major AI system has committed to using the file. It is a cheap bet, not a required part. Invest first in server-rendered content, clear structure and schema. An llms.txt complements these foundations but does not replace them.

Do AI crawlers read our JavaScript?

Often not, or only to a limited degree. Unlike classic search engine crawlers, many AI crawlers execute little or no JavaScript. Content that is created only in the user's browser can be invisible to an AI assistant. Use the page source to check whether your core content is in the HTML without a script. In our audits this is the most common cause of missing AI visibility.

Which schema should we implement first?

Organization and Person. Both are set up in a few hours and affect every page, because they clarify who you are and who is responsible for your content. After that come Article or BlogPosting per post and FAQPage for pages with real question-answer blocks. Important: only mark up what is visibly on the page.

Should we block AI bots in robots.txt?

That is a deliberate business decision, not a technical default. If you block AI crawlers, you are not citable in AI answers. If you stay open, you accept possible use of your content for model training. A workable middle ground separates crawlers for the live answer from those for training data. Since bot names change constantly, robots.txt belongs on a quarterly review list.

The first step for this week

You need no project to get started, just half an hour. Open your most important service page, call up the page source and look for your core content. If the text is there and readable, the foundation is fine and you move on to structure and schema. If you mostly see script code instead of content, you have found the most important topic before any file in the root directory plays a role.

If you would rather do this check together and derive a ranking for your site right away, we will look at it concretely on your website in the Future Check.

Want to know what these topics mean for your company? The Future Check shows you the biggest levers within 2–4 weeks.

Request a Future Check Get in touch directly
100 %