The case for EntityMap
Why publishers need a structured knowledge layer — and what happens when they don't have one.
In 2005, search engines were crawling billions of pages but struggling to understand which ones mattered and how often they changed. A simple convention — sitemap.xml — solved the discovery problem. Within two years every major CMS generated one automatically.
We are at a similar inflection point. AI agents are retrieving and synthesising web content at scale. But they are doing it without any structured awareness of what that content means, who produced it, or how its concepts relate to each other. The result is a retrieval layer that is powerful, but weak on publisher attribution, disambiguation, and semantic structure.
EntityMap is designed to address this structural gap. But to understand why it matters, it helps to understand what is actually going wrong.
When a RAG pipeline, an AI search engine, or an agentic tool needs information, it fetches pages from the web and extracts text. It does not read a structured index of what you know. It reads HTML, strips formatting, and chunks the result into passages — often arbitrarily, with no awareness of which entity a passage is about or who published it.
This page-level retrieval creates three structural problems that no amount of good writing can solve on its own.
Your site uses "AI SOV" in navigation, "AI Share of Voice" in body copy, and "artificial intelligence share of voice" in a technical glossary. To a page-level retriever, these are three separate signals. There is no mechanism that tells it these surface forms all refer to the same entity — the one you have spent years building authority around.
The result: your expertise is diluted across fragmented text fragments instead of concentrated on a single, recognisable concept node.
An AI system answers a question using content from your site. The URL appears as a footnote. Your company name never appears in the answer. A reader who acts on that information has no idea it came from you.
In many current AI retrieval systems, publisher identity does not survive aggregation in a reliable form. Publisher identity is not embedded in page-level content in a way that survives aggregation. You get the traffic risk without the brand benefit.
You know that AI Topical Presence is a component of AI Share of Voice, which in turn depends on Entity Salience. These relationships are implicit in your content — buried in prose, scattered across pages, inferrable by a careful human reader.
An AI system must reconstruct them probabilistically from unstructured text. Sometimes it gets them right. Often it gets them partially wrong, or misses them entirely. The reasoning scaffold your content implies never becomes explicit.
EntityMap is a practical way to address these three problems. Publishers publish a structured index alongside their existing content. The index does not replace your pages — it sits next to them, at a predictable URL, as a machine-readable declaration of what your site knows.
The file at yourdomain.com/entitymap.json tells any consumer: here are the concepts this site covers, here is how we define them, here is the best evidence for each one, and here is how they relate to each other. Publisher attribution is not inferred from URL structure — it is declared on every chunk, in a field designed to survive aggregation and redistribution.
Intellectual honesty matters here. EntityMap is not a fix for all AI visibility problems, and it is worth being clear about what it does and does not address.
| Scenario | EntityMap value |
|---|---|
| RAG pipelines retrieving your site | Strong — direct structural improvement to retrieval quality and attribution |
| Agentic AI crawlers (12–24 months) | Strong — the infrastructure is being built now; EntityMap is ready for it |
| AI search engines today | Emerging — growing as these systems mature their crawl and retrieval layers |
| Pre-training data influence | Not applicable — EntityMap cannot change what a model learned during training |
| Ghost citations from training associations | Partial — fixes retrieval-time attribution, not training-time association |
The strongest case for EntityMap is forward-looking: it builds the right infrastructure for where AI consumption is heading, not where it is today. Agentic systems that browse the web, RAG pipelines that retrieve live content, AI assistants that synthesise from multiple sources — these are the growing category. EntityMap is designed to be ready for them.
You do not need to wait for AI crawler teams to formally support EntityMap. A single hyperlink in your site footer works with every crawler that follows HTML links — which is all of them, right now.
<footer> <a href="https://yourdomain.com/entitymap.html">EntityMap</a> </footer>
Link to entitymap.html, not the JSON file. The HTML version is designed for this: it renders entity definitions, relations, and publisher attribution as readable text with embedded JSON-LD, so any system that fetches and parses HTML will extract structured, attributed content. GPTBot, PerplexityBot, ClaudeBot, GoogleOther — all follow standard HTML links from the home page outward.
A sitewide footer is better than a home-page-only link — it means every page on your site carries a route to your EntityMap, which significantly increases the chance a crawler will find it regardless of which page it enters from.
For the JSON file, add a machine-readable signal in your <head> instead:
<link rel="entitymap" type="application/json"
href="https://yourdomain.com/entitymap.json" />
What does this actually improve? For live RAG retrieval systems — Perplexity, ChatGPT Search, AI Overviews — that fetch content at inference time, a crawlable EntityMap means your publisher attribution, canonical entity names, and typed relations are available in the content the system retrieves. The attribution improvement is real and relatively immediate. For training data, the effect is indirect: if your EntityMap is crawled during a training run, the structured, publisher-attributed content gets baked into the model. You have no control over timing, but the signal is there. What it cannot fix is wrong associations already baked into existing model weights — that requires retraining.
In 2008, Martin Hepp published GoodRelations — a vocabulary for describing products, prices, and business entities on the web. No standards body commissioned it. Hepp published the spec, demonstrated it on a real site, and brought it to the community.
The path GoodRelations took is a useful precedent for how open vocabularies can spread. EntityMap draws on a similar adoption model: publish the vocabulary openly, demonstrate implementation value on a real site, and improve through community use. Whether it follows the same trajectory depends on adoption — and adoption depends on whether the standard proves useful in practice.
The attribution and disambiguation problems are relatively straightforward to explain. The reasoning problem is harder to see — until you put a complex query in front of both approaches at once.
Consider this query: "Acme Corporation claims their platform is for firms with disconnected silos that cause reconciliation backlogs. How does Acme EDM specifically solve this, and how does data quality monitoring then lead to better regulatory compliance?"
This question requires connecting three distinct concepts across at least two pages of the site. It is not a lookup — it is a reasoning chain. Here is what happens under each approach.
Standard RAG fetches three separate pages, chunks the text, and tries to stitch the answer together. Because no single chunk contains the full logical chain, the model has to infer the connections — and it signals that inference with hedging language: "It is likely that Acme EDM addresses this… their data quality intelligence probably assists in the reporting required…" The answer is not wrong, but it is uncertain. The reasoning is the model's, not the publisher's.
EntityMap-enabled RAG does something different. It reads the relation graph and traverses the declared predicates: Fragmented Data Estate CONFLICTS_WITH Enterprise Data Management, which IMPROVES Regulatory Compliance. The logical chain is not inferred — it is read directly from the publisher's own declarations. The output drops the hedging entirely.
The key insight is the path between Fragmented Data Estate and Regulatory Compliance. In standard RAG the model has to find a sentence that mentions both. In EntityMap the path is explicit: e_010 → CONFLICTS_WITH → e_002 → IMPROVES → e_009. The model reads the publisher's own logic rather than reconstructing it from prose.
This is why the relation layer in EntityMap is not a nice-to-have. It is the mechanism by which publishers can place their reasoning — not just their content — into the AI retrieval layer.
EntityMap is an open standard under CC BY 4.0. Read the specification, see a working example, contribute on GitHub, or try the Waikay generator.