The case for EntityMap

AI is reading your content. It doesn't know it's yours.

Why publishers need a structured knowledge layer — and what happens when they don't have one.

In 2005, search engines were crawling billions of pages but struggling to understand which ones mattered and how often they changed. A simple convention — sitemap.xml — solved the discovery problem. Within two years every major CMS generated one automatically.

We are at a similar inflection point. AI agents are retrieving and synthesising web content at scale. But they are doing it without any structured awareness of what that content means, who produced it, or how its concepts relate to each other. The result is a retrieval layer that is powerful, but weak on publisher attribution, disambiguation, and semantic structure.

EntityMap is designed to address this structural gap. But to understand why it matters, it helps to understand what is actually going wrong.


What AI retrieval does today

When a RAG pipeline, an AI search engine, or an agentic tool needs information, it fetches pages from the web and extracts text. It does not read a structured index of what you know. It reads HTML, strips formatting, and chunks the result into passages — often arbitrarily, with no awareness of which entity a passage is about or who published it.

This page-level retrieval creates three structural problems that no amount of good writing can solve on its own.

Problem 1 — Disambiguation loss

Your site uses "AI SOV" in navigation, "AI Share of Voice" in body copy, and "artificial intelligence share of voice" in a technical glossary. To a page-level retriever, these are three separate signals. There is no mechanism that tells it these surface forms all refer to the same entity — the one you have spent years building authority around.

The result: your expertise is diluted across fragmented text fragments instead of concentrated on a single, recognisable concept node.

Problem 2 — Attribution loss (the ghost citation problem)

An AI system answers a question using content from your site. The URL appears as a footnote. Your company name never appears in the answer. A reader who acts on that information has no idea it came from you.

In many current AI retrieval systems, publisher identity does not survive aggregation in a reliable form. Publisher identity is not embedded in page-level content in a way that survives aggregation. You get the traffic risk without the brand benefit.

Problem 3 — Reasoning loss

You know that AI Topical Presence is a component of AI Share of Voice, which in turn depends on Entity Salience. These relationships are implicit in your content — buried in prose, scattered across pages, inferrable by a careful human reader.

An AI system must reconstruct them probabilistically from unstructured text. Sometimes it gets them right. Often it gets them partially wrong, or misses them entirely. The reasoning scaffold your content implies never becomes explicit.


What changes with EntityMap

EntityMap is a practical way to address these three problems. Publishers publish a structured index alongside their existing content. The index does not replace your pages — it sits next to them, at a predictable URL, as a machine-readable declaration of what your site knows.

Without EntityMap
With EntityMap

The file at yourdomain.com/entitymap.json tells any consumer: here are the concepts this site covers, here is how we define them, here is the best evidence for each one, and here is how they relate to each other. Publisher attribution is not inferred from URL structure — it is declared on every chunk, in a field designed to survive aggregation and redistribution.


Where the value is real — and where it isn't

Intellectual honesty matters here. EntityMap is not a fix for all AI visibility problems, and it is worth being clear about what it does and does not address.

ScenarioEntityMap value
RAG pipelines retrieving your siteStrong — direct structural improvement to retrieval quality and attribution
Agentic AI crawlers (12–24 months)Strong — the infrastructure is being built now; EntityMap is ready for it
AI search engines todayEmerging — growing as these systems mature their crawl and retrieval layers
Pre-training data influenceNot applicable — EntityMap cannot change what a model learned during training
Ghost citations from training associationsPartial — fixes retrieval-time attribution, not training-time association
The ghost citation problem has two components: a retrieval problem (addressable now with EntityMap) and a training data problem (not fixable post-training). EntityMap addresses the retrieval half. The training half requires either retraining or fine-tuning — outside any publisher's direct control.

The strongest case for EntityMap is forward-looking: it builds the right infrastructure for where AI consumption is heading, not where it is today. Agentic systems that browse the web, RAG pipelines that retrieve live content, AI assistants that synthesise from multiple sources — these are the growing category. EntityMap is designed to be ready for them.


The simplest thing you can do today

You do not need to wait for AI crawler teams to formally support EntityMap. A single hyperlink in your site footer works with every crawler that follows HTML links — which is all of them, right now.

<footer>
  <a href="https://yourdomain.com/entitymap.html">EntityMap</a>
</footer>

Link to entitymap.html, not the JSON file. The HTML version is designed for this: it renders entity definitions, relations, and publisher attribution as readable text with embedded JSON-LD, so any system that fetches and parses HTML will extract structured, attributed content. GPTBot, PerplexityBot, ClaudeBot, GoogleOther — all follow standard HTML links from the home page outward.

A sitewide footer is better than a home-page-only link — it means every page on your site carries a route to your EntityMap, which significantly increases the chance a crawler will find it regardless of which page it enters from.

For the JSON file, add a machine-readable signal in your <head> instead:

<link rel="entitymap" type="application/json"
      href="https://yourdomain.com/entitymap.json" />

What does this actually improve? For live RAG retrieval systems — Perplexity, ChatGPT Search, AI Overviews — that fetch content at inference time, a crawlable EntityMap means your publisher attribution, canonical entity names, and typed relations are available in the content the system retrieves. The attribution improvement is real and relatively immediate. For training data, the effect is indirect: if your EntityMap is crawled during a training run, the structured, publisher-attributed content gets baked into the model. You have no control over timing, but the signal is there. What it cannot fix is wrong associations already baked into existing model weights — that requires retraining.


The precedent: GoodRelations

In 2008, Martin Hepp published GoodRelations — a vocabulary for describing products, prices, and business entities on the web. No standards body commissioned it. Hepp published the spec, demonstrated it on a real site, and brought it to the community.

2008
GoodRelations published
Academic paper, publicly hosted spec, reference implementation. No W3C mandate, no industry consensus — just a working vocabulary that solved a real problem.
2010–11
Early adoption
E-commerce sites began implementing. Enough real-world usage accumulated to make the case to search engines.
2012
Schema.org absorption
Google, Yahoo, and Bing incorporated GoodRelations core concepts into schema.org. It became the foundation of every product listing in structured search — used by millions of sites that never heard of Hepp.

The path GoodRelations took is a useful precedent for how open vocabularies can spread. EntityMap draws on a similar adoption model: publish the vocabulary openly, demonstrate implementation value on a real site, and improve through community use. Whether it follows the same trajectory depends on adoption — and adoption depends on whether the standard proves useful in practice.


The reasoning advantage — a concrete example

The attribution and disambiguation problems are relatively straightforward to explain. The reasoning problem is harder to see — until you put a complex query in front of both approaches at once.

Consider this query: "Acme Corporation claims their platform is for firms with disconnected silos that cause reconciliation backlogs. How does Acme EDM specifically solve this, and how does data quality monitoring then lead to better regulatory compliance?"

This question requires connecting three distinct concepts across at least two pages of the site. It is not a lookup — it is a reasoning chain. Here is what happens under each approach.

Standard RAG fetches three separate pages, chunks the text, and tries to stitch the answer together. Because no single chunk contains the full logical chain, the model has to infer the connections — and it signals that inference with hedging language: "It is likely that Acme EDM addresses this… their data quality intelligence probably assists in the reporting required…" The answer is not wrong, but it is uncertain. The reasoning is the model's, not the publisher's.

EntityMap-enabled RAG does something different. It reads the relation graph and traverses the declared predicates: Fragmented Data Estate CONFLICTS_WITH Enterprise Data Management, which IMPROVES Regulatory Compliance. The logical chain is not inferred — it is read directly from the publisher's own declarations. The output drops the hedging entirely.

Standard RAG output
EntityMap-enabled RAG output
Standard RAG — page-level retrieval EntityMap-enabled RAG — graph traversal Fragmented data estate /solutions/edm · page chunk Acme EDM /products/acme-edm · page chunk Regulatory compliance /solutions/edm · page chunk probably… likely… Output: probabilistic "It is likely that Acme EDM addresses this…" Fragmented data estate e_010 Enterprise Data Management e_002 Regulatory compliance e_009 CONFLICTS_WITH IMPROVES Output: grounded "Acme's EDM IMPROVES regulatory compliance"

The key insight is the path between Fragmented Data Estate and Regulatory Compliance. In standard RAG the model has to find a sentence that mentions both. In EntityMap the path is explicit: e_010 → CONFLICTS_WITH → e_002 → IMPROVES → e_009. The model reads the publisher's own logic rather than reconstructing it from prose.

This is why the relation layer in EntityMap is not a nice-to-have. It is the mechanism by which publishers can place their reasoning — not just their content — into the AI retrieval layer.


Who should implement EntityMap

EntityMap is an open standard under CC BY 4.0. Read the specification, see a working example, contribute on GitHub, or try the Waikay generator.