The case for EntityMap

AI is reading your content. It doesn't know it's yours.

Why publishers need a structured knowledge layer — and what happens when they don't have one.

In 2005, search engines were crawling billions of pages but struggling to understand which ones mattered and how often they changed. A simple convention — sitemap.xml — solved the discovery problem. Within two years every major CMS generated one automatically.

We are at a similar inflection point. AI agents are retrieving and synthesising web content at scale. But they are doing it without any structured awareness of what that content means, who produced it, or how its concepts relate to each other. The result is a retrieval layer that is powerful, but weak on publisher attribution, disambiguation, and semantic structure.

EntityMap is designed to address this structural gap. But to understand why it matters, it helps to understand what is actually going wrong.

What AI retrieval does today

When a RAG pipeline, an AI search engine, or an agentic tool needs information, it fetches pages from the web and extracts text. It does not read a structured index of what you know. It reads HTML, strips formatting, and chunks the result into passages — often arbitrarily, with no awareness of which entity a passage is about or who published it.

This page-level retrieval creates three structural problems that no amount of good writing can solve on its own.

Problem 1 — Disambiguation loss

Your site uses "AI SOV" in navigation, "AI Share of Voice" in body copy, and "artificial intelligence share of voice" in a technical glossary. To a page-level retriever, these are three separate signals. There is no mechanism that tells it these surface forms all refer to the same entity — the one you have spent years building authority around.

The result: your expertise is diluted across fragmented text fragments instead of concentrated on a single, recognisable concept node.

Problem 2 — Attribution loss (the ghost citation problem)

An AI system answers a question using content from your site. The URL appears as a footnote. Your company name never appears in the answer. A reader who acts on that information has no idea it came from you.

In many current AI retrieval systems, publisher identity does not survive aggregation in a reliable form. Publisher identity is not embedded in page-level content in a way that survives aggregation. You get the traffic risk without the brand benefit.

Problem 3 — Reasoning loss

You know that AI Topical Presence is a component of AI Share of Voice, which in turn depends on Entity Salience. These relationships are implicit in your content — buried in prose, scattered across pages, inferrable by a careful human reader.

An AI system must reconstruct them probabilistically from unstructured text. Sometimes it gets them right. Often it gets them partially wrong, or misses them entirely. The reasoning scaffold your content implies never becomes explicit.

What changes with EntityMap

EntityMap is a practical way to address these three problems. Publishers publish a structured index alongside their existing content. The index does not replace your pages — it sits next to them, at a predictable URL, as a machine-readable declaration of what your site knows.

Without EntityMap

Retrieval starts from pages and fragments
Surface forms remain fragmented
Attribution is weak or absent
Relationships must be inferred from prose

With EntityMap

Retrieval starts from entities and evidence
Surface forms resolve to canonical entities
Attribution declared on every chunk
Relationships are explicit and typed

The file at yourdomain.com/entitymap.json tells any consumer: here are the concepts this site covers, here is how we define them, here is the best evidence for each one, and here is how they relate to each other. Publisher attribution is not inferred from URL structure — it is declared on every chunk, in a field designed to survive aggregation and redistribution.

Where the value is real — and where it isn't

Intellectual honesty matters here. EntityMap is not a fix for all AI visibility problems, and it is worth being clear about what it does and does not address.

The ghost citation problem has two components: a retrieval problem (addressable now with EntityMap) and a training data problem (not fixable post-training). EntityMap addresses the retrieval half. The training half requires either retraining or fine-tuning — outside any publisher's direct control.

The strongest case for EntityMap is forward-looking: it builds the right infrastructure for where AI consumption is heading, not where it is today. Agentic systems that browse the web, RAG pipelines that retrieve live content, AI assistants that synthesise from multiple sources — these are the growing category. EntityMap is designed to be ready for them.

The simplest thing you can do today

You do not need to wait for AI crawler teams to formally support EntityMap. A single hyperlink in your site footer works with every crawler that follows HTML links — which is all of them, right now.

Scenario	EntityMap value
RAG pipelines retrieving your site	Strong — direct structural improvement to retrieval quality and attribution
Agentic AI crawlers (12–24 months)	Strong — the infrastructure is being built now; EntityMap is ready for it
AI search engines today	Emerging — growing as these systems mature their crawl and retrieval layers
Pre-training data influence	Not applicable — EntityMap cannot change what a model learned during training
Ghost citations from training associations	Partial — fixes retrieval-time attribution, not training-time association

Link to entitymap.html, not the JSON file. The HTML version is designed for this: it renders entity definitions, relations, and publisher attribution as readable text with embedded JSON-LD, so any system that fetches and parses HTML will extract structured, attributed content. GPTBot, PerplexityBot, ClaudeBot, GoogleOther — all follow standard HTML links from the home page outward.

A sitewide footer is better than a home-page-only link — it means every page on your site carries a route to your EntityMap, which significantly increases the chance a crawler will find it regardless of which page it enters from.

What does this actually improve? For live RAG retrieval systems — Perplexity, ChatGPT Search, AI Overviews — that fetch content at inference time, a crawlable EntityMap means your publisher attribution, canonical entity names, and typed relations are available in the content the system retrieves. The attribution improvement is real and relatively immediate. For training data, the effect is indirect: if your EntityMap is crawled during a training run, the structured, publisher-attributed content gets baked into the model. You have no control over timing, but the signal is there. What it cannot fix is wrong associations already baked into existing model weights — that requires retraining.

The precedent: GoodRelations

In 2008, Martin Hepp published GoodRelations — a vocabulary for describing products, prices, and business entities on the web. No standards body commissioned it. Hepp published the spec, demonstrated it on a real site, and brought it to the community.

2008

GoodRelations published

Academic paper, publicly hosted spec, reference implementation. No W3C mandate, no industry consensus — just a working vocabulary that solved a real problem.

2010–11

Early adoption

E-commerce sites began implementing. Enough real-world usage accumulated to make the case to search engines.

2012

Schema.org absorption

Google, Yahoo, and Bing incorporated GoodRelations core concepts into schema.org. It became the foundation of every product listing in structured search — used by millions of sites that never heard of Hepp.

The path GoodRelations took is a useful precedent for how open vocabularies can spread. EntityMap draws on a similar adoption model: publish the vocabulary openly, demonstrate implementation value on a real site, and improve through community use. Whether it follows the same trajectory depends on adoption — and adoption depends on whether the standard proves useful in practice.

The reasoning advantage — a concrete example

The attribution and disambiguation problems are relatively straightforward to explain. The reasoning problem is harder to see — until you put a complex query in front of both approaches at once.

Consider this query: "Acme Corporation claims their platform is for firms with disconnected silos that cause reconciliation backlogs. How does Acme EDM specifically solve this, and how does data quality monitoring then lead to better regulatory compliance?"

This question requires connecting three distinct concepts across at least two pages of the site. It is not a lookup — it is a reasoning chain. Here is what happens under each approach.

Standard RAG fetches three separate pages, chunks the text, and tries to stitch the answer together. Because no single chunk contains the full logical chain, the model has to infer the connections — and it signals that inference with hedging language: "It is likely that Acme EDM addresses this… their data quality intelligence probably assists in the reporting required…" The answer is not wrong, but it is uncertain. The reasoning is the model's, not the publisher's.

EntityMap-enabled RAG does something different. It reads the relation graph and traverses the declared predicates: Fragmented Data Estate CONFLICTS_WITH Enterprise Data Management, which IMPROVES Regulatory Compliance. The logical chain is not inferred — it is read directly from the publisher's own declarations. The output drops the hedging entirely.

Standard RAG output

Logic source: probabilistic inference
Hedging: "likely", "probably"
Attribution: URL in a footnote
Hallucination risk: higher — model fills gaps

EntityMap-enabled RAG output

Logic source: declared predicates
Hedging: none — chain is explicit
Attribution: publisher name in the prose
Hallucination risk: lower — graph constrains inference

The key insight is the path between Fragmented Data Estate and Regulatory Compliance. In standard RAG the model has to find a sentence that mentions both. In EntityMap the path is explicit: e_010 → CONFLICTS_WITH → e_002 → IMPROVES → e_009. The model reads the publisher's own logic rather than reconstructing it from prose.

This is why the relation layer in EntityMap is not a nice-to-have. It is the mechanism by which publishers can place their reasoning — not just their content — into the AI retrieval layer.

Who should implement EntityMap

EntityMap is an open standard under CC BY 4.0. Read the specification, see a working example, contribute on GitHub, or try the Waikay generator.

Read the spec See live example Contribute on GitHub Get the generator