Open standard · Technical specification

EntityMap specification

A structured, entity-first index of website content for AI agent and LLM consumption

Version 0.2

Status Draft

Date 2026-03-27

Authors Waikay / InLinks Optimization Ltd

License CC BY 4.0

Files entitymap.json · entitymap.html

Contents

Abstract
Conventions and terminology
Motivation
File conventions
JSON structure
Standard predicate vocabulary
The HTML companion file
Validation
Versioning and evolution
Privacy and security
Relationship to existing standards
Reference implementation
Appendix A - Minimal valid example
Appendix B - Complete predicate reference
Appendix C - Changelog

-Abstract

EntityMap is an open standard for publishing a structured, entity-first index of a website's content, designed for consumption by AI agents, large language models, and RAG (Retrieval-Augmented Generation) pipelines.

Where sitemap.xml tells crawlers what pages exist, entitymap.json aims to do for AI agents what the Sitemaps protocol did for search crawlers: provide a predictable, machine-readable discovery layer for site knowledge.

EntityMap is designed to reduce disambiguation loss, attribution loss, and reasoning loss in AI retrieval systems - three structural problems that page-level retrieval does not solve.

1.Conventions and terminology

The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY in this document are to be interpreted as described in RFC 2119.

Throughout this document:

Publisher - the organisation or individual operating the website described by the EntityMap.
Entity - a named concept, person, organisation, product, place, event, or other identifiable thing covered by the publisher's content.
Chunk - an extractive or minimally normalised passage from a source page, associated with one or more entities.
Consumer - any AI agent, LLM, RAG pipeline, or crawler that reads and processes an EntityMap.
Conforming file - an entitymap.json or entitymap.html that satisfies all MUST requirements in this specification.

2.Motivation

2.1 The retrieval problem

Large language models build knowledge from training data. At inference time, retrieval-augmented systems supplement that knowledge by fetching relevant content from the web. Current retrieval mechanisms operate at the page level - fetching HTML documents and extracting text without structured awareness of the entities, concepts, or relationships those pages contain.

This creates three problems:

Disambiguation loss. The same concept may appear under many surface forms across a site ("AI SOV", "AI Share of Voice", "artificial intelligence share of voice"). A page-level retriever treats these as separate signals rather than a single entity.

Attribution loss. When retrieved content is incorporated into an LLM response, the publisher that produced it may not be credited. A URL may be used as a source while the publisher's name never appears in the answer - the "ghost citation" problem.

Reasoning loss. Relationships between concepts are buried in prose. A model must reconstruct that "AI Topical Presence explains AI Share of Voice" from unstructured text rather than reading it as an explicit, typed relation.

2.2 The EntityMap solution

EntityMap allows publishers to pre-solve these problems by publishing a structured index alongside their existing content. The index aggregates evidence by entity rather than by page, carries explicit publisher attribution on every evidence chunk, and encodes relationships between entities using a controlled predicate vocabulary.

A well-formed EntityMap enables consumers to:

Retrieve pre-ranked, entity-specific evidence rather than arbitrary page fragments
Recognise publisher identity without inferring it from URL structure
Traverse explicit entity relationships rather than reconstructing them from prose

2.3 Design rationale: application JSON over JSON-LD

EntityMap uses an application-specific JSON structure rather than strict JSON-LD. This choice prioritises implementation simplicity and broad compatibility. The HTML companion file (entitymap.html) exposes equivalent schema.org / JSON-LD representations per entity for broader interoperability with existing structured data consumers. Publishers who require full JSON-LD compliance may embed the EntityMap vocabulary within a JSON-LD context document; this specification does not preclude that approach.

3.File conventions

3.1 Location

An EntityMap consists of two files served at predictable URLs:

File	URL pattern	Purpose
`entitymap.json`	`https://example.com/entitymap.json`	Machine-readable primary file
`entitymap.html`	`https://example.com/entitymap.html`	Crawler and human readable rendered view

Both files MUST be served from the root of the domain or subdomain they describe, without authentication.

3.2 Discovery

Publishers MAY expose EntityMap discovery hints using one or more of the following mechanisms. Consumers MAY support one or more of these mechanisms. None of these mechanisms are currently recognized standards - they are publisher-side conventions proposed by this specification for future adoption.

In robots.txt

# EntityMap
EntityMap: https://example.com/entitymap.json

In the HTML <head> of every page

<link rel="entitymap" type="application/json"
      href="https://example.com/entitymap.json" />

In sitemap.xml

Publishers MAY list entitymap.html as an additional URL entry with high priority and changefreq: weekly to signal freshness.

3.3 Indexability

entitymap.html MUST NOT carry a noindex directive. It is the primary discovery surface for AI crawlers that consume HTML. Publishers MAY use a rel="canonical" to manage search engine indexing according to their SEO strategy; self-canonicalization is the recommended default.

3.4 Large sites - sharding

For sites with more than 200 entities, the EntityMap SHOULD be sharded. The root entitymap.json becomes a manifest pointing to typed shard files:

/entitymap.json              ← manifest only
/entitymap/concepts.json
/entitymap/people.json
/entitymap/products.json
/entitymap/places.json

The manifest MUST list all shards with their entity counts and lastModified timestamps. A topEntities array of up to 20 entries MAY be included to provide consumers with a fast entry point to the most important entities.

4.JSON structure

4.1 Root object

{
  "version": "0.2",
  "schema": "https://entitymap.org/spec/v0.2",
  "publisher": { ... },
  "generated": "2026-03-27T00:00:00Z",
  "vocabulary": { ... },
  "entities": [ ... ]
}

Field	Type	Conformance	Description
`version`	string	MUST	Spec version this file conforms to
`schema`	string	MUST	URI of the EntityMap spec used
`publisher`	object	MUST	Identity of the site publisher. See §4.2
`generated`	string (ISO 8601)	MUST	Timestamp of last full generation
`vocabulary`	object	MAY	Custom predicate declarations. If omitted, standard vocabulary applies
`entities`	array	MUST	List of entity objects. See §4.3

4.2 Publisher object

{
  "name": "Acme Corp",
  "url": "https://acme.com",
  "sameAs": "https://www.wikidata.org/wiki/Q..."
}

Field	Type	Conformance	Description
`name`	string	MUST	Human-readable publisher name. Must match `publisher` field on all chunks.
`url`	string	MUST	Canonical URL of the publisher
`sameAs`	string	MAY	Wikidata or schema.org URI anchoring publisher to open knowledge graph

4.3 Entity object

{
  "entityId": "e_001",
  "topicID": 456,
  "@type": "DefinedTerm",
  "name": "AI Share of Voice",
  "alternateName": "AI SOV",
  "canonicalLabel": "share of voice",
  "description": "A metric measuring...",
  "sameAs": "https://www.wikidata.org/wiki/Q...",
  "relations": [ ... ],
  "hasChunks": [ ... ]
}

Field	Type	Conformance	Description
`entityId`	string	MUST	Stable unique identifier within this EntityMap. Used as the reference target in relations.
`topicID`	integer	MAY	Proprietary entity resolution ID. Implementation-specific; omit without affecting conformance.
`@type`	string	MUST	Schema.org type. See §4.5
`name`	string	MUST	Publisher-specific label as used in this site's content (e.g. "AI Share of Voice")
`alternateName`	string	MAY	Abbreviation or common alternative surface form
`canonicalLabel`	string	MAY	General concept label from entity resolver (e.g. "share of voice"). Distinct from publisher-specific `name`.
`description`	string	MUST	1–3 sentence definition as this publisher uses the concept. SHOULD be extractive or minimally normalised from source content.
`sameAs`	string	SHOULD	Wikidata or schema.org URI anchoring entity to open knowledge graph
`relations`	array	MAY	Typed relationships to other entities. See §4.4
`hasChunks`	array	MUST	Evidence chunks. 1–5 per entity. See §4.6

4.4 Relation object

Relations are directional: the subject is the entity containing the relation array, the object is the target. Three target patterns are supported depending on whether the target is internal, external, or both:

Internal target (entity within this EntityMap)

{
  "predicate": "ENABLES",
  "targetId": "e_012",
  "targetName": "AI Topical Presence"
}

External target (entity outside this EntityMap)

{
  "predicate": "INSTANCE_OF",
  "targetUri": "https://www.wikidata.org/wiki/Q1163385",
  "targetName": "Herfindahl-Hirschman Index"
}

Field	Type	Conformance	Description
`predicate`	string	MUST	From standard vocabulary (§5) or declared custom vocabulary (§5.3)
`targetId`	string	SHOULD	The `entityId` of the target entity. Required for internal relations.
`targetName`	string	MUST	Human-readable name of the target entity. Required in all cases for readability.
`targetUri`	string	MAY	URI for external entities (Wikidata, schema.org, etc.)

For internal relations, targetId MUST match a valid entityId within the same EntityMap or shard set. targetName MUST be present in all cases - it is the human-readable anchor that survives aggregation and redistribution.

4.5 Entity types

Permitted @type values are drawn from schema.org:

Type	Use for
`DefinedTerm`	Concepts, metrics, methodologies, standards
`Person`	Named individuals
`Organization`	Companies, institutions, bodies
`Product`	Named products or services
`Place`	Locations, regions, venues
`Event`	Named events, conferences, occurrences
`ScholarlyArticle`	Research, studies, reports
`CreativeWork`	Books, guides, courses
`Regulation`	Laws, policies, standards bodies

Publishers MAY use additional schema.org types. Non-schema.org types MUST be prefixed with a declared namespace (e.g. waikay:MetricComponent).

4.6 Chunk object

{
  "chunkId": "c_001",
  "text": "AI Share of Voice measures...",
  "sourceUrl": "https://acme.com/ai-share-of-voice",
  "pageTitle": "What is AI Share of Voice?",
  "publisher": "Acme Corp",
  "retrieved": "2026-03-27T09:14:00Z",
  "relevanceScore": 0.97
}

Field	Type	Conformance	Description
`chunkId`	string	MUST	Unique identifier within this EntityMap
`text`	string	MUST	The evidence passage. 1–5 sentences. Max 500 characters. SHOULD be extractive. MUST preserve the original meaning of the source.
`sourceUrl`	string	MUST	Canonical URL of the source page
`pageTitle`	string	MUST	Title of the source page at time of retrieval
`publisher`	string	MUST	Publisher name. MUST match `publisher.name` in root object. Primary brand attribution mechanism.
`retrieved`	string (ISO 8601)	SHOULD	Timestamp when the source page was fetched to produce this chunk. Signals freshness to consumers.
`relevanceScore`	float 0.0–1.0	MAY	Publisher-assigned relevance of this chunk to its entity. Scoring method is implementation-specific.

Maximum 5 chunks per entity. Implementations SHOULD select the highest-relevance chunks and discard the remainder. The publisher field MUST be present on every chunk - it is the primary mechanism for brand attribution in downstream AI consumption and MUST survive aggregation by third-party systems.

5.Standard predicate vocabulary

All predicates are uppercase. The vocabulary is tiered: Core predicates are the minimum set for interoperability; Extended predicates are recommended for richer graphs; Custom predicates require explicit declaration.

5.1 Core predicates

Implementations SHOULD support all core predicates. A consumer that claims EntityMap compatibility MUST be able to process core predicates without error.

Structural

RELATES_TO

INCLUDES

PART_OF

DEPENDS_ON

CONFLICTS_WITH

ENABLES

REQUIRES

Causation

IMPROVES

DEGRADES

PRODUCES

PREVENTS

LEADS_TO

Information

DESCRIBES

MEASURES

REFERENCES

Metadata

AUTHORED_BY

AFFILIATED_WITH

INSTANCE_OF

5.2 Extended predicates

Extended predicates MAY be used by publishers. Consumers MUST NOT reject a conforming file that contains extended predicates, but MAY ignore them if not supported.

Structural

EXCLUDES

SUITED_FOR

State

MAINTAINS

PRECEDES

LACKS

Causation

TRANSFORMS

RESTRICTS

REMOVES

RESTORES

CONVERTS

ALLOWS

Information

RECOMMENDS

PROVIDES

PUBLISHES

Analytical

IDENTIFIES

DIAGNOSES

COMPARES

MONITORS

BENCHMARKS

Sequential / spatial

PASSES_THROUGH

NAVIGATES_TO

Agency

REGULATES

PROTECTS

CREATES

TARGETS

ACHIEVES

5.3 Declaring custom predicates

"vocabulary": {
  "predicates": ["POLLINATES", "ZONES_AS", "SEASONALLY_OPERATES"],
  "namespace": "https://acme.com/entitymap/vocab/v1"
}

Custom predicates MUST be uppercase. They MUST NOT conflict with standard predicate names. They MUST be documented at the declared namespace URI. Consumers MUST NOT reject a conforming file that contains undeclared custom predicates, but MAY ignore them.

6.The HTML companion file

entitymap.html is a rendered, crawlable view of the same data as entitymap.json. It is generated from the JSON and MUST NOT be maintained independently.

A conforming entitymap.html MUST:

Reference entitymap.json via <link rel="alternate" type="application/json" href="/entitymap.json" />
Embed per-entity JSON-LD using schema.org types in <script type="application/ld+json"> blocks
Render entity relations as <a href="#entity-slug"> internal hyperlinks where targets exist in the same file
Include publisher attribution on every evidence blockquote via a data-publisher attribute
Not carry a noindex directive

A conforming entitymap.html SHOULD:

Carry a rel="canonical" for SEO management. Self-canonicalization is the recommended default.
Include a <link rel="entitymap"> pointing back to entitymap.json

7.Validation

A conforming entitymap.json MUST:

Be valid JSON parseable without error
Include all MUST fields at root, publisher, entity, and chunk level
Use only predicates from the standard vocabulary or a declared custom vocabulary
Include a publisher field on every chunk matching publisher.name
Not exceed 5 chunks per entity
Use only permitted @type values or namespaced type extensions
Be accessible without authentication at its declared URL
Have all internal targetId values resolve to a valid entityId within the EntityMap or its shard set

8.Versioning and evolution

The spec version is declared in the root version field and MUST match the version of the schema URI used.

Minor versions (0.x) MAY add optional fields without breaking conformance of existing files. Major versions (x.0) MAY introduce breaking changes and MUST be announced with a minimum 6-month deprecation window for the previous version.

Publishers MUST update generated on every rebuild. Consumers SHOULD treat files with a generated timestamp older than 30 days as potentially stale.

9.Privacy and security

EntityMap files are public by definition. Publishers MUST NOT include:

Personal data beyond named public figures or explicitly consenting individuals
Authentication tokens, API keys, or internal system identifiers
Content not already publicly accessible on the described website

The topicID field is optional and implementation-specific. Publishers using proprietary entity resolution systems MAY omit it without affecting conformance. If included, it MUST NOT expose information that would compromise the publisher's systems or data.

10.Relationship to existing standards

sitemap.xml

Sitemaps describe pages. EntityMap describes knowledge. Both SHOULD be present and are complementary, not competing.

schema.org

EntityMap uses schema.org @type values and is designed for compatibility. The HTML companion embeds valid JSON-LD per entity.

robots.txt

EntityMap discovery MAY be declared via an EntityMap: directive. This is a proposed convention, not yet a recognised robots.txt standard.

JSON-LD

EntityMap uses application-specific JSON for implementation simplicity. JSON-LD representations are exposed in the HTML companion for broader interoperability.

Wikidata

sameAs fields SHOULD use Wikidata URIs as canonical entity anchors, linking site knowledge to the open knowledge graph.

RSS / Atom

Conceptual analogy: EntityMap is to AI agents as RSS is to feed readers - a structured, subscribable content layer with predictable discovery.

11.Reference implementation

The initial reference implementation was developed by Waikay and consists of:

An extraction pipeline using NLP entity analysis and LLM-assisted chunk selection and relation extraction
A generator producing conforming entitymap.json and entitymap.html files
A client-side viewer rendering entitymap.html dynamically from entitymap.json

The reference implementation is available at waikay.io/entitymap. Third-party implementations are welcomed. To register an implementation, open an issue at the specification repository.

A.Appendix A - Minimal valid example

{
  "version": "0.2",
  "schema": "https://entitymap.org/spec/v0.2",
  "publisher": {
    "name": "Acme Gardens",
    "url": "https://acmegardens.com"
  },
  "generated": "2026-03-27T00:00:00Z",
  "entities": [
    {
      "entityId": "e_001",
      "@type": "DefinedTerm",
      "name": "Companion Planting",
      "description": "The practice of growing different plants in proximity for mutual benefit, including pest control, pollination support, and improved yield.",
      "sameAs": "https://www.wikidata.org/wiki/Q905413",
      "relations": [
        {
          "predicate": "IMPROVES",
          "targetId": "e_002",
          "targetName": "Crop Yield"
        },
        {
          "predicate": "PREVENTS",
          "targetId": "e_003",
          "targetName": "Pest Damage"
        }
      ],
      "hasChunks": [
        {
          "chunkId": "c_001",
          "text": "Companion planting pairs plants that benefit each other - for example, growing basil near tomatoes to repel aphids and improve fruit flavour.",
          "sourceUrl": "https://acmegardens.com/companion-planting-guide",
          "pageTitle": "The Complete Companion Planting Guide",
          "publisher": "Acme Gardens",
          "retrieved": "2026-03-27T09:14:00Z",
          "relevanceScore": 0.95
        }
      ]
    }
  ]
}

B.Appendix B - Complete predicate reference

CORE STRUCTURAL:   RELATES_TO, INCLUDES, PART_OF, DEPENDS_ON,
                   CONFLICTS_WITH, ENABLES, REQUIRES

CORE CAUSATION:    IMPROVES, DEGRADES, PRODUCES, PREVENTS, LEADS_TO

CORE INFORMATION:  DESCRIBES, MEASURES, REFERENCES

CORE METADATA:     AUTHORED_BY, AFFILIATED_WITH, INSTANCE_OF

EXT STRUCTURAL:    EXCLUDES, SUITED_FOR

EXT STATE:         MAINTAINS, PRECEDES, LACKS

EXT CAUSATION:     TRANSFORMS, RESTRICTS, REMOVES, RESTORES,
                   CONVERTS, ALLOWS

EXT INFORMATION:   RECOMMENDS, PROVIDES, PUBLISHES

EXT ANALYTICAL:    IDENTIFIES, DIAGNOSES, COMPARES,
                   MONITORS, BENCHMARKS

EXT SEQUENTIAL:    PASSES_THROUGH, NAVIGATES_TO

EXT AGENCY:        REGULATES, PROTECTS, CREATES, TARGETS, ACHIEVES

Core: 18 predicates · Extended: 25 predicates · Total: 43 standard predicates

C.Appendix C - Changelog

Version	Date	Notes
0.2	2026-03-27	RFC 2119 normative language throughout. Relation model updated: targetId + targetName + targetUri replacing name-only target. retrieved field added to chunk. Predicate vocabulary tiered into Core and Extended. Discovery language hedged as publisher-side conventions. Opening claims moderated. Chunk text extractive requirement added. Reference implementation framing softened.
0.1	2026-03-27	Initial draft

EntityMap is an open standard published by Waikay / InLinks Optimization Ltd under CC BY 4.0. Contributions and implementations are welcomed. Feedback via waikay.io/entitymap.