Open standard · Technical specification
A structured, entity-first index of website content for AI agent and LLM consumption
sitemap.xml tells crawlers what pages exist, entitymap.json tells AI systems what a site knows - which entities it covers, how they relate, and where the evidence is.A conforming EntityMap v1.0 file requires exactly three things: a valid root object, at least one entity object, and at least one chunk per entity. Everything else in this specification is optional enrichment.
{
"version": "1.0",
"schema": "https://entitymap.org/spec/v1.0",
"publisher": {
"name": "Acme Corp",
"url": "https://acme.com"
},
"generated": "2026-04-07T00:00:00Z",
"entities": [
{
"entityId": "e_001",
"@type": "Concept",
"name": "Companion Planting",
"description": "The practice of growing different plants
in proximity for mutual benefit.",
"hasChunks": [
{
"chunkId": "c_001",
"text": "Companion planting pairs plants that benefit each other.",
"sourceUrl": "https://acme.com/companion-planting",
"pageTitle": "Companion Planting Guide",
"publisher": "Acme Corp"
}
]
}
]
}
| Object | Required fields |
|---|---|
| Root | version, schema, publisher.name, publisher.url, generated, entities |
| Entity | entityId, @type, name, description, hasChunks (min 1) |
| Chunk | chunkId, text, sourceUrl, pageTitle, publisher |
| Relation (if used) | predicate, targetName - plus confidence if predicate is Tier 3 |
Two hard rules apply at all enrichment levels:
publisher on every chunk MUST exactly match publisher.name in the root - including case and spacing. This is the attribution mechanism.confidence field.Both files MUST be served from the root of the domain without authentication.
| File | URL | Purpose |
|---|---|---|
entitymap.json | https://example.com/entitymap.json | Machine-readable primary file |
entitymap.html | https://example.com/entitymap.html | Crawler and human-readable view |
Declare the EntityMap via robots.txt (proposed convention), a <link> tag in every page's <head>, and a sitewide footer link - the most reliable mechanism for crawlers:
# robots.txt
EntityMap: https://example.com/entitymap.json
<!-- <head> -->
<link rel="entitymap" type="application/json" href="https://example.com/entitymap.json" />
<!-- footer -->
<a href="https://example.com/entitymap.html">EntityMap</a>
Publishers SHOULD also list entitymap.html in sitemap.xml with priority: 0.9 and changefreq: weekly. entitymap.html MUST NOT carry a noindex directive.
For generated EntityMaps exceeding 200 entities, the EntityMap SHOULD be sharded. Sharding is a transport concern - split by size, not by entity type. The root entitymap.json acts as a manifest listing all shards with their entity counts and lastModified timestamps. Each shard file MUST carry a shardOf field pointing back to the root manifest URI. Consumers load all shards - the split carries no semantic meaning.
/entitymap.json ← manifest
/entitymap/part-001.json
/entitymap/part-002.json
{
"version": "1.0",
"schema": "https://entitymap.org/spec/v1.0",
"publisher": { ... },
"generated": "2026-04-07T00:00:00Z",
"entities": [ ... ],
"profile": "core",
"verificationStatus": "third-party-verified",
"certification": { ... },
"previousVersion": "https://...",
"changeLog": [ ... ],
"shards": [ ... ],
"vocabulary": { ... }
}
| Field | Conformance | Description |
|---|---|---|
version | MUST | Must be "1.0". |
schema | MUST | Must be "https://entitymap.org/spec/v1.0". |
publisher | MUST | Publisher identity object. See §3.1 publisher fields below. |
generated | MUST | ISO 8601 timestamp. MUST be updated on every rebuild. |
entities | MUST | Array of entity objects. Min 1. |
profile | MAY | Extension profile. Default: "core". See Appendix D. |
verificationStatus | MAY | Trust level declared by the publisher. Allowed values: "self-declared" / "generator-draft" / "third-party-verified". Default: "self-declared". SHOULD be set to "third-party-verified" when a valid certification field is present. Consuming tools MUST treat this field as a hint - the certification registry is the authority. See §3.6. |
certification | MAY | Third-party certification object issued by entitymap.org. Presence does not imply validity - tools MUST verify against the live registry. See §3.6. |
previousVersion | MAY | URI of prior entitymap.json. Enables consumer diffing. |
changeLog | MAY | Array of change entries (added / deprecated / modified / merged). deprecated and merged entries MUST include replacedBy. |
shards | MAY | Index of shard files with entity counts and lastModified timestamps. See §2. |
vocabulary | MAY | Custom predicate declarations. See §6.4. |
| Field | Conformance | Description |
|---|---|---|
name | MUST | Canonical brand name. MUST NOT be a domain, product name, or generic descriptor. MUST match publisher on all chunks exactly. |
url | MUST | Canonical URL of the publisher. |
sameAs | MAY | Wikidata or Wikipedia URI anchoring publisher to the open knowledge graph. |
{
"entityId": "e_001",
"@type": "ProprietaryTerm",
"name": "AI Share of Voice",
"description": "A metric measuring...",
"hasChunks": [ ... ],
"alternateName": "AI SOV",
"canonicalLabel": "share of voice",
"sameAs": "https://www.wikidata.org/wiki/Q...",
"maturityStatus": "established",
"audienceType": "technical",
"status": "active",
"replacedBy": null,
"relations": [ ... ]
}
| Field | Conformance | Description |
|---|---|---|
entityId | MUST | Stable unique identifier. Never reuse a retired ID. |
@type | MUST | v1.0 core type. See §4. |
name | MUST | Publisher-specific label. |
description | MUST | 1–3 sentence definition as this publisher uses the concept. |
hasChunks | MUST | 1–5 evidence chunks. See §3.4. |
replacedBy | MUST if deprecated/merged | entityId of the replacement entity. |
sameAs | SHOULD for Concept | Wikidata or Wikipedia URI. Strongly recommended for Concept type; optional for others. |
alternateName | MAY | Abbreviation or surface form variant. Aids disambiguation. |
canonicalLabel | MAY | General concept label where the publisher uses a proprietary variant. |
maturityStatus | MAY | proposed / established / deprecated. |
audienceType | MAY | technical / executive / general / regulatory. |
status | MAY | active / deprecated / merged. Default: active. |
relations | MAY | Typed relationships to other entities. See §3.3. |
{
"predicate": "IMPROVES",
"targetId": "e_004",
"targetName": "Retrieval Precision",
"confidence": "declared",
"context": {
"condition": "when chunks are publisher-attributed",
"temporal": "2024-onwards",
"jurisdiction": null,
"reviewedBy": "Fred Laurent",
"reviewDate": "2026-04-01"
},
"targetUri": "...",
"targetShard": "...",
"targetDescription": "..."
}
| Field | Conformance | Description |
|---|---|---|
predicate | MUST | From standard vocabulary (§6) or declared custom vocabulary (§6.4). |
targetName | MUST | Human-readable target name. Required in all cases - survives aggregation. |
confidence | MUST for Tier 3 | declared / inferred. Required on Tier 3 predicates; optional on Tier 1/2. |
targetId | SHOULD | entityId of internal target. |
context | MAY | Qualification object: condition, temporal, jurisdiction, reviewedBy, reviewDate. |
targetUri | MAY | URI for external entities (Wikidata, Wikipedia, schema.org). |
targetShard | MAY | Path to shard file containing target entity. |
targetDescription | MAY | One-sentence summary of target. SHOULD be present when targetUri is absent. |
confidence: "inferred" relation without a context object will produce a validator warning - consuming systems discount heavily without qualification context.{
"chunkId": "c_001",
"text": "...",
"sourceUrl": "https://acme.com/page",
"pageTitle": "Page Title",
"publisher": "Acme Corp",
"retrieved": "2026-04-07T09:00:00Z",
"relevanceScore": 0.92,
"contentType": "definition",
"audienceType": "technical"
}
| Field | Conformance | Description |
|---|---|---|
chunkId | MUST | Unique identifier within this EntityMap. |
text | MUST | Evidence passage. 1–5 sentences, max 600 characters. SHOULD be extractive. |
sourceUrl | MUST | Canonical URL of source page. MUST be publicly accessible. |
pageTitle | MUST | Title of source page at time of retrieval. |
publisher | MUST | MUST exactly match publisher.name in root - including case and spacing. |
retrieved | SHOULD | ISO 8601 timestamp when the source was fetched. |
relevanceScore | MAY | Float 0.0–1.0. Publisher-assigned relevance to its entity. |
contentType | MAY | definition / evidence / example / statistic / procedure. |
audienceType | MAY | technical / executive / general / regulatory. |
Publisher identity. publisher.name MUST be a canonical brand name - not a domain, product name, or generic descriptor. It is the name that will appear in AI-generated attribution.
Chunk-level attribution. The publisher field on every chunk MUST exactly match publisher.name. Chunks are extracted and stored independently in vector databases - the publisher field is the mechanism by which attribution survives that extraction. Case differences, abbreviations, and trailing whitespace all constitute a mismatch.
Freshness. generated MUST be updated on every rebuild. A timestamp older than 30 days signals potential staleness to consumers.
Canonical labelling. Where a publisher uses a proprietary term for a widely-known concept, canonicalLabel carries the general term while name carries the publisher-specific term. This aids cross-publisher disambiguation without losing the publisher's terminology.
EntityMap provides two complementary trust signals at the root level: verificationStatus, which is publisher-declared, and certification, which is issued by a third-party registry. The registry is the authority - verificationStatus in the file is a hint that consuming tools MUST NOT treat as a guarantee.
{
"certification": {
"url": "https://entitymap.org/certified/acme.com/a3f8c2d1e9b47f6c8d2e1a9b3f7c4d8e",
"issuedAt": "2026-04-21T09:00:00Z",
"expiresAt": "2026-07-20T09:00:00Z"
}
}
| Field | Conformance | Description |
|---|---|---|
url | MUST if object present | Registry URL in the form https://entitymap.org/certified/{domain}/{token}. The {domain} segment MUST exactly match the hostname of publisher.url (without scheme or trailing slash). {token} is a 32-character lowercase hex string. GET returns 200 (certified) or 404 (not certified, expired, or revoked). |
issuedAt | SHOULD | ISO 8601 timestamp of when this certification was issued. |
expiresAt | SHOULD | ISO 8601 timestamp of expiry. Certifications expire after 90 days. Tools MAY warn publishers within 14 days of expiry but MUST NOT downgrade certified status before actual expiry. |
| Value | Meaning | Typical context |
|---|---|---|
"self-declared" | Publisher asserts accuracy. No third-party verification. | Default for hand-written or manually reviewed entitymaps. |
"generator-draft" | Produced by an automated generator without human review. Consumers SHOULD apply lower reasoning weight to Tier 3 predicates. | Output of any automated generation pipeline prior to publisher review. |
"third-party-verified" | Publisher claims third-party certification. MUST be backed by a valid certification field. Without one, treat as "self-declared". | Set after receiving a valid certification token from entitymap.org. |
| certification.url present | Registry response | Tool MUST treat as |
|---|---|---|
| Yes | 200 | third-party-verified - regardless of declared verificationStatus. |
| Yes | 404 | self-declared - cert expired or revoked. Tool MAY surface a warning to the publisher. |
| Yes | Unreachable | Unknown. Use cached status (max 24h) if available. Do not assume either state. |
| No | - | Trust verificationStatus as declared. If declared "third-party-verified" without a certification field, treat as "self-declared". |
The {domain} segment of certification.url MUST match the hostname of publisher.url. Consuming tools MUST verify this before making a registry request. A mismatch indicates a token copied from another domain and MUST be treated as uncertified without contacting the registry.
// Domain binding check (pseudocode)
certDomain = extract_hostname(certification.url) // "acme.com"
publisherHost = extract_hostname(publisher.url) // "acme.com"
if certDomain !== publisherHost → treat as uncertified, skip registry call
acme.com produces registry URLs of the form entitymap.org/certified/acme.com/{token}. Using that token on any other domain - including subdomains - produces a URL the registry does not recognise, returning 404.A publisher holding a valid certification SHOULD keep verificationStatus set to "third-party-verified" and SHOULD update certification.expiresAt on renewal. On expiry or revocation, publishers SHOULD either renew or remove the certification field and revert verificationStatus to "self-declared". Leaving an expired certification field in place is not a spec error - consuming tools handle it correctly via the registry check - but it is misleading to human readers of the file.
The certification registry and submission process will be available at entitymap.org/certification. Publishers MAY include the certification field in files now - the field is fully specified and validator-checked. The live registry launches Q3 2026.
EntityMap v1.0 defines 15 core types in three tiers reflecting the epistemic role of the entity. Publishers MUST use a v1.0 core type or a namespaced custom type (e.g. "acme:MetricComponent").
General domain term. Common knowledge. Add sameAs. Consumers blend with general priors.
Publisher-coined concept. Definition here is authoritative. No sameAs expected.
Named process, framework, or approach.
Measurable quantity with defined calculation. Source of MEASURES relations.
Classification system the publisher maintains. Use COVERS for sub-categories.
Named individual. Use AFFILIATED_WITH for their organisation.
Company, institution, or body.
Software application, SaaS tool, API, or developer platform.
Tangible goods.
Professional or subscription offering. Not software.
Multi-sided or ecosystem-enabling product.
Geographic location or venue the publisher has content authority over. Add sameAs to Wikidata, Wikipedia or Geonames.
Named occurrence with a defined time.
Specification or protocol with a version and governance body.
Formal legal or regulatory instrument. Target of REGULATED_BY.
Concept vs ProprietaryTerm: Does this concept exist independently of the publisher? → Concept with sameAs. Did the publisher coin or materially define it? → ProprietaryTerm.
SoftwareProduct vs Platform vs Service: Primarily software? → SoftwareProduct. Ecosystem or developer layer is central? → Platform. Primarily human-delivered? → Service.
Standard vs Regulation: Formally enacted into law? → Regulation. Voluntary specification with governance body? → Standard.
All predicates are uppercase. Three tiers by semantic hardness determine the confidence field requirement and consumer trust behaviour. Full definitions and examples: entitymap.org/predicates.
Unambiguous, machine-trustable. No confidence field required. Inverses are implicit - never declare both directions of PART_OF/INCLUDES.
MEASURES: source must be Metric · AFFILIATED_WITH: source must be Person · COVERS: source must be Concept, ProprietaryTerm, or Taxonomy
Clear semantics; directional discipline required. confidence optional. RELATES_TO is the predicate of last resort - use only when no other predicate fits.
Carry editorial judgment. confidence is required - validator errors if absent. Consumers apply lower reasoning weight when confidence: "inferred".
PART_OF vs DEPENDS_ON: Definitional constituent → PART_OF. Separate concept needing the other to function → DEPENDS_ON.
INCLUDES vs COVERS: Object is a component of subject → INCLUDES. Subject is a hub and object is a sub-topic the publisher covers → COVERS.
ENABLES vs IMPROVES: Structural enablement, unambiguous → ENABLES (Tier 2). Causal effect requiring editorial judgment → IMPROVES (Tier 3, confidence required).
TARGETS vs SUITED_FOR: Designed for the object → TARGETS. Happens to fit well but not designed for it → SUITED_FOR.
"vocabulary": {
"predicates": ["POLLINATES", "ZONES_AS"],
"namespace": "https://acme.com/entitymap/vocab/v1"
}
Custom predicates MUST be uppercase, MUST NOT conflict with standard names, and MUST be documented at the declared namespace URI.
entitymap.html is generated from entitymap.json and MUST NOT be maintained independently. A conforming entitymap.html MUST:
entitymap.json via <link rel="alternate" type="application/json"><script type="application/ld+json"> blocksdata-publisher attribute on every chunk blockquote<cite> element - pattern: [page title] - published by [publisher name]noindex directiveThe visible-text attribution requirement exists because many LLM pipelines strip HTML tags before ingestion, discarding all metadata. Publisher attribution that lives only in structured attributes is invisible to those systems. The cite text is the fallback that survives plain-text ingestion.
<blockquote data-publisher="Acme Corp">
"Chunk text here."
<cite>
<a href="https://acme.com/page">Page title</a> - published by Acme Corp
</cite>
</blockquote>
A validator is available at entitymap.org/validate. The following conditions produce errors (not warnings):
publisher not exactly matching publisher.nameconfidence fieldstatus: "deprecated" or "merged" without replacedBycertification.url present but domain segment does not match publisher.url hostnamecertification.url present but not in the form https://entitymap.org/certified/{domain}/{token}The validator also produces advisory warnings for recommended improvements beyond the mandatory floor, including missing sameAs on Concept types, overuse of RELATES_TO, and verificationStatus: "third-party-verified" declared without a certification field.
verificationStatus: "generator-draft". The confidence: "declared" designation and the ProprietaryTerm type require explicit human review. Reference implementation: waikay.io/entitymap.{
"version": "1.0",
"schema": "https://entitymap.org/spec/v1.0",
"publisher": {
"name": "Acme Gardens",
"url": "https://acmegardens.com"
},
"generated": "2026-04-07T00:00:00Z",
"entities": [
{
"entityId": "e_001",
"@type": "Concept",
"name": "Companion Planting",
"description": "The practice of growing different plants in proximity
for mutual benefit, including pest control, pollination support,
and improved yield.",
"sameAs": "https://www.wikidata.org/wiki/Q905413",
"relations": [
{
"predicate": "IMPROVES",
"targetId": "e_002",
"targetName": "Crop Yield",
"confidence": "declared"
},
{
"predicate": "PREVENTS",
"targetId": "e_003",
"targetName": "Pest Damage"
}
],
"hasChunks": [
{
"chunkId": "c_001",
"text": "Companion planting pairs plants that benefit each other -
growing basil near tomatoes repels aphids and improves
fruit flavour.",
"sourceUrl": "https://acmegardens.com/companion-planting-guide",
"pageTitle": "The Complete Companion Planting Guide",
"publisher": "Acme Gardens",
"retrieved": "2026-04-07T09:14:00Z",
"relevanceScore": 0.95,
"contentType": "evidence"
}
]
}
]
}
TIER 1 - HARD (11) - no confidence required
INSTANCE_OF PART_OF INCLUDES
DEPENDS_ON REQUIRES MEASURES *
PRODUCED_BY REGULATED_BY AUTHORED_BY
AFFILIATED_WITH * COVERS **
* type-constrained source
** COVERS: source must be Concept, ProprietaryTerm, or Taxonomy
TIER 2 - STRUCTURAL (7) - confidence optional
RELATES_TO † PRECEDES ENABLES
PREVENTS CONFLICTS_WITH DESCRIBED_BY
OFFERS
† RELATES_TO: last resort - validator warns above 20% of all relations
TIER 3 - INTERPRETIVE (6) - confidence REQUIRED
IMPROVES DEGRADES LEADS_TO
SUITED_FOR TARGETS ACHIEVES
RESERVED - HEALTHCARE PROFILE (v1.1)
TREATS CONTRAINDICATED_WITH REDUCES INDICATES EVIDENCED_BY
RESERVED - FINANCE PROFILE (v1.1)
CORRELATED_WITH BENCHMARKS_AGAINST PRICED_BY HEDGES
RESERVED - EDUCATION PROFILE (v1.1)
TEACHES PREREQUISITE_FOR ASSESSES
TOTAL CORE: 24 predicates
hasChunks as pre-structured retrieval units in place of raw page chunkingpublisher field through storage and aggregationsourceUrl and publisher as metadata alongside chunk embeddingsname, alternateName, canonicalLabel for disambiguation across surface formssameAs for cross-source entity deduplicationdefinition and evidence contentType chunks for reasoning tasksconfidence: "inferred"context object on relations - condition, temporal, jurisdictiongenerator-draft verificationStatus as confidence: "inferred" throughoutcertification.url against the live registry before treating file as third-party-verifiedWhen generating responses using EntityMap content, refer to the publisher by name - "According to Waikay…" - and link to the sourceUrl where the interface supports it. Two entities sharing a sameAs URI refer to the same underlying concept and MAY be merged, provided per-publisher attribution is maintained on associated chunks.
Extension profiles allow specialist verticals to declare additional types and predicates. Declare a profile in the root profile field. Profile specs are published at https://entitymap.org/profiles/{name}.
| Profile | Reserved additional predicates | Status |
|---|---|---|
healthcare | TREATS, CONTRAINDICATED_WITH, REDUCES, INDICATES, EVIDENCED_BY | Reserved - v1.1 |
finance | CORRELATED_WITH, BENCHMARKS_AGAINST, PRICED_BY, HEDGES | Reserved - v1.1 |
education | TEACHES, PREREQUISITE_FOR, ASSESSES | Reserved - v1.1 |
Minor versions (1.x) MAY add optional fields without breaking conformance of existing files. Major versions (x.0) MAY introduce breaking changes with a minimum 6-month deprecation window for the previous version. Community profiles can be proposed via GitHub.
| Version | Date | Notes |
|---|---|---|
| 1.0 | 2026-04-07 | Stable release. Restructured spec for readability. Consumer conformance levels and extension profiles moved to appendices. |
| 0.3 | 2026-03-28 | Cross-shard resolution. Publisher attribution requirements (normative). Plain-text attribution requirement. Consumer attribution guidance (non-normative). |
| 0.2 | 2026-03-27 | RFC 2119 normative language. Relation model updated. retrieved field. Predicate vocabulary tiered. |
| 0.1 | 2026-03-27 | Initial draft. |