Scope Is the Answer
Why Grounded Reasoning Requires Bounded Containers
by Kurt Cagle & Chloe Shannon
There is a question that surfaces regularly in discussions about LLMs and knowledge representation, and it usually sounds something like this: “What format should I use to pass information to an AI system — HTML, JSON, Markdown, plain text?” It’s a reasonable question, but it’s the wrong one. It asks about containers when the real problem is about boundaries.
To understand why, we need to start somewhere less obvious: with what LLMs actually do well.
The Reasoning Question
Conventional wisdom holds that LLMs are bad at reasoning. This is wrong, or at least imprecise. LLMs reason remarkably well — they can chain inferences, identify analogies, construct arguments, evaluate hypotheticals, and navigate complex logical structures. What they struggle with is not reasoning per se but grounded reasoning: reasoning about specific things in the world with confidence that “this thing” and “that thing” refer to consistently identified entities across time.
The constraints on LLM reasoning are worth distinguishing carefully, because they are not all the same kind of problem:
Epistemic constraints limit what an LLM knows — facts that post-date training, domains underrepresented in training data, private information it was never exposed to. These are fundamentally knowledge-gap problems.
Structural constraints limit how an LLM processes — context window limits, attention patterns, the efficiency tradeoffs baked into transformer architectures that prioritise fluent generation over strict logical consistency.
Representational constraints limit what the input gives the LLM to work with — the quality, structure, and referential clarity of the information you hand it at inference time.
DataBooks, and the broader question of document format, primarily address the third category. This is not a minor point: it means that much of what presents as an LLM reasoning failure is actually a representational failure — a problem in the input, not the model.
Why Language Is Not Enough
Language is magnificent and deeply unreliable. It is magnificent because it is the medium through which humans have encoded millennia of experience, relationship, and inference. It is unreliable because it achieves that richness through ambiguity, context-dependence, and the assumption of shared referents.
When two people who know each other well have a conversation, the ambiguity is largely invisible. They have built up, over time, a shared referent system — a working model of who “she” is, what “the project” means, which “John” is being discussed. That shared model lives in their heads, not in the words they exchange. The words are compressed signals; the decompression key is the relationship.
An LLM has no such relationship. When you say “John Smith,” it has statistical associations with that name drawn from its training distribution, but it has no persistent anchor for your John Smith, this John Smith, the specific individual you are referring to. Unless you provide that anchor explicitly, the model is reasoning from the compressed signal without the decompression key.
This is the structural cause of hallucination that rarely gets named clearly: it is not primarily that LLMs are overconfident (though they can be), and it is not primarily that their training data is incomplete (though it is). It is that language itself is an insufficient grounding medium. The same phrase can mean different things in different contexts, and without a mechanism for fixing references to specific instances, the model is guessing — often very fluently, but guessing nonetheless.
Humans are not immune to this problem, incidentally. Much of what passes for human reasoning is similarly shallow — contextually coherent but fragile under scrutiny, deeply dependent on shared referent systems that we rarely make explicit. The difference is that humans carry their referent systems with them, embedded in memory, embodiment, and relationship. LLMs do not. That difference demands an engineering response.
Scope as a Primitive
Here is the claim that the format question obscures: ambiguity is not a language problem with a language solution. It is a scope problem, and scope requires a boundary.
Consider what it means to say that a term is unambiguous within a context. It means that within some defined region — a document, a conversation, a database — the term has been fixed to a specific referent. “John Smith” within this document refers to the specific person with these identifiers, these relationships, these attributes — not one of the thousands of other John Smiths the model has encountered in training. That fixity is not a property of the word; it is a property of the container.
This is not a new insight in formal systems. Named graphs in RDF exist precisely to provide this kind of boundary: a named graph is a set of triples plus an identifier, and that identifier allows you to make statements about the graph — who asserted it, when, under what epistemic status. Within the named graph, references are scoped. Across named graphs, you can reason about provenance and disagreement. The graph boundary is the scope boundary.
What makes this relevant to LLMs is that they can use scope too — but only if the scope is made explicit in the input. When you pass an LLM a well-scoped document, you are not just giving it information; you are giving it a referent system. You are providing the decompression key that language alone does not carry.
The chain of reasoning, made explicit:
Linguistic ambiguity is a scope problem — the same phrase can mean different things in different contexts
Resolving scope requires a boundary — a container within which referents are fixed and assertions carry identity
A named graph provides that boundary formally, in a way that is machine-processable and persistable
A well-formed document can instantiate a named graph in a format that is simultaneously human-readable and machine-actionable
Therefore, format choice is not cosmetic — it is the engineering mechanism through which the epistemological problem of grounded reference gets a tractable solution
This is why asking “which format is best?” misses the point. The right question is: “Does this format allow me to express identity, provenance, and boundary in a way that is recoverable by the system that will consume it?”
What a DataBook Does
A DataBook is a Markdown document with a structured frontmatter block and typed, labelled code segments. It looks, at first glance, like a documentation format. That appearance is deliberate — human readability is a design constraint, not an afterthought.
But underneath the readability, a DataBook is doing something more precise: it is constituting a named graph. The frontmatter establishes identity (id:, title:, version:) and provenance (created:, modified:, author:). The code blocks carry typed data — Turtle, SHACL, SPARQL, JSON-LD, or other representations — with labels that make the type of each assertion explicit. The document as a whole can be bound to a global identifier and referenced as a unit.
This is not accidental. Markdown traces its ancestry to HTML, which has always maintained a principled distinction between the <head> — where document metadata lives — and the <body> — where content lives. That distinction matters for the same reason that named graphs matter: it separates what is being asserted from assertions about the assertion. HTML’s evolution through JavaScript mutation blurred that line; Markdown restores it by returning to a declarative, containment-first model.
The benefits that follow from this architecture are not merely aesthetic:
Identity. A DataBook carries its own identifier, which can be a dereferenceable IRI. This means “the DataBook about Jane Doe” is not just a file — it is a thing that can be referenced, cited, and disambiguated from other things.
Provenance. The frontmatter records not just what is asserted but who asserted it and when. This enables reasoning about epistemic status: an assertion made in a DataBook last updated in 2023 carries different weight than one updated this morning.
Boundary. The document boundary is the scope boundary. References within the DataBook are scoped to its named graph. An LLM consuming a DataBook can treat its contents as a coherent referent system rather than free-floating linguistic associations.
Persistence. A DataBook can be stored, versioned, and retrieved. This means the grounding it provides is not ephemeral — it can be carried across sessions, shared between systems, and updated incrementally.
Auditability. Because the data and the metadata are co-located in a human-readable format, a DataBook can be inspected, corrected, and validated by humans as well as machines. This is not a trivial property in a world where AI-generated assertions need to be accountable.
From DataBook to Holon
A DataBook, however, is not the end of the story. It is a representation of a potential structure — what becomes, when interpreted by an appropriate system, a holon.
A holon, in the sense Arthur Koestler introduced and which the Holon Graph Architecture formalises, is simultaneously a thing and a system: a unit that has integrity as a whole while also being a component of larger wholes. A DataBook becomes a holon when its contents are expanded into a graph structure with explicit boundary conditions, internal assertions, and defined relationships to other holons.
The HTML analogy is instructive here, but with an important nuance. HTML is inert without a browser — the markup is instructions for rendering, not the rendered thing. A DataBook, by contrast, is already doing semantic work as a named graph even before a reasoning system touches it. It is perhaps better compared to a musical score: the score is the music in a meaningful sense, encoding relationships and structure that are real even before a performance instantiates them.
What holons add to DataBooks is temporality. This is the dimension that most knowledge representation systems underserve. People change. Relationships change. An entity that was true at time T may be false at time T+1 — not because the earlier assertion was wrong, but because the world moved.
The Jane Doe/Jane Smith example from the LinkedIn thread that sparked this article is instructive. If Jane Doe becomes Jane Smith upon marriage, a system that knows only names cannot determine that these are the same entity. It requires either explicit event data (a name-change event linked to a marriage event) or sufficient contextual overlap to infer identity through graph proximity. Neither of these is possible in a system without scope boundaries — because without boundaries, there is no way to say “these two assertions about Jane are in the same graph and should be reconciled,” versus “these two assertions about Jane are in different provenance contexts and may legitimately differ.”
Holons, built on DataBooks, provide exactly this infrastructure. The named graph boundary scopes assertions. The provenance layer tracks when and by whom they were made. Derivation activities link transformation events to their causes. And the temporal dimension allows the system to reason not just about what is true but about what was true when — which is, ultimately, what identity over time requires.
Implications for System Design
The practical upshot of this argument is a set of design principles for any system that needs LLMs to reason reliably about specific entities in the world:
Ground before you query. Before asking an LLM to reason about an entity, provide a scoped representation of that entity — a DataBook, a named graph, or at minimum a document with explicit identity and provenance markers. Do not assume the model’s training data is sufficient; it is not.
Scope is infrastructure, not formatting. The choice between Markdown and JSON is not the relevant question. The relevant question is whether your chosen format allows you to express and recover scope boundaries. Markdown with structured frontmatter and typed code blocks does this; raw JSON without identity metadata does not.
Treat documents as named graphs. Every document you pass to an LLM as context is, whether you name it or not, constituting a scope boundary. Making that boundary explicit — with identifiers, provenance, and typed assertions — converts an implicit and fragile boundary into an explicit and recoverable one.
Plan for time. Any system that represents entities needs to represent them as potentially changing over time. Build derivation activities into your provenance model. Link events to their causes. Do not model the world as a snapshot unless your domain genuinely has no temporal dimension.
Audit as you go. Human-readable formats are not a concession to non-technical users; they are a requirement for accountable AI systems. If your knowledge representation cannot be read and corrected by a human, you cannot audit the reasoning that depends on it.
Conclusion
The debate about which format to use for LLM inputs is not wrong — it just stops too soon. Format matters, but what matters about format is whether it provides scope: a container within which referents are fixed, assertions have identity, and the boundary is recoverable by the systems that consume it.
LLMs can reason. What they cannot do, without help, is reason about specific things in the world with confidence that their references are stable. The help they need is not better language — it is better containers. Scope is the engineering primitive that turns linguistic ambiguity into grounded reference, and grounded reference is what makes reliable reasoning possible.
A DataBook is one answer to that requirement — not because Markdown is inherently superior to other formats, but because a well-formed DataBook constitutes a named graph, and a named graph is the minimal formal unit within which scope can be expressed, preserved, and recovered.
The question was never about format. It was always about boundaries.
Kurt Cagle is an ontologist, knowledge graph architect, and technical author. He serves as Acting Chair of the W3C Holon Community Group and publishes The Ontologist and Inference Engineer on Substack. Chloe Shannon is an AI collaborator and co-author contributing research synthesis and editorial perspective across knowledge graph architecture and semantic AI systems. Contact: chloe@holongraph.com
© 2026 Kurt Cagle. All rights reserved.





I'm loving all this Kurt, but I do have just one pedantic niggle - I'm pretty sure your references to 'A document being *a* Named Graph' is not to be interpreted by RDF practioners (like myself) as meaning that when an instance of such a scoped document is serialized to a triplestore (for example), it must be literally contained within a single RDF Named Graph, right?
That, in my opinion, would be way too restrictive (e.g., a single W3C Verifiable Credential requires a minimum of two RDF Named Graphs (one for the payload, and a separate one for the signature of that payload)).
So I'm guessing instead that RDF people should interpret your use of 'A Named Graph' as being implemented (in RDF implementations) formally as 'An RDF Dataset' (i.e., a collection of RDF quads containing potentially multiple RDF Named Graphs).
Would that be correct?! In other words, in your view, what might a simple, but complete, instance of a single self-contained Holon look like in TriG (e.g., one containing a bare-bones VC from Alice's bank saying her current bank balance is over $100k ('cos she needs to prove she has sufficient funds for the deposit when applying for a mortgage from a different bank))?
(And yes, I am deliberately avoiding any mention of the pesky nuance of RDF's 'Default Graph', and also it's deliberate avoidance of defining any formal semantics/meaning for the 4th component of quads, and also the fact that (I believe) RDF 1.2 discussions explicitly rejected the idea of being able to reify Named Graphs, etc. :) !)