Creation of the Document Tree
As explained elsewhere on this site (see also the Mexico City address) TC creates two trees for each text: one representing the text as it appears in the document as an ordered hierarchy of document, pages and lines, the other representing the text as a communicative act, as an ordered heirarchy of entities.
The document tree is inferred by TC from the sequence of document→ pages→ columns→ linebreaks. In TC, each document is a complete TEI2 element. Within the document, the fundamental units of pages, columns and linebreaks are represented by empty milestone elements: <pb/> <cb/> <lb/>. <pb/> elements must carry an "n" attribute, which TC will use to construct the page sequence of the document: <pb n="1r"/>. Within the document, the "n" attribute for each <pb/> should be unique. A "facs" attribute may be used to point at an image of the page, where multiple images are loaded in zip files or from IIIF manifests. If used, <cb/> elements should also carry an "n" attribute. TC will calculate the line number for each <lb/> within a column or page.
This discrimination of the document tree by milestone elements is adequate for many documents: for almost all manuscripts and printed books. It will not work for complex hand-written documents, such as authorial draft manuscripts. The TEI proposes encoding of these using content elements (sourceDoc, surface, zone). While the data model of TC could deal with this encoding, it is not yet supported by TC.
Creation of the Entity Tree
TC terms each act of communication an "entity". A single document may contain many distinct acts of communication, each divisible into parts, or only one, again divisible into parts. The standard TEI content elements <div><div0><p><lg><l><ab> etc. may be used to indicate individual entities. For TC, any content element with an "n" attribute is an act of communication, hence an entity. TC will use these elements to construct the entity tree for the text in each document. Note that a single document may contain multiple texts of an act of communication. TC will use the distinct place of each text within the document tree to discriminate between them.
The "type" attribute may be used to specify to name each entity within TC. Thus, <div type="Group" n="1"> is interpreted by TC as specifying the entity "Group=1" (not all content elements in TEI may hold a "type" attribute).
Relationship between the two trees
TC sees text as a collection of "leaves", with each leaf holding a unit of text, which might be as small as a single character, or might extend over several sentences. Each leaf is present in both the document tree and the entity tree. The document tree defines how these units of text appear on the page. The entity tree defines how the text is to be understood as an act of communication.
A Simple TEI Document in TC
Consider this document:
<div n="Sample" type="entity">
<lb/><ab n="1">This line
<lb/>runs across several
<lb/>lines</ab><ab n="2"> While this
TC will see the text of this document as seven leaves, distributed across two pages and six lines as below:
Note that one line, the third line of page 1r, has two leaves: one for "lines" (the end of <ab n="1">), the second for "While this" (the beginning of <ab n="2">). Hence, TC constructs this document tree:
Note that the leaves "lines" and "while this" appear on the same branch in the document tree.
TC sees the entity tree for this text as composed of a single <div>, divided into two <ab> elements:
Thus, <ab n="1"> contains three leaves; <ab n="2'> contains four leaves.
At present, TC only supports document structure indicated by the "milestone" elements <pb><cb><lb>. There are many circumstances – marginal annotations, on-going authorial revision – which this does not cater for.