The OCHRE Database
The OCHRE database software was created by software engineer Sandra Schloen in consultation with David Schloen, who is a professor of archaeology in the University of Chicago and director of the University’s program in Digital Studies of Language, Culture, and History. They have published earlier descriptions of OCHRE and the principles underlying its design in OCHRE: An Online Cultural and Historical Research Environment (Winona Lake, Indiana: Eisenbrauns, 2012); and “Beyond Gutenberg: Transcending the Document Paradigm in Digital Humanities,” Digital Humanities Quarterly 8/4 (2014).
The staff of the OCHRE Data Service, led by Sandra Schloen and Miller Prosser, provide technical support and training for OCHRE and other CRESCAT tools. See the OCHRE Wiki and the OCHRE manual for more information about how OCHRE is used in practice. The Wiki contains the latest information about building, viewing, analyzing, and publishing a project’s data.
XML Database Architecture
OCHRE is a transactional multi-user system with password-protected user accounts and mechanisms to ensure data security and disaster recovery. It makes use of sophisticated record-locking to meet the “ACID” requirements of atomicity, consistency, isolation, and durability. It is a non-relational (NoSQL) database system that is implemented in Tamino XML Server using large collections of small XML documents of various types. These documents serve as keyed and indexed database items; they are highly atomized and contain universally unique IDs that function as database keys.
Note that XML documents need not correspond to real-world documents, and in the OCHRE data warehouse they rarely do. The term “document” here refers to the fact that XML is a tagged-text format that works on any modern operating system because it encodes information as Unicode characters (e.g., using one of the UTF encoding schemes) rather than in some other binary format. XML was invented to provide a text-based and thus cross-platform notation for all kinds of data—structured, semistructured, and unstructured—and not just for documents as normally understood.
XML Schema is used to specify each XML document type. There are a number of different document types in OCHRE, each corresponding to abstract ontological classes specified in the CHOIR ontology using the Web Ontology Language (see the Ontology page of this website). CHOIR expresses the semantics of RDF triples exported or exposed from OCHRE. In this way, the atomized entities and relationships represented as XML documents in the OCHRE database platform can be published on the Web in a standardized, implementation-independent fashion for use in other software, without losing any information.
XQuery is used to create, read, update, and delete the XML documents in the OCHRE data warehouse (the “CRUD” operations), and also to join them together into larger configurations based on their keys. An XML document in an XML database is analogous to a tuple (table row) in a relational database and XQuery is analogous to SQL, although XQuery is a Turing-complete language, unlike SQL, and in principle is more powerful. An XML Schema specification of the structure and sequence of an XML document’s elements and attributes is analogous to a relation schema for a relational database table, which defines the names, types, and sequence of the attributes (column headings) associated with the attribute values in a tuple (row) of the table.
The OCHRE Item Categories (XML Document Types)
The OCHRE data warehouse contains large numbers of small XML documents organized into collections, with one collection for each type of document. OCHRE’s XML document types correspond to the abstract ontological classes in CHOIR, which are called categories in OCHRE. An instance of one of these classes is called an item in OCHRE and is represented in the data warehouse by a uniquely keyed and indexed XML document of the appropriate type. There are XML document types for the following categories of OCHRE items:
- Projects (which are linked to and “own” all the other items that belong to a given project)
- Hierarchies, sets, and models (which are used to organize other items)
- Queries (which contain the criteria used to select other items)
- Taxonomies and their taxonomic variables and values (which are used to describe other items)
- Spatial locations and objects
- Temporal periods
- Persons and organizations (i.e., social agents of any kind: individual or collective, living or dead, real or fictitious)
- Writing systems and their component signs
- Texts and their epigraphic and discursive components
- Dictionaries and their lexicographic components
- Reference lists and their bibliographic components
- Resources (i.e., digital images, documents, etc.)
- Other concepts (project-defined concepts)
- Published data
Descriptions of the OCHRE Item Categories
- Project items represent research projects that have data in the OCHRE data warehouse. The data of a given project is represented by items belonging to various categories that are all linked to the item that represents the project. There is one and only one project item for each project.
- Taxonomy items represent the taxonomies of projects. There is one and only one taxonomy item for each project. A taxonomy is represented as a hierarchy of taxonomic variable items (see below) alternating with taxonomic value items in each level of the hierarchy. This allows a project to specify the allowable values for each variable by making them “children” of that variable. Moreover, the recursive nesting of a variable as a child of a value that is itself a child of the same variable expresses the taxonomic relationship between more general and more specific values of the variable (e.g., the variable “material type” can have a child value of “metal,” which can have as its child the same variable “material type,” which in turn can have the child values “copper,” “iron,” “gold,” “silver,” etc.
- Taxonomic variable items represent the descriptive variables (attributes) defined by each project. A taxonomic variable can be linked to any other item in order to describe it; and when a qualitative taxonomic value item or a quantitative value (see below) is then linked to a taxonomic variable item that has previously been linked to a third item, the result is a property “statement” analogous to a subject-predicate-object triple in RDF; i.e., an OCHRE item-variable-value triple corresponds to an RDF subject-predicate-object triple. (The term “property” is used in OCHRE to refer to a variable-value pair that has been linked to an item in order to describe that item.) Taxonomic variables belong to one of the following types: nominal, ordinal, integer, decimal, calendrical, or relational. Relational variables are used for project-defined relations between two items (note that a number of inter-item relations are not project-defined but are built into OCHRE). In other words, an item may be linked to a relational variable item that is then linked to a third item which functions as the value of the relational variable and is thereby put into a relationship with the first item (e.g., a spatial item may use the relational variable item “is above” with a value that is the ID of another spatial item to express the fact that one thing is above another).
- Taxonomic value items represent the qualitative values (expressed as character strings) that are defined by each project. A qualitative value item can be linked to a nominal or ordinal taxonomic variable that has been linked to a third item in order to describe the latter item by means of a variable-value property. (Note that quantitative values do not need to be defined in OCHRE because the ontological class of numbers is, in effect, predefined for everyone and is digitally represented by standard numeric data types that are interpreted in the same way everywhere on the Web.)
- Spatial items represent spatially situated units of observation or analysis of any size and any kind, as defined by a project (e.g., geographical regions, places, buildings, stratigraphic layers, physical objects, components of objects, etc., no matter how large or how small).
- Temporal items represent chronological periods of any duration and temporal events or occurrences of any kind, as defined by a project.
- Agentive items represent persons, organizations, or other agents, as defined by a project (real or fictitious, individual or collective, living or dead). The researchers, observers, authors, editors, photographers, illustrators, resource creators, and data-entry staff who enter or create data in CRESCAT are all represented as agent items that are linked as needed to other kinds of items.
- Textual items represent texts of any genre written in any writing system(s) and language(s), which are represented in OCHRE as recursive (containment) hierarchies of epigraphic items cross-linked to recursive hierarchies of discourse items (see below), while epigraphic items in turn are linked to sign items. Text items are used to represent texts that are objects of study and analysis in their own right; other kinds of digital texts (e.g., scholarly reports and secondary literature) will normally be represented by resource items (see below).
- Writing-system items represent writing systems of any kind (alphabetic, logosyllabic, etc.), which are represented in OCHRE as hierarchies of sign items (see below).
- Sign items represent the individual signs that make up writing systems and the configurations of such signs in larger compound signs. A sign item contains information about the different reading values of a sign and its allographic variants (e.g., upper-case “A” and lower-case “a” are allographs of the same sign in the alphabetic writing system used in this website, and this sign also has different phonetic values in different contexts).
- Epigraphic items represent physical components of written texts of any size or kind (punctuation/diacritics, characters/signs, lines of characters, columns, pages, etc.). An epigraphic item at the level of an individual sign is linked to a sign item (and usually to a particular allograph of the sign).
- Discourse items represent meaningful units of linguistic discourse of any length (e.g., morphemes, words, phrases, clauses, sentences, or longer units of discourse).
- Dictionary items represent language dictionaries, which are represented in OCHRE as hierarchies of lexicographic items (see below).
- Lexicographic items correspond to dictionary entries (lemmas) or components of a dictionary entry.
- Reference-list items represent reference lists (bibliographies), which are represented in OCHRE as hierarchies of bibliographic items (see below).
- Bibliographic items represent bibliographic references or the components of a bibliographic reference.
- Resource items represent digital resources of various types and binary formats (e.g., 2D images, 3D models, documents, audio and video clips, etc.). Resource items in the data warehouse are linked to external files in servers elsewhere via their URLs.
- Concept items represent concepts needed by a project that are not defined by one of the other item categories. There are some special concepts defined by OCHRE in this category; for example, units of measurement, which can be linked to taxonomic variable items.
- Hierarchy items represent configurations of other items in hierarchies that express containment or logical subordination. A given item may appear in more than one hierarchy, or in more than one branch of the same hierarchy. This allows multiple overlapping configurations of the same information without duplicating the database items that represent that information. There are two types of hierarchy item. One type is for containment hierarchies, which configure items belonging to the same category and are recursive, meaning that a child item is of the same kind as its parent but on a smaller scale and is in some sense contained within it (e.g., a hierarchy of spatial items that expresses the containment of smaller spatial units within larger units, or a hierarchy of discourse items that expresses the containment of smaller linguistic units within larger units). The other type of hierarchy item is for subordination hierarchies, which are non-recursive (e.g., a hierarchy of resources that organizes them in a logical fashion). Note that a simple list of items is represented as a shallow hierarchy in which a parent item has child items that have no children of their own.
- Model items represent configurations of other items in networks (graphs) that are more complex than simple hierarchies. A model can configure items that belong to one or more categories into sequences and sub-sequences with conditional branching and looping. Standard graph visualization techniques are used to display and inspect these models. A few model types are predefined and have special behaviors in OCHRE, including chronological timelines; flowcharts of processes; Harris Matrix models of stratigraphic relationships for use in archaeology; and social network analysis (SNA) models of the relationships among social agents.
- Set items represent collections of other items that are manually defined by a project or are the result of a query (see below) that selects a set of items based on specific criteria. A set may contain items that all belong to the same category or that belong to different categories. In addition to item configurations represented by hierarchy items, model items, and set items, there are many ways in which an individual item can be directly linked to another item via the unique ID of the target item (e.g., when a spatial item representing an artifact is linked to a resource time representing an image that depicts the artifact).
- Query items represent query criteria that have been named and saved by a project for use with its data. The query criteria can be quite complex, involving both the intrinsic properties of items in the data warehouse (represented as pairs of taxonomic variable and value items linked to the item being described) and the extrinsic relations among items (spatial, temporal, etc.). Boolean algebraic operators (AND, OR, NOT) and relational operators (< , > , <= , >= , == , !=) are supported in query expressions, as well as the nesting of expressions via parentheses. Readable query expressions are constructed by end-users in the OCHRE Java GUI using drop-down pick lists to select the projects and item categories in scope and the variables, values, operators of the search criteria. When the query is executed, the user-created query criteria are automatically converted to XQuery by the Java GUI and sent to the Tamino XML Server for execution, returning a list of item IDs as the query result set. Separate queries can be chained sequentially to select items based on the intersection or union of their result sets.
- Published-data items represent data that a project has published from the highly atomized data warehouse in the form of “flattened” (denormalized) XML documents for use in Web browser and mobile apps via the Web API. Each published-data item is an XML document that has a persistent URL. It will normally correspond to a real-world entity such as an artifact or text, or some other entity or topic that app developers will in most cases prefer to handle as a modular unit of information. The published-data documents are stored in their own collection in the Tamino XML Server and may also be copied (mirrored) to external publication databases (e.g., simple key-value databases) in multiple locations to improve retrieval performance.
Semantic Linking of OCHRE Items to External Controlled Vocabularies
Several categories of OCHRE items can be linked semantically to external controlled vocabularies of terms and concepts such as WikiData and the Getty Vocabularies. This kind of linkage can be done for: (1) taxonomic variables, (2) taxonomic values, (3) spatial locations and objects, (4) temporal periods, (5) persons and organizations, (6) texts, (7) resources, and (8) other concepts.
OCHRE users can enter and save SPARQL queries associated with an item in any of these eight categories. The SPARQL queries are used to search a given external vocabulary and find the URLs of published concepts that could be linked to the item. An OCHRE item can be semantically linked to one or more external terms or concepts from any number of published vocabularies. The semantic linkage may be characterized as a “close match” (synonym), “broader term,” “narrower term,” or just a “related term.” If desired, the external term can be displayed within the OCHRE GUI as the name of the item instead of using a project-defined name. This will often be appropriate in the case of a close semantic match, allowing projects to employ standard terms curated by reputable organizations in various domains of research, such as the Getty Research Institute in the domain of cultural heritage.
These semantic linkages clarify the meaning of terms used by OCHRE projects and provide interoperability with other systems. They solve the problem of homographs (i.e., words that have the same written form but different meanings, such as “light” as in weight versus “light” as in color). They allow OCHRE projects to employ any language, not just English, and to translate their terms using standard terminologies. More generally, semantic linkages to external controlled vocabularies enable cross-project querying within the OCHRE environment among projects that use different nomenclatures. If each project links its terms to one or more controlled vocabularies, an OCHRE database query can retrieve similar items that have been described differently by different projects. Alternatively, a project can borrow a taxonomy or part of a taxonomy from another project entirely within the OCHRE database platform itself, as long as the second project has made its taxonomy public for other OCHRE projects to use. This provides another (and often more efficient) way to achieve semantic integration among projects.
An illustration of the semistructured data model, in which information is organized in hierarchies with cross-hierarchy links. The semistructured data model is a kind of graph model (a labeled, directed graph) consisting of nodes and links, also called vertices and edges.
An illustration of the relational data model, in which information is organized in tables (relations) that are linked together by means of columns (attributes) that contain unique key values (e.g., ID# and ClassID). Each of the tables in a database and its columns are predefined in the database schema.
Why does the OCHRE database use XQuery instead of SQL?
There are good reasons for using XML, XML Schema, and XQuery rather than relational tables and SQL (Structured Query Language). Most of the data in OCHRE is best represented, not in tables with fixed columns, but in open-ended hierarchies (trees) of varying depth. XML and XQuery are particularly well suited to working with hierarchically organized data, in accordance with the semistructured data model, which has emerged as an alternative to the relational data model for many kinds of information. JSON is a somewhat simpler notation than XML but is basically similar and is also used for semistructured data.
For example, relations of spatial containment are easily represented by means of recursive hierarchies, such that a spatially situated unit of observation contains smaller spatial units, and these in turn contain still smaller units, and so on down the hierarchy (e.g., an archaeological site contains soil layers that each contain many artifacts). Likewise, it is easy to see how temporal, linguistic, textual, and taxonomic entities and relations can be readily modeled as recursive hierarchies in which smaller entities are nested within larger entities of the same kind. Other kinds of entities and relationships lend themselves to non-recursive hierarchies (called subordination hierarchies in OCHRE). For example, hierarchies of persons (agents) and hierarchies of digital resources serve to organize these entities without implying that one entity is contained within another.
An enterprise-class XML DBMS such as Tamino XML Server indexes XML documents and implements XQuery in a way that allows efficient querying and manipulation of large amounts of data organized in hierarchies, including recursive hierarchies, for which powerful recursive programming techniques can be used. And XQuery makes it easy to do joins among separate XML documents based on their unique IDs, allowing the software to traverse cross-hierarchy links between database items in different trees.
In theory, SQL can also handle hierarchically organized data, but not very efficiently. XQuery processors, on the other hand, are specifically optimized for such data. And XML and XQuery are just as universal in their scope of application as relational tables and SQL. Relational database theory shows that any kind of data can be represented using interlinked relations, which usually represented as tables with rows and columns. But computer scientists have also shown that semistructured databases can represent any kind of data, including highly structured data tables, using graphs of nodes and links (edges) that consist of hierarchies of nodes with cross-hierarchy links. In more technical terms, this is “a rooted, directed graph in which the edges carry labels representing schema components, and leaf nodes (i.e., nodes without any outgoing edges) are labeled with data values (integers, reals, strings, etc.)” (Dan Suciu, “Semi-structured Data Model,” in Encyclopedia of Database Systems, ed. M. Ling Liu and Tamer Özsu; Springer Link, 27 January 2017; see also Dan Suciu, “Semi-structured Query Languages,” ibid., 7 December 2018).
The XML notation was invented to express this kind of labeled, directed graph in a serialized character-string format that can easily be transmitted on the Internet and be understood by any operating system that supports the XML and Unicode standards (see Database Systems: The Complete Book, 2d edition, by H. Garcia-Molina, J. D. Ullman, and J. Widom [Upper Saddle River, N.J.: Pearson Prentice Hall, 2009], pp. 483–515). In theory, the OCHRE data warehouse could be implemented in a non-XML and non-hierarchical graph database, or even in a relational database; however, this would be much more cumbersome to program and would be less efficient in practice, because we would have lost the benefits of XML for representing hierarchically organized semistructured data and of XQuery for performing compactly written yet powerful recursive queries.
Why does the OCHRE database have a Java GUI instead of using HTML and JavaScript?
The OCHRE platform includes Web browser applications written in HTML, CSS, and JavaScript for viewing published data. However, the core software application that provides a feature-rich graphical user interface (GUI) for building and analyzing the contents of the OCHRE data warehouse is written in Java. A Java application can run under Microsoft Windows, Apple Mac OS X, or Linux. Its windows, scroll bars, pick lists, and buttons can imitate the look-and-feel of the native operating system, as is done in OCHRE.
Java is often taught in computer programming courses and has long been the most commonly used programming language in the world. It remains the dominant language for complex applications that run on servers. But Java is not so commonly used these days for client applications that run on desktop computers, as in OCHRE. Feature-rich desktop applications are more often written in C or C++ (e.g., Adobe Creative Suite applications like Photoshop; geospatial mapping applications like ESRI ArcGIS; drafting and design applications like AutoCAD; etc.). However, the source code for these “native apps” must be customized, compiled, and tested separately for each operating system, which is very labor-intensive. One of Java’s historic strengths is that it supports complex applications and is also “cross-platform” or “platform-agnostic” because Java source code is compiled to bytecode that will run under any operating system for which a Java Runtime Environment is available. This has helped Java to become dominant on the server side, because the same code can run on a Linux server or on a Microsoft Windows server.
Despite Java’s popularity as a language and its dominance on servers, the leading commercial vendors of operating systems and Web browsers (Apple, Microsoft, and Google) have, for their own reasons, made it difficult to use Java on the desktop and impossible to use on mobile devices. Google’s Android is based on Java but is not compatible with it, and Apple’s iOS does not support it at all. Accordingly, most cross-platform GUI applications today rely on Web browsers (e.g., Google Chrome, Apple Safari, Microsoft Edge, Mozilla Firefox). Like Java, Web browsers provide a “virtual machine” for interacting with the native operating system, but for historical reasons browser applications can be written only in HTML and JavaScript (aided by CSS stylesheets).
Unfortunately, HTML and JavaScript, unlike Java, were not designed for creating complex, high-performance applications. They impose many restrictions on software developers and run more slowly than other languages. JavaScript has no static types and was intended to be a lightweight, easy-to-learn scripting language (it is interpreted rather than compiled) for embedding small pieces of code in HTML to make Web pages more interactive. Despite the name, JavaScript is a quite different language from Java and lacks many of its most powerful features; as the saying goes, Java is to JavaScript as car is to carpet. This is why, despite improvements in the latest versions of HTML and JavaScript and the proliferation of JavaScript code libraries, complex desktop applications are usually not written in JavaScript, even today. As for OCHRE, a number of features in its Java GUI are not supported by JavaScript.
The awkward dependency of Web browser applications on JavaScript has created a serious problem that has, at last, prompted a new solution: the ability to compile source code written in multiple languages to WebAssembly bytecode that runs inside browsers. WebAssembly is a new Web standard sponsored by the World Wide Web Consortium. It is designed to support high-performance applications (see “WebAssembly Will Finally Let You Run High-Performance Applications in Your Browser” by Luke Wagner, IEEE Spectrum, Nov. 21, 2017; and “Bringing the Web Up to Speed with WebAssembly” by Andreas Haas et al., in Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, Barcelona, Spain — June 18–23, 2017; ACM SIGPLAN Notices 52/6 [2017]: 185–200). The first version of WebAssembly supports C and C++, and it is expected that future versions will support Java and other languages. When that happens, the OCHRE Java GUI will run inside a browser, as will other complex desktop applications (Adobe Photoshop, ArcGIS, AutoCAD, games, etc.).
In the meantime, users who wish to enter and manage data in the OCHRE database will download and install a Java client application onto their desktop or laptop computers running Windows, OS X, or Linux, as they would do for a native desktop application like Photoshop (note that the OCHRE Java GUI is not an “applet” and does not run in a browser via a plug-in but is a stand-alone application). Such users will be members of participating projects who have been given password-protected accounts in the database and receive technical support and training from the staff of the OCHRE Data Service. Other people who simply want to view and search OCHRE data that has been made public by participating research projects will use one of the lightweight Web browser apps provided as part of the OCHRE platform.