OCHRE System Design

OCHRE is designed to support all stages of the research data life cycle and to make it easy to move from one stage to the next—from initial data acquisition to integration, analysis, publication, and final archiving. The OCHRE system has four tiers, as shown in the diagram below.

The Linked Data Tier consists of data obtained from external Web servers via URLs or from external online databases that have APIs (application programming interfaces) which permit retrieval of specific pieces of information as needed. External resources obtained in this way (e.g., 2D images, 3D models, documents, table rows, Web pages, geospatial data, etc.) can be linked dynamically into an OCHRE project, allowing the data to be fetched as needed. This provides another mechanism for data acquisition in addition to keying in data manually or importing it automatically into the data warehouse, which is done for textual and numeric data contained in legacy text files (e.g., CSV files) and spreadsheets (e.g., Excel). A project will provide its own server to store its images and other external resources, or it will pay for server space provided by the OCHRE Data Service. A project can also create live links to curated online databases maintained by other organizations (e.g., the Zotero bibliographic database maintained by George Mason University, or field-specific online databases).

The Client Tier consists of two kinds of end-user software applications. The first is a stand-alone Java application (not a browser applet) that runs on desktop and laptop computers under Windows, Mac OS X, or Linux. It has a feature-rich graphical user interface (GUI) that enables the members of a project team to build and manage their project’s data via password-protected user accounts, and to display and analyze the data in many different ways. The Java GUI communicates via the Internet with the OCHRE data warehouse running on a Tamino XML Server in the University of Chicago Library. The Java GUI is normally used online but also has an offline mode for use when a fast Internet connection is not available (e.g., on an archaeological dig).

The second kind of software application in the Client Tier consists of Web browser apps (written in HTML, CSS, and JavaScript) and mobile apps (for iOS, Android, etc.) which have been created to view OCHRE data that projects have made public. Project directors may decide at any time to make some or all (or none) of their data public; they can do so via an option in the Java GUI. Published data is exposed to Web browser and mobile apps in the form of self-describing XML documents—usually one per real-world entity (e.g., one document for each artifact, in an archaeological project). These published XML documents are intended for data-exchange between OCHRE and other applications and are dynamically constructed as needed from the much more highly atomized data in the OCHRE warehouse. Thus, they are “flattened” or “denormalized” in comparison to the underlying database items (although, technically speaking, the atomized database items are also XML documents; see the Database page of this website for further details). Published OCHRE data can be delivered to a browser or mobile app as XML or as JSON or HTML, depending on what the app developer prefers, with automatic conversion from XML to JSON or HTML via an XSLT stylesheet that is provided and documented as part of the OCHRE Web API. Apps can be customized for particular projects or particular kinds of data. Generic Web browser apps for various research domains (archaeological, textual, historical, etc.) are under development by the OCHRE team at the University of Chicago and will be made freely available to all OCHRE users, but anyone may create an app to view published data using the OCHRE Web API.

The Middle Tier is the layer of software that exposes published data from the data warehouse to browser and mobile apps via a RESTful Web API (or Web service). This API is currently under development and is described in more detail on the Web Apps page of this website. Apps use the API to fetch published data (exposed as “flattened” XML-based data-exchange documents; see above) by means of persistent URLs; or an app may trigger Java routines and R functions, and pass arguments to them, to execute pre-written queries and analytical workflows that a project has named and saved in the data warehouse for others to use. The Middle Tier uses the HTTP Web service capability of Tamino XML Server, which gives browser and mobile apps access to published data from the core data warehouse.

The Core Data and Analysis Tier consists of both the OCHRE data warehouse and a separate server for executing R functions to do statistical analysis and visualization. The data warehouse is highly scalable and extensively indexed, permitting fast queries. It is implemented using a high-performance database management system called Tamino XML Server from Software AG, which is maintained and backed-up by professional system administrators in the Digital Library Development Center of the University of Chicago Library. The data warehouse is structured in accordance with an innovative non-relational graph data model that is optimized for semistructured data represented as recursive hierarchies of spatial, temporal, linguistic, and taxonomic items. The data warehouse and its underlying data model are described further in the Database page of this website.

Data Provenance and Reproducibility

Data provenance and reproducibility are important considerations in research data management. They are provided by OCHRE in various ways. Any item in the data warehouse can be attributed to an author or observer via a link to an agent item. A spatial item (a location or object) can have multiple date-and-time-stamped observations comprised of distinct sets of variable-value properties, with each observation attributed to a different person and without overwriting previous observations. Data can be edited only via password-protected user accounts assigned to a project’s members by the project’s director, who can specify different view-edit-delete privileges to each user based on item category. All changes to a project’s data are logged and time-stamped by user account at the individual item level. Automatic record-locking is done at the item level and a “try again later” message is displayed if an item is being edited by another user; thus, data cannot be unknowingly overwritten in cases of contention.

User accounts are linked to agent items, so a registered user can be linked to other items as an author, creator, observer, etc., of that data. Queries and analytical workflow scripts (described in the R Analytics page of this website) can be attributed to the users who created them or to the project as a whole. Queries and scripts can be named and (optionally) annotated, and then saved for repeated use as database items in their own right. Query criteria are entered via a graphical user interface (GUI), which generates XQuery code to be sent to the Tamino DBMS when the query is executed. Saved query criteria and R commands can be inspected at any time. Likewise, workflow scripts that combine queries and R code can be viewed in a data-aware R console within the GUI. Not just the analytical outputs but the queries and scripts themselves are stored in the warehouse as indexed database items which can be published on the Web via the REST API. Finally, the internal structure of the warehouse itself is documented via an OWL ontology specification.

The ability to save and reproduce a sequence of operations performed on a given set of data is currently being implemented using the “models” category of database items (described in the “Descriptions of Item Categories” section of the Database page). A model item can configure other database items in a way that expresses a graph of the series of events experienced by those items. When this capability has been fully implemented, a researcher will be able to record a series of editing, querying, and analysis operations as they are applied to a particular item or set of items. These operations are recorded in a model item that can be named and saved in the data warehouse, where it will be linked to the database item(s) that resulted from the operations. This will allow the step-by-step sequence of operations to be reconstructed after the fact, if ever this should be necessary.

Interoperability and Comparison with Other Systems

OCHRE is highly compatible with other data management systems because: (1) it allows users to link their project data to external data sources with URLs or APIs using the HTTP and FTP protocols; (2) it has its own REST API for publishing dynamic views of a project’s data on the Web in XML or JSON formats; and (3) it allows users to expose their data in a more granular fashion as RDF triples that preserve every distinction in the underlying data warehouse (see the Web Apps and Ontology pages of this website).

OCHRE does not reinvent the wheel. Researchers do not need to import all the data for a project into the OCHRE data warehouse if there are suitable online databases they can use as live data sources to be seamlessly integrated with their OCHRE data. Moreover, for working with data that is stored partly in the OCHRE data warehouse and partly in external repositories, OCHRE provides powerful features for querying, analysis, publication, and standards-based archiving to support the entire research data life cycle. In most fields of research nowadays there are online data repositories in which data from many contributing researchers is curated and made accessible to the research community. OCHRE users can link to these data sources in a highly granular fashion, at the individual item level, although the granularity of the data to which an atomized item is linked will depend on the capabilities of the external source’s API.

Most online data repositories have a single, relatively simple schema that was designed by one organization or group (e.g., a schema consisting of one or more tables with rows and columns). They do not accommodate multiple heterogeneous ontologies, as OCHRE does. But in addition to homogenized single-schema repositories of this kind, some online repositories are designed to contain many separate data sets, each with its own schema. Examples of multi-schema “data lakes” used in archaeology are the Archaeology Data Service in the U.K. and The Digital Archaeological Record (tDAR) at Arizona State University. However, they simply accession and store idiosyncratically organized data, preserving the original table schemas and document schemas of the files contributed by diverse researchers. These multi-schema repositories are oriented toward the manual browsing and downloading of separate data sets, one by one, and do not have APIs that permit record-level retrieval for automated querying and analysis of data items.

In contrast to both of these kinds of data repository (single-schema databases and multi-schema data lakes), OCHRE atomizes heterogeneous data sets from various researchers into their smallest elements and integrates both the data and metadata via a global data-warehouse schema that is based on an abstract top-level ontology, while preserving the local ontologies inherent in the original files. The metadata contained in the original files (e.g., table column headings and document tag names) becomes just another kind of data (taxonomic variable items) in the OCHRE data warehouse. This process of atomization and recombination permits much more powerful forms of automated integration, querying, and analysis of data across projects; it is described in more detail in the Database page of this website.

It is true that, in addition to providing a repository for idiosyncratic data sets, The Digital Archaeological Record (tDAR) was initially intended to provide a mechanism for data integration. But it relies on “user-driven integration,” eschewing a global warehouse schema that defines a keyed-and-indexed database structure into which information is inserted when the original data files are imported and atomized, as in OCHRE. In tDAR, by contrast, end-users are responsible for constructing semantic mappings among the original schemas of the data sets contained in the repository to enable on-the-fly data integration in response to a query. However, the data-warehouse approach involving pre-integrated data is faster and more efficient at query-time than on-the-fly integration, and it is widely recognized as a better approach for non-transactional data that does not change frequently. Moreover, tDAR imposes an unrealistic burden on researchers to agree in advance on a common ontology for each domain in order to align their schemas. In OCHRE, on the other hand, data integration is highly automated and takes place during the process of importing legacy data into the warehouse and during subsequent data entry via the graphical user interface, while still allowing user-created semantic mappings among pairs of terms as needed. Thus, although there is some functional overlap with tDAR, OCHRE differs from it quite substantially in its goals and functionality, while still being interoperable with tDAR and other online repositories to the extent allowed by their APIs.

Another system with which OCHRE could be compared is the widely used Galaxy platform for biomedical research. This platform is primarily designed for orchestrating analytical workflows and does not solve the complex problems of spatio-temporal data integration and querying for which OCHRE is designed. Galaxy works with relatively simple (albeit large) data sets that normally have tabular schemas. It is not designed to handle multiple ontologies and recursive spatial, temporal, and taxonomic hierarchies, for which OCHRE is optimized.