OCHRE: An Online Cultural and Historical Research Environment
A Computational Platform for All Stages of Research Data
The OCHRE platform is a tightly integrated suite of computational tools for working with all kinds of data through all stages of research, from initial acquisition to final archiving of the data. OCHRE provides a seamless environment in which it is easy to move from one stage of research to the next. The data is organized by project and is credited to and controlled by the members of the research project. OCHRE has a powerful graphical user interface (GUI) for entering, viewing, and organizing information from many different sources, and for combining, analyzing, publishing, and preserving the information.
Complementing this common GUI, which participating projects use to build and manage their data, are customized Web browser and mobile apps. These employ a Web API (currently under development) that anyone can use to write apps for retrieving and displaying published data from the OCHRE platform in whatever form may be needed. These apps can be tailored for a particular project or a group of similar projects to view and search their data in familiar ways.
The alternative is to employ an ad hoc collection of unrelated software for data management, data analysis, mapping, Web publication, etc., as most researchers currently do. But this requires cumbersome transfers of data from one piece of software to another using intermediate file formats—a time-consuming and error-prone process in which it is easy to lose track of the many pieces of information that one accumulates in a typical research project. By contrast, OCHRE users have a comprehensive view of their data in all its stages and a coherent user interface in which to:
- acquire relevant data from instruments, external online data sources, legacy data files, or by keying it in manually
- integrate large amounts of heterogeneous data within a common, searchable framework
- analyze and visualize the data using powerful statistical techniques to answer research questions
- publish data and research results on the Web in standard formats (XML, JSON) for online viewing and for other software to use
- archive data in an open, standard format (RDF) to preserve it for the long term and ensure its accessibility and reusability
OCHRE was originally developed for use in ancient studies, as an aid to archaeological research involving excavations and surveys, and as an aid to philological research on ancient inscriptions and languages. But its basic design and powerful features make it suitable for many other kinds of research in the humanities and social sciences, and in some branches of the natural sciences, as explained below.
OCHRE at the University of Chicago
The OCHRE platform is supported by experienced computing personnel at the University of Chicago. Technical support, user training, and legacy data conversion are provided by the staff of the OCHRE Data Service (based at the Oriental Institute of the University of Chicago). The OCHRE servers are professionally hosted and maintained at the University of Chicago Library, and the University of Chicago Research Computing Center provides expertise and support for aspects of OCHRE that require high-performance computing, especially for statistical analysis and visualization.
At the heart of the OCHRE platform is an innovative graph database that serves as a data warehouse to integrate diverse information from many different sources. The OCHRE database is implemented in an enterprise-class database management system that is monitored 24/7/365 and backed-up by system administrators in the University of Chicago Library, who ensure the security and accessibility of the data.
OCHRE has been engineered, tested, and refined over the past several years in close consultation with academic researchers in a variety of fields and with funding from a four-year, $1.75-million Scientific Software Integration grant from the U.S. National Science Foundation’s Office of Advanced Cyberinfrastructure (award no. 1450455). Additional funding for research projects to use and test the OCHRE platform has come from the Social Sciences and Humanities Research Council of Canada for an interdisciplinary and international collaboration entitled “Computational Research on the Ancient Near East” (CRANE), headed by Timothy Harrison of the University of Toronto.
OCHRE is now being used by more than 60 multi-person research projects based at 17 universities in the U.S., Canada, and Europe, including several large, multi-person projects at the University of Chicago. In early 2019, at the end of the software development and testing phase, OCHRE has approximately 650 user accounts and 8 million indexed database items representing 80 terabytes of data. This scale of usage has enabled rigorous real-world testing of the software for a wide range of academic use cases and has demonstrated the system’s sustainability. With testing now completed, the platform is being advertised more widely and new projects are being added. OCHRE can easily accommodate thousands of users and petabytes of data because it has been professionally engineered to be computationally efficient and scalable.
OCHRE Deals with Divergent Views of Space, Time, and Taxonomy
OCHRE, which stands for “Online Cultural and Historical Research Environment,” was initially developed for use in archaeology and philology. These fields of research have well-established empirical methods but they are characterized by a high degree of ontological heterogeneity, such that similar phenomena are described in different ways by different researchers. (An “ontology,” in the sense intended here, is a specification of the concepts and relations in a given domain of knowledge. A hierarchical taxonomy is a common, and relatively simple, kind of ontology.)
Ontological heterogeneity is not a problem in itself. Indeed, it is inherent in the practice of research, because different ontologies reflect different interpretive frameworks and research agendas—they are not just the result of sloppy thinking or individual quirks. Ontological heterogeneity is not a vice to be eliminated, in a misguided attempt to standardize human ways of knowing, but rather a defining virtue of a research community that is open to multiple perspectives. However, the non-standardized, heterogeneous ontologies embedded in traditional databases (e.g., in the table schemas of relational databases) cause problems for researchers because they inhibit the automated integration and comparison of data among different research projects that use different recording systems. A mechanism for automated querying and comparison across many diverse data sets would be of great benefit to researchers. What is needed is software that does not suppress ontological heterogeneity via forced standardization but instead embraces it and facilitates data integration by making it easy to create semantic mappings from one project to another. This is a basic aim of OCHRE.
Archaeologists study the material traces of past human activity. Philologists study the historical development of languages, literatures, and systems of writing. These two disciplines exhibit, not just ontological heterogeneity, but a high proportion of relatively unstructured or semistructured data in the form of qualitative descriptions and natural-language texts, which are best represented digitally as open-ended hierarchies (trees) rather than as rigid tables with rows and columns. Archaeology and philology also entail close attention to geographical and chronological variations in the phenomena being studied. Moreover, when dealing with spatial and temporal relations among entities, researchers need mechanisms for representing not just absolute locations in space and time, in terms of numeric coordinates and dates, but the relative placement objects or events with respect to other spatial and temporal phenomena.
Innovative computational methods for dealing with ontological heterogeneity, semistructured data, and spatio-temporal relations are the hallmark of OCHRE. Several large collaborative projects in archaeology and philology have been test beds for OCHRE and provide examples of its use. But it turns out that the software tools developed to deal with the spatial, temporal, linguistic, and taxonomic complexity of archaeological and philological data are applicable to a much wider range of research. This is so because the software is based on powerful conceptual abstractions expressed in an innovative graph database structure characterized by overlapping recursive hierarchies of atomized data elements (described in the Database page of this website). OCHRE’s hierarchical and recursive data model can flexibly represent scholarly knowledge of all kinds without sacrificing the power of modern databases, because it is implemented by means of well-indexed and properly atomized database items that conform to a predictable schema, and so enable efficient queries. Accordingly, OCHRE is now being used, not just in archaeology and philology, but in other areas of the humanities and social sciences; and also in branches of the natural sciences where spatial, temporal, and taxonomic variation are key concerns, such as population genetics (comparing ancient and modern DNA), paleoclimatology and other kinds of paleoenvironmental research, and paleobiology.
OCHRE Manages All Kinds of Research Data
OCHRE supports a wide range of digital formats and data types: textual, numeric, visual, sonic, geospatial, etc. A project’s textual and numeric data are ingested into the OCHRE data warehouse, where they are atomized and manipulated as database items (see the System Design page for a description of the data warehouse in relation to other components of the OCHRE system). OCHRE can automatically import textual and numeric data stored in Excel spreadsheets, Word documents, other XML documents (e.g., TEI texts), or plain text files (e.g., CSV files).
A project’s 2D images, 3D models, GIS mapping data, PDF files, and audio/video clips are not stored in the data warehouse but are treated as external resources and are fetched as needed from external servers provided by the project. External resources are linked to the data warehouse via their URLs and are displayed seamlessly together with a project’s textual and numeric data. Information in the data warehouse can be linked to specific locations within an image or other resource (e.g., a pixel region in a photograph or a page in multi-page document).
OCHRE uses ArcGIS Online and the ArcGIS Runtime SDK (embeddable software components) from ESRI to provide a powerful mapping and spatial analysis capability that is tightly integrated with other data. This is especially important for archaeology, but is necessary also for other kinds of research. In addition to spatial entities, chronological systems and temporal relations can also be represented in a way that makes it easy to work with both relative and absolute dates, and lets users visualize temporal sequences via graphical timelines. More generally, relationships of all kinds—temporal, spatial, social, linguistic, etc.—can be modeled and visualized as network graphs and can be used in database queries that incorporate the extrinsic relationships among entities as well as their intrinsic properties.
For philological projects, there are sophisticated capabilities for representing texts written in any language or script, whether alphabetic or logosyllabic—and for representing the writing systems themselves, which is necessary when dealing with complex ancient scripts such as Mesopotamian cuneiform whose signs have many phonetic values and allographic variants. The epigraphic and discursive dimensions of a text are carefully distinguished, which is often not done in software intended for textual studies, but is necessary to achieve a digital representation suitable for rigorous philological research.
OCHRE Embraces Multiple Ontologies
Computational tools for working with a growing body of interconnected scholarly knowledge must cope with the practical reality that such knowledge is recorded by many different people using divergent ontologies. Each ontology reflects the nomenclature and conceptual distinctions relevant to a particular domain of research, and perhaps also reflects the idiosyncrasies of an individual researcher. No single ontology, no matter how complex and ramified, will be suitable for all purposes, because there is an endless array of conceptual possibilities depending on the subject matter and the questions being asked, not to mention the linguistic traditions and historically situated perspectives of the researchers involved.
A computational system for academic research should acknowledge the hermeneutic principle that meaning depends on context and that modern scholarship demands the freedom to describe phenomena of interest in the light of one’s own critical judgments, without being forced to conform to someone else’s ontology due to its being inscribed in the structure of the computer system. OCHRE was designed and built with this principle in mind (see the Ontology page of this website for a discussion of the philosophical issues involved). OCHRE does not impose a single ontology but rather embraces and coordinates multiple ontologies, letting researchers map the semantic relationships between one ontology and another as needed to facilitate cross-project querying and analysis. It does so by means of a highly abstract “upper” ontology that is defined in terms of basic categories such as space, time, agency, and discourse. This is implemented in the logical schema of the OCHRE database, which can thus subsume any local, project-specific ontology within a more general ontological structure. As a result, OCHRE is flexible and customizable. It does not force researchers to conform to a rigid, predetermined recording system but lets them use their own terminologies and conceptual distinctions. And it does so while providing powerful mechanisms for ingesting and integrating data; for querying and analyzing data; and for publishing and archiving data in a standards-compliant fashion.
OCHRE Conforms to Semantic Web Standards
OCHRE is based on the open, non-proprietary standards published by the World Wide Web Consortium (W3C), the organization responsible for the design of the Web itself. This is especially important for data publication and archiving. The OCHRE Web API publishes data for use by Web browsers and other software using the W3C’s Extensible Markup Language (XML), a self-describing tagged-text format, with stylesheets in the Extensible Stylesheet Language (XSLT) to convert the XML data to HTML or JSON for use in Web apps.
OCHRE can also expose data in a more atomized fashion using the W3C’s Resource Description Framework (RDF) format, which expresses knowledge in the form of subject-predicate-object “triples” that constitute statements about entities. In terms of graph theory, a collection of RDF triple-statements is a labeled, directed multi-graph. RDF triples are well suited for long-term archiving of OCHRE data in an open, standardized format that preserves all of the distinctions and relationships (metadata) stored in the underlying data warehouse. RDF triples can be implemented in a self-describing tagged-text format (e.g., XML) that does not depend on any particular software or operating system. They can be queried using the W3C’s SPARQL querying language and easily imported into any graph database system.
Thus, OCHRE is fundamentally compatible with the Semantic Web. At its deepest level, it is based on the open, non-proprietary standards published by the World Wide Web Consortium, which means it is not locked into any commercial software vendor. Its integrative data warehouse currently runs on Tamino XML Server, a native-XML database management system (DBMS) from Software AG that uses the W3C’s XML Schema language to define the database structure and uses the powerful XQuery language to perform database queries. However, any XML DBMS that supports XQuery could be used instead, such as MarkLogic, IBM DB2 PureXML, or Oracle Berkeley DB XML. (To avoid confusion, please note that the XML “documents” in the OCHRE data warehouse are quite small and highly atomized, and function as indexed database objects with unique keys. They do not normally correspond to real-world documents—unlike TEI-XML documents, for example. See the Database page of this website for more details.)