The OCHRE Ontology, RDF Statements, and the “CHOIR” OWL Specification
OCHRE makes use of an innovative graph database that facilitates the integration of ontologically heterogeneous data derived from multiple sources (i.e., data that has been recorded using different conceptual schemes and terminologies for describing the entities and relationships in a given domain of knowledge). The logical structure of the OCHRE database is specified in 23 XML document types suitable for an XQuery-based system. However, a more generalized expression of the OCHRE graph database structure using RDF triples, and thus suitable for other kinds of graph database software, has been provided by means of a Web Ontology Language (OWL) specification called CHOIR, a “Comprehensive Hermeneutic Ontology for Integrative Research.”
OCHRE is based on an abstract “upper” or top-level ontology within which domain-specific and project-specific ontologies can be subsumed as an aid to integration of ontologically heterogeneous data. The OWL CHOIR expression of this top-level ontology is used by the OCHRE software to specify classes, subclasses, and relations that constrain the structure and meaning of RDF triples exported from the OCHRE database or dynamically exposed on the Web for other software to consume. RDF triples are used in the Semantic Web to represent atomized subject-predicate-object statements of knowledge. RDF triples that conform to the OWL CHOIR ontology specification correspond to item-variable-value triples in the OCHRE database and preserve the high degree of atomization within that database. This allows the entities and relationships represented in the OCHRE database to be published on the Web in a way that preserves all their nuances and distinctions while conforming to the Semantic Web standards.
The OWL specification of CHOIR is currently under development and will be released soon with accompanying annotations and examples of RDF triples that conform to it. In the meantime, the philosophical principles underlying its design are explained below by way of comparison to other top-level ontologies designed to facilitate data integration. Development of the OWL specification of CHOIR is supported in part by the Neubauer Collegium for Culture and Society of the University of Chicago, as one component of a collaborative project on ontology-based data integration (“An Organon for the Information Age”) led by David Schloen (Professor of Near Eastern Archaeology), Malte Willer (Associate Professor of Philosophy), and Samuel Volchenboum (Associate Professor of Pediatrics and Associate Chief Research Informatics Officer).
Philosophical Reflections on the Ontology of OCHRE and CHOIR
by David Schloen and Malte Willer
Ever since Aristotle’s Organon, scholars have recognized the central importance of ontologies as tools for structuring and systematizing human knowledge. Their role in the digital era is to help us manage the vast amount of data accumulating in every domain of inquiry. Researchers in many fields face the challenge of combining heterogeneous data from multiple sources to answer questions by means of comprehensive automated querying and analysis. Building on the computational standards of the Semantic Web, especially RDF and OWL, a top-level ontology like CHOIR can facilitate ontology-based data integration by subsuming local ontologies within a larger ontological structure, making it easier for software to translate between heterogeneous ontologies.
CHOIR is quite simple in comparison to other top-level ontologies and is more modest in its ambitions. Examples of top-level ontologies are the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), Cyc, and the Basic Formal Ontology (BFO); and, for cultural heritage, the CIDOC Conceptual Reference Model (CIDOC CRM) of the International Committee for Documentation and the International Council of Museums. Unlike these ontologies, however, CHOIR does not try to classify reality itself, but rather classifies the statements researchers make about what they study, with emphasis on statements about spatial, temporal, agentive, textual, and taxonomic entities and relationships.
Social agents constitute a basic class in the CHOIR ontology. In the OCHRE database, whose structure CHOIR expresses, every database item must be linked to the person(s) or organization(s) who are the observers, creators, authors, or editors of that data, usually with a date- and time-stamp to indicate when the data was observed or authored, and with the proviso that the same item can be described in different ways by different persons at different times and that it is important to keep track of this as well. In other words, it is assumed that each piece of information in the database consists of a statement uttered by someone about something at a particular time, and that it is essential for modern research to identify who said what, and when they said it, about the phenomena of interest.
Because CHOIR is focused on people making statements about phenomena, it is not committed either to metaphysical realism or to anti-realism. It is not modeling reality directly but is modeling the linguistic practices of spatially and temporally situated agents as they make statements about the world, without regard to whether those statements refer to mind-independent entities. This rather simple approach, which side-steps many of the semantic goals of other top-level ontologies and makes fewer assumptions, nonetheless provides practical benefits in facilitating the semantic integration of ontologically heterogeneous data. CHOIR usefully constrains the very general RDF model of knowledge statements as subject-predicate-object triples by classifying such statements as being about certain basic classes of entities (spatial, temporal, agentive, textual, taxonomic, etc.). Furthermore, CHOIR recursively nests triple-statements, one within another, in a way that implies containment (in the case of spatial statements), or sequencing and sub-sequencing (in the case of temporal and textual statements), or semantic inheritance (in the case of taxonomic statements). CHOIR also links each statement to a social agent (individual or collective) who is the utterer of the statement, and it allows for multiple statements about the same entities uttered by different agents at different times. This ontological structure corresponds to a semistructured graph database schema that can be efficiently implemented in a working computer system using recursive hierarchies of database items and recursive programming techniques (described in the Database page of this website).
Most importantly, a basic class of entities about which agents make statements is the class of taxonomic entities (property variables and values), which in turn are used to make item-variable-value (= subject-predicate-object) triple statements about other classes of entities. Among the taxonomic variables used to describe an entity may be “relational variables” whose values denote other entities to which the first entity has been related in some way. The key point here is that the taxonomic classification of phenomena is not defined in CHOIR beyond the basic categories of space, time, agency, discourse, and inscription, plus the basic categories of taxonomy, namely, descriptive variables (attributes) and the values of variables. All other classifications are not ontologically predetermined but are “user-defined”; that is, they are the result of historically contingent statements made by spatially and temporally (and linguistically and culturally) situated agents about phenomena of interest. Thus, many of the classes and relations defined in other top-level ontologies are regarded by CHOIR, not as universal and predetermined, but as local taxonomic statements made by particular agents (the ontology builders, in this case). In this way, CHOIR is even more “upper” than other top-level ontologies, which can be subsumed within it.
Computational Challenges
Data integration is needed to permit automated queries and analyses that span many sources of information which have been recorded in diverse digital formats in different times and places by different people and organizations (Doan, Halevy, and Ives 2012). Three kinds of heterogeneity emerge as obstacles to data integration: (1) syntactic heterogeneity, (2) schematic heterogeneity, and (3) semantic heterogeneity.
1. Syntactic heterogeneity has been overcome by the adoption of the Unicode character-encoding standard and the Extensible Markup Language (XML) tagged-text format. These now provide a standardized mechanism, built into all modern operating systems, for exchanging any kind of data on the Internet, including both relatively unstructured documents and highly structured data of the kind found in database systems.
2. Schematic heterogeneity arises from the use of different data structures (e.g., tables with different columns, or hierarchies with different levels of nesting) to represent similar kinds of information. This kind of heterogeneity can be overcome by using a domain ontology that specifies a standard set of concepts and relationships in a given domain of knowledge. The information contained in the diverse data structures (character strings, tables, hierarchies) that constitute the schemas of computer systems can be mapped onto common concepts and relations in the domain ontology and so made amenable to automated comparison. An example of a domain ontology is SNOMED CT, which specifies medical concepts and is widely used in biomedical informatics. A domain ontology may consist of just the concepts and relations used in the schema of a single computer system or database, or a set of agreed-upon concepts used by many different computer systems, such as the SNOMED ontology. (Note that we are here making a terminological distinction here between the particular logical schema of a working computer system and the broader conceptual ontology it reflects.)
An ontology is implemented computationally using a description logic, which is “a subset of first-order logic that is restricted to unary relations (called concepts or classes) and binary relations (called roles or properties). The concepts describe sets of individuals in the domain, and the roles describe binary relationships between individuals” (Doan et al. 2012: 328; see also Baader et al. 2017). Description logics are intended to support both knowledge representation and automated reasoning, although they often run into difficulties with the latter. The Web Ontology Language (OWL) we are using for CHOIR is an adaptation of description logics that is suitable for online use on the Web. It is a product of the Semantic Web initiative of the World Wide Web Consortium. This non-profit consortium is responsible for the core technical standards that underlie the Web (e.g., HTTP, HTML, XML, etc.) and it has published additional technical standards such as OWL to encourage data integration and automated reasoning.
3. In addition to syntactic heterogeneity and schematic heterogeneity, for which solutions are available, we must reckon with the semantic heterogeneity of computer systems—the fact that the concepts and relationships used by a given system reflect a particular situation, set of concerns, and view of the world. Semantic heterogeneity is found in the sciences as well as the humanities and does not necessarily imply relativism or anti-realism but is compatible with the metaphysical realism still commonly assumed by many scientific researchers, insofar as there are different ways to describe the same entities. Attempts have been made to overcome semantic heterogeneity by using top-level ontologies defined at a high level of abstraction and thus able to subsume many different domain ontologies. Examples of top-level ontologies are Basic Formal Ontology (Arp, Smith, and Spear 2015) and CYC (Lenat and Guha 1990; Lenat 1995), as well as SUMO (Suggested Upper Merged Ontology; Pease et al. 2002) and DOLCE (Descriptive Ontology for Linguistic and Cultural Engineering; Gangemi et al. 2002), which were mentioned above.
Breathless claims about the possibility of a purely automated mechanism for overcoming semantic heterogeneity should be treated with skepticism, as we shall see below in our discussion of artificial intelligence. Nonetheless, there is considerable practical benefit in using an appropriate top-level ontology as a means of semi-automated semantic integration, even though it will remain reliant at key points on human intervention by end-users. In the case of CHOIR, a semi-automated semantic mapping can be made to link the concepts and relations in each domain ontology (e.g., a project’s taxonomy) to this top-level ontology. But these semantic mappings will be done by researchers themselves, who thus remain responsible for the semantics of the system, albeit with the assistance of machine-learning methods now widely used in natural language processing, which can propose semantic matches to be confirmed by a human user. The very fact that machine learning can assist human experts but still cannot entirely replace the human activity of constructing ontologies and aligning them semantically across domains sends us back to a conception of ontology, not as a universal view from nowhere, but as an organon or instrument for achieving particular human purposes in particular situations. We must strike a balance between the ineradicable semantic role of the embodied human users of a computer system, whose concern is to extract meaning from the system, and the powerful allure of formalization and automation, which can greatly aid researchers in their context-specific work of integrating and querying large amounts of heterogeneous data to answer questions they care about.
Realism, Anti-realism, and Hermeneutic Plural Realism
It is useful to compare CHOIR to Basic Formal Ontology, a philosophically sophisticated top-level ontology that is described in Building Ontologies with Basic Formal Ontology by Robert Arp, Barry Smith, and Andrew D. Spear (Cambridge, Mass.: MIT Press, 2015). These authors argue that an ontology should be “realistic” in that it represents the universals, defined classes, and relations that the relevant field of study is about; and they furthermore suggest that these will ultimately fall into a limited number of basic categories such as object, place, property, and event. But there is a clear bias toward the natural sciences in Basic Formal Ontology. It is difficult to see how it would assist semantic data integration in many areas of research in the humanities and social sciences, and in some natural sciences, in which there is no common agreement about what constitutes an object of study, or whether such an object would enjoy the same metaphysical status as properties, events, places, and other objects of natural science.
Having said that, it is important to stress that Basic Formal Ontology is an impressive achievement on its own terms and reflects the most philosophically sophisticated approach to ontology design today. It builds on the work of Barry Smith, a leading ontologist and professor of philosophy at SUNY Buffalo. Smith subscribes to Aristotelian realism and so refers to generic top-level ontologies as “formal” ontologies, in contrast to domain-specific ontologies, which he calls “material” ontologies. For Smith and his colleagues, a formal ontology is “a representational artifact, comprising a taxonomy as proper part, whose representations are intended to designate some combination of universals, defined classes, and certain relations between them” (Arp, Smith, and Spear 2015). But other ontologists do not believe in the existence of universals, or wish to be agnostic about them, and so do not distinguish universals from other kinds of classes when constructing ontologies. It is worthy asking whether a commitment to realism necessarily implies a particular kind of top-level “formal” ontology, as Smith argues, or whether an effective ontology can be devised that accommodates all of the varieties of realism and anti-realism propounded by ontologists and domain specialists across the sciences and the humanities.
CHOIR is intended to be just such a metaphysically agnostic ontology. To understand the approach it takes, it is useful to revisit the philosophical debate concerning realism and anti-realism, and especially the discussion between Hubert Dreyfus, Charles Taylor, and Richard Rorty on the difference between the natural and human sciences (Review of Metaphysics 34 [1980]: 3–55), which has received a fresh impetus from Dreyfus and Taylor’s recent book Retrieving Realism (Cambridge, Mass.: Harvard University Press, 2015). Dreyfus and Taylor argue, from the perspective of the hermeneutic phenomenology pioneered by Heidegger and Merleau-Ponty, that a robust form of realism is still philosophically viable despite the criticisms offered by Rorty and others of the “view from nowhere” (e.g., in Rorty’s 1979 book Philosophy and the Mirror of Nature)—an epistemological critique that has been very influential in the humanities and social sciences. It is now common to hear scholars speak of the concepts used in the natural sciences as themselves socially constructed and culturally contingent, as opposed to being rooted in a mind-independent metaphysical realm. Dreyfus and Taylor, however, are not so ready to dismiss the metaphysical realism assumed by many working scientists. They opt for what they call a “pluralistic robust realism,” in which: “there may be (1) multiple ways of interrogating reality . . . which nevertheless (2) reveal truths independent of us, that is, truths that require us to revise and adjust our thinking to grasp them . . . and where (3) all attempts fail to bring the different ways of interrogating reality into a single mode of questioning that yields a unified picture or theory” (Dreyfus and Taylor 2015: 154).
These discussions show that the philosophical debate between realists and anti-realists is by no means settled. For this reason, in contrast to Smith’s Basic Formal Ontology and most other top-level ontologies, CHOIR avoids all claims to realism in the first place and aims to categorize, not the subject matter of some type of inquiry, but the inquisitive practices themselves. This provides us with an even-more-basic ontology that can be efficiently implemented in a working computer system and serves as an effective means of semi-automated data integration for all disciplines, in the humanities as well as the sciences. CHOIR does not presume to prescribe the content of what may be stated, in terms of universals or other predefined classes, but more humbly prescribes the structure of agents-making-statements, including statements about which classes exist and which individuals are members of a given class, as well as statements about the relations among classes and among individuals. Of course, there is still an ontological starting point in terms of the predetermined categories of space, time, agency, and discourse that CHOIR employs. But here we can argue on phenomenological grounds that the spatially and temporally situated utterances of embodied linguistic beings are the inescapable horizon within which any ontological discussion is able to make sense.
Ontology and Artificial Intelligence
The ontological approach taken in CHOIR has the air of modesty in that it does not pretend to carve reality at the joints. It is then natural to ask further how such an ontology might respond to another famous plea for modesty from the philosophical literature—the one that flows from Hubert Dreyfus’s skepticism about the possibility of modeling all relevant aspects of human knowledge in a purely symbolic system. This skepticism was powerfully articulated in his 1972 book What Computers Can’t Do (Dreyfus 1972; third edition 1992) and stands in tension with the stated ambitions of prominent ontology projects such as Douglas Lenat’s Cyc project, which aims to represent a vast amount of common-sense knowledge to enable automated reasoning (Lenat and Guha 1990; Lenat 1995; cf. Dreyfus 1992: xvi–xxx on the futility of Lenat’s project).
Statements of knowledge and the agents who make them are represented in a digital computer using formal symbols defined according to standardized conventions and combinable according to certain rules. With knowledge of the relevant conventions and rules, one can automate the syntactic and schematic integration of heterogeneous data derived from multiple sources. Beyond this, many people have been enamored of the idea that semantic integration across domains of knowledge could also be achieved by automated reasoning using symbols and rules. This was the focus of early research in artificial intelligence (AI) from the 1960s to the 1980s. This research led to description logics, which are the basis of modern computational implementations of ontologies, as in the Web Ontology Language.
However, despite considerable effort, symbolic AI is now held by many to have failed on a practical level, and it was also sharply criticized on philosophical grounds by Dreyfus and others as being unworkable in principle. The primary philosophical critique of symbolic AI is from the perspective of the hermeneutic (“existential”) phenomenology of Heidegger and Merleau-Ponty and was made by Dreyfus (1992 [1972]), Charles Taylor (1985), and John Haugeland (1985, 1998). Symbolic AI works well in narrowly defined domains but fails in open-ended reasoning tasks that depend in unpredictable ways on background knowledge (“common sense”). As Dreyfus (1991: 117–119) pointed out, it is often impossible to determine algorithmically which background knowledge is significant and relevant. The problem of selecting relevant facts with which to reason cannot be solved simply by representing more and more facts inside the computer system because the required kind of relevance can never be captured in a formal calculus but emerges from our physical embodiment as agents engaged moment-by-moment in a “world” of involvements. Relevance is a function of a mode of being that first of all entails preconceptual skillful coping in the physical, social, and cultural situations in which we are embedded, and about which we care, and only secondarily and derivatively entails rational calculation. A computer is not in a situation; hence, as Haugeland (1998: 47) put it: “The trouble with artificial intelligence is that computers don’t give a damn.”
CHOIR takes into account the limitations of symbolic knowledge representation, description logics, and automated reasoning. It does not try to solve the intractable computational problem of generalized (domain-free) automated reasoning using formal symbols and rules, on which early AI researchers spent so much effort. This effort is what Haugeland called GOFAI, “Good Old-Fashioned AI,” in contrast to non-symbolic AI based on neural networks, which nowadays relies on statistical machine learning to do pattern matching across vast amounts of data. In reaction to GOFAI’s failure in practice, and also (in some circles) in response to Dreyfus’s philosophical critique, AI work since the 1980s has largely abandoned symbolic knowledge representation and reasoning based on predetermined rules. Much of what is today called AI is based on the rather different paradigm of machine learning, which does not rely on rules-based programming but is “a [probabilistic] set of [algorithmic] methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision-making under uncertainty” (Murphy 2012: 1). Some philosophically attuned computer scientists and cognitive scientists have made explicit attempts to apply machine learning to the design of computational systems in what they take to be a Heideggerian fashion (e.g., Winograd and Flores 1986; Winograd 1995; Agre 1997; Wheeler 2005; cf. Brooks 1991).
But even the most recent AI work on “deep learning” using massive neural networks, which has yielded impressive results in the machine translation of natural language and in image recognition, has not overcome the semantic problems encountered by GOFAI in its attempt to simulate generalized intelligence (see B. C. Smith 2019). Dreyfus and Taylor would say that this is because AI researchers have not yet abandoned the old rationalist idea—vigorously attacked by Heidegger and his intellectual heirs—that intelligence must somehow be based on having an inner representation of an outer world, as opposed to resting fundamentally on our unmediated preconceptual contact with reality as embodied agents (Dreyfus and Taylor 2015: 71–101; see also the essays in Schear [ed.] 2013 on the debate between Hubert Dreyfus and John McDowell concerning the possibility of a preconceptual embodied understanding of the world as opposed to the view espoused by McDowell and many others in the analytic tradition that human experience is permeated with conceptual rationality).
We can see, then, that the semantic integration of data presents a problem of a different order than syntactic or schematic integration. It is not simply an engineering problem but is bound up with fundamental philosophical debates concerning the deep dependence (or not) of meaning on particular social, cultural, and even physical (bodily) contexts. And this is not simply a question of opposing the realism endemic in natural science against the relativism prevalent in the humanities. Rejecting the possibility of a view from nowhere, which many scholars in the humanities would do, may well be compatible with some form of realism, as has been argued by Dreyfus and Taylor (2015), who defend pluralistic robust realism in response to Richard Rorty’s deflationary realism.
CHOIR does not try to decide these difficult questions but merely claims that, regardless of one’s philosophical stance, a top-level ontology that is computationally useful for semantic integration can represent knowledge in the form of spatial, temporal, and taxonomic statements made by particular social agents in particular times and places. But this means that space, time, and agency must themselves be fundamental categories in the ontology, which must also embrace multiple conflicting statements about the same things, whether these statements are made in accordance with a single taxonomy or many different taxonomies. This approach conforms to long-standing research practices of individual attribution and citation, avoiding uncritical acceptance of anonymous authority. It also allows a single ontological framework to accommodate competing views of reality or multiple perspectives on what is held to be the same reality, as well as to accommodate what is purported to be a universal view from nowhere. In this way, we can ensure that ontology plays its proper role, as an organon for inquiry and not a replacement for inquiry.
References
Agre, Philip E. 1997. Computation and Human Experience. Cambridge: Cambridge University Press.
Arp, Robert, Barry Smith, and Andrew D. Spear. 2015. Building Ontologies with Basic Formal Ontology. Cambridge, Mass.: MIT Press.
Baader, Franz, Ian Horrocks, Carsten Lutz, and Uli Sattler. 2017. An Introduction to Description Logic. Cambridge: Cambridge University Press.
Brooks, Rodney A. 1991. “Intelligence without Representation.” Artificial Intelligence 47: 139–159.
Doan, AnHai, Alon Halevy, and Zachary Ives. 2012. Principles of Data Integration. Waltham, Mass.: Morgan Kauffman/Elsevier.
Dreyfus, Hubert L. 1991. Being-in-the-World: A Commentary on Heidegger’s Being and Time, Division I. Cambridge, Mass.: MIT Press.
Dreyfus, Hubert L. 1992. What Computers Still Can’t Do: A Critique of Artificial Reason. 3d ed. [orig. ed. 1972] Cambridge, Mass.: MIT Press.
Dreyfus, Hubert L., and Charles Taylor. 2015. Retrieving Realism. Cambridge, Mass.: Harvard University Press.
Gangemi, Aldo, Nicola Guarino, Claudio Masolo, Alessandro Oltramari, Luc Schneider. 2002. “Sweetening Ontologies with DOLCE.” Pp. 166–181 in A. Gómez-Pérez and V. R. Benjamins (eds.), Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web. Berlin: Springer.
Haugeland, John. 1985. Artificial Intelligence: The Very Idea. Cambridge, Mass.: MIT Press.
Haugeland, John. 1998. Having Thought: Essays in the Metaphysics of Mind. Cambridge, Mass.: Harvard University Press.
Lenat, Douglas B. 1995. “CYC: A Large-Scale Investment in Knowledge Infrastructure.” Communications of the ACM 38/11: 33–38.
Lenat, Douglas B., and R. V. Guha. 1990. Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project. Boston: Addison-Wesley.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. Cambridge, Mass.: MIT Press.
Pease, Adam, Ian Niles, and John Li. 2002. “The Suggested Upper Merged Ontology: A Large Ontology for the Semantic Web and Its Applications.” AAAI Technical Report WS-02-11.
Schear, Joseph K., ed. 2013. Mind, Reason, and Being-in-the-World: The McDowell-Dreyfus Debate. New York: Routledge.
Smith, Barry. 2000. “Logic and Formal Ontology.” Manuscrito 23: 275–323.
Smith, Barry. 2003. “Ontology.” Pp. 155–166 in L. Floridi (ed.), Blackwell Guide to the Philosophy of Computing and Information. Oxford: Blackwell.
Smith, Brian Cantwell. 2019. The Promise of Artificial Intelligence: Reckoning and Judgment. Cambridge, Mass.: MIT Press.
Taylor, Charles. 1985. “Cognitive Psychology.” Pp. 187–212 in Human Agency and Language: Philosophical Papers, vol. 1. Cambridge: Cambridge University Press.
Wheeler, Michael. 2005. Reconstructing the Cognitive World: The Next Step. Cambridge, Mass.: MIT Press.
Winograd, Terry. 1995. “Heidegger and the Design of Computer Systems.” Pp. 108–127 in A. Feenberg and A. Hannay (eds.), Technology and the Politics of Knowledge. Indiana University Press.
Winograd, Terry, and Fernando Flores. 1986. Understanding Computers and Cognition: A New Foundation for Design. Norwood, N.J.: Ablex.