CEDAR Principles

CEDAR is based on distinctive principles that inform its theoretical design and its practical implementation. CEDAR’s goal is well expressed by Jerome McGann, Uni­ver­sity Professor of English at the University of Virginia and founder in 1992 of the pioneering Institute for Advanced Tech­nology in the Humanities:

What we call literature is an institutional system of cultural memory—a republic of letters. The general science of its study is philology and its Turing Machine is what scholars call the Critical Edition. Driven by the historical emergence of online communication, this book [and the CEDAR initiative] investigates their character and relation. (Jerome McGann, A New Republic of Letters: Memory and Scholarship in the Age of Digital Reproduction; Cambridge, Mass.: Harvard University Press, 2014, p. ix)

What unites the CEDAR collaborators, despite the diversity of their scholarly interests and their compartmentalization within different academic depart­ments, is a common interest in philology and how it should be understood and practiced in the digital age. They are taking up McGann’s challenge and investigating the character and possibilities of the critical edition—that essential technology which for two centuries has undergirded research in the humanities—when such works of scholarship are implemented, not in printed books, with all the constraints imposed by that medium, but in digital networks of information.

CEDAR project teams are exploring this issue, not in the abstract, as a purely theoretical exercise, but as a matter of practice, by using new tools to craft digital texts and facilitate their critical analysis by scholars. General discussions about what we desire in a digital critical edition both guide, and are guided by, practical efforts to digitize particular texts and transcend the limits of traditional printed editions. By comparing the features of exemplary texts studied in a wide range of literary and cultural fields and by examining reflexively our own scholarly practices, we can abstract the common challenges presented to scholars everywhere and apply lessons learned from one textual corpus to other texts written in quite different scripts, languages, and cultural contexts.

From the point of view of software engineering, we are here following the time-honored practice of identifying the common structural elements and algorithms inherent in a seemingly heterogeneous collection of data and methods. This strategy yields simpler and more powerful code that can be used for many different purposes and can be sustained more easily by virtue of being widely shared.

Philology, Then and Now

McGann notes that philology and the intellectual tradition from which it grew are receiving renewed attention and esteem. Philology has long been disparaged as antiquarian, or at best regarded as necessary but intellectually cramped and tedious labor that someone else should do. It has been treated with suspicion by postmodernists since the time of Nietzsche, who exposed the failure of its eminent 19th-century practitioners to recognize that their philological recovery of the past was not neutral and objective but was “an active agent in the construc­tion of modern ideol­ogies—which is to say, the constitutive illusions of modern cultural life” (James I. Porter, Nietzsche and the Philology of the Future; Stanford, Calif.: Stanford University Press, 2000, p. 6).

McGann acknowledges the postmodern critique but argues that the older philo­logical vision is not so easily dismissed. He cites approvingly an essay by Edward Said calling for a “return to philology” as “a method for critically engaging ‘the constitution of tradition and the usable past’ … [and] ‘for making those connections that allow us to see part and whole…: what to connect with, how, and how not’” (Edward W. Said,  “The Return to Philology,” in Humanism and Democratic Criticism by E. W. Said; New York: Columbia University Press, 2004, pp. 57–84; quoted in McGann, ibid., p. 2).

McGann himself calls for a return, not just to Wortphilologie, with its focus on language and the internal workings of texts, but to the Sachphilologie of 19th-century historicism—an “object-oriented” approach (in the sense of material objects) that starts with the physical reality of inscribed docu­ments in their concrete historical and social settings, as in modern book history and critical bibliography. McGann laments the fact that many students of the humanities are no longer familiar with what is entailed in the construction of a critical edition and are thus ignorant of the long history of evidence and argu­mentation on which their own interpretations are based. However, as he points out, the impending demise of print culture and the transition to digital culture provide an opportunity to look at old philological methods with fresh eyes and rethink them in a new medium.

The Document Paradigm versus the Database Paradigm

It so happens that philology, including Sachphilologie, has had a long and distinguished history at the University of Chicago since its founding by the Semitic philologist William Rainey Harper. It has thrived at Chicago especially among scholars of premodern and non-European languages and texts, for whom the production of critical editions and philological dictionaries remains a fundamental part of their practice of scholarship. It is relevant to the CEDAR initiative that Jerome McGann, and most other leading figures in digital humanities, are, by contrast, specialists in printed Euro­pean litera­tures of recent centuries. This does not compromise their understanding of the philo­logical tradition, whose relation to digital technology McGann explains very well. But it seems to have affected their computational approach to the repre­sentation of literary texts. The approach most commonly taken when digitizing literary texts is reliant on one- and two-dimensional data struc­tures (strings and tables) that mimic predigital configurations of information in printed lines and pages. These data structures conform to the “document paradigm” of software development, based on long linear character strings, as opposed to the “database paradigm,” in which informa­tion is much more atomized and thus able to be interconnected in more complex ways, allowing for a multi-dimensional representation of texts. A highly atomized and indexed database repre­sentation of a text can be configured and analyzed in many different ways using powerful software and computational standards rooted in modern database theory (see the essay “Beyond Gutenberg: Transcending the Document Paradigm in Digital Humanities” by David and Sandra Schloen, published in Digital Humanities Quarterly vol. 8, no. 4 [2014]).

The document paradigm of digital text processing emerged in the business world and is adequate for alphabetic printed texts when there is no concern for representing variants of the same text and no concern for representing in digital form the epigraphic complexities of physical documents and their histories of copying, translation, and intertextual cross-referencing. And yet these concerns are at the core of philology and, in many cases, they apply to printed texts as much as to ancient manuscripts written in extinct scripts. Textual scholars need a much richer and more nuanced way of digitizing texts that captures all the conceptual distinctions and empirical observations they make in the course of their work.

Indeed, the docu­ment paradigm in digital humanities is responsible for the frustrations voiced by McGann and others concerning the deficiencies of the “markup” method of digital text repre­sentation using tagged character strings:

Since text is dynamic and mobile and textual structures are essentially indeterminate, how can markup properly deal with the phenomena of structural instability? Neither the expression nor the content of a text are given once and for all. Text is not self-identical. The structure of its content very much depends on some act of interpretation by an interpreter, nor is its expression absolutely stable. Textual variants are not simply the result of faulty textual transmission. Text is unsteady, and both its content and expression keep constantly quivering. (Dino Buzzetti and Jerome McGann, “Critical Editing in a Digital Horizon,” in Electronic Textual Editing, ed. L. Burnard et al.; New York: The Modern Language Association of America, 2006, p. 64; see also Jerome McGann, “Marking Texts of Many Dimensions,” in A Companion to Digital Humanities, ed. S. Schreibman et al.; Malden, Mass.: Blackwell, 2004, pp. 198–217.)

The innovative computational approach taken in the CEDAR initiative is based on the observation that the deficiencies of textual markup (the business-derived method of representing a text as a single character string with embedded tags) result from inherent limitations of the document paradigm. This programming paradigm has dominated digital humanities since the 1980s and is the basis of large-scale efforts such as the Text Encoding Initiative. There has been a corres­ponding neglect of the more sophisticated and powerful database paradigm, in which the components of a text, both physical (based on epigraphic study of the marks of inscription) and linguistic (in which the text is understood as meaningful discourse), can be distinguished more finely, at a higher level of abstraction, as atomic data items (epigraphic units and discourse units) capable of being organized into more complex data structures.

A database representation enables many different views of a text to be constructed on-the-fly via database queries. The best data structures for this purpose are not long linear character strings but overlapping recursive hierarchies of atomized data items that conform to a semistructured graph database model rather than a flat-file or even a relational database model (Schloen and Schloen, “Beyond Gutenberg,” ibid.). A distinctive feature of the CEDAR initiative, on a technical level, is its demonstration that a true database approach is necessary to achieve the philological goal of con­necting and contextual­izing the vast array of human products that make up our cultural heritage, and to do so without ignoring or suppressing the idiosyncrasies revealed by close empirical study of inscribed objects on both the epigraphic and discursive level.

The Text as a Space of Possibilities

A key innovation in CEDAR is that the underlying database of literary texts is designed in such a way that multiple versions of the same text can be represented without error-prone duplication of infor­ma­tion and without arbitrarily privileging any one version as more authoritative. This is a particular need in the textual criticism of the Hebrew Bible, for example, for which there are many different and often conflicting manuscript witnesses, some of which are more complete than others but none of which can be regarded as necessarily more true to the original. The Hebrew Bible is known to us only from manuscripts and translations created hundreds of years after it was composed—not to mention the fact that in their final form, many biblical books are the product of a long process of editing and compiling earlier texts. The textual histories of other literary corpora may be less complex, with fewer versions of a text to consider, but the same textual phenomena are found in all cultures and periods.

To create a digital representation suitable for philology, we must first extract all the textual variants, from as many manuscript witnesses and printed copies as possible, down to the level of individual characters and punctuation. The collection of all variants constitutes what CEDAR calls the “content pool” of the text in order to emphasize that these variants co-exist on an equal basis as items in the OCHRE database; none is structurally privileged over the others. Editors then construct from the pool of variants their own version of the text with line-, word-, or even character-level links to explanatory notes, images, and bibliographic citations. An editor’s version of the text is a second-order interpretation whose display is generated on the fly from atomized database items that represent the empirically attested textual variants, using the specific configuration of variants chosen by the editor. These variants remain linked to the manuscripts from which they were derived because the transcription of a particular manuscript is itself represented digitally as just another configuration of the same database items that all float within the shared pool of variants.

Each text edition is attributed in the database to a named scholar or group; it is not anonymous, nor is any one edition structurally privileged as more authoritative than any other. This is not the case in most digital editions today, in which scholarly attribution is often neglected and only one nameless voice is heard that bears the authority of “the computer.” A CEDAR edition is simply one possible configura­tion of the underlying data; it can easily be recon­figured by other scholars, who may create their own editions from the same pool of variants. A CEDAR text is, in effect, a multi-dimensional space of possibilities emerging from the fluctuations in empirical documents. To continue the quantum-mechanics metaphor: the text as a space of possibilities is collapsed by a particular observer into a scholarly creation attributable to him or her as a historically situated interpreter. Moreover, the variant possibilities remain linked to the underlying documents, making it easy to trace the chain of reasoning from a critical interpretation back to the evidence. In this way, we preserve as fundamental the empirical observation of physical docu­ments, with all their idiosyn­crasies, on which philology as a science is based, while making it easy to relate documents one to another and com­pare divergent readings.

Furthermore, any CEDAR edition can itself be interpreted and trans­lated in different ways by different readers. Just as with the second-order critical editions, these third-order interpretations and transla­tions co-exist in the same database, still linked ultimately to the underlying documents. The database can thus organize, display, and analyze the entire history of transmission and inter­pre­tation of cultural objects. This is possible because the OCHRE database system used by CEDAR was designed by scholars to be a multi-project, multi-language, multi-script, and multi-ontology system, unlike commercial systems. It does not encode a single perspec­tive but makes it easy to record and compare many perspectives. Semantic authority does not belong passively to the computer but rests with each scholar who enters information and with each subsequent user who reconfigures and redisplays the information. To achieve this, the underlying computer system itself must be multi-dimensional and multi-vocal.

The OCHRE Platform for Integrative Research

It is beyond the scope of this website to elaborate further on CEDAR’s computational infrastructure, other than to say that it allow scholars to represent in an online system every conceptual distinction, empirical observation, and editorial interpretation they wish to make explicit, and to do so without sacrificing philological rigor and without being forced to conform to computational conventions and inappropriate semantics of the kind found in software originally designed for commercial purposes. The OCHRE database platform used by CEDAR was built with these concerns in mind. OCHRE is currently being used by more than 100 research projects to manage large amounts of textual and archaeological data, while permitting each project to use its own conceptual ontology for classifying and describing the phenomena of interest.

The OCHRE platform has proved to be sustainable over the long term, with a steadily growing user community and institutional support from the University of Chicago. Its servers are monitored and backed-up by system administrators in the University of Chicago’s Digital Library Development Center. Day-to-day technical support and training are provided by the OCHRE Data Service in the Forum for Digital Culture of the University of Chicago.