How amazing is archive.org?! And how do I use it?

So, clearly, the Internet Archive is amazing, and I only know a millionth of a snippet of it. Every time I delve into it, I realize I am only scratching the surface. The main way I use it is when I’m working on incorporating new texts into my corpus. I’m not into doing xml markup from scratch or anything (great work gets done on this at Perseus and the OGL project), but …

once I download an xml file from Perseus or OGL, I want to do a few things to make the text play well with the others I already have. 1) I’m moving footnote content to a separate section of the document; 2) I want to get the text morphologically parsed, so that links to Logeion will work (and so you and I can search Strabo for future forms or for the lemma δοκέω, or whatever). My first step these days, before I use my old part-of-speech tagger (sorely in need of replacing, but that’s another story), is to tokenize the text and ingest the tokens into my morphology database to find out, with a simple sqlite query, which tokens in the new text are so far unknown:

select * from tokens where file like ‘tlg0007%’ and type = ‘word’ and content not in (select token from lexicon)

I need to specify ‘word’ because taggers like punctuation, so our tokens database also contains all punctuation. ‘Tokens’ is the table in which every last word has a sequence # in a work, an associated filename, and a unique wordid in the database. Lexicon is the table with all the different word forms and possible parses (so two rows for μικροί, one row for μικράν, and a bunch for δέοντ᾽).

If only Greek literature used the same words over and over again. In a text the size of a Loeb volume, there can be a thousand or 1500 (non-)issues: words not, or not yet, present in my database. I eyeball those and decide if I need to change the text (ἢδη, ἤκω, ..) or keep the word and run it through Bridget Almas’s Alpheios API to see if the API gives me a nice parse back. After editing the text, I tokenize it again, and after pruning and supplementing the API output (and re-editing because who notices εὑημεροῦντας on the first run through), I export the expanded lexicon and tag the text. After which, you eventually find it among our Greek texts. Of course, it needs a real word-by-word, line-by-line, reader to find all the errors that skip words, repeat words, or swap one legitimate Greek word for another. But at least some clean-up will have been done!

But sometimes things are not quite as clear-cut as all that: the form is not nonsensical. Is this word truly printed exactly like that in the reference edition?

This is where archive.org can come to the rescue. What if the text has ἐγχριπτόμενος? Not in my db, but LSJ reports that ἐγχρίπτω is a variant spelling of ἐγχρίμπτω. Or what does this particular text (remember we’re talking public domain, older texts) do with σώζω vs σῴζω? The newer Perseus and OGL texts generally have a nice link in the header like so:

<ref target=”https://archive.org/stream/moralia01plut#page/32/mode/2up“>The Internet Archive</ref>

This means I don’t even have to search for it in archive.org; I can proofread the OCR by just pasting that link in the browser. I’ve used this for a lot of Lucian, Strabo, the Greek novels, and occasionally even for dictionaries like Slater’s Pindar lexicon because I always forget where I have my copy:-)

After I check the edition, it’s time for a gut call. I will put together a pull request at Perseus for the items that were misread from the edition, but in some additional cases, I’ll decide not to follow the 1827 or 1880 edition for my local version in Chicago. Times have changed, and we can move on, based on a century or two of new knowledge. So I consulted Radt for Strabo, and more recently Hunter/Russell for a Plutarch text (warning: not for every single word, as will be clear by now!). Editorial purists, I offer my apologies. I went with ἐγχριμπτόμενος because that is in the (reconstructed) Euripides that Plutarch is quoting..

The one downside of the fact that archive.org has been at it for so long is that their recent OCR for Greek is so much better than that of a decade ago. I’m wondering if they might add a button saying, ‘can you pretty-please run this through OCR again that would be so great!’ to their non-Roman font items.. One can dream, right? I know they have a lot more to worry about at the moment than niche interests like mine (and you can donate to them!), and, as I said, I don’t usually do from-scratch encoding, still: If you are listening, can you please re-process any of these copies of Sophocles’s Lexicon? Would be ever so grateful!

Leave a Reply

Your email address will not be published. Required fields are marked *