Article rating: Technology, Intermediate
John Jung
Senior Programmer/Analyst
Digital Library Development Center (DLDC), University of Chicago
Spoken Yucatec Maya
This article describes the use of OCHRE as a website back-end for the Spoken Yucatec Maya website at the University of Chicago. OCHRE is a computational platform — the Online Cultural and Historical Research Environment. With support from the OCHRE Data Service at the Oriental Institute, OCHRE is used to “record, integrate, analyze, publish, and preserve cultural and historical information in all of its digital forms.”
OCHRE provides an interface to manage complex semi-structured data for cultural heritage and historical research. Because of the richness of this kind of data, this software can be helpful as a website back-end, since so much development time for a site like this can go towards data management issues. By providing an API that exposes its data as XML, OCHRE can serve as a back-end for website front-ends written in whatever language and framework you like.
The Spoken Yucatec Maya site is a web-based course to teach field researchers Yucatec Maya. It contains digitized sound clips from field recordings of native speakers that date back to the 1920’s. That content is divided among 18 lessons which take the learner through a series of drills, quizzes, vocabulary lists, conversation prompts and more.
A Short History of this Project
The digitized field recordings themselves are valuable documents. Manuel J. Andrade was an anthropologist and linguist who did innovative audio recordings for linguistic fieldwork on the Mayan languages of Mexico and Guatemala. He recorded many of the earliest clips that are represented in today’s site. Norman A. McQuown joined the faculty of the university in 1946, and he worked to conserve these recordings. When he co-founded the Language Laboratories and Archives in 1954 he incorporated these recordings into that collection. He developed courses on language instruction based on these materials, which led to “historically deep, regionally broad, and ethnographically contextualized collections of recordings, papers, and pedagogical materials.”
Fast-forwarding to 2005 brings us to the initial website for this project, when Professors John Lucy and John Goldsmith led a project to digitize this material and to develop metadata to describe them. In 2007 John Lucy led a project to create the initial web-based language instruction course based on this material which is now used by multiple universities.
This iteration of the site
In 2019, the OCHRE Data Service loaded the sound clips and their associated metadata into OCHRE. My task was to duplicate the 2007 website but with a more modern web framework. Because OCHRE reveals its data as XML, you have lots of choices for languages and frameworks for a front-end. I chose the Python language and the Flask framework because it is lightweight and quick for development—the Flask documentation includes a tutorial about the framework itself which will be helpful if you choose to build on the example below. The following instructions were tested on a Mac—if you’re working with a different OS you will have to modify them for your machine.
Getting started with OCHRE data
Next let’s start to work with OCHRE. Data is organized in OCHRE by UUID (Universally Unique ID). You’ll need to refer to the OCHRE back-end to get a UUID to start with, but since I know one already you can use the following command in the terminal to see what data is associated with it:
$ curl "https://ochre.lib.uchicago.edu/ochre?uuid=0ad43a89-09d6-4292-88cb-b6fd6dfe41e5" | xmllint --format - | less
This returns all of the exercises for lesson one of the Spoken Yucatec Maya site—basic sentences, pronunciation, grammar, drills, listening in, conversation, vocabulary, etc.
Using Safari as a quick tool to explore XML data
If you haven’t worked much with XML data you will probably want to start building a suite of tools that are appropriate for different tasks, like Oxygen for making edits, or xmllint for validating and manipulating XML on the command line.
A fast way to start exploring XML data is to use a web browser that can format it with expandable and collapsible sections. On my Mac I can open https://ochre.lib.uchicago.edu/ochre?uuid=0ad43a89-09d6-4292-88cb-b6fd6dfe41e5 in Safari to do this. Control-click anywhere in the page and select “Show Page Source”. The web inspector will appear—be sure you’re on the Sources tab, and click the root element—Response, in this case. Select the DOM Tree option to get an easy to navigate view of this XML document. Click the <xq:result>
element to open that part of the DOM tree, and then click <ochre>
, <text>
, and finally <discourseHierarchy>
. There we can see a <section>
for each sentence in Basic Sentences. Clicking the first <section>
element reveals <transcription>
and <translation>
elements. If you open up <translation>
you can see a <content>
element that includes the text “Hi there, brother!” This is the first sentence of this part of lesson one.
A terminal script for OCHRE data
Now we’ll write a short command line script to display the dialog for lesson one’s basic sentences. First, we’ll need to figure out an XPath that can get us to this part of the document. Looking at the hierarchy in Safari, I can see that we followed this trail of elements like this to get to our list of sections:
<ino:response> <xq:result> <ochre> <text> <discourseHierarchy> <section>
We can represent that in XPath as:
/ino:response/xq:result/ochre/text/discourseHierarchy/section
There are lots of different XML packages in Python—to keep things simple I’ll use ElementTree. It uses a subset of XPath, so I’ll write some expressions a bit differently than I would with other processors. Because ElementTree works from the root element, we’ll modify that XPath so that it’s relative to the root.
./xq:result/ochre/text/discourseHierarchy/section
To write our command line script, I’ll set up a Python virtual environment for any modules we’ll need, and I’ll start by installing requests to request our OCHRE data over https. I used Python 3 for these examples, so I’ll start by using Homebrew to install Python3.8.
$ brew install python@3.8$ alias python=/usr/local/opt/python@3.8/bin/python3$ python -m venv env$ source env/bin/activate
You may need to modify the steps above for something more appropriate for your system. In any case, what we want to do is to install Python3 via Homebrew, and set it up so that you can easily run that version of Python instead of one of the others that is probably installed on your system. python -m venv env
is the command that sets up a virtual environment for a specific project—this lets you easily switch from one Python project to another, where each project can use its own version of Python and its own modules. You’ll need to run source env/bin/activate
to activate the virtual environment whenever you want to start working on your code. You can use the deactivate
command to exit the virtual environment for a project when you’re ready to work on something else.
Now lets put these pieces together. Here is how to iterate over each section element in that data in Python. I’ll call this example hello.py:
import urllib.requestimport xml.etree.ElementTree as ETdef main(): ochre_xml = ET.ElementTree( file=urllib.request.urlopen( 'https://ochre.lib.uchicago.edu/ochre?uuid=0ad43a89-09d6-4292-88cb-b6fd6dfe41e5' ) ) for section in ochre_xml.findall('./xq:result/ochre/text/discourseHierarchy/section', { 'xq': 'http://namespaces.softwareag.com/tamino/XQuery/result' }): print(section)if __name__=='__main__': main()
In the script above, I use the Python urllib
module to request data from OCHRE—it’s one way to request a URL, the way you might do it with a web browser or with the curl
command in the terminal. That function returns a string which I use to instantiate an ElementTree
object so I can manipulate the XML. Then I loop over <section>
elements in that XML using the ElementTree’s findall
method. This is just a skeleton we can expand once we’re sure that we can get at our data, and that our XPath is correct.
Running this script, you should see something like this:
$ python hello.py<Element 'section' at 0x10338aad0><Element 'section' at 0x10338fd70><Element 'section' at 0x103396dd0><Element 'section' at 0x10339f6b0><Element 'section' at 0x103d4c170><Element 'section' at 0x103d525f0><Element 'section' at 0x103d61a10><Element 'section' at 0x103d67dd0>...
Next, we’ll want to get the speaker names, transcriptions, and translations. This will let us look at more of the data, to be sure we’re able to extract it from the XML without any trouble. Modify your script like this:
import urllib.requestimport xml.etree.ElementTree as ETdef main(): ochre_xml = ET.ElementTree( file=urllib.request.urlopen( 'https://ochre.lib.uchicago.edu/ochre?uuid=0ad43a89-09d6-4292-88cb-b6fd6dfe41e5' ) ) for section in ochre_xml.findall('./xq:result/ochre/text/discourseHierarchy/section', { 'xq': 'http://namespaces.softwareag.com/tamino/XQuery/result' }): try: print('{:>15}{}\n{:>15}{}\n{:>15}{}\n'.format( 'Speaker: ', section.find( './/property/label[@uuid="9dc5fbbe-b8db-417f-b9d4-68efa3576e80"]/../value' ).text, 'Transcription: ', section.find('./transcription/content').text, 'Translation: ', section.find('./translation/content').text )) except AttributeError: passif __name__=='__main__': main()
The main difference in the script above is that I used Python’s string formatting method to get nicely formatted output in the terminal. Running this script you should see:
$ python hello.py Speaker: PedroTranscription: ¡Hola, sukuʾun! Translation: Hi there, brother! Speaker: PedroTranscription: Baʾax ka waʾalik teech. Translation: What do you say? Speaker: MarcelinoTranscription: Mix baʾal. Translation: Nothing! Speaker: MarcelinoTranscription: Kux teech. Bix a beel. Translation: And you, how are you? Speaker: PedroTranscription: Chéen beyaʾ. Translation: So-so. Speaker: PedroTranscription: ¡Jach kiʾimak in wóol in wilikech! Translation: Iʾm very happy to see you! Speaker: MarcelinoTranscription: Bey xan teen. Translation: Me, too....
Creating a Flask app for OCHRE
Now let’s create a simple Flask app to serve this content up in a web server. Be sure your virtual environment is activated—if you need to activate it you can do so by running source env/bin/activate
in your project folder. Then use pip to install the Flask package:
$ pip install flask
Then, copy the following Python code into a new file to create a minimal Flask app. I’ll call that file lucy.py
:
from flask import Flask
app = Flask(__name__)
@app.route('/')
def hello():
return 'Hello from Flask.'
if __name__=='__main__':
app.run()
In just a few lines of code, Flask routes a url (in this case, ‘/’) to a specific function. In this case all our function does is return a short greeting. Now run a development version of your new Flask app like this:
$ python lucy.py
Flask’s development server will then start serving your Flask site at http://localhost:5000/
. Flask will warn you that it’s running a server meant for development only—you’ll want to set up something more robust for a production environment. But for now, open this URL in your browser to see a plain text message, “Hello from Flask.”
Extending our Flask app with OCHRE
Let’s extend lucy.py
so it can return some data about lesson one:
import urllib.request
import xml.etree.ElementTree as ET
from flask import Flask
app = Flask(__name__)
@app.route('/')
def main():
ochre_xml = ET.ElementTree(
file=urllib.request.urlopen(
'https://ochre.lib.uchicago.edu/ochre?uuid=0ad43a89-09d6-4292-88cb-b6fd6dfe41e5'
)
)
sections = []
for section in ochre_xml.findall('./xq:result/ochre/text/discourseHierarchy/section', {
'xq': 'http://namespaces.softwareag.com/tamino/XQuery/result'
}):
try:
sections.append('{:>15}{}\n{:>15}{}\n{:>15}{}\n'.format(
'Speaker: ',
section.find(
'.//property/label[@uuid="9dc5fbbe-b8db-417f-b9d4-68efa3576e80"]/../value'
).text,
'Transcription: ',
section.find('./transcription/content').text,
'Translation: ',
section.find('./translation/content').text
))
except AttributeError:
pass
return ''.join(sections)
if __name__=='__main__':
app.run(host="0.0.0.0", port=5000)
Use the built-in development server to see what this looks like:
$ python lucy.py
This should return the correct text to your browser, but without the right newlines or anything like that. If you view the source of the page you’ll see the correctly formatted text—but let’s add templates next so that we can return something that will render properly as HTML.
Adding templates to Flask
We’ll start by making a data structure to send. Instead of joining my sections list into a single string, I’ll set each item in the list to its own python dictionary with three keys—speaker, translation, and transcription. Then I’ll loop over them in the template to display them properly.
Modify your lucy.py
so it looks like this:
import urllib.request
import xml.etree.ElementTree as ET
from flask import Flask, render_template
app = Flask(__name__)
@app.route('/')
def main():
ochre_xml = ET.ElementTree(
file=urllib.request.urlopen(
'https://ochre.lib.uchicago.edu/ochre?uuid=0ad43a89-09d6-4292-88cb-b6fd6dfe41e5'
)
)
sections = []
for section in ochre_xml.findall('./xq:result/ochre/text/discourseHierarchy/section', {
'xq': 'http://namespaces.softwareag.com/tamino/XQuery/result'
}):
try:
sections.append({
'speaker': section.find(
'.//property/label[@uuid="9dc5fbbe-b8db-417f-b9d4-68efa3576e80"]/../value'
).text,
'translation': section.find('./translation/content').text,
'transcription': section.find('./transcription/content').text
})
except AttributeError:
pass
return render_template(
'base.html',
sections=sections
)
if __name__=='__main__':
app.run(host="0.0.0.0", port=5000)
One of the changes in the script above is that we’re now using the render_template()
method. This will allow us to keep our code organized. We’ll request data from OCHRE in Python, and turn it into a data structure that can be passed along to a template file. The render_template()
method will take that data and pass it along to a template, where it can be inserted into HTML for display.
It’s worth thinking about the way you set this data structure up—as your site grows in complexity you will find yourself debugging it from the perspective of either the Python code where it is generated or the template code where it is consumed. In the script above I create a list of dictionaries, e.g.:
[
{
'speaker': 'Pedro',
'transcription': '¡Hola, sukuʾun!',
'translation': 'Hi there, brother!'
},
{
'speaker': 'Pedro',
'transcription': 'Baʾax ka waʾalik teech.',
'translation': 'What do you say?'
},
...
]
To add a template to your project, create a directory called “templates” next to your lucy.py
file. Inside that directory, make a file called base.html:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>base template</title>
<link href="/css/lucy.css" rel="stylesheet" type="text/css">
</head>
<body>
<div id="basic_sentences">
{% for section in sections %}
<div>{{ section.speaker }}</div>
<div>{{ section.translation }}</div>
<div>{{ section.transcription }}</div>
{% endfor %}
</div>
</body>
</html>
In the template above, the {% for section in sections %}...{% endfor %}
block is where we loop over our data structure to display the speaker, translation, and transcription for each sentence.
Create a directory called css
and inside it a file called lucy.css
:
body {
font-family: Helvetica, sans-serif;
}
div#basic_sentences {
display: grid;
grid-template-columns: max-content 1fr 1fr;
grid-column-gap: 1em;
grid-row-gap: 1em;
}
Finally, create a one-line file called lucy.wsgi
and put it next to lucy.py
:
from lucy import app as application
Your directory hierarchy should now look like this:
css lucy.css lucy.py lucy.wsgi templates base.html
Now we’ll want to switch from the built-in Flask server to something that can also serve static files like CSS. We’ll use the mod-wsgi Python package to do that. Inside your Python virtual environment, run the following commands:
$ pip install mod_wsgi-httpd $ pip install mod_wsgi
This will install a copy of Apache into your virtual environment, along with a handy command for running an Apache server with WSGI for your Flask app. Please be patient while installing Apache, as that part of the installation can take a while. Run your test server with the following command:
$ mod_wsgi-express start-server --url-alias /css css lucy.wsgi
Now if you open your browser to http://localhost:8000/
you’ll see that the site should be correctly serving both static files and dynamic content. Although you do not need to re-activate your virtual environment when you make changes to your code, you will need to restart the development server or at least run touch lucy.wsgi
for any changes to take effect.
Please note: I ran into some trouble running mod_wsgi-express with the default Python 3.7 on my Mac. This seems to be a common issue based on the GitHub page for that software, but using Homebrew’s Python 3.8 worked around that problem.
Now that you have a basic setup up and running, you can continue configuring other aspects of Spoken Yucatec Maya. Whenever you want to continue working, just run source env\bin\activate
to reactivate your virtual environment, and use the mod_wsgi-express
command to start your development server.
If you go to https://github.com/uchicago-library/lucy
you can see production code for the site, which expands on the ideas here.