▶ From Text to Tech

"Corpus and computational linguistics for powerful historical text processing"

Workshop Organisers: Gard B. Jenset, The Oxford Research Centre in the Humanities, University of Oxford and Kerri Russell, Faculty of Oriental Studies, University of Oxford

Note: This workshop expects you to bring your own laptop. Please see our Laptop Guidance on the registration page for more information.


This workshop, originating from the HiCor research network (http://www.torch.ox.ac.uk/hicor), will impart the understanding and skills required to work computationally and quantitatively with historical texts. It will take a hands-on technical approach to building a corpus from scratch, while paying particular attention to the unique challenges surrounding historical texts, with examples based on the presenters' own research, covering a range of languages and time periods. The workshop will be organized in a stepwise manner, first introducing students to historical corpus linguistics and issues specific to historical texts such as the process of digitizing and issues with character encoding. Participants will then be introduced to forms of annotation before a hands-on introduction to XML, TEI, XPath, and XSLT. Students are introduced to basic exploratory statistics and the R software package. Finally, students will be given an overview of state of the art historical Natural Language Processing with Python, including some hands-on practice, followed by a comprehensive problem-solving session covering the topics of the week. While the focus is on historical texts, the skills attained in this workshop will be transferable to modern texts. No prior knowledge of XML or a programming language is required, but familiarity with HTML is highly recommended.


Times Monday 20 July
Tuesday 21 July
Wednesday 22 July
Thursday 23 July
Friday 24 July
11:00 - 12:30
From corpus data to historical interpretation Gabor M. Toth
The session will introduce how quantitative and qualitative data extracted from linguistic corpora can support historical analysis. It will present interpretative models developed by various historiographical schools, such as the German Conceptual History or the Cambridge School that drew on linguistic data to study social and intellectual history.
Introduction to corpus linguistics Barbara McGillivray
The lecture will introduce the key concepts underlying modern corpus linguistics. It will cover the difference between texts and corpora, the different types of corpora, and discuss notions like sampling, balance, and representativity, with a particular emphasis on historical corpora.
Searching historical texts – XPath and RegEx (lab) Kerri Russell
The first part of this session will introduce XPath. We will use this to find items based on patterns in our annotation, and practice using basic regular expressions (RegEx) to find patterns in text strings, before combining the two.
Quantitative historical corpus linguistics (lecture) Gard B. Jenset
Corpora are above all a source of quantitative data, and this lecture will give an introduction to how we can make sense of corpus data as numbers, as well as some of the challenges of dealing with historical data in this manner.
Introduction into Natural Language Processing (NLP) with Python Gabor M. Toth
This session will offer a concise overview of the Python NLTK library. Participants will get acquainted with the basic architecture of NLTK, as well as the way this architecture can be related to other technologies such as TEI and information visualization.
14:00 - 17:30 (inc. break)
Challenges of historical texts – Corpus creation, digitizing historical texts, OCR, encoding of special characters Alessandro Vatri
This module expounds the theoretical issues connected with the creation of a corpus of historical texts (especially size, representativeness, and diachrony) and explains the digitization process.
Introduction to annotation of historical texts (lecture) / XML and TEI – Introduction to XML, metadata, and identifiers (lab) Kerri Russell
This session focuses on the annotation of historical texts following the guidelines of the Text Encoding Initiative (TEI) and using Extensible Markup Language (XML). Annotation of linguistic as well as non-linguistic information will be covered. We will discuss how annotation can be customized to a particular language and/or research goals.
Searching historical texts (continued) – XSLT (lab) Kerri Russell
In the second part of this lab session we will use Extensible Stylesheet Language (XSLT) to extract data from the corpus and choose ways to display the data. This course is designed to give participants a starting point for using XSLT and how to use it for searches and displays.
Exploratory analysis with R (lab) Gard B. Jenset
The lab session will provide a hands-on introduction to corpus linguistics with R. Among the topics that will be covered are reading data into R, transforming and summarizing data, and producing high-quality plots and figures.
Problem solving session
The session will provide an opportunity to apply the skills taught during the week, with instructors present to provide guidance.

There are 5 individual speakers in this workshop.

  • Gard B. Jenset
    The Oxford Research Centre in the Humanities, University of Oxford

    Gard B. Jenset has a PhD in English linguistics from the University of Bergen. He currently works with language technology and is also a Visiting Researcher at The Oxford Research Centre in the Humanities. Among his research interests are corpus linguistics and quantitative methods in historical linguistics.

  • Barbara McGillivray
    The Oxford Research Centre in the Humanities, University of Oxford

    Barbara McGillivray (PhD, University of Pisa) is a data scientist at Nature Publishing Group and Visiting Researcher at The Oxford Research Centre in the Humanities. Her research interests include computational and quantitative corpus linguistics for historical languages and Latin in particular.

  • Kerri Russell
    Faculty of Oriental Studies, University of Oxford

    Kerri Russell received her PhD from the University of Hawaiʻi at Mānoa and is currently a Research Officer at Oxford. She works on the development of the Oxford Corpus of Old Japanese and the Old Japanese/English dictionary, which is linked to the corpus, making cross-reference in both directions possible.

  • Gabor M. Toth
    University of Passau / The Oxford Research Centre in the Humanities, University of Oxford

    Gabor M. Toth is an assistant professor at the University of Passau, and a visiting fellow of the Oxford Research Centre in the Humanities. He accomplished his studies at the University of Oxford in 2014. In addition to the history of the Italian Renaissance, his main research interest is the application of corpus and computational linguistics for text analysis.

  • Alessandro Vatri
    Faculty of Classics and Faculty of Linguistics, Philology & Phonetics, University of Oxford

    Alessandro Vatri (D.Phil. Oxon) is Research Assistant in Comparative Philology and Junior Research Fellow of Wolfson College, Oxford. His research focuses on ancient Greek oratory, rhetoric, and linguistics. He has worked with treebanks and has taken part in the NEH Institute for Advanced Technology in the Digital Humanities "Working with Text in a Digital Age" (Perseus Project, Tufts University).


Workshop Venue: All of your sessions will be in the Danson Room at St Anne's College.

AM and PM Refreshment Breaks: All breaks will be in the Ruth Deech building, St Anne's College

Lunch Arrangements: Lunch each day will be in the Ruth Deech Building , St Anne's

Computers: The workshop is a hands-on technical introduction which requires you to bring a laptop with the following software installed. Note you do not need to install Python.

XML processing: Oxygen Everyone should download Oxygen 17. It's available on a number of platforms from: http://www.oxygenxml.com/xml_editor/download_oxygen_xml_editor.html

Anyone affiliated with the University of Oxford can get a free license from Oxford's self-registration page: http://help.it.ox.ac.uk/registration/index. Simply log in with SSO, and then choose "Software" from the menu on the left. Choose Oxygen XML Editor from the list of software options, and then "Oxygen product licence" for the license. There are instructions on how to do this on the self registration page.

Anyone not affiliated with Oxford can use the free trial license (http://www.oxygenxml.com/register.html).

Statistical software: R

You will need to install a recent version of R, freely available for Windows, Mac, and Linux operating systems on http://cran.r-project.org/

You will also need a text editor (not a word processor like MS Word). ). Most computers come with a basic text editor (such as Notepad for Windows) which will be sufficient. However, the recommended editor environment is R Studio (an integrated R environment), freely available for Windows, Mac, and Linux operating systems on http://www.rstudio.com/products/RStudio/. Choose the open source desktop version.

Group Colour: Light blue

Site last updated: 2015-07-15 -- Image Credits -- Contact: events@it.ox.ac.uk