To help out a friend of mine in their thesis, I decided to attempt an AI-based lemmatization of Old English, or at least grouping words with common lemmas. My background has pretty much no formal linguistics in it, I'm just a HRI researcher who read a couple of linguistics papers and thought they could do better (hence the name of this blog post), but I do think there might be some potential here.
The study of Old English is more of an art than a science, where translator interpretation and understanding of context plays a big role.
From the little I've researched, there's very little use of machine learning tools in the study of it. One big thing that makes it difficult to analyze programmatically is the variability of word spelling - it predates standardized spelling, and as such words can be read and pronounced the same with drastically different spellings. Proper lemmatization can solve part of this problem, resolving words with ambiguous spellings to a common lemma (or word root).
🔗Project goal
This project aims to create a neural network for lemmatization of Old English. The ideal end goal is an engine that takes an Old English word as input, and outputs a lemma - but what I believe would be simpler and more reliable from a machine learning standpoint is for it to take two Old English words as inputs and output a probability of them having a common lemma. Additionally, the latter comparative engine could help to create the former transformative one.
Update: Seems there's a ton of well-annotated Old English data available on Wiktionary, I'll ask a few OE experts what they think of it, and if it goes okay I'll use that. That means that a different engine design is possible, one that does direct lemmatization.
🔗Training data
As with most ML projects, its success hinges on the data available. To train the engine, we need a dataset of word pairs with common lemmas. These inputs correspond to a high probability output for backpropagation. For negative low-probability training data we can take two random words from the entire corpus of Old English.
So far, there's some widely respected dictionaries out there that might have what we need (like the Bosworth-Toller - thanks to Dr Ondřej Tichý for its upkeep), but I'm very interested in what Wiktionary can provide. The formatting and layout (and the fact that it includes declensions and conjugations for words) could give us a ton of useful data for training. The fact that I haven't seen it used anywhere else so far is a bit of a concern, maybe the data it catalogues is of bad quality? Will have to reconvene with people who actually speak Old English.
🔗Technical plan
- Use word embeddings for input
- Common lemma pair approach:
- Use Tensorflow Similarity (probably with Triplet Loss)
- Lemmatizer approach:
- Use some other Tensorflow Text approach - likely something simpler, maybe an RNN
🔗Potential issues
🔗Overfitting to Levenshtein distance
Word pairs with common lemmas likely have high Levenshtein distance, and this strikes me as something that could result in overfitting. The words hlaford
, hlavord
, and hlafordes
share lemmas, with Levenshtein distances of 1 to 4. However, lagu
and wæccend
(different lemmas, chosen randomly) have a much higher distance of 7 between them.
To adjust for this, the ideal training data would emphasize word pairs with different lemmas that have low Levenshtein distance.
So yeah, that's all for now. The first task will be to get data, I think the Wikipedia data is more than enough to get started, and if it's not then at least I got something working. I'll be chipping away at it on my Gitlab over the next few weeks.