adm.pizza • Old English Lemmatisation: Resuming work

Old English Lemmatisation: Resuming work

4th Sep 2024
• Tags:
research,
old english,
nlp,
machine learning

Last updated on 4th Sep 2024

I'm looking to revisit the Old English Lemmatisation project. Things were interesting in the previous attempt, with some promising but inherently limited results - there was only so much that could be done while I was writing my thesis. Now that I'm back in academia, there's a lot more that can be done, especially with the hindsight of the first attempt.

As mentioned in the previous post on the topic, the previous attempt required encoding of special characters. That wasn't too complicated, Old English doesn't have letters like J, so they were easily substituted before training, and the output could be converted back. From what I remember, using a tighter alphabet did produce better results, but the model performance still never rose above 65%.

At the time, this is where it started to feel kinda bad - I just didn't know what to improve. And that's because I didn't write the code, I was just cloning a code repository and hoping for the best. And the code was definitely good! But with hindsight, there's so much we were missing out on because of my lack of domain knowledge.

Without the responsibilities of the PhD, I want to invest more time into this, do things properly.

🔗Things that can be done better

There's a few things that I want to improve on, most of them tie into the general approach - Old English is a dead language with a known, limited corpus, and that's something we need to make the most of for machine learning. In particular, we can do more informed word embeddings, build a more robust model, create better training data and produce more usable output.

🔗Word embeddings

Most word embeddings are done based on training text, so that similar word pairs have similar vectors. For example, words "cake" and "pie" would be close to each other in the vector space. I believe word2vec does this based on word location, so that words that are frequently adjacent have similar embeddings. With the pre-existing lemmatiser, I'm not entirely sure how word embeddings were done - probably in a very uninformed way, since the only input was the word list itself, without much other information. But if we already have the entire Old English corpus, nothing can stop us from using that to create the embeddings in the first place, the way modern English word embeddings are created by pumping all of Wikipedia into the neural net.

This might allow us to incorporate the word position in sentences into the model. Could that help to disambiguate some lemmas down the line, i.e. for short words that share inflections? Unsure, but any extra data we can use is good.

🔗Statistical model

The previous version of this used seq2seq, a powerful ML algorithm that takes a text sequence as an input and produces a text sequence as an output. At first I appreciated the open-ended nature of this approach - what if we didn't know about every lemma? The neural network can create whatever lemma it thinks makes sense for the input.

But that's the thing, we know every single Old English word, so why not consider each word as a potential output for the neural net? That way, instead of producing a potentially garbage single word lemma, the engine could rank the most likely known lemmas for the input word. Not only would that ensure the output is an existing lemma, it would allow researchers to check the 2nd- or 3rd-best fits for the input word, in the case the 1st doesn't seem quite right.

🔗Better training data

Before, I was generating a full list of lemmas and inflected terms from Wiktionary data - pretty much treating each word the same. However, using the entire corpus to inform the training set can allow us to be a bit more frequentist with it, so that common lemmas and inflected terms appear more often in the training set. This should allow a more informed lemmatisation, so that rare lemmas aren't constantly spat out when they shouldn't be.

One thing that we were a bit concerned about was sharing the results publicly. We wanted to make sure any OE researchers could access it, and a Git repository does in theory tick that box, but part of the motivation for this project is that Old English researchers don't tend to be computer scientists - no way would they clone the repo, install the necessary prerequisites and run the model. Even foolproof, user-friendly instructions would be a slog. But Rian came up with a great way of sharing the data online - a static database.

Since the corpus is limited, we can simply run the model for each and every Old English word, and save the results. Then a simple static website can be created that allows the user to search for any Old English word to view the output lemma that the neural network produced.

🔗Other things

Aside from that, there's a few things I want to try doing better:

Using PyTorch instead of TensorFlow, and building the model from scratch
Creating "artificial data" by manually creating spelling variations in known lemmas (possibly using that to do a more frequentist approach with the full corpus)
Incorporating available data wherever possible!

That last one might lead me to connect with other researchers out there - gotta build from the shoulders we stand on, etc.

🔗Things that can be done better

🔗Word embeddings

🔗Statistical model

🔗Better training data

🔗Sharing the output

🔗Other things