adm.pizza • Old English Lemmatisation: Data source quality and initial results

Old English Lemmatisation: Data source quality and initial results

20th Oct 2023
• Tags:
research,
old english,
nlp,
machine learning

Last updated on 20th Oct 2023

A lot of progress has been made on the lemmatisation project since my initial post. Since then, I have met with a number of Old English experts, and a lot has been cleared up.

🔗Paradigm changes

I learned a few important things that are dramatically changing my approach to this:

🔗Lemmatisation is more basic than one might think

Old English, like its descendants Swedish and German, has a lot of compound words such as restendæg - "rest-day" (i.e. the Sabbath day of rest), literally combining rest and dæg. I initially thought lemmatisation would be to yield the base word(s) here, so probably dæg (to be honest I didn't consider how it would work). Thankfully, lemmatisation is actually just yielding the uninflected form of a word - more accurate example would then be, for the input restendagum (the dative plural), that it yield restendæg as a lemma.

🔗Machine learning has a lot more steps than I was taught

I've been starting to learn TensorFlow. It's a very powerful tool, but it comes with a ton of machine-learning jargon the likes of which I'm very unfamiliar with. In my masters successfully set up and trained neural networks manually in Python, and TensorFlow enables so much more than that, but obfuscated behind terms like "LSTM" and "Bahdanau attention". So it's been a steeper learning curve than anticipated.

🔗Lemmatisation can be spelled with an s in non-American English

Just to stay consistent :)

🔗Better data exists, but with caveats

The be-all-end-all for Old English attested spellings would appear to be the Dictionary of Old English, digitised by folks at the University of Toronto. It has all attested spellings for entries, as well as copious amounts of well-organised data for each.

There's two issues with using this data; first, it's not publicly accessible, and second, it only includes words from A to I (digitisation is a tough job). So it could be very useful to try and get access to it for future projects, but for now the Wiktionary data will have to do.

🔗Current direction

Because of the above, I've shifted away from the common-lemma-pair approach and moved towards making a lemmatizer.

Myself and my colleague Rian (one of the OE experts) have taken the time to clean up my training data, and we managed to get someone else's lemmatiser working from Github.

🔗Technical implementation

We found a Finnish language lemmmatiser made by Jesse Myrberg on TensorFlow. Since the Finnish language has a handful of potential declensions for each word, we figured it was close enough for Old English to use the same approach. In theoretical terms, it uses seq2seq (sequence transformation). The input data takes the form of simple CSV files, with the first "source" column being the inflected form, and the second "target" column the lemma.

It took an entire day to fix compatability issues, as it dates back to roughly 2017, and uses completely outdated version of TensorFlow. It was a challenge, but I managed to get it working through Anaconda, by creating a virtual environment running Python 3.6 in order to install TensorFlow 1.3.

🔗Initial results

The first run, we only included the Wiktionary OE nouns, and the results were pretty abysmal - gibberish, mostly vowels, with very vaguely OE energy to them. So Rian suggested we add adjectives as well, figuring that they work quite similarly. This cleaned up the results substantially, but it was still far from yielding anything close to accurate. Here's a small excerpt:

source	target	0
hlæfdigan	hlæfdige	firices	
heape	heap	feat	
hornungsuna	hornungsunu	oreangng	
castelle	castel	scellat	
sweflas	swefl	sesal	
hilce	hilc	icld	
getwinnas	getwinn	seanges	
monandaga	monandæg	eangang	
feterum	feter	etere	
ondan	onda	onda	
petersiligena	petersilige	geiweliteal	
nowende	nowend	eneol	
gifeþe	gifeþe	giege	
gesealdnissa	gesealdnis	seselanfes	
wintergeweorpe	wintergeweorp	geseliterer	
huntunge	huntung	egeng	
toþsticcan	toþsticca	ctticeol

("0" is what the engine generated for "source" as input, when it was supposed to yield "target").

Out of the 100 words in the testing data, only one was accurate - onda, shown above. But it's getting better - I'd say more data is all it needs at the moment.

The other issue is that any input that included the letter ash (æ) seemed to result in an output with the letters ae instead. For subsequent tries, I'm going to encode special characters before training it.

🔗What next?

I'm a bit too busy to work on it right now, but here's the next steps:

Add more data - probably in the form of verbs and their conjugated forms. Even if it works very differently to nouns and adjectives, it'll still give it a feel for the language, and help it understand that the end of the word is primarily what has to change.
Encode special characters - there's three characters in Old English that aren't in Modern English - thorn (þ), eth (ð) and ash (æ). There's also three characters in Modern English that aren't in OE, j q and v. That makes it pretty easy to encode the inputs and outputs as required.