A lot of progress has been made on the lemmatisation project since my initial post. Since then, I have met with a number of Old English experts, and a lot has been cleared up.
🔗Paradigm changes
I learned a few important things that are dramatically changing my approach to this:
🔗Lemmatisation is more basic than one might think
Old English, like its descendants Swedish and German, has a lot of compound words such as restendæg
- "rest-day" (i.e. the Sabbath day of rest), literally combining rest
and dæg
. I initially thought lemmatisation would be to yield the base word(s) here, so probably dæg
(to be honest I didn't consider how it would work). Thankfully, lemmatisation is actually just yielding the uninflected form of a word - more accurate example would then be, for the input restendagum
(the dative plural), that it yield restendæg
as a lemma.
🔗Machine learning has a lot more steps than I was taught
I've been starting to learn TensorFlow. It's a very powerful tool, but it comes with a ton of machine-learning jargon the likes of which I'm very unfamiliar with. In my masters successfully set up and trained neural networks manually in Python, and TensorFlow enables so much more than that, but obfuscated behind terms like "LSTM" and "Bahdanau attention". So it's been a steeper learning curve than anticipated.
🔗Lemmatisation can be spelled with an s in non-American English
Just to stay consistent :)
🔗Better data exists, but with caveats
The be-all-end-all for Old English attested spellings would appear to be the Dictionary of Old English, digitised by folks at the University of Toronto. It has all attested spellings for entries, as well as copious amounts of well-organised data for each.
There's two issues with using this data; first, it's not publicly accessible, and second, it only includes words from A to I (digitisation is a tough job). So it could be very useful to try and get access to it for future projects, but for now the Wiktionary data will have to do.
🔗Current direction
Because of the above, I've shifted away from the common-lemma-pair approach and moved towards making a lemmatizer.
Myself and my colleague Rian (one of the OE experts) have taken the time to clean up my training data, and we managed to get someone else's lemmatiser working from Github.
🔗Technical implementation
We found a Finnish language lemmmatiser made by Jesse Myrberg on TensorFlow. Since the Finnish language has a handful of potential declensions for each word, we figured it was close enough for Old English to use the same approach. In theoretical terms, it uses seq2seq (sequence transformation). The input data takes the form of simple CSV files, with the first "source" column being the inflected form, and the second "target" column the lemma.
It took an entire day to fix compatability issues, as it dates back to roughly 2017, and uses completely outdated version of TensorFlow. It was a challenge, but I managed to get it working through Anaconda, by creating a virtual environment running Python 3.6 in order to install TensorFlow 1.3.
🔗Initial results
The first run, we only included the Wiktionary OE nouns, and the results were pretty abysmal - gibberish, mostly vowels, with very vaguely OE energy to them. So Rian suggested we add adjectives as well, figuring that they work quite similarly. This cleaned up the results substantially, but it was still far from yielding anything close to accurate. Here's a small excerpt:
source target 0
hlæfdigan hlæfdige firices
heape heap feat
hornungsuna hornungsunu oreangng
castelle castel scellat
sweflas swefl sesal
hilce hilc icld
getwinnas getwinn seanges
monandaga monandæg eangang
feterum feter etere
ondan onda onda
petersiligena petersilige geiweliteal
nowende nowend eneol
gifeþe gifeþe giege
gesealdnissa gesealdnis seselanfes
wintergeweorpe wintergeweorp geseliterer
huntunge huntung egeng
toþsticcan toþsticca ctticeol
("0" is what the engine generated for "source" as input, when it was supposed to yield "target").
Out of the 100 words in the testing data, only one was accurate - onda
, shown above. But it's getting better - I'd say more data is all it needs at the moment.
The other issue is that any input that included the letter ash (æ
) seemed to result in an output with the letters ae
instead. For subsequent tries, I'm going to encode special characters before training it.
🔗What next?
I'm a bit too busy to work on it right now, but here's the next steps:
- Add more data - probably in the form of verbs and their conjugated forms. Even if it works very differently to nouns and adjectives, it'll still give it a feel for the language, and help it understand that the end of the word is primarily what has to change.
- Encode special characters - there's three characters in Old English that aren't in Modern English - thorn (þ), eth (ð) and ash (æ). There's also three characters in Modern English that aren't in OE, j q and v. That makes it pretty easy to encode the inputs and outputs as required.