Seminar | Part-of-Speech Tagging & Lemmatisation in Unedited Greek: Simple Tasks, Complex Challenges?

Event organised by the Computational Humanities research group.

To register to the seminar, please fill in this form by 1 December 2024.

10 December 2024 – 1.10pm GMT

Remote – Via Microsoft Teams.

Colin Swaelens (Ghent University), Part-of-Speech Tagging & Lemmatisation in Unedited Greek: Simple Tasks, Complex Challenges?

Abstract

In today’s landscape of language technology, dominated by large language models, tasks like part-of-speech tagging and lemmatisation receive less attention in current NLP research. However, these tasks still pose significant challenges, especially for under-resourced, morphologically rich languages like Ancient Greek. Our project focuses on the verbatim transcriptions of Byzantine marginal poetry stored in the Database of Byzantine Book Epigrams (DBBE). Due to the highly interconnected nature of the poems, we aim to eventually perform similarity detection across the corpus. As a first step, we sought to annotate the DBBE with part-of-speech tags, morphological analyses, and lemmas. Although research on these tasks dates back to more straightforward rule-based systems from the 1970s, current taggers struggle with these unedited texts. The inconsistent orthography —largely due to itacism— adds to this complexity. To mitigate these issues, we trained a transformer-based language model encompassing classical, medieval, and modern Greek. Our experiments, however, revealed that fine-tuning the model for each annotation task was not always fruitful. There is a growing tendency to address such challenges with a multi-task head, allowing the model to process multiple annotations concurrently, drawing inspiration from cognitive psychology. This raises the question: will this more intricate solution outshine the seemingly more transparent methods of the past?

Bio

Colin Swaelens is a PhD student at the Language & Translation Technology Team (LT3) and the Database of Byzantine Book Epigrams (DBBE) at Ghent University, under supervision of dr. Ilse De Vos (Flanders AI Academy) and prof. Els Lefever (LT3). His PhD project is embedded in the project Interconnected texts: a graph-based computational approach to Byzantine paratexts as nodes between textual transmission and cultural and linguistic developments. Within this project, he is developing an annotation pipeline to provide all texts in DBBE with a part-of-speech tag, morphological analysis and lemma. This linguistic information will, in a next stage, be used within the development of a tool to detect similar verses in this corpus, serving the other subprojects on manuscript culture and formulaicity.

Seminar | Part-of-Speech Tagging & Lemmatisation in Unedited Greek: Simple Tasks, Complex Challenges?

Leave a comment

Cancel reply