Prologue to Part 2

The second part of this dissertation is dedicated to HAMLET – Hybrid Adaptable Machine Learning approach to Extract Terminology. HAMLET is based on the traditional hybrid methodology for automatic term extraction but interprets this as a supervised machine learning problem. The goal of this part of the project was not necessarily to develop the best possible system for automatic term extraction and to obtain the highest f1-scores, but rather to examine the potential impact of machine learning approaches on automatic term extraction. How do machine learning approaches compare to non-machine learning methodologies? Do they have similar strengths and weaknesses? Which types of features contribute to the successful identification of terms? What is the effect of the training data? How robust is performance across different languages and domains? The ACTER dataset was specifically designed with such approaches in mind, so it could be used as training and evaluation data.

It was decided to base this first machine learning approach on the more traditional, hybrid approach to automatic term extraction. Roughly, this approach starts by linguistically preprocessing a text (tokenisation, lemmatisation, part-of-speech tagging, etc.), and selecting an initial list of unique candidate terms based on part-of-speech patterns. For instance, we know that nouns can be terms, so all nouns will be extracted as candidate terms. Similarly, combinations of one adjective and one noun could be terms, so all bigrams (two sequential tokens) that consist of one adjective, followed by one noun will be extracted. This strategy will, of course, result in an initial list of candidate terms with a lot of noise (false positives, i.e., candidate terms that are not real, valid terms). The list can be filtered, e.g., by removing stopwords (common words in general language that will rarely be terms) or applying a frequency threshold and removing all candidate terms that do not occur frequently enough. Most of the filtering, however, will be performed based on statistical metrics designed to measure termhood (how relevant the candidate term is to the domain), and, for multi-word candidate terms, unithood (whether the different parts of the multi-word candidate term form a cohesive unit) (Kageura & Umino, 1996). The candidate terms can then be ranked based on how they score on one or more of these metrics, and only the best ones are kept.

There are two main differences between the traditional hybrid approach and my machine learning hybrid approach. The first difference concerns the part-of-speech patterns. With HAMLET, the patterns that are used to extract the initial list of candidate

D-TERMINE

terms do not need to be manually defined beforehand but can be automatically extracted from the training data. So, if the training data contains adjectives that are annotated as terms, then all adjectives in the test data will be extracted as candidate terms. The second difference relates to how the initial list of candidate terms is filtered and sorted. Instead of using just one or two statistical metrics and manually setting a threshold to select the best candidate terms, HAMLET calculates dozens of different features with various types of information for each candidate term, and automatically learns the optimal combinations of features from the training data. Therefore, the final selection of candidate terms can be based on much more information and no manual threshold needs to be determined.

Like the previous part, this one is based on two publications:

3. Rigouts Terryn, A., Drouin, P., Hoste, V., & Lefever, E. (2019). Analysing the Impact of Supervised Machine Learning on Automatic Term Extraction: HAMLET vs TermoStat. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 1012–1021. https://doi.org/10.26615/978-954-452-056-4_117

4. Rigouts Terryn, A., Drouin, P., Hoste, V., & Lefever, E. (2021). HAMLET: Hybrid Adaptable Machine Learning approach to Extract Terminology. Terminology, 27(2).

The first publication is a conference paper that reports on a pilot study on the HAMLET project. In this publication, a first version of HAMLET is compared to a traditional hybrid methodology that does not use machine learning: TermoStat. This tool was developed by Prof. Dr. Patrick Drouin, with whom I spent a 3-month research stay. During this stay, one of the projects on which we collaborated was the development of a Dutch version of TermoStat. The comparison of the results goes beyond reporting precision, recall, and f1-scores, and includes a manual error analysis. The Dutch corpus on dressage was chosen for this purpose since this is a corpus in my native language on a subject for which I am a domain expert. That way, I was able to annotate errors myself.

The second publication is a journal paper with an updated version of HAMLET, so the methodology is very similar, but not identical to the one in the first paper. This journal paper contains a much more detailed system description and further investigates how machine learning can best be applied to extract terms, and which factors influence results most. Additionally, it starts with an elaborate overview of the state-of-the-art of automatic term extraction and suggest a new typology to discuss various methodologies, since the research has outgrown the traditional distinction between statistical, linguistic, and hybrid approaches. HAMLET is tested on all different corpora of the ACTER dataset to test how robust performance is in different languages and domains. The training data that is used varies as well, to assess the impact of language- and/or domain-specific training data. The error analysis further explores the role of the different types of

Prologue to Part 2 annotations, and that of term frequency and term length. Finally, the relative importance of the different types of features is investigated to learn how various types of information contribute towards the identification of terms.

Paper 3

Dans le document D-TERMINE : data-driven term extraction methodologies investigated (Page 123-127)