Candidate Term Selection - HAMLET: Hybrid Adaptable Machine Learning approach to Extract Termi

HAMLET: Hybrid Adaptable Machine Learning approach to Extract Terminology

4.2.2 Candidate Term Selection

Candidate term selection refers to the preprocessing step of ATE, where it is decided which lexical units are to be considered as potential terms. As mentioned, in the traditional, hybrid methodology, this would be done based on a predefined list of POS patterns. Examples of systems using this strategy are TermoStat (Drouin, 2003), TermSuite (Cram & Daille, 2016) and TExSIS (Macken et al., 2013). Rather than starting from a predefined list of POS patterns, the POS patterns can also be derived from training data, as was done in the work of Patry and Langlais (2005), who trained a POS-based language model to determine appropriate POS patterns for terms. Another strategy is looking at n-grams (any sequence of n tokens), regardless of POS. This approach was tested, among others, by Wang et al. (2016). They compare an approach based on a common POS pattern (a number of adjectives, followed by a number of nouns) with one where all possible n-grams are selected as candidate terms. However, since selecting all possible n-grams leads to too many candidate terms, they decide to use stopwords as delimiters between possible n-grams. Their POS-based strategy still obtains higher f1-scores. The MWEtoolkit also extracts all raw n-grams, but does allow a posteriori filtering based on POS (Ramisch et al., 2010a, 2010b).

D-TERMINE

128

Besides using POS patterns or n-grams to select a list of unique candidate terms, there is a third approach as well, in which candidate terms are classified within the text itself:

the sequence labelling approach. This approach, rather than extracting a flat list of unique candidate terms, considers each token in the text sequentially, and in relation to the surrounding tokens. Candidate terms can then, for instance, be identified using IOB labels, so that each token is either Inside, or Outside of a term, or the Beginning of a term.

Sequential labelling approaches are still very rare for ATE, but there have been a few attempts in recent years (Kucza et al., 2018; McCrae & Doyle, 2019).

Most approaches fit into one of these three categories relatively easily, though there may be some exceptions. For instance, Gao and Yuan (2019a) use a sequential approach with deep learning and, rather than traditional IOB-labelling, they work with all possible term spans in each sentence, with spans up to a maximum term length k, where k must be smaller than or equal to the sentence length. For instance, for the sentence “This is an example”, the maximum term length k is four, since there are four tokens in the sentence.

The candidate terms based on n-grams include four unigrams (this; is; an; and example), three bigrams (this is; is an; and an example), two trigrams (this is an; and is an example), and one 4-gram (this is an example). Using this strategy allows them to take a sequential labelling approach that is able to detect all nested terms as well, which is not easily possible with an IOB labelling scheme. This way of selecting candidate terms could be marked as a hybrid of the second and third categories proposed for the candidate term selection aspect: a sequence labelling approach with n-grams.

4.2.3 Algorithm

The algorithm can relatively easily be split into rule-based and machine learning methodologies. However, especially in the case of machine learning methodologies, many more distinctions can be made, e.g., supervised vs. semi-supervised vs. unsupervised. An example of an unsupervised deep learning approach is the work of Shah et al. (2019), who use statistical termhood and unithood features to find the most likely candidate terms, and then find similar terms through a siamese neural network with word embeddings.

We refer to their work for more information about supervised vs. unsupervised methodologies for ATE. A semi-supervised approach is also demonstrated by Rokas et al.

(2020), who extract Lithuanian terms in the domain of cybersecurity. Using deep neural networks, they are able to train an efficient classifier with very little data.

Machine learning algorithms can, of course, also be divided according to the learner.

Many different kinds have already been used for ATE, e.g., logistic regression (Bolshakova et al., 2013; Fedorenko et al., 2013), the ROGER evolutionary algorithm (Azé et al., 2005), rule induction with RIPPER (Foo & Merkel, 2010), CRF++ (Judea et al., 2014), decision trees (Karan et al., 2012), support-vector machines (Ljubešić et al., 2018), AdaBoost (Patry &

HAMLET Langlais, 2005), k-nearest neighbours (Qasemizadeh & Handschuh, 2014a), and many types of neural networks (Amjadian et al., 2018; Hätty & Schulte im Walde, 2018b; Kucza et al., 2018; Shah et al., 2019; Wang et al., 2016). When it comes to rule-based approaches, classifying a candidate term usually happens based on statistical termhood and unithood measures. Based on a single score, the top n (percent) of candidate terms can be selected, or all candidate terms above a predetermined threshold value. Multiple metrics can also be combined without resorting to machine learning, by using simple voting algorithms instead. This was done by Vivaldi et al. (2001), who contrasted several voting techniques (simple democratic voting, simple non-democratic voting, and numeric voting) to combine multiple features. They conclude that results benefit most from a more sophisticated additional learning step with the AdaBoost algorithm. Similar conclusions, i.e., that combining multiple features with machine learning algorithms is beneficial, have been reached by many researchers in recent years (Dobrov & Loukachevitch, 2011;

Loukachevitch, 2012; Nokel, Michael et al., 2012; Šajatović et al., 2019).

4.2.4 Features

For the third aspect, features, we do not attempt an exhaustive classification, since the variety and creativity of features that are invented to detect terms is too great. However, we do propose a number of categories for some of the most common types of features, with the caveat that methodologies can combine any number of these types of features.

The first two categories have already been mentioned: linguistic features (using, e.g., POS patterns, parsing, stopwords, etc.) and statistical features (consisting mostly of termhood and/or and unithood measures). For more information, especially about statistical features, we refer to a survey of methods for ATE by Astrakhantsev et al. (2015). Another type of features are morphological or shape-related features, e.g., length, capitalisation, presence of special characters, Greek or Latin forms etc. Related to the statistical features are raw frequency features (term frequencies and document frequencies that have not yet been transformed into statistical measures). Another large category is reserved for features based on external resources, such as existing terminologies and ontologies, Wikipedia, or internet searches. For instance, Vivaldi and Rodríguez (2001) rely on the lexical database EuroWordNet, Loukachevitch (2012) uses both features based on an internet search, and features based on a domain-specific thesaurus, and Ramisch et al.

(2010b) use the results of internet search engines as well. The next type of features are those based on topic modelling, as in the works of Šajatović et al. (2019) and Loukachevitch & Nokel (2013).

Two less commonly used features are those based on language models, like measuring perplexity (Foo, 2009), and features related to the layout and position of the term. The latter have been used for related tasks such as indexing (Koutropoulou & Efstratios, 2019)

D-TERMINE

130

or for unsupervised training data generation (Judea et al., 2014). In that study, they exploit the typical structure of patents, where the captions of or reference to figures often contain terms. Since terms in these locations can be automatically extracted with a high accuracy, they use them to automatically generate a training set for a machine learning approach to ATE. A hypothetical reason for the relative absence of such potentially informative features, like the occurrence of candidate terms in bold or italics, or in the titles of texts, may be the fact that most systems work with plain text files, in which such information is not readily available. Another category of features can be reserved for features relating to context. For instance, the proximity of a candidate term to other highly scored candidate terms (Vivaldi et al., 2001). NC-Value (Frantzi & Ananiadou, 1999) is a commonly used statistical measure that uses contextual information.

The final category is devoted to features that use word- or character-embeddings, which are becoming ever more prevalent. Recently, embeddings are used in both feature-based and so-called “featureless” methodologies (Gao & Yuan, 2019a; Wang et al., 2016).

In the TermFrame project (Pollak et al., 2019) FastText embeddings trained on the small, domain-specific corpus are used to extend the list of candidate terms obtained through a traditional, hybrid approach. The work of Hätty (2020) also illustrates how word embeddings can be used in the context of ATE. In a first study (Hätty & Schulte im Walde, 2018b), word2vec (Mikolov et al., 2013) embeddings are pre-trained on Wikipedia and post-trained on a domain-specific corpus to detect German compound terms in the domain of cooking. The methodology is elaborated in a later study (Hätty et al., 2020), where corpora are added in DIY, hunting, and automotive domains, and comparative embeddings are used for a fine-grained term prediction. Both general embeddings and domain-specific embeddings are trained, and multiple ways are explored to contrast and combine the information of both embeddings for ATE.

In conclusion, the types of features have evolved far beyond only the traditional linguistic and statistical information of early ATE methodologies. The proposed categories are not exhaustive but may still be useful to illustrate the variety in features, and to serve as a starting point for informative system descriptions.

Dans le document D-TERMINE : data-driven term extraction methodologies investigated (Page 153-156)