• Aucun résultat trouvé

HAMLET: Hybrid Adaptable Machine Learning approach to Extract Terminology

4.4 Methodology and Experiments

4.4.1 Experimental Setup

4.4.1.1 Preprocessing and Candidate Term Selection based on POS

All corpora are linguistically preprocessed using LeTs Preprocess (van de Kauter et al., 2013), which includes tokenisation, lemmatisation, POS-tagging, chunking, and Named Entity Recognition (NER). The POS tagging module in LeTs Preprocess is language-dependent. The English, French and Dutch tag sets were from Penn Treebank, TreeTagger and CGN (Corpus Gesproken Nederlands) respectively. The English and French tagsets are similar, with 37 and 34 POS tags respectively, but the Dutch tagset is much more fine-grained and has 152 different POS tags. Since POS tags are such a crucial part of the HAMLET methodology, a single set of language-independent POS tags was required to allow fair cross-lingual experiments and multilingual models. To map the different sets of tags, Universal Dependencies (UD) (Petrov et al., 2012) were used as a basis. A mapping was already available for the English and Dutch tagsets and by comparing these existing mappings, one could also be derived for the French tags. However, since the original LeTs tagsets – even the English and French ones – were all more fine-grained, a few tags were added to the original UD set, which resulted in a final set of 26 tags. Then, it was hypothesised that, in addition to this relatively fine-grained set op POS tags, a more coarse-grained approach might also be beneficial. Consequently, starting from this fine-grained POS tagset, a more coarse-fine-grained simple POS set was created with only eight tags. Both sets are shown in Table 11.

3 In this paper (paper 4 on HAMLET), all token counts include separate end-of-sentence (EOS) tokens, which is why these numbers are higher than in the previous papers. The reason for this discrepancy is that these EOS tokens are considered as separate tokens in the HAMLET methodology, so it is logical to work with the numbers including such tokens in the system description paper.

HAMLET

Table 11 standard and simple POS tagsets standard POS simple POS description

ABR X Abbreviation

ADJ ADJ Adjective

ADP FUNC Adposition (prepositions etc.)

ADV ADV Adverb

CONJ FUNC Conjunction

DET FUNC Determiner

FW X Foreign word

INTJ X Interjection

NOUN NN Noun

NUM X Numeral

PART FUNC Particle

PNO FUNC Pronoun (other)

PNPR FUNC Pronoun (personal)

PNPS FUNC Pronoun (possessive)

PROPN FUNC Proper noun

PUNCT PUNCT Punctuation (general)

PUNQ PUNCT Punctuation (quotation marks) PUNB PUNCT Punctuation (parentheses)

SYM X Symbol

VB VB Verb (infinitive)

VBG VB Verb (present participle or gerund)

VBN VB Verb (past participle)

VBPA VB Verb (past tense)

VBPR VB Verb (present tense or imperative)

VBX VB Verb (other)

EOS EOS End Of Sentence

D-TERMINE

136

Since HAMLET is based on traditional ATE and aims to extract a list of all unique terms rather than all occurrences of each term, the next step was to decide how to combine terms. To make an informed decision, scenarios were tested with six types of variants. To avoid a too fine-grained system which could be overly sensitive to small tagging errors, the simple POS tagset (rather than the more fine-grained standard set) was used. The different variant forms, illustrated with the example term Co-morbidities, are:

1. Token_noPOS (token with original casing, without POS pattern), e.g., Co-morbidities 2. token_noPOS (lowercased token, without POS pattern), e.g., co-morbidities

3. Token_POS (token with original casing, with POS pattern), e.g., Co-morbidities(NN) 4. token_POS (lowercased token, with POS pattern), e.g., co-morbidities(NN)

5. lemma_POS (lowercased lemma, with POS pattern), e.g., co-morbidity(NN) 6. normalised_noPOS (normalised4 token, without POS pattern), e.g., comorbidities

Table 12 total number of (unique) annotations per variant and how consistent annotations are when using this variant; for each annotated term/Named Entity as the specified variant, how many of the occurrences of that string in the text are annotated

variant # annotations % occurrences that are annotated (average)

Token_noPOS 21,270 82.1% consistently each form was annotated, i.e., out of all occurrences of that variant form in the corpus, how often it was annotated. While a small margin of inconsistency will always remain to account for both human error in the manually annotated data, and terms which may only be valid terms in some contexts, the goal was still to limit this inconsistency.

On the other hand, since low-frequency terms are notoriously difficult for ATE, variants that are able to capture more annotations under a single entry (e.g., combining identical terms with different capitalisation) could also be beneficial, since they reduce the total number of unique terms (in that variant), and the number of very rare terms. Table 12 shows the average consistency and total number of gold standard terms per variation. As

4 Normalised in this case meant converting all tokens to only [a-z][0-9] characters, unless this meant no characters were left, in which case UNK was used as a placeholder.

HAMLET can be seen, there are three variants that lead to over 90% consistent annotations on average: Token_POS, token_POS, and lemma_POS. Since the token_POS variant considerably reduces the number of different annotations compared to Token_POS, while at the same time scoring very high on consistency, this is the variant that will be used for all experiments. Nevertheless, the same methodology can be applied with the other variants as well.

After linguistically preprocessing all texts, mapping all POS tags to a shared, language-independent set, and deciding to work with the token_POS variant, a preliminary list of unique candidate terms was extracted. HAMLET follows the traditional hybrid method for ATE and selects a list of candidate terms based on POS patterns. However, contrary to the traditional methods, these patterns are not manually defined, but extracted from the training data. This means that no restrictions had to be predefined with respect to term length, frequency, or POS type, but that all of this information would be derived from the training data. Since POS patterns were derived from the automatically tagged data, this means wrongly tagged patterns will be included as well. While this may lead to more noise among the candidate terms, it could also benefit recall when similar tagging errors are made in the test data.

Table 13 term and Named Entity POS patterns in ACTER corpus

corpora # patterns (incl. NEs) # patterns (excl. NEs)

% with (proper) noun (incl. NEs)

English 436 331 61%

French 353 277 87%

Dutch 283 202 79%

all combined 863 638 88%

Table 13 shows how many different POS patterns are found per language, both including and excluding Named Entities, and how many of those contain at least one noun or proper noun. This is all based on the standard POS tagsets of 26 tags. Since Dutch has a lot more single-word complex compound terms, it is not surprising that there are fewer different POS patterns in this language. Despite some of the similarities between terms and Named Entities, it is also clear that Named Entities do follow at least some different POS patterns, with a substantial difference in the number of patterns when including and excluding Named Entities. In accordance with the accepted knowledge about terms, most of them do contain at least one noun or proper noun, but certainly not all do, especially in English.

D-TERMINE

138

Of course, some of these may be due to tagging mistakes, but it does emphasise that methodologies which are limited to the extraction of only nouns and/or noun phrases are bound to miss relevant terms. Another striking observation from this table is just how different the POS patterns are between the different languages. Of course, it is not surprising that POS patterns are language-dependent, but the limited overlap of patterns between languages is still remarkable. Looking at POS patterns including Named Entities, 317 patterns are only found in English, 243 only in French, and 151 only in Dutch; only 57 patterns are found in all three languages. However, this finding does need to be nuanced, because, when considering the total number of annotations with these patterns, it turns out that these 57 patterns cover 94% of all annotations. So, the number of unique patterns can be misleading due to many very rare ones. For instance, of the 863 unique patterns in all corpora, 581 occur only three times are less. Some of these may also be due to tagging errors.

Similarly, there are considerable differences between the four domains. To illustrate this, consider the English POS patterns (including those of Named Entities). Out of 436 unique patterns, only 26 are present in all four domains; 58 are unique to the domain of corruption, 48 only occur in the dressage corpus, 177 are unique to heart failure, and another 49 only occur in the wind energy corpus. The big difference for heart failure is probably due to the fact that there are more annotations in that domain in general. Again, however, it must be emphasised that these few shared POS patterns (26) cover 95% of all individual annotations. In conclusion, these findings show how POS patterns are very language- and domain dependent, with many possible unique patterns, but that a small portion of these patterns covers the vast majority of al individual annotations

To select candidate terms based on these POS patterns, it may, therefore, be beneficial to only use POS patterns in the training data above a minimum frequency threshold, as this would reduce the amount of noise. However, these candidate terms are the basis for all further processing. Any candidate terms not extracted in this first selection process cannot be retrieved later, so it was decided to focus on recall over precision at this stage.

Therefore, all POS patterns from the training data will be included, without any frequency threshold, so as not to lose (m)any valid terms. In the future, practical implementations of the tool may resort to a POS pattern frequency threshold, but in the current contribution, the focus was on the machine learning classifier, so it was decided to focus on recall for the initial selection of candidate terms.

Nevertheless, this decision does come at a cost to precision. When extracting candidate terms (including Named Entities) based on POS patterns, only the POS patterns from the same language are included. Depending on the setup (see section 4.4.2), POS patterns from the test corpus are either included or excluded. As the previous analysis showed, some POS patterns only occur in a single corpus, so this will influence results. When the POS patterns of the corpus itself are included, precision of POS-based candidate term selection (averaged over all corpora) is only 5.9%, but recall is perfect. In the different corpora,

HAMLET precision ranges between 3.2% and 8.4%. When POS patterns from the test corpus are excluded and only patterns from the other domains are used, precision is similar at 6.5%

on average (between 3.1% and 11.2% per corpus), but recall is no longer perfect and ranges between 91.1% and 98.4% (95% on average). While the loss in recall is limited, this stresses the impact of both volume and relevance of training data.

In conclusion, all corpora were linguistically preprocessed and the language-dependent POS tags were mapped to a shared set to allow interlingual comparisons and multilingual models. Individual occurrences of tokens and candidate terms were combined only if they had the same lowercased form and the same POS. An analysis of candidate term selection based on POS patterns showed how the patterns are very language- and domain-dependent, but that a limited set of all patterns captures most candidate terms, as there are many very rarely used POS patterns. Despite the fact that this means a frequency threshold for POS patterns could benefit ATE performance by reducing noise among the candidate terms, the focus of this research is on the classifier, so it was decided to aim for the highest possible recall, at the cost of precision, and include all POS patterns from the training data.

4.4.1.2 Features

For each of the extracted candidate terms, 177 features were calculated. Most of these were based on the typical information used for hybrid ATE, such as termhood and unithood. Since previous research has repeatedly shown that no single feature appears to work best for all cases (Loukachevitch, 2012), we investigated different feature combinations. Some of the features have not (often) been used for ATE and were based on findings during the annotation process. For instance, especially in the medical domain, terms often occur both in full, and as an abbreviated version. In such cases, they are introduced in the full form, followed by the abbreviation between parentheses, e.g., heart failure (HF). Therefore, features were added to indicate whether a candidate term occurs in the vicinity of parentheses. All features were divided into 6 groups and 18 subgroups.

A summary per subgroup is given below in Table 14 and a complete overview is included in the appendix.

A few of the features rely not only on the domain-specific corpora, but also on general language reference corpora. Two separate types of reference corpora were used per language: a Wikipedia reference corpus, based on Wikipedia dumps, and a newspaper reference corpus5. All reference corpora were limited to 10M tokens and artificially split into 5000 documents. Features that make use of reference corpora are always calculated

5 The newspaper reference corpora per language were: the English News on Web corpus (Davies, 2017), the French Gigaword corpus (Graff et al., 2011), and the news-related subcorpora of the Dutch openSONAR (Oostdijk et al., 2013)

D-TERMINE

140

twice, i.e., once for each type of reference corpus. Non-numeric features are converted to (one-hot) vectors. A non-trivial task was finding a way to encode the POS-pattern into informative features, without having to add a separate feature for each of the 300+

possible patterns. Based on preliminary experiments, we decided to work with three vector representations: two one-hot vectors for all POS tags (not patterns) to represent the POS of the first token and the last token and one frequency vector for the tags of all tokens of the candidate term. For instance, a term like heart failure (noun+noun) would get three vectors representing all POS tags, with zeros in all places except for the first noun in the first vector (1), the last noun in the second vector (1), and the sum of all nouns in the last one (2).

Table 14 description of features per group and subgroup

Shape features (SHAP)

length number of characters & number of tokens

alphanumeric whether the candidate term is alphabetic, numeric,

alphanumeric, etc. & the number of digits and non-alphabetic characters

capitalisation out of all occurrences of the candidate term, how often (%) is it all lowercase, all uppercase, title case, etc.

Linguistic features (LING)

first POS POS tag of the first token of the candidate term (simple &

standard POS)

last POS POS tag of the last token of the candidate term (simple &

standard POS)

freq. POS how frequently each POS tag (simple & standard) occurs within the candidate term (simple & standard POS)

NER whether the candidate term was tagged (completely, partially, etc.) as a Named Entity during preprocessing

chunk which chunk tag(s) were assigned to the candidate term in preprocessing

HAMLET

stopword whether the candidate term contains a stopword or is a stopword6

Frequency features (FREQ)

spec. freq. relative (document) frequency in the specialised corpus ref. freq. relative (document) frequency in the news and Wikipedia

reference corpora

Statistical features (STAT)

stats without ref. metrics to calculate termhood/unithood without comparing to a reference corpus: C-Value (Barrón-Cedeño et al., 2009), TF-IDF (Astrakhantsev et al., 2015), Lexical Cohesion and Basic (Bordea et al., 2013)

stats with ref.

(basic)

metrics to calculate termhood/unithood by comparing

frequencies to a reference corpus: Domain Pertinence (Meijer et al., 2014), Domain Relevance (Bordea et al., 2013), Weirdness (Astrakhantsev et al., 2015), Relevance (Peñas et al., 2001), Log-Likelihood Ratio (Macken et al., 2013)

stats with ref.

(advanced)

similar to the basic termhood/unithood measures, but these measures don’t just use the frequencies of the entire candidate term in the reference corpora, but also those of all separate tokens that make up the candidate term: Vintar’s termhood measure (S. Vintar, 2010), Domain Specificity (Kozakov et al., 2004)

Contextual features (CTXT)

parentheses candidate term occurs between parentheses or right before/after parentheses

6 The ISO stopwords were used for all languages: https://github.com/stopwords-iso

D-TERMINE

142

Variational features (VARI)

var. numbers number of different variations for the candidate term for each variant type (types explained in section 4.4.1.1) and the

combined relative frequency in the domain specific corpus of the candidate term in each variation per variant

var. stats sum of the domain specificity and Vintar termhood scores of all different variations for the candidate term for each variant type

Before training, all statistical features (including those in the variational features), were scaled using scikit-learn’s (Pedregosa et al., 2011) RobustScaler, which is more robust towards outliers. Features without any variance were automatically removed, which mostly concerned the POS-related features, since not all POS tags can occur in first/last position. Out of 177 possible features, this meant that 150-160 usually remained (depending on the setup).

4.4.1.3 Algorithm, Evaluation, and Optimisation

Evaluation and optimisation of the models was based on f1-scores (harmonic mean of precision and recall). In this context, precision is defined as the percentage of true terms among all extracted candidate terms (number of true positives, divided by number of extracted terms), and recall as the percentage of all true terms that have been extracted (number of true positives divided by the number of gold standard terms). Evaluation is strict, in the sense that only exact matches are counted as correct. Relatively low scores were expected due to the inherent difficulty of the ACTER dataset (no minimum or maximum term length, no minimum frequency, no limitations on POS-patterns, inclusion of nested terms). Preliminary experiments were performed to choose the best algorithm for this task. Since there is no way to predict the best algorithm for a specific task and dataset beforehand (no free lunch theorem (Wolpert, 1996)), a relatively wide range of classifiers was tested. With scikit-learn (Pedregosa et al., 2011), the decision tree classifier, random forest classifier (RFC), multi-layer perceptron, and logistic regression were compared, all allowing hyperparameter optimisation. In these preliminary experiments, the best average f1-scores were obtained with the random forest classifier, followed by the decision tree classifier (-5,6 percentage points), logistic regression (-19.4), and the multi-layer perceptron (-20.5). All experiments reported in the current contribution were, therefore, performed with scikit-learn’s random forest classifier. Hyperparameter optimisation was performed through grid search with five folds. Whenever k-fold cross-validation was used, five folds were used, with nested hyperparameter optimisation.

HAMLET