Building the PIBOSO Elementary Rules in ARDAKE

CHAPTER 8 Rules Development

8.2 Building the PIBOSO Elementary Rules in ARDAKE

Based on the visual and non-visual analysis we did, we created a number of elementary rules to identify Population and Intervention annotations in the PIBOSO corpus. For example, it is common to mention an age or an age range in population sentences. We therefore needed to build elementary annotation rules to match age and age range patterns. An age or age range pattern in a sentence gives an indication that this may be a population sentence. To increase the confidence level, the same sentence containing an age or an age range pattern is searched for other population related patterns. This is done by building more elementary annotation rules and combining these rules to calculate the final confidence level to decide whether or not the sentence is a population annotation.

Building annotation rules in ARDAKE is done by simple mouse drag/drop of predefined built-in or user defbuilt-ined patterns, conditions, and actions. Figure 8.1 (A) shows an ARDAKE rule for matching age patterns. The AgeUnit pattern in the rule matches any age unit such as year, month, day, etc. The AgeKeyword pattern is either “old” or “of age”. The rule in Figure 8.1

(A) defines the age as a number (written in digits or in letters) followed by any characters (spaces or other) followed by an age unit, then any characters and ends with an age keyword.

This allows matching age patterns in different forms such as “3 weeks old”, “fifty three years of age”, etc.

(A) (B)

Figure 8.1: Age Rule (A) and Age Indicator Rule (B) in ARDAKE

Figure 8.1 (B) shows another ARDAKE rule to match age indicators in sentences. This rule uses the “Mark fast” action that matches any word from a selected word list. Our age indicator word list contains all variations of English age indicators such as “infant”, “toddler”, “baby”,

“child”, “teenager”, etc…, in their singular and plural forms.

(A) (B) (C)

Figure 8.2: Age Range Rules in ARDAKE

The rules in Figure 8.2 match age ranges in different forms including “between 30 and 40”,

“between ten and eleven”, “under twenty”, “16 and over”, and so on.

The ARDAKE “Mark fast” action can be used to match positive and negative n-grams. All you need is to save positive and negative n-grams, identified by the ARDAKE Corpus Analyser, in text files then set the “Word List” property of the “Mark fast” action to point to those files.

Once this is done, these two rules can be used to define a new rule to identify sentences that have one or more positive population n-grams and no negative population n-grams as a population candidate sentence as shown in Figure 8.3.

Figure 8.3: Population Sentence Candidate Rule in ARDAKE

PIBOSO Semantic rules are as easy to create as the previously shown ones in this section. It is common for population sentences to include the problem or the disease of interest. A generic ARDAKE annotation rule can be created to match any disease in the text being analysed. This can be done by simply using the “Subclass of” condition and setting the “Parent Concept”

property to the top-most disease class in MeSH or a similar ontology. However, this can generate too many concepts to look for and can result in matching many false positive results.

To obtain better results, we should be as specific as possible about the concepts (diseases or problems) to match in the text. Ideally, a dedicated ontology for the corpus being analysed should be used but, since we could not find such an Ontology for the PIBOSO corpus and it would be a tedious task to develop one, we decided to use the MeSH ontology and be more specific about the parent concepts to look for in population sentences. The main reason for

choosing MeSH over other medical ontologies is that NLM uses the MeSH to index the articles in the MedLine/PubMED database.

By reading some of the PIBOSO training population annotations, you can notice that they are mostly about spinal cord issues and brain injuries. Figure 8.4 shows a rule that defines any occurrence of a sub-concept of the “Spinal Cord Diseases” or “Brain Injuries” as a PIBOSO disease.

Figure 8.4: PIBOSO Diseases Annotation Rule in ARDAKE

Other population related rules such as the ones identified during the visual analysis are trivial to create in ARDAKE. This is done by using the Position or Length conditions as shown in Figure 8.5.

The rule in Figure 8.5 covers frequent positions of population sentences in both structured and unstructured abstracts. For better results, the rule in Figure 8.5 could be replaced with two other rules, one for structured abstracts that looks for sentences at positions (2, 4, 5, 6, 7, 8, 9, and 10) and one for unstructured abstracts that looks for sentences at positions 1 to 6 inclusively. This is based on numbers and observations from Figure 7.8 and Figure 7.9 in the previous chapter.

Figure 8.5: ARDAKE Rule to Identify Sentences at Population Position

Once all elementary rules are created, new rules can be created to assign scores to the results of each elementary rule. This is done using the Mark Score action as in Figure 8.6.

Figure 8.6: Mark Score Action for Sentences Containing PIBOSO Disease Annotations Figure 8.6 shows two sub-rules where the first one is to annotate any sentence that contains a PIBOSO Disease pattern as SentenceWithPIBOSODisease and the second one to create an annotation of type PopulationSentence with a score of 20 for each SentenceWithPIBOSODisease. This is because the second sub-rule does not have any conditions. Note that if a sentence was already annotated as a PopulationSentence by another

sub-rule and that this same sentence also contains a PIBOSO_Disease annotation, the score of the PopulationSentence annotation is increased by 20.

Dans le document THESIS PROPOSAL PRESENTED TO UNIVERSITÉ DU QUÉBEC EN OUTAOUAIS IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF PH.D (Page 145-150)