BAG OF WORDS - John Fulcher, University of Wollongong, Australia

If we are given a set of m documents D = [d₁, d₂,..., d_m], it is quite natural to represent them in terms of vector space representation. From this set of documents it is simple to find out the set of vocabularies used. In order that the set of vocabularies would be meaningful, care is taken by using the stemmisation technique which regards words of the same stem to be one word. For example, the words “representation” and “represent”

are considered as one word, rather than two distinct words, as they have the same stem.

Secondly, in order that the set of vocabularies would be useful to distinguish documents, we eliminate common words, like “the”, “a”, and “is” from the set of vocabularies. Thus, after these two steps, it is possible to have a set of vocabularies w₁, w₂,..., w_n which represents the words used in the set of documents D. Then, each document can be represented as an n-vector with elements which denote the frequency of occurrence of the word in the document d_i, and 0 if the word does not occur in the document d_i. Thus, from a representation point of view, the set of documents D can be equivalently represented by a set of vectors V - [v₁, v₂,..., v_m] , where v_i is an n-vector. Note that this set of vectors V may be sparse, as not every word in the vocabulary occurs in the document (Nigam et al., 2000). The set of vectors V can be clustered together to form clusters using standard techniques (Duda, 2001).

Table 1. An overview over the seven categories in the MBS

Category Number of items 1 158 2 108 3 2734 4 162 5 504 6 302 7 62 Total 4030

In our case, we consider each description of an MBS item as a document. We have a total of 4030 documents; each document may be of varying length, dependent on the description of the particular medical procedure. Table 1 gives a summary of the number of documents in each category.

After taking out commonly occurring words, words with the same stem count, and so on, we find that there are a total of 4569 distinct words in the vocabulary.

We will use 50% of the total number of items as the training data set, while the other 50% will be used as a testing data set to evaluate the generalisability of the techniques used. In other words, we have 2015 documents in the training data set, and 2015 in the testing data set. The content of the training data set is obtained by randomly choosing items from a particular group so as to ensure that the training data set is sufficiently rich and representative of the underlying data set.

Once we represent the set of data in this manner, we can then cluster them together using a simple clustering technique, such as the naïve Bayes classification method (Duda, 2001). The results of this clustering are shown in Table 2.

The percentage accuracy is, on average, 91.61%, with 1846 documents out of 2015 correctly classified. It is further noted that some of the categories are badly classified, for example, category-2 and category-4. Indeed, it is found that 62 out of 81 category-4 items are misclassified as category-3. Similarly, 12 out of 54 category-2 items are misclassified as category-5 items.

This result indicates that the hypothesis is not valid; there are ambiguities in the description of the items in each category, apart from category-1, which could be confused with those in other categories. In particular, there is a high risk of confusing those items in category-4 with those in category-3.

A close examination of the list of the 62 category-4 items which are misclassified as category-3 items by the naïve Bayes classification method indicates that they are indeed very similar to those in category-3. For simplicity, when we say items in category-3, we mean that those items are also correctly classified into category-3 by the classification method. Tables 3 and 4 give an illustration of the misclassified items. It is noted that misclassified items 52000, 52003, 52006, and 52009 in Table 3 are very similar to the category-3 items listed in Table 4.

Table 2. A confusion table showing the classification of documents (the actual classifications as indicated in the MBS are given horizontally; classifications as obtained by the naïve Bayes method are presented vertically)

Category 1 2 3 4 5 6 7 Total % Accuracy

It is observed that the way items 5200X are described is very similar to those represented in items 300YY. For example, item 52000 describes a medical procedure to repair small superficial cuts on the face or neck. On the other hand, item 30026 describes the same medical procedure except that it indicates that the wounds are not on the face Table 3. Category-4 Items 52000, 52003, 52006, and 52009 misclassified by the naïve Bayes method as Category-3 items

Table 4. Some items in Category 3 which are similar to items 52000, 52003, 52006, and 52009

Item No Item Description

52000 Skin and subcutaneous tissue or mucous membrane, repair of recent wound of, on face or neck, small (not more than 7 cm long), superficial 52003 Skin and subcutaneous tissue or mucous membrane, repair of recent

wound of, on face or neck, small (not more than 7 cm long), involving deeper tissue

52006 Skin and subcutaneous tissue or mucous membrane, repair of recent wound of, on face or neck, large (more than 7 cm long), superficial 52009 Skin and subcutaneous tissue or mucous membrane, repair of recent

wound of, on face or neck, large (more than 7 cm long), involving deeper tissue

Item No Item Description

30026 Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, not on face or neck, small (not more than 7cm long), superficial, not being a service to which another item in Group T4 applies

30035 Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, on face or neck, small (not more than 7cm long), involving deeper tissue

30038 Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, not on face or neck, large (more than 7cm long), superficial, not being a service to which another item in Group T4 applies

30041 Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, not on face or neck, large (more than 7cm long), involving deeper tissue, not being a service to which another item in Group T4 applies

30045 Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, on face or neck, large (more than 7cm long), superficial

30048 Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, on face or neck, large (more than 7cm long), involving deeper tissue

or neck, with the distinguishing feature that this is not a service to which another item in Group T4 applies. It is noted that the description of item 30026 uses the word “not”

to distinguish this from that of item 52000, as well as appending an extra phrase “not being a service to which another item in Group T4 applies”. From a vector space point of view, the vector representing item 52000 is very close³ to item 30026, closer than other items in category-4, due to the few extra distinguishing words between the two. Hence, item 52000 is classified as “one” in category-3, instead of “one” in category-4. Similar observations can be made for other items shown in Table 3, when compared to those shown in Table 4.

Table 5. Some correctly classified Category-1 items

Item No Item description

3 Professiona l attendance at consulting rooms (not be ing a service to which any other item applies) by a general practitioner for an obvious problem characterised by the straightforward nature of the task that requires a short patient history and, if required, limited examination and management -- each attendance

4 Professiona l attendance, other than a service to which any other item applies, and not being an attendance at consulting rooms, an institution, a hospital, or a nursing home by a general practitioner for an obvious problem characterised by the straightforward nature of the task that requires a short patient history and, if required, limited examination and management -- an attendance on 1 or more patients on 1 occasion -- each patient

13 Professiona l attendance at an institution (not being a service to which any other item applies) by a general practitioner for an obvious problem characterised by the straightforward nature of the task that requires a short patient history and, if required, limited examination and management -- an attendance on 1 or more patients at 1 institution on 1 occasion -- each patient

19 Professiona l attendance at a hospital (not be ing a service to which any other item applies) by a general practitioner for an obvious problem characterised by the straightforward nature of the task that requires a short patient history and, if required, limited examination and management -- an attendance on 1 or more patients at 1 hospita l on 1 occasion -- each patient

20 Professiona l attendance (not being a service to which any other item applies) at a nursing home including aged persons' accommodation attached to a nursing home or aged persons' accommodation situated within a complex that inc ludes a nursing home (other than a professional attendance at a self contained unit) or professional attendance at consulting rooms situated within such a complex where the patient is accommodated in a nursing home or aged persons' accommodation (not being accommodation in a self conta ined unit) by a general practitioner for an obvious problem characterised by the straightforward nature of the task that requires a short patient history and, if required, limited examination and management -- an attendance on 1 or more patients at 1 nursing home on 1 occasion -- each patient

On the other hand, Tables 5 and 6 show items which are correctly classified in category-1 and category-5 respectively. It is observed that items shown in Table 5 are distinct from those shown in Table 6 in their descriptions. A careful examination of correctly-classified category-1 items, together with a comparison of their descriptions with those correctly-classified category-5 items confirms the observations shown in Tables 5 and 6. In other words, the vectors representing correctly-classified category-1 items are closer to other vectors in the same category than other vectors representing other categories.

SUPPORT VECTOR MACHINE AND

Dans le document John Fulcher, University of Wollongong, Australia (Page 46-50)