Datamart aggregation for Association Discovery

Chapter 6. Can we optimize medical prophylaxis tests?

6.3 Sourcing and evaluating data

6.3.2 Datamart aggregation for Association Discovery

If we look for medical variables we only have a view from the patient side. These records allow an analysis of patients independently but not an analysis of the test values in a transactional form.

Therefore, it may be interesting to have a transactional view as it occurs in the diagnoses data. Instead of applying techniques to unique patient records, we also could apply Association Discovery to confirm our results or to find potentially new information. The solution to this idea can be realized by aggregating the medical records to a new datamart. Practically speaking, this can be done by using the pivot technique of the IM for Data.

The following steps give one possible way to aggregate the new datamart.

򐂰 The new datamart consists of only two columns. The first column contains the patient ID (the Transaction Group). The second column contains the test result for a specific variable (the Transaction Item).

򐂰 Because the new datamart will have a transactional view we have to prepare the second column by adding the variable name to the value.

򐂰 To avoid too many values for a variable, for example, we probably have more than 100 different values for only one variable, such as for AGE. We therefore need to find a way to reduce the number of values.

򐂰 The method we use is discretization which means a transfer of numerical to categorical values. This was done as following:

– We calculate the average value of each variable as well as the standard deviation value.

– By applying these calculated numbers we get three intervals, namely:

• All values that are lower than the average minus standard deviation

• All values that are within the standard deviation

• All values that are higher than the average plus standard deviation – We then have to add new values as shown in Figure 6-4 (where ID

represents the patient_id): a value is then “normal” if it occurs within the standard deviation range.

Figure 6-4 Aggregated datamart calculated by pivot of the medical records

Practically speaking, a small program (written in the programming language awk) and a preprocessing function of the IM for Data (Figure 6-5) was used to perform this datamart creation.

򐂰 The program operates on flat files. It concatenates the variable name to this value. The concatenated variable then becomes automatically categorical.

򐂰 The preprocessing function of the IM for Data can then be used to build the datamart (Figure 6-5).

Figure 6-5 shows the preprocessing function of the IM for Data. All the variables (those with suffix “_D“) that are in the selected list box (“add to pivot”) are categorical.

Figure 6-5 Pivoting the data by applying the preprocessing function of the IM for Data

6.4 Choosing the mining technique

Choosing the mining techniques to use is the fifth stage in our generic mining method.

The characterization of the underlying medical datamart is done by several algorithms of the IM for Data. As it was already shown in chapter 6.3, “Sourcing and evaluating data” on page 115 the bivariate statistics represent a suited way to describe patients with and without diabetes mellitus. However, they do not deliver predictive rules but only a semantic description of the visualized data distribution.

We therefore turn our attention to predictive modeling methods, for example, decision trees and Radial Basis Functions, which allow the generation of predictive rules that can be applied to new patients by using the corresponding application mode. Additionally, the models deliver some kind of information about the characteristics of the predicted classes.

It is difficult to calculate good prediction models, because variables may be correlated or medical knowledge is not present. Before you start building predictive modeling, we recommend that you test the underlying variables for their relevance. This can be done using decision trees as a preprocessing technique, because decision trees try to use variables that split the datamart each time. It is important to know which variables occur on top of the tree: those that are not in the tree or are far away from the root of the tree may be negligible.

For the generation of a new prophylactic test this may be very interesting to know, because medical tests can then be ordered according to their relevance;

and, some time could be saved by getting relevant information at the right time.

Another aspect is quality: depending on the importance of each variable a weighting of a variable may be suited. As described above, the concentration of blood plasma is surely a main contributor in deciding whether a patient has diabetes mellitus or not. From the analytical standpoint it may be interesting to see how the quality of the model behaves if we reduce or increase the weight of this variable.

In general a deeper understanding of the variables’ relevance is not really necessary, because only eight variables are present. However, if a larger number of records with a higher variable concentration is available, this may be a good strategy to get an impression of the variables’ importance.

An alternative way of getting expressible rules is the computation of associative rules by Association Discovery. Using the aggregated datamart that was based on the medical records, we then get new information that can describe the characteristics of negative and positive test results.

6.5 Interpreting the results

The sixth stage in our generic mining method is to interpret the results that we have obtained and determine how we can map them to our business.

This section contains evaluation results by applying two techniques of the IM for Data:

򐂰 Classification tree

򐂰 Radial Basis Functions

Both techniques are predictive techniques that can build up predictive models based on historical datamarts.

Dans le document Mining Your Own Business in Health Care (Page 131-135)