Active Learning - Existing Approaches - Machine Learning in Computer Vision

Existing Approaches

8. Active Learning

All the methods above considered a ‘passive’ use of unlabeled data. A dif-ferent approach is known as active learning, in which an oracle is queried

0 0.2 0.4 0.6 0.8 1 0.15

0.2 0.25 0.3 0.35 0.4 0.45 0.5

Weight of unlabeled data (λ(( )

Probability of Error

(a)

0 0.2 0.4 0.6 0.8 1

0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3

Weight of unlabeled data (λ(( )

Probability of Error

(b)

Figure 7.5. The effect of varying the weight of the unlabeled data on the error for the Artiﬁcial data problem (a) and Shuttle database (b). The number of labeled and unlabeled records is speciﬁed in Table 7.3.

for the label of some of the unlabeled data. Such an approach increases the size of the labeled data set, reduces the classiﬁer’s variance, and thus reduces the classiﬁcation error. There are different ways to choose which unlabeled data to query. The straightforward approach is to choose a sample randomly.

This approach ensures that the data distributionp(C,X)is unchanged, a de-sirable property when estimating generative classifiers. However, the random sample approach typically requires many more samples to achieve the same performance as methods that choose to label data close to the decision bound-ary. We note that, for generative classifiers, the latter approach changes the data distribution therefore leading to estimation bias. Nevertheless, McCallum and Nigam [McCallum and Nigam, 1998] used active learning with generative models with success. They proposed to first actively query some of the labeled

Concluding Remarks 153 data followed by estimation of the model’s parameters with the remainder of the unlabeled data.

We are interested in seeing the effect of active learning for learning a gener-ative Bayesian network. Figure 7.6 brings an interesting example. The figure shows the classification error as a function of the number of (actively queried) labeled data when estimating an arbitrary generative Bayesian network with a Bayes error rate of approximately 6.3%. The classifier includes nine binary observables, with many arcs between the observable nodes, and a binary class variable. We use the simple random active learning approach (labeling of ran-dom samples) and the approach of choosing samples with high information gain (i.e., around the decision boundary). We use an initial pool of 20 labeled samples and sample additional samples from a pool of 10000 unlabeled sam-ples. We also test McCallum and Nigam’s approach of using the remainder of the unlabeled data after a portion are labeled actively. Figure 7.6(a) shows this approach when the correct structure is specified. We see that for the methods using only labeled data, when we increase the number of labeled data the per-formance improves at more or less the same rate. We also see that the addition of labeled samples does not make any difference when adding the unlabeled data (Random+EM, Info Gain+EM)— the unlabeled data are already sufficient to achieve an error rate close to the Bayes error. Figure 7.6(b) shows the perfor-mance when an incorrect model (TAN) is assumed. Here we see that queries of samples close to the decision boundary allows a faster improvement in per-formance compared to random sampling — in fact, the bias introduced by the boundary sampling allows the classifier to achieve better overall performance even after 1000 samples. When adding the unlabeled data, performance degra-dation occurs and as more labeled data are added (changing the ratio between labeled and unlabeled data), the classification performance does improve, but, at 1000 labeled samples, is still far from that using no unlabeled data.

The above results suggest the following statements. With correctly specified generative models and a large pool of unlabeled data, “passive” use of the unla-beled data is typically sufficient to achieve good performance. Active learning can help reduce the chances of numerical errors (improve EM starting point, for example), and help in the estimation of classification error. With incorrectly specified generative models, active learning is very profitable in quickly reduc-ing error, while addreduc-ing the remainder of unlabeled data might not be desirable.

9. Concluding Remarks

The discussion in this chapter illustrated the importance of structure learning when using unlabeled data for Bayesian network classiﬁers. We discussed dif-ferent strategies that can be used, such as model switching and structure learn-ing. We introduced a classiﬁcation driven structure search algorithm based on

0 200 400 600 800 1000 0.06

0.08 0.1 0.12 0.14 0.16 0.18

Number of Labeled Data

Classification Error

Random Random+EM Info Gain Info Gain+EM

(a)

0 200 400 600 800 1000

0.1 0.15 0.2 0.25

Number of Labeled Data

Classification error

Random Random+EM Info Gain Info Gain+EM

(b)

Figure 7.6. Learning Generative models with active learning. (a) Assuming correct structure, (b) Assuming incorrect structure (TAN).

Concluding Remarks 155 Metropolis-Hastings sampling, and showed that it performs well both on fully labeled datasets and on labeled and unlabeled training sets.

The idea of structure search is particularly promising when unlabeled data are present. It seems that simple heuristic methods, such as the solution pro-posed by Nigam et al [Nigam et al., 2000] of weighing down the unlabeled data, are not the best strategy for unlabeled data. We suggest that structure search, and in particular stochastic structure search, holds the most promise for handling large amount of unlabeled data and relatively scarce labeled data for classification. We also believe that the success of structure search methods for classification increases significantly the breadth of applications of Bayesian networks.

To conclude, this chapter presented and compared several different ap-proaches for learning Bayesian network classiﬁers that attempt to achieve im-provement of classiﬁcation performance when learning with unlabeled data.

While none of these methods is universally superior, each method is useful under different circumstances:

Learning TANs using EM-TAN is a fast and useful method. Although the expressive power of TAN structures is still limited, it usually outperforms the Naive Bayes classiﬁer in semi-supervised learning.

Structure learning in general, and classiﬁcation driven structure search in particular is usually the best approach to learn with unlabeled data. How-ever, this approach comes at a computational cost and would work only with reasonably large datasets.

It is difﬁcult to learn structure using independence tests and unlabeled data–

the reliability of the independence tests is not sufﬁcient.

Active learning is a promising approach and should be used whenever model uncertainty is prevalent. An interesting idea is to try and use ac-tive learning to assist in learning the model, querying data that will assist in determining whether a model change is called for. Pursuit of this idea in the subject of future research.

In a nutshell, when faced with the option of learning with labeled and unla-beled data, our discussion suggests following the following path.

Start with Naive Bayes and TAN classiﬁers, learn with only labeled data and test whether the model is correct by learning with the unlabeled data, using EM and EM-TAN.

If the result is not satisfactory, then SSS can be used to attempt to further improve performance with enough computational resources.

If none of the methods using the unlabeled data improve performance over the supervised TAN (or Naive Bayes), active learning can be used, as long as there are resources to label some samples.

Chapters 10 and 11 present two different human computer interaction ap-plications (facial expression recognition and face detection) that use the algo-rithms suggested in this chapter.

Chapter 8 APPLICATION:

Dans le document Machine Learning in Computer Vision (Page 162-168)