Accuracy Comparison - Experimental Comparison on Approximately Monotonic CP

2.8 Experimental Comparison on Approximately Monotonic CP

2.8.1 Accuracy Comparison

We begin the accuracy evaluation by comparing all the presented Cube Pruning algorithms setting the beam size to 30. Notice that this implies the k = 30 for the CP sub problems. In Table 2.1 we report the measures collected. One of the most used accuracy metric for Machine Translation systems is the BLEU score (Papineni et al. (2002)). In the second column of Table 2.1 we report theBLEU score achieved by different algorithms. NCP has a noticeably lower score, while all the other algorithms reach approximately the sameBLEU score. TheBLEU score variations for these algorithms cannot be considered significant. Since we compare decoding algorithms on the same search space, the best way to compare the accuracy is in terms of search score. The search score is the score on which the decoder bases the bottom-up search and the pruning actions. For each algorithm we compute the average score of the best translation found for the test sentences.

The search score can be interpreted as the logarithm of the probability mass assigned to each translation. In the third column of Table 2.1 are reported the averages of the search score for each algorithm. We notice that the search score values are directly proportional to the amount of search space explored by each algorithm. SCP is the algorithm exploring the biggest portion of the search space and it reach the best search score. The FCPs algorithms prune more than SCP, and get a score that is slightly lower than SCP. Among the FCPs algorithms, FCP1 is the one getting the best score, and is also the one exploring a bigger portion of the search space among the FCPs algorithms. FCP2 and FCP3 have the same accuracy since they insert the candidate elements in the same order and

apply the same update policy as explained in section 2.6.3. LCP is performing slightly worst. NCP performs particularly badly.

It is not easy to interpreter these scores as they appear. To give a more meaningful comparison, we compute the variation with respect to the SCP score in terms of probability. Given an average search score X, the probability mass associated to it is: C·e^X, whereC is a constant factor. The probability variation with respect to SCP is:

C·e^X −C·e^X^SCP

C·e^X^SCP = e^X −e^X^SCP

e^X^SCP (2.30)

Where X_SCP is the score associated to SCP.

In the fourth column of Table 2.1 we report the variations in terms of prob-ability in comparison to SCP. We can see that the FCP1 performs well, losing only ≈ 1%. Other algorithms have a moderate loss < 7%. Except for NCP, that loses more than one third of the probability mass. This result proves that LCP performs much better than a Naive linear time algorithm, and actually its performance are close to theO(n log n) algorithms.

BLEU avg. score relative prob. var.

SCP 32.60 -118.5016 –

Table 2.2: Evaluation and comparison of the accuracy with beam size = 300.

For each of the algorithm presented are reported the BLEU score (second col-umn), the average internal search score (third colcol-umn), and variation in terms of probability with respect to SCP (fourth column).

For the first series of experiments the beam size was set to 30. Beams of this magnitude are typical for systems that focus on optimizing the speed. Systems optimizing accuracy generally use larger values for the beam, in the order of hundreds. We repeat the previous experiments for a beam size of 300, and report the results in Table 2.2. The ordering of the algorithms is similar to the one obtained in the previous set of experiments. The main difference is that the gaps between the algorithms are smaller. This can be justified observing that by

-40

Figure 2.10: Score variation in terms of probability with respect to SCP, for each algorithm presented, at different values of the beam size.

increasing the beam size, all the CP algorithms converge to the exhaustive search.

Also notice that FCP1 performs slightly better than SCP. This is due to the fact that with this beam size the two algorithms performance are really close, and a perturbation in an intermediate pruning step may lead FCP1 to prune out a sub-tree that have better partial score, but leads to complete derivations with better global score. Thus the +0.1101% increase in average probability mass obtained by FCP1 is to be considered not significant, and FCP1 accuracy with k= 300 is to be considered at the same level as SCP’s accuracy.

Now we wish to investigate in more details how the accuracy changes with the beam size. We compute the average score of the best translation found for the test sentences for all the algorithms at different beam sizes in the range [1,10000]. We can safely assume that all these algorithms have a lower bound in loss in accuracy, because for larger beam size all the algorithms tend to converge to exhaustive search. Thus the score-variation curves will tend to 0 without significant variations. The collected measures are plotted in Figure 2.10. The blue curve plots the FCP accuracy variation. The blue curve stays close to thex axis. This means that FCP performs really close to the SCP accuracy. In many

points FCP accuracy is slightly better than SCP, and the blue curve cross the x axis. As we discussed above in more details, this is evidence that FCP accuracy is not significantly different from SCP, and noise can cause FCP to perform better than SCP in few instances. In general, FCP1 variation has a lower bound of

Dans le document Algorithms and frameworks for Tree-based Machine Translation and tree structures prediction (Page 79-82)