A Benchmark Study on the Effectiveness of
Search-based Data Selection and Feature Selection for Cross Project Defect Prediction
Seyedrebvar Hosseini a , Burak Turhan b , Mika M¨ antyl¨ a a
a
M3S, Faculty of Information Technology and Electrical Engineering University of Oulu, 90014, Oulu, Finland
b
Department of Computer Science
Brunel University London, UB8 3PH, United Kingdom
Abstract
Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect pre- diction(CPDP). On the other hand, feature selection and data quality are issues to consider in CPDP.
Objective: We aim at utilizing the Nearest Neighbor (NN)-Filter, em- bedded in genetic algorithm to produce validation sets for generating evolv- ing training datasets to tackle CPDP while accounting for potential noise in defect labels. We also investigate the impact of using different feature sets.
Method: We extend our proposed approach, Genetic Instance Selection (GIS), by incorporating feature selection in its setting. We use 41 releases of 11 multi-version projects to assess the performance GIS in comparison with benchmark CPDP (NN-filter and Naive-CPDP) and within project (Cross- Validation(CV) and Previous Releases(PR)). To assess the impact of feature sets, we use two sets of features, SCM+OO+LOC(all) and CK+LOC(ckloc) as well as iterative info-gain subsetting(IG) for feature selection.
Results: GIS variant with info gain feature selection is significantly better than NN-Filter (all,ckloc,IG) in terms of F1 (p = values 0.001, Cohen’s d = {0.621, 0.845, 0.762}) and G (p = values 0.001, Cohen’s d = {0.899, 1.114, 1.056}), and Naive CPDP (all,ckloc,IG) in terms of F1 (p = values 0.001, Cohen’s d = {0.743, 0.865, 0.789}) and G (p =
Email addresses: rebvar@oulu.fi (Seyedrebvar Hosseini),
burak.turhan@brunel.ac.uk (Burak Turhan), mika.mantyla@oulu.fi (Mika M¨ antyl¨ a)
values 0.001, Cohen’s d = {1.027, 1.119, 1.050}). Overall, the perfor- mance of GIS is comparable to that of within project defect prediction (WPDP) benchmarks, i.e. CV and PR. In terms of multiple comparisons test, all variants of GIS belong to the top ranking group of approaches.
Conclusions: We conclude that datasets obtained from search based approaches combined with feature selection techniques is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of a loss in pre- cision. Using different optimization goals, utilizing other validation datasets and other feature selection techniques are possible future directions to inves- tigate.
Keywords: Cross Project Defect Prediction, Search Based Optimization, Genetic Algorithms, Instance Selection, Training Data Selection
1. Introduction
1
Despite the extensive body of studies and its long history, software defect
2
prediction is still a challenging problem in the field of software engineering
3
[1]. Software teams mainly use software testing as their primary method of
4
detecting and preventing defects in the development stage. On the other
5
hand, software testing can be very time consuming while the resources for
6
such tasks might be limited and hence, detecting defects in an automated
7
way can save lots of time and effort [2, 3, 4].
8
Defect data from previous versions of the same project could be used to
9
detect defect prone units in new releases. Prediction based on the historical
10
data collected from the same project is called within project defect prediction
11
(WPDP) and has been studied extensively [5, 6, 7, 8, 9, 10, 11]. New code
12
and releases in the same project usually share many common characteristics
13
that make them a good match for constructing prediction models. But this
14
approach is subject to criticism as within project data is usually not available
15
for new projects. Moreover, this problem only tends to increase as the need
16
for software based platforms and services is growing at a rapid pace. On the
17
other hand, there are plenty of relevant public datasets available, especially
18
in the open source repositories [12] that can act as candidates to identify and
19
prevent bugs in the absence and even presence of within project data. Using
20
the available public datasets, one can investigate the usefulness of models
21
created on the data from other projects, especially for those with limited or
22
no defect data repository [3, 4, 13].
23
Various studies have focused on different aspects of defect prediction in-
24
cluding data manipulation approaches (such as re-sampling [14, 15, 16, 17],
25
re-weighting [17, 18, 19, 20], filtering [4, 13, 21, 22], etc.), learning technique
26
optimization (boosting [14, 15, 17, 23], bagging[23], ensembles [22, 23, 24, 25],
27
etc.) and metrics (feature selection [23, 26, 27, 28], different set of met-
28
rics [29]) to mention some. Predictions usually target the effectiveness of
29
the approaches in two main categories: binary and continuous/multi-class
30
predictions[1]. The features also come in different levels of code including
31
function/method, class, file and even component levels [1]. As would be dis-
32
cussed in later sections, the experiments in this study are class level binary
33
predictions performed on 41 Java projects. A summary of the related studies
34
would be presented in Section 2.
35
Learning approach and the training data are two of the major elements
36
in building high performance prediction models. Finding a suitable dataset
37
(instances and features) with similar defect distribution characteristics as
38
the test set is likely to increase the performance of prediction models [30].
39
Since defect detection and label assignment is based on mining version control
40
systems [31, 32, 33], the process could be prone to errors and data quality can
41
be questionable [34, 35]. In other words, the labels of some of the instances
42
might not have been identified correctly, two or more instances with the
43
same measurements can have different labels, or undetected defects may not
44
be captured in the dataset in the time of dataset compilation.
45
In this study, we address these problem with a search based instance
46
selection approach, where a mutation operation is designed to account for
47
data quality. Using the genetic algorithm, we guide the instance selection
48
process with the aim of convergence to datasets that match the characteristics
49
of the test set more precisely. The fitness function at each generation is
50
evaluated on a validation set generated via (NN)-Filter. With these, we
51
handle the potential noise in data while tackling the training data instance
52
selection problem with GIS.
53
In our earlier work [35], we combined training data instance selection with
54
search based methods to address the potential defect labeling errors and we
55
compared our method’s performance with benchmark CPDP and WPDP
56
(cross validation) approaches. This study not only acts as a replication of
57
our original study but also extends it by the following ways:
58
• Including a more comprehensive set of datasets. In the original study,
59
only the last available dataset from multi-version software projects were
60
used in our experiments. Overall, 13 datasets from PROMISE repos-
61
itory were used to assess the performance of our proposed approach.
62
We extend the number of datasets to 41 from 11 projects. The reason
63
for the choosing these datasets is their maturity (multiple versions) and
64
the choice of WPDP previous releases benchmark.
65
• Including an extra within project benchmark (previous releases). We
66
used a single WPDP benchmark in our original study namely, 10 fold
67
stratified cross validation. Since multiple releases are available for each
68
project in this extension, we compare the performance of GIS with this
69
WPDP benchmark as well.
70
• Investigating the effect of different sets of metrics and the sensitivity of
71
GIS toward them. Chidamber & Kemerer (CK)+LOC was used in the
72
original study, due to its reported good performances by Hall et al.[1] as
73
well as recent CPDP studies [25, 36]. To have a better understanding
74
of how GIS reacts to the choice of metrics we used all features and
75
one feature selection technique, i.e. iterative infogain subsetting. We
76
conducted these sets of experiments to address one of the threats to
77
the validity of our original conclusions, i.e. the selection of software
78
metrics. Interestingly, adding SCM to CK+LOC has a positive effect
79
on the WPDP performance despite its negative effect on GIS, therefore
80
making it necessary to be included as failing to so, would be a threat
81
to the conclusion validity.
82
• Presenting a more extensive analysis, related work and discussions. A
83
wider range of datasets, benchmarks and results, requires more com-
84
prehensive analysis. The results are presented through multiple types
85
of diagrams (violin plots, critical difference (CD) diagrams, line plots)
86
and the approaches are compared with different targets in mind through
87
statistical tests for pairwise and multiple comparisons for exploring dif-
88
ferent perspectives of the achieved results.
89
Accordingly, the aim of this study is to answer the following research
90
questions:
91
RQ1: How is the performance of GIS compared with benchmark cross
92
project defect prediction approaches?
93
RQ2: How is the performance of GIS compared with the within project
94
defect prediction approaches?
95
RQ3: How different feature sets affect the performance of GIS?
96
This paper is organized as follows: The next section summarizes the re-
97
lated studies on CPDP and briefly describes how our study differs. Proposed
98
approach, datasets and experimental procedures are presented in Section 3.
99
Section 4 presents the results of our analysis and discussions. Section 5 ad-
100
dresses some of the concerns that arise during the analysis and wraps up our
101
findings. Section 6 discusses the threats to the validity of our study. Finally,
102
the last section concludes the paper with a summary of the findings as well
103
as directions for future work.
104
2. Related Work
105
Cross project defect prediction (CPDP) has drawn a great deal of interest
106
recently. To predict defects in projects without sufficient training data, many
107
researchers attempted to build novel and competitive CPDP models [4, 13,
108
18, 21, 31, 37, 38]. However, not all studies report good performances of
109
CPDP [4, 16, 38].
110
In a series of experiments, Turhan et al. [4] observed that CPDP un-
111
derperforms WPDP. They also found that despite its good probability of
112
detection rates, CPDP causes excessive false alarms. They proposed to use
113
(NN)-Filter to select the most relevant training data instances based on a
114
similarity measure. Through this method, they were able to lower the high
115
false alarm rates dramatically, but their model performance was still lower
116
than WPDP.
117
Zimmermann et al. [38] performed a large scale set of CPDP experiments
118
by creating 622 pair-wise prediction models on 28 datasets from 12 projects
119
(open source and commercial) and observed only 21 pairs (3.4%) that match
120
their performance criteria (precision, recall and accuracy, all greater than
121
0.75). This observation suggests that the majority of predictions will proba-
122
bly fail if training data is not selected carefully. They also found that CPDP
123
is not symmetrical as data from Firefox can predict Internet Explorer defects,
124
but the opposite does not hold. They argued that characteristics of data and
125
process are crucial factors for CPDP.
126
He et al. [13] proposed to use the distributional characteristics (median,
127
mean, variance, standard deviation, skewness, quantiles, etc.) for training
128
dataset selection. They concluded that in the best cases cross project data
129
may provide acceptable prediction results. They also state that training data
130
from the same project does not always lead to better predictions and carefully
131
selected cross project data may provide better prediction results than within-
132
project (WP) data. They also found that data distributional characteristics
133
are informative for training data selection. They used a metalearner built
134
on top of the prediction results of the decision table learner to predict the
135
outcome of the models before making actual predictions.
136
Herbold [18] proposed distance-based strategies for the selection of train-
137
ing data based on distributional characteristics of the available data. They
138
presented two strategies based on EM (Expectation Maximization) cluster-
139
ing and NN (Nearest Neighbor) algorithm with distributional characteristics
140
as the decision strategy. They evaluated the strategies in a large case study
141
with 44 versions of 14 software projects and they observed that i) weights
142
can be used to successfully deal with biased data and ii) the training data
143
selection provides a significant improvement in the success rate and recall of
144
defect detection. However, their overall success rate was still too low for the
145
practical application of CPDP.
146
Turhan et al. [21] evaluated the effects of mixed project data on predic-
147
tions. They tested whether mixed WP and CP data improves the predic-
148
tion performances. They performed their experiments on 73 versions of 41
149
projects using Na¨ıve Bayes classifier. They concluded that the mixed project
150
data would significantly improve the performance of the defect predictors.
151
Zhang et al [39] created a universal defect prediction model from a large
152
pool of 1,385 projects with the aim of relieving the need to build prediction
153
models for individual projects. They approached the problem of variations in
154
the distributions by clustering and rank transformation using the similarities
155
among the projects. Based on their results, their model obtained prediction
156
performance comparable to the WP models when applied to five external
157
projects and performed similarly among projects with different context fac-
158
tors.
159
Ryu et al. [22] presented a Hybrid Instance Selection with the Near-
160
est Neighbor (HISNN) method using a hybrid classification to address the
161
class imbalance for CPDP. Their approach used a combination of the Near-
162
est Neighbour algorithm and Na¨ıve Bayes learner to address the instance
163
selection problem.
164
He et al. [26] considered CPDP from the viewpoint of metrics and features
165
by investigating the usefulness of simplified metric sets. They used a greedy
166
approach to filter the list of available metrics and proposed to use different
167
sets of metrics according to the defined criteria. They observed that minimum
168
feature subsets and TOPK metrics could provide acceptable results compared
169
with their benchmarks. They further concluded that the minimum feature
170
subset can improve the predictions despite the acceptable loss of precision.
171
Feature selection and more specifically, feature matching was studied by
172
Nam et al.[28]. Their proposed approach provided the ability of performing
173
predictions on training and test datasets with different sets of metrics. They
174
used statistical procedures for the feature matching processes and observed
175
that their CPDP approach outperforms WPDP in 68% of the predictions.
176
While the above-mentioned studies focus on the dataset, instance and fea-
177
ture selection problems, none of them are using the search based approach.
178
One such approach in defect prediction context has been considered by Liu
179
et al., who tried to come up with mathematical expressions as their solu-
180
tions that maximize the effectiveness of their approach [36]. They compared
181
their approach with 17 non-evolutionary machine learning algorithms and
182
concluded that the search-based models decrease the misclassification rate
183
consistently compared with the non-search-based models.
184
Canfora et al. proposed a search based multi-objective optimization ap-
185
proach [40] for CPDP. Using multi-objective genetic algorithm NSGA-II, they
186
tried to come up with an optimal cost effectiveness model for CPDP. They
187
concluded that their approach outperforms the single objective, trivial and
188
local prediction approaches. Recently, Xia et al. [41] have conducted a search
189
based experiment consisting of a genetic algorithm phase and an ensemble
190
phase. They utilized logistic regression as their base learner and small chunks
191
of within project data in their settings. They compared their proposed ap-
192
proach, i.e. HYDRA with some of the most recent CPDP approaches and
193
observed that it outperformed the benchmarks significantly. Even though
194
these studies use a search based approach, they are neither focused on the
195
instance/dataset selection nor on the data quality problem and hence, differ
196
from our approach.
197
3. Research Methodology
198
This section describes the details of our study starting with a discussion
199
of our motivation. We then present the proposed approach as well as the
200
benchmark methods, datasets and metrics, and the performance evaluation
201
criteria used in our study.
202
3.1. Motivation
203
If selected carefully, a dataset from other projects can provide a better
204
predicting power than WP data [13] as the large pool of the available CP
205
data has the potential to cover a larger range of the feature space. This may
206
lead to a better match between training and test datasets and consequently
207
to better predictions.
208
One of the first attempts in this area was the idea of filtering the training
209
dataset instances [4]. In this approach, the most similar instances from a
210
large pool containing all the training instances from other projects are se-
211
lected using k-NN algorithm. Since these instances are closer to the test set
212
based on a particular distance measure, they could potentially lead to better
213
predictions. Using the distributional characteristics of the test and training
214
datasets is another approach used in multiple studies [13, 42]. Clustering the
215
instances is yet another approach used in other studies [18, 31, 39]. While
216
these methods have been shown to be useful, the search based approach to
217
selection is not considered by any of these papers. An evolving dataset start-
218
ing with the initial datasets generated using one or a combination of these
219
approaches can be a good candidate for a search based data selection problem.
220 221
Data Quality. Another inspiration for this work is the fact that the pub-
222
lic datasets are prone to quality issues and contain noisy data [34, 35, 43].
223
Since defects are discovered over time, certain defects might not have been
224
discovered at the time of compiling the datasets and hence, some of the
225
instances in the training set may be misrepresenting themselves as non-
226
defective, while with similar kind of measurements defects can exist in the
227
test set. In contrast, while some test instances are not defective, the most
228
similar items in the training set might be labeled as defective. In short,
229
some of the instances in the test set can have similar measurements with
230
the training set, yet different class labels. Please note that mislabeling may
231
not be the only reason for such situations, and they can occur naturally, i.e.
232
the class labels of similar measurements can differ depending on the metric
233
set used. The acknowledgment of noise in the data and guiding the learning
234
algorithm to account for that can lead to better predictions, as we proposed
235
in our original paper [35] and validate it in this study.
236 237
Features. We used CK+LOC metric set originally to assess the per-
238
formance of GIS and the other benchmarks. Hall et al. [1] asserted that
239
OO (Object-Oriented) and LOC have acceptable prediction power and they
240
Test Set
Test Parts . . . .
1 2
p-1 P
(NN)-Filter
Training Set
Validation Sets
Search Based Optimizer
Best Training
Sets
Learner Predictions Feature Ranking
Feature Selection
1 2
3
Figure 1: Summary of the search based training data instance selection and performance evaluation process used in this paper.
outperform SCM (static code metrics). Moreover, they observed that adding
241
SCM to OO and LOC is not related to better performance [1]. Moreover, the
242
usefulness of CK+LOC was validated in multiple studies in the context of
243
CPDP [25, 37, 40] of which [40] involves using search based approaches. We
244
extend our work not only by considering CK+LOC but also OO+SCM+LOC
245
as well as feature selection.
246
We used the information gain concept to select the top ranked features
247
which explain the highest entropy in the data. Information gain method is
248
relatively fast and the ranking process does not need any actual predictions.
249
Please note that there certainly exist more sophisticated and powerful feature
250
selection approaches that could be used in the context of CPDP. The use of
251
information gain feature subsetting is due to its simplicity and speed and a
252
proof of concept that even simple refinement of the data through a guided
253
procedure can lead to practical improvements.
254
Infogain is defined as follows: For an attribute A and a class C, the
255
entropy of the class before and after observing A are as follows:
256
H(C) = − X
c∈C
p(c) log 2 p(c) (1)
H(C|A) = − X
a∈A
p(a) X
c∈C
p(c|a) log 2 p(c|a) (2) The amount of explained entropy by including A reflects the additional
257
information acquired and is called information gain. Using this approach the
258
Algorithm 1 Pseudo code for GIS
1:
Set numGens = The number of generations of each genetic optimization run.2:
Set popSize = The size of the population.3:
Set DATA = 41 releases from{Ant, Camel, ivy, jedit, log4j, lucene, poi, synapse, velocity, xalan,xerces}4:
Set FEATURES ={CKLOC, All, IG}5: 6:
forFSET in FEATURESdo7:
forRELEASE in DATAdo8:
set TEST = Load Instances from RELEASE with metric set FSET9:
set TRAIN = Instances from all other projects with metric set FSET10:
tdSize = 0.02 * Number of instances in TRAIN11:
fori = 1 to 30do12:
Set TestParts = Split TEST instances into p parts13: 14:
foreach testPart in TestPartsdo15:
Set vSet = Generate a validation dataset using (NN)-Filter method (with three distance measures).16:
Set TrainDataSets = CreatepopSizedataset from TRAIN with replacement each withtdSizeinstances17: 18:
foreach td in TrainDataSetsdo19:
Evaluate td on vSet and add it to the initial generation20:
end for21: 22:
forg in range(numGens)do23:
Create a new generation using the defined Crossover and Mutation function and Elites from the curent generation.24:
Combine the two generations and extract a new generation25:
end for26: 27:
Set bestDS = Select the top dataset from the GA’s last iteration.28:
Evaluate bestDS on testPart and append the results to the pool of results.29:
end for30: 31:
Calculate Precision, Recall and F1 and G from the predictions.32:
end for33:
Report the median of all 30 experiments34:
end for35:
end forfeatures of the datasets are ranked from the highest to the lowest amount of
259
entropy explained. We used iterative InfoGain subsetting [44] to select the
260
appropriate set of features for our experiments. Iterative InfoGain subsetting
261
starts by training the predictors using the top n ranked attributes for n ∈
262
{1, 2, ...} and continues until a point that having j + 1 attributes instead
263
of j does not improve the predictions. An improvement in predictions was
264
measured using F1 values achieved from a 1 × 10 fold cross validation on
265
the training dataset. The train test splits were identical when adding the
266
features iteratively during the feature selection operation.
267
3.2. Proposed Approach
268
Figure 1 visualizes the whole research process reported in this paper. The
269
details of the search based optimizer are not present in the figure and instead,
270
they are provided in Algorithm 1 and discussed below.
271
The process starts with splitting the test set into p parts randomly (p = 5
272
in our experiments). Partitioning the test set into smaller chunks plays an im-
273
portant role in the overall procedure. By creating smaller chunks, the process
274
of optimizing and adjusting the dataset is easier as there are less elements to
275
consider and the datasets generated could be better representatives for these
276
smaller chunks than the whole dataset. This procedure however, adds extra
277
complexity to the model and the run-time would increase consequently since
278
a search based optimizer is required for each part.
279
Each part (without the labels) is fed into the (NN)-Filter instance selec-
280
tion method in order to select the most relevant instances from the training
281
set for the purpose of reserving a validation set, on which we optimize the
282
search process. Please note that the training set is a combination of all the
283
instances from other projects. We used the closest three instances with mul-
284
tiple distance measures to account for the possible error in using a specific
285
distance measure. The unique instances from the generated set were selected
286
to act as the validation dataset used to guide our instance selection process.
287
The availability of mixed data as used in [17, 21, 41] could also potentially
288
act as a replacement for the aforementioned similarity measures and boost
289
the performance of our approach.
290
The process then randomly creates an initial population containing pop-
291
Size datasets (popSize =30 in our experiments). Each population element
292
is a dataset selected randomly and with replacement from the large pool
293
of training set instances (see Table 1). The selected number of population
294
members and their sizes lead to an average of 94.99% coverage (std=0.031)
295
of the instances in the large pool of available training instances (multiple
296
copies for some) for each iteration.
297
Each population member is then evaluated on the validation set, which
298
is acquired via the (NN)-Filter in the previous step. Then, for numGens
299
generations, a new population is generated and the top elements are selected
300
to survive and move to the next generation. There is an alternative stopping
301
criterion for GIS (described below). These procedures are repeated 30 times
302
to address the randomness introduced by both the dataset selection and ge-
303
netic operations. Below, the genetic operations and parameters are discussed
304
in more details:
305 306
Initial Population: The initial population is generated using the ran-
307
dom instance selection process with replacement from a large pool of in-
308
stances containing all elements from other projects than the test project.
309
The instances might contain elements included in the validation dataset as
310
F
1F
2· · · F
m-2F
m-1F
mL
C
16 0 · · · 0 1 1 0
C
23 0.97 · · · 12 3 1 1
C
35 0.69 · · · 12.6 4 1.4 0
· · · · · · · · · · · · · · · · · · · · · · · ·
C
n-23 0.98 · · · 16 2 1 0
C
n-13 0.82 · · · 8.33 1 0.67 0
C
n16 0.73 · · · 28.3 9 1.56 1
Table 1: Chromosome structure
they are not removed from the large pool of candidate training instances due
311
to their possible usefulness for the learning process. The selection process
312
consumes 94.99% of the initial training data on average and eliminates a
313
group of them with each passing generation.
314 315
Chromosome Representation: Each chromosome contains a number
316
of instances from a list of projects. A chromosome is a dataset sampled
317
from the large training dataset randomly and with replacement. A typical
318
chromosome example can be seen in Table 1. F i represents i th selected feature
319
and L represents the class label. Atypical chromosome contains n instances
320
denoted by C 1 to C n .
321
We used a fixed size chromosome in our experiments. The size of each
322
chromosome (dataset) was set to 0.02% of the large pool of training data
323
from other projects. The fixed size chromosome was selected to show the
324
effectiveness of our proposed approach with respect to the small candidate
325
training sets generated. One might find the varying size chromosome more
326
useful in practice as the candidates in this version might be able to capture
327
more properties of the test set subject to prediction.
328 329
Selection: The Tournament selection is used as the selection operator of
330
GIS. Since the population size is small in our experiments, tournament size
331
was set to two.
332 333
Elites: A proportion of the population is moved to the next generation;
334
those that provide the best fitness values. We transfer two of the top parents
335
to the next generation.
336 337
Stopping Criteria: We place two limitations on the number of itera-
338
tions that the genetic algorithm could progress. The first one is the maximum
339
number of generations allowed. In this case, this number was set to 20. The
340
reason for selecting a relatively small number of generations (20) is due to
341
having small population sizes. The small populations was selected for the
342
runtime considerations. Despite their sizes however, they cover 94.99% of the
343
original training instances on average in every iteration, when creating the
344
initial populations. Further, we observed that the process converges quickly
345
and hence, making 20 an acceptable maximum number of generations. The
346
other stopping criterion is the amount of benefit gained from the population
347
generated. If the difference between the mean fitness of two consecutive pop-
348
ulations is less than = 0.0001, the genetic algorithm stops.The mentioned
349
epsilon was selected arbitrarily to be a small number. One can expect to
350
achieve better results by tuning these parameters.
351 352
Fitness Function: F1 * G is used as the fitness value of each population
353
element. Each population element (a dataset) is evaluated on the validation
354
set and fitness value is assigned to it. The selection of this fitness function is
355
not random as both of these values (F1 and G) measure the balance between
356
precision and recall, but in different ways.
357 358
Mutation: The mutation function handles potential data quality (e.g,
359
noise, mislabelling etc.) issues. Randomly changing the class value of the
360
instances from non defective to defective (and vice versa), the mutation op-
361
eration guides the process through multiple generations for yielding more
362
similar datasets. This could to some extent account for both undiscovered
363
bugs as well as contradictory labels for similar measurements in different
364
projects (training and test data). With the probability of mP rob = 0.05,
365
a number of training set instances are mutated by flipping the labels (de-
366
fective → non defective or non defective → defective). Note that since the
367
datasets could contain repetitions of an element (from the initial population
368
generation and later from the crossover operation), if an instance is mutated,
369
all of its repetitions are also mutated. This way, we could avoid conflicts
370
between the items in the same dataset. The mutation process is described
371
in Algorithm 2 formally.
372 373
Crossover: The generated training datasets used in the population could
374
possibly have large sizes. The time for training a learner with a large dataset
375
and validating it on a medium size validation set increases, if the size of the
376
train and validation datasets increase. To avoid having very large datasets
377
Algorithm 2 Mutation
1:
Input→DS: a dataset2:
Output→A dataset with possible mutated items3: 4:
set mProb = p// Mutation probability5:
set mCount = c// Number of instances to mutate6:
set r = Random value between 0 and 17: 8:
ifr<mProbthen9:
fori in range(mCount)do10:
Randomly select an instance that is not been mutated in the same round11:
Find all repeats of the same item and flip their labels12:
end for13: 14:
end ifAlgorithm 3 One point crossover
1:
Input→DS1 and DS22:
Output→Two new datasets generated from DS1 and DS23: 4:
Set nDS1 = Empty dataset5:
Set nDS2 = Empty dataset6:
Set point = Random in the range of either of DS1 or DS27:
SHUFFLE DS1 and DS28: 9:
fori = 1 to pointdo10:
Append DS1(i) to nDS111:
Append DS2(i) to nDS212:
end for13: 14:
fori = point+1 to DS1’s lengthdo15:
Append DS1(i) to nDS216:
Append DS2(i) to nDS117:
end for18: 19:
foreach unique instance in nDS1 and nDS2do20:
Use the majority voting to decide the label of the instance and its repetitions.21:
end forone point crossover was used during the crossover operation. Nevertheless,
378
some might find two point cross over more useful. As mentioned earlier, the
379
chromosomes are a list of instances from the large training set, selected ran-
380
domly and with replacement with a fixed size. Hence one point cross over
381
would not increase the size of the datasets when combining them. Since the
382
chromosomes possibly contain the repetitions of one item and the mutation
383
operation changes the label of an instance, conflicts might occur in the chro-
384
mosomes generated from combining the two selected parents. In the case of
385
conflicts, the majority voting is used to select the label of such instances.
386
Algorithm 3 provides the pseudo-code for crossover operation.
387 388
3.3. Benchmark Methods
389
To have a better insight into the performance achieved by GIS, it is
390
compared to the following benchmarks:
391
(NN)-Filter (CPDP): In this approach, the most relevant training in-
392
stances are selected based on a distance measure [4]. In this case, we used
393
10 nearest neighbours and Euclidean distance. This value is similar to that
394
utilized by multiple previous studies [4, 14, 18, 45]. The k=10 was the value
395
of choice in the study by Turhan et al. [4] which proposed NN-Filter. The
396
selection of k as 10 was followed by later studies such as He et al. [45] and
397
Chen et al. [14], both of which focus on CPDP. Another CPDP study by
398
Herbold [18] used different values of k ∈ {3, 5, 10, 15, 20, 25, 30} and ob-
399
served the best results for larger k values. The simplicity of the method and
400
the comprehensive number of studies that have tested the approach are the
401
reasons for choosing this method as a benchmark [4, 17, 22]. Also GIS uses
402
(NN)-Filter to select the validation dataset and a benchmark is required to
403
measure the performance difference between (NN)-Filter and GIS.
404
Naive (CPDP): In this approach, the whole training set is fed into
405
the learner and the model is trained with all the training data points. This
406
method has also been tested in many studies and provides a baseline for
407
the comparisons [4, 17, 22]. The approach is easy and at the same time
408
demonstrates that while the availability of large pools of data could be useful,
409
not all the data items are.
410
10-Fold cross validation (WPDP): In this benchmark, we perform
411
stratified cross validation on the test set. Many studies have reported the
412
good or at least better performance of this approach compared with that of
413
cross project methods [4]. Outperforming and improving WPDP is the main
414
goals of many such studies. We refer to this benchmark as CV throughout
415
our analysis.
416
Previous Releases (WPDP): Previous releases of the same project
417
are used to train the prediction model. Similar to 10-fold cross validation,
418
a good performance of this approach is expected in comparison with that of
419
cross project methods as these older releases are more similar to the test set
420
in comparison with datasets from other projects. More importantly, there is a
421
higher possibility of finding even identical classes in the old and new releases
422
of a project. Previous releases are another target of the CPDP studies as
423
acquiring such data is still difficult in some cases. Note that the first release
424
of each project does not have a previous release and therefore no prediction
425
could be performed for it in this category. We use the 10 fold cross validation
426
Table 2: Utilized datasets and their properties
Dataset #Classes #DP DP% #LOC Dataset #Classes #DP DP% #LOC
ant-1.3 125 20 16 37699 lucene-2.0 195 91 46.7 50596
ant-1.4 178 40 22.5 54195 lucene-2.2 247 144 58.3 63571
ant-1.5 293 32 10.9 87047 lucene-2.4 340 203 59.7 102859
ant-1.6 351 92 26.2 113246 poi-1.5 237 141 59.5 55428
ant-1.7 745 166 22.3 208653 poi-2.0 314 37 11.8 93171
camel-1.0 339 13 3.8 33721 poi-2.5 385 248 64.4 119731
camel-1.2 608 216 35.5 66302 poi-3.0 442 281 63.6 129327
camel-1.4 872 145 16.6 98080 synapse-1.0 157 16 10.2 28806
camel-1.6 965 188 19.5 113055 synapse-1.1 222 60 27 42302
ivy-1.1 111 63 56.8 27292 synapse-1.2 256 86 33.6 53500
ivy-1.4 241 16 6.6 59286 velocity-1.4 196 147 75 51713
ivy-2.0 352 40 11.4 87769 velocity-1.5 214 142 66.4 53141
jedit-3.2 272 90 33.1 128883 velocity-1.6 229 78 34.1 57012
jedit-4.0 306 75 24.5 144803 xalan-2.4 723 110 15.2 225088
jedit-4.1 312 79 25.3 153087 xalan-2.5 803 387 48.2 304860
jedit-4.2 367 48 13.1 170683 xalan-2.6 885 411 46.4 411737
jedit-4.3 492 11 2.2 202363 xalan-2.7 909 898 98.8 428555
log4j-1.0 135 34 25.2 21549 xerces-1.2 440 71 16.1 159254
log4j-1.1 109 37 33.9 19938 xerces-1.3 453 69 15.2 167095
log4j-1.2 205 189 92.2 38191 xerces-1.4 588 437 74.3 141180
xerces-init 162 77 47.5 90718
result for the first release of each project in order to make the comparisons
427
easier. We denote this benchmark by PR in the following.
428
Feature Selection: Each of the aforementioned benchmarks are trained
429
and tested using three different sets of features. CK+LOC, used in our orig-
430
inal study [35] as well as the whole set of features in the datasets which
431
consist of OO+SCM+LOC are considered for all benchmarks. Beside these
432
feature sets, a portion of the features ranked based on their respective in-
433
formation gain are used to prepare another set of benchmarks. We used
434
iterative InfoGain subsetting method to select the appropriate features for
435
each benchmark.
436
The first two benchmarks (CPDP) are used to answer RQ1 and the latter
437
are utilized to answer RQ2. The results of different versions of GIS would
438
be used to answer the last research question, i.e. RQ3. Each experiment is
439
repeated 30 times to address the randomness introduced by CV and GIS.
440
3.4. Datasets and Metrics
441
We used 41 releases of 11 projects from the PROMISE repository for our
442
experiments. These projects are open source and all of them have multiple
443
versions. Due to the inclusion of the multi version WPDP benchmark, we
444
skipped the use of datasets with a single version. The datasets are collected
445
by Jureczko, Madeyski and Spinellis [31, 32]. The list of the datasets is
446
presented in Table 2 with the corresponding size and defect information.
447
Table 3: List of the metrics used in this study
ID Variable Description
1 WMC Weighted Methods per Class 2 DIT Depth of Inheritance Tree 3 NOC Number of Children
4 CBO Coupling between Object classes 5 RFC Response for a Class
6 LCOM Lack of Cohesion in Methods 7 CA Afferent Couplings
8 CE Efferent Couplings 9 NPM Number of Public Methods 10 LCOM3 Normalized version of LCOM 11 LOC Lines Of Code
12 DAM Data Access Metric 13 MOA Measure Of Aggregation
14 MFA Measure of Functional Abstraction 15 CAM Cohesion Among Methods 16 IC Inheritance Coupling 17 CBM Coupling Between Methods 18 AMC Average Method Complexity 19 MAX CC Maximum cyclomatic complexity 20 AVG CC Mean cyclomatic complexity
The reason for using these datasets is driven by our goal to account for noise
448
in the data, which is a threat specified by the donors of these datasets. Each
449
dataset contains a number of instances corresponding to the classes in the
450
release. Originally, each instance has 20 static code metrics listed in Table
451
3. Three scenarios were considered for selecting the metric suites. In the
452
first scenario, we used CK+LOC portion of the metrics as the basis of our
453
experiments. CK+LOC is used and validated in previous CPDP studies
454
[2, 46] and Hall et al. [1] have addressed the usefulness of these metrics in
455
comparison with static code metrics. In the second scenario, the full set of
456
metrics were considered for our experiments and finally for the last scenario,
457
we used a very simple and fast feature selection approach based on the rank
458
of the features according to their information gain. The selection of the
459
metrics in our original study was skipped as its primary focus was only on
460
the instance selection problem and using a reduced set that is tried in other
461
studies allowed us to demonstrate the feasibility of our approach as a proof of
462
concept. While this paper includes the same feature set, it also involves the
463
feature selection concept to some extent and detailed analysis are presented
464
accordingly.
465
3.5. Performance Measures and Tools
466
Na¨ıve Bayes (NB) is used as the base learner in all experiments. NB is a
467
member of the probabilistic classifier family that are based on applying Bayes’
468
theorem with strong (na¨ıve) independence assumptions between the features
469
[47]. The good performance of NB has been shown in many studies. Menzies
470
et al. [3, 48] and Lessmann et al.[49] have demonstrated the effectiveness
471
of NB with a set of data mining experiments performed on NASA MDP
472
datasets. Lessmann et al. compared the most common classifiers on the
473
NASA datasets and concluded that there is no significant difference between
474
the performances of top 15 classifiers, one of which is NB [49] .
475
To assess the performance of the models, four indicators are used: Pre-
476
cision, Recall, F1 and G. These indicators are calculated by comparing the
477
outcome of the prediction model and the actual label of the data instances.
478
To that end, the confusion matrix is created using the following values:
479
TN: The number of correct predictions that instances are defect free.
480
FN: The number of incorrect predictions that instances are defect free.
481
TP: The number of correct predictions that instances are defective.
482
FP: The number of incorrect predictions that instances are defective.
483
Using confusion matrix, mentioned indicators are calculated as follows:
484
Precision: The proportion of the predicted positive cases that were cor-
485
rect is calculated using:
486
P recision = T P
T P + F P (3)
Recall: Recall is the proportion of positive cases that were correctly
487
identified. To calculate recall the following equation is used:
488
Recall = T P
T P + F N (4)
F1: To capture the trade-off between precision and recall, F1 (F-Measure)
489
is calculated using the values of recall and precision. The most common ver-
490
sion of this measure is the F1-score which is the harmonic mean of precision
491
and recall. This measure is approximately the average of the two when they
492
are close, and is more generally the square of the geometric mean divided by
493
the arithmetic mean. We denote F1-measure by F1 in the following.
494
F 1 = 2 × P recision × Recall
P recision + Recall (5)
G: While F1 is the harmonic mean of Recall and Precision, G (GMean)
495
is the geometric mean of the two.
496
G = p
precision × recall (6)
In this study, F1 and G are selected as the principal measures of re-
497
porting our results and performing comparisons in order to detect the best
498
approach(es). F1 and G are also used as parts of the fitness function in GIS as
499
discussed earlier. Finally, F1 is used in the context of the iterative infogain
500
subsetting to select the best set of features according to their information
501
gain.
502
All the experiments are conducted using WEKA 1 machine learning li-
503
brary version 3.6.13. The statistical tests are carried out using the scipy.stats 2
504
library version 0.16.1, Python 3 version 3.4.4 and statistics library from
505
Python. The violin plots and CD Diagrams are generated using the mat-
506
plotlib 4 library version 1.5.3 and evaluation package from Orange 5 library
507
version 3.3.8 respectively. A replication package is available online for GIS 6 .
508
4. Results
509
Tables 5, 6, 7 and 8 provide the median F1 and G values from the exper-
510
iments performed for CPDP and WPDP benchmarks, respectively. In these
511
tables, the reported results are without variation for (NN)-Filter and Naive
512
CPDP methods as well as PR since there is no randomness involved in their
513
settings. For other benchmarks, the experiments are repeated 30 times to
514
account for the existing randomness in the design of their experiments. The
515
results of within and cross project predictions are presented separately to
516
evaluate the differences in both within and cross project cases and to answer
517
the corresponding research questions properly. The results of GIS are dupli-
518
cated in the cross and within project tables to make the comparisons easier.
519
In both sets of tables, the last two rows present the median and mean values
520
of all predictions.
521
These results are depicted through diagrams and plots in Figures 4 and 13.
522
The rankings in the first figure plots are based on the median and critical
523
difference scheme. The third figure provides per datasets results for GIS,
524
CPDP and WPDP. These plots are described in the following.
525
1 http://www.cs.waikato.ac.nz/ml/weka/
2 https://www.scipy.org/
3 http://www.python.org
4 http://matplotlib.org/
5 http://orange.biolab.si/
6 https://doi.org/10.5281/zenodo.804413
Table 4: Violin Plots and CD Diagrams for F1 and G
1234567891011121314150.0
0.2
0.4
0.6
0.8
1.0
Nai ve
ckl oc
NN
−F ilt er
ckl oc
Nai ve
IG
NN
−F ilt er
IG
Nai ve
all
NN
−F ilt er
all
CV
ckl oc
PR
ckl oc
PR
IG
PR
all
CV
all
GIS
all
CV
IG
GIS
ckl oc
GIS
IG
(a) F1
1234567891011121314150.0
0.2
0.4
0.6
0.8
1.0
Nai ve
ckl oc
NN
−F ilt er
ckl oc
Nai ve
IG
NN
−F ilt er
IG
Nai ve
all
NN
−F ilt er
all
PR
ckl oc
CV
ckl oc
PR
IG
PR
all
CV
all
CV
IG
GIS
all
GIS
ckl oc
GIS
IG
(b) G
123456789101112131415 Naiveckloc NN−Filterckloc NN−FilterIG NaiveIG Naiveall PRckloc NN−Filterall CVcklocGISallPRIGGIScklocCVIGGISIGPRallCVall
CD
(c) F1
123456789101112131415 Naiveckloc NN−Filterckloc NN−FilterIG NaiveIG Naiveall NN−Filterall PRckloc CVcklocPRIGCVIGGISallPRallCVallGIScklocGISIG
CD
(d) G
Table 5: F1: GIS vs Cross Project Benchmarks
GIS(C) NN-Filter(C) Naive(C)
file All ckloc IG All ckloc IG All ckloc IG
ant-1.3 0.292
0.3260.314 0.372 0.444 0.488 0.294 0.250 0.375 ant-1.4
0.3700.355 0.361 0.219 0.185 0.274 0.190 0.129 0.197 ant-1.5 0.211 0.245
0.2510.313 0.444 0.338 0.338 0.418 0.423 ant-1.6 0.442 0.465
0.5130.410 0.426 0.450 0.408 0.420 0.444 ant-1.7 0.361 0.395
0.4160.497 0.424 0.485 0.465 0.437 0.504 camel-1.0
0.1020.078 0.089 0.188 0.238 0.128 0.333 0.333 0.194 camel-1.2 0.491
0.5250.483 0.271 0.238 0.240 0.192 0.170 0.167 camel-1.4 0.302 0.303
0.3200.281 0.239 0.269 0.204 0.201 0.246 camel-1.6 0.329
0.3420.329 0.219 0.214 0.226 0.196 0.235 0.212 ivy-1.1 0.664
0.7030.589 0.375 0.222 0.222 0.274 0.225 0.243 ivy-1.4 0.150 0.157
0.1730.318 0.129 0.182 0.300 0.273 0.292 ivy-2.0 0.277 0.300
0.3380.364 0.434 0.412 0.391 0.391 0.421 jedit-3.2 0.543 0.575
0.5870.336 0.226 0.303 0.486 0.359 0.424 jedit-4.0 0.423 0.451
0.4630.422 0.302 0.368 0.468 0.468 0.500 jedit-4.1 0.486 0.486
0.5200.493 0.414 0.403 0.601 0.523 0.567 jedit-4.2 0.300 0.259
0.3420.443 0.378 0.420 0.460 0.473 0.481 jedit-4.3 0.043 0.034
0.0470.096 0.164 0.152 0.079 0.119 0.108 log4j-1.0 0.447
0.5230.442 0.519 0.391 0.348 0.256 0.162 0.111 log4j-1.1 0.575
0.6190.579 0.576 0.500 0.462 0.233 0.053 0.150 log4j-1.2 0.730 0.668
0.7810.217 0.138 0.165 0.119 0.071 0.071 lucene-2.0 0.633
0.6400.609 0.446 0.324 0.383 0.175 0.175 0.198 lucene-2.2 0.643
0.6800.612 0.282 0.226 0.235 0.185 0.127 0.127 lucene-2.4 0.691
0.7140.668 0.358 0.217 0.252 0.280 0.211 0.194 poi-1.5 0.681 0.706
0.7410.314 0.210 0.210 0.284 0.200 0.222 poi-2.0 0.215 0.216 0.219 0.267 0.197 0.230 0.234 0.215 0.257 poi-2.5 0.768 0.761
0.7960.233 0.165 0.179 0.262 0.176 0.246 poi-3.0 0.766
0.8030.786 0.263 0.185 0.196 0.269 0.194 0.287 synapse-1.0 0.196 0.220
0.2230.421 0.311 0.410 0.333 0.276 0.320 synapse-1.1 0.415
0.4570.455 0.463 0.311 0.442 0.370 0.237 0.240 synapse-1.2 0.520 0.527 0.537 0.560 0.310 0.431 0.431 0.262 0.273 velocity-1.4 0.564 0.642
0.7240.188 0.088 0.132 0.120 0.099 0.133 velocity-1.5 0.628 0.583
0.7120.228 0.116 0.185 0.164 0.116 0.198 velocity-1.6 0.506 0.521
0.5580.291 0.205 0.317 0.250 0.237 0.283 xalan-2.4 0.304 0.287
0.3110.390 0.317 0.344 0.367 0.327 0.400 xalan-2.5 0.569
0.5830.577 0.373 0.301 0.301 0.395 0.281 0.294 xalan-2.6 0.514 0.567
0.5890.511 0.404 0.413 0.490 0.375 0.382 xalan-2.7 0.798
0.8310.763 0.402 0.248 0.251 0.416 0.255 0.261 xerces-1.2 0.234
0.2560.239 0.240 0.171 0.200 0.244 0.200 0.240 xerces-1.3
0.3790.329 0.327 0.331 0.291 0.288 0.331 0.327 0.295 xerces-1.4 0.646 0.638
0.7100.310 0.189 0.198 0.250 0.171 0.184 xerces-init 0.408 0.433
0.5160.318 0.258 0.277 0.318 0.295 0.303 Median 0.457 0.486 0.498 0.331 0.239 0.277 0.284 0.237 0.257
Mean 0.453 0.467 0.478 0.344 0.273 0.298 0.304 0.255 0.280
To measure the performance difference across the benchmarks, two differ-
526
ent approaches were considered. First, the performance of GIS in comparison
527
with other benchmarks was assessed through Wilcoxon signed rank tests. Ta-
528
bles 9 and 10 summarize the results of the pairwise statistical tests based on
529
F1 and G values respectively for all 30 runs. The first column of each entry
530
in these tables is the p − value obtained from the tests and the second col-
531
umn is the Cohen’s d value associated with the performance obtained from
532
the two treatments subject to comparison. The following equation is used to
533
Table 6: F1: GIS vs Within Project Benchmarks
GIS (C) CV (W) PR (W)
file All ckloc IG All ckloc IG All ckloc IG
ant-1.3 0.292 0.326 0.314 0.427 0.303 0.441 0.427 0.303 0.441 ant-1.4 0.370 0.355 0.361 0.400 0.394 0.444 0.308 0.154 0.278 ant-1.5 0.211 0.245 0.251 0.370 0.448 0.507 0.429 0.430 0.500 ant-1.6 0.442 0.465 0.513 0.576 0.431 0.586 0.601 0.514 0.477 ant-1.7 0.361 0.395 0.416 0.556 0.497 0.498 0.531 0.438 0.518 camel-1.0 0.102 0.078 0.089 0.300 0.286 0.118 0.300 0.286 0.118 camel-1.2 0.491 0.525 0.483 0.322 0.288 0.205 0.208 0.178 0.053 camel-1.4 0.302 0.303 0.320 0.265 0.245 0.264 0.300 0.304 0.288 camel-1.6 0.329 0.342 0.329 0.312 0.261 0.204 0.306 0.287 0.259 ivy-1.1 0.664 0.703 0.589 0.574 0.449 0.538 0.574 0.449 0.538 ivy-1.4 0.150 0.157 0.173 0.176 0.080 0.000 0.267 0.250 0.278 ivy-2.0 0.277 0.300 0.338 0.389 0.425 0.425 0.375 0.380 0.424 jedit-3.2 0.543 0.575 0.587 0.572 0.467 0.462 0.572 0.467 0.462 jedit-4.0 0.423 0.451 0.463 0.421 0.294 0.237 0.517 0.394 0.481 jedit-4.1 0.486 0.486 0.520 0.500 0.361 0.398 0.526 0.400 0.323 jedit-4.2 0.300 0.259 0.342 0.432 0.320 0.400 0.475 0.405 0.465 jedit-4.3 0.043 0.034 0.047 0.211 0.214 0.077 0.102 0.164 0.233 log4j-1.0 0.447 0.523 0.442 0.632 0.607 0.584 0.632 0.607 0.584 log4j-1.1 0.575 0.619 0.579 0.725 0.687 0.697 0.708 0.698 0.667 log4j-1.2 0.730 0.668 0.781 0.657 0.578 0.686 0.453 0.417 0.474 lucene-2.0 0.633 0.640 0.609 0.553 0.507 0.556 0.553 0.507 0.556 lucene-2.2 0.643 0.680 0.612 0.487 0.423 0.451 0.500 0.452 0.426 lucene-2.4 0.691 0.714 0.668 0.529 0.466 0.526 0.525 0.435 0.525 poi-1.5 0.681 0.706 0.741 0.454 0.409 0.592 0.454 0.409 0.592 poi-2.0 0.215 0.216 0.219 0.207 0.218 0.162 0.288 0.269 0.288 poi-2.5 0.768 0.761 0.796 0.578 0.281 0.812 0.256 0.208 0.238 poi-3.0 0.766 0.803 0.786 0.477 0.369 0.688 0.294 0.264 0.338 synapse-1.0 0.196 0.220 0.223 0.385 0.296 0.432 0.385 0.296 0.432 synapse-1.1 0.415 0.457 0.455 0.527 0.433 0.461 0.500 0.427 0.469 synapse-1.2 0.520 0.527 0.537 0.577 0.505 0.530 0.510 0.397 0.489 velocity-1.4 0.564 0.642 0.724 0.892 0.831 0.880 0.892 0.831 0.880 velocity-1.5 0.628 0.583 0.712 0.433 0.311 0.494 0.752 0.756 0.765 velocity-1.6 0.506 0.521 0.558 0.360 0.305 0.410 0.526 0.505 0.530 xalan-2.4 0.304 0.287 0.311 0.363 0.299 0.354 0.363 0.299 0.354 xalan-2.5 0.569 0.583 0.577 0.377 0.302 0.555 0.306 0.297 0.301 xalan-2.6 0.514 0.567 0.589 0.598 0.535 0.575 0.428 0.407 0.407 xalan-2.7 0.798 0.831 0.763 0.913 0.820 0.930 0.349 0.260 0.285 xerces-1.2 0.234 0.256 0.239 0.231 0.175 0.162 0.231 0.175 0.162 xerces-1.3 0.379 0.329 0.327 0.372 0.274 0.466 0.291 0.265 0.247 xerces-1.4 0.646 0.638 0.710 0.718 0.645 0.707 0.254 0.189 0.000 xerces-init 0.408 0.433 0.516 0.333 0.327 0.330 0.349 0.312 0.362 Median 0.457 0.486 0.498 0.450 0.373 0.472 0.429 0.394 0.424 Mean 0.453 0.467 0.478 0.468 0.399 0.460 0.430 0.377 0.402
calculate Cohen’s d:
534
d = X gis − X b
std p (7)
Here, X b and X gis are the means of the benchmark and GIS respectively.
535
Hence, a positive Cohen’s d means that GIS yields better results than the
536
compared counterpart. std p represents the pooled standard deviation which
537
Table 7: G: GIS vs Cross Project Benchmarks
GIS(C) NN-Filter(C) Naive(C)
file All ckloc IG All ckloc IG All ckloc IG
ant-1.3 0.381
0.4340.418 0.373 0.447 0.488 0.299 0.258 0.387 ant-1.4
0.4460.397 0.410 0.220 0.190 0.275 0.198 0.135 0.207 ant-1.5 0.314 0.353
0.3590.322 0.447 0.343 0.340 0.418 0.425 ant-1.6 0.507 0.528
0.5580.417 0.447 0.461 0.422 0.438 0.463 ant-1.7 0.450 0.486
0.5010.502 0.449 0.504 0.477 0.454 0.523 camel-1.0
0.2080.193 0.201 0.233 0.258 0.143 0.347 0.347 0.196 camel-1.2 0.520
0.5800.512 0.299 0.299 0.292 0.256 0.257 0.242 camel-1.4 0.393
0.4160.388 0.285 0.266 0.282 0.212 0.226 0.266 camel-1.6 0.402
0.4450.388 0.226 0.246 0.249 0.207 0.267 0.234 ivy-1.1 0.664
0.7050.604 0.458 0.336 0.336 0.398 0.356 0.342 ivy-1.4 0.247 0.270
0.2710.331 0.129 0.182 0.306 0.283 0.309 ivy-2.0 0.366 0.391
0.4190.371 0.434 0.419 0.395 0.395 0.426 jedit-3.2 0.563
0.6120.603 0.374 0.274 0.352 0.502 0.393 0.455 jedit-4.0 0.467 0.502
0.5150.428 0.332 0.388 0.469 0.478 0.511 jedit-4.1 0.520 0.529
0.5610.501 0.444 0.427 0.602 0.536 0.576 jedit-4.2 0.389 0.364
0.4350.453 0.379 0.420 0.484 0.477 0.484 jedit-4.3 0.114 0.098
0.1290.156 0.213 0.203 0.141 0.176 0.166 log4j-1.0 0.494
0.5650.517 0.537 0.446 0.396 0.383 0.297 0.243 log4j-1.1 0.581
0.6350.615 0.596 0.552 0.509 0.336 0.164 0.285 log4j-1.2 0.746 0.697
0.7900.349 0.272 0.300 0.252 0.192 0.192 lucene-2.0 0.638
0.6540.610 0.517 0.422 0.471 0.272 0.272 0.331 lucene-2.2 0.643
0.6810.614 0.363 0.323 0.327 0.295 0.231 0.231 lucene-2.4 0.693
0.7180.669 0.439 0.338 0.356 0.377 0.344 0.315 poi-1.5 0.681 0.709
0.7480.408 0.312 0.312 0.382 0.309 0.331 poi-2.0 0.287 0.318
0.3270.267 0.201 0.235 0.234 0.217 0.258 poi-2.5 0.769 0.762
0.8030.325 0.267 0.285 0.350 0.265 0.341 poi-3.0 0.767
0.8090.794 0.361 0.301 0.308 0.365 0.300 0.390 synapse-1.0 0.292 0.337
0.3430.469 0.325 0.417 0.334 0.277 0.333 synapse-1.1 0.461
0.5210.514 0.466 0.330 0.458 0.388 0.290 0.300 synapse-1.2 0.554 0.574
0.5770.566 0.354 0.455 0.455 0.329 0.330 velocity-1.4 0.575 0.645
0.7250.275 0.160 0.203 0.189 0.176 0.214 velocity-1.5 0.635 0.602
0.7120.319 0.209 0.281 0.265 0.209 0.300 velocity-1.6 0.511 0.538
0.5690.340 0.322 0.378 0.320 0.322 0.346 xalan-2.4 0.392 0.385
0.4020.397 0.322 0.347 0.383 0.327 0.400 xalan-2.5 0.575
0.5900.583 0.399 0.360 0.360 0.414 0.331 0.344 xalan-2.6 0.522 0.580
0.5960.540 0.478 0.486 0.509 0.440 0.445 xalan-2.7 0.813
0.8420.785 0.502 0.376 0.379 0.513 0.382 0.388 xerces-1.2 0.249 0.278
0.2870.242 0.183 0.201 0.247 0.209 0.242 xerces-1.3
0.4280.356 0.400 0.334 0.310 0.290 0.334 0.338 0.298 xerces-1.4 0.665 0.652
0.7160.418 0.308 0.310 0.371 0.303 0.301 xerces-init 0.409 0.436
0.5190.354 0.342 0.326 0.354 0.376 0.364 Median 0.497 0.531 0.537 0.373 0.323 0.343 0.350 0.303 0.331
Mean 0.496 0.515 0.523 0.384 0.327 0.345 0.351 0.312 0.335
can be calculated as follows:
538
std p = s
(n gis − 1) ∗ (s gis ) 2 + (n b − 1) ∗ (s b ) 2
n gis + n b − 2 (8)
Where s gis , n gis , s b and n b are the standard deviation of GIS measure-
539
ments, the number of subjects in the GIS group, standard deviation of
540
benchmark group and number of subjects in the benchmark group, respec-
541
tively. Cohen’s d is a way of representing the standardized difference between
542