A Benchmark Study on the Effectiveness of

(1)

A Benchmark Study on the Effectiveness of

Search-based Data Selection and Feature Selection for Cross Project Defect Prediction

Seyedrebvar Hosseini ^a , Burak Turhan ^b , Mika M¨ antyl¨ a ^a

a

M3S, Faculty of Information Technology and Electrical Engineering University of Oulu, 90014, Oulu, Finland

b

Department of Computer Science

Brunel University London, UB8 3PH, United Kingdom

Abstract

Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect pre- diction(CPDP). On the other hand, feature selection and data quality are issues to consider in CPDP.

Objective: We aim at utilizing the Nearest Neighbor (NN)-Filter, em- bedded in genetic algorithm to produce validation sets for generating evolv- ing training datasets to tackle CPDP while accounting for potential noise in defect labels. We also investigate the impact of using different feature sets.

Method: We extend our proposed approach, Genetic Instance Selection (GIS), by incorporating feature selection in its setting. We use 41 releases of 11 multi-version projects to assess the performance GIS in comparison with benchmark CPDP (NN-filter and Naive-CPDP) and within project (Cross- Validation(CV) and Previous Releases(PR)). To assess the impact of feature sets, we use two sets of features, SCM+OO+LOC(all) and CK+LOC(ckloc) as well as iterative info-gain subsetting(IG) for feature selection.

Results: GIS variant with info gain feature selection is significantly better than NN-Filter (all,ckloc,IG) in terms of F1 (p = values 0.001, Cohen’s d = {0.621, 0.845, 0.762}) and G (p = values 0.001, Cohen’s d = {0.899, 1.114, 1.056}), and Naive CPDP (all,ckloc,IG) in terms of F1 (p = values 0.001, Cohen’s d = {0.743, 0.865, 0.789}) and G (p =

Email addresses: rebvar@oulu.fi (Seyedrebvar Hosseini),

burak.turhan@brunel.ac.uk (Burak Turhan), mika.mantyla@oulu.fi (Mika M¨ antyl¨ a)

(2)

values 0.001, Cohen’s d = {1.027, 1.119, 1.050}). Overall, the perfor- mance of GIS is comparable to that of within project defect prediction (WPDP) benchmarks, i.e. CV and PR. In terms of multiple comparisons test, all variants of GIS belong to the top ranking group of approaches.

Conclusions: We conclude that datasets obtained from search based approaches combined with feature selection techniques is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of a loss in pre- cision. Using different optimization goals, utilizing other validation datasets and other feature selection techniques are possible future directions to inves- tigate.

Keywords: Cross Project Defect Prediction, Search Based Optimization, Genetic Algorithms, Instance Selection, Training Data Selection

1. Introduction

1

Despite the extensive body of studies and its long history, software defect

2

prediction is still a challenging problem in the field of software engineering

3

[1]. Software teams mainly use software testing as their primary method of

4

detecting and preventing defects in the development stage. On the other

5

hand, software testing can be very time consuming while the resources for

6

such tasks might be limited and hence, detecting defects in an automated

7

way can save lots of time and effort [2, 3, 4].

8

Defect data from previous versions of the same project could be used to

9

detect defect prone units in new releases. Prediction based on the historical

10

data collected from the same project is called within project defect prediction

11

(WPDP) and has been studied extensively [5, 6, 7, 8, 9, 10, 11]. New code

12

and releases in the same project usually share many common characteristics

13

that make them a good match for constructing prediction models. But this

14

approach is subject to criticism as within project data is usually not available

15

for new projects. Moreover, this problem only tends to increase as the need

16

for software based platforms and services is growing at a rapid pace. On the

17

other hand, there are plenty of relevant public datasets available, especially

18

in the open source repositories [12] that can act as candidates to identify and

19

prevent bugs in the absence and even presence of within project data. Using

20

the available public datasets, one can investigate the usefulness of models

21

(3)

created on the data from other projects, especially for those with limited or

22

no defect data repository [3, 4, 13].

23

Various studies have focused on different aspects of defect prediction in-

24

cluding data manipulation approaches (such as re-sampling [14, 15, 16, 17],

25

re-weighting [17, 18, 19, 20], filtering [4, 13, 21, 22], etc.), learning technique

26

optimization (boosting [14, 15, 17, 23], bagging[23], ensembles [22, 23, 24, 25],

27

etc.) and metrics (feature selection [23, 26, 27, 28], different set of met-

28

rics [29]) to mention some. Predictions usually target the effectiveness of

29

the approaches in two main categories: binary and continuous/multi-class

30

predictions[1]. The features also come in different levels of code including

31

function/method, class, file and even component levels [1]. As would be dis-

32

cussed in later sections, the experiments in this study are class level binary

33

predictions performed on 41 Java projects. A summary of the related studies

34

would be presented in Section 2.

35

Learning approach and the training data are two of the major elements

36

in building high performance prediction models. Finding a suitable dataset

37

(instances and features) with similar defect distribution characteristics as

38

the test set is likely to increase the performance of prediction models [30].

39

Since defect detection and label assignment is based on mining version control

40

systems [31, 32, 33], the process could be prone to errors and data quality can

41

be questionable [34, 35]. In other words, the labels of some of the instances

42

might not have been identified correctly, two or more instances with the

43

same measurements can have different labels, or undetected defects may not

44

be captured in the dataset in the time of dataset compilation.

45

In this study, we address these problem with a search based instance

46

selection approach, where a mutation operation is designed to account for

47

data quality. Using the genetic algorithm, we guide the instance selection

48

process with the aim of convergence to datasets that match the characteristics

49

of the test set more precisely. The fitness function at each generation is

50

evaluated on a validation set generated via (NN)-Filter. With these, we

51

handle the potential noise in data while tackling the training data instance

52

selection problem with GIS.

53

In our earlier work [35], we combined training data instance selection with

54

search based methods to address the potential defect labeling errors and we

55

compared our method’s performance with benchmark CPDP and WPDP

56

(cross validation) approaches. This study not only acts as a replication of

57

our original study but also extends it by the following ways:

58

(4)

• Including a more comprehensive set of datasets. In the original study,

59

only the last available dataset from multi-version software projects were

60

used in our experiments. Overall, 13 datasets from PROMISE repos-

61

itory were used to assess the performance of our proposed approach.

62

We extend the number of datasets to 41 from 11 projects. The reason

63

for the choosing these datasets is their maturity (multiple versions) and

64

the choice of WPDP previous releases benchmark.

65

• Including an extra within project benchmark (previous releases). We

66

used a single WPDP benchmark in our original study namely, 10 fold

67

stratified cross validation. Since multiple releases are available for each

68

project in this extension, we compare the performance of GIS with this

69

WPDP benchmark as well.

70

• Investigating the effect of different sets of metrics and the sensitivity of

71

GIS toward them. Chidamber & Kemerer (CK)+LOC was used in the

72

original study, due to its reported good performances by Hall et al.[1] as

73

well as recent CPDP studies [25, 36]. To have a better understanding

74

of how GIS reacts to the choice of metrics we used all features and

75

one feature selection technique, i.e. iterative infogain subsetting. We

76

conducted these sets of experiments to address one of the threats to

77

the validity of our original conclusions, i.e. the selection of software

78

metrics. Interestingly, adding SCM to CK+LOC has a positive effect

79

on the WPDP performance despite its negative effect on GIS, therefore

80

making it necessary to be included as failing to so, would be a threat

81

to the conclusion validity.

82

• Presenting a more extensive analysis, related work and discussions. A

83

wider range of datasets, benchmarks and results, requires more com-

84

prehensive analysis. The results are presented through multiple types

85

of diagrams (violin plots, critical difference (CD) diagrams, line plots)

86

and the approaches are compared with different targets in mind through

87

statistical tests for pairwise and multiple comparisons for exploring dif-

88

ferent perspectives of the achieved results.

89

Accordingly, the aim of this study is to answer the following research

90

questions:

91

RQ1: How is the performance of GIS compared with benchmark cross

92

project defect prediction approaches?

93

(5)

RQ2: How is the performance of GIS compared with the within project

94

defect prediction approaches?

95

RQ3: How different feature sets affect the performance of GIS?

96

This paper is organized as follows: The next section summarizes the re-

97

lated studies on CPDP and briefly describes how our study differs. Proposed

98

approach, datasets and experimental procedures are presented in Section 3.

99

Section 4 presents the results of our analysis and discussions. Section 5 ad-

100

dresses some of the concerns that arise during the analysis and wraps up our

101

findings. Section 6 discusses the threats to the validity of our study. Finally,

102

the last section concludes the paper with a summary of the findings as well

103

as directions for future work.

104

2. Related Work

105

Cross project defect prediction (CPDP) has drawn a great deal of interest

106

recently. To predict defects in projects without sufficient training data, many

107

researchers attempted to build novel and competitive CPDP models [4, 13,

108

18, 21, 31, 37, 38]. However, not all studies report good performances of

109

CPDP [4, 16, 38].

110

In a series of experiments, Turhan et al. [4] observed that CPDP un-

111

derperforms WPDP. They also found that despite its good probability of

112

detection rates, CPDP causes excessive false alarms. They proposed to use

113

(NN)-Filter to select the most relevant training data instances based on a

114

similarity measure. Through this method, they were able to lower the high

115

false alarm rates dramatically, but their model performance was still lower

116

than WPDP.

117

Zimmermann et al. [38] performed a large scale set of CPDP experiments

118

by creating 622 pair-wise prediction models on 28 datasets from 12 projects

119

(open source and commercial) and observed only 21 pairs (3.4%) that match

120

their performance criteria (precision, recall and accuracy, all greater than

121

0.75). This observation suggests that the majority of predictions will proba-

122

bly fail if training data is not selected carefully. They also found that CPDP

123

is not symmetrical as data from Firefox can predict Internet Explorer defects,

124

but the opposite does not hold. They argued that characteristics of data and

125

process are crucial factors for CPDP.

126

He et al. [13] proposed to use the distributional characteristics (median,

127

mean, variance, standard deviation, skewness, quantiles, etc.) for training

128

dataset selection. They concluded that in the best cases cross project data

129

(6)

may provide acceptable prediction results. They also state that training data

130

from the same project does not always lead to better predictions and carefully

131

selected cross project data may provide better prediction results than within-

132

project (WP) data. They also found that data distributional characteristics

133

are informative for training data selection. They used a metalearner built

134

on top of the prediction results of the decision table learner to predict the

135

outcome of the models before making actual predictions.

136

Herbold [18] proposed distance-based strategies for the selection of train-

137

ing data based on distributional characteristics of the available data. They

138

presented two strategies based on EM (Expectation Maximization) cluster-

139

ing and NN (Nearest Neighbor) algorithm with distributional characteristics

140

as the decision strategy. They evaluated the strategies in a large case study

141

with 44 versions of 14 software projects and they observed that i) weights

142

can be used to successfully deal with biased data and ii) the training data

143

selection provides a significant improvement in the success rate and recall of

144

defect detection. However, their overall success rate was still too low for the

145

practical application of CPDP.

146

Turhan et al. [21] evaluated the effects of mixed project data on predic-

147

tions. They tested whether mixed WP and CP data improves the predic-

148

tion performances. They performed their experiments on 73 versions of 41

149

projects using Na¨ıve Bayes classifier. They concluded that the mixed project

150

data would significantly improve the performance of the defect predictors.

151

Zhang et al [39] created a universal defect prediction model from a large

152

pool of 1,385 projects with the aim of relieving the need to build prediction

153

models for individual projects. They approached the problem of variations in

154

the distributions by clustering and rank transformation using the similarities

155

among the projects. Based on their results, their model obtained prediction

156

performance comparable to the WP models when applied to five external

157

projects and performed similarly among projects with different context fac-

158

tors.

159

Ryu et al. [22] presented a Hybrid Instance Selection with the Near-

160

est Neighbor (HISNN) method using a hybrid classification to address the

161

class imbalance for CPDP. Their approach used a combination of the Near-

162

est Neighbour algorithm and Na¨ıve Bayes learner to address the instance

163

selection problem.

164

He et al. [26] considered CPDP from the viewpoint of metrics and features

165

by investigating the usefulness of simplified metric sets. They used a greedy

166

approach to filter the list of available metrics and proposed to use different

167

(7)

sets of metrics according to the defined criteria. They observed that minimum

168

feature subsets and TOPK metrics could provide acceptable results compared

169

with their benchmarks. They further concluded that the minimum feature

170

subset can improve the predictions despite the acceptable loss of precision.

171

Feature selection and more specifically, feature matching was studied by

172

Nam et al.[28]. Their proposed approach provided the ability of performing

173

predictions on training and test datasets with different sets of metrics. They

174

used statistical procedures for the feature matching processes and observed

175

that their CPDP approach outperforms WPDP in 68% of the predictions.

176

While the above-mentioned studies focus on the dataset, instance and fea-

177

ture selection problems, none of them are using the search based approach.

178

One such approach in defect prediction context has been considered by Liu

179

et al., who tried to come up with mathematical expressions as their solu-

180

tions that maximize the effectiveness of their approach [36]. They compared

181

their approach with 17 non-evolutionary machine learning algorithms and

182

concluded that the search-based models decrease the misclassification rate

183

consistently compared with the non-search-based models.

184

Canfora et al. proposed a search based multi-objective optimization ap-

185

proach [40] for CPDP. Using multi-objective genetic algorithm NSGA-II, they

186

tried to come up with an optimal cost effectiveness model for CPDP. They

187

concluded that their approach outperforms the single objective, trivial and

188

local prediction approaches. Recently, Xia et al. [41] have conducted a search

189

based experiment consisting of a genetic algorithm phase and an ensemble

190

phase. They utilized logistic regression as their base learner and small chunks

191

of within project data in their settings. They compared their proposed ap-

192

proach, i.e. HYDRA with some of the most recent CPDP approaches and

193

observed that it outperformed the benchmarks significantly. Even though

194

these studies use a search based approach, they are neither focused on the

195

instance/dataset selection nor on the data quality problem and hence, differ

196

from our approach.

197

3. Research Methodology

198

This section describes the details of our study starting with a discussion

199

of our motivation. We then present the proposed approach as well as the

200

benchmark methods, datasets and metrics, and the performance evaluation

201

criteria used in our study.

202

(8)

3.1. Motivation

203

If selected carefully, a dataset from other projects can provide a better

204

predicting power than WP data [13] as the large pool of the available CP

205

data has the potential to cover a larger range of the feature space. This may

206

lead to a better match between training and test datasets and consequently

207

to better predictions.

208

One of the first attempts in this area was the idea of filtering the training

209

dataset instances [4]. In this approach, the most similar instances from a

210

large pool containing all the training instances from other projects are se-

211

lected using k-NN algorithm. Since these instances are closer to the test set

212

based on a particular distance measure, they could potentially lead to better

213

predictions. Using the distributional characteristics of the test and training

214

datasets is another approach used in multiple studies [13, 42]. Clustering the

215

instances is yet another approach used in other studies [18, 31, 39]. While

216

these methods have been shown to be useful, the search based approach to

217

selection is not considered by any of these papers. An evolving dataset start-

218

ing with the initial datasets generated using one or a combination of these

219

approaches can be a good candidate for a search based data selection problem.

220 221

Data Quality. Another inspiration for this work is the fact that the pub-

222

lic datasets are prone to quality issues and contain noisy data [34, 35, 43].

223

Since defects are discovered over time, certain defects might not have been

224

discovered at the time of compiling the datasets and hence, some of the

225

instances in the training set may be misrepresenting themselves as non-

226

defective, while with similar kind of measurements defects can exist in the

227

test set. In contrast, while some test instances are not defective, the most

228

similar items in the training set might be labeled as defective. In short,

229

some of the instances in the test set can have similar measurements with

230

the training set, yet different class labels. Please note that mislabeling may

231

not be the only reason for such situations, and they can occur naturally, i.e.

232

the class labels of similar measurements can differ depending on the metric

233

set used. The acknowledgment of noise in the data and guiding the learning

234

algorithm to account for that can lead to better predictions, as we proposed

235

in our original paper [35] and validate it in this study.

236 237

Features. We used CK+LOC metric set originally to assess the per-

238

formance of GIS and the other benchmarks. Hall et al. [1] asserted that

239

OO (Object-Oriented) and LOC have acceptable prediction power and they

240

(9)

Test Set

Test Parts . . . .

1 2

p-1 P

(NN)-Filter

Training Set

Validation Sets

Search Based Optimizer

Best Training

Sets

Learner Predictions Feature Ranking

Feature Selection

1 2

3

Figure 1: Summary of the search based training data instance selection and performance evaluation process used in this paper.

outperform SCM (static code metrics). Moreover, they observed that adding

241

SCM to OO and LOC is not related to better performance [1]. Moreover, the

242

usefulness of CK+LOC was validated in multiple studies in the context of

243

CPDP [25, 37, 40] of which [40] involves using search based approaches. We

244

extend our work not only by considering CK+LOC but also OO+SCM+LOC

245

as well as feature selection.

246

We used the information gain concept to select the top ranked features

247

which explain the highest entropy in the data. Information gain method is

248

relatively fast and the ranking process does not need any actual predictions.

249

Please note that there certainly exist more sophisticated and powerful feature

250

selection approaches that could be used in the context of CPDP. The use of

251

information gain feature subsetting is due to its simplicity and speed and a

252

proof of concept that even simple refinement of the data through a guided

253

procedure can lead to practical improvements.

254

Infogain is defined as follows: For an attribute A and a class C, the

255

entropy of the class before and after observing A are as follows:

256

H(C) = − X

c∈C

p(c) log ₂ p(c) (1)

H(C|A) = − X

a∈A

p(a) X

c∈C

p(c|a) log ₂ p(c|a) (2) The amount of explained entropy by including A reflects the additional

257

information acquired and is called information gain. Using this approach the

258

(10)

Algorithm 1 Pseudo code for GIS

1:

Set numGens = The number of generations of each genetic optimization run.

2:

Set popSize = The size of the population.

3:

Set DATA = 41 releases from{Ant, Camel, ivy, jedit, log4j, lucene, poi, synapse, velocity, xalan,xerces}

4:

Set FEATURES ={CKLOC, All, IG}

5: 6:

forFSET in FEATURESdo

7:

forRELEASE in DATAdo

8:

set TEST = Load Instances from RELEASE with metric set FSET

9:

set TRAIN = Instances from all other projects with metric set FSET

10:

tdSize = 0.02 * Number of instances in TRAIN

11:

fori = 1 to 30do

12:

Set TestParts = Split TEST instances into p parts

13: 14:

foreach testPart in TestPartsdo

15:

Set vSet = Generate a validation dataset using (NN)-Filter method (with three distance measures).

16:

Set TrainDataSets = CreatepopSizedataset from TRAIN with replacement each withtdSizeinstances

17: 18:

foreach td in TrainDataSetsdo

19:

Evaluate td on vSet and add it to the initial generation

20:

end for

21: 22:

forg in range(numGens)do

23:

Create a new generation using the defined Crossover and Mutation function and Elites from the curent generation.

24:

Combine the two generations and extract a new generation

25:

end for

26: 27:

Set bestDS = Select the top dataset from the GA’s last iteration.

28:

Evaluate bestDS on testPart and append the results to the pool of results.

29:

end for

30: 31:

Calculate Precision, Recall and F1 and G from the predictions.

32:

end for

33:

Report the median of all 30 experiments

34:

end for

35:

end for

features of the datasets are ranked from the highest to the lowest amount of

259

entropy explained. We used iterative InfoGain subsetting [44] to select the

260

appropriate set of features for our experiments. Iterative InfoGain subsetting

261

starts by training the predictors using the top n ranked attributes for n ∈

262

{1, 2, ...} and continues until a point that having j + 1 attributes instead

263

of j does not improve the predictions. An improvement in predictions was

264

measured using F1 values achieved from a 1 × 10 fold cross validation on

265

the training dataset. The train test splits were identical when adding the

266

features iteratively during the feature selection operation.

267

3.2. Proposed Approach

268

Figure 1 visualizes the whole research process reported in this paper. The

269

details of the search based optimizer are not present in the figure and instead,

270

they are provided in Algorithm 1 and discussed below.

271

The process starts with splitting the test set into p parts randomly (p = 5

272

(11)

in our experiments). Partitioning the test set into smaller chunks plays an im-

273

portant role in the overall procedure. By creating smaller chunks, the process

274

of optimizing and adjusting the dataset is easier as there are less elements to

275

consider and the datasets generated could be better representatives for these

276

smaller chunks than the whole dataset. This procedure however, adds extra

277

complexity to the model and the run-time would increase consequently since

278

a search based optimizer is required for each part.

279

Each part (without the labels) is fed into the (NN)-Filter instance selec-

280

tion method in order to select the most relevant instances from the training

281

set for the purpose of reserving a validation set, on which we optimize the

282

search process. Please note that the training set is a combination of all the

283

instances from other projects. We used the closest three instances with mul-

284

tiple distance measures to account for the possible error in using a specific

285

distance measure. The unique instances from the generated set were selected

286

to act as the validation dataset used to guide our instance selection process.

287

The availability of mixed data as used in [17, 21, 41] could also potentially

288

act as a replacement for the aforementioned similarity measures and boost

289

the performance of our approach.

290

The process then randomly creates an initial population containing pop-

291

Size datasets (popSize =30 in our experiments). Each population element

292

is a dataset selected randomly and with replacement from the large pool

293

of training set instances (see Table 1). The selected number of population

294

members and their sizes lead to an average of 94.99% coverage (std=0.031)

295

of the instances in the large pool of available training instances (multiple

296

copies for some) for each iteration.

297

Each population member is then evaluated on the validation set, which

298

is acquired via the (NN)-Filter in the previous step. Then, for numGens

299

generations, a new population is generated and the top elements are selected

300

to survive and move to the next generation. There is an alternative stopping

301

criterion for GIS (described below). These procedures are repeated 30 times

302

to address the randomness introduced by both the dataset selection and ge-

303

netic operations. Below, the genetic operations and parameters are discussed

304

in more details:

305 306

Initial Population: The initial population is generated using the ran-

307

dom instance selection process with replacement from a large pool of in-

308

stances containing all elements from other projects than the test project.

309

The instances might contain elements included in the validation dataset as

310

(12)

F

1

F

2

· · · F

m-2

F

m-1

F

m

L

C

1

6 0 · · · 0 1 1 0

C

2

3 0.97 · · · 12 3 1 1

C

3

5 0.69 · · · 12.6 4 1.4 0

· · · · · · · · · · · · · · · · · · · · · · · ·

C

n-2

3 0.98 · · · 16 2 1 0

C

n-1

3 0.82 · · · 8.33 1 0.67 0

C

n

16 0.73 · · · 28.3 9 1.56 1

Table 1: Chromosome structure

they are not removed from the large pool of candidate training instances due

311

to their possible usefulness for the learning process. The selection process

312

consumes 94.99% of the initial training data on average and eliminates a

313

group of them with each passing generation.

314 315

Chromosome Representation: Each chromosome contains a number

316

of instances from a list of projects. A chromosome is a dataset sampled

317

from the large training dataset randomly and with replacement. A typical

318

chromosome example can be seen in Table 1. F _i represents i ^th selected feature

319

and L represents the class label. Atypical chromosome contains n instances

320

denoted by C ₁ to C _n .

321

We used a fixed size chromosome in our experiments. The size of each

322

chromosome (dataset) was set to 0.02% of the large pool of training data

323

from other projects. The fixed size chromosome was selected to show the

324

effectiveness of our proposed approach with respect to the small candidate

325

training sets generated. One might find the varying size chromosome more

326

useful in practice as the candidates in this version might be able to capture

327

more properties of the test set subject to prediction.

328 329

Selection: The Tournament selection is used as the selection operator of

330

GIS. Since the population size is small in our experiments, tournament size

331

was set to two.

332 333

Elites: A proportion of the population is moved to the next generation;

334

those that provide the best fitness values. We transfer two of the top parents

335

to the next generation.

336 337

Stopping Criteria: We place two limitations on the number of itera-

338

tions that the genetic algorithm could progress. The first one is the maximum

339

(13)

number of generations allowed. In this case, this number was set to 20. The

340

reason for selecting a relatively small number of generations (20) is due to

341

having small population sizes. The small populations was selected for the

342

runtime considerations. Despite their sizes however, they cover 94.99% of the

343

original training instances on average in every iteration, when creating the

344

initial populations. Further, we observed that the process converges quickly

345

and hence, making 20 an acceptable maximum number of generations. The

346

other stopping criterion is the amount of benefit gained from the population

347

generated. If the difference between the mean fitness of two consecutive pop-

348

ulations is less than = 0.0001, the genetic algorithm stops.The mentioned

349

epsilon was selected arbitrarily to be a small number. One can expect to

350

achieve better results by tuning these parameters.

351 352

Fitness Function: F1 * G is used as the fitness value of each population

353

element. Each population element (a dataset) is evaluated on the validation

354

set and fitness value is assigned to it. The selection of this fitness function is

355

not random as both of these values (F1 and G) measure the balance between

356

precision and recall, but in different ways.

357 358

Mutation: The mutation function handles potential data quality (e.g,

359

noise, mislabelling etc.) issues. Randomly changing the class value of the

360

instances from non defective to defective (and vice versa), the mutation op-

361

eration guides the process through multiple generations for yielding more

362

similar datasets. This could to some extent account for both undiscovered

363

bugs as well as contradictory labels for similar measurements in different

364

projects (training and test data). With the probability of mP rob = 0.05,

365

a number of training set instances are mutated by flipping the labels (de-

366

fective → non defective or non defective → defective). Note that since the

367

datasets could contain repetitions of an element (from the initial population

368

generation and later from the crossover operation), if an instance is mutated,

369

all of its repetitions are also mutated. This way, we could avoid conflicts

370

between the items in the same dataset. The mutation process is described

371

in Algorithm 2 formally.

372 373

Crossover: The generated training datasets used in the population could

374

possibly have large sizes. The time for training a learner with a large dataset

375

and validating it on a medium size validation set increases, if the size of the

376

train and validation datasets increase. To avoid having very large datasets

377

(14)

Algorithm 2 Mutation

1:

Input→DS: a dataset

2:

Output→A dataset with possible mutated items

3: 4:

set mProb = p// Mutation probability

5:

set mCount = c// Number of instances to mutate

6:

set r = Random value between 0 and 1

7: 8:

ifr<mProbthen

9:

fori in range(mCount)do

10:

Randomly select an instance that is not been mutated in the same round

11:

Find all repeats of the same item and flip their labels

12:

end for

13: 14:

end if

Algorithm 3 One point crossover

1:

Input→DS1 and DS2

2:

Output→Two new datasets generated from DS1 and DS2

3: 4:

Set nDS1 = Empty dataset

5:

Set nDS2 = Empty dataset

6:

Set point = Random in the range of either of DS1 or DS2

7:

SHUFFLE DS1 and DS2

8: 9:

fori = 1 to pointdo

10:

Append DS1(i) to nDS1

11:

12:

end for

13: 14:

fori = point+1 to DS1’s lengthdo

15:

16:

17:

end for

18: 19:

foreach unique instance in nDS1 and nDS2do

20:

Use the majority voting to decide the label of the instance and its repetitions.

21:

end for

one point crossover was used during the crossover operation. Nevertheless,

378

some might find two point cross over more useful. As mentioned earlier, the

379

chromosomes are a list of instances from the large training set, selected ran-

380

domly and with replacement with a fixed size. Hence one point cross over

381

would not increase the size of the datasets when combining them. Since the

382

chromosomes possibly contain the repetitions of one item and the mutation

383

operation changes the label of an instance, conflicts might occur in the chro-

384

mosomes generated from combining the two selected parents. In the case of

385

conflicts, the majority voting is used to select the label of such instances.

386

Algorithm 3 provides the pseudo-code for crossover operation.

387 388

(15)

3.3. Benchmark Methods

389

To have a better insight into the performance achieved by GIS, it is

390

compared to the following benchmarks:

391

(NN)-Filter (CPDP): In this approach, the most relevant training in-

392

stances are selected based on a distance measure [4]. In this case, we used

393

10 nearest neighbours and Euclidean distance. This value is similar to that

394

utilized by multiple previous studies [4, 14, 18, 45]. The k=10 was the value

395

of choice in the study by Turhan et al. [4] which proposed NN-Filter. The

396

selection of k as 10 was followed by later studies such as He et al. [45] and

397

Chen et al. [14], both of which focus on CPDP. Another CPDP study by

398

Herbold [18] used different values of k ∈ {3, 5, 10, 15, 20, 25, 30} and ob-

399

served the best results for larger k values. The simplicity of the method and

400

the comprehensive number of studies that have tested the approach are the

401

reasons for choosing this method as a benchmark [4, 17, 22]. Also GIS uses

402

(NN)-Filter to select the validation dataset and a benchmark is required to

403

measure the performance difference between (NN)-Filter and GIS.

404

Naive (CPDP): In this approach, the whole training set is fed into

405

the learner and the model is trained with all the training data points. This

406

method has also been tested in many studies and provides a baseline for

407

the comparisons [4, 17, 22]. The approach is easy and at the same time

408

demonstrates that while the availability of large pools of data could be useful,

409

not all the data items are.

410

10-Fold cross validation (WPDP): In this benchmark, we perform

411

stratified cross validation on the test set. Many studies have reported the

412

good or at least better performance of this approach compared with that of

413

cross project methods [4]. Outperforming and improving WPDP is the main

414

goals of many such studies. We refer to this benchmark as CV throughout

415

our analysis.

416

Previous Releases (WPDP): Previous releases of the same project

417

are used to train the prediction model. Similar to 10-fold cross validation,

418

a good performance of this approach is expected in comparison with that of

419

cross project methods as these older releases are more similar to the test set

420

in comparison with datasets from other projects. More importantly, there is a

421

higher possibility of finding even identical classes in the old and new releases

422

of a project. Previous releases are another target of the CPDP studies as

423

acquiring such data is still difficult in some cases. Note that the first release

424

of each project does not have a previous release and therefore no prediction

425

could be performed for it in this category. We use the 10 fold cross validation

426

(16)

Table 2: Utilized datasets and their properties

Dataset #Classes #DP DP% #LOC Dataset #Classes #DP DP% #LOC

ant-1.3 125 20 16 37699 lucene-2.0 195 91 46.7 50596

ant-1.4 178 40 22.5 54195 lucene-2.2 247 144 58.3 63571

ant-1.5 293 32 10.9 87047 lucene-2.4 340 203 59.7 102859

ant-1.6 351 92 26.2 113246 poi-1.5 237 141 59.5 55428

ant-1.7 745 166 22.3 208653 poi-2.0 314 37 11.8 93171

camel-1.0 339 13 3.8 33721 poi-2.5 385 248 64.4 119731

camel-1.2 608 216 35.5 66302 poi-3.0 442 281 63.6 129327

camel-1.4 872 145 16.6 98080 synapse-1.0 157 16 10.2 28806

camel-1.6 965 188 19.5 113055 synapse-1.1 222 60 27 42302

ivy-1.1 111 63 56.8 27292 synapse-1.2 256 86 33.6 53500

ivy-1.4 241 16 6.6 59286 velocity-1.4 196 147 75 51713

ivy-2.0 352 40 11.4 87769 velocity-1.5 214 142 66.4 53141

jedit-3.2 272 90 33.1 128883 velocity-1.6 229 78 34.1 57012

jedit-4.0 306 75 24.5 144803 xalan-2.4 723 110 15.2 225088

jedit-4.1 312 79 25.3 153087 xalan-2.5 803 387 48.2 304860

jedit-4.2 367 48 13.1 170683 xalan-2.6 885 411 46.4 411737

jedit-4.3 492 11 2.2 202363 xalan-2.7 909 898 98.8 428555

log4j-1.0 135 34 25.2 21549 xerces-1.2 440 71 16.1 159254

log4j-1.1 109 37 33.9 19938 xerces-1.3 453 69 15.2 167095

log4j-1.2 205 189 92.2 38191 xerces-1.4 588 437 74.3 141180

xerces-init 162 77 47.5 90718

result for the first release of each project in order to make the comparisons

427

easier. We denote this benchmark by PR in the following.

428

Feature Selection: Each of the aforementioned benchmarks are trained

429

and tested using three different sets of features. CK+LOC, used in our orig-

430

inal study [35] as well as the whole set of features in the datasets which

431

consist of OO+SCM+LOC are considered for all benchmarks. Beside these

432

feature sets, a portion of the features ranked based on their respective in-

433

formation gain are used to prepare another set of benchmarks. We used

434

iterative InfoGain subsetting method to select the appropriate features for

435

each benchmark.

436

The first two benchmarks (CPDP) are used to answer RQ1 and the latter

437

are utilized to answer RQ2. The results of different versions of GIS would

438

be used to answer the last research question, i.e. RQ3. Each experiment is

439

repeated 30 times to address the randomness introduced by CV and GIS.

440

3.4. Datasets and Metrics

441

We used 41 releases of 11 projects from the PROMISE repository for our

442

experiments. These projects are open source and all of them have multiple

443

versions. Due to the inclusion of the multi version WPDP benchmark, we

444

skipped the use of datasets with a single version. The datasets are collected

445

by Jureczko, Madeyski and Spinellis [31, 32]. The list of the datasets is

446

presented in Table 2 with the corresponding size and defect information.

447

(17)

Table 3: List of the metrics used in this study

ID Variable Description

1 WMC Weighted Methods per Class 2 DIT Depth of Inheritance Tree 3 NOC Number of Children

4 CBO Coupling between Object classes 5 RFC Response for a Class

6 LCOM Lack of Cohesion in Methods 7 CA Afferent Couplings

8 CE Efferent Couplings 9 NPM Number of Public Methods 10 LCOM3 Normalized version of LCOM 11 LOC Lines Of Code

12 DAM Data Access Metric 13 MOA Measure Of Aggregation

14 MFA Measure of Functional Abstraction 15 CAM Cohesion Among Methods 16 IC Inheritance Coupling 17 CBM Coupling Between Methods 18 AMC Average Method Complexity 19 MAX CC Maximum cyclomatic complexity 20 AVG CC Mean cyclomatic complexity

The reason for using these datasets is driven by our goal to account for noise

448

in the data, which is a threat specified by the donors of these datasets. Each

449

dataset contains a number of instances corresponding to the classes in the

450

release. Originally, each instance has 20 static code metrics listed in Table

451

3. Three scenarios were considered for selecting the metric suites. In the

452

first scenario, we used CK+LOC portion of the metrics as the basis of our

453

experiments. CK+LOC is used and validated in previous CPDP studies

454

[2, 46] and Hall et al. [1] have addressed the usefulness of these metrics in

455

comparison with static code metrics. In the second scenario, the full set of

456

metrics were considered for our experiments and finally for the last scenario,

457

we used a very simple and fast feature selection approach based on the rank

458

of the features according to their information gain. The selection of the

459

metrics in our original study was skipped as its primary focus was only on

460

the instance selection problem and using a reduced set that is tried in other

461

studies allowed us to demonstrate the feasibility of our approach as a proof of

462

concept. While this paper includes the same feature set, it also involves the

463

feature selection concept to some extent and detailed analysis are presented

464

accordingly.

465

3.5. Performance Measures and Tools

466

Na¨ıve Bayes (NB) is used as the base learner in all experiments. NB is a

467

member of the probabilistic classifier family that are based on applying Bayes’

468

(18)

theorem with strong (na¨ıve) independence assumptions between the features

469

[47]. The good performance of NB has been shown in many studies. Menzies

470

et al. [3, 48] and Lessmann et al.[49] have demonstrated the effectiveness

471

of NB with a set of data mining experiments performed on NASA MDP

472

datasets. Lessmann et al. compared the most common classifiers on the

473

NASA datasets and concluded that there is no significant difference between

474

the performances of top 15 classifiers, one of which is NB [49] .

475

To assess the performance of the models, four indicators are used: Pre-

476

cision, Recall, F1 and G. These indicators are calculated by comparing the

477

outcome of the prediction model and the actual label of the data instances.

478

To that end, the confusion matrix is created using the following values:

479

TN: The number of correct predictions that instances are defect free.

480

FN: The number of incorrect predictions that instances are defect free.

481

TP: The number of correct predictions that instances are defective.

482

FP: The number of incorrect predictions that instances are defective.

483

Using confusion matrix, mentioned indicators are calculated as follows:

484

Precision: The proportion of the predicted positive cases that were cor-

485

rect is calculated using:

486

P recision = T P

T P + F P (3)

Recall: Recall is the proportion of positive cases that were correctly

487

identified. To calculate recall the following equation is used:

488

Recall = T P

T P + F N (4)

F1: To capture the trade-off between precision and recall, F1 (F-Measure)

489

is calculated using the values of recall and precision. The most common ver-

490

sion of this measure is the F1-score which is the harmonic mean of precision

491

and recall. This measure is approximately the average of the two when they

492

are close, and is more generally the square of the geometric mean divided by

493

the arithmetic mean. We denote F1-measure by F1 in the following.

494

F 1 = 2 × P recision × Recall

P recision + Recall (5)

G: While F1 is the harmonic mean of Recall and Precision, G (GMean)

495

is the geometric mean of the two.

496

G = p

precision × recall (6)

(19)

In this study, F1 and G are selected as the principal measures of re-

497

porting our results and performing comparisons in order to detect the best

498

approach(es). F1 and G are also used as parts of the fitness function in GIS as

499

discussed earlier. Finally, F1 is used in the context of the iterative infogain

500

subsetting to select the best set of features according to their information

501

gain.

502

All the experiments are conducted using WEKA ¹ machine learning li-

503

brary version 3.6.13. The statistical tests are carried out using the scipy.stats ²

504

library version 0.16.1, Python ³ version 3.4.4 and statistics library from

505

Python. The violin plots and CD Diagrams are generated using the mat-

506

plotlib ⁴ library version 1.5.3 and evaluation package from Orange ⁵ library

507

version 3.3.8 respectively. A replication package is available online for GIS ⁶ .

508

4. Results

509

Tables 5, 6, 7 and 8 provide the median F1 and G values from the exper-

510

iments performed for CPDP and WPDP benchmarks, respectively. In these

511

tables, the reported results are without variation for (NN)-Filter and Naive

512

CPDP methods as well as PR since there is no randomness involved in their

513

settings. For other benchmarks, the experiments are repeated 30 times to

514

account for the existing randomness in the design of their experiments. The

515

results of within and cross project predictions are presented separately to

516

evaluate the differences in both within and cross project cases and to answer

517

the corresponding research questions properly. The results of GIS are dupli-

518

cated in the cross and within project tables to make the comparisons easier.

519

In both sets of tables, the last two rows present the median and mean values

520

of all predictions.

521

These results are depicted through diagrams and plots in Figures 4 and 13.

522

The rankings in the first figure plots are based on the median and critical

523

difference scheme. The third figure provides per datasets results for GIS,

524

CPDP and WPDP. These plots are described in the following.

525

1 http://www.cs.waikato.ac.nz/ml/weka/

2 https://www.scipy.org/

3 http://www.python.org

4 http://matplotlib.org/

5 http://orange.biolab.si/

6 https://doi.org/10.5281/zenodo.804413

(20)

Table 4: Violin Plots and CD Diagrams for F1 and G

1234567891011121314150.0

0.2

0.4

0.6

0.8

1.0

Nai ve

ckl oc

NN

−F ilt er

ckl oc

Nai ve

IG

NN

−F ilt er

IG

Nai ve

all

NN

−F ilt er

all

CV

ckl oc

PR

ckl oc

PR

IG

PR

all

CV

all

GIS

all

CV

IG

GIS

ckl oc

GIS

IG

(a) F1

1234567891011121314150.0

0.2

0.4

0.6

0.8

1.0

Nai ve

ckl oc

NN

−F ilt er

ckl oc

Nai ve

IG

NN

−F ilt er

IG

Nai ve

all

NN

−F ilt er

all

PR

ckl oc

CV

ckl oc

PR

IG

PR

all

CV

all

CV

IG

GIS

all

GIS

ckl oc

GIS

IG

(b) G

123456789101112131415 Naiveckloc NN−Filterckloc NN−FilterIG NaiveIG Naiveall PRckloc NN−Filterall CVcklocGISallPRIGGIScklocCVIGGISIGPRallCVall

CD

(c) F1

123456789101112131415 Naiveckloc NN−Filterckloc NN−FilterIG NaiveIG Naiveall NN−Filterall PRckloc CVcklocPRIGCVIGGISallPRallCVallGIScklocGISIG

CD

(d) G

(21)

Table 5: F1: GIS vs Cross Project Benchmarks

GIS(C) NN-Filter(C) Naive(C)

file All ckloc IG All ckloc IG All ckloc IG

ant-1.3 0.292

0.326

0.314 0.372 0.444 0.488 0.294 0.250 0.375 ant-1.4

0.370

0.355 0.361 0.219 0.185 0.274 0.190 0.129 0.197 ant-1.5 0.211 0.245

0.251

0.313 0.444 0.338 0.338 0.418 0.423 ant-1.6 0.442 0.465

0.513

0.410 0.426 0.450 0.408 0.420 0.444 ant-1.7 0.361 0.395

0.416

0.497 0.424 0.485 0.465 0.437 0.504 camel-1.0

0.102

0.078 0.089 0.188 0.238 0.128 0.333 0.333 0.194 camel-1.2 0.491

0.525

0.483 0.271 0.238 0.240 0.192 0.170 0.167 camel-1.4 0.302 0.303

0.320

0.281 0.239 0.269 0.204 0.201 0.246 camel-1.6 0.329

0.342

0.329 0.219 0.214 0.226 0.196 0.235 0.212 ivy-1.1 0.664

0.703

0.589 0.375 0.222 0.222 0.274 0.225 0.243 ivy-1.4 0.150 0.157

0.173

0.318 0.129 0.182 0.300 0.273 0.292 ivy-2.0 0.277 0.300

0.338

0.364 0.434 0.412 0.391 0.391 0.421 jedit-3.2 0.543 0.575

0.587

0.336 0.226 0.303 0.486 0.359 0.424 jedit-4.0 0.423 0.451

0.463

0.422 0.302 0.368 0.468 0.468 0.500 jedit-4.1 0.486 0.486

0.520

0.493 0.414 0.403 0.601 0.523 0.567 jedit-4.2 0.300 0.259

0.342

0.443 0.378 0.420 0.460 0.473 0.481 jedit-4.3 0.043 0.034

0.047

0.096 0.164 0.152 0.079 0.119 0.108 log4j-1.0 0.447

0.523

0.442 0.519 0.391 0.348 0.256 0.162 0.111 log4j-1.1 0.575

0.619

0.579 0.576 0.500 0.462 0.233 0.053 0.150 log4j-1.2 0.730 0.668

0.781

0.217 0.138 0.165 0.119 0.071 0.071 lucene-2.0 0.633

0.640

0.609 0.446 0.324 0.383 0.175 0.175 0.198 lucene-2.2 0.643

0.680

0.612 0.282 0.226 0.235 0.185 0.127 0.127 lucene-2.4 0.691

0.714

0.668 0.358 0.217 0.252 0.280 0.211 0.194 poi-1.5 0.681 0.706

0.741

0.314 0.210 0.210 0.284 0.200 0.222 poi-2.0 0.215 0.216 0.219 0.267 0.197 0.230 0.234 0.215 0.257 poi-2.5 0.768 0.761

0.796

0.233 0.165 0.179 0.262 0.176 0.246 poi-3.0 0.766

0.803

0.786 0.263 0.185 0.196 0.269 0.194 0.287 synapse-1.0 0.196 0.220

0.223

0.421 0.311 0.410 0.333 0.276 0.320 synapse-1.1 0.415

0.457

0.455 0.463 0.311 0.442 0.370 0.237 0.240 synapse-1.2 0.520 0.527 0.537 0.560 0.310 0.431 0.431 0.262 0.273 velocity-1.4 0.564 0.642

0.724

0.188 0.088 0.132 0.120 0.099 0.133 velocity-1.5 0.628 0.583

0.712

0.228 0.116 0.185 0.164 0.116 0.198 velocity-1.6 0.506 0.521

0.558

0.291 0.205 0.317 0.250 0.237 0.283 xalan-2.4 0.304 0.287

0.311

0.390 0.317 0.344 0.367 0.327 0.400 xalan-2.5 0.569

0.583

0.577 0.373 0.301 0.301 0.395 0.281 0.294 xalan-2.6 0.514 0.567

0.589

0.511 0.404 0.413 0.490 0.375 0.382 xalan-2.7 0.798

0.831

0.763 0.402 0.248 0.251 0.416 0.255 0.261 xerces-1.2 0.234

0.256

0.239 0.240 0.171 0.200 0.244 0.200 0.240 xerces-1.3

0.379

0.329 0.327 0.331 0.291 0.288 0.331 0.327 0.295 xerces-1.4 0.646 0.638

0.710

0.310 0.189 0.198 0.250 0.171 0.184 xerces-init 0.408 0.433

0.516

0.318 0.258 0.277 0.318 0.295 0.303 Median 0.457 0.486 0.498 0.331 0.239 0.277 0.284 0.237 0.257

Mean 0.453 0.467 0.478 0.344 0.273 0.298 0.304 0.255 0.280

To measure the performance difference across the benchmarks, two differ-

526

ent approaches were considered. First, the performance of GIS in comparison

527

with other benchmarks was assessed through Wilcoxon signed rank tests. Ta-

528

bles 9 and 10 summarize the results of the pairwise statistical tests based on

529

F1 and G values respectively for all 30 runs. The first column of each entry

530

in these tables is the p − value obtained from the tests and the second col-

531

umn is the Cohen’s d value associated with the performance obtained from

532

the two treatments subject to comparison. The following equation is used to

533

(22)

Table 6: F1: GIS vs Within Project Benchmarks

GIS (C) CV (W) PR (W)

file All ckloc IG All ckloc IG All ckloc IG

ant-1.3 0.292 0.326 0.314 0.427 0.303 0.441 0.427 0.303 0.441 ant-1.4 0.370 0.355 0.361 0.400 0.394 0.444 0.308 0.154 0.278 ant-1.5 0.211 0.245 0.251 0.370 0.448 0.507 0.429 0.430 0.500 ant-1.6 0.442 0.465 0.513 0.576 0.431 0.586 0.601 0.514 0.477 ant-1.7 0.361 0.395 0.416 0.556 0.497 0.498 0.531 0.438 0.518 camel-1.0 0.102 0.078 0.089 0.300 0.286 0.118 0.300 0.286 0.118 camel-1.2 0.491 0.525 0.483 0.322 0.288 0.205 0.208 0.178 0.053 camel-1.4 0.302 0.303 0.320 0.265 0.245 0.264 0.300 0.304 0.288 camel-1.6 0.329 0.342 0.329 0.312 0.261 0.204 0.306 0.287 0.259 ivy-1.1 0.664 0.703 0.589 0.574 0.449 0.538 0.574 0.449 0.538 ivy-1.4 0.150 0.157 0.173 0.176 0.080 0.000 0.267 0.250 0.278 ivy-2.0 0.277 0.300 0.338 0.389 0.425 0.425 0.375 0.380 0.424 jedit-3.2 0.543 0.575 0.587 0.572 0.467 0.462 0.572 0.467 0.462 jedit-4.0 0.423 0.451 0.463 0.421 0.294 0.237 0.517 0.394 0.481 jedit-4.1 0.486 0.486 0.520 0.500 0.361 0.398 0.526 0.400 0.323 jedit-4.2 0.300 0.259 0.342 0.432 0.320 0.400 0.475 0.405 0.465 jedit-4.3 0.043 0.034 0.047 0.211 0.214 0.077 0.102 0.164 0.233 log4j-1.0 0.447 0.523 0.442 0.632 0.607 0.584 0.632 0.607 0.584 log4j-1.1 0.575 0.619 0.579 0.725 0.687 0.697 0.708 0.698 0.667 log4j-1.2 0.730 0.668 0.781 0.657 0.578 0.686 0.453 0.417 0.474 lucene-2.0 0.633 0.640 0.609 0.553 0.507 0.556 0.553 0.507 0.556 lucene-2.2 0.643 0.680 0.612 0.487 0.423 0.451 0.500 0.452 0.426 lucene-2.4 0.691 0.714 0.668 0.529 0.466 0.526 0.525 0.435 0.525 poi-1.5 0.681 0.706 0.741 0.454 0.409 0.592 0.454 0.409 0.592 poi-2.0 0.215 0.216 0.219 0.207 0.218 0.162 0.288 0.269 0.288 poi-2.5 0.768 0.761 0.796 0.578 0.281 0.812 0.256 0.208 0.238 poi-3.0 0.766 0.803 0.786 0.477 0.369 0.688 0.294 0.264 0.338 synapse-1.0 0.196 0.220 0.223 0.385 0.296 0.432 0.385 0.296 0.432 synapse-1.1 0.415 0.457 0.455 0.527 0.433 0.461 0.500 0.427 0.469 synapse-1.2 0.520 0.527 0.537 0.577 0.505 0.530 0.510 0.397 0.489 velocity-1.4 0.564 0.642 0.724 0.892 0.831 0.880 0.892 0.831 0.880 velocity-1.5 0.628 0.583 0.712 0.433 0.311 0.494 0.752 0.756 0.765 velocity-1.6 0.506 0.521 0.558 0.360 0.305 0.410 0.526 0.505 0.530 xalan-2.4 0.304 0.287 0.311 0.363 0.299 0.354 0.363 0.299 0.354 xalan-2.5 0.569 0.583 0.577 0.377 0.302 0.555 0.306 0.297 0.301 xalan-2.6 0.514 0.567 0.589 0.598 0.535 0.575 0.428 0.407 0.407 xalan-2.7 0.798 0.831 0.763 0.913 0.820 0.930 0.349 0.260 0.285 xerces-1.2 0.234 0.256 0.239 0.231 0.175 0.162 0.231 0.175 0.162 xerces-1.3 0.379 0.329 0.327 0.372 0.274 0.466 0.291 0.265 0.247 xerces-1.4 0.646 0.638 0.710 0.718 0.645 0.707 0.254 0.189 0.000 xerces-init 0.408 0.433 0.516 0.333 0.327 0.330 0.349 0.312 0.362 Median 0.457 0.486 0.498 0.450 0.373 0.472 0.429 0.394 0.424 Mean 0.453 0.467 0.478 0.468 0.399 0.460 0.430 0.377 0.402

calculate Cohen’s d:

534

d = X _gis − X _b

std _p (7)

Here, X _b and X _gis are the means of the benchmark and GIS respectively.

535

Hence, a positive Cohen’s d means that GIS yields better results than the

536

compared counterpart. std _p represents the pooled standard deviation which

537

(23)

Table 7: G: GIS vs Cross Project Benchmarks

GIS(C) NN-Filter(C) Naive(C)

file All ckloc IG All ckloc IG All ckloc IG

ant-1.3 0.381

0.434

0.418 0.373 0.447 0.488 0.299 0.258 0.387 ant-1.4

0.446

0.397 0.410 0.220 0.190 0.275 0.198 0.135 0.207 ant-1.5 0.314 0.353

0.359

0.322 0.447 0.343 0.340 0.418 0.425 ant-1.6 0.507 0.528

0.558

0.417 0.447 0.461 0.422 0.438 0.463 ant-1.7 0.450 0.486

0.501

0.502 0.449 0.504 0.477 0.454 0.523 camel-1.0

0.208

0.193 0.201 0.233 0.258 0.143 0.347 0.347 0.196 camel-1.2 0.520

0.580

0.512 0.299 0.299 0.292 0.256 0.257 0.242 camel-1.4 0.393

0.416

0.388 0.285 0.266 0.282 0.212 0.226 0.266 camel-1.6 0.402

0.445

0.388 0.226 0.246 0.249 0.207 0.267 0.234 ivy-1.1 0.664

0.705

0.604 0.458 0.336 0.336 0.398 0.356 0.342 ivy-1.4 0.247 0.270

0.271

0.331 0.129 0.182 0.306 0.283 0.309 ivy-2.0 0.366 0.391

0.419

0.371 0.434 0.419 0.395 0.395 0.426 jedit-3.2 0.563

0.612

0.603 0.374 0.274 0.352 0.502 0.393 0.455 jedit-4.0 0.467 0.502

0.515

0.428 0.332 0.388 0.469 0.478 0.511 jedit-4.1 0.520 0.529

0.561

0.501 0.444 0.427 0.602 0.536 0.576 jedit-4.2 0.389 0.364

0.435

0.453 0.379 0.420 0.484 0.477 0.484 jedit-4.3 0.114 0.098

0.129

0.156 0.213 0.203 0.141 0.176 0.166 log4j-1.0 0.494

0.565

0.517 0.537 0.446 0.396 0.383 0.297 0.243 log4j-1.1 0.581

0.635

0.615 0.596 0.552 0.509 0.336 0.164 0.285 log4j-1.2 0.746 0.697

0.790

0.349 0.272 0.300 0.252 0.192 0.192 lucene-2.0 0.638

0.654

0.610 0.517 0.422 0.471 0.272 0.272 0.331 lucene-2.2 0.643

0.681

0.614 0.363 0.323 0.327 0.295 0.231 0.231 lucene-2.4 0.693

0.718

0.669 0.439 0.338 0.356 0.377 0.344 0.315 poi-1.5 0.681 0.709

0.748

0.408 0.312 0.312 0.382 0.309 0.331 poi-2.0 0.287 0.318

0.327

0.267 0.201 0.235 0.234 0.217 0.258 poi-2.5 0.769 0.762

0.803

0.325 0.267 0.285 0.350 0.265 0.341 poi-3.0 0.767

0.809

0.794 0.361 0.301 0.308 0.365 0.300 0.390 synapse-1.0 0.292 0.337

0.343

0.469 0.325 0.417 0.334 0.277 0.333 synapse-1.1 0.461

0.521

0.514 0.466 0.330 0.458 0.388 0.290 0.300 synapse-1.2 0.554 0.574

0.577

0.566 0.354 0.455 0.455 0.329 0.330 velocity-1.4 0.575 0.645

0.725

0.275 0.160 0.203 0.189 0.176 0.214 velocity-1.5 0.635 0.602

0.712

0.319 0.209 0.281 0.265 0.209 0.300 velocity-1.6 0.511 0.538

0.569

0.340 0.322 0.378 0.320 0.322 0.346 xalan-2.4 0.392 0.385

0.402

0.397 0.322 0.347 0.383 0.327 0.400 xalan-2.5 0.575

0.590

0.583 0.399 0.360 0.360 0.414 0.331 0.344 xalan-2.6 0.522 0.580

0.596

0.540 0.478 0.486 0.509 0.440 0.445 xalan-2.7 0.813

0.842

0.785 0.502 0.376 0.379 0.513 0.382 0.388 xerces-1.2 0.249 0.278

0.287

0.242 0.183 0.201 0.247 0.209 0.242 xerces-1.3

0.428

0.356 0.400 0.334 0.310 0.290 0.334 0.338 0.298 xerces-1.4 0.665 0.652

0.716

0.418 0.308 0.310 0.371 0.303 0.301 xerces-init 0.409 0.436

0.519

0.354 0.342 0.326 0.354 0.376 0.364 Median 0.497 0.531 0.537 0.373 0.323 0.343 0.350 0.303 0.331

Mean 0.496 0.515 0.523 0.384 0.327 0.345 0.351 0.312 0.335

can be calculated as follows:

538

std _p = s

(n _gis − 1) ∗ (s _gis ) ² + (n _b − 1) ∗ (s _b ) ²

n _gis + n _b − 2 (8)

Where s _gis , n _gis , s _b and n _b are the standard deviation of GIS measure-

539

ments, the number of subjects in the GIS group, standard deviation of

540

benchmark group and number of subjects in the benchmark group, respec-

541

tively. Cohen’s d is a way of representing the standardized difference between

542

A Benchmark Study on the Effectiveness of