• Aucun résultat trouvé

Boosting Automated Program Repair for Adoption By Practitioners

N/A
N/A
Protected

Academic year: 2021

Partager "Boosting Automated Program Repair for Adoption By Practitioners"

Copied!
171
0
0

Texte intégral

(1)

PhD-FSTM-2020-080

The Faculty of Sciences, Technology and Medicine

Dissertation

Defence held on 07/12/2020 in Luxembourg

to obtain the degree of

Docteur de l’Université du Luxembourg

en Informatique

by

Anil KOYUNCU

Born on 3rd April 1986 in Karsiyaka, Turkey

Boosting Automated Program Repair for

adoption by practitioners

Dissertation Defence Committee

Dr. Yves Le Traon, Dissertation Supervisor

Professor, University of Luxembourg, Luxembourg

Dr. Jacques Klein, Chairman

Associate Professor, University of Luxembourg, Luxembourg

Dr. Tegawendé Bissyandé, Vice Chairman

Assistant Professor, University of Luxembourg, Luxembourg

Dr. Earl T. Barr

Professor, University College London, UK

Dr. Michael Pradel

(2)

Abstract

Automated program repair (APR) attracts huge interest from research and industry as the ultimate target in the automation of software maintenance. Towards realising this automation promise, the research community has explored various ideas and techniques, which are increasingly demonstrating that APR is no longer fictional. Although literature techniques constantly set new records in fixing a significant fraction of defects within well-established benchmarks, we are not aware of the large-scale adoption of APR in practice. Meanwhile, open-source and commercial organisations have started to reflect on the potential of integrating some automated steps in the software development cycle. Actually, the current practice has several development settings that use a number of tools to automate and systematise various tasks such as code style checking, bug detection, and systematic patching. Our work is motivated by this fact. We advocate that systematic and empirical exploration of the current practice that leverages tools to automate debugging tasks would provide valuable insights for rethinking and boosting the APR agenda towards its acceptability by developer communities. We have identified three investigation axes in this dissertation. First, mining software repositories towards understanding code change properties that could be valuable to guide program repair. Second,

analysing communication channels in software development in order to assess to what extent

they could be relevant in a real-world program repair scenario. Third, exploring generic concepts

of patching in the literature for establishing a common foundation for program repair pipelines that

can be integrated with industrial settings.

This dissertation makes the following contributions to the community:

• An empirical study of tool support in a real development setting providing concrete insights on the acceptance, stability, and the nature of bugs being fixed by manually-craft patches vs. tool-supported patches and manifests opportunities for improving automated repair techniques. • A novel information retrieval based bug localisation approach that learns how to compute the

similarity scores of various types of features.

• An automated mining strategy to infer fix pattern that can be integrated into automated program repair pipelines.

• A practical bug report driven program repair pipeline.

(3)
(4)
(5)

Acknowledgements

Throughout my Ph.D., I have received a great deal of support and assistance from many people, and without their guidance, this dissertation would not have been possible. I would like to express my gratitude for every one of them.

First and foremost, I would like to thank my supervisor, Yves Le Traon, for giving me this great opportunity to pursue my doctoral studies in his team. I am grateful for the time he spent discussing with me new ideas and contributions.

Then, I would like to thank my day-to-day advisor, Tegawendé Bissyandé. He is the one who has introduced me to the world of automated program repair. His expertise was invaluable in formulating the research questions and methodology. His insightful feedback pushed me to sharpen my thinking and brought my work to a higher level.

I am equally grateful to Jacques Klein and Dongsun Kim for their continuous support, valuable advice, and guidance.

I would like to thank Martin Monperrus for his constructive and insightful feedback on my work, as well as great advice.

I would like to thank all the members of my Ph.D. defence committee, including chairman Jacques Klein, Earl Barr, Michael Pradel, Dongsun Kim, my supervisor Yves Le Traon, and my daily adviser Tegawendé Bissyandé. It is my great honour to have them on my defence committee, and I appreciate very much their efforts to examine my dissertation and evaluate my Ph.D. work.

I would like to acknowledge my colleagues from my internship at LIP6 of Sorbonne University for their wonderful collaboration. I would particularly like to single out my supervisor at LIP6, Dr. Julia Lawall. Thank you for your patient support and for all of the opportunities, I was given to further my research.

I would like to extend my thanks to all my co-authors, including Kui Liu, Kisub Kim, Haoye Tian, Abdoul Kader Kaboreé, for their valuable discussions and collaborations.

Lastly, I would like to thank my beloved family for their endless love and continuous support.

(6)
(7)

Contents

List of figures 11 List of tables 13 Contents 14 1 Introduction 15 1.1 This thesis. . . 17 1.2 Contributions . . . 17 1.3 Roadmap . . . 19

2 Background & Related Work 20 2.1 Automated Program Repair . . . 21

2.1.1 Fault Localisation . . . 21

2.1.2 Patch Generation. . . 22

2.1.3 Patch Validation . . . 23

3 An Empirical Study of Patching in Practice 26 3.1 Overview . . . 27

3.2 Background . . . 28

3.3 Methodology . . . 29

3.3.1 Dataset Collection . . . 30

3.3.2 Research Questions. . . 32

3.4 Empirical Study Findings . . . 32

3.4.1 Descriptive Statistics on the Data . . . 32

3.4.2 Acceptance of Patches (RQ1) . . . 34

3.4.3 Profile of Patch Authors (RQ2) . . . 36

3.4.4 Stability of Patches (RQ3). . . 37

3.4.5 Bug Kinds (RQ4). . . 38

3.5 Discussions . . . 41

3.5.1 Implications. . . 41

3.5.2 Exploiting Patch Redundancies . . . 42

3.5.3 Threats to Validity . . . 43

3.6 Related Work . . . 43

3.6.1 Program Repair . . . 43

3.6.2 Patch Acceptability . . . 44

3.6.3 Program Matching and Transformation . . . 44

3.7 Summary . . . 45

4 Mining Software Repositories 46 4.1 Overview . . . 48

4.1.1 Motivation . . . 49

4.2 Background . . . 50

4.2.1 Abstract Syntax Tree . . . 50

4.2.2 Code Differencing . . . 51

(8)

Contents

4.3 Approach . . . 53

4.3.1 Overview . . . 54

4.3.2 Step 0 - Patch Collection . . . 54

4.3.3 Step 1 – Rich Edit Script Computation . . . 55

4.3.4 Step 2 – Search Index Construction . . . 58

4.3.5 Step 3 – Tree Comparison . . . 59

4.3.6 Step 4 – Pattern Inference. . . 61

4.4 Experimental Evaluation. . . 62 4.4.1 Dataset . . . 62 4.4.2 Implementation Choices . . . 62 4.4.3 Statistics . . . 63 4.4.4 Research Questions. . . 65 4.5 Results. . . 65

4.5.1 RQ1: Comparison of FixMiner Clustering against Manual Dissection . . . . 65

4.5.2 RQ2: Compatibility between FixMiner’s patterns and APR literature patterns 67 4.5.3 RQ3: Evaluation of Fix Patterns’ Relevance for APR . . . 71

4.6 Discussions and Threats to Validity. . . 76

4.6.1 Runtime performance. . . 76

4.6.2 Threats to external validity. . . 76

4.6.3 Threats to construct validity . . . 76

4.7 Related Work . . . 76

4.7.1 Automated Program Repair. . . 76

4.7.2 Code differencing. . . 77

4.7.3 Change patterns. . . 77

4.8 Summary . . . 79

5 Analysing Communication Channels 80 5.1 Learning Insights from Bug Reports . . . 84

5.1.1 Overview . . . 84

5.2 Background . . . 85

5.3 Empirical Study on IRBL tools . . . 87

5.3.1 Research Questions. . . 87

5.3.2 Experiment Setup . . . 88

5.3.3 Dataset . . . 88

5.3.4 Performance Metrics . . . 89

5.3.5 RQ-1: Affinities among state-of-the-art tools . . . 91

5.3.6 RQ-2: Feature Importance . . . 92

5.4 D&C: an Approach to Adaptively Learn the Weights of Similarity Scores . . . 98

5.4.1 Feature Space for the Classification Models . . . 99

5.4.2 Divide-and-Conquer via Multi-classification . . . 99

5.4.3 Ranking of Bug Localisation Recommendations . . . 102

5.5 Assessment . . . 102

5.5.1 Execution Times . . . 102

5.5.2 Validation experiment . . . 102

5.5.3 Comparison against the state-of-the-art . . . 103

5.5.4 Project-wise performance comparison . . . 104

(9)

Contents

5.7.2 Query Reformulation. . . 108

5.7.3 VSM in IRBL. . . 108

5.7.4 Topic modelling in IRBL . . . 108

5.7.5 Stack traces in IRBL. . . 108

5.7.6 Feature combinations in IRBL . . . 109

5.7.7 New approaches to IRBL . . . 109

5.7.8 IRBL-related studies. . . 109

5.8 Summary . . . 110

5.9 Bug Report driven Program Repair. . . 110

5.9.1 Overview . . . 110

5.10 Motivation . . . 112

5.10.1 Fault Localisation Challenges . . . 112

5.10.2 Patch Validation in Practice. . . 113

5.11 The iFixR Approach . . . 113

5.11.1 Input: Bug reports . . . 114

5.11.2 Fault Localisation w/o Test Cases . . . 114

5.11.3 Fix Pattern-based Patch Generation . . . 116

5.11.4 Patch Validation with Regression Testing . . . 117

5.11.5 Output: Patch Recommendation List . . . 118

5.12 Experimental Setup . . . 119

5.12.1 Dataset & Benchmark . . . 119

5.12.2 Implementation Choices . . . 120 5.12.3 Research Questions. . . 120 5.13 Assessment Results . . . 120 5.13.1 RQ1: [Fault Localisation] . . . 121 5.13.2 RQ2: [Overfitting] . . . 122 5.13.3 RQ3: [Patch Ordering]. . . 123 5.14 Discussion . . . 126 5.15 Threats to Validity . . . 127 5.16 Related Work . . . 128 5.17 Summary . . . 128

6 Exploring Generic Concepts of Patching 130 6.1 Overview . . . 131

6.2 Related Work . . . 132

6.3 The FlexiRepair Framework. . . 133

6.3.1 Execution steps of FlexiRepair . . . 134

6.3.2 Overview of the SmPL Language . . . 135

6.3.3 Patch clustering . . . 137

6.3.4 Generic Patch Inference . . . 137

6.3.5 Code Transformation with Generic Patches . . . 138

6.4 Study Design . . . 138 6.4.1 Subjects . . . 139 6.4.2 Assessment Benchmarks . . . 139 6.4.3 Implementation Choices . . . 140 6.5 Assessment . . . 140 6.5.1 Research Questions. . . 140 6.6 Results. . . 140

6.6.1 Generic Patch Inference Capability . . . 140

6.6.2 Generic Patch tractability . . . 144

6.6.3 Repairability . . . 145

6.6.4 Efficiency . . . 146

(10)

Contents

7 Conclusions and Future Work 149

7.1 Conclusions & Future Work . . . 150

7.1.1 Mining software repositories . . . 150

7.1.2 Communication channels in software development . . . 150

7.1.3 Generic concepts of patching . . . 151

Bibliography 155

(11)

List of figures

1 Introduction 16

2 Background & Related Work 21

3 An Empirical Study of Patching in Practice 27

3.1 Illustration of SmPL matching and patching. . . 31

3.2 Patch derived from the SmPL template in Figure 4.15a. . . 32

3.3 Temporal distributions of patches. . . 33

3.4 Temporal distributions of DLH patches broken down by tool. . . 33

3.5 Spatial distribution of patches. . . 34

3.6 Delay in commit acceptance. . . 34

3.7 # of Patches submitted / discussed / accepted. . . 35

3.8 Speciality of developers Vs. Patch types. . . 36

3.9 Commitment of developers Vs. Patch types. . . 36

3.10 Time lag between patch integration and reverting. . . 37

3.11 Distribution of patch sizes in terms of files. . . 38

3.12 Distribution of patch sizes in terms of hunks. . . 38

3.13 Distribution of patch sizes in terms of lines. . . 39

3.14 Distribution of change operations (Total # of operations & # of distinct operations in patches).. . . 40

3.15 Top-5 change operations appearing at least once in a patch from the three processes. 41 3.16 Example of Compound/If:add – Add an If block.. . . 41

3.17 Searching for redundancies among patches that fix warnings of bug finding tools (i.e., DLH patches). . . 41

4 Mining Software Repositories 48 4.1 Example Java class. . . 51

4.2 AST representation of the Helloworld class. . . 51

4.3 GNU diff format. . . 51

4.4 Tangled commit. . . 53

4.5 The FixMiner Approach. At each iteration, the search index is refined, and the computation of tree similarity is specialised in specific AST information details. . . . 53

4.6 Patch of fixing bug Closure-93 in Defects4J dataset. . . 55

4.7 GumTree edit script corresponding to Closure-93 bug-fix patch represented in Figure 4.6. 55 4.8 Illustration of subtree extraction. . . 56

4.9 Excerpt AST of buggy code (Closure-93). . . 57

4.10 Rich Edit Script for Closure-93 patch in Defects4J. ←- represents the carriage return character which is necessary for presentation reasons. . . 57

4.11 ShapeTree of Closure-93.. . . 58

4.12 ActionTree of Closure-93. . . 58

4.13 TokenTree of Closure-93.. . . 58

4.14 Distribution of pairwise bug report similarity. Note: Ared linerepresents an average similarity for all bug reports in fold, andblue linerepresents average similarity bug reports within a cluster. . 68

(12)

List of figures

4.16 The overall workflow of PARFixMiner program repair pipeline. . . 71

4.17 Example of fix patterns yielded by FixMiner. . . 72

4.18 Overlap of the correct patches by PARFixMiner and other APR tools. . . 74

5 Analysing Communication Channels 84 5.1 Example bug report with (1) Summary, (2) Description, (3) Stack Trace, (4) Summary Hint, (5) Description Hint, and (6) Code Element. . . 86

5.2 Successful recommendations of IRBL tools and the overlapping among them for Top1, Top5 and Top10. . . 90

5.3 Feature engineering process. . . 93

5.4 Mean Average Precision distributions for different sets of bug reports and with various query formulations (i.e., different combinations of features where the vertical axis refers to the bug report features and the horizontal axis to the source code features. The orange lines show the median values and the green arrows the mean values of the distributions). . . 96

5.5 Mean Reciprocal Rank distributions for different sets of bug reports and with various query formulations (i.e., different combinations of features). . . 96

5.6 Divide and Conquer. Learning Approach. . . . 99

5.7 Project-wise performance comparison(X and Y axes show MAP and MRR values, the red dots). . . 104

5.8 Example of Linux bug report addressed by R2Fix. . . 111

5.9 The iFixR Program Repair Workflow. . . 113

5.10 Illustration of “Insert Cast Checker” fix pattern. . . 117

5.11 Buggy code of Defects4J bug Math-75. . . 117

5.12 AST of bug Math-75 source code statement.. . . 118

5.13 Distribution of elapsed time (in days) between bug report submission and test case attachment. . . 121

5.14 Patched source code and test case of fixing Math-5.. . . 126

5.15 Distribution of # of comments per bug report. . . 127

6 Exploring Generic Concepts of Patching 131 6.1 The FlexiRepair pipeline. . . . 133

6.2 An example code context and change operation presentation. . . 137

6.3 An example of generic patch. . . 138

6.4 Distribution of the patch cluster sizes. . . 141

6.5 Distribution of the generic patches in hunks.. . . 144

6.6 An example of patches sharing the same cluster but having distinct generic patches . 144 6.7 NPC of sensical patches. . . 147

6.8 NPC of all patches.. . . 147

6.9 NPC of sensical patches for various selection strategies . . . 148

6.10 NPC of all patches for various selection strategies. . . 148

(13)

List of tables

2 Background & Related Work 21

3 An Empirical Study of Patching in Practice 27

3.1 Statistics on H patches in Linux Kernel. . . 30

3.2 Statistics on DLH patches in Linux Kernel. . . 31

4 Mining Software Repositories 48 4.1 Comparison of fix pattern mining techniques in the literature. . . 49

4.2 Dataset. . . 63

4.3 Comparison space reduction. . . 64

4.4 Statistics on clusters.. . . 64

4.5 Statistics on Full vs Partial patterns. . . 64

4.6 Statistics on Pattern Spread. . . 65

4.7 Proportion of shared patterns between our study dataset and Defects4J. . . 66

4.8 Granularity example to FixMiner mined patterns. . . 67

4.9 Example FixMiner fix-patterns associated to APR literature patterns. . . 69

4.10 Compatibility of Patterns: FixMiner vs Literature Patterns.. . . 69

4.11 Example changes associated to FixMiner mined patterns. . . 71

4.12 Details of the benchmark. . . 73

4.13 Number of bugs fixed by different APR tools. . . 73

4.14 Defects4J bugs fixed by different APR tools.. . . 75

5 Analysing Communication Channels 84 5.1 Tools considered in this study.. . . 88

5.2 Descriptive Statistics of Curated Bench4BL.. . . 89

5.3 Overlap analysis results. . . 92

5.4 IR features collected from bug reports. . . 94

5.5 IR features collected from source code files. . . 94

5.6 Results of Principal Component Analysis. . . 98

5.7 Validation Experiment. . . 103

5.8 Performance comparison against state-of-the-art IRBL tools. Dataset are cleaned to fit our criteria on pre-fix activities. . . 103

5.9 Project wise performance results. . . 105

5.10 Region-classifiers vs Multi-classifier. . . 106

5.11 Test case changes in fix commits of Defects4J bugs.. . . 112

5.12 Failing test cases after removing future test cases.. . . 113

5.13 Example bug report (Defects4J Lang-7). . . 114

5.14 Fix patterns implemented in iFixR. . . 117

5.15 Fault localization results: IRFL (IR-based) vs. SBFL (Spectrum-based) on Defects4J (Math and Lang) bugs. . . 121

5.16 Fault localization performance. . . 122

5.17 IRFL vs. SBFL impacts on the number of generated correct/plausible patches for Defects4J bugs. . . 122

(14)

List of tables

5.20 iFixR vs state-of-the-art APR tools. . . 125

5.21 Change properties of iFixR’s correct patches. . . 125

5.22 Dissection of bugs successfully fixed by iFixR. . . 125

5.23 Dissection of bug reports related to Defects4J bugs. . . 127

6 Exploring Generic Concepts of Patching 131 6.1 Some of selected repositories used in our study. . . 139

6.2 Basic statistics of Benchmarks. . . 140

6.3 Statistics on Patch Clusters . . . 141

6.4 Statistics on Patch Clusters Spread . . . 142

6.5 Inferred Generic Patch statistics. . . 142

6.6 Frequently observed generic patches. . . 143

6.7 Top-10 projects contributed to pattern inference. . . 143

6.8 Number of Introclass bugs fixed by APR tools. . . 145

6.9 Selected generic patches fixing Introclass defects. . . 146

6.10 Number of Codeflaws bugs fixed by APR tools. . . 146

(15)

1 Introduction

Contents

(16)

Chapter 1. Introduction

Fault free software is an illusion. To cope with this reality, a huge effort has been invested by the research community to increase automation in software maintenance. An ultimate automation target in software maintenance is automatic program repair (APR). APR is broadly about generating corrective patches in an automated manner in order to eliminate the bugs in programs without breaking any existing functionality. Towards achieving this ambition, the research on automated program repair has explored various ideas, algorithms, techniques for scoping, and sorting the space of patch candidates effectively and efficiently. The literature includes a broad range of techniques that use heuristics (e.g., via random mutation operations [115]), constraints solving (e.g., via symbolic execution [172]), or machine learning (e.g., via building a code transformation model [63]) to drive patch generation. The associated literature demonstrated an incredible momentum subsequent to the seminal work of Weimer et al. [237] on generate-and-validate approaches. Over the years, the community has incrementally advanced the state-of-the-art with numerous test-based approaches that have been shown effective in generating valid patches for a significant fraction of defects within well-established benchmarks [78,123,145,197].

Despite this excitement in the research community, adoption by practitioners remains limited. Meanwhile, however, practitioners benefit from a number of tools to automate and systematise various tasks such as code style checking, bug detection, and systematic patching. Our work is motivated by this fact. We advocate that the exploration of practitioner’s realities and expectations would facilitate the adoption of the automated program repair systems in practice. To support our endeavour, we have investigated the nature of the current practice and have observed the following aspects: À Identifying fault locations under conditions which appropriately reflect development settings remains a largely open problem. To the best of our knowledge, most of the current state-of-the-art APR approaches [35,39,84,88,110,111,114,117,127,132,137,138,140,155,172,238,253,255,257] leverage test suites to perform fault localisation as test suites an affordable approximation to program specifications given the absence of formal specifications. While current test-based APR approaches would be suitable with carefully crafted benchmarks, their adoption by practitioners struggles as in practice for most development settings, many bugs are reported without the available test suite being able to reveal them.

Á The intractability of the fix patterns and the generalisability (i.e., the scope of mining ) of the mining strategies remain a challenge for deriving relevant patterns for program repair. Early techniques such as GenProg [117,238] relied on simple mutation operators to drive the genetic evolution of the code. More widespread today are approaches that build on fix patterns [88] (also referred to as fix templates [135] or program transformation schemas [72]) learned from existing patches. Several APR systems [48,72,88,99,129,131,132,132,133,135,153,200] in the literature discovering diverse sets of fix patterns obtained either via manual generation or automatic mining of bug-fix datasets. Manual summarisation of fix patterns is a heavy burden for APR practitioners. In addition, manual mining of fix patterns cannot enumerate the common and effective fix patterns as much as possible. Recent automatic mining techniques use frequency of code change actions [76,241], static analysis violations [129,196] and from Q&A posts [135] to mine fix patterns the intractability of the fix patterns. Template-based program repair systems, whether they leverage specifically pre-defined mutation operators, infer code transformations on-the-fly or rely on offline-inferred fix patterns, they generally build on data of existing code bases (preferably with a large history of code changes). If the source of mining is not appropriate (e.g., limited recurrent changes or changes associated with domain-specific bugs), the mined patterns may be irrelevant for the program that is targeted for repair. Overall, although the literature approaches can come in handy for discovering diverse sets of fix patterns, the reality is that the intractability of the fix patterns and the generalisability (i.e., the scope of mining ) of the mining strategies remain a challenge for deriving relevant patterns for program repair.

 Reliable automated repair techniques could be beneficial in a production development chain as debugging aids [221]. Consequently, it is crucial to ensure that all advancements can be measured and

(17)

1.1. This thesis

assessed rigorously in terms of efficiency, efficacy, and usability to perceive affordance by practitioners. Even though the automated research community has already started to reflect on the acceptability [88,162] and correctness [210,254] of the patches generated by APR tools, various steps, and artefacts in automated program repair techniques remain largely intractable. This intractability remains a big obstacle for transparency. In addition to transparency, approaches in the literature are often provided in monolithic tooling, which prevents extension, adaptation and even application on real-world development setting as they require substantial engineering effort for experimental adaptation. These aspects urge to perform additional research in building automatic patch generation systems that are based on flexible, transparent and practical techniques to enable better assessment of research advancements and to facilitate the adoption of APR by software maintainers.

1.1 This thesis

In this dissertation, we propose to go back to the basics, systematically and empirically exploring the current practice of repair to provide actionable insights for, what can be repaired, how it can be repaired and when it can be repaired. We aim to offer actionable insights for rethinking and boosting the automated repair agenda towards its acceptability by developer communities.

The main objectives of this thesis are as follows:

• (a) Systematically and empirically explore the current practice of repair to provide extensive insights for, how the developer community can accept tool-supported patches (automated repair), and the automation of what kind of fixes can be readily accepted in the community. • (b) Devise a new automated repair approach oriented towards fixing user-reported bugs under

conditions which appropriately reflect development settings.

• (c) Devise a new transparent and flexible automated repair approach that builds on the concept of generic patch, that defines a unified representation/notation for specifying fix patterns (aka templates).

Concretely, towards realising these objectives, in this dissertation, we focus on:

• Mining software repositories towards understanding their characteristics, and explore insights on how to leverage them to facilitate program repair.

• Analysing communication channels in software development in order to assess to what extent they could be relevant in a real-world program repair scenario.

• Exploring generic concepts of patching in the literature for establishing a common foun-dation for program repair pipelines that can be integrated with industrial settings.

1.2 Contributions

We now summarise the contributions of this dissertation as below:

(18)

Chapter 1. Introduction

This work has led to a research paper published in the Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 2017 (ISSTA 2017).

• D&C: A Divide-and-Conquer Approach to IR-based Bug Localisation.

We extensively study the performance of state-of-the-art bug localisation tools, specifically focusing on investigating the query formulation (i.e., which bug report features should be compared against which features of source code files) and its importance with respect to the localisation performance. Building on insights from this study, we propose D&C, a novel IRBL approach which adaptively learns to compute the weight to associate to similarity scores of IRBL features. The training scenario builds on our findings that the various state-of-the-art localisation tools (hence the associated similarity features that they leverage) can be highly performant for specific sets of bug reports. Concretely, we leverage a gradient boosting supervised learning technique to build multi-classifiers by training on homogeneous sets of bug reports whose localisations appear to be successful with specific types of features.

The results of this research will be soon submitted to the Springer Empirical Software Engineering Journal (EMSE).

• iFixR: Bug Report driven Program Repair.

We have investigated the feasibility of automating patch generation from bug reports. To that end, we implemented iFixR; an APR pipeline variant adapted to the constraints of test cases unavailability when users report bugs. The proposed system revisits the fundamental steps, notably fault localisation, patch generation and patch validation, which are all tightly-dependent to the positive test cases in a test-based APR system. In particular, iFixR replaces classical spectrum-based fault localisation with Information Retrieval (IR)-based fault localisation. We take as input the bug report in the natural language submitted by the program user and rely on the information in this report to localise the bug positions. We make no assumptions on the availability of positive test cases that encode functionality requirements at the time the bug is discovered, and we assume only the presence of regression test cases to validate patch candidates. We further propose a strategy to prioritise patches for recommendation to developers in order to increase the probability of placing a correct patch on top of the list as in the absence of a complete test suite, we cannot guarantee that all patches that pass regression tests will fix the bug.

This work has led to a research paper published in the Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering (FSE 2019).

• FixMiner: Mining Relevant Fix Patterns for Automated Program Repair.

We propose to investigate the feasibility of mining relevant fix patterns that can be easily integrated into an automated pattern-based program repair system. To that end, we propose an iterative and three-fold clustering strategy, FixMiner, to discover relevant fix patterns automatically from atomic changes within real-world developer fixes. The goal of FixMiner is to infer separate and reusable fix patterns that can be leveraged in other patch generation systems. In order to convey the full syntactic and semantic meaning of the code changes, we introduce the concept of Rich Edit Script, which is a specialised tree data structure of the edit scripts that captures the AST-level context of code changes. FixMiner infer patterns by discovering cluster of patches that are sharing a common representation, which is computed based on the following information encoded in Rich Edit Scripts for each round of the iteration: abstract syntax tree, edit actions tree, and code context tree. We assess the consistency of the FixMiner patterns with the patterns in the literature. We further demonstrate with the implementation of an automated repair pipeline that the patterns mined by FixMiner are relevant for generating correct patches.

This work has led to a research papers published in Springer Empirical Software Engineering Journal (EMSE 2020).

• FlexiRepair: Transparent Program Repair with Generic Patches.

Template-based program repair research needs a common ground to express fix patterns in a

(19)

1.3. Roadmap

standard and reusable manner. We propose to build on the concept of generic patch (also known as semantic patch), which is widely used in the Linux community to automate code evolution. We advocate that generic patches could provide at the same time a unified representation and a specification for fix patterns.

We present the design and implementation of a repair framework, FlexiRepair, that explores generic patches as the core concept. In particular, we show how concretely generic patches can be inferred and applied in a pipeline of Automated Program Repair (APR). With FlexiRepair, we address an urgent challenge in the template-based APR community to separate implementation details from actual scientific contributions by providing an open, transparent and flexible repair pipeline on top of which all advancements in terms of efficiency, efficacy and usability can be measured and assessed rigorously. Furthermore, because the underlying tools and concepts have already been accepted by a wide practitioner community, we expect FlexiRepair ’s adoption by industry to be facilitated. Preliminary experiments with a prototype FlexiRepair on the IntroClass and CodeFlaws benchmarks suggest that it already constitutes a solid baseline with comparable performance to some of state of the art.

This work has led to a research paper submitted to the 43rd ACM/IEEE International Conference on Software Engineering (ICSE 2021).

1.3 Roadmap

(20)

2 Background & Related Work

In this chapter, we provide the preliminary details that are necessary to understand the purpose, techniques and key concerns of the various research studies that we have conducted in this dissertation. Mainly, we revisit the literature of automated program repair, focusing on fault localisation, patch generation and patch validation, respectively.

Contents

2.1 Automated Program Repair . . . 21

2.1.1 Fault Localisation . . . 21

2.1.2 Patch Generation . . . 22

(21)

2.1. Automated Program Repair

2.1 Automated Program Repair

The research on automated program repair has explored various ideas, algorithms, techniques for traversing a search space of patch candidates that are generated by applying change operators to the buggy program code. Depending on how a technique conducts the search and constructs the patches, the literature includes a broad range of techniques that use heuristics-based [76,88,133,238],

constraint-based [156,172,255] and learning-aided [63,137,140] following the taxonomy proposed by Le Goues et al. [118]. Most of these techniques take a buggy program and a correctness criterion (often test suites as they represent an affordable approximation to program specifications) as their input. Generally, they follow a set of activities starting with i) a fault localisation step that identifies code locations that are likely to be buggy ii) a patch generation step that conducts the search and construction of the patch candidates and iii) a patch validation step that checks whether the proposed fix actually corrects the bug.

2.1.1 Fault Localisation

Fault localisation (FL) research focuses on developing automated techniques to identify program entities (such as source code files, methods, and statements) that are likely to contain the defect. FL techniques leverage dynamic analysis, runtime information and static analysis to identify a potential buggy program element. Depending on what FL techniques leverage as the source of information to identify the suspicious program entity, they can be classified as follows:

Spectrum-based fault localisation (SBFL) techniques use test coverage information [3,65,251], mutation-based fault localisation (MBFL) techniques use test results collected from mutating the program [164,182] (dynamic) program slicing techniques use the dynamic program dependencies [7,194], stack trace analysis techniques use error messages [247,250], predicate switching techniques use test results from mutating the results of conditional expressions [273], information-retrieval based fault localisation (IRFL) techniques use bug report information [199,242,247,267,276], and history-based techniques use the development history to identify the suspicious program elements that are likely to be defective [94,191]. Recently, the literature leverages machine learning and deep learning techniques to perform fault localisation. Ye et al. [263] proposed a learning-to-rank approach to bug localisation based features representing the degree of suspiciousness. Kim et al. [89] dealt with bug report quality to improve bug localisation with a two-phase model focusing on high-quality bug reports. Lam et al. presented HyLoc [103] and DNNLoc [104] that use deep neural networks to learn relevancy between tokens in bug reports and code elements in the source code. In this dissertation, we propose a novel IRFL approach, D&C, which adaptively learns to compute the weight to associate to similarity scores of features from sets of bug reports whose localisations appear to be successful with specific types of features.

In the scope of automated program repair, most APR systems use SBFL [6,36,62,63,76,99,127,132,

133,140,152,154,200,241,253,255]. SBFL leverages execution coverage information of passing and failing test cases and a formula such as Ochiai [3] to spot the bug positions. In the literature, APR systems often rely on a testing framework [35,72,114,200] such as GZoltar [31], and a spectrum-based fault localisation formula [249,258,274], such as Ochiai [3]. The popularity of Ochiai is backed up by empirical evidence on its effectiveness to help localise faults in object-oriented programs as highlighted by several fault localisation studies [182,216,251,258]. Pearson et al. [186] has even shown that Ochiai outperforms current state-of-the-art ranking metrics, or at least offers similar performance measures.

(22)

Chapter 2. Background & Related Work

ssFix [253] prioritises statements from the stack trace of crashed programs that are executed before those statements that are ranked by the FL tool. ACS [255] uses predicate switching [273] and refines the suspicious code locations list. SimFix [76] applies a test case purification approach to improve the accuracy of fault localisation step before patch generation. Liu et al. [131] has investigated the fault localisation impact on the repair performance of APR systems highlighting the potential biases due to the elision of assumptions and tweaks details while presenting the results of repair tools. Meanwhile, a few studies [20,126] discussed the possibility to leverage bug reports in the context of automated program repair. To the best of our knowledge, Liu et al. [126] proposed the most advanced study in this direction. However, their R2Fix approach does not use any fault localisation, it rather focuses on bug reports [126, page 283] which explicitly include localisation information. In this dissertation, we have investigated the feasibility of leveraging bug reports in the context of the APR. Concretely, we propose an APR system, iFixR that is driven by bug reports, that replaces classical spectrum-based fault localisation with Information Retrieval (IR)-based fault localisation.

2.1.2 Patch Generation

The literature includes a broad range of techniques that use heuristics (e.g., via random mutation operations [115]), constraints solving (e.g., via symbolic execution [172]), or machine learning (e.g., via building a code transformation model [63]) to drive patch generation.

Heuristic-based repair approaches employ a generate-and-validate methodology, which constructs and iterates over a search space of syntactic program modifications [118] that are then validated by running the modified program on the provided set of test cases. The search space denotes a set of considered modifications (also referred to as patch candidates [88]) for a buggy program that is generated by a dedicated repair approach. The validation proceeds with the evaluation of patch candidates by executing the provided set of test cases. Once a patch candidate can make a buggy program pass all tests, it is considered as the patch for the buggy program. Notable APR tools for Java programs include jGenProg [152], GenProg-A [271], ARJA [271], RSRepair-A [271], SimFix [76], jKali [152], Kali-A [271], jMutRepair [152], HDRepair [114], PAR [88], S3 [110], ELIXIR [200], SOFix [135], CapGen [241].

Constraint-based repair approaches proceed with a methodology different from heuristic-based repair, which constructs repair constraints that will be used to select donor code for the patch generation [63]. The constraints describe the code fragments that should satisfy the variable types, values or behaviour specified by the constraints. Such code fragments are returned as the potential matches, which further be synthesised into patch candidates with the specific functions. For example, symbolic execution approaches extract properties about the function to be synthesised; these properties constitute the repair constraints. Solutions to the repair constraints can be obtained by constraint solving or other search techniques. Nopol [257], DynaMoth [49] ACS [255], Cardumen [153], SemFix [172] and Angelix [156] are notable constraint-based repair approaches.

Learning-based repair approaches explore the advanced machine learning technique, especially deep learning technique, to boosting program repair [118]. To date, learning-based repair has been exploited in three ways: 1) learning models of correct code that are used to prioritise the patch candidates in terms of the correctness [140]; 2) learning code transformation patterns that summarise how human-written patches change buggy code into correct code [27,137], where patterns can be further used to generate patch candidates; and 3) learning to improve the repair process and training models for end-to-end repair, where models are leveraged to predict the correct code for the given buggy code without using any other explicitly provided context information [36,228] and learning the context of the code surrounding a fix [121].

More widespread today are approaches that build on fix patterns [88] (also referred to as fix templates [135] or program transformation schemas [72]) learned from existing patches which could

(23)

2.1. Automated Program Repair

be grouped into heuristic-based repair as the main process of fix pattern-based repair is consistent with heuristic-based repair. Several APR systems [48,72,88,99,129,131,132,132,133,135,153,200] implement this strategy by using diverse sets of fix patterns obtained either via manual generation or automatic mining of bug-fix datasets. Depending on how a technique obtains fix patterns, it can be categorised into four groups: manual summarizing, pre-definition, frequency and mining.

1. Manual Summarizing: Pan et al. [181] manually identified 27 fix patterns from patches of five Java projects to characterize the fix ingredients of patches. Kim et al. [88] manually summarised 10 fix patterns from 62,656 human-written patches collected from Eclipse JDT.

2. Pre-definition: Durieux et al. [48] pre-defined 9 repair actions for null pointer exceptions by unifying the related fix patterns proposed in previous studies [44,85,141]. On the top of

PAR [88], Saha et al. [200] further defined 3 new fix patterns to improve the repair performance. Hua et al. [72] proposed an APR tool with six pre-defined so-called code transformation schemas. Xin and Reiss [253] proposed an approach to fixing bugs with 34 pre-defined code change rules at the AST level.

3. Frequency: Besides formatted fix patterns, researchers [76,241] also explored to automate program repair with code change instructions (at the abstract syntax tree level) that are frequently recurring in existing patches [76,130,151,240,275]. The strategy is then to select the top-n most frequent code change instructions as fix ingredients to synthesize patches.

4. Mining: Long et al. [137] proposed Genesis, to infer fix patterns for three kinds of defects from existing patches. Liu and Zhong [135] explored fix patterns from Q&A posts in Stack Overflow. Liu et al. [129] and Rolim et al. [196] proposed to mine fix patterns from static analysis violations from FindBugs and PMD respectively.

This dissertation also focuses on fix pattern-based program repair. We consider the predefined set of patterns used by literature APR systems, as well as we investigate the feasibility of an automated approach to mine relevant and actionable fix patterns that can be easily integrated into an automated pattern-based program repair system.

In the last decade, most proposed techniques in the literature present repair pipelines where patch candidates are generated then validated against a program specification, generally a (weak) test suite. We refer to them as generate-and-validate test-suite based repair approaches. The genetic programming-based approach proposed by Weimer et al. [238], as well as follow-up works, appeared only valid for hypothetical use cases as the assumption that test cases are readily available still does not hold in practice [15,95,187]. Nevertheless, in the last couple of years, two independent reports have illustrated the use of literature techniques in actual development flows: in the open source community, the Repairnator project [230] has successfully demonstrated that automated repair engines can be reliable: open source maintainers accepted and merged patches which were suggested by an APR bot. At the premises of Facebook, the SapFix repair system has been reported to be part of the continuous integration pipeline [148] while Getafix was used there at large scale [12]. In this dissertation, our aim is to devise automated repair approaches facilitating the adoption of the automated program repair systems in practice. In this direction, we focus not only improvements on the current generate-and-validate repair approaches, but also we devise a new automated repair approach oriented towards fixing user-reported bugs under conditions which appropriately reflect development settings.

2.1.3 Patch Validation

(24)

Chapter 2. Background & Related Work

bugs for which APR tool can generate a patch that makes the buggy program pass all the test cases [88,114,152,238,257]. However, analysing patch correctness was largely ignored or unconcerned in the community until the analysis study of patch correctness conducted by Qi et al. [190]. Their systematic analysis of the patches reported by three generate-and-validate program repair systems (i.e., GenProg, RSRepair and AE) shown that the overwhelming majority of the generated patches are not correct but just overfit the test inputs in the test suites of buggy programs. In another study, Smith et al. [210] uncover that patches generated with lower coverage test suites overfit more. Actually, these overfitting patches often simply break under-tested functionalities, and some of them even make the “patched” program worse than the unpatched program. Since then, the overfitting issue has been widely studied in the literature.

Eventually, to fairly assess the performance on fixing real bugs of APR tools, the number of bugs for which a correct (i.e., it is semantically equivalent to the patch that the program developer accepts for fixing the bug) patch is generated appeared to be a more reasonable metric than the mere number of plausible patches [255]. This metric has since then become standard among researchers, and is now widely accepted in the literature for evaluating APR tools [35,72,76,127,132,133,135,200,241]. Based on data presented with such metric, researchers explicitly rank the APR systems, and use this ranking as a validation of new achievements in program repair. However, this has been a manual effort based on a recurrent criterion: a plausible patch is considered as correct when it is semantically similar to the developer’s patch in the benchmark.

Therefore, researchers have started to focus some effort in automating the identification of patch correctness [254]. Le et al. [112] revisit the overfitting problem in semantics-based APR systems. Le et al. [113] further assess the reliability of authors and automated annotations in assessing patch correctness. They recommend making publicly available to the community the patch correctness evaluations of the authors. Yang and Yang [259] explore the difference between the runtime behaviour of programs patched with developer’s patches and those by APR-generated plausible patches. They unveil that the majority of the APR-generated plausible patches leads to different runtime behaviour compared to correct patches. Liu et al. [134] propose to unveil the implicit rules that researchers use to make the decisions on correctness.

In recent literature, researchers are investigating to predict the correctness of patches. One of the first explored research directions relied on the idea of augmenting test inputs, i.e., more tests need to be proposed. Yang et al. [260] design a framework to detect overfitting patches. This framework leverages fuzz strategies on existing test cases in order to automatically generate new test inputs. In addition, it leverages additional oracles (i.e., memory-safety oracles) to improve the validation of APR-generated patches. In a contemporary study, Xin and Reiss [252] also explored to generate new test inputs, with the syntactic differences between the buggy code and its patched code, for validating the correctness of APR-generated patches. As complemented by Xiong et al. [254], they proposed to assess the patch correctness of APR systems by leveraging the automated generation of new test cases and measuring behaviour similarity of the failing tests on buggy and patched programs. Through an empirical investigation, Yu et al. [270] summarised two common overfitting issues: incomplete fixing and regression introduction. To assist alleviating the overfitting issue for synthesis-based APR systems, they further proposed UnsatGuided that relies on additional generated test cases to strengthen patch synthesis, and thus reduce the generation of incorrect overfitting patches. Ye et al. [261] propose ODS, an overfitting detection system, that learns an ensemble probabilistic model for classifying and ranking potentially overfitting patches.

In a recent work, Csuvik et al. [40] exploit the textual and structural similarity between the buggy code and the APR-patched code with two representation learning models (BERT [43] and Doc2Vec [107]) by considering three patch code representation (i.e., source code, abstract syntax tree and identifiers). Their results show that the source code representation is likely to be more effective in correct patch identification than the other two representations, and the similarity-based patch validation can filter

(25)

2.1. Automated Program Repair

out incorrect patches for APR tools. Tian et al. [224] focus on assessing representation learning techniques for predicting correctness of patches generated by program repair tools.

(26)

3 An Empirical Study of Patching in Practice

In this work, we investigate the practice of patch construction in the Linux kernel development, focusing on the differences between three patching processes: (1) patches crafted entirely manually to fix bugs, (2) those that are derived from warnings of bug detection tools, and (3) those that are automatically generated based on fix patterns. With this study, we provide to the research community concrete insights on the practice of patching as well as how the development community is currently embracing research and commercial patching tools to improve productivity in repair. In particular, we investigate the extent of the acceptance of bug finding and patch application tools in a production environment, and study the opportunities of automation that the automated repair community can explore.

This chapter is based on the work published in the following research paper:

• A. Koyuncu, T. F. Bissyandé, D. Kim, J. Klein, M. Monperrus, and Y. Le Traon. Impact of tool support in patch construction. In Proceedings of the 26th ACM SIGSOFT International

Symposium on Software Testing and Analysis, pages 237–248. ACM, 2017

Contents

3.1 Overview . . . 27 3.2 Background . . . 28 3.3 Methodology . . . 29 3.3.1 Dataset Collection. . . 30 3.3.2 Research Questions . . . 32

3.4 Empirical Study Findings . . . 32 3.4.1 Descriptive Statistics on the Data. . . 32

3.4.2 Acceptance of Patches (RQ1) . . . 34

3.4.3 Profile of Patch Authors (RQ2) . . . 36

3.4.4 Stability of Patches (RQ3) . . . 37 3.4.5 Bug Kinds (RQ4) . . . 38 3.5 Discussions. . . 41 3.5.1 Implications . . . 41 3.5.2 Exploiting Patch Redundancies . . . 42 3.5.3 Threats to Validity . . . 43 3.6 Related Work . . . 43 3.6.1 Program Repair . . . 43 3.6.2 Patch Acceptability . . . 44

3.6.3 Program Matching and Transformation . . . 44

(27)

3.1. Overview

3.1 Overview

Patch construction is a key task in software development. In particular, it is central to the repair process when developers must engineer change operations for fixing the buggy code. In recent years, a number of tools have been integrated into software development ecosystems, contributing to reducing the burden of patch construction. The process of a patch construction indeed includes various steps that can more or less be automated: bug detection tools, for example, can help human developers characterise and often localise the piece of code to fix, while patch application tools can systematise the formation of concrete patches that can be applied within an identified context of the code. Tool support, however, can impact patch construction in a way that may influence acceptance or that focuses the patches to specific bug kinds. The growing field of automated repair [88,117,156,172], for example, is currently challenged by the nature of the patches that are produced and their eventual acceptance by development teams. Indeed, constructed patches must be applied to a code base and later maintained by human developers.

This situation raises the question of the acceptance of patches within a development team, with regards to the process that was relied upon to construct them. The goal of our study is therefore to identify different types of patches written by different construction processes by exploring patches in a real-world project, to reflect on how the program repair is conducted in current development settings. In particular, we investigate how advances in static bug detection and patch application have already been exploited to reduce human efforts in repair.

We formulate research questions for comparing different types of patches, produced with varying degrees of automation, to offer to the community some insights on i) whether tool-supported patches can be readily adopted, ii) whether tool-supported patches target specific kinds of bugs, and iii) where further opportunities lie for improving automated repair techniques in production environments. In this work, we consider the Linux operating system development since it has established an important code base in the history of software engineering. Linux is furthermore a reliable artefact [74] for research as patches are validated by a strongly hierarchical community before they can reach the mainline code base. Developers involved in Linux development, especially maintainers who are in charge of acknowledging patches, have relatively extensive experience in programming. Linux’s development history constitutes valuable information for repair studies as a number of tools have been introduced in this community to automate and systematise various tasks such as code style checking, bug detection, and systematic patching. Our analysis unfolds as an empirical comparative study of three patch construction processes:

• Process H: In the first process, developers must rely on a bug report written by a user to understand the problem, locate the faulty part of source code, and manually craft a fix. We refer to it as Process H, since all steps in the process appear to involve Human intervention.

• Process DLH: In the second process, static analysis tools first scan the source code and report on lines which are likely faulty. Fixing the reported lines of code can be straightforward since the tools may be very descriptive on the nature of the problem. Nevertheless, dealing with static debugging tools can be tedious for developers with little experience as these tools often yield too many false positives. We refer to this process as Process DLH, since Detection and Localisation are automated but Human intervention is required to form the patch.

• Process HMG: Finally, in the third process, developers may rely on a systematic patching tool to search for and fix a specific bug pattern. We refer to this process as Process HMG, since Human input is needed to express the bug/fix patterns which are Matched by a tool to a code base to

Generate a concrete patch.

(28)

Chapter 3. An Empirical Study of Patching in Practice

Acceptance of patches: development communities, such as the Linux kernel team, are becoming aware of the potential of tool support in patch construction i) to gain time by prioritising engineering tasks and ii) to attract contributions from novice developers seeking to join a project.

Kinds of bugs: Tool-supported patches do not target the same kinds of bugs as manual patches. However, we note that patches fixing warnings outputted by bug detection tools are already complex, requiring several change operations over several lines, hunks and even files of code.

Opportunities for automated repair: We have performed preliminary analyses which show that bug detection tools can be leveraged as a stepping stone for automated repair in conjunction with patch generation tools, to produce patches that are consistent with human patches (for maintenance), correct (derived from past experience of fixing a specific bug type) and thus likely to be rapidly accepted by development teams.

3.2 Background

Linux is an open-source operating system that is widely used in environments ranging from embedded systems to servers. The heart of the Linux operating system is the Linux kernel, which comprises all the code that runs with kernel privileges, including device drivers and file systems. It was first introduced in 1994, and has grown to 14.3 million lines of C code with the release of Linux 4.8 in Oct. 2016.1 All data used in this work are related to changes propagated to the mainline code base until

Oct. 2, 20162.

A recent study has shown that, for a collection of typical types of faults in C code, the number of faults is staying stable, even though the size of the kernel is increasing, implying that the overall quality of the code is improving [179]. Nevertheless, ensuring the correctness and maintainability of the code remains an important issue for Linux developers, as reflected by discussions on the kernel mailing list [214].

Development Model of Linux

The Linux kernel is developed according to a hierarchical open source model referred to as Benevolent dictator for life (BDFL) [245], in which anyone can contribute, but ultimately all contributions are integrated by a single person, Linus Torvalds. A Linux kernel maintainer receives patches related to a particular file or subsystem from developers or more specialised maintainers. After evaluating and locally committing them, he/she propagates them upwards in the maintainer hierarchy eventually towards Linus Torvalds.

Finally, Linux developers are urged to “solve a single problem per patch”3, and maintainers are known to enforce this rule as revealed by discussions on contributors’ patches in the Linux Kernel Mailing List (LKML) [214] archive.

Patching and Repair in Linux

Recently, the development and maintenance of the Linux kernel have become a massive effort, involving a huge number of people. 1,731 distinct commit authors have contributed to the development of

1Computed with David A. Wheeler’s ‘SLOCCount’.

2Kernel’s Git HEAD commit id is c8d2bc9bc39ebea8437fd974fdbc21847bb897a3.

3see Documentation/SubmittingPatches in linux tree.

(29)

3.3. Methodology

Linux 4.84. The patches written by these commit authors are then validated by the 1,142 maintainers

of Linux 4.85, who are responsible for the various subsystems.

Since the release of Linux 2.6.12 in June 2005, the Linux kernel has used the source code management system git [50]. The current Linux kernel git tree [227] only goes back to Linux 2.6.12, and thus we use this version as the starting point of our study. Between Linux 2.6.12 and Linux 4.8 there were 616,291 commits, by 20,591 different developers6. These commits are retrievable from the git

repository as patches. Basically, a patch is an extract of code, in which lines beginning with - are to be removed lines beginning with + are to be added.

The Linux kernel community actively uses the Bugzilla [86] issue tracking system to report and manage bugs. As of November 2016, over 28 thousands bug reports were filed in the kernel tracking system, with about 6,000 marked as highly severe or even blocking.

The Linux community has also built, or integrated, a number of tools for improving the quality of its source code in a systematic way. For example, The mainline code base includes the coding style checker checkpatch, which was released in July 2007, in Linux 2.6.22. The use of checkpatch is supported by the Linux kernel guidelines for submitting patches7, and checkpatch has been regularly

maintained and extended since its inception. Sparse [244] is another example of the tools built by Linus Torvalds and colleagues to enforce type checking.

Commercial tools, such as Coverity [217], also often help to fix Linux code. More recently, researchers at Inria/LiP6 have developed the Coccinelle project [125] for Linux code matching and transformation. Initially, the project was designed to help developers perform collateral evolutions [177]. It is now intensively used by Linux developers to apply fix patterns to the whole code base.

3.3 Methodology

Our objective is to empirically check the impact of tool support in the patch construction process in Linux. To achieve this goal, we must collect a large, consistent and clean set of patches constructed in different processes. Specifically, we require:

(1) patches that have been a-priori manually prepared by developers based on the knowledge of a potential bug, somewhere in the code. For this type of patches, we assume that a user may have reported an issue while running the code. In the Linux ecosystem, such reporters are often kernel developers.

(2) patches that have been constructed by using the output of bug finding tools, which are integrated into the development chain. We consider this type of patches to be tool-supported, as debugging tools often provide reliable information on what the bug is (hence, how to fix it) and where it is located.

(3) patches that have been constructed, by a tool, based fully on change rules. Such fixes, validated by maintainers, are actually based on templates of fix patterns which are used to i) match (i.e., locate) incorrect code in the project and ii) generate a corresponding concrete fix.

4Obtained using git log v4.7..v4.8 | grep ^Author | sort -u | wc -l, without controlling for variations in

names or email addresses.

5Obtained using grep ^M: MAINTAINERS | sort -u | wc -l without controlling for variations in names or email

addresses.

6Again, we have not controlled for variations in names or email addresses.

(30)

Chapter 3. An Empirical Study of Patching in Practice

3.3.1 Dataset Collection

To collect patches constructed via Process H, hereafter referred to as H patches, we consider patches whose commits are explicitly linked to a bug report from the kernel bugzilla tracking system and any other Linux distributions bug tracking systems. We consider that such patches have been engineered manually after careful consideration of the report filed by a user, and often after a replication step where developers dynamically test the software.

Until Linux 4.8, we have found 5,758 patches fixing defects described in bug reports. Unfortunately, for some of the patches, the link to its bug report provided in the commit log was not accessible (e.g., because of restriction in access rights of some Redhat bug reports or because the web page was no longer live). Consequently, we were able to collect 4,417 bug patches corresponding to a bug report (i.e., ∼ 77% of H patches). Table3.1 provides statistics on the bugs associated with those patches.

Table 3.1: Statistics on H patches in Linux Kernel.

Severity # reports # patches

Severe 965 1,052

Medium 2,961 3,163

Minor 138 136

Enhancement 47 66

Total 4,111 4,417

First, we note that the severity of most bugs (2,961, i.e., 72.0%) is medium, and H patches have fixed substantially more severe bugs (965, i.e., 23.5%) than minor bugs (138, i.e., 3.3%). Only 47 (1.1%) bug reports represent mere enhancements. Second, exploring the data shows that there is not always a 1 to 1 relationship between bug reports and patches: a bug report may be addressed by several patches, while a single patch may relate to several bug reports. Nevertheless, we note that 4,270 out of 5,265 (i.e., 89%) patches address a single bug report. Third, a large number of unique developers (1,088 out of 18,733= 6.95%) have provided H patches to fix user bug reports. Finally, H patches have touched about 17% (= 9,650/57,195) of files in the code base. Overall, these statistics suggest that the dataset of H patches is diverse as they are indeed written by a variety of developers to fix a variably severe set of bugs spread across different files of the program.

We identify patches constructed via Process DLH, hereafter referred to as DLH patches, by matching in commit logs messages on the form “found by <tool>”8 where <tool> refers to a tool used by

kernel developers to find bugs. In this work, we consider the following notable tools, for static analysis:

• checkpatch: a coding style checker for ensuring some basic level of patch quality.

• sparse: an in-house tool for static code analysis that helps kernel developers to detect coding errors based on developer annotations.

• Linux driver verification (LDV) project : a set of programs, such as the Berkeley Lazy Abstraction Software verification Tool (BLAST) that solves the reachability problem, dedicated to improving the quality of kernel driver modules.

• Smatch: a static analysis tool.

• Coverity: a commercial static analysis tool.

• Cppcheck: an extensible static analysis tool that feeds on checking rules to detect bugs. and for dynamic analysis:

8We also use “generated by <tool>” since the commit authors also often refer to warnings as “generated by” a given

tool.

(31)

3.3. Methodology

• Strace: a tracer for system calls and signals, to monitor interactions between processes and the Linux kernel.

• Syzkaller: a supervised, coverage-guided Linux syscall fuzzer for testing untrusted user input. • Kasan: the Linux Kernel Address SANitizer is a dynamic memory error detector for finding

use-after-free and out-of-bounds bugs.

After collecting patches referring to those tools, we further check that commit logs include terms “bug” or “fix”, to focus on bug fix patches. Table3.2provides details on the distribution of patches

produced based on the output of those tools.

Table 3.2: Statistics on DLH patches in Linux Kernel. Tool # patches Tool # patches

checkpatch 292 sparse 68

LDV 220 smatch 39

coverity 84 cppcheck 14

strace 4 syzkaller 7

kasan 1

Checkpatch and the Linux driver verification project tools are the most mentioned in commit logs.

The Coverity commercial tool and the sparse internal tool also helped to find and fix dozens of bugs in the kernel. Finally, we note that static tools are more frequently referred to than dynamic tools. HMG patches in Linux are mainly carried out by Coccinelle, which was originally designed to document and automate collateral evolutions in the kernel source code [177]. Coccinelle is built on an approach where the user guides the inference process using patterns of code that reflect the user’s understanding of the conventions and design of the target software system [106].

Static analysis by Coccinelle is specified by developers who use control-flow sensitive concrete syntax matching rules [28]. Coccinelle provides a language, SmPL9, for specifying search and transformations

referred to as semantic patches. It also includes a transformation engine for performing the specified semantic patches. To avoid confusion with semantic patches in the context of automated repair literature, we will refer to Coccinelle-generated patches as SmPL patches.

1 1 @@ 2 2 expression E; 3 3 constant c; 4 4 type T; 5 5 @@ 6 6 -kzalloc(c * sizeof(T), E) 7 7 +kcalloc(c, sizeof(T), E)

(a)Example of SmPL templates.

1 1void main(inti) 2 2 {

3 3

4 4 kzalloc(2 * sizeof(int), GFP_KERNEL); 5 5 kzalloc(sizeof(int) * 2, GFP_KERNEL); 6 6

7 7 }

(b)C code matching the template on the left. (iso-kzalloc.c).

Figure 3.1: Illustration of SmPL matching and patching.

Figure4.15illustrates an SmPL patch example. This SmPL patch is aimed at changing all function calls of kzalloc to kcalloc with a reorganisation of call arguments. For more details on how SmPL patches are specified, we refer the reader to the project documentation10. Figure3.2represents the

concrete Unix diff generated by Coccinelle engine and which is included in the patch to forward to mainline maintainers.

In some cases, the fix is not directly implemented in the SmPL patch (which is then referred to as SmPL match). Nevertheless, since each bug pattern must be clearly defined with SmPL, the associated fix is straightforward to engineer. Overall, we have collected 4,050 HMG patches mentioning “coccinelle” or “semantic patch” and applied to C code11.

9Semantic Patch Language.

10

http://coccinelle.lip6.fr/documentation.php

11We have controlled with a random subset of 100 commits that this grep-based approaches yielded indeed only relevant

Références

Documents relatifs

Table 2: Experimental results on repairing the bugs of Commons Math project with 3 different ingredi- ent scope: File, Package and Application. As listing 1 shows, the patch

We evaluate DynaMoth over 224 of the Defects4J dataset. The evaluation shows that Nopol with DynaMoth is capable of repairing 27 bugs. DynaMoth is able to repair bugs that have

2) Skewness of Probability Distributions: Figure 2 shows the probability for the most frequent repair actions of repair model CTET according to the transaction size (in number of

Because FixMiner considers code hunks as the unit for building Rich Edit Scripts, a given pattern may represent a repeating context (i.e., Shape pat- tern) or change (i.e., Action

Patch construction is a key task in software development. In par- ticular, it is central to the repair process when developers must engineer change operations for fixing the buggy

When the method has no parameter, Args ← ∅. Given a buggy method, we consider all methods with similar method signatures as candidates for fix ingredients. In practice, we consider

• Connecting Program Synthesis and Reachability: Automatic Program Re- pair Using Test-Input Generation (2017) [134] creates a meta-program parametrized with parameters, encoding

Parallel to this process, which has not yet been completed in France, the decentralization laws have strengthened local authorities in several areas concerning