The Impact of Operating Systems and Environments on Build Results

(1)

THE IMPACT OF OPERATING SYSTEMS AND ENVIRONMENTS ON BUILD RESULTS

MAHDIS ZOLFAGHARINIA

DÉPARTEMENT DE GÉNIE INFORMATIQUE ET GÉNIE LOGICIEL ÉCOLE POLYTECHNIQUE DE MONTRÉAL

MÉMOIRE PRÉSENTÉ EN VUE DE L’OBTENTION DU DIPLÔME DE MAÎTRISE ÈS SCIENCES APPLIQUÉES

(GÉNIE INFORMATIQUE) DÉCEMBRE 2017

c

(2)

ÉCOLE POLYTECHNIQUE DE MONTRÉAL

Ce mémoire intitulé :

THE IMPACT OF OPERATING SYSTEMS AND ENVIRONMENTS ON BUILD RESULTS

présenté par : ZOLFAGHARINIA Mahdis

en vue de l’obtention du diplôme de : Maîtrise ès sciences appliquées a été dûment accepté par le jury d’examen constitué de :

M. QUINTERO Alejandro, Doctorat, président

M. ADAMS Bram, Doctorat, membre et directeur de recherche

M. GUÉHÉNEUC Yann-Gaël, Doctorat, membre et codirecteur de recherche M. KHOMH Foutse, Ph. D., membre

(3)

DEDICATION

To my grandma, Who always inspired me to achieve my goals

And to my parents, Who always support and encourage me

(4)

ACKNOWLEDGEMENTS

I believe changing my program to a research based master was one of the best decisions for my academic life. It gave me the opportunity to meet new people and attend amazing conferences, which all are now part of this thesis. A especial thanks to my supervisor, Dr.Bram Adams for his kind guidance, great support and boundless patience all along the way. I learned a lot from working with a humble, talented and hard working supervisor like you.

I would also like to thank my co-supervisor Dr.Yann-Gaël Guéhéneuc for his kindness, mo-tivation and immense knowledge. Thank you Yann for always being by my side, not just as a co-supervisor but as a mentor and a friend.

My sincere thanks to Dr.Foutse Khomh for accepting my invitation to be jury member and to Dr.Alejandro Quintero for accepting to be president of my defense.

I am grateful to my parents, and my siblings for their unconditional love. Without your support, this thesis would not have been possible.

I would like to thank all my colleagues in our lab and my friends : Amir, Parastou, Yujuan, Rodrigo, Bani, Alexandre, Ruben, Sepideh, Asana, and Antoine for all the moment we spent together, and all the memories we made.

I am grateful to the following university staff : Louise Longtin, Nathalie Audelin, Chantal Balthazard, and Brigitte Hayeur for their unfailing assistance.

(5)

RÉSUMÉ

L’intégration continue (IC) est une pratique d’ingénierie logicielle permettant d’identifier et de corriger les fautes logicielles le plus rapidement possible après l’intégration d’un change-ment de code dans système de contrôle de versions. L’objectif principal de l’IC est d’informer les développeurs des conséquences des changements effectués dans le code. L’IC s’appuie sur différents systèmes d’exploitation et environnements d’exécution pour vérifier si un système fonctionne toujours après l’intégration des changements. Ainsi, de nombreux "builds" sont créés, alors que seulement quelques-uns révèlent de nouvelles fautes. En d’autres termes, un phénomène d’inflation des builds se produit, où le nombre croissant de builds a un rendement décroissant. Cette inflation rend l’interprétation des résultats des builds difficile, car l’infla-tion augmente l’importance de certaines fautes, alors qu’elle cache l’importance d’autres. Cette thèse fait progresser notre compréhension de l’impact des systèmes d’exploitation et des environnements d’exécution sur les fautes des builds et le biais potentiel encouru à cause de l’inflation des builds par une étude à grande échelle de 30 millions de builds de l’écosys-tème CPAN. Nous choisissons CPAN parce que CPAN fournit un riche ensemble de données pour l’analyse automatisée des builds sur des douzaines d’environnements (versions de Perl) et systèmes d’exploitation. Cette thèse rapporte une analyse quantitative et qualitative sur les fautes dans les builds pour classer ces fautes et trouver la raison de leur apparition. Nous observons : (1) l’évolution des fautes des builds au fil du temps et rapportons que plus de builds sont effectués, plus le pourcentage de fautes de builds diminue, (2) différents environ-nements et systèmes d’exploitation mettent en avant différentes fautes, (3) les résultats des builds doivent être filtrés pour identifier des fautes fiables, et (4) la plupart des fautes des builds sont dus à leur dépendance à l’API. Les chercheurs et les praticiens devraient tenir compte de l’impact de l’inflation des builds lorsqu’ils analysent ou exécutent des builds.

(6)

ABSTRACT

Continuous Integration (CI) is a software engineering practice to identify and correct a defect as soon as possible after a code change has been integrated into the version control system. The main purpose of CI is to give developers a quick feedback of code changes. These changes build on different OSes and runtime environments to check backward compatibility as well as to check if the product still works with the new changes. So, many builds are performed, while only a few of them can identify new failures. In other words, a phenomenon of build inflation can be observed, where the increasing number of builds has diminishing returns in terms of identified failures vs. costs of running the builds. This inflation makes interpreting build results challenging as it increases the importance of some failures, while it hides the importance of others. This thesis advances our understanding of the impact of OSes and runtime environments on build failures and build inflation through a large-scale study of 30 million builds of the CPAN ecosystem. We choose CPAN because CPAN provides a rich data set for the analysis of automated builds on dozens of environments (Perl versions) and operating systems.

This thesis performs quantitative and qualitative analysis on build failures to classify these failures and find out the reason of their occurrence. We observe: (1) the evolution of build failures over time and report that while more builds are being performed, the percentage of them identifying a failure drops, (2) different OSes and environments are not equally reliable, (3) the build results of CI must be filtered to identify reliable failing data, (4) and most build failures are due to API dependency. Researchers and practitioners should consider the impact of build inflation when they are analyzing and-or performing builds.

(7)

TABLE OF CONTENTS

DEDICATION . . . iii

ACKNOWLEDGEMENTS . . . iv

RÉSUMÉ . . . v

ABSTRACT . . . vi

TABLE OF CONTENTS . . . vii

LIST OF TABLES . . . ix

LIST OF FIGURES . . . x

LIST OF SYMBOLS AND ABBREVIATIONS . . . xii

CHAPTER 1 INTRODUCTION . . . 1

1.1 Research Hypothesis: Build Inflation Consequences . . . 3

1.2 Thesis Contributions: The Impact of OSes/Environments on Build Inflation 3 1.3 Organization of Thesis . . . 5

CHAPTER 2 LITERATURE REVIEW . . . 6

2.1 State-of-the-practice . . . 6 2.1.1 BuildBot . . . 6 2.1.2 Jenkins . . . 6 2.1.3 Travis CI . . . 8 2.1.4 Treeherder . . . 8 2.1.5 CPAN . . . 8 2.2 State-of-the-art . . . 10 2.3 Build Systems . . . 11 2.4 Build Failures . . . 12

CHAPTER 3 RESEARCH PROCESS AND ORGANIZATION OF THE THESIS . 15 CHAPTER 4 ARTICLE 1: DO NOT TRUST BUILD RESULTS AT FACE VALUE –AN EMPIRICAL STUDY OF 30 MILLION CPAN BUILDS . . . 16

(8)

4.1 Introduction . . . 16

4.2 Background . . . 18

4.2.1 CPAN . . . 18

4.2.2 Related Work . . . 20

4.3 Observational Study Design . . . 22

4.3.1 Study Object . . . 22

4.3.2 Study Subject . . . 22

4.3.3 Quantitative Study Sample . . . 22

4.3.4 Qualitative Study Sample . . . 25

4.4 Observational Study Results . . . 26

4.5 Discussion . . . 46

4.5.1 Explanatory Classification Model . . . 46

4.5.2 Comparison to Prior Build Failure Research . . . 47

4.6 Threats To Validity . . . 49

4.7 Conclusion . . . 49

CHAPTER 5 GENERAL DISCUSSION . . . 51

CHAPTER 6 CONCLUSION . . . 53

6.1 Summary . . . 53

6.2 Limitations and Future Work . . . 54

(9)

LIST OF TABLES

Table 1.1 Target expression in a makefile . . . 1 Table 4.1 Number of builds, distversions, and average numbers of builds per

distversion in each period of six months between January 2011 and June 2016. . . 27 Table 4.2 Total percentage of occurrences of the four different build failure

evo-lution patterns, across all OSes. “Pure” refers to occurrences of the patterns without fluctuation (e.g., [1, 1, 1]), while “Noisy” refers to oc-currences with fluctuation (e.g., [1, 0, 1]). . . . 34 Table 4.3 Percentages of vectors in Cithat fails inconsistently, as well as

percent-ages of those vectors for which a minority of OSes is failing (C_im). The latter percentages are then broken down across all studied OSes (i.e., they sum up to the percentages in the third column). . . 38 Table 4.4 Failure types and their percentages in Perl versions with minority

(10)

LIST OF FIGURES

Figure 2.1 BuildBot . . . 7 Figure 2.2 Jenkins CI tool . . . 7 Figure 2.3 TreeHerder CI tool dashboard . . . 9 Figure 4.1 Example of CPAN build report summary. A vertical ellipse represents

an “environment build vector” (RQ3), while a horizontal ellipse repre-sents an “OS build vector” (RQ4). . . 20 Figure 4.2 Distribution of the number of builds and versions across CPAN dists.

This Hexbin plot summarizes where the majority of the data can be found (darker cells), where a cell represents the number of CPAN dists with a given median number of builds (x-axis) and the number of ver-sions (y-axis). The black lines correspond to the thresholds used to filter the data, dividing the data into 9 quadrants. For each quadrant, the number of CPAN dists within it is mentioned on the plot. The central quadrant contains the final data set. . . 24 Figure 4.3 Hierarchy of fault types across all operating systems and environments. 27 Figure 4.4 Distribution of failure ratios in six-month periods. The linear regression

line shows the corresponding trend of the ratios over the six studied years. . . 28 Figure 4.5 Distribution of the numbers of builds per CPAN dist, as well as of the

numbers of OSes and environments on which these dists’ builds took place. . . 30 Figure 4.6 Distribution of the ratio of all builds performed on a given environment

(black y-axis), and the proportion of those builds failing (blue y-axis). 31 Figure 4.7 Distribution of the ratio of all builds performed on a given OS (black

y-axis), and the proportion of those builds failing (blue y-axis). . . . 31 Figure 4.8 For a given OS, the distribution across CPAN distversions of the

per-centages of environments for which no builds have been performed. . 32 Figure 4.9 Percentages of the four build failure evolution patterns for the Linux

OS. The blue bars represent occurrences of the pure patterns (no fluc-tuation) while gray bars are noisy occurrences (including fluctuations). 35 Figure 4.10 Distribution of fault categories in OS vectors with minority failures vs.

majority failures . . . 42 Figure 4.11 Distribution of fault categories across all OSes in the minority dataset. 42

(11)

Figure 4.12 Distribution of failures due to dependency vs. non-dependency faults in different build patterns when majority (6-10) of builds fail. Y-axis shows the failure ratio and x-axis shows fault types and build patterns. 45 Figure 4.13 Distribution of failures due to dependency vs. non-dependency faults

in different build patterns when minority (1-3) of builds fail. Y-axis shows the failure ratio and x-axis shows fault types and build patterns. 45 Figure 4.14 Bean-plot showing the distributions of AUC, true negative recall and

true positive recall across all dists. The horizontal lines show median values, while the black shape shows the density of the distributions. . 47

(12)

LIST OF SYMBOLS AND ABBREVIATIONS

OS Operating System CI Continuous Integration VCS Version Control System DevOps Development and Operations

POSIX Portable Operating System Interface API Application Programming Interface MSR Mining Software Repository

POM Project Object Model PBP Perl Best Practice AWS Amazon Web Services QA Quality Assurance

(13)

CHAPTER 1 INTRODUCTION

Continuous integration (CI) automates the compilation, building, and testing of software.

A CI build server automates the build process of a software project so that it is built as soon as developers make any changes, making sure to compile, test and analyze the impact of those changes on the system. Each new commit entered into the VCS trigger the CI tool (e.g., Jenkins), which compiles and tests the project partially or fully. The main purpose of CI is to help developers to detect build failures as soon as possible [1] to reduce the risk of defective releases. In the last decade, CI has become popular in both industrial and open-source software (OSS) community [2, 3]. Automation of CI can help increase the speed of project development and expose potential failures as soon as possible. Many well known companies practice CI such as Google, Facebook, Linkedin and Netflix [4].

Build systems automate the process of compiling a software project into executable

arti-facts rapidly across a variety of operating systems and programming languages [5] by using compilers, scripts and other tools. They play a crucial role in the software development pro-cess. Developers rely on the build systems to test their modifications to the source code, while testers use the build systems for executing automated tests to check if the expected output is still be provided after changes. Many researchers have studied build systems [6, 7, 8]. Initially, developers wrote ad hoc programs to do the whole compilation and test, until the build language -Make- was designed at Bell labs in 1979 by Feldman as a build automation tool [9]. It became the initial build system for Unix-like systems. Whenever a part of the program is changed, Make performs just the necessary commands to recompile the related files to create an up-to-date executable. Make does this by finding the name of a required target in the expression, assuring all of the dependent files exist, and then creating the target. Table 1.1 shows an example of a Make target expression. Make represents dependencies among files in a directed acyclic graph (DAG) The DAG can vary among build executions, different tools, and configurable features. When a change happens, the DAG informs Make about what has to be rebuilt.

As different programming languages gained popularity, different build automation systems

Table 1.1 Target expression in a makefile Makefile Expression Example target: dependencies main.o: main.c text.h

(14)

were developed. Unix-like systems mainly use make-based tools such as GNU make and BSD make. Ruby based (Rake), LISP-based (ASDF), C#-based (Cake), and Java-based (Ant, Maven, and Jenkins) systems use Non-Make-Based tools. Finally, Facebook’s Buck and Google’s Blaze are other recent build system.

As other example, Maven uses build-by-convention rather than a complicated build lan-guage, While other build systems like Ant needs developers to write all the commands. Maven is based on the core concept of a build lifecycle, which refers to an explicit, standard-ized process for building each code or other artifact. The developer building a project must use a small set of commands to compile a Maven project, then the instructions mentioned in the pom.xml file will assure the proper outputs. Furthermore, dependency management is a mechanism to centralize API dependencies using an online repository to hold build artifacts and to specify constraints to control the versions of API libraries to be used for building the software project.

Testing is a critical phase of the build process to guarantee software quality. The build

system runs the compiled software project through a series of automated tests [10]. After compiling the new code changes, the CI system will execute the compiled system using a variety of test types such as unit test, component tests, and regression testings to assure that the product works properly [11]. Developers often run other types of tests, such as performance testing to check if all performance requirements have been met [8].

Build Inflation is a phenomenon, which is produced due to the massive builds at the CI

practice. In theory, all additional builds/tests add more information, because a failed run clearly identifies a problem with the new code changes, while a successful run allows to elimi-nate a certain build from the list of suspicious builds. However, in practice, some builds/tests add more information than others. Each code change typically is built on multiple OSes, runtime environments and hardware architectures to check if a software product works on that particular OS or environment. Therefore, for every new commit/release, we have large numbers of builds. Figure 2.3 shows how one commit in Treeherder was built on 54 different OSes, and just for the first OS, there are 74 different test suites. While every new build provides new information, the number of builds across environments and OSes is not homo-geneously distributed, which gives more weight to some failures than others. Therefore, even though, each additional build brings new information, these builds have different values and are not equally informative.

(15)

It is essential to reduce build inflation for several reasons. First, each build by itself needs noticeable time and effort for executing. As such, scheduling additional builds only increases this time and effort, but without corresponding gains in QA. Apart from time lost, there are also monetary losses associated with redundant builds. Although cloud services, e.g., AWS or Microsoft Azure, etc., provide a faster and cheaper build infrastructure, they do have a non-negligible cost. O’Duinn, the former release manager of Amazon, discussed about the financial cost of a checkin in his blog [12]. The Amazon prices for the two cheapest AWS regions (US-west-2 and US-east-1) for daily load on an OnDemand builder cost $0.45, and OnDemand tester costs $0.12 per hour. He mentioned that a checkin costs at least USD$30.60 to answer “how much did that checkin actually cost Mozilla”. The cost included the following usage: USD$11.93 for Firefox builds/tests, USD$5.31 for Fennec builds/tests and USD$13.36 for B2G builds/tests. Thus, we can define the approximate cost that we can save by deleting extra/unnecessary builds. We can decide using this information if developer time is worth spending on scheduling for the valuable builds to avoid all these costs. Moreover, as O’Duinn has mentioned, “network traffic is still a valid concern, both for reliability and for cost“. Hence, even in the presence of large-scale cloud infrastructure, awareness and reduction of build inflation is a must.

1.1 Research Hypothesis: Build Inflation Consequences

This thesis aims to study the impact of build inflation with the following hypothesis:

Build inflation is an actual phenomenon in open-source projects that impacts the in-terpretation of build results.

To explore this hypothesis we analyse the impact of “build inflation” to study the returns of each additional build. Bias in build outcomes can be introduced due to large number of builds on diverse combinations of OSes and environments. This bias makes it complicated to interpret build outcomes. Practitioners and researchers should consider it while investigating build outcomes. The following section describes my contribution and its connections with the research hypothesis.

1.2 Thesis Contributions: The Impact of OSes/Environments on Build Inflation

Just as Maven for the Java programming language, CPAN (Comprehensive Perl Archive Net-work) [13] is an ecosystem of modules (APIs/libraries) for Perl. CPAN has its own CI system

(16)

that builds and tests any new release of a module on a divers range of operating systems and run-time environments (Perl versions). Similar to Maven and npm, but different from TravisCI and Jenkins, CPAN’s CI works at release-level not at commit-level.

We selected CPAN in this study as it provides a rich data set for the analysis of builds and build inflation across different OSes and run-time environments. We performed quantitative and qualitative analysis on build failures to understand the impact of OS and run-time en-vironment on 30 million builds of the CPAN ecosystem. We also studied the evolution of build failures over time, which allowed us to study build inflation. We mined the log files and meta-data of all CPAN modules to categorize build failures and the cause of their occurrences (faults). We studied 30 million builds in 5.5 years for 12,584 module versions, 27 OSes, and 103 (Perl) environments.

Our first contribution is the finding that the build failure ratio across all modules decreases over time, while the number of builds per module increases about 10 times. Moreover, the build results on different OSes/environments are not equally reliable, not only due to the fewer number of builds in some OSes, e.g., 40% of builds in Linux vs. less than 5% in Windows, but also due to the lower popularity of some OSes among Perl developers. Our findings show that 86.5% of build results consistently succeed or fail across all OSes/environments, which is an evidence of build inflation. Indeed, if a certain build failure, which consistently succeeds/fails everywhere, has (not) been found for one OS or environment build, remaining builds are no longer necessary from the perspective of that build failure. Our findings also show that failures occurring only on a minority of operating systems usually occur on the operating systems that are the least supported by the developers (Windows in the case of CPAN). As Linux and Freebsd are the main OSes for CPAN, so we should treat the build failures on these two OSes as high priority. These findings showed different builds have different values and must be prioritized to avoid the consequence of build inflation that can increase the importance of some failures while it hides the importance of others. We categorized build failures into six main categories and nine subcategories. We showed that API dependency is the highest reason of build failures either in minority or in majority failing builds. We also showed that programming issues are the 2nd highest reason of failure when majority of OSes fail, while configuration issues are the 2nd highest reason when minority of OSes fail.

(17)

1.3 Organization of Thesis

This thesis is structured as follows: Chapter 2 discusses prior research related to our work. Chapter 3 presents the research process and the organization of the thesis. Chapter 4 intro-duces the comprehensive structure and details of our empirical studies, which is my accepted paper at MSR 2017, followed by general discussion in chapter 5, and then conclusion and future work in Chapter 6.

(18)

CHAPTER 2 LITERATURE REVIEW

In this chapter, we introduce prior research on CI and discuss the studies most relevant to this thesis.

2.1 State-of-the-practice

CI is vital for modern software development. A CI tools like Jenkins executes builds several times a day for patches under review, and for each new commit entering into the VCS, to send an early notification to developers about any build failure [14, 15].

Continuous integration is done both at release-level (consider official releases of a project) and at commit-level (consider each new changes in the source code). Some of popular CI tools are Travis CI [16], Strider [17], Jenkins [18], TeamCity [19], Hudson, and Go. We briefly present some of these CI tools. Companies use the most appropriate tools based on the features they need. For example, the Eclipse community adopted Jenkins, while Mozilla and Google Chromium use Buildbot [20], which has server-slave-based structure, and can be customized based on demands.

2.1.1 BuildBot

Buildbot is adopted by Mozilla, Chromium, and WebKit. Buildbot is a Python-based tool, and can be deployed on POSIX-compliant operating systems [21]. Buildbot is a job scheduling system, and it executes the jobs whenever resources are available. Buildbot consists of one or more build masters and many workers, as soon as the masters monitor changes, workers run builds on a variety of operating systems. Figure 2.1 presents build results in Buildbot dashboard across different operating systems.

2.1.2 Jenkins

Jenkins is an open-source CI tool that is popular among DevOps for being free, open-source and modular (with over 1K plugins) [21]. Figure 2.2 shows the Jenkins dashboard, which illustrates the build history of new commits.

(19)

Figure 2.1 BuildBot

(20)

2.1.3 Travis CI

Travis CI is a hosted CI service, which is integrated with GitHub. Travis CI’s build envi-ronment provides several run-time envienvi-ronments for multiple programming languages, e.g., Ruby, PHP, Node.js. While Travis CI repository is hosted on GitHub and can be setup quickly, Jenkins needs to be hosted, configured and installed.

2.1.4 Treeherder

Mozilla uses Treeherder as a reporting dashboard for commit-level CI results of its projects [22]. Treeherder also has a rich set of APIs that can be adapted in other projects to provide the required information. Figure 2.3 shows the results of automated builds and the related tests for only one commit on Treeherder [23], that were performed on November 3, 2017. It shows build results on 54(!) combinations of build configurations and operating systems (multiple versions of Linux, OS X, Windows, Android). A build configuration is a specific selection of features (e.g., QuantumRender or Stylo) and build tool parameters (e.g., opt and debug). Figure 2.3 also reports the results of test suites for each of the 54 builds. With different tests being run on different operating systems, build results are not straightforward to interpret for outsiders.

It seems that an explosion in builds is happening, and that there is an inflation in builds, i.e., many builds are happening, most of them succeeding, and very few number of builds fail, which lead to diminishing returns. It means that too many builds are happening for each new commit/release, while only few of them fail, i.e., for the first OS among 74 builds, only one failure happened.

2.1.5 CPAN

CPAN presents a rich data set of the results of automated builds for the Perl programming language on a variety of operating systems and run time environments. Figure 4.1 shows the build dashboard for one particular Perl package, providing a summary of build results across different operating systems and run-time environments. CPAN’s CI works at package release-level like Maven, not at commit-level like Travis CI and Jenkins. Release-level CI is the process of scheduling and controlling a software build via a variety of OSes and environments, which consists of testing and deploying software releases, and consider official releases of a project to get this the level of granularity instead of considering each new change on the code.

(21)

(22)

2.2 State-of-the-art

Continuous Integration focuses on integrating code changes by several developers constantly, avoiding unpredictable integration close to release. Developers get a quicker feedback on the changes with this automation process in CI [15]. A CI server monitors the version control system (VCS), builds the latest code snapshot, then runs tests. Therefore, if a new com-mit or release is available in the VCS, build and test systems are provoked to compile and test.

Seo et al. [5] concentrated on the compiler errors that occur in Google’s build process for Java and C++ environments. They assessed 26.6 million builds -and found that- the proportion of failures for C++ and Java builds is 37.4% and 29.7%. They presented multiple patterns of build failures and concluded that API dependencies are the major reason of compilation errors. They mentioned that the time to fix the failures varies greatly and different tools are required to support developers’ needs. We complement this study with a larger-scale study of 68.9 million builds of CPAN that shows that build failures due to programming issues occur with a ratio of 36.1% that happen across most of the operating systems, while it reduces to 15.5% when a failure impacts only a few operating systems. Also, we analyzed build failures for Perl, which is an interpreted language, while Seo et al. studied compiled languages (Java and C++).

Vasilescu et al. [24] conducted a preliminary study on 246 GitHub projects to investigate the impact of adopting CI systems (such as Travis CI) on productivity and software quality. They declared that Continuous Integration usage can increase productivity but it does not necessarily increase software quality. Also, they introduced evidence on the advantages of CI, they did not discuss the consequences of performing many builds in CI. This thesis is analyzing the impact of adopting CI on build results considering factors such as operating systems and environments within the CPAN CI system. Armed with this understanding and increasing demand for CI usage, researchers and developers must be careful about using CI to avoid build inflation and its consequences.

Researchers studied TRAVIS CI as a data source [3]. They showed that CI services (CIS) like Travis CI [16], which is a globally accessible CIS, were widely used and increased the developers’ productivity. They analyzed 34,544 open-source projects from GitHub and found that over 40% of these projects use CI. They analyzed almost 1.5 million builds of Travis CI to understand the reason of its usage and popularity. The survey showed that CI is widely adopted in popular projects and helped to decrease the time between releases.

(23)

Leppanen et al. [25] presented more frequent releases as a perceived advantage of CI, by studying the state-of-the-art CI practice in 15 Finnish software companies. They showed that projects using CI release twice as fast as projects that do not. Other researchers [26] also studied CI adoption by interviewing 27 developers at Ericsson’s R&D to understand their perception of CI. These two works study CI usage and perception in industry, showing how CI gains popularity and is growing.

2.3 Build Systems

Build systems are the infrastructures converting a set of artifacts into an executable format. A build system has a critical role in software development. A poorly implemented build system frustrate developers and waste time [27].

Developers run the build process to compile and test the changes they made in the code. Then, they must wait for the build process while it is executing. Build tools like make, ANT, and Maven examine the last modification time of output and input files to perform only the commands required to update an executable. The waiting period frustrates developers and impacts on their productivity [5]. Substantial research has been performed to accelerate the build process. Adams et al. [28] proposed to recompile by semantically analyzing the changes performed in a file to check if it and any of its dependencies must be recompiled. Yu et al. [29] improve build speed by removing unneeded dependencies among files and unnecessary code from header files. These two approaches accelerate incremental compilation, which is not performed for CI, it always builds from scratch.

Suvorov et al. [30] examined successfully and failed migration to a different build system in two open-source projects, Linux and KDE, and outline four major challenges faced by the migration from one build system to another due to missing features of the build system. If the maintenance effort associated with the build system grows, developers choose to migrate to another build system. They stated that failed migrations usually do not collect enough build requirements prior to prototyping the migration.

Adams et al. [6, 31] analyzed the evolution of the Linux kernel build system. They studied the modifications in the size of the kernel’s makefiles (SLOC), and dependencies in different releases. They found that build system expanded and must regularly be maintained [32]. They found primary evidence of improving complexity in the Linux kernel build dependency graphs. They showed that the build complexity co-evolves with the program source code at

(24)

the release level.

McIntosh et al. [7] empirically studied ten open-source projects and observed that between 4% to 27% of tasks involving source code modifications need a modification in the associated build code as well. They also mentioned that build code frequently evolves and is expected to include bugs due to high churn rate. Mcintosh et al. [33] studied the evolution of ANT build systems in four open-source Java projects, and stated that the build maintenance is mainly due to the code changes creation. They observed that the complexity of the build code and also the behaviour of the build system both increase and evolve.

2.4 Build Failures

While Schermann et al. [34] presented architectural issues as one of the major obstacles to CI adoption and build failures, we show configuration problems and platform dependent failures as one of the problems.

Kerzazi et al. [35] analysed 3,214 builds in a company during 6 months to study build failures. They found that the 17.9% of build that fail cost about 2,035 man-hours, if it takes one hour for each build to succeed. We studied 30 million builds and analyzed build results to avoid unnecessary builds and decrease these failure costs.

Rausch et al. [36] reported build failures in 14 open-source Java projects and categorized failures into 14 different groups. They performed their analysis in commit-level, and showed that more than 80% of failures are due to failed test cases, while we performed our analysis in release-level, and showed more than 80% of failures are due to programming, dependency, OS, and configuration issues. We investigated build failures in two groups, i.e., failures only affecting a few (minority) of the operating systems versus most (majority) of the operating systems, we reported 4.2% of majority failing builds categorized as test failures, in compari-son with 10% failing tests in minority failures.

Kerzazi et al. [37] conducted a study of releases showing unexpected system behavior after deployment. Their findings show that source code is not the major reason of build failure, but defective configurations or database scripts are the main issues. Although our results verify that source code is not always the main reason of failure, we also concentrate on the importance of other factors (OS/environment) causing builds to fail.

(25)

Denny et al. [38] studied compile errors of Java code and observed that syntax is a crucial barrier for novice developers, and students submitted non-compiling code even in simple assignments. They claimed that 48% of build failures occur because of compilation errors. Another research on predicting performance in a programming course, Dyke [39] assessed the frequency of compile errors by tracking Eclipse IDE usage with novice developers, and checked for correlation with productivity. They showed that completed runs and successful compilation are related to productivity. We also study build failures and the reason of their occurrence, and categorize them into different groups.

Miller et al. [40] studied 66 build failures of Microsoft projects in 2007 and used CI as a quality control mechanism in one distributed, collaborative team across 100 days. They categorized these failures into compilation, unit testing, static analysis, and server failures. We investigated 791 build failures across 6 months, and describe a classification of build failures in a different domain, i.e., Perl and release level CI vs. Java and commit-level CI. Vassallo et al. [41] classified CI build errors in 418 based projects at ING, and 349 Java-based open-source projects hosted on GitHub that use Travis CI. The open-source and ING projects reported different build failure. This classification does not provide any information regarding the role of OSes and environment and how OS and environment can break the builds. In contrary, our classification explicitly considers OS and environment to understand what type of errors is the reason of build failure, we study build failures considering these two factors. We also determine to what degree OS and environment can lead to build inflation.

Mcintosh et al. [42] gathered the information of build source/test co-change in four open-source projects: Eclipse-core, Jazz, Lucene, and Mozilla. They derived some metrics like the numbers of files added/removed/modified, and provided a re-sampling approach to predict the build co-changes. We integrate these research results with our explanatory model of build failures of CPAN.

Finally, our study finds that API dependencies, such as missing libraries or the wrong API version cause, over 41% of builds to break. CPAN is an ecosystem of modules (APIs/libraries) and if build commands do not find any of the dependent modules, then the build will fail. The API of an ecosystem is indeed a major element of development costs [43]. Zibran et al. analyzed 1,513 bug reports of Eclipse, GNOME, MySQL, Python, and Android projects, and among them, 562 bug-reports were about API issues. They found that about 175 of those issues were about API correctness.

(26)

Many researchers studied API for different objectives such as recommending appropriate APIs to developers when upgrading the dependencies of one module to a newer version [44]. McDonnel et al. [45] studied the API migration in the context of services. A previous work summarized this work [46], and here we summarize two works related to API.

Wu et al. [46] analysed 22 releases of the Apache and Eclipse frameworks and their client systems. They showed that the different API modifications in the frameworks affect the clients, then they categorized them into API modifications and API usages. To determine API usages and reduce the impact of API modifications, it is recommended to apply analyses and tools on frameworks and their client programs.

Tufano et al. [47] show that it is not feasible to revise most of the projects due to the missing dependencies for the older version of projects. Kula et al. [48] empirically studied library migration on 4,600 GitHub software projects and 2,700 library dependencies. They showed that although most of these projects rely on dependencies, 81.5% of the studied projects keep using their outdated dependencies. Our findings show more distribution of failures including dependency despite of accurate technical details.

Dig et al. [49] also conducted an analysis on five open-source systems (Eclipse, Log4J, Struts, Mortgage, and JHotDraw) to understand the API changes. They showed that 80% of changes are due to refactoring. However, as our study is in a higher level of granularity, we observed build failures related to programming issues where the majority of builds fail were 36.1% and 15.5% where minority of builds fail.

(27)

CHAPTER 3 RESEARCH PROCESS AND ORGANIZATION OF THE THESIS

This chapter presents the methodology and the structure of my thesis. This thesis wants to help practitioners understand the impact of operating system and environment on build results, and help researchers to consciously analyze build results. In Chapter 4, we analyze build failures in the CPAN build environment to find out whether prior findings on build failures and error types in Java [5, 41] also hold for Perl programming language and for release-level CI.

An earlier version of this work was published at the 14th International Conference on Mining Software Repositories [50]. Chapter 4 extends this work by adding three additional research questions involving a qualitative analysis of build failures. This extension has been submitted to the Springer journal on Empirical Software Engineering.

We aim to understand what factors play a role in the build process and in CI. This can help developers how to better contribute while integrating their code, as well as help researchers to identify the valuable OSes/environments to build on per day. Optimizing build execution will decrease the cost and effort required for performing builds, especially in large scale industries.

(28)

CHAPTER 4 ARTICLE 1: DO NOT TRUST BUILD RESULTS AT FACE VALUE –AN EMPIRICAL STUDY OF 30 MILLION CPAN BUILDS

Submitted at: Springer journal on Empirical Software Engineering (Extension of published paper at: the 14th International Conference on Mining Software Repositories)

Mahdis Zolfagharinia, Bram Adams, Yann-Gaël Guéhéneuc

Abstract

Continuous Integration (CI) is a cornerstone of modern quality assurance. It provides on-demand builds (compilation and tests) of code changes or software releases. Despite the myriad of CI tools and frameworks, the basic activity of interpreting build results is not straightforward, due to the phenomenon of build inflation: one code change typically is built on dozens of different runtime environments, operating systems (OSes), and hardware architectures. While build failures due to configuration faults might require checking all possible combinations of environments, operating systems, and architectures, failures due to programming faults will be flagged on every such combination, artificially inflating the number of build failures (e.g., 20 failures reported due to the same, unique fault). As previous work ignored this inflation, this paper reports on a large-scale empirical study of the impact of OSes and runtime environments on build failures on 30 million builds of the CPAN ecosystem. We observe the evolution of build failures over time and investigate the impact of OSes and environments on build failures. We show that Perl distributions may fail differently on different OSes and environments and, thus, that the results of CI require careful filtering and selection to identify reliable failure data. Manual analysis of 791 build failures shows that dependency faults (missing modules) and programming faults (undefined values) are the main reasons of build failures, with dependency faults typically being responsible for build failures on only few of the build servers. With this understanding, developers and researchers should take care in interpreting build results, while dashboard builders should improve the reporting of build failures.

Keywords: Continuous Integration, Build Failure, Perl, and CPAN

4.1 Introduction

Continuous integration (CI) is an important tool in the quality-assurance tool-box of software

companies and organizations. It enables the swift detection of faults and other software-quality issues. A CI system, such as Jenkins, performs builds multiple times a day, for either

(29)

each patch currently under review, each new commit entering the version control system, or at given times of the day (e.g., nightly builds). It combines build and test scripts to run compilers and other tools in the right order, then test the compiled system [31, 5]. It notifies developers as soon as possible of build failures [14, 15]. A more coarse-grained form of CI is used by package repositories of open-source Linux distributions (e.g., Debian or Ubuntu) and of library repositories, like Maven or CPAN. Their CI systems only receive official releases (instead of every commit), yet those must be built and tested before publishing the new releases. Thus, a CI system may perform dozens or hundreds of builds a day.

Contrary to popular belief, a single commit or release is not built just once, but separate builds are performed for different environments and operating systems (OSes), such as dif-ferent Java versions or difdif-ferent versions of Windows. These multiple builds, for multiple environments and OSes, lead to the problem of build inflation: there are many build results across OSes/environments, which are not necessarily uniform, i.e., they do not fail or succeed for the same reasons for all environments and–or OSes. Thus, additional builds may bring diminishing returns. Testing in some environment/OS, like Darwin in Figure 4.1, would be more valuable than testing on other OSes, like FreeBSD, because Darwin builds have targeted fewer Perl versions—half of which succeeded—while the vast majority of FreeBSD has been failing. Future FreeBSD builds are hence expected to fail, while the Darwin builds could either fail or succeed.

The software-engineering research community, although interested in CI, e.g., the MSR’17 mining challenge was about CI1_{, lacks knowledge on the breadth and depth of the adoption of} CI by software companies and organizations. We need answers to questions such as does CI’s advantages surpass its disadvantages? Do developers use CI, and what are its limitations? How does CI help developers? Consequently, we study the impact of build inflation, in particular its resulting bias, on build failures and CI. We empirically study 30 million builds of the Comprehensive Perl Archive Network (CPAN) [13] performed between 2011 and 2016 and extracted from its CI environment. We cover more than 12,000 CPAN packages—called distributions in the context of CPAN and in the following, 27 OSes, and 103 environments. We answer the following seven questions:

— RQ1: How do build failures evolve across time?

— RQ2: How do build failures spread across OSes/environments? — RQ3: To what extent do environments impact build failures? — RQ4: To what extent do OSes impact build failures?

— RQ5: What are the different types of build failures?

(30)

— RQ6: To what extent do OSes impact build failure types?

— RQ7: To what extent do environments impact build failure types?

We show an inflation in the numbers of builds and build failures, which hide the reality of builds in noise, e.g., comparing the results of millions of builds on Linux with thousands of builds on Cygwin can lead to wrong conclusions about the quality of Perl releases on Cygwin in comparison to Linux. We also show that unnecessary builds, e.g., on OSes/environments, for which we already have many builds, could be avoided by investigating the impact of environments and OSes on build failures. Our observations provide empirical evidence of the bias introduced by build inflation and provide insights on how to deal with this inflation. We provide the largest quantitative observational study to date on build failures and build inflation. The results of our study form the basis of future qualitative and quantitative studies on builds and CI. They can help researchers to better understand build results. They can also help practitioners to prioritize the environments/OSes with which to perform continuous integration, in terms of expected returns, i.e., numbers of builds vs. chance of failures. We extend our previous work [50] by analyzing the types of build failures and the impacts of different environments and operating systems on failures and their types. We thus can distinguish between OS dependent and independent build failures. We then categorize these build failures into 13 groups based on the reasons of the failures. We added RQ5, RQ6, and RQ7 to describe each of these categories and impacts. Finally, Section 4.5 also provides a detailed comparison of the obtained build failure types with those reported in the literature. We organize our paper as follows: Section 4.2 presents background information on the CPAN CI environment and major related work. Section 4.3 describes our observational study design while Section 4.4 presents our observations, followed by their discussions in Section 4.5. Section 4.6 describes threats to the validity of our observations and discussions. Finally, Section 4.7 concludes with insights and future work.

4.2 Background

4.2.1 CPAN

Overview: This section provides an overview of the software ecosystem whose build results

we are studying in this paper, i.e., the Comprehensive Perl Archive Network (CPAN). Similar to Maven and npm for Java and Node.js, CPAN is a ecosystem of modules (APIs/libraries) for the Perl programming language. It contains more than 255,000 Perl modules packaged into 39,000 distributions, i.e., packages that combine one or more modules with their doc-umentation, tests, build and installation scripts. Each distribution can have one or more

(31)

versions. To simplify terminology, in the following, we refer to a distribution of a set of

modules as a “dist” and to a distribution version as a “distversion”.

Build Reports: CPAN implements its own continuous integration system, which builds

and tests any new beta or official version of a dist on a variety of operating systems (e.g., Windows vs. Linux) and runtime environments (e.g., Perl version 5.8 vs. 5.19). If successful, the new distversion can be made available to CPAN users. Unlike Travis CI and Jenkins, but similar to Maven and npm, CPAN CI does not work at commit-level, but at release-level. CPAN depends on its own build servers and build servers owned and hosted by volunteering CPAN members. Thus, distversions are not guaranteed to be built and tested on every OS and environment version (cf. white cells in Figure 4.1).

As Perl is an interpreted language , build scripts typically process or transform code and data instead of mere compilation. The test scripts then allow to verify that the modules in the dist are working correctly on a given OS and environment. For each build (we will refer to the execution of build and test scripts for a given OS and environment as a “build”), CPAN generates a build report and all reports for a given version of a dist are summarized into an overview report, as shown in Figure 4.1. This Figure shows the summary of build results for version 0.004002 of the “List-Objects-Types” dist for environments 5.8.8 to 5.19.3 (left column) and OSes CygWin to Solaris (top row). Red cells indicate that all builds for a combination of OS and environment failed, green cells show that all were successful, red/green cells indicate that some builds failed, and orange cells represent unknown results (e.g., build or tests were interrupted).

Rest API: CPAN provides a RESTful API [51] and a Web interface [13] to allow complex

queries on all publicly-available modules and dists. Furthermore, the whole history of CPAN and all of its modules and dists are accessible via the GitPAN project2_{. These data sources} allow to conveniently access CPAN for analysis, providing access to build results and module meta-data, such as modules names, versions, dependencies, and other helpful information:

— GUID: a unique global identifier that identifies each dist.

— CSSPATCH: a value (pat) or (unp) indicating if this module was tested with a patched version of Perl.

— CSSPERL: a value (rel or dev) indicating if this module was tested with a release or development version of Perl.

(32)

Figure 4.1 Example of CPAN build report summary. A vertical ellipse represents an “en-vironment build vector” (RQ3), while a horizontal ellipse represents an “OS build vector” (RQ4).

4.2.2 Related Work

There exists previous work related to build, build failures, and CPAN or other ecosystems. We summarize now the studies most relevant to our own study.

Denny et al. [38] investigated compile errors in short pieces of Java code and how students fixed these errors. They showed that 48% of the builds failed due to compilation errors. Similarly, Dyke [39] assessed the frequency of compile errors by tracking Eclipse IDE usage with novice programmers.

Suvorov et al. [30] studied two popular open-source projects, Linux and KDE, and found that the migration from one build system to another might fail due to missing features. Adams et al. [6, 31] analyzed the changes to the Linux kernel build system and reported that the build system grew and had to frequently be maintained. McIntosh et al. [33] replicated the same study on the ANT build system. We complement these studies with our explanatory model of build failures for the build system of CPAN.

Other CI systems, similar to CPAN, but at the granularity of commits instead of releases, are Jenkins, Hudson, Bamboo, and TeamCity [19]. Stahle and Bosch [2] surveyed CI practices and build failures in such CI systems. They observed that test failures during builds are sometimes accepted by developers because developers know that these particular failures will be fixed later [52]. We complement this study by showing the impact of OSes and

(33)

environments on build failures within CPAN particular CI system.

Seo et al. [5] reported on a case study performed at Google, in which they assessed 26.6 million builds in C and Java. They showed that there exist multiple patterns of build failures and focused on compilation problems to conclude that dependencies are the main source of compilation errors, that the time to fix these errors varies widely, and that developers need dedicated tool support. We also showed that there exist multiple patterns of build failures. We also took into account the impact of OSes and environments on failures.

Vasilescu et al. [24] conducted an empirical study about CI usage on 246 projects from GitHub. They showed that CI significantly improves the productivity of GitHub teams. However, despite providing evidence on the benefits of CI, they do not provide any detailed information about CI usage, such as the consequences of continuously integrating many builds. Our findings provide evidence on the occurrence of inflation in builds, as CI has become more popular over time.

Hilton et al. [3] assessed 34,544 open-source projects from GitHub, 40% of which use CI. They analyzed approximately 1.5 million builds from Travis CI to understand how and why devel-opers use CI. They found evidence that CI can help projects to release regularly. Leppanen et al. [25] also reported more frequent releases as a perceived benefit of CI by interviewing developers from 15 companies. Two other works [40, 26] have performed case studies on the use of CI and found a positive impact of CI. This growing usage of CI makes it important to perform a larger study of CI and builds.

Our study shows that 39.4% of the builds failing on only a few of the OSes are due to API dependencies (missing modules and libraries) compared to 27.8% of the builds failing on the majority of OSes. APIs are indeed an important factor of development costs [43]. Previous work studied APIs for various purposes: (1) to recommend relevant APIs to developers during development and–or during changes, typically when upgrading the dependencies of one module to newer versions of other modules [44], and (2) to migrate APIs, in particular in the context of services [45]. We refer the interested reader to a previous article summarizing this work [46]. We summarize here only two relevant works related to APIs and builds. Wu et al. [46] analyzed changes in 22 releases of the Apache and Eclipse frameworks and their client programs. They observed the kind of API changes in the frameworks impacting the clients and classified API changes and API usages. They suggested analyses and tools apply to frameworks and their client programs to identify different kinds of API usage and reduce the impact of API changes. We provide evidence that such API changes are the most important cause of build failures, with a different impact depending on the OSes and environments on which the builds are performed.

(34)

4.3 Observational Study Design

We now describe the design of our observational study. For the sake of locality, we present the research questions, their motivations, and their results in the next section.

4.3.1 Study Object

The object of our study is the impact of OSes and environments on build results to analyze the phenomenon of “build inflation”, where an excessive number of builds on heterogeneous combinations of OSes and environments can introduce bias in build results. Such bias makes it difficult to interpret build results (and hence detect bugs), and should be taken into consid-eration by practitioners and researchers analyzing build results. In certain cases, one might even consider to elide (combinations of) OSes and environments from the CI server if those do not contribute useful information.

4.3.2 Study Subject

We choose CPAN to study the impact of OSes and environments on build failures because CPAN [13] provides the results of the automated builds of all Perl dists and distversions on dozens of OSes and environments. Hence, it provides a large and rich data set. Moreover, CPAN has a long history, even though it provides build data only at the release level. Using the data sources mentioned in Section 4.2.1, we mined the build logs and meta-data of all distversions. Build logs contain the results of all CPAN builds, including the com-mands executed, build results (failed or succeeded), and the error messages generated by the build and–or test scripts. The META.yml meta-data files contain a dist name, version, dependencies, author and other dist-related information (e.g., supported OS/Perl version). Using the dist build logs and meta-data as the main data source for our observational study, we obtained a data set of 16 years of build results for 39,000 dists, 27 OSes and 103 (Perl) environments.

4.3.3 Quantitative Study Sample

First, we fetched the complete CPANtesters build repository, yielding 68.9 million builds over a period of 16 years, between January 2000 and August 2016. We found that most builds were performed between 2011 and 2016. Although for each OS, builds were performed on different OS versions and architectures, 10 OSes and 13 Perl environments stood out. In particular, each of these 10 OSes had more than one million builds, while each of these 13

(35)

environments had more than 800,000 builds. To reduce time and (to some degree) simplify our analysis, we filtered out the other OSes and Perl environments, reducing the initial data set to 62.8 million builds on 10 OSes and 13 environments (Perl 5.8 to 5.21, excluding 5.09) for a period of about 5.5 years between January 2011 to June 2016.

As mentioned in Section 4.2, every cell in Figure 4.1 shows all the builds for a combination of OS and Perl environment. Red cells show that all builds failed, green cells represent that all were successful, and red/green cells indicate that some builds failed (due to different architectures and OS versions). Following the different directions of the research questions, in RQ1 and RQ2, we count all builds of every cell, in RQ3 and RQ4, we summarize the build output of each cell by considering only the most common build outcome, while in RQ5 to RQ7, we summarize each cell build outcome by considering only the most recent failure. Although we investigate 13 Perl major versions, e.g., 5.8, the dataset includes 103 Perl minor versions, e.g., 5.8.8. Results of a major version include results of all minor versions. This data set included dists with only one build as well as dists with thousands of builds. For example, 13,522 dists have more than 1,000 builds while 967 dists have less than 3 builds. Unfortunately, not all builds have build results and we cannot draw any reliable conclusion for dists with too few builds or too few versions, while modules with too many builds or versions might not be representative either. Consequently, we filtered out builds without corresponding data, we determined lower and upper thresholds for the number of builds and number of versions, and filtered out modules below or above those thresholds, respectively, as explained in the following.

Figure 4.2 illustrates the distribution of the median number of builds and the number of versions across all CPAN dists in our full data set (the number of dists within each quadrant is shown as well). Black lines show the lower and upper thresholds that we determined for the median number of builds and number of versions for each dist. By looking at the data distribution, we filtered out CPAN dists with less than 10 build results and less than 5 versions. Then, to determine the upper thresholds for filtering outliers, we used the following formula [53] based on the inter-quartile range: ut = (uq − lq) ∗ 1.5 + uq where lq and uq are the 25th and 75th percentiles and ut is the upper threshold.

Of the nine quadrants shown in Figure 4.2, only the central one is used in our study. After removing 849,638 builds without data and filtering out the other quadrant data, we obtained a final data set of 12,584 CPAN dists with about 30 million builds. Including other quadrants would increase our data set size, but might introduce noise in the form of outliers. The resulting quantitative study sample is used in RQ1 to RQ7.

(36)

Figure 4.2 Distribution of the number of builds and versions across CPAN dists. This Hexbin plot summarizes where the majority of the data can be found (darker cells), where a cell represents the number of CPAN dists with a given median number of builds (x-axis) and the number of versions (y-axis). The black lines correspond to the thresholds used to filter the data, dividing the data into 9 quadrants. For each quadrant, the number of CPAN dists within it is mentioned on the plot. The central quadrant contains the final data set.

(37)

4.3.4 Qualitative Study Sample

We performed a qualitative study on build failures to categorize different types of failures that occur during the build process and to understand the impact of environment and OS on the types of failures that occur. We manually analyze CPAN build logs to identify the types of failures occurring in builds, then compare the frequencies of their occurrences.

First, we created a dataset of OS build vectors as it is shown in Figure 4.1 with at least one build failure among at least 10 OSes to compare build failures happening in only a minority of OSes (inconsistent failure) to those happening in most of the OSes (consistent failure), for a given Perl version (see RQ4, RQ6 and RQ7). Since for each combination of an OS/environment, multiple builds (and failures) can exist, with a maximum number of 1,362 failures and a median number of 64 failures per OS/env, we had to pick one build per vector element (OS/environment). In contrast to RQ3/4, we picked the most recent failure for each vector element, because we want to find out the most up-to-date reason of failures across all OSes/environments. This reduced the number of build failures from 76,748 to 1,421 across 804 OS vectors (note that most of the vectors did not contain any build failure, which is expected).

Furthermore, we observed that 4% of the selected build results are marked as unknown,while such unknown results were considered as failure in RQ3/4, we cannot do this for RQ5, RQ6 and RQ7 because we need to understand why a failure happened from its build log, so we ignored these unknown results. Note that although we removed 4% of the selected build results, none of the 804 OS vectors was removed entirely.

We distinguish the remaining OS vectors with 1 to 3 failing OSes into the group of “minority vectors” and those with 6 to 10 into the group of “majority vectors”. We ignore vectors with 4 or 5 out of 10 failing OSes because those could have the characteristics of both minority and majority vectors. Thus, we obtained 752 vectors (963 failures) of minority failures and 52 vectors (458 failures) of majority failures.

Because 1,421 failures across 804 vectors is a large number, we randomly sampled 791 failures (confidence level of 95% and confidence interval of 5%) across 306 vectors, yielding a set of 52 majority vectors with 458 failures and 254 minority vectors with 333 failures, which we analyzed manually.

To categorize the different failure types, the 1st _{and 2}nd _{author explored the build log of each} analyzed failure and organized and labeled failures according to the fault responsible for the failure, using the “card sorting” technique [54]. Card sorting allows classifying the failures and the causes of these failures in 791 randomly-chosen build failures and label errors. This

(38)

technique is used in empirical-software engineering whenever qualitative investigations and classifications are required. Bacchelli et al. [55] used this technique to investigate code-review comments. Hemmati et al. [56] used it to examine survey analysis [57].

Card sorting involved multiple reviewing iterations. Initially, the first author investigated error messages in logs to analyze the reported symptoms of failures, then explored on-line reports and feedback for each error message to find evidence of and extract the real faults. Then, she added the error messages and their reasons (faults) into a card in Google Keep. Eventually, after analyzing the main rationale for each build failure and recording it onto a card, she grouped these cards into different categories based on a common rationale of build failures (faults). In the second and third iterations, the first and second author discussed differences among error types to categorize them. Major reasons for disagreement involved unclear error messages and too broad/narrow categories. Eventually, both authors agreed on six main categories of failures, containing nine subcategories. Figure 4.3 shows the categories that we studied in RQ5, RQ6, and RQ7.

4.4 Observational Study Results

We now present the motivations, approaches, and results of the seven observational research questions, RQ1 to RQ7.

RQ1: How do build failures evolve across time?

Motivation. This initial research question aims at understanding how often builds fail and

whether the ratio of failing builds is a constant value or fluctuates across time. We investigate build inflation in terms of number of builds. Beller et al. [14] found a median of 2.9% of Java builds and a 12.7% of Ruby builds in Travis CI to be failing, while Seo et al. recorded failure ratios of 37.4% and 29.7% for C++ and Java builds at Google [5]. Unfortunately, apart from these lump numbers, not much more is known about the build failure ratios, in particular about the evolution of this ratio in time. Furthermore, all existing CI studies have targeted commit-level builds and tests, while CPAN is a package release-level build infrastructure, typical of software ecosystems.

Approach. To study the ratio of build failures in our data set, we consider all failing

builds (red cells in Figure 4.1) and unknown build results (orange cells in Figure 4.1). When a particular OS/environment combination (one cell in Figure 4.1) saw multiple builds, we considered all of them in this RQ. For each CPAN dist in the data set of 30 million builds, we computed the ratios of build failures as #buildf ailures/#builds. From 2010 to 2014,

(39)

Figure 4.3 Hierarchy of fault types across all operating systems and environments.

each year, two Perl versions were released (release cycle of 6 months), so we investigated the evolution of failures per period of six months. We did not distinguish between OSes and environments in this RQ.

Findings. The median build failure ratio decreases across time from 17.7% in the first six months of 2011 to 6.3% in the first six months of 2016.

Figure 4.4 shows the distribution of failure ratios across all builds of all CPAN dists in the studied period of six years. From 2011 to 2013, the median failure ratio in the first half of a year is higher than that of the second half of the year, yet from 2014 on this trend is reversed. As the regression line in Figure 4.4 shows, the overall build failure ratio has a strong decreasing trend between 2011 and 2016, especially when taking into account the logarithmic scale used in the figure. This decreasing ratio might be due to several reasons, for example less builds being performed across time or less releases being made for dists. In the following, we explore these two hypotheses.

Until the first half of 2015, each year, more builds were being made, from one

Table 4.1 Number of builds, distversions, and average numbers of builds per distversion in each period of six months between January 2011 and June 2016.

2011-A 2011-B 2012-A 2012-B 2013-A 2013-B 2014-A 2014-B 2015-A 2015-B 2016-A Number of Builds 626 946K 1,860K 2,404K 3,021K 3,482K 3,625K 4,082K 4,827K 3,394K 2,891K Number of Distversions 14 7,185 8,085 8,338 10,443 9,387 9,549 11,682 9,621 7,829 7003 #builds / #releases 44.7 131.7 230 288 289.2 371 379.6 349.5 501.7 433.5 412.9

(40)

● ● ● 3.98 6.31 10 15.85 25.12 39.81 2011_A2011_B2012_A2012_B2013_A2013_B2014_A2014_B2015_A2015_B2016_A %F ailure

Figure 4.4 Distribution of failure ratios in six-month periods. The linear regression line shows the corresponding trend of the ratios over the six studied years.

(2011) to several million (2016). Table 4.1 shows the number of builds per period of

six months. We observe that, even though ever more builds are executed, they seem more successful over time, i.e., there is an inverse correlation between numbers of builds and build failures. It is not clear why the second half of 2015 and first half of 2016 show a decreasing number of builds, however these observations might explain the plateau (instead of decrease) of median values for the rightmost box-plots in Figure 4.4.

The average numbers of builds per distversion shows a ten-fold increase from 44.7 to 412.9 across time, although there are some fluctuations from the second half of 2014 on (2014-B). To understand if the decreasing build failure ratio is due to a

drop in the number of releases being made over time, we counted the number of releases of dists in each six month period. The average number of builds per release in Table 4.1 shows an increasing trend, growing from 44.7 in the first six months of 2011 to 501.7 in the first half of 2015 (with a slight dip at the end of 2014), after which the average ratio drops, but still remains higher than in 2014. We explain this observation as follows: although the numbers of builds dropped from the second half of 2015 on, the number of releases did not drop at the same rate.

Overall, the steady drop in build failure ratio can (at least partially) be explained by a strong increase in numbers of builds per release, i.e., strong increases in the numbers of builds and the numbers of releases per CPAN dist. While an increasing number of releases is typical for today’s release engineering strategies [58], the increasing numbers of builds cannot be

(41)

explained intuitively. The next research question helps understand this build inflation by considering the impact of different OSes and environments on builds.

RQ1: The median build failure ratio de-creases super-linearly across time, while the number of builds per distversion sees a 10-fold inflation.

RQ2: How do build failures spread across OSes/environments?

Motivation. Build inflation seems to occur due to the needs for building and testing a

release on different versions of the OS and Perl environment. Our hypothesis to explain the decrease of the build failure ratio observed in RQ1 is that a given new distversion is built multiple times in such a way that most of these builds succeed, while only a few fail. Although each OS and Perl environment, of course, can show deviating behavior (which is why multiple builds are performed in the first place), they are essentially building and testing the same features. Hence, feature-related faults are expected to trigger failures across all OSes and environments, inflating the numbers of build failures. Indeed, observing a feature-related build failure on one environment or OS theoretically suffices, additional builds are expected to fail as well. In contrast, OS- or environment-specific problems occur only for the problematic OS or environment, which is not trivial to predict. This deviation between different types of failures might not only explain our findings for RQ1, but also lead to bias in build results that must be addressed to avoid incorrect conclusions by build engineers and researchers.

Approach. For each build, we extracted build log data about the OSes and environments

used during the build, then calculated build failure ratios per OS and environment. For the same reasons outlined in RQ1, we removed builds with unspecified status. Furthermore, as any CPAN community member can volunteer a machine for CPAN builds, a wide variety of hardware architectures and OS/environment versions are used. To make our analysis feasible, we again considered all build results recorded for a given operating system and environment. The analysis of the impact of hardware architecture is left for future work.

Findings. The analyzed CPAN distversions have a median of 179 builds, which

took place on a median of 22 environments and 7 OSes. Figure 4.5 shows the

distribution of the numbers of builds, OSes, and environments across all dists. While the numbers of OSes are more or less stable around 7, the number of Perl environments to test is much higher, while the total numbers of builds for a dist correlate with the product of both. Through a manual analysis of the CPAN data, we observed that, when a new version of a dist