Code Quality - Software Engineering and Bioinformatics Challenges

Chapter 1 Introduction

1.3. Software Engineering and Bioinformatics Challenges

1.3.2. Code Quality

Like any other software project, a challenge for bioinformatics research projects is maintaining good code quality. In bioinformatics code quality is important because it helps ensure that research which uses the code is valid and that software development time is optimized. The characteristics of good quality code which are important for bioinformatics are:

 Correctness

 Maintainability

 Ease of use

 Efficiency

Correctness refers to the codes ability to produce the correct result. Correctness is correlated with the number of bugs, both known and unknown, that the software has. Obviously, code that has bugs and produces the wrong result is not suitable for producing scientific results that are valid. Code containing many bugs is also a major time sink due to the amount of time that is required to find, understand and fix the problem without introducing new bugs.

Maintainability refers to how easy it is for a programmer to work with the software. It addresses questions such as: How easy is it to debug the code? How fast is it to produce a fix? How quickly can a new developer understand the code? And how quickly can new features be added to the code? Because of the rapid change in requirements and features that are inherent in software that is developed for research, maintainability is especially important.

Ease of use refers to how easy it is to learn the API’s that the code exposes and how easy it is to use the API’s once they are learnt.

Efficiency directly relates to the performance and speed of running the software. Efficiency is important because code that takes a long time to run slows down the development process. Not just because of the intrinsic time it takes, but also due to the cost of interrupting developer flow.

The literature on best practice for developing bioinformatics and scientific software^35–39 covers the basic of how to write good code. Such as using a Version Control System (VCS) for keeping

track of the code, how to organize and track data and results. As well as the basics of coding best practices, such as making incremental changes, not writing duplicated code and writing code that works before optimizing the code. However, there are best practices and development methodologies that are used in software engineering which can also help with writing and maintain good quality bioinformatics code. The best practices that are particularly useful when writing bioinformatics code are: testing, refactoring, measuring code quality and continuous integration.

Testing. In software engineering testing and especially automated testing is an active area of research. There are generally four levels of tests: unit tests, integration tests, user interface tests and system tests. For bioinformatics software that is developed while conducting research the testing levels that are of interest are unit tests and integration tests. Unit tests verify the correctness of a small unit of code, usually at the method level. And integration tests are used to check larger parts of the code such as parsing an input file. To help with write and automatically run the tests, frameworks and software libraries such as JUnit (http://junit.org/) and Mockito (http:// mockito.org/) can be used. Obviously having tests is useful to ensure that the code is correct. However, testing also helps with maintainability and usability of the code.

Software is far easier to maintain if there are tests that can give immediate feedback if changes to the existing codebase cause errors in other parts of the codebase. Additionally, writing code that is testable encourages developers to write small isolated units of code, which is a very good practice and makes the resulting code easier to maintain and use.

Refactoring. Refactoring is the process of incrementally restructuring existing code without changing or adding to the codes behavior⁴⁰. Refactoring techniques are used to take code that is correct and change the structure of the code to improve the code to make it easier to maintain and use. Refactoring relies heavily on unit tests to ensure that the code still works correctly after each refactoring step.

Measuring code quality. A commonly used tool for measuring code quality is SonarQube (http://www.sonarqube.org/). SonarQube combines static code analysis with the results of the unit tests to provide an overview of the overall code quality as well as the change to the quality that occurs whenever a change is made to the code.

Continuous Integration. Continuous integration (CI)⁴¹ is a software engineering practice in which changes to the code are automatically tested, and code quality is re-measured whenever the changes are committed to the VCS. The goal of CI is to provide rapid feedback so that changes that introduce issues or break the build can be corrected rapidly.

Test Driven Development. An agile development methodology that fits very well with bioinformatics software development is Test Driven Development(TDD)³⁹. In brief, TDD describes a short development cycle for adding improvements or new functionality. First the developer writes an, initially failing, automated test that defines the bug, improvement or new functionality. Code is then written to make the test pass. This is followed by refactoring the new code to bring it up to acceptable quality. Figure 7 summarizes the TDD cycle.

Figure 7. The test-driven development process that was used while developing MzJava. This figure is an adaptation of an image from the blog

https://softwareasscience.blogspot.ch/2014/02/tdd-test-driven-development-in-practice.html

Dans le document Developing algorithms to automate the identification of post translational modification in LC-MS/MS data (Page 21-24)