Program understanding using ontologies and dynamic analysis

(1)

Thesis

Reference

Program understanding using ontologies and dynamic analysis

BELMONTE TORREJON, Javier

Abstract

La compréhension de logiciels est essentielle à leur maintenance. En effet, aucune activité de maintenance ne peut être effectuée sans avoir d'abord compris la partie du logiciel qui sera modifié. En outre, la complexité de la compréhension de logiciels est mise en évidence par son coût: elle représente environ la moitié du coût total du cycle de développement logiciel.

Par conséquent, dans notre recherche nous visons à améliorer l'efficacité des développeurs dans les tâches de maintenance informatique en les aidant à comprendre le code source du programme concerné. En nous appuyant sur des recherches en psychologie, nous identifions les trois types de connaissance nécessaires à la compréhension d'un programme: son/ses but(s), la structure de son code source et une argumentation sur la capacité du dernier à satisfaire le(s) premier(s). Puis, en utilisant une représentation innovante du traitement élémentaire d'information basée sur des ontologies, nous proposons un artéfact composé de trois couches et capable de modéliser les trois types de connaissance. Enfin, après avoir vérifié [...]

BELMONTE TORREJON, Javier. Program understanding using ontologies and dynamic analysis. Thèse de doctorat : Univ. Genève, 2014, no. Sc. 4742

URN : urn:nbn:ch:unige-456861

DOI : 10.13097/archive-ouverte/unige:45686

Available at:

http://archive-ouverte.unige.ch/unige:45686

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

UNIVERSITÉ DE GENÈVE FACULTÉ DES SCIENCES Département d’informatique

Haute École de Gestion de Genève

Professeur D. Buchs Professeur Ph. Dugerdil

Program Understanding Using Ontologies and Dynamic

Analysis

THÈSE

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention informatique

par

Javier Belmonte Torrejon

de

Bolivie

Thèse No. 4742

GENÈVE

Atelier d’impression Uni-Mail 7 janvier 2015

(3)

(4)

Remerciements

Je souhaite d’abord remercier Prof. Philippe Dugerdil pour son suivi tout autant exigeant qu’encourageant. C’est sans doute grâce à son support que j’arrive avec succès à la fin de ce travail de recherche. J’aimerais également remercier Prof. Didier Buchs pour ses remarques lors de nos réunions, elles ont à chaque occasion fait avancer ma recherche de manière substantielle.

Merci aussi à Prof. Christophe Roche, son aide a été essentielle à la concrétisation de l’utilisation des ontologies dans mon travail. Un tout spécial remerciement à Prof. Prabhakar TV, à sa famille et a son équipe d’étudiants, doctorants et développeurs à l’IIT-Kanpur; leur accueil chaleureux et leur aide précieuse ont été décisifs pour la réussite d’une grande partie de mon travail.

Enfin, un grand merci à ma famille, amis et collègues qui m’ont encouragé tout au long de ces années. Je suis certain aujourd’hui de l’influence positive que leurs pensées pour moi et leur certitude à l’égard de mon succès ont eu sur mon état d’esprit.

i

(5)

ii

(6)

Résumé

Il est bien connu que la maintenance est la partie la plus chère du cycle de développement: elle représente entre 60% et 90% du coût total d’un programme [Erlikh, 2000]. Bien que ce soit extrêmement important aussi, il est moins connu que près de deux tiers des coûts de la maintenance sont dédiés à la compréhension du code source [Müller et al., 1993]. Ceci n’est pas surprenant puisque pour être capables de réaliser une activité de maintenance quelconque, il est nécessaire de comprendre comment fonctionne le code source et pourquoi il fonctionne de telle façon [Clayton et al., 1998]; cela au moins pour le code qui doit être modifié pour la tâche de maintenance.

La compréhension nécessaire pour la maintenance n’est pas diﬀérente de la compréhension que les développeurs ont du programme pendant son im- plémentation. En eﬀet, les développeurs ont besoin d’être au courant des éléments du système responsables des parties de la spécification dont ils sont responsables. Selon [von Mayrhauser and Vans, 1995], cette connaissance est construite inconsciemment et prend la forme d’une carte de correspondance entre les éléments du domaine métier (le domaine du problème) et ceux du domaine système (code source). Ces correspondances représenteraient alors leur compréhension du programme en développement.

Pour définir le problème auquel nous nous sommes attaqués, nous avons complémenté la définition de compréhension de logiciel proposée par Bigger- staﬀ [Biggerstaﬀet al., 1993] avec les résultats de la recherche en psychologie réalisée par Perkins [Perkins, 1986]:

“La compréhension de logiciel, considérée non pas comme un processus mais comme le produit d’un processus, est obtenue lorsque les trois types de connaissance suivants sont acquis: (1) une structure du programme, (2) les fonctionnalités métier automatisées par le programme, et (3) l’explication de la capacité du programme (en termes de sa structure et de son comportement) à accomplir lesdites fonctionnalités.”

iii

(7)

iv

Ensuite, notre définition de la compréhension de programmes nous a per- mis d’établir un méta-modèle en trois couches pour la carte de correspondances proposée par von Mayrhauser. Nous suggérons que les modèles con- struits en suivant ce méta-modèle — modèles que nous avons appelé les

“artefacts de compréhension” — sont similaires à ceux qu’un développeur construirait mentalement; ils seraient alors faciles à assimiler. Nous confir- mons cette hypothèse empiriquement et proposons ensuite un processus pour la production systématique de ces artefacts.

La représentation innovante des tâches métier atomiques demandées au programme, i.e. les “manipulations”, est centrale dans notre processus de production. En effet, définies en utilisant des concepts du domaine métier et du domaine système (actions élémentaires implémentées par les systèmes in- formatiques), les manipulations offrent un point milieu où les fonctionnalités métier et le code source peuvent être connectés. Puisque les manipulations représentent des notions dynamiques, l’aspect le plus intéressant de cette mise en correspondance est que notre artefact de compréhension est effec- tivement capable d’expliquer non seulement la structure du programme mais son comportement aussi.

L’utilisation des manipulations est aussi essentielle à l’automatisation du processus de production. Plus précisément, c’est le modèle des manipulations, basé sur des ontologies, qui rend possible une technique d’analyse sta- tique pour leur détection automatique. Ensuite, les manipulations détectés pendant l’exécution d’un programme sont utilisées pour définir un modèle intermédiaire de comportement. Pour finir, nous utilisons une technique d’analyse dynamique pour rétablir les relations entre le code source et les éléments d’une décomposition des fonctionnalités métier.

(8)

Abstract

It is commonly known that maintenance is the most expensive part in the life cycle of a software system; it represents between 60% and 90% of the total cost of a program [Erlikh, 2000]. While extremely important as well, it is less commonly known that almost two-thirds of the maintenance cost are devoted to software comprehension or understanding [Müller et al., 1993]. This is not surprising given that before being able to perform any kind of maintenance activity, at least the part of the program that needs to be modified has to be understood in terms of how it works and why it works like that [Clayton et al., 1998].

The understanding of the program required for maintenance is no diﬀerent to that which the developers of the program had while implementing it.

Indeed, during the process of implementing a program, the developers are also required to keep in mind which elements of the system satisfy what part of the specification. According to [von Mayrhauser and Vans, 1995], this understanding is built unconsciously and takes the form of a kind of mental mapping between the elements of the problem domain (business domain or real world) and the elements of the system domain (source code). This mapping represents their understanding of the program being developed.

We complement a popular definition proposed by Biggerstaﬀ [Biggerstaﬀ et al., 1993] with Perkins’ research in psychology [Perkins, 1986] to establish our definition of program understanding:

Program understanding, considered to be not a process but a process’s outcome, is achieved when three diﬀerent kinds of knowledge are acquired: (1) a structure of the program, (2) the business domain functionalities carried out by the program and (3) an explanation of the program’s structure and behavior that justifies its ability to perform said functionalities.

Next, based on our definition of program understanding, we propose a v

(9)

vi

three-layer meta-model for the understanding mapping put forward by von Mayrhauser. We suggest that the models built by following our meta-model

— models that we have named “understanding artifacts” — are close to what a software engineer would build mentally; consequently, they would be very easy to assimilate by developers. We confirm this hypothesis empirically and subsequently propose a procedure for the systematic production of these understanding models.

The innovative way to represent the atomic business tasks implemented by the program, i.e. the “manipulations”, in at the center of the understanding artifact’s production procedure. Indeed, since manipulations are defined using business domain concepts and system domain concepts (elementary information processing actions), they oﬀer a middle point where business domain functionalities and source code can be reconnected. Because manipulations represent a dynamic notion, the most interesting aspect of such reconnection is that our understanding artifacts are able to explain not only the program structure but its behavior as well.

The use of manipulations is also essential to the automation of the production procedure. Specifically, it is because of the ontologies at the center of our manipulation model that we can propose a static analysis technique for the detection of manipulations in the source code. Next, we use the manipulations detected while carrying out a functionality to define an intermediary representation of the program behavior. Finally, a dynamic analysis technique is applied to reestablish the relations between the source code and the elements of the manipulation-based decompositions of each business domain functionalities.

(10)

Chapter 1 Introduction

It is commonly known that maintenance is the most expensive part in the life cycle of a software system; it represents between 60% and 90% of the total cost of a program [Erlikh, 2000]. While extremely important as well, it is less commonly known that almost two-thirds of the maintenance cost are devoted to software comprehension or understanding [Müller et al., 1993]. This is not surprising given that before being able to perform any kind of maintenance activity, at least the part of the program that needs to be modified has to be understood in terms of how it works and why it works like that [Clayton et al., 1998].

The understanding of the program required for maintenance is no diﬀerent to that which the developers of the program had while implementing it.

Indeed, during the process of implementing a program, the developers are also required to keep in mind which elements of the system satisfy what part of the specification. According to [Storey, 2005; von Mayrhauser and Vans, 1995], this understanding is built unconsciously and takes the form of a kind of mental mapping between the elements of the problem domain (business domain or real world) and the elements of the system domain (source code).

This mapping represents their understanding of the program being developed.

Following research in psychology (c.f. section 1.1.2), we propose a three- layer meta-model for this understanding mapping (c.f. chapter 2). We suggest that models built with our meta-model are close to what a software engineer would build mentally, and are thus very easy to assimilate by developers. We confirm this hypothesis in chapter 3 and then, in chapter 4, we introduce a procedure for the systematic production of these understanding models. In chapter 5 and chapter 6, we decribe how we automated the

1

(15)

2 CHAPTER 1. INTRODUCTION static and dynamic analysis specified in our procedure. We performed two experiments, which are discussed in chapter 7.

1.1 Defining the problem

1.1.1 Our research scope

Although necessary to all software engineering activities, the study of program understanding is most interesting in the context of maintenance tasks.

Indeed, unlike other activities, maintenance involves modifying the program after its delivery, potentially even after the program has stopped being developed. Thus, while most of the time other activities are performed by software engineers that have gradually obtained an understanding of the program during development, maintenance frequently involves having to gain understanding of an unfamiliar program.

We decided to focus our research on that latter case: program understanding in the context of maintenance tasks. Furthermore, to tackle the problem in its greatest generality, we consider the maintenance context making the smallest number of assumptions possible: a lack of documentation on the program paired with the unavailability of its original developers. However, because the program is being maintained, we suppose that it is still being used and that its business domain is still relevant. Therefore, we can assume to have access to a user of the program and to an expert of the business domain.

1.1.2 Understanding in general

The understanding process begins with something that is not understood.

Regardless of how much information is known about the latter, understanding involves the obtention (by creation or transfer) of new knowledge. This seems indeed to be the case since, once something is understood, the process of understanding does not need to be repeated. In other words, the process’s result has become part of the knowledge base and can be recalled upon need.

According to Perkins [Baron, 2009; Perkins, 1986], this new knowledge involves three diﬀerent aspects of what we wish to understand: its purpose or purposes, its structure, and an argumentation, based on the structure, supporting the fulfillment of the purposes.

(16)

1.1. DEFINING THE PROBLEM 3 Although the description of the structure and the purposes depend on existing knowledge and understanding, they are both self-contained and can be stated independently. However, since the goal of the argumentation is to explain the involvement of the structural elements in the fulfillment of the purposes, the argumentation depends on the other two kinds of knowledge.

Indeed, the facts and beliefs that compose the argumentation depend not only on existing knowledge and understanding but also on the structure elements and purposes. From a diﬀerent perspective, we could say that although all three kinds of knowledge are necessary for it, understanding is never achieved by adding a new structural element or a new purpose to the existing knowledge.

While the last piece of knowledge required to understand something is always of the argumentation kind, not every new fact or belief added to the argumentation produces understanding. Indeed, understanding is brought only by the facts or beliefs that explain (possibly indirectly) how some purpose is fulfilled by part of the structure. The understanding obtained by every such new piece of knowledge is naturally partial, limited to the involved structural elements. Thus, given a set of purposes and a structure, the argumentation explaining the satisfaction of all purposes can be built gradually, relying on a partial understanding of individual purposes.

Understanding depends on the established structure of what needs to be understood as well: it has an impact on the depth of the understanding.

This impact is a logical consequence of the fact that structures can, in most cases, be described at diﬀerent abstraction levels. Of course, the complexity of the argumentation to be built is at least proportional to the level of detail or depth of the structure. Thus, using a structure that is only as complex as necessary is important to ensure eﬃciency in the search for an argumentation.

1.1.3 Understanding programs

Program understanding in the context of maintenance activities means understanding a program’s source code. Thus, to be understood, we need to identify the source code’s purpose and an appropriate structure with which we can argue the source code’s potential to fulfill its purpose.

Although other purposes exist for programs, e.g. academic demonstra- tions, we focus only on the set of functionalities that each program has been implemented to carry out. In fact, since most pieces of source code exist to participate in at least one functionality, maintenance tasks can be expressed

(17)

4 CHAPTER 1. INTRODUCTION in terms of the functionalities that need to be added, fixed, improved or reused.

Because of our particular maintenance context, we further restrict the scope of purposes and focus exclusively on the program purposes that are meaningful or relevant to the program’s users: the business domain functionalities. This leaves most of the utility source code, e.g., test code or installer code, outside the scope of our work.

Constraining the purposes to be analyzed to program business functionalities, Perkins’ general definition of understanding is compatible with Brooks’

view that program understanding can be achieved by recreating the links between the domain problem and the program’s source code by hypothesis generation, refinement and validation [Brooks, 1983]. Indeed, if we consider the main hypothesis to be that the program has some particular business purposes, then the argumentation explaining how the structure of the program fulfills those purposes should validate it.

Even though there may be countless ways to build such an argumentation, we focus on one inspired by Biggerstaﬀ’s description of program understanding in [Biggerstaﬀ et al., 1993]:

“A person understands a program when they are able to explain the program, its structure, its behavior, its eﬀects on its operational context, and its relationships to its application domain in terms that are qualitatively diﬀerent from the tokens used to construct the source code of the program.”

In other words, every identified characteristic of a program (structure, behavior, etc.) needs to be explained in terms that are “qualitatively diﬀer- ent” from the source code’s programming language tokens. Since the latter feature is satisfied by business domain terms, building an argumentation for how the source code structure is able to carry out a business functionality would not only immediately satisfy Brooks’ definition, it would undeniably have explained the program’sstructure,behavior and itsrelationships to the application domain in a way that satisfies Biggerstaﬀ’s definition as well.

Since our maintenance scope focuses only on the program’s source code, the program’s effect on itsoperational context, the last aspect in Biggerstaff’s definition, falls outside of our scope. Most of Biggerstaff’s definition is thus covered by applying Perkins’ view on understanding, on programs, as long as their purpose is evaluated in terms of the business domain functionalities they carry out.

(18)

1.2. STATE OF THE ART 5 Definition 1.1. Our definition of program understanding in the maintenance context defined in section 1.1.1 is the following:

Program understanding, considered to be not a process but a process’s outcome, is achieved when three diﬀerent kinds of knowledge are acquired: (1) a structure of the program, (2) the business domain functionalities carried out by the program and (3) an explanation of the program’s structure and behavior that justifies its ability to perform said functionalities.

Like Brooks’ and Biggerstaﬀ’s, our definition takes the form of an artifact.

We refer to it as the “understanding artifact”, to distinguish it from the understanding process by which understanding is achieved.

1.2 State of the art

In this section we focus on describing third-party research work explicitly aimed at helping developers during program understanding tasks. Although we have tried to give a general view of the domain, because of the nature of our research, we could not avoid having a bias towards the research aiming to provide source code elements with some form of meaning.

1.2.1 Concept assignment

Besides a definition of program understanding that we have combined with other research in psychology to state our definition of the problem (c.f. section 1.1.3]), in [Biggerstaﬀ et al., 1993] the author introduced “concept assignment” as a general approach to tackling the problem. Concept assignment refers to the search and assignment of “human-oriented” concepts to the elements of the program code. Biggerstaﬀexplained that during program understanding, the software engineer has to discover the human-oriented concepts in the program’s elements and interrelate them into a “human-oriented expression of computational intent,” something very similar to what we have called the purpose of the program.

While the human-oriented concepts may be implemented in multiple ways, most of the research in this field is focused on looking for them in the identifiers used by the developers in the source code. This might be because

(19)

6 CHAPTER 1. INTRODUCTION identifiers can be assumed to contain information about the concepts implemented in the program. Indeed, while programming, software engineers need to refer to the information handled by their program and identifiers are a straightforward way to do so. Then, using natural language words, programmers create identifiers for variables, functions and classes that connect them to the implemented concepts. We assume this is the case because as- signing random identifiers is a way to obfuscate code, and it is very unlikely that developers would willingly obfuscate their source code since it would then become very hard or impossible to work with (c.f. [Ratiu, 2009]). This assumption has been previously agreed on and used to extract domain knowledge from the source code in [Haiduc and Marcus, 2008; Zhou et al., 2008;

Ratiu et al., 2008].

In [Anquetil, 2001], the author analyzes the information found in informal sources, i.e., comments and identifiers, to decide whether it is worth trying to use them in the recovery of traceability links between domain concepts and implementation components. Both sources are analyzed by a partially- automated process to extract the terms that refer to domain concepts. These terms are then manually categorized into three domains: application, computer science and general domain. Anquetil remarks that although a high percentage of the concepts found in comments (⇠ 80%) and in identifiers (⇠ 60%) are general concepts, the category of the most repeated concepts in each source can be used to diﬀerentiate them: computer science concepts are repeated the most in comments and application concepts in identifiers.

While the difference in size between the domains seems to be the cause of the common more dominant presence of general concepts, the author explains that the different needs fulfilled by each source can explain the difference in the categories of most repeated concepts. Indeed, identifiers need to inform what application concepts they implement and, most of the time, comments focus on explaining the implementation details. Because of their higher fre- quency in both sources, it should be possible to filter general concepts out, leaving only application and computer science concepts. The difference between the most repeated concepts justifies the use of identifiers in concept assignment approaches. Indeed, it is application domain concepts that need to be assigned to source code elements.

The same sources of informal knowledge are studied by Haiduc and Mar- cus in [Haiduc and Marcus, 2008]. In their work, the authors manually build a list of domain terms in graph theory and use it to analyze their use in six Application Programming Interfaces (APIs) related to that domain. The presence of domain terms in comments and identifiers is higher in the APIs than in the programs analyzed in [Anquetil, 2001]: ⇠90% in comments and

(20)

1.2. STATE OF THE ART 7

⇠77%in identifiers; this is probably the case because of the level of specialization of the analyzed source code. The authors then compare the use of domain terms between the diﬀerent APIs, finding that while the probability of two people choosing the same term to refer to the same concept is in general as low as 20%, this percentage is as high as 63% in the case of developers choosing the terms used in comments and identifiers. Given that the APIs in the study were developed by diﬀerent teams, such a high percentage is astonishing and shows that there is an important level of agreement in the choice of terms made by developers.

In [Antoniol et al., 2007], the authors focus on a diﬀerent characteristic of the vocabulary used in identifiers: its stability through the program’s evolution. In particular, they wonder how it compares to the stability of the program’s structure. Some program entities, i.e., files, classes and methods, are followed between releases of the program to analyze their evolution. If two entities of consecutive releases of the program have the same name, then they are assumed to be the same entity. Entities that have no corresponding entity in a previous release by name, are analyzed using a structural vector representation of entities the authors had previously developed; if the distance between the vector of an unmatched entity and that of any entity of a previous release is small enough, then a renaming of the entity is presumed to have taken place. Otherwise, the entity is considered to be new. This structural vector distance is also used to compute a measure of the program structure stability between diﬀerent releases of the program. Comparing the rate of entity renaming and the structural stability of three large programs, the authors conclude that the stability of the identifiers’ lexicon is always higher. Moreover, the rate at which entities are renamed decreases over time, meaning that the lexicon tends to stabilize. The authors’ explanation for these findings is that the lexicon “forms an essential corner stone of the programmers’ understanding and mental models”.

1.2.2 Use of domain knowledge

The coherence between diﬀerent developers’ choice of words in comments and identifiers, and the stability of identifiers, suggests that developers have enough understanding of the domain to use its concepts correctly. Indeed, high agreement rates on the terms used to refer to a concept can only happen if there is also a strong agreement on the referred concept. Hence, it would make sense to use a conceptualization of the domain, e.g., an ontology, as a source of information during software engineering activities involving identi-

(21)

8 CHAPTER 1. INTRODUCTION fiers, including concept assignment. It is, however, only recently that there has been an interest in incorporating ontologies into the software engineering process. In [Djuric and Devedzic, 2011], the authors explain that this might be the case because the eﬀort in developing ontology languages and tools is focused on building the Semantic Web and not on programming or software engineering practices.

In [Zhou et al., 2008], the authors propose the construction of an Appli- cation Specific Ontology (ASO) to help maintainers understand programs.

This ontology is a combination of a domain ontology, built by domain experts, and a class diagram ontology built by exploiting the similarities between Uni- fied Modeling Language (UML) class diagrams and ontologies: hierarchical structure, properties and relations. The authors propose an ontology mapping algorithm to connect the concepts of these ontologies and present a study case. While the presented algorithm is very optimistic and, in our opinion, only works if the ontologies are extremely similar, the cited uses of such an ontology are interesting. First, because of the connections between classes and domain concepts, the ontology is a form of concept location; and second, since the domain ontology has to be respected by the implementation, the connections in the ontology could be used to detect certain kinds of design defects.

A similar approach was taken by Ratiu and Deißenböck in [Ratiu and Deißenböck, 2006a,b]. In [Ratiu and Deißenböck, 2006b], the authors describe an ontology mapping approach for concept location based on two ontologies: an ontology of the real world and another representing the identifiers in the code. Although WordNet¹ is only a thesaurus, the study case presented in these articles uses it as an ontology-like model of the real world. To create a similar representation of a Java program, the authors introduce a number of heuristics that translate certain program relations into those used in Word- Net, e.g., synonymy, hypernymy and hyponymy. Since the relations are then the same in both ontology-like representations, the authors transform them both into directed graphs and use a graph matching technique to perform the ontology mapping. This technique was applied on JFreeChart², an open source Java library for drawing charts, finding that 20% of the identifiers in the source code could be mapped to a WordNet concept. The mapping obtained in that study case was used in [Ratiu and Deißenböck, 2006a] to semi-automatically detect semantic problems caused by a flawed use of real world concepts, a use of the ontology mapping also proposed in [Zhou et al.,

1http://wordnet.princeton.edu

2http://www.jfree.org

(22)

1.2. STATE OF THE ART 9 2008]. The authors introduce a framework for the definition of these semantic problems, documenting all possible deviations from what they consider to be the ideal case, i.e., that every concept in the ontology is referred-to by a unique word in the source code and implemented in exactly only one program element. Using this framework they were able to detect that the terms

“ellipse” and “oval” are used to refer to the same concept in the Standard Java API.

However, business domain ontologies are not the only ones used to provide help in software engineering activities. In [Hyland-Wood et al., 2008], the authors aim to use an ontology of software engineering concepts that they have developed, to enhance the maintenance process. This ontology’s novelty is that it does not focus only on object-oriented software components but on requirements, tests and metrics as well. Once populated with concept instances concerning a program, the goal of their ontology is to answer questions like:

1. Which requirements have not yet been revalidated after a change?

2. Which tests have failed?

3. Which requirements relate to failed tests?

Most of the relationships in the ontology can be populated automatically;

however, those concerning requirements still need to be entered by hand since they are usually stated in Natural Language (NL). Some previous suc- cessful experiments undertaken with this ontology led them to propose a software maintenance methodology for developers based on metadata anno- tations. Once in the ontology, such information could be exploited to direct maintenance activities. This is particularly interesting if the maintenance tasks are to be carried out in a decentralized environment, like that of open source projects. Indeed, the ontology could first be used to centralize all change requests, and then to, automatically, propose to the developers the maintenance activities that they are best fitted to carry out.

Also in the requirements field, in [Jaroslaw, 2010], the authors remark that although several tools exist that analyze the consistency between the class model and other analytic models, none of them cover requirement specification. Therefore, the consistency between the class model and the requirements is more diﬃcult to maintain over time. By using a domain ontology as an intermediary level, they propose to extract a glossary from the requirements expressed in natural language. This glossary could then be transformed into a basic starting version of the class model. Such a process would be iterative and would require a substantial amount of user interac-

(23)

10 CHAPTER 1. INTRODUCTION tion. However, the quality of the first set of classes thus obtained (in terms of completeness and consistency) would be superior since developers, who are usually not experts in the application domain, would be taking advantage of the “knowledge of the application experts stored in the domain ontologies”.

Recently, in [Carvalho, 2013] and [Carvalho et al., 2014], Carvalho et al. described a technique based on two ontologies: a program ontology, populated with the analyzed program’s elements, e.g., classes, methods and variables, and a problem domain ontology. Their technique focuses on using these ontologies to integrate the knowledge produced by a number of tools tackling specific problems during program comprehension activities. One of these tools, detailed in both articles, maps program elements to the concepts in the domain ontology they represent. However, the mapping process does not involve the ontologies since it is based exclusively on the probability of the concept names to be synonyms based on an a priori analysis of specialized software related corpora. In this research, the ontologies are used as a form of source code documentation.

While identifiers are ideal for the detection of application domain concepts, other uses have been imagined. In [Singer and Kirkham, 2008], the authors prove empirically that by looking at the last word of a class’s name, it is possible to expect certain micro patterns [Gil and Maman, 2005] to be satisfied by the class’s source code. Micro patterns are single-class syntac- tic properties that are implementation oriented, non-trivial but at the same time easily recognizable by static analysis techniques. Moreover, although most of the correspondences between class-names and micro-patterns seem to be respected only within a single group of developers, the share of correspondences that were global to all groups during their experiments is still notable (⇠ 33%). Authors conclude that classes’ names codify community conventions about programming, knowledge that could for example be used to search for badly implemented classes.

In [Fry et al., 2008], the authors focus on extracting representative verbs and direct objects from method signatures and comments. Their research is motivated by the fact that in most software maintenance tasks, the goal is to either add or modify already existing actions. Then, the extraction of verbs is essential to the identification of actions in the source code. By combining available NL processing tools with a set of extraction rules, they establish an extraction process that is able to achieve 57% precision and 64% recall. However because neither the verbs nor the direct objects have a structure, this assessment is done informally, attributing matching scores by individually considering the closeness of the detected actions to those

(24)

1.2. STATE OF THE ART 11 annotated by human annotators.

In fact, despite their general importance in human communication, actions and verbs do not seem to have been properly conceptualized yet. Neverthe- less, we have been able to find some action ontologies, like the one built in [Kobayashi et al., 2011], where authors try to align the set of possible actions that a robot can perform to the knowledge oﬀered by Wikipedia. The set of atomic actions composing the ontology is extremely simple though, with almost all action concepts established in a single layer. Moreover, authors have mixed verbs and objects from the domain in their ontology, thus reducing the chances that it could be reused in other domains.

Verbs and objects are diﬀerentiated in [Parisi et al., 2007], where the NL instructions in a script are analyzed in order to automatically produce a 3D animation. The authors propose to parse the scripts using the Standford NLP parser³, which is able to identify the parts of the scripts’ sentences, e.g., nouns, verbs and adjectives; then, the identified parts are mapped to concepts of the ResearchCyc⁴ ontology, an upper ontology. The article is only a methodology proposal, thus, the diﬃculty of translating the abstract verbs extracted from the scripts into practical actions in the animation is raised but left unsolved.

1.2.3 Dynamic analysis

Another source of information used in program understanding is the behavior of a program while it is executed. Dynamic analysis involves the study of execution traces, the name usually given to the information gathered while the program is running. This kind of analysis has an intrinsic advantage over static analysis: it can benefit from runtime information like object identities or dynamic binding. However, another inherent characteristic of dynamic analysis is that it focuses on the part of the program that is executed. While this can be an advantage because it reduces the source code that needs to be analyzed, it is also a disadvantage since the results obtained are specific to the source code involved in the program execution and cannot, in general, be extended to the rest of the program.

Recently, in [Cornelissen et al., 2009], the authors made a survey of the works in dynamic analysis that have been published in the ten years preceding its publication . We looked through that survey to check if any reference

3http://nlp.stanford.edu/software/lex-parser.shtml

4http://www.cyc.com/platform/researchcyc

(25)

12 CHAPTER 1. INTRODUCTION was made to business domain knowledge being used to perform dynamic analysis but could not find anything. In fact, to the best of our knowledge, no other research group but ours is explicitly processing domain knowledge in dynamic analysis for program comprehension. It seems that most of the eﬀort is applied to compressing or summarizing the execution traces, i.e., [Quante and Koschke, 2007] [Smit et al., 2008] [Kuhn and Greevy, 2006].

In [Asadi et al., 2010], the authors perform concept location based on Latent Semantic Indexing (LSI) techniques. Thus, their notion of concept is diﬀerent from the one used in concept assignment. Indeed, in LSI approaches, a concept is an orthonormal dimension resulting from the reduction of dimensions of the original space; which, in this case, had as many dimension as there were diﬀerent terms in the methods called in an execution trace.

Thus, concepts are sets of terms that, when considered together, diﬀerentiate methods from each other better than the original dimensions. The authors propose that methods executing a single feature of the program are concep- tually cohesive and decoupled from those implementing other features. In other words, each segment of an execution trace where a single feature of the program is executed should be composed by methods forming a cluster in the LSI space. Their empirical results do not contradict their proposal, but are not enough to prove it right. However, even if demonstrated, the resulting kind of concept mapping represents very little help in program comprehension because these concepts do not represent anything that developers would readily understand. The authors might have recognized this problem since they introduce later on, in [Medini et al., 2012], a technique to label the obtained execution trace segments. However, while their technique improves their previous results, the obtained labels are simply smaller sets of terms than those obtained in [Asadi et al., 2010]. They do not represent any abstraction of the concept mapping.

1.2.4 Discussion

Recently, Biggerstaﬀ’s definition (c.f. section 1.1.3) of program understanding has become popular among the articles in the domain. However, some- times it is not clear how the research work contributes in the right direction.

For example, while the clusters of methods obtained in [Asadi et al., 2010]

and [Medini et al., 2012] could indeed correspond to the features executed by the program, the sets of terms mapped to the methods in each identified cluster do not represent a single notion or concept (in the sense that Biggerstaﬀ gives to that word) that a developer could easily identify.

(26)

1.3. RESEARCH GOAL 13 While ontologies are evidently a source of interest in program understanding activities, most approaches still exploit very small ontologies, or only analyze the relationships in the ontology that reduce their knowledge to be an object-oriented representation of the business domain. In fact, the ontologies used in [Anquetil, 2001], [Haiduc and Marcus, 2008] and [Zhou et al., 2008]

are so small and similar to the source code that one could think they had been created as a conceptual representation of the program’s class diagram instead of the domain. Moreover, although a real world ontology-like representation of the domain is used in [Ratiu and Deißenböck, 2006b] and [Ratiu and Deißenböck, 2006a], the analyses in these latter works and in those with small ontologies are all based on the concept hierarchy defined by the “is-a”

relationship. Besides losing much of the knowledge in the ontology, doing so and then mapping the ontologies using graph matching techniques assumes that ontological concepts are implemented as classes in the program, which we have frequently observed not to be the case.

The novelty of ontology-based analyses might be one of the reasons why we have not been able to find any example of the use of business domain ontologies in dynamic analysis. Another important reason could be, in our opinion, that there has been very little development in understanding actions from a conceptual point of view. Indeed, this follows on from the fact that because of dynamic analysis’ focus on the program’s behavior, the study of actions is essential to these techniques (c.f. [Fry et al., 2008] in section 1.2.2).

1.3 Research goal

Having reduced the scope of our research in section 1.1.1, we were able to clearly define the kinds of knowledge we believe need to be recovered to understand a program in that context (c.f. section 1.1.3). Expressed in more detail, two of our three research goals are:

1. To create a model for the program understanding artifact. In particular, the instances of the model should be capable of holding the three kinds of knowledge required by our program understanding definition.

2. To define a procedure the output of which would be such an understanding artifact.

The availability of such an artifact should help program understanding during maintenance by reducing the time necessary for developers to achieve understanding. Indeed, being close to what psychologists believe is needed to

(27)

14 CHAPTER 1. INTRODUCTION understand, we expect our artifact to facilitate the program understanding process by providing most of the knowledge necessary to achieve it (all the three kinds of knowledge).

Of course, despite the artifact being useful during maintenance, to build it by hand would demand an eﬀort at least equivalent to that currently required to understand a program without it. Thus, the third goal of our research is:

3. To automate the program understanding artifact building process.

(28)

Chapter 2 The understanding artifact

2.1 Introduction

The business domain was chosen to support the explanations because it is qualitatively diﬀerent from source code. It is also a kind of knowledge necessary during any software developing process. Indeed, most of a program’s requirements are expressed in terms of the business domain functionalities to be automated. Moreover, business domain functionalities and source code can be thought of as opposite ends of the development process, as shown in Figure 2.1.

Business domain

functionalities Development process Source code

Figure 2.1: Development process, from business domain functionalities to source code

During the development process, traceability links can be used to keep track of the source code elements implementing the requirements made for the program. Since this connection brings them close to Brooks’ definition of program understanding (c.f. section 1.1.3), it is not surprising that these links are helpful during program understanding. However, traceability links do not completely satisfy our definition of program understanding, notably because they lack the capability to deal with program behavior.

While the capacity to fulfill our definition of program understanding makes traceability links diﬀerent from our understanding artifact, both models dis-

15

(29)

16 CHAPTER 2. THE UNDERSTANDING ARTIFACT tinguish the connected elements (opposite ends of the development process) from the connecting elements (links and argumentation respectively). Thus, we have divided this section into two parts:

1. In section 2.2, we introduce a formalization for the types of knowledge to be connected, i.e. the source code structure and the business domain functionalities.

2. In section 2.3, we describe a model for the type of knowledge connecting the previous two, i.e. the argumentation links.

2.2 Knowledge to be connected

2.2.1 Source code

Source code is essentially a very specialized form of human readable text con- taining a sequence of computer instructions (c.f. section 1.1.1). Evidently, multiple structures exist for computer programs (e.g., file systems, software architectures), in particular the one defined by the source code’s programming language. In this section we will identify which one among the existing structures is the most appropriate for our goal.

Since any source code can be considered to be a sequence of computer instructions, we could start our search for a program’s structure there. How- ever, the number of source code instructions is several orders of magnitude larger that the number of business functionalities. As we remarked in section 1.1.2, this diﬀerence would greatly increase the complexity of the argumentation to be made. Moreover, while program instructions are interesting because they are common to all kinds of source code, their granularity is such that they do not hold any information that could justify them being linked to one business functionality instead of another. Without a context, determined by the surrounding instructions in the program, a single instruction cannot be given any particular business meaning; identical instructions might very well be used to implement completely unrelated business functionalities in diﬀerent parts of the code.

Consequently, our search for an appropriate source code structure can jus- tifiably jump directly to the smallest block of instructions capable of having a business meaning or purpose: subroutines. Given our focus on procedural languages, particularly Java, we will onwards refer to subroutines as methods. Methods are defined as a block of instructions working as a unit to

(30)

2.2. KNOWLEDGE TO BE CONNECTED 17 perform a specific task, and although this task could have no more business meaning than any single computer instruction (e.g., perform an index by index sum of two integer arrays), it could just as well have a business domain level relevance (e.g., compute a resulting force vector). This is the feature we are interested in and it happens to diﬀerentiate methods from any other smaller block of instructions.

Should we choose a structure at a higher level of abstraction than methods, we would be forced to keep an aggregated business meaning (that of all the methods inside a structural element), thus losing the knowledge of the business tasks implemented by individual methods. The advantage, of course, would be having fewer structural elements to connect together into an explanation. However, the power of argumentation would be reduced as well, thus hindering the help that we would be able to provide to understand the code.

We decided to choose the methods as the base elements for the source code structure because: on the one hand, they are the smallest appropriate block of instructions; on the other hand, since most of the source code exists within methods, most elements of structures at a higher level of abstraction acquire their business meaning almost exclusively from the methods they contain.

2.2.2 Business functionalities

We chose to represent the program’s purposes as the fulfillment of business domain functionalities. Although it was not brought up in our definition, the formalization of business domain functionalities also requires finding a structure with an appropriate granularity for our goal.

If a functionality can be accomplished by a computer, it usually means that it can be decomposed into simpler functionalities; there are then multiple levels of granularity. Moreover, the decomposition of functionalities follows the direction of the development process, with code instructions fulfilling the tiniest functionalities. However, since source code instructions cannot represent business functionalities because of their level of granularity (c.f.

section 2.2.1), they cannot represent business purposes either.

This means that, at some point in the development process, the decomposition of the functionalities reaches a point beyond which the resulting functionalities no longer represent business purposes. Since we use functionalities to represent the purpose of the program, we focus our attention on

(31)

18 CHAPTER 2. THE UNDERSTANDING ARTIFACT a granularity level before that tipping point. This granularity level is actu- ally that of services, as defined by IBM’s Software Oriented Modeling and Architecture (SOMA) [Arsanjani et al., 2008]:

“We have found that it is better to refrain from adding all existing fine-grained system functions as candidate services; only the ones that provide business value to the enterprise should be added.”

In other words, any business domain functionality that provides business value can be used to represent a purpose of the program. However, the further these functionalities are from the tipping point, the larger the gap between purpose and structure that our argumentation has to cover.

Definition 2.1. Consequently, we define the accepted business domain functionalities as the simplest functionalities that can still be considered to be a service to the business domain.

2.3 Connection knowledge

The third kind of knowledge is the argumentation, which connects the program’s source code to the business domain functionalities in a way that explains the capacity of the first to carry out the latter. Once again we need to define the most appropriate structure to hold this knowledge; however, in this case, there are no already available models to choose from.

The insights for the model representing this kind of knowledge in our understanding artifact were that:

1. Because they are two aspects of a same program, business functionalities and source code are already connected as a whole:

• To perform all business functionalities is the reason why a program is made.

• The sequence of instruction in the source code is how a program performs the business functionalities.

2. Two similar questions can be asked about each of these two aspects of the program in order to bring them closer.

• From the perspective of the purpose: whatdoes the program need to do to perform each business functionality?

(32)

2.3. CONNECTION KNOWLEDGE 19

• From the perspective of the source code structure: whatdo source code methods accomplish that helps performing the business functionality in which they participate?

3. Both questions require, to be answered, an intermediary structure between source code and purpose. We propose to use a single structure for both of them, a structure that could at the same time answer the more general question: what does the program do?

We propose then to represent the three kinds of knowledge required for program understanding as three layers, linked to each other, each answering a diﬀerent question about the program.

1. First layer, why was the program implemented?

2. Second layer, what is the program expected to do to fulfill each one of its purposes?

3. Third layer, how does the program carry out what it has to do?

These three layers can be assumed to be connected because they represent diﬀerent aspects of a program. However, an explicit connection is also possible because each layer oﬀers a finer granularity than the immediately more abstract layer: first, a whole business domain functionality, then a subdivision of the functionality into smaller tasks, and finally the methods, the smallest instruction block that can be considered a task unit. If we add to this the fact that business functionalities are already a subdivision of the

“program’s purpose” as a whole, we identify a general structure for all three layers, represented in Figure 2.2:

• Each layer is established at a particular level of abstraction from the source code. This level of abstraction defines the granularity of the knowledge that each layer holds about the program (i.e., why, what and how).

• The elements of each layer build an explanation of the elements of the immediately more abstract layer. Therefore, a hierarchic mapping connects the elements of all layers:

– Business domain functionalities are subdivided into tasks to explain what the program needs to do to fulfill them.

– Tasks are linked to source code methods to explain how the program implements them.

In the following sections we define precisely the level of abstraction of each

(33)

20 CHAPTER 2. THE UNDERSTANDING ARTIFACT layer as well as the kind of decomposition they represent in relation with the elements at the immediately more abstract layer. Given that we have already defined the granularity level for the first and third layers (c.f. section 2.2), these descriptions are halfway done for them. It should be noticed that since the program’s behavior is not described in the first and third layer, it is a requirement left for the second layer (c.f. section 2.3.2).

Program's purpose

Source code 3rd layer: "How?"

2nd layer: "What?"

1st layer: "Why?"

Figure 2.2: The layers’ lower abstraction elements composition hierarchy

2.3.1 First layer

The abstraction level at this layer is that of business domain services. In section 2.2.2 we defined the granularity of the layer by requesting that the business domain functionalities not be decomposable into simpler business domain services. Thus, the elements in this layer are considered atomic at their abstraction level.

Because they are atomic, the functionalities in this layer could certainly be combined into more complex functionalities. However, since explaining

(34)

2.3. CONNECTION KNOWLEDGE 21 the program structure in terms of the most basic business functionalities is enough to satisfy our program understanding model, we leave widening the scope of our artifact to larger functionalities to future work. The program’s purpose is then represented in our model as a set of atomic business domain functionalities.

2.3.2 Second layer

We can assume that the carrying out of a functionality can be subdivided into smaller executable units since the source code is an example of this.

We are, however, interested in an abstraction level higher than the source code’s, one that is at the same time related to the business domain and whose implementation can be related to the source code methods in the third layer.

Because the elements into which the business functionalities are decomposed in this layer represent what the system must do, we call them “tasks”.

Tasks should remain connected to the business domain so that they can be used to explain the source code. At the same time, because they are implemented by a computer program, they necessarily represent some sort of information processing to be accomplished. In both cases, tasks cannot be performed in an arbitrary order because they depend on each other. In fact, the way in which they interact while carrying out a functionality is what allows them to explain the program behavior as well as its structure. Indeed, while mapping the source code methods to the business knowledge in this layer’s tasks explains the source code’s structure, the knowledge of how these tasks depend on each other to carry out the functionality should explain why the source code methods intervene when they do.

The tasks’ interdependence gives rise to a number of acceptable task- sequences that can be formalized using a control flow diagram. Multiple options with similar expressive power are used in computer science, e.g.

flowcharts and activity diagrams. We chose to use the Business Process Model Notation (BPMN) because it aims at supporting business users as well as technical users, making it easier to understand for non-developers.

2.3.2.1 Task description

Being at the same time a business-related operation and a kind of information processing, the tasks represent some form of manipulation of business domain information. To formalize them it is therefore necessary to formalize both the

(35)

22 CHAPTER 2. THE UNDERSTANDING ARTIFACT manipulation and the business domain information. To distinguish between the kind of processing that could be applied to any information and the actual processing of a particular business domain information, we refer to the first kind as “actions” and to the second kind as “manipulations”.

Business domain information can be represented formally using an ontology. However, since the business domain is related to the program under analysis, we cannot give any further details about this ontology in general terms. In our research we have used two business domain ontologies, which we have described next to the description of the program being analyzed in our experiments (c.f. chapter 3 and section 7.2.1). The details of how we eﬀectively use the business domain ontology are given in section 4.4.

Since actions represent only the kind of processing to be applied, their formalization can be achieved independently from the business domain. This allows its reuse during the analysis of any particular program. Reusability is a key feature of this ontology enabling to reduce the cost of its technical application on new programming language (c.f. section 5.3).

2.3.2.2 The Action Ontology

While we were able to customize existing business ontologies to represent the business domain of our study cases’ programs, we could not find any ontology of actions. Therefore, we built the Action Ontology, a conceptualization of the kinds of processing that a computer program can perform.

We believe the absence of actions in business ontologies is a consequence of the fact that they are not, in their vast majority, domain specific. Indeed, actions cannot, in general, carry out a domain-specific action by themselves.

For example, while “register a new guest” is specific to the hostelry domain, the action “register” has multiple meanings by itself (e.g., the most general being the action of putting something in a register); therefore, although it may be frequently used or otherwise important in a specific domain, it cannot be said to belong to any business domain in particular. It is only when associated with a domain concept (like “guest” from the hostelry domain) that an action becomes a domain-specific action.

The root concept of the ontology is[Action]. It is specialized by[Infor- rmation delivery] and [Information manipulation], which represent the most general actions that can be used on information, i.e., either handle it or communicate it. We specialized the exchange of information into

(36)

2.3. CONNECTION KNOWLEDGE 23 [Receive]and [Send]only, while the handling of information is specialized in much greater detail following the Model View Control (MVC) pattern:

• [Manipulation for Control], decision making actions based on information

• [Manipulation for View], actions on information with visible outcome for the user

• [Manipulation for Model], actions that handle the information itself This first layer of concepts is shown in Figure 2.3, using the Visual Nota- tion for OWL Ontologies¹ (VOWL).

Manipulation on Control

Information manipulation

Information delivery

Receive Send

Manipulation on Model

Manipulation on View

Action

Figure 2.3: First group of Action ontology concepts

The specialization of [Manipulation for Control] is short as well, as we have only identified [Validate input], the process of checking that whatever input was made by the user is valid. [Manipulation on Model]is slightly more complex since it reflects the Create, Read, Update and Delete (CRUD) basic functions of information handling in persistent storage. Al- though some people add searching as a basic function as well, modifying the acronym to SCRUD, we found it more suitable to add [Search] as a kind of [Read]action since in our case it refers to the action of retrieving specific information. Thus, the action ontology is expanded as shown in Figure 2.4.

The concept [Manipulation for View] is first specialized into [Confi- gure view]and[Render view]. [Configure view]is the action of config- uring any view of domain information whereas [Render view]is the action of displaying a view for the user. Two diﬀerent kinds of views can be displayed for any type of business information: a visualization of the information or an interface through which the user can change the business information. Thus, [Render view] is specialized into [Display content] and [Display management interface].

1http://purl.org/vowl/spec/

(37)

24 CHAPTER 2. THE UNDERSTANDING ARTIFACT

Manipulation on Control

Manipulation on Model

Create Read

Validate

input Update Delete

Search Information

manipulation

Figure 2.4: Second group of Action ontology concepts

We specialize [Display content] according to the diﬀerent kinds of visualization possible:

• [Display collection], displaying a collection of instances of a business concept.

– [Display as graph], [Display as list], [Display as ta- ble] and [Display as tree].

• [Display individual], displaying a single instance of a business concept.

– [Play audio],[Play video]and [Display statically]when the instance of the business concept can be displayed as an image, text or even as a web page.

We specialize [Display management interface] into:

• [View configuration], the rendering of an interface from which to configure the views produced by any [Display content].

• [Content management], the rendering of an interface used to modify the business information.

– [Creation interface], [Deletion interface] and [Modifi- cation interface].

This third and last group of concepts is shown in Figure 2.5.

Program understanding using ontologies and dynamic analysis

Thesis

Reference

Program understanding using ontologies and dynamic analysis

UNIVERSITÉ DE GENÈVE FACULTÉ DES SCIENCES Département d’informatique

Haute École de Gestion de Genève

Professeur D. Buchs Professeur Ph. Dugerdil

Program Understanding Using Ontologies and Dynamic

Analysis

THÈSE

Javier Belmonte Torrejon

Bolivie

Remerciements

Résumé

Abstract

Contents

Chapter 1 Introduction

1.1 Defining the problem

1.1.1 Our research scope

1.1.2 Understanding in general

1.1.3 Understanding programs

1.2 State of the art

1.2.1 Concept assignment

1.2.2 Use of domain knowledge

1.2.3 Dynamic analysis

1.2.4 Discussion

1.3 Research goal

Chapter 2

The understanding artifact

2.1 Introduction

2.2 Knowledge to be connected

2.2.1 Source code

2.2.2 Business functionalities

2.3 Connection knowledge

2.3.1 First layer

2.3.2 Second layer