Découverte et réconciliation de données numeriques relatives aux personnes pour la gestion des ressources humaines

(1)

HAL Id: tel-01677841

https://tel.archives-ouvertes.fr/tel-01677841

Submitted on 8 Jan 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Découverte et réconciliation de données numeriques

relatives aux personnes pour la gestion des ressources

humaines

Mohammad Ghufran

To cite this version:

Mohammad Ghufran. Découverte et réconciliation de données numeriques relatives aux personnes pour la gestion des ressources humaines. Autre. Université Paris Saclay (COmUE), 2017. Français. �NNT : 2017SACLC062�. �tel-01677841�

(2)

Thèse

de

doctorat

NNT

:

2017

SACLC

062 données numériques relatives

aux personnes pour la gestion

des ressources humaines

Thèse de doctorat de l’Université Paris-Saclay

préparée à CentraleSupélec

École doctorale

n

◦

580 : sciences et technologies de

l’information et de la communication (STIC)

Spécialité de doctorat : Informatique

Thèse présentée et soutenue à Orsay, le 27 novembre 2017, par

Mohammad Ghufran

Composition du Jury :

Nicolas Sabouret Professeur

Université Paris-Sud (LIMSI-CNRS) Président Mathieu Roche

Chercheur

CIRAD (TETIS) Rapporteur

Patrick Marcel

Maître de Conférences

Université François Rabelais (LIFAT) Rapporteur Dario Colazzo

Professeur

Université Paris-Dauphine (LAMSADE) Examinateur Nacéra Bennacer Seghouani

Professeur

CentraleSupélec (MODHEL) Directrice de thèse Gianluca Quercini

Maître de Conférences

CentraleSupélec (MODHEL) Co-Directeur de thèse Florent André

MindMatcher Invité

(3)

DOCTORAL SCHOOL STIC

SCIENCES ET TECHNOLOGIES DE L’INFORMATION ET DE LA COMMUNICATION

P H D T H E S I S

Digital Identity Discovery and

Reconciliation for Human Resources

Management

Defended by

Mohammad G h u f r a n

on November 27, 2017

Jury :

President of jury : Nicolas S a b o u r e t - Paris-Sud University Reviewer : Mathieu R o c h e - CIRAD

Reviewer : Patrick M a rc e l - François Rabelais University Examiner : Dario C o l a z z o - Paris-Dauphine University Advisor : Nacéra B e n nac e r S e g h o ua n i - LRI, CentraleSupélec Co-Advisor : Gianluca Q u e rc i n i - LRI, CentraleSupélec Invited : Florent A n d r é - MindMatcher

(4)

(5)

Acknowledgments

I would like to express my greatest gratitude to the reviewers Dr. Patrick Marcel and Dr Mathieu Roche for taking the time to review my thesis and sharing their valuable comments. Without their feedback, this thesis would not have been possible. At the same time, I would like to thank Dr. Dario Colazzo and Dr. Nicolas Sabouret, members of the jury, for sharing their expert opinion, their time and participation.

To my supervisors Dr. Nacera Bennacer Seghouani and Dr. Gianluca Quercini, for their continuous guidance, help and mentoring. Without their unfailing support, this thesis would not have been possible. I would also like to thank the university staff for their assistance.

To my mother and late father, because I owe it all to them. Many Thanks! They have always been encouraging, supportive and have made countless sacrifices for me. I am also grateful to my siblings, who have provided me with guidance, inspiration and unfettered support throughout in my life.

I am also grateful to MindMatcher for funding the research, providing resources and valuable insight into Human Resource Management through high times as well as the lows.

And finally, last but by no means least, a very special gratitude goes out to my friends for their support, encouragement and for expecting nothing less than excellence from me.

(6)

(7)

IV Reconciliation of Social Network Profiles 61 1 Related Work . . . 63 2 Background . . . 65 3 Attribute Comparison . . . 67 3.1 N i c k na m e s . . . 67 3.2 R e a l na m e . . . 68 3.3 L o c at i o n s . . . 68 3.4 E m a i l s. . . 72 3.5 We b s i t e s and P ro f i l e s . . . 73 4 LIAISON . . . 73 4.1 Candidate Selection . . . 73 4.2 Cross-links determination . . . 73 4.3 The algorithm . . . 74 5 Evaluation Results . . . 77 5.1 Evaluation of LIAISON . . . 78

5.2 Comparison with Existing Work . . . 81

6 Concluding Remarks . . . 83 V Toponym Disambiguation 85 1 Related work . . . 87 2 Preliminaries . . . 89 3 L o c u s R a n k Algorithm . . . 90 3.1 Location Graph . . . 92 3.2 Disambiguation . . . 96 3.3 Inference . . . 97 4 Evaluation . . . 97 4.1 Data Set . . . 97 4.2 Methodology . . . 97 4.3 Disambiguation Results . . . 98 4.4 Inference Results . . . 100 5 Concluding remarks . . . 104

VI Searching Resources about a Job Seeker 107 1 Query Generation . . . 110

2 Obtaining Web Resources. . . 114

(9)

Contents v

4 Data set . . . 119

5 Experimental Results . . . 119

5.1 Results . . . 120

6 Conclusion . . . 122

VII Conclusion & Perspectives 125

Bibliography 131

Appendices 143

(10)

(11)

C h a p t e r I

(12)

1 Context

Finding the appropriate individual to hire and, more generally, Human Resource Management, is a crucial part of any organization. The performance of the company, the progress of its projects and its development relies on its human capital. With the advent of the Internet, this process has progressed from a paper-based to an internet-based system, leading to an easy access to job offers for applicants as well as to applicants for organizations. Instead of sending paper-based applications and advertisements in print media, there now exist online portals and services that enable companies to post job openings and applicants to apply for these posts. At the same time, internet has introduced a new set of challenges and problems for the human resource expert as well as new requirements from them.

Human Resource Management is an important and wide subject. It incorporates a wide array of tasks and work-flows which deal with different aspects of Human Resource Management. These include employee management tasks such as keeping track of va-cations, payroll management and recruitment. Most of these aspects are governed by set rules and processes already in place in organizations. For example, employees are allowed a fixed number of days off according to the rules set in place, and salaries are predefined. The task of recruitment, however, is dynamic in nature and not fulfilled by simply following a set of rules or guidelines.

The recruitment process is complicated due to many different reasons. First and foremost is the diverse nature of job descriptions that are to be filled. An organization needs to hire personnel to carry out different aspects of the business it operates in. While an organization may be focused in a certain business domain, it still needs to hire personnel to carry out support functions. For example, a pharmaceutical company not only needs experts who are capable of carrying out research and develop new drugs or medicines but also experts to carry out trials, ensure compliance with industry standards, marketing and sales experts, information technology experts, procurement experts, supply chain experts, support staff as well as human resources experts. The task of hiring personnel to fill all these roles is typically the responsibility of the Human Resources Department of a business. Different branches or departments of a business typically submit job descriptions for positions needed to be filled and it is the responsibility of the Human Resource Department to fill the open positions by advertising them, screening the applications and finding appropriate applicants. While typically the department requiring new resources is involved in the hiring process by conducting domain specific interviews to better judge the compatibility of an applicant with the open position, it is the task of the Human Resource experts to identify and short-list applicants to be interviewed in further detail. This poses a problem for recruiters since they are not familiar with most, if not all domains that they have to hire personnel for.

(13)

1. Context 3

Human Resource experts frequently resort to searching online to identify web pages relevant to an individual applicant. This is done in order to validate the experience and expertise of an applicant, to augment the information available in resumes, to identify soft skills of an applicant, and to discover their activity on social networks. This is typically done manually by issuing queries to a search engine and inspecting the results obtained to identify the web resources (such as a Facebook profile of an applicant, a personnel page on a company website, a blog, or a list of publications) that are related to an applicant. This task is laborious, unsystematic and arbitrary. Typically, a recruiter would identify personal information such as an applicants’ name and search online for web resources They may also combine the name with some education or professional institutes frequented by the applicant to restrict the obtained results. It is desirable to automate this process; but its automation is a complex problem. The first and foremost problem in this regard is the generation of queries which is, in turn, is dependent on identifying key information within a resume that can be used to generate queries.

Given this context, there is a need to propose solutions to automate the process of enriching the information that is found in a resume with information found online from diverse sources. Two primary objectives need to be achieved in order to automate this process. Firstly, annotation of resumes to identify key information in them, and secondly, automation of query generation while utilizing detected information in resumes to identify resources relevant to an applicant.

This thesis is undertaken in collaboration and support from MindMatcher1

that is a SME (Small and Medium sized Enterprise) based in Paris, working in Human Resource Management domain. MindMatcher focuses on intelligent matching between job offers and applicants through the use of technology. The current approach computes similarity scores between documents based on frequencies of keywords present in the documents. Each document is represented as a vector, where each element contains a tf-idf normalized frequency of a particular word that is present in the document. Distinct words in all the documents present in the system make up the vocabulary and each vector contains as many elements as there are words in the vocabulary. Using this vector representation of documents, relevance of a resume to a job offer can be quantified by computing the similarity (cosine similarity) between the corresponding vectors. Since all the documents are represented in the same vector space, it is possible to compute similarity between two job offers, or two resumes.

Other works have tackled various aspects of Human Resource Management, notably in the domain of matching of job offers and resumes. This task is also known as a Job Recommender System (JRS), a term that encompasses a wide variety of methods and techniques. For instance, [Dane, 2012] has proposed a method to first detect different entities and then normalizing them into a standardized and fixed knowledge base followed

(14)

by computing similarity between these normalized entities to perform matching between job offers and applicant resumes. PROSPECT [Singh et al., 2010], is a resume short-listing platform that uses Conditional Random Fields to first extract information from resumes. This information is then used to carry out filtering and searching of applicant profiles relevant to a job offer, even though no explicit matching between job offers and applicants is made. [Kessler et al., 2012] propose a method which first applies a statistical summarization system to applicant profiles and then computes similarity between job offers and the summary to obtain relevant applicants for a job offer.

[Kumaran and Sankar, 2013,Yu et al., 2005] both segment resumes before extracting information from resumes using Hidden Markov Models (HMM), whereas they use GATE -a rule b-ased entity extr-actor -and Support Vector M-achines respectively, to detect entities in the resumes. Note that they both use parts of speech tagging as input to the input to their system. [Paparrizos et al., 2011] have used a decision table/naïve Bayes classifier to predict applicants’ future job as a way to find job recommendations for them. [Zaroor et al., 2018] attempt to reduce the scope of the matching problem by first classifying both resumes and job offers into groups. They employ a combination of parts of speech tagging, rules based approach and a knowledge base to identify skills and occupation categories within resumes and job offers. Job offers and applicants in each group are then ranked separately, and similarity is computed using semantic similarity between extracted skills and concepts to obtain recommendations. Hong et al. [Hong et al., 2013] first cluster applicants into separate groups based on their characteristics and then apply different job recommendation strategies for each cluster. Reciprocal or bilateral recommendation approaches also exist where recommendations for both job offers and applicants is proposed at the same time where both local utility of the system (satisfying individuals’ preferences) and global utility (overall quality of all recommendations) are taken into account [ZHANG, 2011,

Malinowski et al., 2006,Li and Li, 2012].

These approaches use a wide range of methods to perform some form of matching between job offers and applicant profiles. However, none of them have attempted to enhance the applicant profiles by looking up additional information about them online -which is the primary objective of the work presented in this thesis. Moreover, many of these methods, such as recommender systems, make use of applicants’ and employers’ preferences - usually based on their interactions with an online portal to provide recom-mendations. This is not ideal since, for example, a position an applicant desires to have may not necessarily match their skill set or experience level. Some of these approaches utilize methods which are not best suited for resumes, such as HMMs or CRFs which rely heavily on proper sentence and language structure so that natural language models such as parts of speech tagging can be applied. These methods have limitations in terms of multilingual support - separate models need to be trained for all the languages to be supported. Moreover, these methods require large amount of annotated training data.

(15)

2. Objectives and challenges 5

Existing online tools that aggregate web pages on a person such as Reputation2

, Spokeo3

and Pipl4

also exist. However, they are not capable of taking any input nor provide any control over the search criteria. Typically, only manual searching by name is possible, with no possibility of filtering or providing any contextual information. Moreover, they are generally limited in the types of online resources returned. The responses are frequently ambiguous and not necessarily pertaining to the person searched and all the results have to be manually inspected.

There is a need to improve on various aspects of extraction of information from resumes whereby, little or no training data is required for different stages of the process, different languages are supported without requiring separate language models and is independent of writing style and natural language constructs. Additionally, no approach exists which enhances the resume of an applicant by discovering additional information from the Web -which has the potential to improve the quality of matching between job offers and applicant profiles. In this thesis, we will discuss various challenges that need to be overcome in order to alleviate the shortcomings of existing approaches and pave the way for enhancing resumes by discovering additional information to complement resumes of applicants.

2 Objectives and challenges

The global objective of this thesis is to propose solutions that enable automation of enhancing the information already available about an applicant in the form of a resume. This overarching objective requires us to achieve several intermediary objectives. Broadly speaking, these intermediary objectives are linked to solving the core problems that need to be solved in order to achieve automated recovery of information about applicants. These objectives are as follows and are discussed in greater detail in this section:

— Extraction of key information from resumes.

— Discovery of online resources pertinent to an applicant. — Disambiguation & Reconciliation of social network profiles.

Extraction of key information from resumes

The extraction of information from resumes is a key component of the work presented in this thesis. The process of searching for web resources whether it is done manually or automatically, needs the use of keywords to generate queries to a Web search engine. When this process is carried out manually, the recruiter is responsible for this task; but in order to automate it, a mechanism to identify key information is required. More precisely, the objective is to identify,in the resume, education and professional institutes that an applicant has frequented. This information is most likely to help identify online web pages

(16)

and also to help disambiguate them i.e. to determine whether a particular web page is relevant to an applicant or not.

The first challenge in this regard is the diverse nature of input documents i.e. resumes. Since there is no set standard for resumes, they vary in terms of file formats used to write them, writing style, structure of the document, and the domain they belong to. For instance, an applicant may choose to use tables to describe different aspects of their professional experience whereas another applicant may choose to use bullet points and yet another may choose to write complete sentences and paragraphs. In all of the cases, the formatting of documents (fonts, sizes, headings, labels etc.) is also different. Existing information extraction systems assume that the input documents are written in similar style, use similar vocabulary and even the domain of the content [Kumaran and Sankar, 2013,Yu et al., 2005,

Singh et al., 2010]. These restrictions are enforced by the underlying algorithms used by these methods and are therefore not ideal for use in scenarios where these restrictions cannot be guaranteed - which is the case for resumes as input documents.

Another challenge with respect to the extraction of information from resumes is the language of the input documents. Resumes can be written in the language of choice of the applicant (typically, the language of the country or region where they are located). In addition to this, resumes can often contain text or phrases in multiple languages. For example, an applicant writing their resume in English could write they attended “Université Paris-Sud”, which is the name of a University in French language. This is a pervasive practice, as we have observed while working with resumes. While in general, the amount of such text is not large, it typically concerns names of institutes an appli-cant has frequented. It is for this reason that having a strong support for documents written in multiple languages is desired in the context of this thesis. There are numer-ous existing approaches that do support documents in multiple languages for varinumer-ous applications of information extraction, depending on the requirements of the authors [Ferragina and Scaiella, 2012, Moro et al., 2014]. However, these systems still need the input language to be defined a-priori, therefore enforcing a single language for each document. While this is indeed acceptable in most use cases, it is a serious challenge to be addressed in the context of this thesis.

The task of extracting information from resumes is further complicated due to the presence of ambiguity in entities present in resumes. This gives rise to mis-classification of entities caused by limited being available in such documents. For example, an entity “Pierre and Marie Curie University” could appear in a resume as an educational institute frequented by a student or it could also appear in a resume of a professor or staff member who works there. Tackling this problem is a challenge that we aim to address through identifying sections in resumes and therefore providing additional contextual information. Applicants usually group information regarding different aspects of their education and professional experience into sections - blocks of text which focus on a particular aspect of

(17)

2. Objectives and challenges 7

their experience. If this information is known, mis-classifications can be reduced.

In addition, it is hard to use existing methods for information extraction on resumes due to excessive amount of time and effort needed to annotate training data that is required for supervised or semi-supervised methods proposed by most existing approaches. It is, therefore ideal that the proposed approach be unsupervised in nature and require minimal or no annotated training data.

Retrieval of web resources

The definition of an approach that is able to use the key information extracted from a resume to obtain online resources is the primary objective of this thesis. Online resources are any web pages or documents that are available on the Web. These could be personal pages created by the applicant such as Facebook or LinkedIn profiles, profile Web pages at the institutions where they work, publications or pages containing information about their co-authors, blog posts, open source project contributions, or any other types of pages that are relevant to the applicant. Identifying such resources will enrich the resume and newly discovered resources could be used in multiple ways. Not only do these resources help human resource experts to enrich the resumes with additional information that is not present in them but they also enable them to verify the information in the resumes. Both of these aspects have the potential of improving the quality of matching between job offers and applicant profiles. These documents can be added to the matching algorithms to provide additional information to the system which was previously not available. Recruiters can verify or determine the level of expertise of an applicant by looking at their contributions, and publications. They are also able to have a better understanding of the applicants’ soft skills, interests and extracurricular activities by examining personal pages or social media profiles that are publicly available. To the best of our knowledge, no effective solutions currently exist to discover online resources pertinent to an applicant.

Discovering online resources relevant to an applicant has several research challenges that need to be addressed. When searching for information online, the search results returned by a search engine are often ambiguous. This is due to the fact that first and last names are often ambiguous. For instance, searching for “Mohammad Ghufran” online yields multiple Facebook profiles of individuals with that exact name. Any of these profiles could be the actual profile created by the job applicant. It is necessary to ascertain which of these resources are relevant to a particular applicant and which ones are not. Only then is it possible to use them effectively. Ascertaining which resources are relevant to an applicant is especially difficult in case of social network profiles where often very limited, and ambiguous information is available regarding an individual.

The major challenge to achieve this objective is the selection of query patterns that are to be issued to the search engines. The choice of queries directly influences the results that are obtained and it is important to select adequate query patterns such that they are

(18)

neither too generic and therefore yield results which are not relevant to the applicant nor are they restrictive such that they omit results which could have have been relevant to the applicant. This is especially difficult since it is impossible to perform an exhaustive search of query patterns and also because it is not possible to perform an exhaustive evaluation (which would mean looking at all the search results obtained for a query). The second challenge with regards to discovering online resources is to propose an approach to rank and select online resources with respect to their relevance to an applicant. In order to conclude that an online resource is relevant to an applicant, the content and named entities mentioned in the resource must be similar to the ones mentioned in the resume. Since numerous resources (in the hundreds or even thousands) can be returned by a search engine, it is therefore important to rank and select a subset of them so that they can be examined by the recruiters and potentially used to augment their knowledge about the applicant or be used to improve the matching process by providing additional information.

Disambiguation & Reconciliation of social network profiles

Social network profiles are important resources for a recruiter because they provide information about soft skills, extra-curricular activities and general interests of an applicant. Profiles of an applicant are also particularly hard to identify because applicants may or may not disclose their social network profiles in their resumes, or the information present in the social network profiles (such as name and location) may not be accurate or up-to-date. Identifying the profiles that have not been explicitly disclosed can be particularly helpful - but some mechanism is needed to discover them. Specifically, given a profile of an individual in a social network, it is desired to identify other profiles of the same individual that is not already known. For example, if an applicants’ LinkedIn profile is known, it would be helpful to identify their Facebook profile based on the publicly available information in both social networks.

Tackling the problem of ambiguity within publicly available information within social network profiles is also an important challenge. Information such as nicknames, names and locations, which are often publicly available in social network profiles, are ambiguous, which results in uncertainty when it comes to determining whether a profile belongs to a particular individual. Resolving this ambiguity has the potential to help in reconciliation of profiles. In the same way, inferring such information when it is missing can also be beneficial. Disambiguation and inference of ambiguous or missing information in social network profiles with the goal of completing these profiles is thus a challenge that we aim to address in this thesis.

(19)

3. Contributions 9

3 Contributions

In light of the objectives and challenges outlined earlier, several scientific contributions are made in this thesis. The major challenges outlined include identification of salient entities present in a resume, generation of queries using these entities and finally selecting online resources pertinent to an applicant. The contributions made in this thesis propose solutions to all of the challenges related to automating the process of enhancing a resume. These contributions are outlined below:

— A method to divide a resume into sections is proposed with the aim of reducing the ambiguity in named entities present in a resume.

— A method of classification for sections previously determined is also proposed. This method classifies each section into different classes based on the content present in it. This helps identifying the type of information that is present in a particular section - thereby helping reduce the ambiguity in entities found in resumes. — An approach for the extraction of key information from resumes is proposed that

uses the previously detected sections and their types to identify education and professional institutes using an unsupervised method. The proposed approach links the named entities to Wikipedia - the largest knowledge base freely available. For each entity to be annotated, the approach defines a method to construct an interpretation graph: a set of all possible senses it could be interpreted as. The method uses Wikipedia along with information about the type of section an entity appears in, to disambiguate and filter entities. In particular, two filters, one based on similarity of detected entities with the context in the input document, and the second based on Wikipedia infobox information are defined.

— The algorithm LocusRank is presented in this thesis, which disambiguates the lo-cation information provided by an individual by looking at the lolo-cation information provided by their friends. This algorithm uses no contextual information contrary to existing approaches that tackle the problem of disambiguation. Furthermore, LocusRank is able to go a step further and infer the location information of indi-viduals by looking at location information provided by an indiindi-viduals’ connections in a social network. The algorithm shows promising results for both disambiguation and inference of location information.

— A method for generating queries using the key information extracted from resumes is also proposed. This approach combines personal information with educational and professional institutes to retrieve salient online resources. Furthermore, an approach to rank the obtained results in order of their relevance to an applicant is also defined. We show that both the approaches are effective in achieving the objective of identifying resources that have the potential to extend an experts’ knowledge about an applicant. It is shown that the ranking approach reduces the

(20)

number of irrelevant resources recovered from search engines and increases the number of relevant ones.

— Contributions are made to an algorithm which discovers previously unknown profiles of individuals in multiple social networks. The algorithm utilizes publicly available information in social networks to achieve this by defining similarity measures for different types of publicly available information in social networks. My contribution to this algorithm is focused on computing similarity between locations.

4 Organization of this manuscript

This manuscript is organized into chapters that deal with broad categories of research challenges encountered when looking at the global objective of automating discovery of online resources relevant to an applicants’ profile.

The task of identification and classification of sections in resumes is addressed in ChapterII. This is used as a basis for the task of extraction of information from resumes. ChapterIII focuses on the task of information extraction from resumes. It outlines an approach for identifying key organizations within a resume which is later used in an approach to discover additional resources online pertaining to an individual as discussed in ChapterVI.

In ChapterV, we focus on methods to resolve ambiguity when dealing with online resources. An approach to disambiguate location information is proposed which is often the most ambiguous information in social network profiles. We also outline an approach to compute similarity between different social network profiles by utilizing basic information to identify new profiles of the same individual in different social networks.

(21)

C h a p t e r I I

(22)

Introduction

Resumes are free-form texts and applicants take the liberty to make their own design, document file format, fonts, colors and text structures (tables, lists, free form texts). In addition, the kind of information provided as well as the sequence in which it appears is not fixed. While most applicants may put their personal information at the beginning of the document, some choose to put it at the end. Similarly, some applicants may put their education experience before their professional experiences. Some may put detailed listing of their projects or positions, while others may not. Yet more applicants may put their publications or their skills in a separate section. An example of a resume is shown in FigureII.1. In this document, the author uses clearly marked sections in their resume which give details about different aspects: their personal details, academic and professional experience, skills and publications. Each of these sections have been annotated with a different color on the left hand side margin in the figure. In addition, FiguresII.2and

II.3show two other resumes which are written in completely different styles and formats. These examples show the diversity encountered in resumes in terms of:

— Domain of expertise of the applicants — Format and structure of the documents — Types of sections used

— Writing styles and language constructs used such as paragraphs or bullet lists It is clear from these examples that applicants choose to provide diverse information in their resumes which are grouped together in the form of sections. Not only does the writing style vary widely from one applicant to another, but it also varies across different sections within the same resume. For instance, in the first example resume, the skills section is presented as a list of competencies, publications are given as a numbered list and education and professional experience is given as a mix of tables and bullet points.

Due to the high amount of diversity, it is evident that dividing the problem into a set of simpler more tangible problems is likely to alleviate this problem by reducing the context. While there is incredible amount of diversity within resumes, the fact that they all cluster information together into broad categories can be harnessed to help in the process of identifying and classifying different entities that appear in resumes. Dividing a document into sections before trying to identify entities within each section permits to not only limit possible classes an entity can be classified as, but it may also give clues to what type the entity may be.

In this chapter, we define an approach to divide a resume into its constituent sections and to identify the type of these sections so that this information can be used in identifying and classifying detected entities. The proposed approach comprises two steps. The first step is dedicated to identifying the boundaries of sections, given a resume as input. We explore different supervised machine learning algorithms to determine the best one. The

(23)

13

(24)

second step comprises classification of contents within each section using supervised machine learning algorithms. This final step enables us to limit the context of any entities present in the section. If it is determined that a section contains information regarding academic experience of an applicant, then professional entities, skills, personal details and references are not likely to be present in it. These two steps are discussed in detail in Sections1and2respectively.

1 Identification of sections

In this section, a method to divide a resume into its constituent sections is outlined. In this method, a resume is treated as a collection of lines,l1, l2, l3. . . ln. To identify the sections, the lines that mark the beginning and end of all the sections need to be identified. If these boundaries are known, then the sections can easily be identified by simply splitting the resume at the lines that constitute these boundaries. In fact, we observe that the end of a section also marks the beginning of the section proceeding it. Therefore, if the beginnings of all the sections in a resume are known (conversely, if the ends of all the sections are known), all the sections can be uniquely identified. Therefore, only the beginnings or the ends of sections are required to be identified to determine all the sections in a resume. Since most resumes have a distinctive heading at the beginning, we have chosen to identify these lines through supervised machine learning methods. Given only lines that mark beginning of all the sections, content of all the sections in a resume can be identified as follows:

— A section comprises all the text between two lines identified as beginning of section. In the first example resume, section detailing the work experience of the applicant is an example of such a section.

— A section comprises all the text from the beginning of the resume to the first line identified as beginning of a section. This type of sections do not have a particular line marking their beginning and are usually the first section in a resume.

— A section comprises all the text from the last line identified as beginning of section to the end of the resume. These sections are typically the last sections in resumes and are not followed by another section. An example of such a section in Figure

II.1is the section labeled as publications.

This method requires minimal training data at the level of lines in the resume as opposed to training data at the level of words in the document as is necessary in the case of other approaches used to identify sections in resumes [Yu et al., 2005,Singh et al., 2010,

Lafferty et al., 2001]. Since the input resumes are in a wide range of formats - both in terms of the file types (pdf, doc, docx etc.) as well as in different written structure (due to the document structure/style chosen by the author), it is particularly hard to extract any structural information from the it. Even the basic task of identifying the beginning

(25)

1. Identification of sections 15

(26)

and end of sections is difficult since most of the formatting information and clues are lost when extracting plain text from the original file formats. For example, typographic information such as font sizes or transformations like bold face or italic writing are lost during this process.

To identify the lines that mark the beginning of sections, a supervised model to help identify these boundaries is created. More formally, given a resumeDcontainingnlines

l1, l2, . . . , ln, we train a binary classifier which classifies each line of a resume into class “BoS” (Beginning of Section) or “N” (Not beginning of section). This annotation is shown in FigureII.1with markings “BoS” and “N” in red at the end of each line. In the FigureII.1, lines annotated as “BoS” contain the texts “personal data”, “work experience”, “education”, “teaching”, “skills” and “publications”. The complete sections are also demarcated with

rectangles of different colors that identify their boundaries. Two lines that are classified as “BoS” mark the boundaries of a section. Similarly, in FigureII.3, lines classified as “BoS” contain the texts “education”, “professional and management experience” and “extra-curricular activities”.

Each line in the resume is treated as a candidate for the class “BoS” based on a set of features we extract from the line itself and its context. We then train different models over the training set to determine the best model for classifying each line as class “BoS” or “N”. We note that this classification task is independent of the type of section each line only marks the beginning of a section and is not affected by the type of the section it demarcates .

1.1 Features

To build the model to classify a line into one of two classes “BoS” (beginning of section) or “N” (not beginning of section), we used several features. These features are identified from information present in each line and its context, i.e. the lines that precede and follow it. A wide variety of features, based on presence of keywords, parts of speech tags, and different statistics computed from the lines and their context are used. The features extracted for each line include:

— Keyword features: these are binary features for each of the selected keywords which appear frequently within lines that mark the beginning of sections. The selected keywords include “education”, “experience” and “skills”.

— Word count: the total number of words in the line.

— Newlines : whether the line has one or more newlines preceding it.

— Noun phrase ratio: the ratio of tokens part of a noun phrase over the total number or tokens in the line.

— Capital case, title case: whether the text in the line is written in capital case or title case.

(27)

(28)

These are the most helpful features that help discriminate between lines belonging to the two classes.

1.2 Evaluation Results

Different supervised models are trained and evaluated in order to determine the best one. The features described previously are extracted from each line in the data and are used for training and testing all the different models. Models that we have experimented with include Support Vector Machines, Random Forests, Logistic Regression, and Adaboost. Parameters for all the methods have been chosen using a grid search over a range of possible values. Grid search is an exhaustive search for the best parameters for a learning algorithm from a manually specified subset of all the possible parameters available for the algorithm. For example, a Support Vector classifier withrbfkernel has two parameters for which the best values need to be determined;C, a regularization penalty, andγ, the coefficient for the kernel. A set of values for each of these parameters is selected, say

C = {10, 100, 1000, 10, 000} and γ = {10−5, 10−4, 10−3, 10−2, 0.1, 0.2}. Using these possible values for the parameters, separate models are trained for each combination of these parameters and their performance is measured by cross-validation.

By employing grid search, the best parameters for all the methods used in the evaluation are selected. In case of Support Vector classifier, the penalty parameterCis set to10, 000, andrbfkernel is used with its coefficientγset to10−4_{for non-linear kernel, whereas linear} kernel is used withCset to10. An SVM model is also trained using Stochastic Gradient Descent with linear kernel,C = 10and hinge loss. Random Forest classifier is trained with number of estimators set to50and minimum number of samples required for each leaf node is set to2. The logistic regression model is trained with the ridge estimatorλ set to10−8. Adaboost classifiers are used with two different classifiers; decision trees and support vector machines. In both cases, the10iterations are used to train the Adaboost classifiers.

Since the classes are heavily biased, the models were trained and tested using stratified cross-validation. This means that sampling is done with replacement in order to get training and test data sets which are balanced. In all experiments, training and test sets are generated without overlap i.e. examples which are used for testing the performance of the system are not used for training.

The data set consists of 90 resumes. Each resume is first transformed into plain text using the Apache Tika open source project. Starting from this plain text representation, each line within the documents is annotated by the author into one of the two classes, i.e. whether the line marks beginning of a section (“BoS”) or not(“N”). This ground truth is established in order to evaluate different classifiers used for the classification problem. This annotation is carried out with the help of both the original resumes and the plain text versions in order to aid the annotation process. This is necessary since not having the

(29)

original document makes it difficult to decide which lines mark the beginning of a section without the visual clues of the original document.

Overall, the dataset contains10, 344 individual lines including empty ones. A total of 264 lines are annotated as “BoS” and the remaining are annotated as “N”. Starting from this data set, we perform stratified 10-fold cross-validation using several supervised classification methods. The methods are compared using the Precision, Recall and F1-score metrics.

The definitions of these measures in the case of a binary classification are as follows:

Precision = T P

T P + FP (II.1)

Recall = T P

T P + FN (II.2)

F1 − score = 2× Precision × Recall

Precision + Recall (II.3)

where:

— TP is the true positive rate i.e. examples that are classified as beginning of section and are actually so.

— FP is the false positive rate i.e. examples that are not classified as beginning of section but they actually are.

— FN is the false negative rate i.e. examples that are classified as beginning of section but they are actually not.

In other words, recall measures the ratio of examples which have been correctly classified over the total number of examples, whereas precision measures the ratio of examples which have been correctly classified over all the examples which have been classified as “BoS”. F1 combines both these scores into a single metric with a harmonic mean.

The results for binary classification obtained for different learning algorithms are provided in TableII.1. As can be seen from the table, most of the methods tried perform well on the data used. Most notably, all methods using support vector machines perform very well, along with random forest method. The table shows the results for each of the binary classes as well as the average for each of the methods used. For the class “BoS”, we see that models based on SVM using both the linear and rbf kernel give best results with observed precision being0.860and0.849respectively. They are followed by logistic regression and random forest models with precision of0.824and0.823. When looking at the recall for class “BoS”, Naive Bayes is the best method based on the experiments with recall of 0.849. Random forest is the second best model for class “BoS” with a recall of

0.664, followed by adaboost with decision trees and linear SVM. When looking at the F1 score for class “BoS”, random forest model performs best with F1 score of0.735, followed by linear SVM model with F1 score of0.722.

(30)

Model TP FP Precision Recall F1-score SVM “N” 0.997 0.479 0.986 0.997 0.991 “BoS” 0.521 0.003 0.849 0.521 0.646 Average 0.983 0.465 0.982 0.983 0.981 SVM SGD “N” 0.996 0.412 0.988 0.996 0.992 “BoS” 0.588 0.004 0.805 0.588 0.680 Average 0.984 0.400 0.982 0.984 0.982 SVM Linear “N” 0.997 0.378 0.989 0.997 0.993 “BoS” 0.622 0.003 0.860 0.622 0.722 Average 0.986 0.367 0.985 0.986 0.985 Logistic Regression “N” 0.996 0.412 0.988 0.996 0.992 “BoS” 0.588 0.004 0.824 0.588 0.686 Average 0.984 0.400 0.983 0.984 0.983 Naive Bayes “N” 0.967 0.151 0.995 0.967 0.981 “BoS” 0.849 0.033 0.437 0.849 0.577 Average 0.963 0.148 0.979 0.963 0.969 Random Forest “N” 0.996 0.336 0.990 0.996 0.993 “BoS” 0.664 0.004 0.823 0.664 0.735 Average 0.986 0.326 0.985 0.986 0.985 Adaboost J48 “N” 0.995 0.353 0.989 0.995 0.992 “BoS” 0.647 0.005 0.786 0.647 0.710 Average 0.984 0.343 0.983 0.984 0.984 Adaboost SVM “N” 0.996 0.412 0.988 0.996 0.992 “BoS” 0.588 0.004 0.833 0.588 0.690 Average 0.984 0.400 0.983 0.984 0.983

(31)

2. Identification of the type of sections 21

In the case of class “N”, all the models perform well when considering precision, which remains over0.986in all cases. The best model is naive bayes followed by random forest having precisions of0.995and0.990respectively. Similarly, high recall is also observed for this class. The best model in terms of recall for class “N” is support vector machine (both with linear and rbf kernel), followed by logistic regression and random forests. Overall performance, as measured by F1 score for this class is obtained with linear SVM and random forest models. These two models are also the best when considering the overall performance of both the classes, with F1 score of0.985in both cases. These two are thus the best models to use for the task of detecting boundaries within resumes.

We note that there is high class imbalance in the data set. The ratio of “N” to “BoS” is about33 : 1 therefore, misclassifying one example for the latter class does not strongly affect the overall performance of an algorithm. For example, the naive bayes classifier has an F1 score of0.577for class “N” but the overall F1 score is still0.969which is very high. It is for this reason that the results for the class with fewer samples must be observed and given more importance. Given this consideration, the best classifier is the random forest, which has the highest F1 score for class “BoS” as well as the highest F1 score overall.

We note that since there is class imbalance in the data set, it is important to give more importance to the results for the minor class. This is because in cases when the number of samples for one class are considerably lower than other classes

The training times for various methods vary considerably. For instance, adaboost with SVM takes about27seconds to train on a core i5 computer whereas a model using a single SVM with linear kernel only takes about0.15seconds while a random forest takes about

0.51seconds. In addition, no improvement to results is observed with adaboost over single model classifiers.

2 Identification of the type of sections

In the previous section, we discussed an approach to identify boundaries of sections present in resumes. Once the boundaries have been identified and the sections isolated, the next step is to determine what type of information is present in them. A section could give details about education institutes frequented by the applicant or their skills, for example. We choose a set of classes which broadly describe different aspects of an applicants’ profile described by each section. These classes are based on the HRMOntology [Gómez-Pérez et al., 2007] which defines an ontology for Human Resource Management. The selected classes are:

— Personal: containing personal information such as name, date of birth and contact details.

— Education: containing details about academic training such as the schools and universities frequented and the degrees obtained.

(32)

— Professional: containing information about professional experience such as the organizations an individual worked at and positions held.

— Skills: containing information about professional or personal skills of the applicant. — Others: all the other sections.

The classes “Personal”, “Education”, “Professional” and “Skills” are chosen because they are most commonly found in resumes and are also most important in describing different aspects of an applicants’ background and experience. They provide personal information such as name, age, and contact information, educational and professional background, and their skills. The class “Other” is chosen as a label for all the other types of sections which should not be classified as any of the other four classes. This is done to avoid misclassification of sections into one of the above mentioned classes. We also note that we do not assume any particular order for these sections nor do we enforce any limitations on the number of times each kind of section may appear in a resume. For instance, an applicant may choose to detail their education experience in one section and their trainings and certifications in another. Similarly, they may choose to separate their soft skills and language skills from their technical skills and create two separate sections for them.

In FigureII.1, six sections are highlighted in the example resume. The first section in the figure is concentrating on personal information, the second and fourth focus on the professional experience, the third focuses on the education experience and the last two are classified as “Other” since they should not be classified as any other class.

In order to train the models for classification, a training set is created using the output of binary classifier described in the previous section. Resumes are split into sections at the lines which are identified as marking beginning of section (“BoS”). These sections are then manually annotated by the author into one of the classes mentioned previously.

2.1 Features

A combination of features is used to classify sections into one of five selected classes. Firstly, the text present in the section is transformed into vector space which comprises all the distinct words across all the sections. A vector space representation of a resume essentially associates a statistic with each of the terms present in the document. For instance, in a simple frequency based representation, each document could be represented as a vector having as many elements as there are words in the vocabulary. Each element is then represented by the frequency with which that word appears in the document. However, simple frequencies introduce numerous problems such as bias for length of a document and common words having high frequencies (such as stop-words). Because of these shortcomings, tfidf statistic is used for addressing these issues which is a widely used document representation technique. In fact, tfidf is a product of two statistics, namely term frequency (tf) and inverse document frequency (idf). The term frequencytf(t, d)is

(33)

Class Pattern

Education education, ecole, univ.*, college, degree, school, diploma, diplome, grade, GPA

Professional inc, limited, ltd

Personal Phone: 0[0-9]([ .-]?[0-9]{2}){4},

Email: ([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4})

Table II.2 – Keywords and patterns used as features for section detection

the number of times that termt appears in the documentd.

The inverse document frequency is a measure of how much information a term provides. In other words, tfidf promotes terms that are frequent within a section while being rare in other sections. Such terms are likely to be important for that section, as opposed to terms that are used frequently in any section. For instance, sections describing the educational experience may use terms like “university”, “college” and “Master” - thereby giving clues to the classifier. On the contrary, a section containing personal information may contain terms like “phone” and “email”.

Formally, idf of a termtin a corpusDis given by:

idf(t, D) =log N

1 +|d ∈ D : t ∈ d]|

whereNis the number of documents in the corpus and the denominator is the number of documents in which the termt appears. The denominator is adjusted by adding 1 in order to avoid division by zero. Finally,tf − idfis computed as:

tfidf(t, d, D) = tf(t, d)· idf(t, D)

In addition to the vector representation of sections, we also use some frequency-based features for particularly interesting keywords and patterns which are helpful in determining each section. Each group of keywords is treated as a feature and the value of the feature is the number of times the patterns associated with it are repeated in a section. A detailed list of these keyword patterns is given in TableII.2.

In the table, the “.*” is a regular expression pattern which matches any characters; meaning the pattern “univ.*” will match both “university” and “université”. Similarly, patterns for phone number and email address are represented as regular expressions.

2.2 Evaluation Results

Models trained using the data set include Support Vector Classifier, Naive Bayes, Stochastic Gradient Descent (SGDC), Bagging, Adaboost, and Random Forests. The parameters for these methods are selected through grid search. For Random Forest

(34)

model, the number of estimators is selected to be50,gini index is used to measure the performance of the splits and minimum number of samples required for each leaf is 2. The number of estimators for Adaboost classifier is300and decision trees are used as the constituent weak learners for this method. For Naive Bayes, the smoothing parameterαis set to one and uniform prior probabilities are used.

Three different version of Support Vector classification are experimented with. In the first version, one-vs-rest strategy is employed whereby a separate binary classifier is trained for each class in the data set. This classification technique takes samples of one class and treats them as positive examples while samples from all the other classes are treated as negative examples. In the second version, one-vs-one strategy is employed, where a separate classifier is trained for each pair of classes present in the data set. Third version uses the one-vs-rest scheme with stochastic gradient descent learning. For the first two versions, the penalty parameterCis set to10, 000whileγ, the coefficient for rbf kernel is set to10−4. The learning rate for SGD is set to10−3. Additionally, an ensemble classifier, using Support Vector classifiers with10estimators is also employed. This method uses average of the predictions provided by each of the estimators for each example.

In order to determine a classifier for identifying the type of a section, the section boundaries identified previously are used to isolate sections. Sections are identified by splitting the resumes at lines which were classified as “BoS”. This yields 459 sections, which have been manually annotated into one of 5 classes previously described by the author. This set of sections is used to train and evaluate different classification models which utilize features described earlier in this section.

The results obtained for these models are summarized in the Table II.3. From this table, we can observe that the task of determining the appropriate class for each section is a difficult one whereby most of the classifiers struggle with multiple types of sections. Overall, one-vs-rest Support Vector Classifier performs best among the different classifiers tried. It yields the highest average F1 score of81% for all the classes. Both SGD based Support Vector Classifier and Random Forest classifier have very similar, but slightly lower overall performance. We can observe that Random Forest classifier is the best for “Personal” and “Professional” classes whereas one-vs-rest SVC performs best for classes “Skills” and “Other”. In the case of class “Education”, both one-vs-one SVC and one-vs-rest SVC with stochastic gradient descent training perform the best with F1 score of88%. In general, “Personal” and “Education” classes are the easiest to predict with high F1 scores achieved

for most of the classifiers, followed by “Professional” sections.

The sections “Skills” and “Other” are the most difficult to predict with the best F1 scores of71% and77% respectively. This is because these sections are the most diverse and the texts within can be starkly different from one resume to another. The following excerpts from two different sections, both classified as “Skills” illustrate this problem:

(35)

Classifier Section Precision Recall F1-score

Random Forest Education 0.84 0.87 0.85 Other 0.68 0.81 0.74 Personal 0.90 0.90 0.90 Professional 0.86 0.88 0.87 Skills 0.82 0.48 0.61 Average 0.81 0.80 0.79 One Vs. Rest SVC Education 0.81 0.87 0.84 Other 0.71 0.83 0.77 Personal 0.87 0.90 0.89 Professional 0.88 0.83 0.85 Skills 0.89 0.59 0.71 Average 0.82 0.81 0.81 Adaboost(DTrees) Education 1.00 0.66 0.79 Other 0.50 0.84 0.63 Personal 0.88 0.76 0.81 Professional 0.78 0.55 0.65 Skills 0.86 0.55 0.67 Average 0.76 0.68 0.69 SVC Education 0.90 0.87 0.88 Other 0.68 0.85 0.76 Personal 0.87 0.87 0.87 Professional 0.87 0.81 0.84 Skills 0.85 0.59 0.69 Average 0.82 0.80 0.80 Bagging SVC Education 0.89 0.83 0.86 Other 0.62 0.77 0.69 Personal 0.87 0.87 0.87 Professional 0.82 0.74 0.78 Skills 0.73 0.55 0.63 Average 0.77 0.76 0.76 SGDC Education 0.90 0.87 0.88 Other 0.70 0.83 0.76 Personal 0.84 0.87 0.85 Professional 0.85 0.83 0.84 Skills 0.85 0.59 0.69 Average 0.81 0.80 0.80

Multinomial Naive Bayes

Education 0.57 0.90 0.70 Other 0.84 0.40 0.54 Personal 0.89 0.83 0.86 Professional 0.61 0.86 0.71 Skills 0.76 0.66 0.70 Average 0.74 0.70 0.68

(36)

with C, Assembler, Fortran, Basic, MATHCAD . . .

. . . Languages: Fluent in English (TOEIC: 880 points obtained in 2004) . . .

Some contributing factors include different domains or expertise of the applicants, and diverse types of information provided in sections perceived to be “Skills” of an individual. For instance, the text present in the skills section of a computer scientist may be completely different from skills of an economist with no overlap in the terms within. In the above examples, first individual is mentioning technical skills whereas the second one is outlining their language skills. Similarly, since “Other” is an all encompassing class, it is possible that sections having very different content are combined together under the same label. For example, sections with details about publications are grouped together with honors, awards and extra-curricular activities. This is due to the vast variety of section types possible in resumes and the impossibility to exhaustively annotate different types of sections.

Even though it is not possible to compare with other approaches such as PROSPECT [Singh et al., 2010] since the data sets are not the same, a brief comparison is provided here. In their approach, only two sections, namely “Education” and “Professional” are detected with the best F1 reported scores being 92.1% and 78.5% respectively. This, compared with88% and87% respectively for our methods, is a mixed result - PROSPECT performs better for class “Education” by4% whereas Random Forest model proposed in this section performs better for “Professional” class by 8.5%. However, the list of features used by PROSPECT is extensive, including named entity recognition, and visual features such as font sizes and tables present in the documents. These visual clues and features are not available in our case.

3 Conclusion

In this chapter, we have discussed the importance of determining sections within a resume for the purpose of extracting information from them. Since resumes can be written in a wide range of formats and the writing styles of individuals are versatile, it is important to isolate different parts of resumes in order to get contextual clues. This contextual information can be leveraged in the phase of information extraction. We have presented a two phased classification process in this chapter. In the first phase, section boundaries are determined within resumes using a binary classifier, where each line in a resume is treated as an example. Different classifiers are used to train a model to detect the lines which mark the beginning of sections. Support vector classifier with linear kernel as well as random forest models perform best and yield F1 score of0.985.

In the second phase, we isolate sections based on the boundaries previously determined by the binary classifier in the first phase. A new classifier is then trained to detect the type

(37)

3. Conclusion 27

of each section. Each section is classified into one of 5 different classes commonly found in resumes. The experiments show that several classifiers show best results for different classes. Overall, high performance is achieved using support vector and random forest classifiers, with F1 score ranging between71% and90% for all the classes.

We observe that classes “Skills” and “Other” are particularly hard to classify due to the diverse nature of content present in them. This is one aspect which is a good direction for future work on this topic. Particularly, it may be possible to use topic modeling and using more elaborate features and representations to improve the performance of the classification.

(38)

(39)

C h a p t e r I I I

Extracting Key Information from

Resumes

(40)

Introduction

In this chapter, we define an approach that automatically extracts key information from resumes. We make use of the multilingual encyclopedic knowledge provided by Wikipedia to help in the process of identification and filtering of the key information present within resumes. The input to our approach is a section, obtained as a result of automatic classification. A two step classification approach was used to identify and classify sections into classes Education, Professional, Skills and Other. The details of this classification process have been discussed in ChapterII. The class associated with a section gives details about the type of information that is present in it. For example, a section classified as “Education” gives details about the schools and universities attended by an applicant. In this chapter, we focus on extracting information from two types of sections, namely “Education” and “Professional”. The entities detected within in these sections also have the same types associated with them. For instance, the entity “University of Paris-Sud” is classified as having type “Education”, whereas the entity “Microsoft Inc.” is classified as having type “Professional”.

More specifically, our approach links n-grams (i.e. phrases) in the resume to Wikipedia articles that describe the entities denoted by these phrases. The rationale of our approach is to find all possible interpretations (i.e. meanings), and select the most appropriate one, given the context. For example, the n-gram “Marie Curie” is ambiguous and could be interpreted as a scientist or a university, among others. If this n-gram appears in the Education section, its interpretation as a university is preferred since it is more likely and appropriate in this context. This is because education institutes are the targeted types of entities for sections related to education experience of applicants. Other interpretations of the n-gram (as a scientist in this example), can be discarded by using information present about the entity in Wikipedia. The main contributions of this approach are as follows:

1. it does not require a priori POS tagging and is not dependent on lexical information. 2. it can be easily adapted to detect entities of any given type, such as education

institutes or professional organizations. 3. it is language independent.

4. it is not supervised, thus it requires no manually labeled training data.

Many existing state of the art approaches for information extraction are discussed at length in Section1. None of the approaches fulfill all of the above mentioned properties

and therefore are not suitable for use in this setting. Within the widely studied subject of entity linking to a knowledge base, there is room for improving on the aspects of multi-language support, ability to filter the type of entities that are output and the ability to handle input documents of different types and domains.

We have evaluated this approach on a real data set of resumes for two target classes i.e. Education and Professional institutes. Comparisons are also conducted against TagMe

(41)

1. State of the art 31

[Ferragina and Scaiella, 2012] and Babelfy [Moro et al., 2014] which show that this ap-proach achieves better results for both target classes.

The remainder of this chapter is organized as follows. After an overview of related work in Section1, we introduce preliminary concepts and notations in Section 2. Our

approach is detailed in Section3while the evaluation process and results is discussed in

Section4, followed by concluding remarks in Section5.

1 State of the art

The approach described in this chapter touches different research areas. In this section, prominent works in these areas are discussed to overview existing methods proposed by state of the art methodologies.

1.1 Entity Recognition

Entity Recognition, also known as Named Entity Recognition and Classification is a widely studied topic [Singh et al., 2013, Nadeau and Sekine, 2007]. There are several key factors that affect any entity recognition system, such as the type and language of input documents, and the use (or not) of a knowledge base. A vast variety of methods for entity recognition also exist which impact the input training data required and the result of the system. Supervised methods use techniques such as HMMs, CRFs or SVMs [McCallum and Li, 2003,Asahara and Matsumoto, 2003]. The most basic approaches in-volve identification of a text boundary within an input text followed by classifying the text into one of predefined classes [Finkel et al., 2005,Siencnik, 2015,Atdag and Labatut, 2013]. More complex systems involve not only identifying a text boundary, but also associ-ating a concept within a curated knowledge base [Cucerzan, 2007, Kulkarni et al., 2009,

Meij et al., 2012, Milne and Witten, 2008, Mihalcea and Csomai, 2007, Moro et al., 2014,

Carmel et al., 2014,Ferragina and Scaiella, 2012,Atdag and Labatut, 2013].

The type of data used for training and testing a system for named entity recognition influences the choice of method used. For example, the input data could be web pages, news articles, biological, chemical, medical, legal texts or even documents that are vastly different from conventional writing style such as Tweets or resumes. There exist methods which are specialized to handle such non conventional documents [Derczynski et al., 2015,

Wang et al., 2015,Zhang and Elhadad, 2013,Rocktäschel et al., 2012].

The design of entity recognition systems is also influenced by the type of, and number of entities that are targeted. In traditional entity recognition and classification systems, the most prevalent entity types in literature are "persons", "organizations" and "loca-tions", also known as "enamex" together [Sundheim, 1995]. Some approaches also attempt to target more fine-grained entity classes such as “politician” or “entertainer” as