Information Retrieval Features for Personality Traits

(1)

Information Retrieval Features for Personality Traits

Edson Roberto Duarte Weren [email protected]

Abstract. This paper describes the methods employed to solve the Au- thor Profiling task at PAN-2015. The main goal was to test the use of features derived from Information Retrieval to identify the personality traits of the author of a given text. This paper describes the features, the classification algorithms employed, and how the experiments were run.

Also, I provide a comparative analysis of my results compared to those of other groups.

Keywords: Information Storage and Retrieval, Document and Text Processing.

1 Introduction

Author Profiling, which has a growing importance in applications in forensics, marketing, and security[1], deals with the problem of finding as much information as possible about an author, just by analyzing an text produced by that author.

This paper reports on the my participation at the third edition of the Au- thor Profiling task, organized in the scope of the PAN Workshop series, which is collocated with CLEF2015. More details about the task and the workshop can be found in the overview paper [3]. The task requires that participating teams come up with approaches that take a text as input and predict the gender (male/female), the age group (18-24, 25-34, 35-49, or 50+) and the author’s personality traits (extroverted, stable, agreeable, conscientious, or open in a range from -0.5 to 0.5).

2 Identifying Author Profiles

The underlying assumption was that authors from the same gender, age group or personality traits tend to use similar terms and that the distribution of these terms would be different across genders, age groups and personality traits. To implement this notion, all conversations were indexed using an Information Re- trieval engine and then, the conversation to be classified was treated as a query.

The idea is that the conversations retrieved (i.e., the most similar to the query) are the ones from the same gender, age group, and personality traits.

(2)

The training dataset was composed of conversations (XML files) about var- ious topics grouped by author. Conversations were in English, Spanish, Italian, and Dutch and were annotated with the gender, age group, and the personality traits of the author. A complete description of the dataset may be found in http://pan.webis.de/.

2.1 Features

The texts from each author, or the documents, were represented by a set of 288 features (or attributes).

The complete set of texts was indexed by an Information Retrieval (IR) System in a manner similar to that used in [4–6]. Then, the text that to be classified was used as a query and the kmost similar texts were retrieved. The ranking is given by the Cosine or Okapi metrics as explained below.

Cosine These features are computed as an aggregation function over the top- k results for each age, gender, and personality trait obtained in response to a query composed by the keywords in the text to be classified. Three types of aggregation functions were tested, namely: count, sum, and average. For this featureset, queries and documents were compared using the cosine similarity (Eq.

1). For example, if we retrieve 10 documents in response to a query composed by the keywords in q, and 5 of the retrieved documents were in the 18-24 age group, then the value for 18-24 cosine avg is the the average of the 5 cosine scores for this class. Similarly, 18-24 cosine sum is the summation of such scores, and 18-24 cosine count simply counts how many retrieved documents fall into the 18-24 cosine count category.

COSIN E= (c, s) c.q

|c||q| (1)

where−→c and−→q are the vectors for the document and the query, respectively.

The vectors are composed oftf_i,c×idf_i weights wheretf_i,c is the frequency of term i in document c, and IDF_i = log_n(i)^N where N is the total number of documents in the collection, andn(i) is the number of documents containingi.

Okapi Similar to the previous, these features compute an aggregation function (average, sum, and count) over the retrieved results from each gender, age, and personality traits group that appeared in the top-k ranks for the query composed by the keywords in the document. For this featureset, queries and documents were compared using the Okapi BM25 score (Eq. 2).

BM25(c, q) =

n

X

i=1

IDFi

tfi,c·(k1+ 1)

tf_i,c+k₁(1−b+b_avgdl^|D| ) (2) wheretfi,candIDFiare as in Eq. 1|d|is the length (in words) of documentc, avgdlis the average document length in the collection,k1 andb are parameters

(3)

that tune the importance of the presence of each term in the query and the length of the text. In my experiments, i usedk1= 1.2 andb= 0.75.

2.2 Experiments

The steps taken to process the datasets and run the experiments were the fol- lowing:

1. Pre-process the conversation in the training data to tokenize (only during testing, stemming and stopword removal was performed but without significant gains).

2. Use each conversation as queries.

3. Index 100% of the pre-processed conversations with a retrieval engine. Zettair¹, which is a compact and fast search engine developed by RMIT University (Australia), was used for indexing and querying. Zettair implements several methods for ranking documents in response to queries and calculates cosine and Okapi BM25.

4. Compute the features using the results from the queries submitted to Zettair.

The top-10 scoring conversations were retrieved.

5. Train the classifiers and generate the models. Weka [2] was used to build the classification models.

6. Use the trained classifiers to predict the classes of the conversations used as queries. Once the classifiers are trained, they can be used to predict the classes for new unlabelled conversations. Thus, the conversations from the test data were treated as queries and went through steps 1, 4 and 6.

2.3 Training the Classifiers

Twenty-six classifiers are necessary, since there are four languages and seven dimensions in each (age [only English and Spanish], gender, and personality traits [extroverted, stable, agreeable, conscientious, and open]). All results in this section refer to experiments run on thetraining data only. The predictions of the classifiers were compared two settings (i) using all 288 features an (ii) using just a subset of 6 to 22 features (produced by BestFirst subset evaluator).

Figure 1 shows the results comparing the accuracy of the runs that use all features and the runs that use just the subset. Using the subsets had advantages in nearly all cases. The only exception was for age. This can be confirmed by Table 1, which shows results of pairedt-tests that assess the significance of the difference between the runs that use all features and the runs that use a subset only. For the majority of the learning algorithms, the results of all runs are very close, with a slight advantage in favor of the runs with the selected subset of attributes.

Figure 2 shows the most accurate classifiers on the training data grouped by language. We can see that some languages had better performance than

1 http://www.seg.rmit.edu.au/zettair/

(4)

0.735 0.764 0.735 0.735

0.68

0.898 0.921

0.794 0.852

0.763 0.824

0.711

0.829

0.947

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

Agreeable Consicentious Extroverted Open Stable Age Gender

All Subset

Fig. 1.Accuracy considering all and a subset of features

Table 1.Significance differences between pairs all/subset of features for four languages and seven dimensions

Dimensions Language P-value Significant Advantage of Use of the Features

Agreeable

EN 0.00 Yes All

ES 0.03 Yes Subset

IT-NL 0.00 Yes Subset

Conscientious

EN 0.30 No -

ES 0.40 No -

IT-NL 0.16 No -

NL 0.04 Yes Subset

Extroverted

EN 0.00 Yes All

IT-NL 0.07 No -

NL-ES 0.00 Yes Subset

Gender ES 0.01 Yes Subset

EN-IT-NL 0.00 Yes Subset

Age EN 0.00 Yes Subset

ES 0.07 No -

Open All 0.00 Yes Subset

Stable ES 0.06 No -

EN-IT-NL 0.00 Yes Subset

others. While Italian had the best scores, English had the lowest. This difference could be explained by the fact that Italian has a more diverse morphology and a greater vocabulary compared to English and this may provide the classifier with more distinctive features. Regarding the choice of classifier, we can see that different languages had different classifiers as best performers. Classification via Regression, Random Committee, and Rotation Forest were among the top 5 in two cases each.

Figure 3 shows the most accurate classifiers for Age and Gender prediction.

For age, we notice that a number of algorithms achieved similar results (around 0.8). For gender, RBFNetwork was the best.

Figure 4 shows the best classifiers for modelling personality traits. The Mul- tilayer Perceptron is among the top performers in three out of five traits. The

(5)

0.829 0.822

0.783 0.783 0.783

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

Class.ViaRegression FT LMT SimpleLogistic LogitBoost

(a) English

0.89 0.87 0.87 0.86 0.84

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

RotationForest LMT SimpleLogistic RndCommittee IB1

(b) Spanish

0.921

0.895 0.895

0.684 0.658

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

Class.ViaRegression DecisionStump RndSubSpace FT Nnge

(c) Italian

0.853

0.824 0.824 0.824 0.824

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

RBFNet NaiveBayes NBUpdateable RndCommittee RotationForest

(d) Dutch Fig. 2.Best classifiers based on Accuracy by Language

most notable cases in which there were large differences in the accuracies of the classifiers were for Conscientious and Open.

(6)

0.829 0.822 0.816 0.816 0.816

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

RndCommittee RotationForest BayesNet Decorate SMO

(a) Age

0.900

0.947 0.921 0.921 0.895

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

RBFNetwork Decorate IB1 RndCommittee DecisionStump

(b) Gender Fig. 3.Best classifiers based on Accuracy by Age and Gender

0.794 0.794 0.794 0.794

0.737

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

LMT RndCommittee SimpleLogistic OrdinalClassClassifier END

(a) Agreeable

0.852941 0.852941

0.684211 0.684211 0.66

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

Logistic MultilayerPerceptron FT MultiClassClassifier Decorate

(b) Conscientious

0.763 0.740 0.735 0.735 0.735

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

Nnge RndCommittee END LogitBoost SimpleLogistic

(c) Extroverted

0.824 0.794 0.794

0.664 0.658

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

MultilayerPerceptron LMT SimpleLogistic RndForest FT

(d) Open

0.711 0.711 0.711 0.706 0.684

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

IB1 LMT SimpleLogistic MultilayerPerceptron Decorate

(e) Stable

Fig. 4.Best classifiers based on Accuracy by Trait

(7)

3 Official Experiments

A pairwise comparison of the accuracies of the twenty-two teams which partici- pated on the Author Profiling task was held, considering age, gender, personality traits and language. In this work, for the systems to be significantly different from each other, they had to have p<0.05.

As a result, system proposed in study is not significantly different from systems that scored best, considering as an example the small set of training data used: English, Spanish, Italian, and Dutch - 152, 100, 38 and 34 files, respectively.

Comparing the results on the training and test datasets, a drop of about ten percentage points was observed. Overall results on the training data were 0.8171 and on the test data the final score was 0.7223.

4 Conclusion

In this paper, i presented an empirical evaluation of a number of features and learning algorithms for the task of identifying author profiles. More specifically, the task here is, for a given text, to identify gender, age group and personality traits of its author.

The goal was to validate the use of Information Retrieval-based features to identify personality traits. The results show that they are suitable to the task.

Acknowledgments. I thank to Viviane Pereira Moreira for their help in the final revision of this paper.

References

1. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of an anonymous text. Communications of the ACM 52(2), 119–123 (2009) 2. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The

weka data mining software: an update. ACM SIGKDD explorations newsletter 11(1), 10–18 (2009)

3. Rangel, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd Author Profiling task at PAN 2015. In: Cappellato, L., Ferro, N., Jones, G., SanJuan (eds.) CLEF 2015 Labs and Workshops, Notebook Papers CEUR-WS.org vol. 1391 (2015)

4. Weren, E.R., Kauer, A.U., Mizusaki, L., Moreira, V.P., de Oliveira, J.P.M., Wives, L.K.: Examining multiple features for author profiling. Journal of Information and Data Management 5(3), 266 (2014)

5. Weren, E.R., Moreira, V.P., de Oliveira, J.P.: Exploring information retrieval features for author profiling – notebook for pan at clef 2014. Cappellato et al.[6]

6. Weren, E.R., Moreira, V.P., de Oliveira, J.P.: Using simple content features for the author profiling task. In: Notebook for PAN at Cross-Language Evaluation Forum.

Valencia, Spain (2013)