• Aucun résultat trouvé

A two-stage contagious Naive Bayes classifier for detecting sociolinguistic features in text

N/A
N/A
Protected

Academic year: 2022

Partager "A two-stage contagious Naive Bayes classifier for detecting sociolinguistic features in text"

Copied!
2
0
0

Texte intégral

(1)

A two-stage contagious Naive Bayes classifier for detecting sociolinguistic features in text

Iena Petronella Derks1 and Alta de Waal12

1 Department of Statistics, University of Pretoria

2 Center for Artificial Intelligence Research (CAIR)

1 Introduction

Online platforms allow users to masquerade themselves; making virtual interac- tions anonymous or misleading recipients of the interactions. It also facilitates an environment for cybercrimes, allowing users to take advantage of others and commit heinous acts. An important concern on social media usage, in particular, has to do with the security of under-age users that have access to the Internet.

Children are more vulnerable to threatening situations, such as harassment [3], cyberbullying [7], and inappropriate conversations [8]. Natural language process- ing (NLP) techniques can be used to process and understand social media data [1]. In the area of sociolinguistics, there is evidence that links natural word use to personality and social fluctuations [5]. In NLP, the term burstiness is used to de- scribe the tendency of word recurrence. The burstiness phenomenon is frequently exhibited in real text, in which an informative word is more likely to occur if it has already appeared in the text [2]. State-of-the-art NLP models, such as the multinomial Naive Bayes model, are often used to model text documents [4].

2 Methodology

One application area of NLP is sociolinguistics which can be defined as the relationship between social factors and linguistics. Sociolinguistics aims to isolate features to determine linguistic variation in social conditions [6]. This paper investigates classification models which can model the burstiness, or contagious effects of text as the text that we are interested in are manifestations of different social groups of people. For example, a teenager vs. an adult impersonating as a teenager will have different sentence structures. More formally defined, the purpose of this work is to:

1. Learn the linguistic patterns among different social groups of people, and classifying unknown authors according to these patterns; and

2. Represent these patterns as Bayesian networks to gain an understanding of the dependency structure of words used among different social groups.

To identify these linguistic patterns, a comparison is made between two clas- sification techniques, namely the Naive Bayes (NB) classifier and a contagious

Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)

(2)

2 IP Derks, A de Waal

counterpart thereof. The NB classifier assumes that words occurring in a docu- ment are independent of each other. On the other hand, the contagious classifier captures the burstiness phenomenon. To go one step further, we investigate the dependencies between words using a Bayesian network. This allows us to under- standwhy certain word patterns results in a classification.

2.1 Data Application

This paper presents a comparison between the baseline NB classifier and the proposed contagious counterpart thereof. Two data sets will be used to evaluate the performance of each method, namely the IMDB data set and the PAN 2012 data set. The IMDB data set consist of movie reviews, with binary sentiment classification. The PAN 2012 data set is originally used to identify potential predators in online conversations, with 66 927 conversations. The problem is addressed with a two-stage solution, where stage 1 is based on text classification techniques and stage 2 makes use of Bayesian networks to understand the struc- tural dependencies among words in a document. The evaluation of stage 1 is typical classification performance, whereas the visual structural learning of the Bayesian network provides for exploratory data analysis in order to understand the conditional dependencies between words.

References

1. Chowdhury, G.G.: Natural language processing. Annual review of information sci- ence and technology37(1), 51–89 (2003)

2. Doyle, G., Elkan, C.: Accounting for burstiness in topic models. In: Proceedings of the 26th Annual International Conference on Machine Learning. pp. 281–288. ACM (2009)

3. Kennedy, G., McCollough, A., Dixon, E., Bastidas, A., Ryan, J., Loo, C., Sahay, S.: Technology solutions to combat online harassment. In: Proceedings of the first workshop on abusive language online. pp. 73–77 (2017)

4. Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the dirich- let distribution. In: Proceedings of the 22nd international conference on Machine learning. pp. 545–552. ACM (2005)

5. Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural language use: Our words, our selves. Annual review of psychology 54(1), 547–577 (2003)

6. Spolsky, B., Widdowson, H., et al.: Sociolinguistics, vol. 1. Oxford University Press (1998)

7. Van Hee, C., Lefever, E., Verhoeven, B., Mennes, J., Desmet, B., De Pauw, G., Daelemans, W., Hoste, V.: Automatic detection and prevention of cyberbullying.

In: International Conference on Human and Social Analytics (HUSO 2015). pp.

13–18. IARIA (2015)

8. Yenala, H., Jhanwar, A., Chinnakotla, M.K., Goyal, J.: Deep learning for detecting inappropriate content in text. International Journal of Data Science and Analytics 6(4), 273–286 (2018)

Références

Documents relatifs

We have designed four novel feature scoring functions for OC, and tested them on two datasets of product review data, using as baselines the only three feature scoring functions for

Pour d6terminer s'il est possible de considdrer le r6gime d'imposition de Hong Kong comme faisant partie des lois de la RPC, le Juge Rip propose une analogie

Although the proposed LALM and LADMs are not necessarily always the fastest (again, for matrix completion problems with small sample ratios relative to rank(X ∗ ), the APGL

These linguistic vec- tors can be used as input of different ma- chine learning models in order to build automatic text classifiers for solving a wide variety of NLP

The proposed in the article model considers this parameter which allows reducing the time for assessing the belonging of a document to certain categories by reducing the size of

In this paper we investigated the application of text classification methods to the legal domain using the cases and rulings of the French Supreme Court.. We showed that a system

The differences in classification performance of three common classifiers, across message lengths and across enhancement meth- ods, as measured by the F1 score for accuracy

A 10-fold cross-validation of text classification on both unbalanced and balanced datasets of spam emails and SMS have been carried out to test the performance and sensitivity of