• Aucun résultat trouvé

Automatic Identification of Multipage News: A Machine Learning Approach

N/A
N/A
Protected

Academic year: 2022

Partager "Automatic Identification of Multipage News: A Machine Learning Approach"

Copied!
1
0
0

Texte intégral

(1)

Automatic Identification of Multipage News: A Machine Learning Approach

Pashutan Modaresi

Heinrich-Heine-University of Düsseldorf Institute of Computer Science, Düsseldorf, Germany

modaresi@cs.uni-duesseldorf.de

Online news contain valuable information that can be utilized for private or commercial purposes. In the commercial context, online media monitoring services provide other companies or individuals with their required information in a systematic manner. This is accomplished by crawling plenty of news websites.

Numerous news websites follow the strategy of pagination to split the stories into multiple pages. Given that, to identify multipage stories, manual rules have to be defined. On the other hand, the dynamic nature of the HTML pages requires a tremendous amount of effort in maintaining these rules. With this in mind, in this work we propose an automatic approach to identify multipage news stories.

We collected a list of web-pages in which the news were splitted in multiple pages and manually annotated them. To each link on the page a label has been assigned. That is, a link either points to the next page of the news or not. As the number of links which do not point to the next pages significantly dominates the number of link pointing to the next page of a news, the data set is highly imbalanced. Moreover, in order to design a language independent algorithm, news pages originating from different countries have been considered.

For each link, theclassandidattributes of the corresponding anchor element, together with the text content of the anchor have been concatenated and fed into a Naive Bayes classifier. The same set of features extracted from the parent elements of an underlying link has been fed into another Naive Bayes classifier.

Moreover, the relative position of a link on the news page (calculated by means of a heuristic) has been used to train a regression model. Additionally, some other features such as the structure of the href attribute of an anchor or the length of its text content have been integrated. Intentionally, the similarity between the content of the base page and the one of the target page has be ignored, as the calculation of this feature requires network availability that is not always given.

By cause of various learning algorithms being used, the final binary decision has to be performed by combining the results of the single constructed models.

For this we use a stacking technique where we train a learning algorithm to combine the predictions of the constructed models.

Our first experimental results have revealed very high precision and recall values (≥0.9) for both labels under analysis.

Copyright © 2015 by the paper’s authors. Copying permitted only for private and academic purposes. In: R. Bergmann, S. Görg, G. Müller (Eds.): Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9.

October 2015, published at http://ceur-ws.org

75

Références

Documents relatifs

Each statement has to be either true or false, its evaluation depends on the knowledge of a fact, and its correct truth value can be obtained with a commensurate effort.. As

Instrumentation 2x Alto Saxophone 2x Tenor Saxophone Baritone Saxophone 4x Trumpet / Flugelhorn 3x Tenor Trombone Bass Trombone Electric Guitar Bass Guitar

To see that a qualitative argument can be used: we know that the completely ordered case minimize the energy and we are interested in the states that have the closer possible

Key Words : Welfare Economics, Social Choice Theory, Bergson-Samuelson Social Welfare Function, Social Welfare Functional, Arrow, Bergson, Samuelson. Classification JEL: B21,

We considered the possibility that there was only very weak molecular correlation along the texture axis (a classical nematic) so that the observed meridional arcs could be

There is a motivation from logic : Were Z ⊂ Q diophantine, then Hilbert’s 10th problem over the rationals would have a negative answer (using Matijasevich’s theorem over Z).. Let

In order to simplify the process of converting numbers from the modular representation to the positional representation of numbers, we will consider an approximate method that allows

Our proposed approach to the automatic segmentation of news text according to news schema categories is based on the analysis of the various linguistic devices used by writers..