Complete proceedings as a single PDF-file

(1)

Detection, Representation, and Exploitation of Events in the Semantic Web

Workshop in conjunction with the

11th International Semantic Web Conference 2012 Boston Massachussets, USA, 12 November 2012

Edited by:

Marieke van Erp Laura Hollink Willem Robert van Hage

Rapha¨el Troncy David A. Shamma

(2)

In recent years, researchers in several communities involved in aspects of the web have begun to realise the potential benefits of assigning an important role to events in the representation and organisation of knowledge and media – benefits which can be compared to those of representing entities such as persons or locations instead of just dealing with more superficial objects such as proper names and geographical coordinates. While a good deal of relevant research – for example, on the modelling of events – has been done in the semantic web community, much complementary research has been done in other, partially overlapping communities, such as those involved in multimedia processing and information retrieval.

However, there is a shift in semantics in multimedia research, one that moves away from content semantics towards conversation semantics that is contained in social media. With respect to events and information, what happens in an event becomes secondary to how people react and/or what they talk about.

The attendance of DeRiVE 2011 proved that there is a great interest from many different communities in the role of events.

The goal of DeRiVE 2012 is to further strengthen and expand on the results from DeRiVE 2011 and to strengthen the participation of the semantic web community in the recent surge of research on the use of events as a key concept for representing knowledge and organising and structuring media on the web.

The workshop invited contributions to three central questions, with the goal to formulate answers to these questions that advance and reflect the current state of understanding. Each submission was be expected to address at least two questions explicitly, if possible including a system demonstration. This year, we also specifically focused on event and conversation semantics in multimedia and social media.

The questions we aim to address are the following:

Question 1: How can events be detected and extracted for the semantic web?

• How can events be recognised in particular types of material on the web, such as calendars of public events, social networks, microblogging sites, semantic wikis, and regular web pages?

• How can events be summarised, segmented and described using social media?

• How can the quality and veracity of the events mentioned in noisy microblogging sites such as Twitter be verified?

• How can a system recognise a complex event that comprises separately

(3)

iii

Question 2: How can events be modelled and represented in the semantic web?

• How can we improve the interoperability of the various event vocabularies such as EVENT, LODE, SEM, or F to name a few?

• How deployed is the schema.org Event class on the web?

• To what extent can the many different event infoboxes of Wikipedia be reconciled for Wikidata?

• What are the requirements for event representations for qualitatively different types of events (e.g., historical events such as wars; cultural events such as upcoming concerts; personal events such as family vacations)?

• How can aspects of existing event representations developed in other communities be adapted to the needs of the semantic web?

• To what extent can/should a unified event model be employed for such different types of events?

• How do social contexts (Facebook, Twitter, etc.) change the implicit content semantics?

Question 3: What is the relationship between events, data, and applications?

• How can events be represented in a way to support conversation semantics, search, or enhanced browsing?

• How do tools for event annotation and consumption alter or change the content semantics of the event itself?

• How can we improve existing methods for visualising event representations and enabling users to interact with them in semantic web user interfaces?

• What are the requirements for event detection, representation, and systems creation implicitly or explicitly defined by these three questions?

Contributions of the Workshop Papers

In each of the seven accepted papers for DeRiVE 2012, two of the workshop topics are addressed. The first, fourth and fifth contributions to be presented,Au- tomatic Classification and Relationship Extraction for Multi-Lingual and Multi- Granular Events from Wikipedia by Daniel Hienert, Dennis Wegener and Heiko Paulheim, Harnessing Disagreement for Events Semantics by Lora Aroyo and

(4)

The second and third contributions,Hyperlocal Event Extraction of future Events by Tobias Arrskog, Peter Exner, H˚akan Jonsson, Peter Norlander, and Pierre Nugues and Automatic Extraction of Soccer Game Events from Twitter by Guido van Oorschot, Marieke van Erp and Chris Dijkshoorn primarily focus on extraction of events from real-world data but also explore how wide deploy- ment of their techniques would alter current methods of information processing around events.

The focus on detection in a majority of the submissions shows that this is still a topic that deserves much attention, but the fact that there is already a significant amount of (semi-)structured event-data available and that the results of event detection are reaching acceptable levels have opened up interesting avenues for starting to use event-data in real world settings. This is showcased by the the sixth and seventh contributions accepted for presentation,Bringing parliamentary debates to the Semantic Webby Damir Juric, Laura Hollink and Geert-Jan Houben andMaking Sense of the Arab Revolution and Occupy: Visual Analytics to Understand Events by Thomas Ploeger, Bibiana Armenta, Lora Aroyo, Frank de Bakker and Iina Hellsten. These contributions show what issues are encountered in working with event-based and how these are being addressed by use of various (inter)disciplinary methods.

We hope that in compiling the programme and proceedings for DeRiVE 2012 we have succeeded in presenting various perspectives and discussion points on the problems around detection, representation and exploitation of events and that the workshop contributed to yet another step closer to getting to understand events and their uses better.

September 2012

Marieke van Erp, VU University Amsterdam Laura Hollink, VU University Amsterdam

Willem Robert van Hage, VU University Amsterdam Rapha¨el Troncy, EURECOM

David A. Shamma, Yahoo! Research

Programme Committee

The following colleagues kindly served in the workshop’s program committee.

Their joint expertise covers all of the questions addressed in the workshop, and they reflect the range of relevant scientific communities.

• Jans Aasman, Franz, Inc.,

• Klaus Berberich, Max-Planck Institute for Informatics

(5)

v

• Diana Maynard, University of Sheffield

• Vasileios Mezaris, CERTH/ITI

• Yves Raimond, BBC

• Matthew Rowe, Knowledge Media Institute

• Ansgar Scherp, University of Koblenz-Landau

• Nicu Sebe, University of Trento

• Ryan Shaw, University of North Carolina

• Thomas Steiner, Google

• Denis Teyssou, AFP

• Sarah Vieweg, University of Colorado Boulder

(6)

(7)

Abstract. Wikipedia is a rich data source for knowledge from all domains. As part of this knowledge, historical and daily events (news) are collected for different languages on special pages and in event portals. As only a small amount of events is available in structured form in DBpedia, we extract these events with a rule-based approach from Wikipedia pages. In this paper we focus on three aspects: (1) extending our prior method for extracting events for a daily granularity, (2) the automatic classification of events and (3) finding relationships between events. As a result, we have extracted a data set of about 170,000 events covering different languages and granularities. On the basis of one language set, we have automatically built categories for about 70% of the events of another language set. For nearly every event, we have been able to find related events.

Keywords: Historical Events, News, Wikipedia, DBpedia

1 Introduction

Wikipedia is an extensive resource for different types of events like historical events or news that are user-contributed and quality-proven. Although there is plenty of information on historical events in Wikipedia, only a small fraction of these events is available in a structured form in DBpedia. In prior work we have focused on extracting and publishing these events for the use in the semantic web and other applications [6]. In this paper, we focus on how the dataset can be enriched and its quality can be further improved. We address this question with two approaches: to find categories for events and to extract relationships between events. These features can later be used in end-user applications to list related events, browse between events or filter events from the same category.

The remainder of this paper is as follows: Section 2 presents related work. In

(10)

1, 2 and 3). In Section 4 we present an approach on how events can be automatically classified with categories (Question 1). In Section 5 we show how relationships between events from different languages and granularities can be found (Question 1).

2 Related Work

There is a range of systems specialized for the extraction of events and temporal relations from free text. The TARSQI toolkit [16] can detect events, times and their temporal relations by temporal expressions in news articles. HeidelTime [14] is a rule-based system for the extraction and normalization of temporal expressions using mainly regular expressions. The TIE system [9] is an information extraction system that extracts facts from text with as much temporal information as possible and bounding start and end times.

Some work has been done for the extraction of events from Wikipedia articles with machine learning or rule-based approaches and the presentation for the end user in user interfaces with timelines and maps. The approach of Bhole [2] for example first classifies Wikipedia articles as persons, places or organizations on the basis of Support Vector Machines (SVM). Then text mining is used to extract links and event information for these entities. Entities and their events can be shown on a timeline. In another system [3] major events are extracted and classified for a historical Wikipedia article and shown in a user interface with a timeline, map for event locations and named entities for each event.

Other work concentrates on the extension of knowledge bases like DBpedia [1] or YAGO [15] with temporal facts. Exner and Nugues [4] have extracted events based on semantic parsing from Wikipedia text and converted them into the LODE model.

They applied their system to 10% of the English Wikipedia and extracted 27,500 events with links to external resources like DBpedia and GeoNames. Since facts in knowledge bases evolve over time the system T-YAGO [17] extends the knowledge base YAGO with temporal facts, so that they can be queried with a SPARQL-style language. As a subsequent technology, Kuzey & Weikum [8] presented a complete information extraction framework on the base of T-YAGO that extracts more than one million temporal facts from Wikipedia resources like semi-structured data (infoboxes, categories, lists and article titles) and free text of Wikipedia articles with a precision over 90% for semi-structured and 70% for full text extraction. Alternatively, the YAGO2 system [7] extends the YAGO knowledge base with temporal and spatial components. This information is extracted from infoboxes and other resources like GeoNames.

There is a collection of ontologies for the modeling of events in RDF like EVENT¹, LODE [13], SEM [5], EventsML² and F [12], a comparison can be found in [5].

However, most related work in this field is about the extraction of events from free text or knowledge bases like Wikipedia or YAGO and the enrichment of entities from text or knowledge bases with temporal information. Not much work has been done on

(11)

the further enrichment of event datasets such as adding relations or additional information like categorizations.

3 Events from Wikipedia

Wikipedia is a rich data source for events of different topics, languages and granularity. Most research focuses on the extraction of events from the full text of Wikipedia articles and on relating it to the appropriate entities. Major historical events have their own article, or events are collected in articles for a special topic. Events are also collected in time units of different granularity (i.e. years or months) available for different languages. These articles contain lists with events, whose structure is relatively stable. In prior work we have focused on the extraction of events from year- based articles, which include information on individual years for different languages [6]. Table 1 gives an overview over the extracted events for different languages and their extraction quotients. The number of possible events for each language is based on the assumption that every event line in the Wiki markup starts with an enumeration sign. The extracted dataset has several unique characteristics: (1) it has a wide temporal coverage from 300 BC to today, (2) it is available for a lot of different languages, (3) different granularities (year or month) are available, (4) Wikipedia users already have chosen which events are important for different granularities, (5) events already contain links to entities, (6) events have categorizations or can be enriched with categorization and relationships among each other.

Table 1. Number of extracted events for language/granularity and the extraction quotients Language/Granularity Possible Events Extracted Events Extraction Quotient

German/Year 36,713 36,349 99.01%

English/Year 39,739 34,938 87.92%

Spanish/Year 20,548 19,697 95.86%

Romanian/Year 13,991 10,633 76.00%

Italian/Year 14,513 10,339 71.24%

Portuguese/Year 8,219 7,395 89.97%

Catalan/Year 7,759 6,754 87.05%

Turkish/Year 3,596 3,327 92.52%

Indonesian/Year 2,406 1,963 81.59%

English/Month 38,433 35,633 92.71%

German/Month 11,660 11,474 98.40%

Total 178,502

3.1 Extraction, processing and provision

Figure 1 shows the overall extraction and processing pipeline. Our software crawls Wikipedia articles for different granularities (years and months) and different languages. For year-based articles, German, English, Spanish, Romanian, Italian, Portuguese, Catalan, Turkish and Indonesian with a temporal coverage from 300BC to today are crawled. For daily events, German and English articles from the year

(12)

of the event section in the article, the identification of events in the event section and the separation of date, description and links for each event. Events can be further described by categories that result from headings in the markup. Events and links are then stored in a MySQL database.

The resulting data set is then further processed. For the automatic classification see Section 4, for the finding of relationships between events see Section 5. We also crawl the Wikipedia API to add an individual image to each event for the use in the timeline.

We provide access to the extracted events via the Web-API, SPARQL endpoint, Linked Data Interface and in a timeline. The Web-API³ gives lightweight and fast access to the events. Events can be queried by several URL parameters like begin_date, end_date, lang, query, format, html, links, limit, order, category, granularity and related. Users can query for keywords or time periods, and results are returned in XML or JSON format. The Linked Data Interface⁴ holds a representation of the yearly English dataset in the LODE ontology [13]. Each event contains links to DBpedia entities. Users can query the dataset via the SPARQL endpoint (http://lod.gesis.org/historicalevents/sparql). Additionally, yearly events for the English, German and Italian dataset are shown in a Flash timeline (http://www.vizgr.org/historical-events/timeline/) with added images and links to Wikipedia articles. Users can search for years, scroll and scan the events and navigate to Wikipedia articles.

Fig. 1. Processing, extraction and provision pipeline.

3.2 Extraction of daily events

In addition to the extraction of yearly events presented in [6], we have extracted daily events from the German and English Wikipedia version. The German version provides events on a daily basis in articles of months (i.e.

http://de.wikipedia.org/wiki/Juni_2011) from the year 2000 to today. The English structure is quite more complicated and daily events are distributed in three different site structures: (1) most daily events are collected in the Portal:Current events (http://en.wikipedia.org/wiki/Portal:Current_events), (2) some events are collected in the Portal:Events (before July 2006) and (3) other events are collected in month collections similar to the German version. English daily events are also available for the years 2000 to today. First, we have extended the extraction software to query

SPARQL endpoint

1.1 E v e n t s

Event Parsing

& Link Extraction +Adding Images +Adding Categories +Adding Relationships

Wikipedia Articles

1.2 E v e n t s

SESAME

1.3 E v e n t s

Web API

1.4 E v e n t s

LODE Events

1.5 E v e n t s

Linked Data Frontend

1.6 E v e n t s

(13)

these site structures. Then, regular expressions for the identification of event section and for the individual events have been added. The extraction algorithm had to be slightly modified to handle new structures specific for daily events. As a result, the software could extract 35,633 English daily events (extraction quotient: 92.17%) and 11,747 German daily events (extraction quotient: 98.40%).

3.3 Analyzing the data set

The overall data set has been analyzed as a prerequisite to the automatic classification and the search for relationships between events. The number of extracted events and extraction quotients for different languages and granularity are shown in Table 1. The categories in German events are created from subheadings on the corresponding Wikipedia page. Yearly German events are categorized with one or two categories by headings of rank 2 or 3, which can be used for the automatic classification of events.

Table 2 shows the ten most used categorizations for German events. In English or other languages categorizations are rarely used. The number of links and entities per event can be seen in Table 3. In the German and English dataset most events have between one and four links.

Table 2. Categories (translated) and their counts for yearly German events

Table 3. Distribution of links to entities within the German and English yearly dataset

Category Count

Politics and world events 18,887

Culture 4,135

Science and technology 3,096

Religion 2,180

Economy 2,011

Sports 1,434

Disasters 1,351

Politics 613

Culture and Society 309

Society 286

Count of entities English German

No entity 6,371 1,489

One entity 5,773 7,815

Two entities 10,143 9,969

Three entities 8,405 8,086

Four entities 4,499 4,606

Five entities 2,376 2,457

Six entities 1,271 1,234

Seven or more entities 901 693

4 Automatic Classification of Events

To provide a useful semantic description of events, it is necessary to have types attached to these events. Possible types could be "Political Event", "Sports Event", etc. In the crawled datasets, some events already have types extracted from the Wikipedia pages, while others do not. Therefore, we use machine learning to add the types where they are not present.

The datasets we have crawled already contain links to Wikipedia articles. In order to generate useful machine learning features, we have transformed these links to DBpedia entities. For inferring event types, we have enhanced our datasets consisting of events and their descriptions by more features: the direct types (rdf:type) and the

(14)

enhancing the datasets, we have used our framework FeGeLOD [11], which adds such machine learning features from Linked Open Data to datasets in an automatic fashion. The rationale of adding those features is that the type of an event can be inferred from the types of the entities involved in the event. For example, if an entity of type SoccerPlayer is involved in an event, it is likely that the event is a sports event.

As discussed above, the majority of events in our datasets comprises between one and four links to entities. Therefore, we have concentrated on such events in our analysis. We have conducted two experiments: first, we have inferred the event types on events from the German dataset, using cross validation for evaluation. Second, we have learned models on the German datasets and used these models to classify events from the English dataset, where types are not present. In the second experiment, we have evaluated the results manually on random subsets of the English dataset.

Figure 2 depicts the classification accuracy achieved in the first experiment, using 10-fold cross validation on the German dataset. We have used four random subsets of 1,000 events which we have processed by adding features and classifying them with three different commonly used machine learning algorithms: i.e., Naïve Bayes, Ripper (in the JRip implementation), and Support Vector Machines (using the Weka SMO implementation, treating the multi-class problem by using 1 vs. 1 classification with voting). As a baseline, we have predicted the largest class of the sample. It can be observed that the categories of related entities are more discriminative than the direct types. The best results (around 80% accuracy) are achieved with Support Vector Machines.

Types Categories Types+Categories Types Categories Types+Categories Types Categories Types+Categories Types Categories Types+Categories

0 10 20 30 40 50 60 70 80 90 100

Baseline NB JRip SMO

One Entity Two Entities Three Entities Four Entities

Accuracy (%)

Fig. 2. Classification accuracy on the German dataset, using ten-fold cross validation for evaluation

Since Support Vector Machines have yielded the best results in the first experiment, we have trained four SVMs for the second experiment, one for each

(15)

have then used these models to classify four subsets of the English dataset, consisting of 50 events each. The results of that classification have been evaluated manually.

The results are shown in Figure 3. First, we have tested the best performing combination of the first experiment, using both categories and direct types of the related entities. Since the results were not satisfying, we have conducted a second evaluation using only direct types, which yielded better results. The most likely reason why categories work less well as features than classes is that the German and the English DBpedia use the same set of classes (i.e., DBpedia and YAGO ontology classes, among others), but different categories. In our experiments, we have observed that only a subset of the categories used in the German DBpedia have a corresponding category in the English DBpedia. Thus, categories, despite their discriminative power in a single-language scenario, are less suitable for training cross-language models.

One entity Two entities Three entities Four entitiies 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Types and Categories Types Only

Fig. 3. Classification accuracy achieved on English dataset, using Support Vector Machines trained on the German dataset

In summary, we have been able to achieve a classification accuracy of around 70%

for the English dataset, using a model trained on the German dataset. The results of both experiments show that machine learning with features from DBpedia is a feasible way to achieve an automatic classification of the extracted events.

5 Relationships Between Events

With a dataset of events for different languages and granularities it is interesting to know which relations between these events exist. To find relationships, different features of the events could be used: (1) time, (2) categories, (3) topic/content or (4) links. Time as a single criterion is not by far enough. The category is too simplistic and there are only a few categories. Relationships based on the topic/content of the event are not easy to find as the events only include micro-text with a few words or sentences. Taking links as a criterion, we have to consider which links to take and

(16)

for extracting relationships between events.

As described in Section 3.3, we have extracted 178,502 events in total. From these, 172,189 events include links. As a preprocessing step, we transform every non- English link to the English equivalent by querying the inter-language link from the Wikipedia API. As a result, every event from different languages contains links to English Wikipedia/DBpedia entities.

In the following, we analyze this set of events. As first step we vary the number of links that two events have to share and count the events that share this number of links with at least one other event (see Table 4). In detail, we consider two events to share a link if these events contain a link to the same DBpedia entity. From our analysis results it can be seen that 95.8 % of the events (that include links) share at least one link with at least one other event. As we are dealing with a multi-lingual set of events, it is interesting to know how many events share one link with at least one event of a different language. In our set of events, 155,769 events share at least one link with at least one other event of a different language, which is 90.5 % of the events in the set. 75.7% of the events include a link to another granularity, i.e. from year to month or vice versa.

Table 4. Analysis of the number of shared links between events

# shared links # events that share the number of links with

at least one other event in %

(# total events = 172,189)

1 165,014 95.8 %

2 100,401 58.3 %

3 35,456 20.6%

4 9,900 5.7%

So far, we have looked for events that share one link in the overall database. In the following, we vary the time interval in which we search for these events (see Table 5).

In detail, if we look at an event at time x, an interval of one month means that we search for events in the time interval [x-15 days : x + 15 days]. For the time-based analysis, we can only consider events where the date includes information on the day (and not only on the month and year). In our set these are 109,510 events.

Table 5. Analysis of the number of events that hold shared links in a given time interval Time interval Number of events that

share one link with at least one other event in the time interval

In % (number of total events with exact date = 109,510)

Overall 105,042 95,9 %

Year [x-182 days : x+182 days] 90,193 82,4 %

Month [x-15 days : x+15 days] 74,499 68,0 %

Week [x-3 days : x+3 days] 61,246 55,9 %

Based on this analysis we have been able to define the relatedness between two

(17)

maximal between these events. Whereby we have found that in our dataset, a large part has at least one link in common (95.8%) within a time interval of a year (82.4%) and we can also find links to other languages (90.5%) and granularities (75.7%).We have implemented the relatedness feature in the Web-API. To compute related events for an individual event, we query for events that have at least one link in common within a time interval of plus/minus ten years and then sort results first by number of shared links and then by time distance to the original event.

For example, the query for Arab Spring⁵ finds eleven events from the yearly English dataset and related events from other languages and granularities. For example, the event of 2011/01/14: “Arab Spring: The Tunisian government falls after a month of increasingly violent protests President Zine El Abidine Ben Ali flees to Saudi Arabia after 23 years in power.” lists equivalent events from different languages, i.e. Italian: “In Tunisia, dopo violente proteste…”, Spanish: “en Túnez el presidente Zine El Abidine Ben…”, German: “Tunis/Tunesien: Nach den schweren Unruhen der Vortage verhängt Präsident Zine el-Abidine…” and from a month/news view: “Thousands of people protest across the country demanding the resignation of President Zine El Abidine Ben Ali. [Link] (BBC)”

As a final step we have compiled an evaluation set with 100 events and 5 related events for each and analyzed them manually. We have found that the perceived relatedness between two events (1) depends on the time interval between events and (2) depends on the count (1 vs. 4), type (general types like Consul vs. finer types like Julius Caesar) and position (at the beginning or the end of the description) of shared links.

In summary, we have been able to find a related event for nearly every event in the dataset, also for events from other languages and granularities.

6 Conclusion

We have extracted an event dataset from Wikipedia with about 170,000 events for different languages and granularities. A part of these events includes categories which can be used to automatically build categories for about 70% of another language set on the basis of links to other Wikipedia/DBpedia entities. The same linking base is used together with a time interval to extract related events for nearly every event, also for different languages and granularities.

At the moment, we only use Wikipedia/DBpedia links that are already included in the events' descriptive texts. However, those links are not always complete or available in other data sets. Using automatic tools such as DBpedia spotlight [10]

would help increasing the result quality and allow us to process text fragments without hyperlinks as well.

At the end of Section 5 we have shown that the perceived quality of events depends also on the abstractness of links. The analysis on how the abstractness of links can be modeled and used as an additional feature for the ranking of related events remains to future work.

(18)

1. Auer, S. et al.: DBpedia: A Nucleus for a Web of Open Data. In 6th Int’l Semantic Web Conference, Busan, Korea. pp. 11–15 Springer (2007).

2. Bhole, A. et al.: Extracting Named Entities and Relating Them over Time Based on Wikipedia. Informatica (Slovenia). 31, 4, 463–468 (2007).

3. Chasin, R.: Event and Temporal Information Extraction towards Timelines of Wikipedia Articles. Simile. 1–9 (2010).

4. Exner, P., Nugues, P.: Using semantic role labeling to extract events from Wikipedia.

Proceedings of the Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011). Workshop in conjunction with the 10th International Semantic Web Conference 2011 (ISWC 2011). , Bonn (2011).

5. Hage, W.R. van et al.: Design and use of the Simple Event Model (SEM). Web Semantics:

Science, Services and Agents on the World Wide Web. 9, 2, (2011).

6. Hienert, D., Luciano, F.: Extraction of Historical Events from Wikipedia. Proceedings of the First International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data (KNOW@LOD 2012). , Heraklion, Greece (2012).

7. Hoffart, J. et al.: YAGO2: exploring and querying world knowledge in time, space, context, and many languages. Proceedings of the 20th international conference companion on World wide web. pp. 229–232 ACM, New York, NY, USA (2011).

8. Kuzey, E., Weikum, G.: Extraction of temporal facts and events from Wikipedia.

Proceedings of the 2nd Temporal Web Analytics Workshop. pp. 25–32 ACM, New York, NY, USA (2012).

9. Ling, X., Weld, D.S.: Temporal Information Extraction. In: Fox, M. and Poole, D. (eds.) AAAI. AAAI Press (2010).

10. Mendes, P. et al.: DBpedia Spotlight: Shedding Light on the Web of Documents. In the Proceedings of the 7th International Conference on Semantic Systems (I-Semantics). (2011).

11. Paulheim, H., Fürnkranz, J.: Unsupervised Generation of Data Mining Features from Linked Open Data. International Conference on Web Intelligence and Semantics (WIMS’12).

(2012).

12. Scherp, A. et al.: F–a model of events based on the foundational ontology dolce+DnS ultralight. Proceedings of the fifth international conference on Knowledge capture. pp. 137–

144 ACM, New York, NY, USA (2009).

13. Shaw, R. et al.: LODE: Linking Open Descriptions of Events. Proceedings of the 4th Asian Conference on The Semantic Web. pp. 153–167 Springer-Verlag, Berlin, Heidelberg (2009).

14. Strötgen, J., Gertz, M.: HeidelTime: High quality rule-based extraction and normalization of temporal expressions. Proceedings of the 5th International Workshop on Semantic Evaluation. pp. 321–324 Association for Computational Linguistics, Stroudsburg, PA, USA (2010).

15. Suchanek, F.M. et al.: Yago: a core of semantic knowledge. Proceedings of the 16th international conference on World Wide Web. pp. 697–706 ACM, New York, NY, USA (2007).

16. Verhagen, M., Pustejovsky, J.: Temporal processing with the TARSQI toolkit. 22nd International Conference on on Computational Linguistics: Demonstration Papers. pp. 189–

192 Association for Computational Linguistics, Stroudsburg, PA, USA (2008).

17. Wang, Y. et al.: Timely YAGO: Harvesting, Querying, and Visualizing Temporal Knowledge from Wikipedia. Proceedings of the 13th International Conference on Extending Database Technology (EDBT), Lausanne, Switzerland, March 22-26. pp. 697–700 (2010).

(19)

Hyperlocal Event Extraction of Future Events

Tobias Arrskog, Peter Exner, H˚akan Jonsson, Peter Norlander, and Pierre Nugues

Department of Computer Science, Lund University Advanced Application Labs, Sony Mobile Communications

{tobias.arrskog,peter.norlander}@gmail.com hakan.jonsson@sonymobile.com {peter.exner,pierre.nugues}@cs.lth.se

Abstract. From metropolitan areas to tiny villages, there is a wide variety of organizers of cultural, business, entertainment, and social events.

These organizers publish such information to an equally wide variety of sources. Every source of published events uses its own document structure and provides different sets of information. This raises significant customization issues. This paper explores the possibilities of extracting future events from a wide range of web sources, to determine if the document structure and content can be exploited for time-efficient hyperlocal event scraping. We report on two experimental knowledge-driven, pattern-based programs that scrape events from web pages using both their content and structure.

1 Introduction

There has been considerable work on extracting events from text available from the web; see [1] for a collection of recent works. A variety of techniques have been reported: [2] used successfully data-driven approaches for the extraction of news events while knowledge-driven approaches have been applied to extract biomedical [3], historical [4], or financial events [5] among others.

Much previous research focuses on using the body text of the document, while some authors also use the document structure. For example, [4] apply semantic role labelling to unstructured Wikipedia text while [6] use both the document structure and body text to extract events from the same source.

The focus of this paper is on extracting future events using the body text of web pages as well as their DOM structure when the content has multiple levels of structure. We naturally use the body text from the web page as itthat contains essential information, e.g. time, date, and location instances. We also exploit the DOM structure as a source of information. Although HTML embeds some sort of structure, the actual structure is not homogeneous across websites. We report on the problem of extracting event information from a variety of web pages and we describe two systems we implemented and the results we obtained. .

(20)

1.1 Properties of Local Events

The events we are interested in are those that typically appear in calendars and listings, such as cultural, entertainment, educational, social, business (exhibi- tions, conferences), and sport events, that attract athe general and large public may have an interest in.

The end goal of this project is to be able to serve users with information about events that match their current interest and context, e.g. using location-based search, by aggregating these events from hyperlocal sources.

Event aggregators already exist, e.g. Eventful and Upcoming, that collect and publish event information, but they tend to only gather information about major events in cooperation with organizers or publishers. By contrast, we want to extract existing information directly from the publisher.

The main challenge is time-efficient scaling since there is a great number of hyperlocal organizers and sources as well as variations in the formats and DOM structure of the sources and ambiguity. We may also have to deal with missing, ambiguous, or contradictory information. For example, locations can appear in the title:

Concert – Bruce Springsteen (This time in the new arena),

and contradict the location indicated elsewhere. Another example is a title:

Outdoor dining now every Friday and Saturday

containing date information which narrows or sometimes contradicts the dates indicated elsewhere on the page.

FThe domain we are interested deals with future events form. This is a very wide area, where only few historically-annotated data is available. This makes a statistical approach problematic, at least initially. Instead, we chose a knowledge- driven, pattern-based approach, where we process both the structure of HTML documents and their content. We analyzse the content using knowledge of the event domain, e.g. event keywords.

In this paper, we report on the problem of extracting event information from given web pages and we describe two systems we implemented and the results we obtained.

1.2 Applications and Requirements for Event Structures

From the possible properties of an event, wWe chose to extract the title, date, time, location, event reference (source)andpublisherwhich answers thewWhen, where, andwhatquestions aboutof thean event. These are however the most basic attributes, and for a useful application, further information could be extracted, including topic, organizer, cost and target audience.

We set aside In this paper, we do not cover the semantic representation of event data, but future research may need to address representing the above attributes in existing event data models.

(21)

Hyperlocal Event Extraction of Future Events 3

2 System Architecture 2.1 Designing a Simple Scraper

For each site in the list, we created a unique script. These scripts contained a hand-crafted set of rules to extract the correct information for that specific site.

This may require a good deal of manual effort as we naturally have toTo expand the list of additional hand-crafted scripts is required, which leads to high costs when scaling to multiplemany sources..

In order to limit scaling costs, the scripts need to be simplistic. For this reason, we decided to A chosen limit ation was that the internal structure of the information in the events needs to be the same between each other, so that a small set of rules can extract the information from all the events.

2.2 Designing a Generic Scraper

We investigated if it would be possible to create a generic scraper which could handle all websites without manual labour.

The first step to generically scrape a website is to find all the pages that contain events. This is currently done using domain knowledge, i.e. the system is given only pages which are known to contain events. The possibilities to find pages without manual labour is further discussed in Sect. 5. The system uses six steps to scrape the events from a given web page. Figure 1 shows the system architecture. We implemented the first three steps using the ad-hoc scripts of Sect. 2.1.

Scraper Page

Classify

Extract default values and domain knowledge

Identify the event list

Identify each specific

event within the list Annotate Rank and select attributes for each event

Reevaluate selected attributes by looking at the entire event list

Store

Fig. 1. The implemented generic scraper. Dashed boxes use manually written, site- dependent scripts.

(22)

2.3 Attribute Annotation and Interpretation

The system uses rules to annotate and interpret text. The benefit of a rule- based system is that it can both parse the text and create structured data. As previous work suggests, extracting the time and date of events can be solved through rules. While problematic, the system is able to extract named entities, for example named locations as well. To do this, the system uses three major rules:

1. Keyword detection preceding a named location, e.g looking forlocation: or arena:

2. Keyword detection succeeding a named location, for example a city 3. Structured keyword detection preceding a named location. e.g. look for

location or arena when isolated in a separate structure. As an example:

location Boston which corresponds to “<b>location</b> Boston” using HTML tags.

When the rules above return a named location, we query it against a named location database. Using these rules and a database lookup, we can minimize the false positives.

2.4 Attribute Ranking and Selection

The system uses domain knowledge to choose what data to extract:

– The system extracts only one title and chooses the most visually distin- guished text it can, implied by the DOM structure

– Dates and times are following a hierarchy of complexity, where it takes those of highest complexity first. Some sites used a structure where event structures were grouped by date. To avoid false positives with dates in these event structures, the scraper choose dates between the event structures if less than half of the event structures contained dates.

– The extraction of the location for the event was done in the following order:

If the event structure contained a location coordinate, choose it. Otherwise use a default location. If the event site had no default location, use the most commonly referred city in the event structure.

3 Evaluation 3.1 Scoring

We evaluated the performances of the simple and generic scrapers and we compared them with a scoring defined in Table 1.

(23)

Hyperlocal Event Extraction of Future Events 5 Table 1.Criteria for full and partial scoring for the test set.

Full match

Title Lexicographic distance to correct = 0 Date Resulting date(s) equal to correct date(s)

Time Resulting start time equals correct start time (minute) Location Result within 1000 m of correct

Partial match

Title Result contains correct title

Date Full match or if result contains at least one of correct date(s) Time Full match or if result contains at least one of correct start time(s) Location Result within 5000 m of correct

3.2 Training

At the start of the project, we gathered a training set composed of nine different event sites found in the Lund and Malm¨o area, Sweden. With the help of the training set, we could change the rules or add new ones and easily monitor their overall effect. This concerned both the rules of the annotator, scraper, and the location lookup.

3.3 Evaluation

In order to evaluate the system, we gathered a test set of nine, previously unseen, event web sites. The goal was to extract information about all (max. 30) events.

The tests were conducted in three parts.

1. In the first part, we used the generic scraper (Sect. 2.2);

2. In the second one, we built simple scrapers (Sect. 2.1) for each of the test sites.

3. We extracted the events manually by hand in the third part.

The results from the first two parts were then compared against the third.

The generic scraper and the simple scrapers were compared in how accurately they extracted the title, date, time, and location of the event. The time of the setup was also compared for both the generic and simple scrapers.

We built a simple scraper for each site specifically to extract the text containing the title, date, time, and the location. The text strings containing the dates and times were then sent to the same algorithm that the generic scraper uses to parse the date and time. Once the text containing the location is extracted, we use the same location lookup in all the scrapers.

3.4 Bias Between the Training and Test Sets

The sites in the training set were all composed of a list with events where all the necessary information (title, date, time, location) could be found. In the

(24)

Table 2.F1score for full and partial match on test data for the generic scraper.

Full Partial

Site Title Date Time Location Average Title Date Time Location Average lu 0.0 0.967 0.767 0.433 0.542 0.4 0.967 0.933 0.633 0.733 mah 0.068 1.0 0.0 0.6 0.417 0.915 1.0 1.0 1.0 0.979 babel 0.0 0.818 0.0 1.0 0.830 1.0 0.909 0.818 1.0 0.932 lund.cc 1.0 0.667 1.0 0.652 0.714 1.0 0.967 1.0 0.652 0.905 m¨ollan 0.0 0.857 1.0 1.0 0.75 0.0 0.857 1.0 1.0 0.714

nsf 1.0 1.0 1.0 0.0 0.673 1.0 1.0 1.0 0.286 0.822

malm¨o.com 1.0 1.0 0 0.691 0.543 1.0 1.0 0 0.963 0.741 burl¨ov 0.889 0.75 0.333 0.2 0.369 1.0 0.875 0.333 0.2 0.602 dsek 0.0 0.2 0.444 0.833 0.588 1.0 0.2 1.0 0.833 0.758 AverageF10.440 0.807 0.505 0.601 0.603 0.813 0.864 0.787 0.730 0.799

Table 3.F1score for full match on test data for the generic scraper without loading the event details page.

Full Partial

Site Title Date Time Location Title Date Time Location lu 1.0 1.0 0.967 N/A 1.0 1.0 0.967 N/A mah 0.967 0.929 1.0 N/A 0.967 0.929 1.0 N/A babel 0.0 0.0 N/A 1.0 1.0 0.0 N/A 1.0

Table 4.F1 score for full and partial match on test data for the simple scraper.

Full Partial

Site Title Date Time Location Average Title Date Time Location Average lu 1.0 0.967 0.967 0.267 0.800 1.0 1.0 1.0 0.667 0.917

mah 1.0 1.0 0.0 0.7 0.675 1.0 1.0 1.0 1.0 1.0

babel 0.0 0.7 0.211 1.0 0.478 1.0 0.7 0.632 1.0 0.833 lund.cc 1.0 0.667 1.0 0.622 0.822 1.0 0.967 1.0 0.622 0.897 m¨ollan 0.857 0.667 1.0 1.0 0.881 1.0 0.833 1.0 1.0 0.959

nsf 1.0 1.0 1.0 0.0 0.75 1.0 1.0 1.0 0.0 0.75

malm¨o.com 1.0 1.0 0.0 0.823 0.706 1.0 1.0 0 0.912 0.728

burl¨ov 1.0 1.0 0.0 0.0 0.5 1.0 1.0 0.0 0.0 0.5

dsek 0.952 0.706 0.778 1.0 0.859 0.952 0.706 0.889 1.0 0.887 AverageF10.868 0.856 0.551 0.601 0.719 0.995 0.912 0.725 0.689 0.83

(25)

Hyperlocal Event Extraction of Future Events 7 Table 5.Time taken for the setup for the test sites.

Site Generic Simple Manual lu 23 min 83 min 60 min mah 7 min 24 min 68 min babel 11 min 59 min 15 min lund.cc 9 min 13 min 60 min möllan 2 min 31 min 13 min nsf 5 min 24 min 15 min malmö.com 31 min 63 min 35 min burlöv 10 min 30 min 22 min dsek 11 min 23 min 21 min Average 12 min 39 min 34 min

test set, most of the sites had a structure that did not have all the required information: Each event had a separate page with all the information, the event details page. The information on the event details page was not composed of the typical compact structured form but rather had more body text. Of the nine sites in the test set, three sites (lund.cc, nsf, dsek) did not require an event details page for the necessary information. But the information on the sites nsf and dsek were in their structure more comparable to a body text. A concept to handle this is presented in Sect. 4.1 that concerns the extraction of the title.

4 Conclusion

The setup for the generic scraper took on average 12 minutes, compared to creating a simple scraper for each site that took on average 39 minutes (Table 5).

The setup for the generic scraper is more than three times faster than creating a simple scraper for each site. This can be compared to the pure manual labor which took on average 34 minutes per site, thus both scrapers essentially have a pay back time of one pass.

4.1 Title

The generic scraper performs rather poorly on the test set while it shows better results on the training set. This is either due to a training overfit or a significant mismatch between the training and test sites. Sect. 3.4 analyzes the mistakes and discusses this problem. When using the system on these pages without loading, they do yield better results, as shown in Table 3. The rest of the failing test sites failed because the system looked to much in the structure where it should have analyzed the layout instead, i.e. it chose links when it should have chosen the ones which were more visually prominent.

4.2 Date

The simple scraper is 5% better on the date identification than the generic scraper on average for both the full and partial matches. Examining the scores

(26)

for the full match more closely, (Tables 2 and 4), the score for the generic is the same or better than the score for the simple scraper for every site except burl¨ov and dsek. We even observe a complete failure for dsek. We investigated it and we discovered that dsek expressed the dates relative to the current date e.g.today, tomorrow. This wasn’t implemented yet which made the generic scraper pick another strategy for picking dates, as a result the correct dates were forfeited.

4.3 Time

The average scores for the time extraction between the generic and the simple scrapers are rather similar. The system does finds the correct times but does report many false positives, which according to the scoring set in Sect. 3.1 yields only a partial match. The system tends to over detect times. We programmed it to prefer times coupled with dates over solitary times but in the test set, it seems it was rather common to have time and dates further apart. This makes the system choose all times, where it should have chosen a subset. Another pattern was also found: for some sites, the system returned both start and end time separately which shows that the system is lacking rules to bind start and end times together.

4.4 Location

The difference between simple and generic scraper is negligible and the problem of location is less about selection and more about actually find and understand the named locations (Tables 2 and 4). The system uses assumed knowledge to fill in what is left out of the events, i.e. knows city, region or location which it can use to fallback to or base the search around. Using this assumed knowledge has proved useful when looking at babel, m¨ollan, dsek, lu and mah and this should hold true on all hyperlocal websites. Even if the system has some basic knowledge about the web page, the location annotation and selection still has problems with disambiguation. This disambiguation problem is partly rooted in the fact that the named locations are within the domain knowledge of the site.

As an example, a university website might write lecture halls or class rooms as the location of the event. These named locations could have the same name as pub in another city, a scientist or simply nonexistent in any named location database.

4.5 Final Words

At the end of the test cycle, however, we considered that an generic scraper is not only possible to do, but in some cases even better than a simple one. The hardest problem with scraping sites is not necessarily to understand the structure, even if vague. The problem for a scraper is rather to understand what can only be described as domain knowledge. Sites uses a lot of assumed knowledge which can be hard to understand for a machine or even if its understanding could be

(27)

Hyperlocal Event Extraction of Future Events 9 completely wrong in the context. For example, lecture halls can be named the same as a pub in the same region, making it hard for a system to determine if the location is correct or not. This might be attainable with better heuristics, e.g.

if the location lookup can be made with some hierarchical solution and domain knowledge can be extracted from the sites prior to the extraction of events.

5 Future Work 5.1 Page Classification

On the Internet, sites show a significant variation and most of them do not contain entertainment events. Therefore a first step in a generic system, the dashed box “Classify” in Figure 1, would be to identifyif the input web page contains events. If it does not, it makes no sense to scrape it and doing so could even lead to false positives. If web pages could be classified with reasonable certainty, it could also be used with a crawler to create an endless supply of event pages to scrape.

5.2 Exploring Repetitiveness

To solve the dashed box “Identify the event list” shown in Figure 1, we investigated the repetitiveness of the event list. With the help of weighing in structural elements, e.g. P, STRONG, H3, it yielded some interesting results on small sites.

This technique can potentially be further refined by calibrating weights if the page is annotated using what is described in Sect. 2.3.

5.3 Rank and Select with Help of Layout

While the system uses a very limited rank and selection based on an implied layout for title (prefer H3, H2 etc. over raw text), it would be interesting to have the selection fully use layouts. To attract attention and to create desire, the vital information about an event are among the first things the reader is supposed to notice and comprehend. Thus it is usually presented in a visually distinguishing way. This can be achieved by coloring the text differently, making it larger, or simply in a different font or typing. This layout is bundled within the HTML document, possibly modified by the CSS, thus looking at these clues with some heuristics allows to find the visually distinguishing sentences [7]. As an example, an event might use a H3 element for the title, bold for the location, or it might have another background color for the date. If the entire system would use layout to aid the selection we believe that the system will perform better and will yield less false positives.

(28)

References

1. Hogenboom, F., Frasincar, F., Kaymak, U., de Jong, F.: An Overview of Event Extraction from Text. In van Erp, M., van Hage, W.R., Hollink, L., Jameson, A., Troncy, R., eds.: Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011) at Tenth International Semantic Web Conference (ISWC 2011). Volume 779 of CEUR Workshop Proceedings., CEUR- WS.org (2011) 48–57

2. Liu, M., Liu, Y., Xiang, L., Chen, X., Yang, Q.: Extracting key entities and significant events from online daily news. In Fyfe, C., Kim, D., Lee, S.Y., Yin, H., eds.:

Intelligent Data Engineering and Automated Learning - IDEAL 2008. Volume 5326 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg (2008) 201–209 3. Chun, H.w., Hwang, Y.s., Rim, H.C.: Unsupervised event extraction from biomedical literature using co-occurrence information and basic patterns. In: Proceedings of the First international joint conference on Natural Language Processing. IJCNLP’04, Berlin, Heidelberg, Springer-Verlag (2005) 777–786

4. Exner, P., Nugues, P.: Using Semantic Role Labeling to Extract Events from Wikipedia. In van Erp, M., van Hage, W.R., Hollink, L., Jameson, A., Troncy, R., eds.: Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011) at Tenth International Semantic Web Conference (ISWC 2011). Volume 779 of CEUR Workshop Proceedings., CEUR-WS.org (2011) 38–47

5. Borsje, J., Hogenboom, F., Frasincar, F.: Semi-automatic financial events discovery based on lexico-semantic patterns. Int. J. Web Eng. Technol.6(2) (January 2010) 115–140

6. Hienert, D., Luciano, F.: Extraction of historical events from wikipedia. In: Pro- ceedings of the First International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data, CEUR-WS.org (2012)

7. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Extracting content structure for web pages based on visual representation. In: Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications. APWeb’03, Berlin, Heidelberg, Springer- Verlag (2003) 406–417

(29)

Automatic Extraction of Soccer Game Events from Twitter

Guido van Oorschot, Marieke van Erp¹, and Chris Dijkshoorn¹ The Network Institute

Department of Computer Science VU University Amsterdam

{marieke.van.erp,c.r.dijkshoorn}@vu.nl

Abstract. Sports events data is often compiled manually by companies who rarely make it available for free to third parties. However, social media provide us with large amounts of data that discuss these very same matches for free. In this study, we investigate to what extent we can accurately extract sports data from tweets talking about soccer matches.

We collected and analyzed tweets about 61 Dutch premier league soccer matches. For each of these matches we 1) extracted the minutes in which an event occurs, 2) classified the event type and 3) assigned events to either the home or away team. Our results show that the aggregation of tweets is a promising resource for extracting game summaries, but further research is needed to overcome data messiness and sparsity problems.

1 Introduction

Soccer is a highly popular game, and with it information about soccer matches played. Many soccer fans try to keep track of their favorite teams by reading or watching game summaries. Generally, these summaries provide an overview of the minutes in which game highlights as goals, cards, and substitutions happen for both teams. This type of data is often created manually, a time-consuming and expensive process. Companies make good money off selling these data to third parties. However, the rise of vast amounts of data on social media platforms such as Twitter¹ is drawing the attention of the research community. [1] for example, mine Twitter to detect earthquakes and notify people more quickly and accurately than conventional methods are able to. [2] predict stock prices from analysing sentiment in tweets about stock tickers. Twitter is also a beloved medium for sports fans, during matches they often tweet about their teams and what is happening in the match. Preliminary work to extract useful information about sport matches from tweets has been carried out by [3]; they were able to successfully extract certain types of events from soccer and rugby games by analysing the number of tweets per minute. In this contribution, we build upon

Complete proceedings as a single PDF-file

Detection, Representation, and Exploitation of Events in the Semantic Web

Workshop in conjunction with the

11th International Semantic Web Conference 2012 Boston Massachussets, USA, 12 November 2012

Question 1: How can events be detected and extracted for the semantic web?

Question 2: How can events be modelled and represented in the semantic web?

Question 3: What is the relationship between events, data, and applications?

Contributions of the Workshop Papers

Programme Committee

Contents

Automatic Classification and Relationship Extraction for Multi-Lingual and Multi-Granular Events from

Wikipedia

Hyperlocal Event Extraction of Future Events

Automatic Extraction of Soccer Game Events from Twitter

1 Introduction