• Aucun résultat trouvé

Text and Documents 2

N/A
N/A
Protected

Academic year: 2022

Partager "Text and Documents 2"

Copied!
27
0
0

Texte intégral

(1)

Text and Documents 2

CS 7450 - Information Visualization March 18, 2004

John Stasko

InfoVis Tasks

• Two main tasks that Information

Visualization can assist with in this area

Enhance a person’s ability to read, understand and gain knowledge from a document

Understand the contents of a document or collection of documents without reading them

(2)

Spring 2004 CS 7450 3

More Specific Tasks

• Which documents contain text on topic XYZ?

• Which documents are of interest to me?

• Are there other documents that might be close enough to be worthwhile?

• What are the main themes of a document?

• How are certain words or themes distributed through a document?

Simple Taxonomy

Single document

Collection of

Enhanced presentation (syntax)

Concepts and relationships (semantics)

Today’s focus

(3)

Spring 2004 CS 7450 5

Document Collections

• Problem or challenge is how to present the contents/semantics/themes/etc of the documents to someone who does not have time to read them all

• Who cares?

Researchers, news people, CIA, InfoVis contest judges ….

Vector Space Analysis

• How does one compare the similarity of two documents?

• One model

Make list of each unique word in document Throw out common words (a, an, the, …)

Make different forms the same (bake, bakes, baked)

Store count of how many times each word appeared

Alphabetize, make into a vector

(4)

Spring 2004 CS 7450 7

Vector Space Analysis

• Model (continued)

Want to see how closely two vectors go in same direction, inner product

Can get similarity of each document to every other one

Use a mass-spring layout algorithm to position representations of each document

• Some similarities to how search engines work

Wiggle

• Not all terms or words are equally useful

• Often apply TFIDF

Term frequency, inverse document frequency

• Weight of a word goes up if it appears often in a document, but not often in the collection

(5)

Spring 2004 CS 7450 9

Process

Analysis Algorithms Visualization

Documents

Decomposition, statistics

Similarity, clustering, normalization

2D, 3D display Vectors,

keywords

Data tables for vis

Smart System

• Uses vector space model for documents

May break document into chapters and sections and deal with those as atoms

• Plot document atoms on circumference of circle

• Draw line between items if their similarity exceeds some threshold value

Salton et al Science ‘95

(6)

Spring 2004 CS 7450 11

Text Relation Maps

Label on line can indicate similarity value

Items evenly spaced

Doesn’t give viewer idea of how big each section/document is

Improved Design

Proportional to length of section

Links placed at correct relative position

(7)

Spring 2004 CS 7450 13

Text Themes

• Look for sets of regions in a document (or sets of documents) that all have common theme

Closely related to each other, but different from rest

• Need to run clustering process

Algorithm

• Recognize triangles in relation maps

Three with edges above threshold

• Make a new vector that is centroid of 3

• Triangles merged whenever centroid vectors are sufficiently similar

(8)

Spring 2004 CS 7450 15

Text Theme Example

Triangles shown

Colored in to help presentation

Skimming and Summarization

Can use graph traversal to follow specific themes throughout collection

Walk along connected edges

(9)

Spring 2004 CS 7450 17

Helpful

• What do you think?

VIBE System

• Smaller sets of documents than whole library

• Example: Set of 100 documents retrieved from a web search

• Idea is to understand contents of documents relate to each other

Olsen et al

Info Process & Mgmt ‘93

(10)

Spring 2004 CS 7450 19

Focus

• Points of Interest

Terms or keywords that are of interest to user

Example: cooking, pies, apples

• Want to visualize a document collection where each document’s relation to points of interest is show

• Also visualize how documents are similar or different

Technique

• Represent points of interest as vertices on convex polygon

• Documents are small points inside the polygon

• How close a point is to a vertex

represents how strong that term is within the document

P1

(11)

Spring 2004 CS 7450 21

Algorithm

• Example: 3 POIs

• Document (P1, P2, P3) (0.4, 0.8, 0.2)

• Take first two

P1

P2

0.4

0.4+0.8 0.333 1/3 of way from P2 to P1

Algorithm

• Combine weight of first two 1.2 and make a new point, P’

• Do same thing for third point

P1

P2

1.2

1.2+0.2 0.86

0.14 of way from P’ to P3

P’

P3

(12)

Spring 2004 CS 7450 23

Sample Visualization

VIBE Pro’s and Con’s

Effectively communications relationships

Straightforward

methodology and vis are easy to follow

Can show relatively large collections

Not showing much about a document

Single items lose “detail”

in the presentation

Starts to break down with large number of terms

(13)

Spring 2004 CS 7450 25

Visualizing Documents

• VIBE presented documents with respect to a finite number of special terms

• How about generalizing this?

Show large set of documents

Any important terms within the set become key landmarks

Not restricted to convex polygon idea

Basic Idea

• Break each document into its words

• Two documents are “similar” if they share many words

• Use mass-spring graph-like algorithm for clustering similar documents together and dissimilar documents far apart

(14)

Spring 2004 CS 7450 27

Kohonen’s Feature Maps

• AKA Self-Organizing Maps

• Expresses complex, non-linear

relationships between high dimensional data items into simple geometric

relationships on a 2-d display

• Uses neural network techniques

Map Display of SOM

(15)

Spring 2004 CS 7450 29

Map Attributes

• Different, colored areas correspond to different concepts in collection

• Size of area corresponds to its relative importance in set

• Neighboring regions indicate commonalities in concepts

• Dots in regions can represent documents

ai2.bpa.arizona.edu/ent/

More Maps

(16)

Spring 2004 CS 7450 31

More Maps

faculty.cis.drexel.edu/Sitemap/

Interactive demos

Xia Lin

Map Review

• Do you think the technique is useful?

• Strengths/weaknesses?

(17)

Spring 2004 CS 7450 33

WEBSOM

Self-organizing map of Net newsgroups and postings

websom.hut.fi/websom Demo

Map.net

maps.map.net/start Demo

(18)

Spring 2004 CS 7450 35

Search Visualization

www.kartoo.com

Work at PNL

• Group has developed a number of visualization techniques for document collections

Galaxies

Themescapes

ThemeRiver

...

Wise et al

(19)

Spring 2004 CS 7450 37

Galaxies

Presentation of documents where similar ones cluster together

Themescapes

• Self-organizing maps didn’t reflect density of regions all that well -- Can we improve?

• Use 3D representation, and have height represent density or number of

documents in region

(20)

Spring 2004 CS 7450 39

Themescape

Video

WebTheme

(21)

Spring 2004 CS 7450 41

ThemeRiver

Digesting News

• Current information infrastructure simply can’t handle exploding scale of news information and its cross correlation

• Need for an intelligent system that automatically builds the correlations and relationships between news articles

(22)

Spring 2004 CS 7450 43

Galaxy of News

• Purpose

Allows users to quickly gain a broad understanding of a news base

Allows users to explore and effectively browse through expanding news base

Allows users to find relationships between articles that would otherwise be unknown

Rennison UIST ‘94 powerful relationship

construction engine effective navigation

Approach

• Pyramidal representation

No global, explicit hierarchical representation

• Semanitc zooming and panning with fluidity and navigation

No fixed locations, space is dynamically constructed

• Use of galaxy as a metaphor gives space a freedom of dimensional constraints

(23)

Spring 2004 CS 7450 45

Key Features

• Users move freely, smoothly and continuously in 3D space

• Move through network of symbols that are sorted, close proximity means related

• x-y topic space, z (depth) is more detail

• User’s trajectory determines what appears

• Key idea: space is not statically defined and laid out

Display

(24)

Spring 2004 CS 7450 47

Discussion

• Why not use hypermedia?

• Credibility issues on news filters

• “Hyper”, sudden Jumping

• No idea what other info. is available

• Manually connect all documents ?

Do we trust news filters?

Provide access to all articles

Potential Application

Search engine

(25)

Spring 2004 CS 7450 49

Related Idea on Web

www.plumbdesign.com/thesaurus/

Another Challenge

• Visualize an entire book

• What does that mean?

• How about showing word appearances?

(26)

Spring 2004 CS 7450 51

TextArc

Brad Paley textarc.org

Upcoming

• Software Visualization

Reading

Kimelman et al

• Guest talk at GVU Convocation

No regular class

• Evaluation

(27)

Spring 2004 CS 7450 53

References

• Spence and CMS texts

• All referred to papers

• F ‘99 slides

Ho and Kim

Lewis

Miller

Références

Documents relatifs

Most popular feature selection methods for text classification (TC) are based on binary information concerning the presence/absence of the feature in each training document.. As

Au Kenya et en Ouganda, il est aussi présent, mais surtout sous sa forme miraa, dont on mastique les tiges plus fines que celle du khat éthiopien et, dans ces deux pays, la

La grève de Waihi s’inscrit dans le cadre plus général de la lutte engagée depuis 1908 par les Red Feds contre le gouvernement et le système de conciliation et d’arbitrage

Tollis, Graph Drawing: Algorithms for the Visualization of Graphs.?. Spring 2004 CS

• Want to view a page of a document but see context of surrounding text and

Первый вариант необходим в случае дальнейшей индексации текста как для геометрического, так и для полнотекстового поиска, а

A neural network algorithm for labeling images in text documents is developed on the basis of image preprocessing that simplifies the localization of individual parts of a

The analysis of short or medium-size EUTDs is performed in conditions of medium or significant degree of rubric intersection and in case of statistical data