EMERGENT VOCABULARY CONTROL IN WEB 2.0

(1)

CONTROL IN WEB 2.0

Comparisons with conventional LIS theory and practice

VANDA BROUGHTON

The paper examines the phenomenon of user assigned metadata (tagging) and evaluates the extent to which elements of vocabulary control found in structured terminologies are replicated there. The ways in which vocabulary is controlled in traditional formats, such as thesauri, are discussed, and the advantages for information retrieval considered. Similar processes evolving in tagging systems are identified, and various techniques for introducing vocabulary control are described, and the results compared with formal controlled vocabularies. Ways in which controlled vocabularies and tags can be combined to enhance performance are considered. The paper concludes that tagging systems have useful features that complement conventional indexing, but that they lack the precision of intellectually built systems, particularly with respect to structural relationships, and that it is difficult to derive these relationships using machine methods.

Cet article examine le phénomène d’assignation de métadonnées par l’utilisateur (tagging) et évalue jusqu’à quel point on y retrouve les éléments de contrôle du vocabulaire présents dans les terminologies structurées. Les méthodes de contrôle du vocabulaire dans les langages documentaires traditionnels, comme les thésaurus, ainsi que les avantages qu’ils offrent lors de la recherche d’information sont d’abord présentés. On identifie des processus similaires dans les systèmes de tagging et on décrit les techniques d’introduction des techniques de contrôle de vocabulaire ; le langage d’indexation et d’accès qui en résulte est ensuite comparé aux langages documentaires plus formels. Les façons de combiner vocabulaires contrôlés et étiquettes (tags) pour améliorer le repérage sont discutées. Les systèmes de tagging possèdent des caractéristiques utiles qui viennent compléter les résultats de l’indexation conventionnelle, mais ils n’offrent pas la précision des langages documentaires préconstruits, en ce qui concerne notamment les relations structurales ; il est difficile pour un système automatisé d’établir ces relations.

(2)

It is generally accepted today that the addition of metadata to digital resources is a useful and effective way to improve retrievability. Keywords, descriptors, subject headings, and tags all help to reveal subject content that may not be explicit in the text itself, and, in the case of non-text resources, can be the only practical way to indicate their semantic aspects.

Metadata can be divided into four main categories according to its origins:

– professionally generated metadata, created by cataloguers and indexers using conventional controlled vocabularies

– author generated metadata, usually in the form of keywords added to the preliminaries of a paper

– automatically generated metadata, derived from the resource itself using text mining or natural language processing tools, or harvested from the resource’s existing metadata

– user generated metadata, attached by end users, usually on an informal and individual basis, but increasingly as a result of co-operative projects.

These are not mutually exclusive categories and there is some degree of overlap between them. Automatically generated metadata, for example, may contain some harvested metadata originally created by information professionals, and the boundaries between author generated and user generated tags in social bookmarking tools are mostly impossible to discern. Nevertheless, where the assessment of metadata value is concerned, this analysis provides a convenient starting point and it offers a basis for considering the feasibility of metadata generation.

This paper will focus on the way in which user generated metadata can enhance search and retrieval, as well as knowledge organization, but as most of the contemporary literature considers user metadata in comparison with professionally created metadata, and the advantages and disadvantages of each, it is useful to look at these in the first instance.

Professionally created metadata

Research indicates that in an ideal world professionally assigned metadata gives the best search results. There are some contraindications for professional metadata however, both at the practical and conceptual level. The most frequently identified problem is one of scalability; although there is little documentary evidence for the costs of cataloguing and metadata assignment (Cedars Final Report, 2002), there is general agreement that it is consumptive of resources both at the point of generation, and in terms of indexer training needs over a longer period (Mathes, 2004). Other than in a managed

(3)

environment, such as an individual institution or bibliographic database, adding such formal metadata is an unrealisable goal.

There are also objections to professional metadata on the grounds of the reflection of cultural and linguistic biases inherent in traditional library and information science tools, particularly terminologies. There is a history of criticism of this unacceptable bias from within the profession (Berman, 1971), and although this has to some extent been corrected, it still remains a concern (Knowlton, 2005). More recent studies of user understanding of controlled vocabularies have indicated that nuances of meaning are entirely lost on most end users, and more worryingly, many professionals fail to comprehend the meaning of the majority of entries (Drabenstott et al., 1999). From outside the profession, comparisons with user generated tagging systems suggest that established LIS tools, while representing subject in a detailed and specific manner, usually fail to indicate other attributes of resources that end users regard as significant, and often lack currency in the terminology. For these reasons there are perceived benefits to user metadata other than the purely practical and economic ones.

Automatically generated metadata

Automatic metadata generation (AMG) is regarded as a cost-saving, if less effective, alternative to expert metadata, especially within the scholarly information scene, and has been the subject of a number of studies over recent years (Greenberg et al., 2005; Polfreman et al., 2008). Techniques include text and data mining, statistical text analysis and natural language processing, to derive appropriate descriptors from textual and data corpora. Image retrieval is a particular focus of interest, although much of the research on automatic classification of images takes place within the field of machine intelligence and has little immediate application to information retrieval. There have been other more relevant investigations, and a number of toolkits developed, for automatically assigning metadata for film and sound resources, and learning objects, as well as general textual material.

Some projects have made use of specific LIS bibliographic or documentary classifications, subject heading lists or thesauri, but they are relatively few.

Several older US studies considered the Library of Congress Classification, Library of Congress Subject Headings, and the Dewey Decimal Classification as a basis for automatic categorization or classification (Larson, 1992; Thompson et al., 1997; Chan, 2000). One UK example is the Wolverhampton project (Automatic RDF Metadata Generation for Resource Discovery, n.d.), which used an automatic classifier based on the Dewey Decimal Classification. A

(4)

more recent project used automatic indexing techniques to extract MeSH (Medical Subject Headings) terms (Neveol et al., 2006). Assignment of controlled vocabulary terms can be achieved in different ways: for instance, Data Fountains’ iVia LCSH software (Ivia, 2009) assigns Library of Congress Subject Headings to new resources by choosing the LCSH terms that it has most frequently applied to other previously catalogued resources that contain those same keywords; alternatively, the extracted keywords may be mapped to a classification scheme such as DDC.

There is evidence that value is placed on the role of controlled vocabularies in second level processing of extracted metadata, and the advantages gained by combining uncontrolled AMG metadata with controlled vocabularies may offer some lessons for the further exploitation of user generated metadata. A number of research studies have demonstrated that the use of a controlled vocabulary improves the performance of the tool or system (Liang et al., 2006; Cheung et al., 2005; Aula and Kaki, 2005; Ko et al., 2004), and document classification systems are likely to use a controlled vocabulary of some kind, whether this is a traditional system, a ‘built’ thesaurus or ontology, or a more limited set of pre- determined categories or subject headings.

In the commercial sector in particular there are numerous examples of automatic classification tools which combine term extraction of various kinds with a controlled vocabulary; the latter is unlikely to be a standard LIS terminology, and may only represent a fairly narrow domain, raising questions about the viability of universal subject tools. Autonomy (www.autonomy.com), one of the earliest operators in this field, offers a variety of specialist taxonomies to be used in combination with its automatic classifier, and many other companies utilise similar techniques, either using a pre-determined controlled vocabulary, or one generated during the process of text analysis (Blumberg and Atre 2003). Newsindexer (www.newsindexer.com/index.html) demonstrates an interesting combination of text analysis and a controlled vocabulary where the choice of thesaurus terms is conditional on the context of extracted terms. There is also a body of work into the use of controlled vocabularies as search thesauri for the purpose of query formulation and query expansion (Shiri and Revie, 2006; Tudhope et al., 2006; Blocks, 2004). Here search terms rather than extracted metadata are automatically mapped to a pre- built thesaurus which provides control of synonyms and a hierarchical browsing structure. All of these tools and systems for the enhancement of mechanically derived metadata, and end-user search terms, could provide models for similar work in the realm of user generated metadata.

(5)

Benefits of controlled vocabularies

In many respects tagging is in a similar state to keyword or post-coordinate (sometimes called coordinate) indexing in the 1950s and 1960s, before vocabulary controls and standards for terminology construction were introduced, and there are correspondences between early post-coordinate keyword lists, and both user generated metadata and automatically generated metadata.

Where data (and metadata) is held in databases (of which the automated catalogue is one example) linear order becomes irrelevant as a means of retrieval, and, because there is no need to arrange the items systematically, individual keywords, or descriptors, can be used to indicate subject content.

These can appear on the item record as a series of independent ‘tags’ with no sense of priority or order, i.e. there is no pre-coordination of terms. Terms are only coordinated at the point of search, that is ‘post’ the indexing.

Tags are usually arrived at spontaneously or derived from the text in hand without reference to any local or general authority, and it is interesting to see that many of the criticisms of tagging reflect the problems that vocabulary control was designed to address in post-coordinate systems.

Like tagging systems, early keyword tools were largely uncontrolled, and descriptors were either assigned by indexers on the basis of their interpretation of content, or sometimes mechanically extracted from the text. However, contemporaneous research into the information retrieval process showed that such uncontrolled use of language did not make for very efficient retrieval (Cleverdon, 1962), and that the factors affecting retrieval performance could be identified (Cleverdon, 1966). The use of natural, uncontrolled language was shown to lead to inconsistency in indexing, and subsequently a more managed approach to indexing vocabularies began to emerge.

The result of introducing what came to be known as vocabulary control was the emergence of the thesaurus as the primary post-coordinate indexing tool.

The history of this process is well documented (Aitchison and Dextre Clarke, 2004), and today we have a number of national and international standards for good practice in the design of thesauri and related vocabularies, both monolingual and multilingual.

Vocabulary control consists of several elements that together help to ensure consistency in indexing, and improve the match between index terms and search terms to raise efficiency in retrieval. One set of processes deals with the meanings of terms, and correspondences between terms with the same or similar meanings (and sometimes different meanings associated with terms that

(6)

look the same). These can be considered together under the general heading of

‘equivalence relationships’, the purpose being to select a ‘preferred term’ from a group which are similar, or very similar in meaning, structure or form. They can be listed as follows:

– one preferred term is selected from a pair or group of synonyms, or near- synonyms

– one form is selected from singular/plural, different parts of speech, variant spellings, etc.

– homonyms, or polysemous words, are differentiated (disambiguated) by the use of qualifiers.

This is quite a complex process in practice, so that, for instance, plural or singular form may be used for different categories of terms e.g. singular for abstract nouns (peace, unity, consciousness), plurals for concrete, or count nouns (rabbits, umbrellas, boxes, metals). In the same way, consistent polices are established for dealing with punctuation, numbers, and other symbols which occur in terms, and choices must be made between scientific and vernacular names, foreign and local spellings, forms of personal and corporate names, and so on. The formal standards, such as British Standard BS:8723 for thesauri and other controlled vocabularies (British Standards Institution, 2005), deal with these matters in some detail and large numbers of illustrative examples can be found there.

There is a remarkable degree of consensus about how this should be done, and there is little variation in practice to be found in the various national and international standards, or in the equivalent national subject heading systems (Lopes and Beall, 1999).

The second major pillar of conventional indexing tools is the identification of structural relationships between terms, which assist in the navigation of the vocabulary (by providing cross references between more general and more specific terms or concepts, and between terms related in other ways), and allow for the formulation and modification of metadata strings (such as compound terms and pre-coordinated subject headings) and search queries. A distinction is drawn between the hierarchical (also called semantic, or ontological) relationships of broader and narrower terms which are inherent and permanent, and syntactic relationships between terms which are not part of the permanent structure of a subject, but which occur as a result of combining concepts in compound subjects. Semantic relationships are by definition between two concepts of the same order e.g., two entities (dog - poodle) or two processes (breathing - inhalation); in the classification or taxonomy, they are class and sub-class, and in the thesaurus broader and narrower terms. Syntactic

(7)

relationships represent the intersection of concepts of a different order e.g., entity and process (ruminant - digestion), operation and agent (weaving - handlooms); in the classification they are combined classes, in the thesaurus related, or associative, terms.

The identification of relationships also provides a means of generating systematic views of subject fields through the realisation of implicit classificatory and ontological structures. In more mature forms of indexing tool, such as the standard thesaurus, the operation of the semantic and structural relationships are combined in the rules for the management of compound terms. Examples of these can be found in the major standards such as BS:8723, and are largely derived from the hierarchical status and relationships of individual terms. Today a degree of convergence can be seen in the interdependency of thesaurus standards at international level, and BS:8723 suggests a degree of coherence in the underlying theory for generating different kinds of vocabularies.

Of course this complexity in the design process is not a concern for the indexer, who will use the vocabulary as presented. This leads us to the final advantage of the controlled vocabulary: its role as a standard for implementation, relieving indexers of the need to consider these conceptual and linguistic problems themselves. The advantages of controlled vocabularies can be summarised as follows:

– the problem of managing variant terms is addressed and preferred terms are chosen which aid consistency in indexing and improve retrieval performance

– this is achieved at the conceptual level through synonym control, and at the practical level through the control of word forms (punctuation, spelling, pluralization, etc.)

– identification of structural relationships allows for navigation, and for the formulation of complex metadata and of search queries

– implicit classificatory and ontological structures may provide a subject overview and browsing tools

– the vocabulary acts as a standard, offering guidance to the indexer, and ensuring uniformity of practice.

Where the controlled vocabulary is embedded in the system it may operate

‘invisibly’ through automatic mapping of terms or automatic hierarchical browsing, without the need for any input on the part of either the indexer or the searcher (Manning et al. 2008, 191).

(8)

The origins of user generated metadata

Within the Web 2.0 context, user generated metadata appears to have originated in the social bookmarking site then called Del.icio.us (delicious.com/). Delicious combined two practices: the organization of personal information, and the sharing of this information through the world wide web. What additionally distinguished Delicious from other bookmarking systems was the facility to add keywords to the resources, the first example of user generated metadata. Other sites, such as Flickr (www.flickr.com/), soon followed suit, and the practice of adding personal keywords, or tags, to online resources became common. A distinction can be drawn between ‘broad’

folksonomies, where resources have been tagged by numerous users, and

‘narrow’ folksonomies, like Delicious, where a resource may be tagged only once, usually by its creator.

Although initially tags were assigned to items more or less randomly and independently, as the practice grew, and the tag mass increased, systems of representing and organizing tags began to emerge. The introduction of the term

‘folksonomy’ by Thomas Vander Wal in 2004 (Vander Wal, 2007), with its obvious associations with ‘taxonomy’, itself suggests that the free and easy tagging of content had somehow automatically begun to generate structures, these sharing some characteristics of ‘traditional’ classifications ( ‘personomy’

may be used as an alternative for the collection of tags used by an individual tagger (Jaschke, 2008, p. 231).) Vander Wal also uses the expression ‘emergent thesaurus’ (Vander Wal, 2007) which implies that, at some stage, the disorganized tag mass acquires some elements of structure, and that simple vocabulary tools in some way coalesce from the uncontrolled vocabulary.

Certainly, as the resources are accessed and shared by others, the vocabulary is used and added to in a collaborative manner, and the folksonomy can therefore be regarded as having indexing, retrieval, and organizational or browsing functions, even if these work in a relatively crude way. More sophisticated examples of tagging can be found in sites like Connotea (www.connotea.org/) and Citeulike (www.citeulike.org/), intended for the management of scholarly references, where the tags are likely to be highly specific, and where in most instances they serve to supplement existing professionally produced subject metadata since, in the first instance, references are derived from bibliographic databases.

It should be noted that, at least at the time of its appearance, there was a political dimension to tagging, based on a robust open source culture, where tagging was perceived as democratic and under the control of the users, as opposed to controlled vocabularies and other standards which are

(9)

representative of authority and institutional power. This has given rise to expressions such as ‘social classification’ and ‘democratic classification’. The expression ‘social web’ is also used as a contrast or complement to the

‘semantic web’. Within the information sharing context, the social web has more emphasis laid on freedom of access, collective intelligence, open, uncontrolled aspects of organization, and the advantages thereof; semantic web approaches are more likely to make use of embedded technologies and intellectually built tools, and to draw on models from conventional indexing to support intelligent searching.

Evaluation of tagging systems

A relatively neutral approach is adopted in surveys of existing tools undertaken by Mathes (2004), Hammond, et al. (2005), and Macgregor and McCulloch (2006). Otherwise, opinion tends to be divided between those who see the folksonomy as inherently suited to the unmanaged, more democratic, web environment in a way that formal indexing tools are not, and those who stress the better retrieval performance of a structured vocabulary. There is a considerable volume of literature representing both sides of this divide, much of it in the form of blog postings, and opinion pieces on the web.

The tagging process itself is still not easily understood, and Tonkin and Guy (2006) assert that "[a]t this time, little is known about the decision-making process behind tag selection, and quantitative data is relatively scarce". A model for the cognitive process is proposed by Sinha (2005), but has not been subjected to testing. This reflects a similar state of ignorance concerning more formal subject cataloguing processes. Shirky also identifies considerable variation in the skill of taggers, and in the number of tags used by individuals, ranging from a single figure to several hundred (Shirky, 2006).

Advantages of folksonomies

An overview of the political and human rights aspects of tagging is provided by Birdsall (2005; 2007) who stresses the democratic nature of the web and the opportunities it provides for interaction, collaboration, and communication, with the emphasis on open and equal participation. Viewed from this perspective, tagging is opposed to the authoritarian structures of the organized institutional information world, and any judgement based on its retrieval properties is irrelevant. Macgregor and McCulloch (2006) observe that this characteristic of tagging is frequently seen by those who adopt an ideological stance as over-riding any imperfections. An associated approach is that the very

(10)

uncontrolled-ness of tagging vocabulary is more appropriate to the nature of the world wide web, with its inherent disorder.

Nevertheless, most commentators have assessed tagging in terms of its performance as a means of organization and retrieval. The primary advantages of tagging are that it is quick, cheap and intuitive to use, and tremendously easy to access. No training is required, no rules need to be learnt, no expensive tools are involved, and no constraints are imposed upon the user.

Flexibility and currency of the vocabulary used is also regarded as important.

Rapid responsiveness to changing terminology in fast evolving subject fields is seen as a major feature, and is often contrasted with the slowness of published terminologies to accommodate new terms. Within specialist communities, group familiarity with evolving technical vocabulary may impart a degree of uniformity of practice. Another significant feature of tagging is the identification of attributes of resources which are not addressed in controlled vocabularies, and, in this respect, resources which have been tagged may more accurately match end-user search terms. In a more general sense, correspondence with user practice is a notable feature, and "perhaps the most important strength of a folksonomy is that it directly reflects the vocabulary of users" (Mathes, 2004, p.10).

For the most part, the arguments for and against tagging concentrate on the conceptual basis (freedom of access, ease of use, etc.) and the vocabulary (currency, user orientation, etc.), but Hammond et al. (2005) note the added advantage of the excellent technical infrastructure of most tagging sites:

“[a]lmost without exception these social bookmarking tools are feature-rich, providing search on both users and tags (with Boolean operators), comments (and comment trails), simple linking syntaxes, and APIs (application programming interfaces) for posting to and from these tools (and to other tools such as blogs)”.

Shortcomings of folksonomies

The disadvantages of folksonomies arise out of the nature of the language employed, and their relative lack of semantic structure when compared with established controlled vocabularies.

Major criticisms of folksonomies and tagging include the lack of precision in virtually all systems (Peters and Stock, 2010, p.82), and that tags lack any sort of authority control (Noruzi, 2007, p. 1). Problems which are “inherent in an uncontrolled vocabulary” (Mathes, 2004) include the proliferation of synonyms, inconsistency of word forms such as plurals, and the polysemous nature of

(11)

many terms (Guy and Tonkin, 2006; Noruzi, 2007). Tonkin and Guy (2006) additionally identify the personalised nature of some tags, and the use of

‘private’ terminology, which is not meaningful to a wider audience. Other linguistic difficulties are the lack of appropriate or consistent depth of indexing (Noruzi, 2007), the use of variant forms of the same tag leading to tag redundancy (Niwa et al., 2006), and ambiguity in the application of tags because of the lack of definitions and scope notes (Mathes, 2004). Guy and Tonkin (2006) also raise the issue of poor management of compound and conjugated terms. Taken together these disadvantages result in a great deal of ‘noise’ in the system, with high recall and general poor performance (Macgregor and McCulloch, 2006). External to difficulties inherent in the vocabulary itself is the problem of tag literacy, and sloppiness in the use of tags which also impedes retrieval (Mejias, 2005).

A related problem is the very general nature of the tags. Delicious makes its most popular tags visible, and the great majority of these relate to formats (blog, software, web, video) or very broad subject categories (politics, education, business, science); many of them are apparently evaluative and subjective (fun, cool). Examples of popular tags from Flickr include: day, color, live, trip, and me. Hammond et al. (2005a) note that, whereas formal classifications are universally oriented, folksonomies tend to be context or user oriented, or sometimes action oriented (for example, ‘read later’), although research carried out by Ulises Ali Mejias of Ideant (Ideant, 2004) came to the conclusion that although many tags are very personal, the meaning of the majority is shared and understood by users. Golder and Huberman (2005) propose a formal division into extrinsic functions (subject, form, ownership, and scale) and intrinsic functions (evaluations, personal references, and task organizing), the former having a high degree of agreement, the latter being only relevant to the individual tagger.

Efforts to improve tagging

There seems to be fairly widespread agreement that tags are an easy way to create a vocabulary, and one which complements traditional terminologies in the nature and range of the tag terms, but that the lack of control creates some problems for retrieval. Given the useful aspects of folksonomy, such as its more current terminology, its closeness to user thinking, and the coverage of non-traditional attributes of tagged objects, the tag vocabulary has enough merits to make it worthwhile trying to overcome these problems.

Key questions are whether there is consensus regarding the vocabulary, whether such a vocabulary is stable over time, whether basic features of control

(12)

are present, and whether there is any representation of relationships to support navigation and browsing. Secondary questions concern how the vocabulary is presented to the user, and the phenomenon of the ‘long tail’ or rarely used tags.

Vocabulary control and tagging

A variety of proposals have been made, and projects undertaken, to try and improve the degree of control of tags and increase their effectiveness. Among these, there are three main ways in which the problems of an uncontrolled vocabulary may be addressed:

– by the emergence of a semi-controlled vocabulary from the tag mass, based on popular tags,

– by use of an external controlled vocabulary in combination with the unorganized tag mass,

– by the imposition of editorial rules (this is rare but sometimes occurs in collaborative tagging projects like the Open Directory).

This replicates thesaurus practice, where the thesaurus may be built intellectually, or derived from a textual corpus automatically on the basis of co- occurrence analysis or other computational and bibliometric techniques. In the latter situation, it is the case that the relationships between terms are necessarily non-specific, and “[t]he quality of the associations is typically a problem. Term ambiguity easily introduces irrelevant statistically correlated terms” (Manning et al. 2008, 192), and we shall see that the same applies to the tagging environment.

Folksonomy based vocabularies

The simplest and most obvious attempts at standardisation of tag use is through the medium of tag clouds, or similar visual displays, whose purpose is to indicate the most heavily used tags, and to promote their re-use by new taggers. Most systems contain a recommendation system of this kind, although they tend only to use very basic strategies, being generally reliant on the frequency with which tags are assigned (Jaschke, 2008, p. 231), and on their co- occurrence, rather than on any semantic analysis of tags (Shirky, 2006). Tag clouds may include related (Connotea) or unrelated tags (Delicious popular tags), and tags may be listed in alphabetical order, although popularity is usually indicated by font size. Sometimes the arrangement of tags is random (Candan et al., 2008, p. 77) ‘with the hope of projecting a more pleasant (if not highly informative) feeling’. Various software packages such as TagCrowd

(13)

(tagcrowd.com/), Wordle (www.wordle.net/), ToCloud (www.tocloud.com/), and Tumblr Tag Clouds (tumblrtags.rivers.pro/) can be used to generate tag clouds from a textual corpus operating on the basis of word frequency.

Most recent work (for example, Golder and Huberman, 2006; Halpin et al., 2007; Peters and Stock, 2010; Robu et al., 2008) confirms that the tag distribution does stabilize, and a consensus emerges as to significant tags. As the system matures, the common vocabulary tends to become self-controlling as new taggers opt for established tags rather than inventing new ones. As happens with controlled vocabularies, taggers are encouraged to use this limited vocabulary, thus reinforcing the ‘core’ set of tags, and tending towards greater convergence and uniformity in the assignment of tags.

Robu et al. (2007) use informetric methods to establish patterns of collaborative tagging and to answer the question of whether the tagging system gains stability over time. Their expectation that an open ended number of non- expert users might generate a vocabulary in a constant state of flux is found not to be true, providing a challenge to the idea that a powerful advantage of tagging is the currency, flexibility, and responsiveness of the vocabulary. In practice they find that tagging patterns tend to stabilize into power law distributions, demonstrating the emergence of a collective consensus in the use of vocabulary, and the presence of a limited set of high-use tags. They conclude that tags from the long tail tend to be those of individual interest to the tagger, and used for organization at a personal level. In an analysis of tags on Flickr and Delicious, Guy and Tonkin (2006) suggest that the ‘long tail’ of unusual and rarely used tags does not in fact exist, only about 10-15% of tags being single use, although they are not able to indicate from their sample how popular the most frequently used tags are.

Other informetric exercises include Peters and Stock (2010) who analyse document-specific tag distributions to identify what they call ‘power tags’, heavily used tags that have broad consensus, and which conform to the power law distribution; opposed to these are ‘tail tags’, which reveal minority views but are nevertheless useful for those minorities. In narrow folksonomies, where there are limited numbers of item-specific assigned tags, a similar exercise is employed using search tags rather than index tags.

Various studies have examined how tag recommendations can be improved with more sophisticated methods. Jaschke et al. (2008) investigated the use of software to rank tags. A ranking algorithm similar in function to Google PageRank, is employed to weight the value of ‘important’ tags, taggers, and tagged resources. Comparing this with collaborative filtering techniques and frequency analysis over tag data from BibSonomy, CiteULike and Flickr, the

(14)

‘results show that the graph-based approach of FolkRank is able to provide tag recommendations that are significantly better than approaches based on tag counts’ (Jaschke, 2008, p. 244), but is considered to be cost-intensive in comparison. Al-Khalifa and Davis (2007) also examine the folksonomy as a source for automatically generated metadata, and in later work (Khalifa et al., 2007) develop a generator for tag suggestion, but the results show that semantic search is superior, because of the necessary limitations of folksonomy search which is essentially keyword search.

Using external controlled vocabularies

The use of a pre-built controlled vocabulary tool is very frequently used to manage automatically extracted metadata, and there is a body of thought that promotes a similar use of controlled vocabularies in conjunction with the tag metadata (Noruzi, 2007). Additional terms from the controlled vocabulary can be used to enrich the folksonomy, or the tags can be mapped to an external vocabulary to help with query formulation or modification, or to aid in browsing and navigation. This is particularly advantageous in the early stages of folksonomy development, where the vocabulary is too small to allow for any sort of analysis, and where controlled vocabulary terms can be employed to

‘cold start’ the folksonomy. Sometimes the reverse of this process is examined, where the folksonomy terms are also used to expand an existing well structured vocabulary and the two types of vocabulary exist in a symbiotic relationship (Rosenfeld, 2005).

A model close to those systems operating with automatically generated metadata is that of Calefato et al. (2007) which combines a text analysis technique to extract tags which are then mapped to a controlled vocabulary, WordNet (wordnet.princeton.edu/), to achieve control of synonyms. However, in other research (Laniado et al., 2007) it was found that only about 8% of tags can be mapped to WordNet equivalents, suggesting that this particular vocabulary is not particularly useful. It appears that performance can be greatly enhanced if primary processing of the tags is more sophisticated. Cantador et al.

(2010) combine natural language processing techniques with an algorithm based on Google that effectively deals with spaces and misspellings to create a set of morphological variations of a tag that allow it to be mapped more successfully to an external vocabulary source.

Tag harvesting from comparable resources identified using similarity metrics (Byde et al., 2007), and text mining (Niwa et al., 2006), are other techniques reflecting AMG approaches.

(15)

One of the few evaluations of the combined approach is the work carried out by Tudhope and his associates, who tested two tools, one designed for use by user-taggers with the Intute subject portal resource (www.intute.ac.uk) (Golub et al., 2009), and the other by author-taggers with the Science and Technology Facilities Council repository (Matthews et al., 2009). The Intute demonstrator has three interfaces to support searching, basic tagging, and enhanced tagging. The enhanced interface prompts the tagger with suggestions derived from external knowledge organization systems: Dewey Decimal Classification captions, index terms, and mappings to the Library of Congress Subject Headings; these are presented in a structured manner with broader and narrower options indicated. The participants were politics students, most of them familiar with tagging applications, but not regular taggers, and they were given a specific set of tasks to search for and tag resources on given topics.

Feedback showed that most enjoyed the task and found the demonstrator easy to use. They valued the tag suggestions, but did not make much use of hierarchical browsing. They considered that the tag cloud was a less useful feature.

The STFC demonstrator has a more complex interface with the underlying thesaurus more fully accessible, and was tested with a small number of participants. The controlled vocabulary was the ACM Computing Classification generated as a thesaurus through the use of the Simple Knowledge Organization System (SKOS) format. It was supplemented by a tag cloud based on term frequency. None of the participants were regular taggers, but they were research active scientists within the discipline. While there was quite a variety of response to the exercise, most of the participants opted to use the controlled vocabulary as a source of index terms, found it straightforward to operate, and exploited the browsing facilities of the structure to some extent. Most did not make any use of the tag cloud, finding its lack of structure less than helpful.

Editorial management of tags

Peters and Weller (2008) suggest a process of ‘tag gardening’ in which the tag vocabulary is subjected to editorial policy, and re-engineered to improve performance. This may take the form of guidelines for taggers, or manual editing by vocabulary managers, although they stress that this should take place within the context of the folksonomy and not in advance of it. Otherwise, most of their proposals are very similar to standard practices of vocabulary control in managing forms of tags, synonyms and homonyms. Guy and Tonkin (2006) propose a similar approach for the improvement of tag literacy, in an

(16)

interesting discussion that reveals users’ attempts to represent semantic complexity through the use of compound and structured tags.

Realistically, although editorial control may be exercised in a managed environment, it seems unlikely that this could be scaled up to the web as a whole.

Constructing tag hierarchies

Matthews’ (2009) and Golub’s (2009) results, discussed above, suggest that the unorganized tag cloud is not very useful in supporting either searching or tagging, although it should be borne in mind that they worked within an academic context. This finding is also confirmed for the biomedical field by the work of Kuo et al. (2007) who found that the tag clouds are not the most effective way of revealing relationships between concepts. Related studies of tag cloud use (Halvey and Keane, 2007; Hearst and Rosner, 2008) show that they tend to be scanned by users, rather than closely examined, and Hearst and Rosner (2008) found objections to them (from users) on the grounds of questionable usability, and their apparently popular viewpoint.

Conventionally, the tag mass exists as a flat space, and there is no intention to represent structural relationships between tags; ‘there is no hierarchy, and no directly specified parent-child or sibling relationships between these terms’

(Mathes, 2004); in fact some writers refute the existence of such relationships, and see no value at all in hierarchical systems such as taxonomies and ontologies, suggesting that ‘there is no file system, only links’ (Shirky, 2005).

However, where tags exist only as metadata attached to individual items, there is no way that the end user can browse resources, and there is some evidence that many users prefer this method of resource discovery, particularly where the subject is unfamiliar, or searches are not productive. It has also been shown that unambiguous navigational links aid user understanding of resources (Mobrand and Spyridakis, 2007). In most tagging systems an individual tag may be searched for, but if the results are not useful there is no means of identifying broader, narrower, or related terms to modify the search. The folksonomy also lacks a conceptual ‘overview’ of a subject domain, since the visualization in tag clouds is normally of tag frequency or ranking, and not of tag inter-relations.

For this reason some recent researchers have attempted to derive hierarchical structures from tag clouds. Heymann, one of a number of researchers investigating the generation of hierarchies, challenges Shirky’s view that hierarchy is an irrelevance, and reinforces the idea that hierarchical structures are extremely useful for retrieval based on browsing (Heymann

(17)

2008). Nevertheless, the published work reveals some interesting differences in the understanding of hierarchy, and of broader and narrower terms, when compared to conventional indexing.

Tsui et al. (2010) describe a method of automatic taxonomy construction based on linguistic processing methods using heuristic rules and syntactic analysis. This purports to be derived from tags, but closer reading reveals that there is some dependence on a textual corpus from which to infer conceptual relationships from sentence structure. Nevertheless, this study provides a very interesting parallel with the thesaurus practice of regarding a compound term as a ‘focus + modifier’, when it infers that a compound term of the type

<adjective-noun> is likely to be in a parent-child relationship with the noun involved (e.g. ‘credit card’ and ‘card’).

Robu et al. (2007) attempt to derive a structure from a folksonomy; network analysis theory is employed to attempt some categorization of terms using a community detection algorithm, and the results compared to a similar analysis of search engine query logs. Both give reasonable results, the folksonomy performing rather better than the log data, but although they are able to generate groups of associated terms, the relationships are non-specific and the categorization lacking in any detailed structure. In a related paper, the same team conjecture that ‘this method might be useful in extracting a classification scheme (ontology) from a categorization scheme (folksonomy)’ (Halpin et al., 2007, p. 212). However, although they manage to achieve some clustering of strongly associated tags, there is no internal structure to the groups nor specified relationships between tags. They conclude that ‘the shared tag vocabularies... are not fully fledged semantic web ontologies’ and ‘the process of constructing proper formal ontologies from folksonomies... is not a straightforward task’ (Robu et al., 2007, p. 7).

A related project is that of Au Yeung at al. (2009) which categorizes tags automatically and without reference to an external source. Here the analysis is based on co-occurrence of tags, and serves to establish the general subject domain of tags, and hence to disambiguate tags; it does not, however, identify specific relationships between tags, only the existence of a general semantic relation.

Heymann and Garcia-Molina (2006) observe that standard ways of presenting tags make it difficult to identify broader and narrower terms, and describe a method of generating tag hierarchies from very large unstructured sets of tags taken from Delicious and CiteULike. This is based on notions of similarity between tags, and requires that natural hierarchical relationships exist within the tag mass. However, their example of a tag hierarchy looks very

(18)

unfamiliar by the standards of controlled vocabularies, with proposed hierarchical relationships of the kind ‘Google-map’, and ‘Software-free- downloads’.

Work by Candan et al. (2008) exhibits a similar phenomenon. They also start from the premise that browsing is difficult in tagging systems, because it is

‘impossible to organize content for effective navigation (p. 75), and that there is a failure to display context. They propose a tag mining approach from which a contextual layout can be derived, although, again, the results do not indicate hierarchical relationships between individual tags. Rather, they resemble old- fashioned enumerative pre-coordinate classifications on the Library of Congress model. Tag hierarchies extracted from news articles about hurricane Katrina, display such unlikely subordinations as ‘Storm-US’, and ‘Employment- Month-August-President’, and, in the latter schema, ‘September’ is not subordinated to month, but to ‘Job’.

In reality what these associations reveal are not the semantic relationships of the hierarchy, but the syntactic relationships between terms in combination.

The US is clearly not a sub-class of ‘storm’, but the combination ‘storm [in the]

US’ is a kind of subdivision of ‘storm’. What is being presented is a structured display of tagged objects rather than the tags themselves, and the parallel with expert metadata is the subject authority file, rather than the indexing language.

This is not to say that these structured displays are not useful for browsing purposes, but they clearly do not allow for navigation and modification of search in the way that controlled vocabularies do, with their exposure of varied and specific inter-concept relationships. While it is clear that a non-specific relationship can be established between ‘parrots’ and birds’, and it may be inferred that ‘parrots of South America’ is a conceptually compound sub-class of ‘parrots’, there seems no obvious method of determining that ‘birds’ and

‘parrots’ are hierarchically related to each other. It also makes sense of Shirky’s assertion that there is no hierarchy, only links (Shirky, 2005), since the point of entry to the tag vocabulary is through the linked concepts (as attached to resources) rather than the concepts themselves.

This perhaps demonstrates that there can be no direct comparison between tagging systems and controlled vocabularies, since there is no formal representation of the tags per se, and they can only be identified and analysed through their application to individual resources. A comparable exercise would be to try and reconstruct the Dewey Decimal Classification, or the Library of Congress Subject Headings, from the classmarks on the spines of books, or the subject entries in a catalogue.

(19)

It is also a pity that most comparisons with library science standards tend to use systems of nineteenth century origin, such as Dewey or LCSH, rather than the more modern and better structured analytico-synthetic and faceted classifications. The latter show more evidence of a ‘bottom-up’ approach in their construction and design, and have structures which are inherently more hospitable to machine interpretation. Heymann is aware of the potential of faceted systems: “I think facets might be a really good fit, but then the question becomes: how do we determine groups of tags to call a facet?” (2008), and it will be interesting to see if, and how, this line of enquiry might develop. A prototype faceted organizer for tags is available in FaceTag (2006), discussed more fully by Quintarelli (2007); while the structure is doubtless hospitable to the generally simple nature of tags, it is again an intellectually built framework.

Some recent attempts to identify structure in tagging systems achieve a measure of success through a combination of automatic text analysis, and reference to an external semantic tool (Cantador et al., 2010). Although this semantic structure cannot be regarded as either explicit or implicit in the tag vocabulary, it does enable browsing and navigation without the need for intellectual processing and organization of the source tags. The external tool used is YAGO, a large semantic network developed at the Max-Planck Institute for Informatics (YAGO, 2009), which has itself been constructed automatically from other sources, such as WordNet and Wikipedia. By mapping tags to this source, Cantador’s tool exploits the taxonomic and hierarchical relationships within YAGO, and allows automatic categorization of tags to sub-categories such as artefact, animal, plant, organization, etc. When tested against intellectual mapping, automatic assignment of tags to categories was found to achieve a level of accuracy of around 80% for various categories related to subject content. The categories used here bear a striking resemblance to those employed in facet analysis. Faceted structures are particularly logical and transparent in the way in which they represent both semantic and syntactic relationships (based on intra- and inter-category relations), and allow for the generalization of relationships between terms in different categories. It looks as if there might be some potential in attempting to combine the two approaches.

Conclusions

It is clear from the foregoing that tagging is now generally regarded as a legitimate activity, and one which has something to offer in terms of indexing and retrieving online resources, particularly in the unorganized web, where expert cataloguing and indexing can never be achieved on a large scale.

Advantages of tagging are that it is accessible, easy, and cost-free, and that the language it uses may be more current, and more representative of user

(20)

perceptions, than controlled vocabularies. As a result it is increasingly being used to complement conventional cataloguing and indexing by allowing end users to add tags of their own to the professional metadata; this phenomenon, which originated in specialist research environments, is now seen as a relevant way to enhance controlled vocabularies even in a managed library context.

It is generally observed that tagging lacks the precision and quality of controlled vocabularies, and, although there is a body of opinion that says tagging should be kept deliberately uncontrolled, many commentators, even those outside the LIS domain, find the principles of vocabulary control and structural relations of interest, and consider that some measure of control can only improve the performance of tagging.

An analysis of the literature demonstrates a state of affairs somewhat similar to that of early post-coordinate indexing, and the emergence of structured vocabularies at that time. The main foci of recent research activity into tagging replicates the two major theoretical foundations of the controlled vocabulary:

on the one hand, the limiting and morphological control of tags themselves (management of equivalence relationships); and on the other, the relationships between tags representing different concepts (management of semantic relationships).

Vocabulary control of tags

As tagging began to be established as a common practice, many commentators reflected on the great variation, particularly in the form of tags, and proposed that some control mechanisms would improve tag performance.

Suggestions for how this might be addressed fall into three main groups:

– through editorial management of tags;

– through the use of informetric techniques to identify and reinforce significant tags;

– through secondary processing of tags using controlled vocabularies.

Formal editorial management of tags seems unlikely to succeed, other than in a limited environment, although it is feasible to include guidelines advising how tags should be formed. Although taggers tend to re-use tags, there is no guarantee that conventions of spelling, form, punctuation, pluralization, and so on, will be followed. Automatically built tag suggestion tools are a more realistic way of working towards a controlled vocabulary, and it is clear that taggers will make use of such tools where they are available, and see the advantages of using a ‘common’ set of tags. Emergent features of the managed tag mass tend

(21)

to parallel existing conventions in traditional indexing languages, and some interesting features of the tagging process can be observed:

– analysis of tags and tag patterns has shown that a collective consensus emerges, and that ‘preferred’ tags can be determined;

– frequency of use is an important criterion for identifying a core vocabulary, but informetric techniques can also be used to establish tag distributions, and detect ‘power tags’;

– there is a tendency for taggers to follow the general trend and to confirm existing tags, thus reinforcing the collaborative vocabulary;

– the idea of a very large number of tags in regular use, the so-called ‘long tail’, seems to be contradicted, and it is suggested that many tags are used by only a limited number of individuals.

A major problem in managing tags is the variation in form of tags representing the same concept (plurals, variant spelling, management of punctuation and spaces, and so on). Mapping tags to an external authority is one way to minimise variation, again reflecting conventional indexing practice in respect of extracted metadata or automatically generated metadata. The use of controlled vocabularies to modify or supplement tags is a common feature of some tagging tools, but implies a degree of intellectual input in either the building of the vocabulary, the mapping, or both. Secondary processing of tags using natural language processing tools may make this easier, by identifying and aggregating linguistic variants of tags, and if a large open source lexical tool is used as the authority, intellectual effort can be reduced to a minimum.

Synonym control poses more problems than management of variant word forms, but can be addressed to a limited extent by clustering techniques based on co-occurrence of terms that help to establish context and disambiguate homonyms.

At present there is little in the literature to suggest that the problems of tags for compound topics and their representation are being considered.

Structural relationships in the vocabulary

While it is clear that some advances have been made in implementing controls on the tag vocabulary, the ‘controlled vocabulary’ so achieved does not exhibit many of the features of traditional controlled vocabularies that support browsing and navigation of the system and refinement of indexing or search. If some measure of success can be demonstrated in the choice and form of tags,

(22)

formal conceptual relationships between tags appear to be extremely hard to determine using automatic methods.

This is not to say that no relationships can be established using informetric analysis techniques:

– a general semantic relationship of a non-specific kind, can often be detected between two tags on the basis of frequent co-occurrence;

– this may be sufficient to support allocation of tags to a subject domain and achieve a level of categorization based on subject;

– such categorization can assist in the disambiguation of tags;

– there are algorithms that can discern syntactic relationships between groups of tags attached to resources, and support a semi-structured overview of tagged material.

However, the formal hierarchical relationships to be found in a taxonomy, ontology or classification system cannot be made explicit without some degree of intellectual analysis and input to the tagging tool, and there are no current methods of extracting them automatically. This means that hierarchical browsing can only be supported where there is access to a built vocabulary from which formal relationships can be derived.

Hence there is a good case for the use of external controlled vocabularies to add value to tagging vocabularies by exposing the conceptual relationships between tags. A number of prototypes exist which provide a conceptual structure to which tags can be mapped, or which prompt the tagger with additional tags presented in a structured format. If there is good preliminary linguistic processing of tags, this mapping can be carried out automatically, achieving a degree of automatic categorization of tags based on functional roles (entity, person, artefact, organization), as opposed to subject domain.

This suggests that faceted structures, which make use of such functional categories, have potential in this area, as some researchers have proposed.

Despite some tentative work on the use of facets in creating structural frameworks for tags, there appear to have been no recent developments in this area. It is to be hoped that this situation may change in the near future.

References

Aitchison J., Dextre Clarke S., “The thesaurus: a historical viewpoint with a look to the future”, in Thomas A. R., Roe S. E. (Editors), The thesaurus: review, renaissance, and revision, Binghampton, NY, Haworth Press, 2004.

(23)

Al-Khalifa H., Davis H., “Exploring the value of folksonomies for creating semantic metadata”, International Journal of Semantic Web and Information Systems, vol. 3, n° 1, 2007, p. 12-38.

Al-Khalifa H.S., Davis H.C., Gilbert L., “Creating structure from disorder: Using folksonomies to create semantic metadata”, Proceedings of the 3rd International Conference on Web Information Systems and Technologies, Barcelona, Spain, 2007.

Au Yeung A., Gibbins N., Shadbolt N., “Contextualising Tags in Collaborative Tagging Systems”, Proceedings of the 20^th ACM Conference on Hypertext and Hypermedia, 2009, p. 251-260, http://www.albertauyeung.com/papers/ht09-auyeung.pdf

Automatic metadata generation application (AMeGa) Project final report, http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf

Autonomy Automatic classification and taxonomy generation http://www.autonomy.com/

content/Functionality/idol-functionality-categorization/index.en.html

Aula A., Kaki M., “Findex: improving search result use through automatic filtering categories”, Interacting with Computers, vol. 17, n° 2, 2005, p.187-206.

Automatic RDF metadata generation for resource discovery, School of Computing and IT, Wolverhampton, n.d. http://www8.org/w8-papers/2c-search-discover/

automatic/automatic.html

Berman S., Prejudice and antipathies: a tract on the LC Subject Headings concerning people, Metuchen N.J., Scarecrow Press, 1971.

Birdsall W. F., “Libraries and the communicative citizen in the twenty-first century”, Libri, vol. 55, n° 2-3, 2005, p. 75-83, http://www.librijournal.org/pdf/2005-2- 3pp74-83.pdf

Birdsall W. F., “Web 2.0 as a social movement”, Webology, vol. 4, n° 2, 2007, Article 40, http://www.webology.ir/2007/v4n2/a40.html

Blocks D., A qualitative study of thesaurus integration for end-user searching (PhD thesis), 2004.

http://www.comp.glam.ac.uk/~facet/dblocks/DBlocks_ThesisOnline_Main.html Blumberg R., Atre S., “Automatic classification: moving to the mainstream”, Information Management, April 2003, http://www.information-management.com/issues/20030401 /6501-1.html?pg=1

British Standards Institution, BS 8723 Structured vocabularies for information retrieval, London, BSI, 2005.

Byde A., Wan H., Cayzer S., “Personalized tag recommendations via tagging and content-based similarity metrics”, Proceedings of the International Conference on Weblogs and Social Media, Boulder, CO March, 2007, http://www.icwsm.org/papers/4-- Byde-Wan-Cayzer.pdf

Calefato F., Gendarmi D., Lanubile F., “Towards social semantic suggestive tagging”, Proceedings of SWAP 2007, the 4th Italian Semantic Web Workshop, Bari, Italy, December

(24)

18-20, 2007, CEUR Workshop Proceedings, http://cdg.di.uniba.it/cdg/ gendarmi/

papers/SWAP07_cameraready.pdf

Candan K. S., Di Caro L., Sapino M. L., “Creating tag hierarchies for effective navigation in social media”, in Proceedings of the 2008 ACM workshop on Search in social media, Napa Valley, California, USA. p. 75-82.

Cantador I., Konstas I, Joemon M. J., “Categorising social tags to improve folksonomy- based recommendations”, Web Semantics: Science, Services and Agents on the World Wide Web, 2010.

Cedars Final Workshop, Manchester Conference Centre, UMIST, Manchester, 25-26 February 2002. Workshop summary by Michael Day (UKOLN) and Maggie Jones (Cedars Project Manager), http://www.leeds.ac.uk/cedars/pubconf/umist/final Workshop Rep.html

Chan L. M., “Exploiting LCSH, LCC and DDC to retrieve networked resources: issues and challenges”, Conference on Bibliographic Control in the New Millenium, Library of Congress, November 2000, http://lcweb.loc.gov/catdir/bibcontrol/chan_paper.html Cheung C.F., Lee W.B. Wang Y., “A multi-facet taxonomy system with applications in

unstructured knowledge management”, Journal of Knowledge Management, vol. 9, n° 6, 2005, p. 76-91.

Cleverdon Cyril W., Aslib Cranfield research project: report on the testing and analysis of an investigation into the comparative efficiency of indexing systems, Cranfield, [College of Aeronautics], 1962, http://hdl.handle.net/1826/8366

Cleverdon Cyril W., Factors determining the performance of indexing systems, Cranfield, Cranfield Institute of Technology, 1966.

Drabenstott K. M., Simcox S., Fenton E. G., “End-user understanding of subject headings in library catalogs”, Library Resources & Technical Services, vol. 43, n° 3, 1999, p. 140-160.

FaceTag http://www.facetag.org/

Golder S., Huberman B., “The structure of collaborative tagging systems”, CoRR (Computing Research Repository) abs/cs/0508082, 2006, http://arxiv.org/ftp/cs/

papers/0508/0508082.pdf

Golub K., Jones C., Matthews B., Puzon B., Nielsen M. L., Moon J., Tudhope D.,

“EnTag: enhancing social tagging for discovery”, Joint Conference on Digital Libraries, JCDL 2009, Austin, Texas, June 15-19, 2009, http://www.ukoln.ac.uk/projects/

enhanced-tagging/dissemination/entag-jcdl09.pdf

Greenberg J., Spurgin K., Crystal A., Final report for the AMeGA (Automatic metadata generation applications) Project, Washington, DC, Library of Congress, 2005, http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf

Guy M., Tonkin E., “Folksonomies: tidying up tags”, D-Lib Magazine, vol, 12, n° 1, 2006, http://www.dlib.org/dlib/january06/guy/01guy.html

(25)

Halpin H., Robu V., Shepherd H., “The complex dynamics of collaborative tagging”, in Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada [doi>10.1145/1242572.1242602]

Halvey M.J., and Keane M.T., “An assessment of tag presentation techniques”, in Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada [doi>10.1145/1242572.1242826]

Hammond T., Hannay T., Lund B., Scott J., “Social bookmarking tools (I): a general review”, D-Lib Magazine, vol. 11, n° 4, 2005. http://www.dlib.org/dlib/april05/

hammond/04hammond.html

Hearst M.A., Rosner D., “Tag clouds: data analysis tool or social signaller?”, in Proceedings of the 41st Annual Hawaii International Conference on System Sciences, January 07- 10, 2008. [doi>10.1109/HICSS.2008.422]

Heymann P., Tag hierarchies, 2008, http://heymann.stanford.edu/taghierarchy.html Heymann P., Garcia-Molina H., Collaborative creation of communal hierarchical taxonomies in

social tagging systems, Technical Report 2006-10, Stanford InfoLab, April 2006.

Ivia project, 2009, http://ivia.ucr.edu/projects/Metadata/LCSH.shtml

Jaschke R., Marinho L., Hotho A., Schmidt-Thieme L., Stumme G., “Tag recommendations in social bookmarking systems”, AI Communications, vol. 21, 2008, p. 231-247.

Knowlton S., “Three decades since Prejudices and antipathies: a study of changes in the Library of Congress Subject Headings”, Cataloging & Classification Quarterly, vol. 40, n° 2, 2005, p. 123-129, http://www.haworthpress.com/web/CCQ

Ko Y., Park J., Seo J., “Improving text categorization using the importance of sentences”, Information Processing and Management, vol. 40, n° 1, 2004, p. 65-79.

Kuo B.Y.-L., Hentrich T., Good B.M., Wilkinson M.D., “Tag clouds for summarizing web search results”, in Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada [doi>10.1145/1242572.1242766]

Laniado D., Eynard D., Colombetti M., “Using WordNet to turn a folksonomy into a hierarchy of concepts”, in Proceedings of the 4th Italian Semantic Web Workshop, Bari, Italy, 2007, p. 192-201.

Larson R., “Experiments in automatic Library of Congress Classification”, Journal of the American Society for Information Science, vol. 43, n° 2, 1992, p. 130-148.

Liang C.-Y., Guo L., Xia Z., Nie F.-G., Xiao X., Su L., Yang Z.-Y., “Dictionary-based text categorization of chemical web pages”, Information Processing & Management, vol. 42, n° 4, 2006, p. 1017-1029.

Lopes M. I., Beall J., (eds.), Principles underlying subject heading languages, München, Saur, 1999.

(26)

Manning C. D., Raghavan P., Schütze H., Introduction to information retrieval, Cambridge, Cambridge University Press, 2008. http://informationretrieval.org/

Macregor G., McCulloch E., “Collaborative tagging as a knowledge organisation and resource discovery tool”, Library Review, vol. 55, n° 5, 2006, p. 291-300.

http://strathprints.strath.ac.uk/2335/

Mathes A., Folksonomies - cooperative classification and communication through shared metadata, 2004, http://www.adammathes.com/academic/computer-mediated-communication/

folksonomies.html

Matthews B., Jones C., Puzon B., Moon J., Tudhope D., Golub K., Nielsen M. L., “An evaluation of enhancing social tagging with a knowledge organization system”, Paper presented at the ISKO UK Conference Content architecture: exploiting and managing diverse resources, June 23-24, 2009, University College London, http://www.iskouk.org/conf2009/papers/matthews_ISKOUK2009.pdf

Mejias U., Tag literacy, 2005, http://blog.ulisesmejias.com/2005/04/26/tag-literacy/

Merholz P., Metadata for the masses, 2004, http://www.adaptivepath.com/ideas/essays/

archives/000361.php

Mobrand K.A., Spyridakis J.H., “Explicitness of local navigational links:

comprehension, perceptions of use, and browsing behavior”, Journal of Information Science, vol. 33, n° 1, 2007, p. 41-61, http://jis.sagepub.com/cgi/content/abstract/

33/1/41

Neveol A., Rogozan A., Darmoni S., “Automatic indexing of online health resources for a French quality controlled gateway”, Information Processing & Management; vol. 42, n° 3, 2006, p. 695-709.

Newsindexer, http://www.newsindexer.com/index.html

Niwa S., Doi T., Honiden S., “Web page recommender system based on folksonomy mining”, Information Processing Society of Japan (IPSJ) Journal, vol. 47, n° 5, 2006, p. 1382-1392.

Noruzi A., “Editorial. Folksonomies: why do we need controlled vocabulary?”, Webology, vol. 4, n° 2, 2007, http://webology.ir/2007/v4n2/editorial12.html Peters I., Stock W., “« Power tags » in information retrieval”, Library Hi Tech, vol. 28,

n° 1, 2010, p. 81-93.

Peters I., Weller K., “Tag gardening for folksonomy enrichment and maintenance”, Webology, vol. 5, n° 3, 2008, Article 58, http://www.webology.ir/2008/v5n3/a58. html Polfreman M., Broughton V., Wilson A., Metadata generation for resource discovery, JISC,

2008, http://www.jisc.ac.uk/media/documents/programmes/resourcediscovery/

metgenreport_final_v5.doc

(27)

Quintarelli E., Rosati L., Resmini A., “FaceTag: integrating bottom-up and top-down classification in a social tagging systems”, Proceedings of the 8^th Annual IA Summit, 2007, http://www.facetag.org/download/facetag.pdf

Robu V., Halpin H., Shepherd H., “Emergence of consensus and shared vocabularies in collaborative tagging systems”, ACM Transactions on the Web (TWEB), vol. 3, n° 4, 2009, Article 14, http://portal.acm.org/citation.cfm?id=1594173.1594176

Rosenfeld L., “Folksonomies? How about metadata ecologies?”, 2005, http://louisrosenfeld.com/home/bloug_archive/000330.html

Shiri A., Revie C., “Query expansion behavior within a thesaurus-enhanced search environment: a user-centred evaluation”, Journal of the American Society for Information Science and Technology, vol. 57, n° 4, 2006, p. 462-478.

Shirky C., Ontology is overrated: categories, links, and tags, 2005, http://www.shirky.com/

writings/ontology_overrated.html

Sinha R., Cognitive analysis of tagging, 2005, http://rashmisinha.com/2005/09/27/a- cognitive-analysis-of-tagging/

Thompson R., Shafer K., Vizine-Goetz D., “Evaluating Dewey concepts as a knowledge base for automatic subject assignment”, First ACM Digital Libraries Workshop, January 1997, http://orc.rsch.oclc.org:6109/eval_dc.html

Tsui E., Wang W., Cheung C., Lau A., “A concept-relationship acquisition and inference approach for hierarchical taxonomy construction from tags”, Information Processing and Management, vol. 46, 2010, p. 44-57.

Tudhope D., Binding C., Blocks D., Cunliffe D., “Query expansion via conceptual distance in thesaurus indexed collections”, Journal of Documentation, vol. 62, n° 4, 2006, p. 509-533.

Vander Wal T., Folksonomy coinage and definition, 2007, http://vanderwal.net/folksonomy.

html

YAGO, 2009, http://www.mpi-inf.mpg.de/yago-naga/yago/

(28)