Notes on the Relationship between Search and Tagging

13 09 2009

I had a number of exciting and very inspiring conversations this week with Marc, Rakesh, Fabian, Cathy, and Pranam, as well as with Ed and Rowan. It was great talking to everybody and I wanted to share some of the issues that were discussed. Most conversations focused on the role of tagging, and how it relates to searching the web. I do not claim that any of these interesting thoughts are mine or that my notes offer answers.  They merely aim to serve as pointers for what I consider important issues.

A minority of resources on the web is tagged:

A number of current research projects study the question how tagged resources can inform/improve search. However, a minority of resources on the web is tagged, and the gap between tagged and non-tagged resources is likely increasing (although this seems difficult to predict cf. Paul Heymann’s work). This would mean that a decreasing ratio of resources on the web have tagged information associated with it. The question then becomes: Why bother analyzing tagging systems in the first place when their (relative) importance is likely to decrease over time?

Tagged resources exhibit topical bias (that’s a bad thing!):

Tagging is often a geek activity. I am not aware of any studies of delicious’ user population, but it is likely that delicious’ users are more geeky than the rest of the population. This is a bad thing because it would bias any broad attempt leveraging tagging for search. The bias might depend on the particular tagging system though: Flickr seems to have a much broader, and thereby more representative, user base.

Bookmarks exhibit timely bias (that’s a good thing!):

Bookmarking typically represents an event in time triggered by some user. Most tagging systems therefore provide timestamp information, allowing to infer more information about the context in which a given resource is being tagged. This allows us to use tagging systems for studying how information on the web is organized, filtered, diffused and consumed.

Search supercedes any other form of information access/organisation:

I found this issue to be the most fundamental and controversial one. How do increasingly sophisticated search engines change the way we interact with information? What is the role that directories (such as Yahoo!) and personal ressource collections (such as “Favorite folders”) play in a world where search engines can (re)find much information we require with increasing precision? To give an example: Would an electronic record of all resources that a user has ever visited – and a corresponding search interface to them – replace the need for information organization ala delicious or Browser Favorites? (all privacy concerns set aside for a moment). How would such a development relate to the desire of users to share information with friends?

Search intent is poorly understood:

While there has been some work on search queries and query log analysis, the intent behind queries remains largely elusive. Existing distinctions (such as the one by Broder) need further elaboration and refinement. An example would be what Rakesh called pseudo-navigational queries – where the user has a certain expectation about the information, but this information can be found on several sites (e.g. wikipedia, an encyclopedia or other sites).

Conflict in tagging systems:

Tagging systems are largely tolerant of conflicts, for example, with regard to tagging semantics. This is different from systems such as wikipedia, where conflict is regarded to be an important aspect of the collaboration process. Twitter seems to lie in between those extremes, where conflict can emerge easily (e.g. around hashtags) , with some rudimentary support for resolution.

I truly enjoyed these conversations, and hope that they will continue at some point in the future.




6 responses

15 11 2009
Michael Bernstein

Even if a minority of resources are tagged, it suggests an opportunity for semisupervised learning or similar approaches to help. The more dangerous part of this is the topical bias, since that’s more difficult to bootstrap.

15 11 2009
Markus Strohmaier

I agree – tagged resources still represent a very large manually-labeled set of documents that can be useful for improving IR methods.

However, I’m particularly curious about the apparent decline of unique visitors to social bookmarking sites such as delicious this fall:

I wonder whether this is a seasonal slump or whether users adopt other strategies to re-find and share information (search? facebook? twitter?). If activity streams are the main beneficiaries, then the user-generated metadata for ressources would be very different (tweet-context vs. tags).

16 11 2009
Marshall Clark

Great analysis Markus. I’ve always felt that tagging, and really most kinds of explicit annotation, are subject to bias. Even linking has it’s well documented issues (link spam, etc).

I’m curious what role, if any, you think implicit behaviors like user behavioral tracking may have on search in the future?

17 11 2009
Markus Strohmaier

Marshall, Thanks for your comment. To some extent, you could view search as a process where users implicitly tag resources with queries (via click-throughs for example). Anchor-text would be another example where implicit user behavior contributes valuable metadata. It seems that focusing on tracking implicit behavior addresses issues of scale much better than explicit tagging.

19 01 2010
Michael Muller

Tagging vocabularies may be constrained by context and intent. We compared tagging vocabularies across four enterprise social software systems, used by overlapping groups of users, in our paper at the GROUP 2007 conference. There was surprisingly little overlap of tagging vocabulary from one service to another. This pattern occurred even when we focused on analysis on each individual’s tags across the multiple services.

Tags are very useful *within* systems, but they don’t provide a very reliable “glue” to connect resources *across* systems.

19 01 2010
Markus Strohmaier


Thanks for your comment and the pointer to the article. The problem of cross-system tagging is well worth discussing. From what I read in your paper, it seems that the lack of overlap occurs because different resources are being tagged in those different systems (URLs, blogs, people, etc). I wonder whether two systems that focus on tagging the same kind of resources (say URLs), populated by similar user groups, still show these effects.

What I found interesting in your work is the suggestion that the emergent semantics that we can study in these systems are not only influenced by users and context, but also by the kind of resources that tags are applied to. Intuitively, this makes sense: One would categorize photos in a different way than one would categorize music, or files. I can see how mapping between those different folksonomical structures is an interesting problem.

