List of Social Tagging Datasets

22 07 2009

I’m currently compiling a list of social tagging datasets for our current / future research, but since it might be of interest to others as well I’m sharing it. Here’s the link:

A non-exhaustive list of Social Tagging Datasets that are available for research

If you are aware of other social tagging datasets available for research, please let me know by leaving a comment to this post.

Motivations for Tagging: Categorization vs. Description

21 07 2009

UPDATE March 17 2010: More results can be found in the following publication: M. Strohmaier, C. Koerner, R. Kern, Why do Users Tag? Detecting Users’ Motivation for Tagging in Social Tagging Systems, 4th International AAAI Conference on Weblogs and Social Media (ICWSM2010), Washington, DC, USA, May 23-26, 2010. (Download pdf)

In a past post, I talked about the role of tagging motivation in social tagging systems, and a distinction between users who use tags for Categorization and users who use tags for Description purposes.

One question that is interesting in this context is: “How do tag clouds of Categorizers respectively Describers actually look like – and what can we learn from them?“.

Categorizers vs. Describers: Our previous work suggests how tag clouds of Categorizers/Describers would look like theoretically: Categorizers would rather use general terms for tagging, terms that are useful labels for categories based on his model of the world.  On the other hand, Describers would use terms that are specific to a resource or concepts that can be found directly within a resource, based on characteristics of the resource. That’s the theory.

Christian Körner, one of my PhD students, looked into this question empirically based on his current work, where he applies previously discussed measures to detect tagging motivation (Conditional Tag Entropy and Orphaned Tags) to several tagging datasets. While in reality we expected that most tagging behaviour is the result of a combination of categorization and description motivation, Christian was particulary interested in “extreme” cases, i.e. cases of “extreme” Categorizers and “extreme” Describers. Here are selected results:

Example of an Extreme Categorizer: Among 445 delicious users, the following screenshot shows the tag cloud of the single user that scored highest on our “Categorization” measure (the most extreme Categorizer in our dataset).


An example tag cloud of an "Extreme Categorizer" (based on ~1900 bookmarks)

The results are quite intriguing: The above user clearly uses very general terms to annotate his resources, and introduces an elaborated taxonomy to categorize them. While some parts of his vocabulary are more elaborate and fine grained (e.g. “fashion” and corresponding sub-categories “fashion_blog” and “fashion_brand”) others are less elaborated  (e.g. “games, health, etc”). The user also produced a controlled vocabulary and sticked to it over the course of 1900 bookmarks, which I think can be seen as another indication for the inclination of this user to use tags for categorization purposes. The fact that a combination of our measures for tagging motivation (Conditional Tag Entropy and Orphaned Tags) has produced this interesting example of an extreme Categorizers provides some evidence for the plausibility of these measures. I think that’s great news.

Example of an Extreme Describer: The next screenshot shows an excerpt of a tag cloud of the user that scored highest on the “Description” measure (the most extreme Describer in our dataset).


An example tag cloud of an "Extreme Describer" (excerpt, based on ~1700 bookmarks)

It is interesting to note that this tag cloud represents an excerpt, the original tag cloud of this user is ~twice this size. The user clearly introduces a large set of tags, and uses many different variations of the same or similiar concepts, without much consideration with regard to terminological or conceptual differences (e.g. exce,  excel, Excel_Functions, Excel2007, Exceler, excelets, ExcelPoster, Excl, excxel). Again, the fact that our measures for tagging motivation produced this particular user as an extreme example of a Describer can be seen as an indicator for the principle plausibility of our measures.

However, what is also apparent from this example is that even in the case of this extreme Describer, some categories seem to be present in his tag vocabulary (e.g. “ebooks, fun, etc”). This suggests that a binary approach to understanding tagging motivation (a user is EITHER a Categorizer OR a Describer) is inplausible.

Open Questions: Overall, the examples of two users motivated by diametrically different motivations for tagging raises a number of interesting questions worth studying: What are characteristics, utilities and properties of tags produced by Categorizers and Describers? How do these different types of tagging motivation influence resulting folksonomies? And how do they influence quality attributes of algorithms (e.g. search, ranking) and applications (e.g. tag recommendation) that are processing folksonomical data? We are looking into some of these questions in our current research.

UPDATE March 17 2010: More results can be found in the following publication: M. Strohmaier, C. Koerner, R. Kern, Why do Users Tag? Detecting Users’ Motivation for Tagging in Social Tagging Systems, 4th International AAAI Conference on Weblogs and Social Media (ICWSM2010), Washington, DC, USA, May 23-26, 2010. (Download pdf)Motivations for Tagging: Categorization vs. Description

ACM Hypertext’09 Student Competition

4 07 2009

I just came back from a road trip to ACM Hypertext’09 with my students, and I’m particulary happy that one of them, Christian Körner, won the 1st place in this year’s ACM Hypertext’09 Graduate Student Research Challenge. Bravo Christian!

The competition was strong, and Christian did a great job in presenting preliminary results from his PhD research on Tagging Motivation. Here are a few links to his competition material:

I would also like to congratulate all runner-ups at the competition. All participating students worked really hard to present their research to conference attendees in an engaging way. I really liked the enthusiasm of the students, and the student competition as well as the conference as a whole gave me a bunch of new ideas and perspectives.

You might also be interested in my liveblogging notes on Ricardo Baeza-Yates‘ and Lada Adamic‘s very interesting keynotes at the conference.

ACM Hypertext'09 Student Research Competition

Liveblogging Wednesday @ Hypertext’09

1 07 2009

I’m sharing my live notes from the second hypertext keynote on Relating Content by Web Usage by Ricardo Baeza-Yates at Hypertext ‘09.

In case you have any additions, comments or links that would make my notes more complete / more useful, please leave a comment and fill in the blanks.

On the nature of search and intent:

Ricardo starts by stating that Search is not about document retrieval anymore. Given Ricardo’s history in document retrieval, this is an interesting thing to hear.

Search is rather about mediating user goals, in particular:

  1. idenitfying a users’ task
  2. providing means for task completion

For search to be successful, intent of searchers needs to be related to content available on the web. Ricardo argues that rather than focusing on content, search engines need to focus on objects, such as people, places, businesses, restaurants etc. Search intent then can be satisfied by exploiting and mapping characteristics of such objects and their corresponding attributes.

On the nature of content:

So how can we learn about objects and attributes? One approach is to look into metadata, where Ricardo distuingishes betweeen explicit (Metadata, Y! Answers, Flickr, etc) and implicit (anchor text, queries and clickthrough, etc) metadata. Ricardo points out that some of this metadata is private, making usage more complicated.

A key question in this context is “What is the quality of different kinds of metadata?”. Ricardo mentions that although user-generated metadata is noisy, on an aggregate level, he believes that it outperforms metadata generated by experts.

Search in Social Media:

Ricardo introduces TagExplorer, a Yahoo resesarch prototype for tag-based, faceted navigation/search of Flickr. Facets that are supported are locations, subjects, activites, time, names and others. I didn’t fully understand how these facets are identified or determined, but it seems the selection is based on / informed by previous empirical Yahoo research on different types of tags in Flickr.

Another prototype Ricardo demonstrates is the Correlator.

Web Usage:

Ricardo starts with the assumption that “when users use the web, they think”, and he suggests that we can/should tap into the outcome of these cognitive processes and exploit them for search. An example of that are query logs, where users actively make relevance judgements and engage in search query formulation / reformulation strategies.

Ricardo gives a number of examples where this might be useful, for example it might help in learning about relationships between queries, sessions and documents.

Open Issues:

Ricardo concludes his talk by discussing a number of issues he feels are important for future research. He discusses the interesting research question of studying explicit social networks (where links between users are made explicit) versus implicit social networks (where links between users are inferred). Related to this problem is the problem of implicit and explicit metadata. Ricardo refers to that problem as the virtuous cycle, where both implicit and explicit metadata can be used/should be used to inform search.

Another problem Ricardo mentions is the question when it is necessary to acquire more data vs. when we need to tweak our algorithms. As researchers, I guess we tend to have a bias towards working on the algorithmic rather than the data aspect.

My impressions:

I think Ricardo’s talk gave a great overview of the many activities at Yahoo Research. Due to the number of projects being presented, it was difficult for me to capture everything that was presented, and I feel that my notes in this post capture only a small part of what Ricardo talked about in his keynote. So check out Ricardo’s website / Yahoo research website / the slides of this talk to get a more complete picture of their exciting projects.

Update: I just stumbled upon Alvin Chin‘s notes of Ricardo’s keynote, which nicely complement my notes here.