I’m sharing my live notes from the second hypertext keynote on Relating Content by Web Usage by Ricardo Baeza-Yates at Hypertext ‘09.
In case you have any additions, comments or links that would make my notes more complete / more useful, please leave a comment and fill in the blanks.
On the nature of search and intent:
Ricardo starts by stating that Search is not about document retrieval anymore. Given Ricardo’s history in document retrieval, this is an interesting thing to hear.
Search is rather about mediating user goals, in particular:
- idenitfying a users’ task
- providing means for task completion
For search to be successful, intent of searchers needs to be related to content available on the web. Ricardo argues that rather than focusing on content, search engines need to focus on objects, such as people, places, businesses, restaurants etc. Search intent then can be satisfied by exploiting and mapping characteristics of such objects and their corresponding attributes.
On the nature of content:
So how can we learn about objects and attributes? One approach is to look into metadata, where Ricardo distuingishes betweeen explicit (Metadata, Y! Answers, Flickr, etc) and implicit (anchor text, queries and clickthrough, etc) metadata. Ricardo points out that some of this metadata is private, making usage more complicated.
A key question in this context is “What is the quality of different kinds of metadata?”. Ricardo mentions that although user-generated metadata is noisy, on an aggregate level, he believes that it outperforms metadata generated by experts.
Search in Social Media:
Ricardo introduces TagExplorer, a Yahoo resesarch prototype for tag-based, faceted navigation/search of Flickr. Facets that are supported are locations, subjects, activites, time, names and others. I didn’t fully understand how these facets are identified or determined, but it seems the selection is based on / informed by previous empirical Yahoo research on different types of tags in Flickr.
Another prototype Ricardo demonstrates is the Correlator.
Web Usage:
Ricardo starts with the assumption that “when users use the web, they think”, and he suggests that we can/should tap into the outcome of these cognitive processes and exploit them for search. An example of that are query logs, where users actively make relevance judgements and engage in search query formulation / reformulation strategies.
Ricardo gives a number of examples where this might be useful, for example it might help in learning about relationships between queries, sessions and documents.
Open Issues:
Ricardo concludes his talk by discussing a number of issues he feels are important for future research. He discusses the interesting research question of studying explicit social networks (where links between users are made explicit) versus implicit social networks (where links between users are inferred). Related to this problem is the problem of implicit and explicit metadata. Ricardo refers to that problem as the virtuous cycle, where both implicit and explicit metadata can be used/should be used to inform search.
Another problem Ricardo mentions is the question when it is necessary to acquire more data vs. when we need to tweak our algorithms. As researchers, I guess we tend to have a bias towards working on the algorithmic rather than the data aspect.
My impressions:
I think Ricardo’s talk gave a great overview of the many activities at Yahoo Research. Due to the number of projects being presented, it was difficult for me to capture everything that was presented, and I feel that my notes in this post capture only a small part of what Ricardo talked about in his keynote. So check out Ricardo’s website / Yahoo research website / the slides of this talk to get a more complete picture of their exciting projects.
Update: I just stumbled upon Alvin Chin‘s notes of Ricardo’s keynote, which nicely complement my notes here.