Notes on the Relationship between Search and Tagging

13 09 2009

I had a number of exciting and very inspiring conversations this week with Marc, Rakesh, Fabian, Cathy, and Pranam, as well as with Ed and Rowan. It was great talking to everybody and I wanted to share some of the issues that were discussed. Most conversations focused on the role of tagging, and how it relates to searching the web. I do not claim that any of these interesting thoughts are mine or that my notes offer answers.  They merely aim to serve as pointers for what I consider important issues.

A minority of resources on the web is tagged:

A number of current research projects study the question how tagged resources can inform/improve search. However, a minority of resources on the web is tagged, and the gap between tagged and non-tagged resources is likely increasing (although this seems difficult to predict cf. Paul Heymann’s work). This would mean that a decreasing ratio of resources on the web have tagged information associated with it. The question then becomes: Why bother analyzing tagging systems in the first place when their (relative) importance is likely to decrease over time?

Tagged resources exhibit topical bias (that’s a bad thing!):

Tagging is often a geek activity. I am not aware of any studies of delicious’ user population, but it is likely that delicious’ users are more geeky than the rest of the population. This is a bad thing because it would bias any broad attempt leveraging tagging for search. The bias might depend on the particular tagging system though: Flickr seems to have a much broader, and thereby more representative, user base.

Bookmarks exhibit timely bias (that’s a good thing!):

Bookmarking typically represents an event in time triggered by some user. Most tagging systems therefore provide timestamp information, allowing to infer more information about the context in which a given resource is being tagged. This allows us to use tagging systems for studying how information on the web is organized, filtered, diffused and consumed.

Search supercedes any other form of information access/organisation:

I found this issue to be the most fundamental and controversial one. How do increasingly sophisticated search engines change the way we interact with information? What is the role that directories (such as Yahoo!) and personal ressource collections (such as “Favorite folders”) play in a world where search engines can (re)find much information we require with increasing precision? To give an example: Would an electronic record of all resources that a user has ever visited – and a corresponding search interface to them – replace the need for information organization ala delicious or Browser Favorites? (all privacy concerns set aside for a moment). How would such a development relate to the desire of users to share information with friends?

Search intent is poorly understood:

While there has been some work on search queries and query log analysis, the intent behind queries remains largely elusive. Existing distinctions (such as the one by Broder) need further elaboration and refinement. An example would be what Rakesh called pseudo-navigational queries – where the user has a certain expectation about the information, but this information can be found on several sites (e.g. wikipedia, an encyclopedia or other sites).

Conflict in tagging systems:

Tagging systems are largely tolerant of conflicts, for example, with regard to tagging semantics. This is different from systems such as wikipedia, where conflict is regarded to be an important aspect of the collaboration process. Twitter seems to lie in between those extremes, where conflict can emerge easily (e.g. around hashtags) , with some rudimentary support for resolution.

I truly enjoyed these conversations, and hope that they will continue at some point in the future.





Why we can’t quit searching in Twitter

19 08 2009

I’m still trying to get my head around this recent Slate magazine article on Seeking: How the brain hard-wires us to love Google, Twitter, and texting. And why that’s dangerous.

In this blog post, I’m basically trying to tie this article on “Seeking” together with two related topics: “Information Foraging” and “Twitter”.

Seeking

The Slate magazine article starts by observing that we (humans) are insatiably curious, and that we gather data even if it gets us into trouble. To give an example:

Nina Shen Rastogi confessed in Double X, “My boyfriend has threatened to break up with me if I keep whipping out my iPhone to look up random facts about celebrities when we’re out to dinner.

The article goes on making several intertwined arguments which I’m trying to sort out here.

One of the arguments focuses on reporting how lab rats can be artificially put in “a constant state of sniffing and foraging”.  It has been observed in experiments that rats tend to endlessly push a button if the button would stimulate electrodes that are connected to the rat’s lateral hypothalamus. This gets rats locked into a state of endless repetitive behavior. Scientists have since concluded that the lateral hypothalamus represents the brain’s pleasure center.

Another point is made based on work by University of Michigan professor of psychology Kent Berridge (make sure to check out his website after reading this post) who argues that the mammalian brain has separate systems for wanting and liking. Think of it as the difference between wanting to buy a car, and liking to drive it.

Wanting [or seeking] and liking are complementary. The former catalyzes us to action; the latter brings us to a satisfied pause. Seeking needs to be turned off, if even for a little while, so that the system does not run in an endless loop.

Interestingly, our brain seems to have evolved into “being more stingy with mechanisms for pleasure than desire”: Mammals are rewarded by wanting (as opposed to liking), because creatures that lack motivation (but thrive on pleasure) are likely to lead short lives (due to negative selection).

There are lab experiments reporting on how the wanting system can take over the liking system. Washington State University neuroscientist Jaak Panksepp says that “a way to drive animals into a frenzy is to give them only tiny bits of food which sends their seeking system into hypereactivity.”

Information Foraging

This brings me to a book I’m currently reading on “Information Foraging Theory” by Peter Pirolli (I have not finished it yet!). At the beginning, the book argues that the way users search for information can be likened to the way animals forage for food. Information Foraging Theory makes a basic distinction between two types of states a forager is in: between-patch and within-patch states. In between-patch states, foragers are concerned with finding new patches to feed on, whereas in within-patch states, creatures are concerned with consuming a patch. Information Foraging Theory is in part concerned with modeling “optimal” strategies for foragers that would maximize some gain (e.g. information value) function, based on the Marginal Value Theorem, depicted in the illustration below.

In this illustration from wikipedia, Transit time refers to between patch (left side) and Time foraging in patch refers to within patch (right side of the diagram). The slope of the tangent corresponds to the optimal rate of gain. There is an interesting relationship between time spent within and between patches. If patches yield very little average gain (e.g. calories, or information value), patches are easily exhausted, quickly putting foragers into the between patch state again.

Twitter, Seeking and Information Foraging

Now I’m trying to tie these two topics together (there might even be a common basis for the relationship between these topics in the literature).

Seeking and Information Foraging: It seems that wanting and liking systems relate to within and between patch states in Information Foraging Theory. If lab rats push a button in an experiment, it seems that the rat’s electrodes modify their liking system in a way that prevents them from engaging within a patch, and puts them into a between patch state. When animals are sent into a frenzy by giving them tiny bits of food, within patch time is minimalized, sending them right back into a between patch state. In this scenario, animals spend relatively more time searching than actually consuming food, effectively reducing their overall gain in comparison to scenarios where they would be confronted with large bits of food (higher gain patches).

Finally, how this all might relate to Twitter: I’m arguing that Twitter’s message restriction to 140 characters (disregarding links that might be posted in Twitter messages) artificially reduces within-patch time. The gains of a patch (a tweet) might still vary, but gain is not dependent on within-patch time anymore. The average “within-patch/gain function” (right side of the above illustration) seems to be constant! It always takes the same approximate amount to read a tweet (assuming there are no URLs in a tweet), reading “longer” does not increase the gain.

In addition, Twitter’s particular user interface (chronological listing of tweets) seems to be weak in terms of information scent: Judging whether a tweet is relevant or not requires a forager to read the entire tweet, regardless whether the patch (the tweet) contains a gain (an informative value) or not.  This seems to yield to a situation where in systems such as Twitter, users quickly change between within patch (reading a tweet) and between patch (finding the next tweet to read) states. The reason to that might be the following: When a forager has exhausted a patch, he would switch back to a between patch state. However, due to a deprivation of information scent on the Twitter user interface, the user is largely helpless in the between patch state (he does not know where to search next, other than reading the next tweet). This leaves users with a desire to change back to within patch states as quickly as possible (only reading entire tweets can help to assess relevance), thereby potentially adapting chaotic and/or irrational strategies.

The above observation might also explain the frenzy that animals are sent into when being offered tiny bits of food while being deprived of “scent” to inform their between patch phases. The hypothesis would be that the frenzy would not occur if the animals were offered clues regarding where the next patch is to be expected, and what gain they could get from exhausting it (of course behavioural biologists might have already studied this question).

Returning to Twitter, it seems that the same effect that sends animals into a frenzy could be at place at Twitter, where users – due to a combination of small within-patch times and weak information scent – engage in uninformed foraging of artificially small information patches.

This of course is the provocative conclusion of the Slate Magazine article. What I found interesting is how these three topics – seeking, information foraging and Twitter – nicely tie into each other on a theoretical level.

I still have not figured out what a reduction of within patch times alone means from an Information Foraging Theoretic perspective – I’d like to figure that out at some point in the future.





Impressions from CIKM08 and SSM08 in Napa Valley

2 11 2008

CIKM’08 and SSM’08 just took place this week in Napa Valley.  The conference was well attended, and the papers were exciting. On Sunday, I took part in Christos Faloutsos tutorial on Large Graph Mining (tutorial material), which gave a great overview of current research on networks and network algorithms. I had a chance to briefly talk to Christos about similarities of vector-based and graph-based research, and whether it would be worth looking into synergies (as they both can be understood as different perspectives on affiliation matrices). We agreed that it would be interesting to see whether there are certain problems in one domain (e.g. vector-space), that are easier to solve in the other (e.g. networks). Overall, I really enjoyed the tutorial.

The program (program.pdf) of the main conference was pretty packed. Among other sessions, the most interesting sessions I found were the ones on Web Search, Social Search, and Query Analysis. I found Bruce Croft’s keynote the most interesting, as it most closely related to the research of my group. He talked about the difficulty of IR to deal with “long queries” (also cf. intent descriptions of the TREC datasets), as IR in the past has often focused on short ones. One particularly interesting chart he presented illustrated that click-through ratios decline significantly with query length (using the MS click log). Bruce interpreted that as evidence for the difficulty of existing search algorithms to deal with such queries. Many contributions emphasized the importance of understanding search intent (something I am interesting in), and most contributions had a strong evaluation background. It seemed that Mechanical Turk created some buzz as a presumably easy and reliable (if you manage to avoid all the pitfalls such as spam, etc) form of large scale, fast evaluation for search (and other problems).

The social interactions at the conference were also great: I had the opportunity to share dinner with Greg (whose blog I read for a while now), and share another very pleasant dinner with Ron and Peter who work at Me.dium, a social search startup.

The SSM08 workshop was well attended, with 40 registered participants. I gave a presentation on “Purpose Tagging – Capturing User Intent to Assist Goal-Oriented Social Search” (paper.pdf) and I felt that people appreciated my ideas and research results. There were a couple of interesting questions at the end of my presentation (such as “are there dominant purposes for given resources”? Example: “eating food” would probably dominate restaurant websites. Maybe this question could be answered by looking into purpose frequency evaluations of resources) and some follow up discussions. Some of the people I have only known through their papers were in the audience, including M. Smith, A. Chowdury, E. Agichtein, M. Hurst, M. Hearst, E. Chi, L. Getoor, S. Dumais and many others. The workshop was a great chance for me to meet and discuss issues of social search with them – Thanks to Ian and Eugene for organizing such an excellent event. I would really like to see a successor to this event.

At the workshop, I was particularly excited to see a preview of MrTaggy (currently not public), a social search engine developed by Ed Chi at PARC. It allows people to purposefully re-arrange search results, and share them with friends. M. Hurst also talked about uRank, a Microsoft social search product (currently available only in the US). Tusavvy and me.dium.com are further contenders. The discussion was lively, and focused on privacy and different notions of search (search vs. browsing), social search (on/offline, synchronous/asynchronous, etc) and privacy. Abdur demoed Twitter search, a rather powerful tool and a rather unusual search problem.

Overall, the conference was an excellent event to get in touch with KM&IR researchers talking about their current research, as well as with people from industry (Yahoo, Google, Live Labs, Ask, etc). Next year’s conference is in Beijing.





Real Estate Intent on the Web

26 09 2008

The US housing crisis has been frequently reported as one major cause for the current financial turmoil. I was wondering whether indicators for the housing crisis can be found in Google’s Search Query Logs.

To look into this, I compared the search volume for the following explicit intentional queries: “buy a house”, “rent a house”, “sell my house” and “find a house”. The results are interesting, yet hardly surprising.

Tracking Real Estate Intent on Google Trends

Tracking Real Estate Intent on Google Trends

” Buy a house” undergoes seasonal fluctuations, with peaks at the end/beginning of every year. Overall though, there seems to be a downward trend. At the same time, “sell my house” and “rent a house” are on the rise. “Find a house” is relatively stable, but slowly declining as well. Although subtle, the housing crisis can be identified in the data.

Google Trends seems to provide some interesting data (such as the one above), yet I miss some features and several questions remain unanswered. I’d love to see a mashup with Google Maps, where I can not only plot the queries over time, but map them on different regions of the US. Questions that I would like to have answered include: 1) What is the absolute search volume of queries? Google does not give a way the absolute number of queries per of interest. 2) Does Google account for rising query volume? I assume that the total number of queries issued in 2004 is significantly lower than the total number of queries issued in 2007. So does a decline in, for example, the blue curve refer to an absolute decline in numbers, or to a relative decline that factors in overall query volume increase? 3) “Sell my house” does not have any data before 2005 – what does this mean exactly? It seems odd that people on the web did not (or hardly) use the term “sell my house” before 2005 on Google.

Yet, the kind of analysis provided by Google hints towards potential applications of taking an explicit goal-oriented stance on the web.

Update (Oct 7 2008): It seems that Google Insights for Search covers many of the issues identified above. See the example here (you must be logged into your Google Account to see the numbers).