Trend Prediction = Search + Social Network Analysis?

15 02 2008

I just stumpled upon the New Scientist article about Peter Gloor‘s web tool Condor. The article is a rather high level introduction, but an interesting read. Here’s the basic description (from New Scientist):

“Condor starts by taking an ordinary search term – the name of a political candidate or a company – and plugging it into the Google web search engine. It then takes the URLs of the top 10 hits returned by Google and plugs them back into the Google search field, prefaced with the term “link:”. In response, Google returns the sites that link to the 10 original sites. Condor then repeats the process with the new set of sites.”

This leads to networks similiar to the network illustrated below (taken from these slides)

Gloor Ego-centric networks of search results

Basically, what Peter seems to propose is to use Google search for constructing ego-centric networks of websites related to a specific search term. Subsequently (described in the full article) he calculates the betweeness values (the number of shortest paths going through each node) for each of the websites in the network and calculates the average of this, something like the “popularity factor” of a search term (Peter doesn’t really call it like that – but this seems to be the intention). Peter speculates that this might be helpful to predict the outcome of elections, pick future Oscar winners or predict stocks.

The available information about the specifics of Condor seems to be scarce (I didn’t find a published article, but I found slides and some introductory text), so I definitely don’t have the complete picture of Condor’s architecture, but from what I’ve seen so far (my registration for a test account on condorview.com is pending), I observe the following issues:

  • The network created is almost a forest (due to the nature of the graph construction process, see the image above) but not quite, when a website appears in several different search results, this introduces vertices that let the graph deviate from the “pure” forest structure, towards graphs that are denser
  • So what does this mean in terms of the phenomenon that Peter wants to observe? “Popularity” in Condor seems to be measured by averaging betweeness factors. The more the network deviates from the forest structure, the more “shortcuts” to navigate the forest are introduced, the lower the average betweeness factor becomes.
  • This would mean that, for example, the movies with the highest average betweeness factors (i.e. more likely to get an Oscar) produced ego-centric networks that are closer to pure-forest structures than movies with lower average betweeness factors (i.e. less likely to get an Oscar).
  • Since Google ranks websites (at least in part) with PageRank, this would mean that the most popular websites for an “Oscaresque” movie do not have extensive links among their neighbours in their ego-centric networks.

This seems odd and somewhat counterintuitive, and I do not see an obvious explanation of this – any thoughts? It would also be interesting to conduct evaluations, for example comparing different network parameters, in particular clustering coefficient, in-degree and out-degree values to see how they would perform compared to average betweeness.

I must say that my analysis might be flawed since it is unclear how many iterations with Google search were performed, and how many of the websites that are being returned are included, or how they are included (top level domain vs. specific sites) – all issues that would obviously influence the resulting graph.

Peter (who also has a blog) was recently in Vienna, so it seems I’ve missed my chance to discuss his very interesting approach and my observations in person – something I definitely would have enjoyed doing.

Advertisements




CSKGOI Workshop Proceedings online

13 02 2008

Mathias just announced that the proceedings of our CSKGOI workshop at the IUI conference are now available at the CEUR workshop proceedings server (CSKGOI Workshop proceedings CEUR-WS Vol-323). Thanks Mathias’ for putting all this together and thanks to everyone who contributed to this nice little event, which was an excellent occasion to discuss current research on commonsense knowledge and goal-oriented interfaces with like-minded colleagues.

Mathias, Peter and I also presented some results of our own research that focused on studying the nature and structure of user goals in the AOL search query log – a log that contains more than 20 mio search queries.

Here’s the link to our paper in case you are interested:

Different Degrees of Explicitness in Intentional Artifacts: Studying User Goals in a Large Search Query Log, Markus Strohmaier, Peter Prettenhofer and Mathias Lux





Studying Goals on 43things.com

8 02 2008

In a recent effort, Thomas, a student in my research group, explored goals and goal relations on 43things.com. You can see an outcome of his effort in the Pajek screenshot below (click here for a larger version of the graph):

Goal Association Graph

We used 43things.com API to crawl a minor set of goals and relations between them. The relations were inferred via a rather naive approach using simple weights for user- and tag- co-occurrences as well as 43things.com’s similiar-goal API call. Surprisingly, many of the strongest associations inferred are rather intuitive. Utilizing 43things.com’s API, this small exercise illustrates the potential of large socially-constructed corpora for collecting common sense knowledge (such as “fall in love” helps to “be happy” – as illustrated in the above screenshot). Making such common sense knowledge more explicit on one hand might help to aid users in formulating, progressing towards and satisfying their goals on the web, and on the other hand might help systems to identify, understand, assess and reason about users’ requirements and goals.





Hello world!

8 02 2008

So here we go – yet another blog injected into the blogosphere. With the blogosphere growing exponentially, I guess a new blog needs to provide at least some justification or an outline of its purpose. I intend to use the blog mainly to talk about issues related to my research on agents and social computation on the web. A sort of minimum requirement is that this blog should not contribute to worsening the signal to noise ratio in the blogosphere. Let’s see how that goes.