I just stumpled upon the New Scientist article about Peter Gloor‘s web tool Condor. The article is a rather high level introduction, but an interesting read. Here’s the basic description (from New Scientist):
“Condor starts by taking an ordinary search term – the name of a political candidate or a company – and plugging it into the Google web search engine. It then takes the URLs of the top 10 hits returned by Google and plugs them back into the Google search field, prefaced with the term “link:”. In response, Google returns the sites that link to the 10 original sites. Condor then repeats the process with the new set of sites.”
This leads to networks similiar to the network illustrated below (taken from these slides)
Basically, what Peter seems to propose is to use Google search for constructing ego-centric networks of websites related to a specific search term. Subsequently (described in the full article) he calculates the betweeness values (the number of shortest paths going through each node) for each of the websites in the network and calculates the average of this, something like the “popularity factor” of a search term (Peter doesn’t really call it like that – but this seems to be the intention). Peter speculates that this might be helpful to predict the outcome of elections, pick future Oscar winners or predict stocks.
The available information about the specifics of Condor seems to be scarce (I didn’t find a published article, but I found slides and some introductory text), so I definitely don’t have the complete picture of Condor’s architecture, but from what I’ve seen so far (my registration for a test account on condorview.com is pending), I observe the following issues:
- The network created is almost a forest (due to the nature of the graph construction process, see the image above) but not quite, when a website appears in several different search results, this introduces vertices that let the graph deviate from the “pure” forest structure, towards graphs that are denser
- So what does this mean in terms of the phenomenon that Peter wants to observe? “Popularity” in Condor seems to be measured by averaging betweeness factors. The more the network deviates from the forest structure, the more “shortcuts” to navigate the forest are introduced, the lower the average betweeness factor becomes.
- This would mean that, for example, the movies with the highest average betweeness factors (i.e. more likely to get an Oscar) produced ego-centric networks that are closer to pure-forest structures than movies with lower average betweeness factors (i.e. less likely to get an Oscar).
- Since Google ranks websites (at least in part) with PageRank, this would mean that the most popular websites for an “Oscaresque” movie do not have extensive links among their neighbours in their ego-centric networks.
This seems odd and somewhat counterintuitive, and I do not see an obvious explanation of this – any thoughts? It would also be interesting to conduct evaluations, for example comparing different network parameters, in particular clustering coefficient, in-degree and out-degree values to see how they would perform compared to average betweeness.
I must say that my analysis might be flawed since it is unclear how many iterations with Google search were performed, and how many of the websites that are being returned are included, or how they are included (top level domain vs. specific sites) – all issues that would obviously influence the resulting graph.
Peter (who also has a blog) was recently in Vienna, so it seems I’ve missed my chance to discuss his very interesting approach and my observations in person – something I definitely would have enjoyed doing.