Open Data for Cities: Enabling Citizens to Have the Apps They Want/Need

8 10 2009

During my recent resarch visit to PARC / the SF Bay Area, I came across a quite impressive iniative by the San Francisco Municipal Government aimed at opening up city data.

While I was aware of Obama’s data.gov initiative on a federal level, opening up municipal data seems to be interesting because in many cases it is closer to people everday’s concerns, such as finding a parking lot or avoiding areas with high levels of crime.

http://datasf.org is a website related to the San Francisco iniative, aiming to create transparency about the datasets made available by the city so far, such as the Disabled Parking Blue Zones dataset (.zip download). The general idea is to expose municipal data to the public, in order to enable the public to come up with innovations they feel are useful and/or important. Examples of such innovations can be found in a showcase, including an app for public health scores of SF restaurants or an iphone application for finding kid-friendly locations in the city.

Brilliant! What is also remarkable about these applications is that these innovations came to the city of San Francisco to no costs other than the costs related to publishing the data. Application development was done by developers who cared for a problem or companies who spotted a business opportunity.

In addition, publishing this data shifts – to some extent – responsibility from cities to citizens. If an application does not exist, people can certainly demand it to be provided – but more importantly – they can decide to develop it themselves, or organize in a way to get the applications they want developed indepedent of municipal approval.

After some further research, I was excited to see that the city of Toronto has a similar initiative, http://www.toronto.ca/open. Toronto major David Miller announced it at Mash09 (watch the video here, the interesting stuff starts at ~12:40).

From a transcript of his speech (excerpts), David Miller brings the vision of such initiatives nicely to the point:

I am very pleased to announce today at Mesh09 the development of http://toronto.ca/open, which will be a catalogue of city generated data.  The data will be provided in standardized formats, will be machine readable, and will be updated regularly.  This will be launched in the fall of 2009 with an initial series of data sets, including static data like schedules, and some feeds updated in real time.

The benefits to the city of Toronto are extremely significant.  Individuals will find new ways to apply this data, improve city services, and expand their reach.  By sharing our information, the public can help us to improve services and create a more liveable city.  And as an open government, sharing data increases our transparency and accountability.

In his speech, Major Millor also challenged the audience to develop apps that would help the government spot deficiencies and improvement potentials based on the published data (e.g. which contractor fixes reported road damage fastest/sustainably/etc?). Citizens (or better: “developers”) can come up with new ways of tapping into the data to develop new and innovative applications that provide unique services to municipal communities.

In Graz (Wikipedia), I am currently teaching – among other courses – a course on Web Science at Graz University of Technology,  with more than 100 students per semester. I can see a huge opportunity to combine latest web algorithms, and hands-on experiences on the web with the creative potential of students in order to come up with a vast number of new and innovative applications that could have an exciting impact to the city.

My results of a quick review on related efforts in Graz however have been somewhat disappointing. The only resource I found was the GeoDataServer Graz (if you are aware of other resources please post them as a comment!), which provides web interfaces to mostly static, geographic information, such as “rivers in Graz” or a “3D model of Graz” – which are fine and exciting examples. But for open data, these initiatives would need to be expanded significantly, to include up-to-date data feeds, APIs, common data representation formats and – most importantly – a grande strategy that provides a common vision of how the city wants to go about governing its data. I think this will eventually take place. In any case, I’m looking forward to getting students excited to participate and contribute to such initiatives, as these iniatives can probably serve as an excellent vehicle to let students have an impact, and at the same time teach them about the importance of service and responsibility in societies.

This development also nicely ties in with some of my research interests on people’s motivations on the web: Enabling people to develop and have access to applications they want seems to be a tremendous shortcut to a more goal-oriented, useful, and ultimately more effective web. And with the advent of end user programming and tools such as Yahoo Pipes, there is not even a requirement for users to have lots of programming skills anymore to come up with useful applications or mashups.Motivations for Tagging: Categorization vs. DescriptionOpen Data for Cities: Enabling Citizens to Have the Apps They Want





Why we can’t quit searching in Twitter

19 08 2009

I’m still trying to get my head around this recent Slate magazine article on Seeking: How the brain hard-wires us to love Google, Twitter, and texting. And why that’s dangerous.

In this blog post, I’m basically trying to tie this article on “Seeking” together with two related topics: “Information Foraging” and “Twitter”.

Seeking

The Slate magazine article starts by observing that we (humans) are insatiably curious, and that we gather data even if it gets us into trouble. To give an example:

Nina Shen Rastogi confessed in Double X, “My boyfriend has threatened to break up with me if I keep whipping out my iPhone to look up random facts about celebrities when we’re out to dinner.

The article goes on making several intertwined arguments which I’m trying to sort out here.

One of the arguments focuses on reporting how lab rats can be artificially put in “a constant state of sniffing and foraging”.  It has been observed in experiments that rats tend to endlessly push a button if the button would stimulate electrodes that are connected to the rat’s lateral hypothalamus. This gets rats locked into a state of endless repetitive behavior. Scientists have since concluded that the lateral hypothalamus represents the brain’s pleasure center.

Another point is made based on work by University of Michigan professor of psychology Kent Berridge (make sure to check out his website after reading this post) who argues that the mammalian brain has separate systems for wanting and liking. Think of it as the difference between wanting to buy a car, and liking to drive it.

Wanting [or seeking] and liking are complementary. The former catalyzes us to action; the latter brings us to a satisfied pause. Seeking needs to be turned off, if even for a little while, so that the system does not run in an endless loop.

Interestingly, our brain seems to have evolved into “being more stingy with mechanisms for pleasure than desire”: Mammals are rewarded by wanting (as opposed to liking), because creatures that lack motivation (but thrive on pleasure) are likely to lead short lives (due to negative selection).

There are lab experiments reporting on how the wanting system can take over the liking system. Washington State University neuroscientist Jaak Panksepp says that “a way to drive animals into a frenzy is to give them only tiny bits of food which sends their seeking system into hypereactivity.”

Information Foraging

This brings me to a book I’m currently reading on “Information Foraging Theory” by Peter Pirolli (I have not finished it yet!). At the beginning, the book argues that the way users search for information can be likened to the way animals forage for food. Information Foraging Theory makes a basic distinction between two types of states a forager is in: between-patch and within-patch states. In between-patch states, foragers are concerned with finding new patches to feed on, whereas in within-patch states, creatures are concerned with consuming a patch. Information Foraging Theory is in part concerned with modeling “optimal” strategies for foragers that would maximize some gain (e.g. information value) function, based on the Marginal Value Theorem, depicted in the illustration below.

In this illustration from wikipedia, Transit time refers to between patch (left side) and Time foraging in patch refers to within patch (right side of the diagram). The slope of the tangent corresponds to the optimal rate of gain. There is an interesting relationship between time spent within and between patches. If patches yield very little average gain (e.g. calories, or information value), patches are easily exhausted, quickly putting foragers into the between patch state again.

Twitter, Seeking and Information Foraging

Now I’m trying to tie these two topics together (there might even be a common basis for the relationship between these topics in the literature).

Seeking and Information Foraging: It seems that wanting and liking systems relate to within and between patch states in Information Foraging Theory. If lab rats push a button in an experiment, it seems that the rat’s electrodes modify their liking system in a way that prevents them from engaging within a patch, and puts them into a between patch state. When animals are sent into a frenzy by giving them tiny bits of food, within patch time is minimalized, sending them right back into a between patch state. In this scenario, animals spend relatively more time searching than actually consuming food, effectively reducing their overall gain in comparison to scenarios where they would be confronted with large bits of food (higher gain patches).

Finally, how this all might relate to Twitter: I’m arguing that Twitter’s message restriction to 140 characters (disregarding links that might be posted in Twitter messages) artificially reduces within-patch time. The gains of a patch (a tweet) might still vary, but gain is not dependent on within-patch time anymore. The average “within-patch/gain function” (right side of the above illustration) seems to be constant! It always takes the same approximate amount to read a tweet (assuming there are no URLs in a tweet), reading “longer” does not increase the gain.

In addition, Twitter’s particular user interface (chronological listing of tweets) seems to be weak in terms of information scent: Judging whether a tweet is relevant or not requires a forager to read the entire tweet, regardless whether the patch (the tweet) contains a gain (an informative value) or not.  This seems to yield to a situation where in systems such as Twitter, users quickly change between within patch (reading a tweet) and between patch (finding the next tweet to read) states. The reason to that might be the following: When a forager has exhausted a patch, he would switch back to a between patch state. However, due to a deprivation of information scent on the Twitter user interface, the user is largely helpless in the between patch state (he does not know where to search next, other than reading the next tweet). This leaves users with a desire to change back to within patch states as quickly as possible (only reading entire tweets can help to assess relevance), thereby potentially adapting chaotic and/or irrational strategies.

The above observation might also explain the frenzy that animals are sent into when being offered tiny bits of food while being deprived of “scent” to inform their between patch phases. The hypothesis would be that the frenzy would not occur if the animals were offered clues regarding where the next patch is to be expected, and what gain they could get from exhausting it (of course behavioural biologists might have already studied this question).

Returning to Twitter, it seems that the same effect that sends animals into a frenzy could be at place at Twitter, where users – due to a combination of small within-patch times and weak information scent – engage in uninformed foraging of artificially small information patches.

This of course is the provocative conclusion of the Slate Magazine article. What I found interesting is how these three topics – seeking, information foraging and Twitter – nicely tie into each other on a theoretical level.

I still have not figured out what a reduction of within patch times alone means from an Information Foraging Theoretic perspective – I’d like to figure that out at some point in the future.





List of Social Tagging Datasets

22 07 2009

I’m currently compiling a list of social tagging datasets for our current / future research, but since it might be of interest to others as well I’m sharing it. Here’s the link:

A non-exhaustive list of Social Tagging Datasets that are available for research

If you are aware of other social tagging datasets available for research, please let me know by leaving a comment to this post.





Dynamic presentation adaptation based on user intent classification on Flickr

10 06 2009

Mathias just pointed me to a recent demonstration of their current research on dynamically adapting the user interface of an image-sharing system, in their case Flickr.com,  based on a classification of user intent.

The problem Mathias and his student, Christoph Kofler, are adressing is interesting, and can be described in the following way.

The basic underlying assumption is that in addition to learning more about the content of image-sharing systems, we also need to know more about the users’ intent in order to improve search.

A majority of research on image sharing systems such as flickr has focused on leveraging and improving the utilization of content-specific (e.g. MPEG7) as well as user-generated (e.g. tags) meta-data to better describe the content of photos or images etc. This allows systems to better reflect what a given image is about. However, when searching for content, the intent of users comes into play. Depending on the users’ search intent, only a subset of resources might be relevant. In other words, a successful search result can be considered to be a search result that successfully matches users’ intent with the content available in image sharing systems.

I’d like to give an example of a particular search intent category in image sharing systems where a recognition of user intent would be useful:

A user who wants to download an image for later commercial use (e.g. to include it in marketing material) might only want to retrieve items that specifically allow him to do that. While this data about copyright in principle is available in image-sharing systems (e.g. The Creative Commons licence) in the form of meta data, these systems need the ability to capture and approximate users’ intent in order to map it onto relevant resources. This is where existing search in image-sharing systems has an enormous potential for improvement.

Mathias and his student are interested in the different possible categories of search intent in image-sharing systems, and how they can help to inform search. They are currently developing a taxonomy of search intent in image-sharing systems and they have already developed an early prototype that aims to demonstrate the potential of learning about user intent and using this knowledge to adapt the presentation of search results. While it appears that the prototype is at an early stage, using simple rule-based mechanisms, I think the prototype excellently demonstrates the difficulty and importance of learning more about the users’ search intent in image-sharing systems.

Dynamic presentation adaptation based on user intent classification on Flickr

Other work on user intent in image-sharing systems focuses on, for example, tag intent aiming to study the different reasons why users tag (Ames and Naaman 2007).

Click here to watch the 6 min demonstration video.

References:

Ames, M. and Naaman, M. 2007. Why we tag: motivations for annotation in mobile and online media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (San Jose, California, USA, April 28 – May 03, 2007). CHI ’07. ACM, New York, NY, 971-980. DOI= http://doi.acm.org/10.1145/1240624.1240772





A research study on “Why users tag”

1 06 2009

I’d kindly like to request your help in a current study I am working on:

Please consider participating in the brief survey on “Why do users tag?”. The entire survey should take you no longer than 1-2 minutes to complete (2 questions only!).

More background on this research can be found in a previous post. Your help would be greatly appreciated.

Please click here to take the survey.





Annotating Textual Resources on the Web with Human Intentions

23 04 2009

I have recently written about current research in our group attempting to annotate textual resources with intentions. A preliminary report from a study that focused on annotating transcripts from speeches given by candidates of the 2008 presidential electionshave has recently been accepted at the Hypertext’09 conference as a 2-page poster.

Intent Tag Clouds vs. Traditional Tag Clouds, Based on transcripts of political speeches given by the two US presidential candidates in 2008.

Intent Tag Clouds vs. Traditional Tag Clouds, Based on transcripts of political speeches given by the two US presidential candidates in 2008.

The image above aims to give an impression of what we were trying to achieve, further details can be found in the conference proceedings:

M. Strohmaier, M. Kroell and C. Körner, Automatically Annotating Textual Resources with Human Intentions, Hypertext 2009, 20th ACM Conference on Hypertext and Hypermedia. Turino, Italy, 2009. Poster (download 2-page pdf)

We are currently working on an expanded version of it.

In related news, I’m also happy that one of Austria’s largest, nation-wide newspapers – DerStandard – has covered aspects of my research in their “Science” section.

DerStandard APR22 2009

Read the (german!) article online or in print.Annotating Textual Resources on the Web with Human Intentions





Why do users tag? Detecting user motivation in tagging systems

5 04 2009

On the “social web” or “web2.0”, where user participation is entirely voluntarily, User Motivation has been identified as a key factor in the mechanisms contributing to the success of tagging systems. Web researchers are trying to identify the reasons why tagging systems work for a couple of years now, evident in, for example, the organization of a panel at CHI 2006 and a number of conferences and workshops on this topic.

Recent research on tagging motivation suggests that it is a rather complex construct. However, there seems to be emerging consensus that a distinction between at least two categories of tagging motivation appears useful: Categorization vs. Description. (Update May 30 2009: I was able to trace back the earliest mention of this distinction to a blog post by Tom Coates from 2005).

UPDATE March 15 2010 – More results can be found in: M. Strohmaier, C. Koerner, R. Kern, Why do Users Tag? Detecting Users’ Motivation for Tagging in Social Tagging Systems, 4th International AAAI Conference on Weblogs and Social Media (ICWSM2010), Washington, DC, USA, May 23-26, 2010. (Download pdf)

UPDATE April 23 2010 – Even more results in: C. Körner, R. Kern, H.P. Grahsl, M. Strohmaier, Of Categorizers and Describers: An Evaluation of Quantitative Measures for Tagging Motivation, 21st ACM SIGWEB Conference on Hypertext and Hypermedia (HT2010), Toronto, Canada, June 13-16, ACM, 2010. (download pdf)

Categorization vs. Description

Categorization: Users who are motivated by Categorization engage in tagging because they want to construct and maintain a navigational aid to the resources (URLs, photos, etc) being tagged. This typically implies a limited set of tags (or categories) that is rather stable. Resources are assigned to tags whenever they share some common characteristic important to the mental model of the user (e.g. ‘family photos’, ‘trip to Vienna’ or ‘favorite list of URLs’). Because the tags assigned are very close to the mental models of users, they can act as suitable facilitators for navigation and browsing.

Description: On the other hand, users who are motivated by Description engage in tagging because they want to accurately and precisely describe the resources being tagged. This typically implies an open set of tag, with a rather dynamic and unlimited tag vocabulary. The goal of tagging is to identify those tags that match the resource best. Because the tags assigned are very close to the content of the resources, they can act as suitable facilitators for description and searching.

Related Research: This basic distinction can be identified in the work of a number of researchers who have made similar distinctions: Xu et al 2006 (“Context-based” vs. “Content-based”), Golder and Huberman 2006 (“Refining Categories” vs. “Identifying what it is/is about”), Marlow et al 2006 (“Future retrieval” – “Contribution and Sharing”),  Ames and Naaman 2007 (“Organization” vs. “Communication”) and Heckner et al 2008 (“Personal Information Management vs. Sharing”), just to give a few examples, all represent recent research aiming to demystify and conceptualize the reasons why users participate in tagging systems.

Why should we care?

In the wild“, user behavior on social tagging systems is often a combination of both. So why is this distinction interesting? I believe that this distinction is interesting because it has a number of important implications, including but not limited to:

  1. Tag Recommender Systems: Assuming that a user is a “Categorizer”, he will more likely reject tags that are recommended from a larger user population because he is primarily interested in constructing and maintaing “her” taxonomy, using “her” individual tag vocabulary.
  2. Search: Tags produced by “Describers” are more likely to be helpful for search and retrieval because they focus on the content of resources, where tags produced by “Categorizers” focus on their mental model. Tags by categorizers thus are more subjective, whereas tags by describers are more objective.
  3. Knowledge Acquisition: Folksonomies, i.e. the conceptual structures that can be inferred from the tripartite graph of tagging systems, are likely to be influenced by the “mixture” or dominance of categorizers and describers in their system. A tagging system primarily populated by categorizers is likely to give rise to a completely different set of possible folksonomies than tagging systems primarily populated  by describers. More importantly, it is plausible to assume that even within certain tagging systems, tagging motivation among users vary.

This brings me to a small research project I am currently working on: Assuming that a) this distinction in user motivation exists in real-world tagging systems and b) it has important implications, it would be interesting to measure and detect the degree to which users are Categorizers or Describers. Due to the latent nature of “tagging motivation”, past research has mostly focused on questionnaire or sample-based studies of motivation, asking users how they interpret their tagging behavior themselves. While this early work has provided fundamental insights into tagging motivation and contributed significantly to theory building, as a research community, we currently lack robust metrics and automatic methods to detect tagging motivation in tagging systems without direct user interaction.

Detecting Tagging Motivation

I think there are several approaches to detect whether users are Categorizers or Describers without the need to ask them directly. One approach would focus on analyzing the semantics of tags, using wordnet and other knowledge bases to determine the meaning of tags and infer user motivation. This would require parsing of text and performing linguistic analysis, which I believe is difficult in the presence of typos, named entities, combined tags (“toread”) and other issues. Another approach would focus on comparing the tag vocabulary of users to the tag vocabulary of “the crowd”. Users that share a greater set of common tag vocabulary might be describers, whereas users having a highly individual vocabulary might be categorizers. Again there are problems: Tagging systems that accomodate users with different language backgrounds might be prone to detecting user motivation based on false premises.

So what would be a more robust way of detecting user motivation? I am currently interested in developing a model that would be agnostic to language, semantics or social context, focusing solely on statistical properties of individual tagging histories. This way, a determination of user motivation could be made without linguistic analysis or acquiring complete folksonomies from tagging systems, based on a single users’ log of tagging. Let me explain what I mean. I hypothesize that the following statistical properties of users’ tagging history allows to conduct interesting analyses:

  1. Tag Vocabulary size over time: Over time, an ideal Categorizer’s tag vocabulary would reach a plateau, because there is only limited categories that are of interest to him. An ideal Describer is not limited in terms of her tagging vocabulary. This should be easy to observe.
  2. Tag Entropy over time: A Categorizer has an incentive to maintain high entropy (or “information value”) in his tag cloud. Tags would need to be as discriminative as possible in order for him to use them as a navigational aid, otherwise tags would be of little use in browsing. A describer would not have an interest to maintain high entropy.
  3. Percentage of Tag Orphans over time: Categorizers have an interest in a low ratio of Tag Orphans (tags that are only used once) in their set of tags, because lots of orphans would inhibit the usage of their set of tags for browsing. Describers naturally produce lots of orphans when trying to find the most descriptive and complete set of tags for resources.
  4. Tag Overlap: While a Describer would be perfectly fine assigning two or more synonymous tags to the same resource (he might not know which term to use when searching for this resource at a later point), a Categorizer would not have an interest in creating two categories that contain the exact same set of resources. This would again inhibit the usage of tags for browsing, a Categorizers’ main motivation for tagging.

Preliminary Investigations

I have done some preliminary investigations to explore whether these statistical properties of users’ tagging history can actually serve as indicators of tagging motivation. Here are my preliminary results:

tag-vocabulary1

Growth of tag vocabulary in different tagging systems

The diagram above shows the growth of tag vocabulary of different taggers. The upper most red line represents tagging behavior of an almost “ideal” Describer, in this case tags produced by the ESP game, that contain a set of tags that represent valid descriptions of the resources they are assigned to. The lower most green line represents tagging behavior of an almost “ideal” Categorizer, tags (in this case: a number of photo sets) produced by a flickr user that categorized photos into a limited set of categories (> 100 sets). All other lines represent tagging behavior of real users on different tagging platforms (bibsonomy, delicious, flickr tags). It is worth noting that all other data lies between the two identified extremes.

In the following, I will discuss the suitability of tag entropy of single users (as opposed to the work by Chi and T. Mytkowicz 2008 focusing on large sets of users) as an indicator for detecting tagging motivation:

Change of tag entropy over time

Change of tag entropy over time

In this diagram, we can see that while our “ideal” Categorizer and our “ideal” Describer almost describe extremes, there are some users “outdoing” them (e.g. “u5 bibsonomy bookmarks” has even lower entropy than the tags acquired from the “ideal” Describer “ESP game”). Entropy thus seems to be -to some extent – a useful indicator for tagging motivation.

Next, I’ll discuss data comparing the rate of tag orphans in different datasets:

Rate of Tag Orphans over time

Rate of Tag Orphans over time

Like in the previous diagram, extreme behavior represents a good (but not optimal) upper and lower bound for real tagging behavior. While the “ideal” Categorizer (flickr sets, green line at the bottom) has a very small number of tag orphans, the “ideal” Describer (ESP game data, red line at the top) has a much higher tag orphan rate.

If we can identify the functions of extreme user motivation “(ideal” Categorizers and Describers), and position real user motivation between those extremes, we might be able to come up with scores indicative of user motivation in tagging systems – e.g. a user might be 80% Categorizer, and 20% Describer. Having such a model could help exploring the implications of different user motivations outlined above. Together with students (in particular Christian Körner, Hans-Peter Grahsl and Roman Kern), I am working on constructing and validating such a model, which we are aiming to submit to a conference this year.

UPDATE March 15 2010: More results can be found in the following publication: M. Strohmaier, C. Koerner, R. Kern, Why do Users Tag? Detecting Users’ Motivation for Tagging in Social Tagging Systems, 4th International AAAI Conference on Weblogs and Social Media (ICWSM2010), Washington, DC, USA, May 23-26, 2010. (Download pdf)

UPDATE April 23 2010 – Even more results in: C. Körner, R. Kern, H.P. Grahsl, M. Strohmaier, Of Categorizers and Describers: An Evaluation of Quantitative Measures for Tagging Motivation, 21st ACM SIGWEB Conference on Hypertext and Hypermedia (HT2010), Toronto, Canada, June 13-16, ACM, 2010. (download pdf)

References:

Towards the Semantic Web: Collaborative Tag Suggestions. Proceedings of the Collaborative Web Tagging Workshop at the WWW 2006, Edinburgh, Scotland, 2006.

Can all tags be used for search?. CIKM ’08: Proceeding of the 17th ACM conference on Information and knowledge management, 193–202, ACM, New York, NY, USA, 2008.

HT06, tagging paper, taxonomy, Flickr, academic article, to read. HYPERTEXT ’06: Proceedings of the seventeenth conference on Hypertext and hypermedia, 31–40, ACM, New York, NY, USA, 2006.

Usage patterns of collaborative tagging systems. Journal of Information Science, (32)2:198, 2006.

HT06, tagging paper, taxonomy, Flickr, academic article, to read. HYPERTEXT ’06: Proceedings of the seventeenth conference on Hypertext and hypermedia, 31–40, ACM, New York, NY, USA, 2006.

Personal Information Management vs. Resource Sharing: Towards a Model of Information Behaviour in Social Tagging Systems. Int’l AAAI Conference on Weblogs and Social Media (ICWSM), San Jose, CA, USA, 2009.

Why we tag: motivations for annotation in mobile and online media. CHI ’07: Proceedings of the SIGCHI conference on Human factors in computing systems, 971–980, ACM, New York, NY, USA, 2007.

Understanding the efficiency of social tagging systems using information theory. Proceedings of the nineteenth ACM conference on Hypertext and hypermedia, 81–88, 2008.

Why do users tag? Detecting user motivation in tagging systems