Monday, August 28, 2006


Triple stores containing specimens, sequences, images, literature, etc. are all very well, but there is a lot of information that is not captured by such a system. For human users (as opposed to dumb computers), often a simple summary is more informative than a set of images and a map, especially if that summary mentions something interesting, such as why the "Google ant" was so named, or that the trap-jaw ant Odontomachus bauri has incredibly fast jaws (doi:10.1073/pnas.0604290103). There is also a lot of information that may one day be semantically encoded, but for now will only be captured as text (such as extensive commentaries on web sites).

Lastly, there is the issue of getting people involved. One of the striking things about biodiversity web sites is the lack of community involvement. My question is "why is this the case?" Here are some thoughts:
  1. If feedback consists of sending an email to the person/organisation running the site, then there's little incentive to get involved. Why bother typing a detailed summary of why a particular piece of information is wrong, only to have that essay disappear into a black hole?

  2. Annotations can be valuable, but there is an issue of trust. Why invest effort in a site that may be a short-lived toy? There are so many sites competing for attention.

One way to foster involvement might be through a Wiki, whereby anybody can contribute to writing a page about each organism. However, biodiversity Wikis such as Wikispecies have, in my opinion, been a spectacular failure, as evidenced by the number of missing pages or stubs. Perhaps part of the reason is the lack of content, which could be addressed by pre-populating each page with basic information from a database (such as name, any specimens, images, literature, etc.). In other words, each page would start with the level of detail of an iSpecies report (for background to iSpecies visit my iSpecies blog). As Kevin Kelly commented at the recent Google Sci Foo camp, people are much more likely to edit existing content than create content de novo.

Wikis are all very well, but my major worry is the potential to loose information. Here's the problem. Suppose I generate a Wiki page for an ant, and include information on its distribution. What happens if the underlying distribution data changes? Now that the page is in Wiki form, it will be out of date. Furthermore, I'm not sure I want users editing distribution records -- these should really be edited at the level of the source database, so that the changes propagate to other users of those data.

One possibility is to use custom tags in the Wiki. When the HTML page is generated, Wiki tags are rendered in the usual way, but the custom tags are replaced by the results of a database call (for example, a SPARQL query). Hence, something like %DISTRIBUTION would be replaced by a Google map of the specimens for that taxon. This would mean that the distribution map would always reflect the current database, and the user (assuming they don't delete the %DISTRIBUTION tag itself) won't be able to alter that information. Of course, this means we need some mechanism for users to inform the curators of the source data of any potential errors. This is particularly important if we pre-populate the Wiki page with information that may be incorrect (such as images harvested from a search engine).

We could also help things by encouraging standard ways of linking to other resources, or storing data. For example, say a user edits a page and adds a citation to a paper that isn't in the underlying triple store. Ideally we would get that new paper into the triple store (rather than have it languish in the Wiki text). There are some ways to do this, such as extracting metadata from DOIs, and using local links [need to think about this].

Likewise, with images, if we have a convention that images get posted to, say, Flickr, then we have a means for storing metadata about those images directly in our triple store.

These ideas have partly come out of conversations with Rebecca Shapley at Google, and Dave Thau at the California Academy of Sciences.


Post a comment

<< Home