Friday, June 30, 2006


Displaying a name by itself isn't very useful, so I'm exploring adding a classification to the ant demo. The question is which one? As a trial I've decided to use ITIS, based on the October 12, 2005 dump used by uBio (you can see it here). There are some 15,000 ant taxa in ITIS.

The plan is to retrieve the classification by spidering the uBio site, starting with the RDF for Formicidae in the ITIS classification ( By following the <ubio:hasChild> tags, we can traverse the complete tree.

One issue is getting my head around uBio's name structure. I use their FindIT SOAP service to get LSIDs for names from NameBank. FindIT returns canonical name ids, but ClassificationBank uses different LSIDs for the names (the "display name"). To give a concrete example, using FindIT to search on "Melissotarsus insularis" yields the LSID, whereas in the ITIS classification, the ClassificationBank record ( links to

Wednesday, June 28, 2006

TreeBASE rocks

I gave a talk today ("Dude, where's my tree?") at the Evolution 2006 meeting at Stony Brook. It was intended as a somewhat tongue-in-check overview of some issues concerning TreeBASE, and broader areas of biodiversity informatics, making use of ants as an example (see my SemAnt project).
Michael Donoghue took me aside after the talk and made some interesting points. He was a little tired — understandably — of hearing that "TreeBASE sucks" (e.g., my CIPRES talk), and felt that my constantly saying this was counter productive. It could also lead to people not putting their data in TreeBASE because they'd heard that it "sucks".
There is an element of social responsibility here, I guess. I resolutely avoid politics. I don't mean this in a pejorative sense, it's just that I don't have the temperament or skill for it, unlike Michael himself (Lee Belbin is another person in this area who strikes me as a very skilled manager).
Now, my talk was intended to be fun, and I was taking the piss out of myself as much as anything. I also think the things we criticise are the things we value the most. But that said, let be make it clear that TreeBASE is very important. As editor of Systematic Biology I've made authors submit data to it. I have a lot of respect for the work Michael, Bill Piel, and Mike Sanderson put into TreeBASE. If you have phylogenetic data — submit it to TreeBASE. It's the best we have. It's just that, well, as a community we could do better.

Taxonomic names, metadata, and the Semantic Web

My paper "Taxonomic names, metadata, and the Semantic Web" has appeared in Biodiversity Informatics.

Life Science Identifiers (LSIDs) offer an attractive solution to the problem of globally unique identifiers for digital objects in biology. However, I suggest that in the context of taxonomic names, the most compelling benefit of adopting these identifiers comes from the metadata associated with each LSID. By using existing vocabularies wherever possible, and using a simple vocabulary for taxonomy-specific concepts we can quickly capture the essential information about a taxonomic name in the Resource Description Framework (RDF) format. This opens up the prospect of using technologies developed for the Semantic Web to add ``taxonomic intelligence" to biodiversity databases. This essay explores some of these ideas in the context of providing a taxonomic framework for the phylogenetic database TreeBASE.

Sunday, June 25, 2006

Publications via iTunes - kewl!

Greg Restall has described how he put his papers into the iTunes music store. How cool is that! A nice demonstration of how RSS makes all sorts of interesting applications possible.

(Via All My Eye.)

Tuesday, June 20, 2006


Building 3Store3 3.0.14 on Mac OS X is an absolute pain. It's pretty clear the developers haven't done so, because there are a slew of dependencies that aren't mentioned. It builds fine on Linux, so it's a case where the developers haven't realised that the assumptions they make on Linux don't always hold on other platforms (such as my beloved Mac).

So, what happens after we type ./configure?

Firstly, we couldn't find rasqal (part of Redland)

configure: error: Package requirements (rasqal >= 0.9.11) were not met:
Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.

OK, so edit your .bash_profile to contain these lines:


(/usr/local/lib/pkgconfig is where Redland package config files are stored, /opt/local/lib/pkgconfig is used by Darwin Ports, see below.)

Next, we don't have glib, so it's off to Darwin Ports, which packages a lot of Open Source tools for Mac OS X. Install it, then at the Terminal type:

sudo /opt/local/bin/port install glib2-devel

(this may take a while...). Make sure you have /opt/local/lib/pkgconfig in your PKG_CONFIG_PATH variable (see above). Now, we get

configure: error: Cannot find Berkeley DB library version 4

Sigh. So, we grab Berkeley DB 4, cd to the directory, and

cd build_unix
sudo make install

So we're there, right? Not so fast, did you think this was meant to be easy? 3Store3 assumes Berkeley DB4 is somewhere it ain't.

mkdir /usr/include/db4
cd /usr/include/db4
ln -s /usr/local/BerkeleyDB.4.4/include/db.h
./configure LDFLAGS=-L/usr/local/BerkeleyDB.4.4/lib

Are we there yet?! Yes.

sudo make install

To set up the triple store:


Then $£@!#, I discover I need MySQL version 4.1.x (I'm running 4.0.21). To be fair, the documentation (what that) states this pretty clearly. OK, so we move the data safely out of the way, grab 4.1 from, and install it. I was running CompleteMySQL, so I need to remove that from my path, otherwise we get the wrong mysql-config when rebuilding 3Store.

So, why did I do all this? In a word, SPARQL. Hope it's worth it...

Monday, June 19, 2006

Donat Agosti enters the blogsphere

Donat Agosti has three blogs, biosyscontext, biodivcontext, and antbase (the later is not populated yet). Not one to mince words, Donat has some pithy things to say about the E. O. Wilson Biodiversity Foundation, which looks like another example of the triumph of hype over craft (for another approach see my iSpecies).

Donat is also rather critical of the authors of a recent paper on ant phylogeny (doi:0.1126/science.1124891). Nobody can accuse ants of being dull!

Monday, June 12, 2006

Using links to rank specimens, sequences, etc.

The "adundance" problem arises when a search returns too many hits. How does the user decide which ones are relevant (other than wading through the list)? The classical example is web search, where potentially millions of web pages may be returned. So, the challenge is to rank the results so the user needs look at only the top 10 or so (and is confident that what she is after is in the top 10).
For biodiversity searches this is also relevant, especially if a search may return 100s of specimens. How do we rank these (assuming some are more interesting than others)? Well, one approach is to adopt the same approach as Google -- rank things based on links. In the case of specimens, we could use links to sequences, images, and publications as evidence that a specimen "matters" (i.e., people have done work on it, and therefore it is likely to be of interest).
I'd suggested something like this for phylogenies over on iPhylo, but it dawned on me today that the same idea might make sense for the SemAnt project. As an aside, Ben Szekely (another participant at TDWG GUID2) and Elias Torres have a cool paper on extending PageRank to tags.

Friday, June 09, 2006


Just in time for the second meeting of TDWG GUID, I've got a version of the ant triple store up and running here. There are some technical details posted on the TDWG GUID wiki.
Note that the first time you go to the site you get a warning from Google Maps that the API key is invalid. It is, there's an issue to do with the server having multiple IPs (a temporary situation while I restore another machine that got hacked). Just click on OK and it should work fine. Figured it out. In my haste I'd left some redundant code in the file index.php, and it contained the wrong Google API key. Simple really...duh!

Thursday, June 08, 2006

Discovering new things

So, why bother with all this effort to aggregate information into a triple store, I hear you ask? Well, the expectation is that we can learn things we previously didn't know.
For example, consider the ant specimen casent0500379, which is recorded as the source of several sequences in GenBank. These sequences have been obtained by different research groups, and published in different papers.

We can see this immediately if we construct a graph based on the RDF in the triple store. Each node in the graph is a subject. Two nodes, x and y are connected by an edge if there is a triple corresponding to either (?x, ?pred, ?y) or (?y, ?pred, ?x). The neighbourhood of a node is all the nodes adjacent to that node.

This graph represents the neighbourhood of specimen casent0500379. I've expanded the graph by finding the neighbours of all the neighbours of casent0500379. Note that this graph shows that five sequences have been obtained from this specimen (identified by their "gi" numbers), and those five sequences are associated with three different taxonomic names in GenBank(!).

Two labs have sequenced the 28S ribosomal RNA gene from this specimen, one with accession number DQ353560 (gi:87047406), published by Moreau, et al. in Science (pmid:16601190, doi:10.1126/science.1124891) and one with accession number DQ401020 (gi:89477179), published by Ouellette et al. in MPE (pmid:16630727, doi:10.1016/j.ympev.2006.03.017).

Now, the specimen has not been identified beyond being assigned to the genus Proceratium (which includes the Google ant). These two research groups have given it different informal names, hence GenBank doesn't realise that these two taxa are the same.

In fact there is a third taxon, "Proceratium sp. CSM-2006" for this same specimen, which has been sequenced for wingless.

This is a small example of where aggregating and visualise links between multiple data sources can tell us something we didn't know before.

My triple store misses the sixth sequence (yet another 28S rRNA sequence) from this specimen (AY325951, gi:34398469), because the specimen is recorded in the "isolate" field, not the specimen_voucher field:
     source          1..1233
/organism="Proceratium sp. CS-2003-1"
/mol_type="genomic DNA"

Tuesday, June 06, 2006


YeastHub is an interesting example of data integration in bioinformatics using RDF and a triple store. See Cheung et al. for details (doi:10.1093/bioinformatics/bti1026). To quote from the abstract:
As the semantic web technology is maturing and the need for life sciences data integration over the web is growing, it is important to explore how data integration needs can be addressed by the semantic web. The main problem that we face in data integration is a lack of widely-accepted standards for expressing the syntax and semantics of the data. We address this problem by exploring the use of semantic web technologies—including resource description framework (RDF), RDF site summary (RSS), relational-database-to-RDF mapping (D2RQ) and native RDF data repository—to represent, store and query both metadata and data across life sciences datasets.

Pity the actual site seems broken...

Monday, June 05, 2006

Taxonomic Markup and GUIDs

These notes were put together partly in response to discussions with Donat Agosti, but also as part of my experiments with storing ant data in a triple store. The idea is to mark up a taxonomic paper on ants with links to external sources of information (such as names, specimens, images, etc.).

If you want some related inspiration, see Leigh Dodds' post on the scientific paper as a modern palimpsest.

Firstly, I'm going to distinguish between mark up and metadata. I'm going to use mark up to mean tagging a manuscript to identify the relevant bits. For taxonomic literature this is largely after the fact, but for modern journals the article itself is represented in XML, which is then converted to a nice display using XSL. The BMC journals are a good example of this. What I'm interested in is how to mark up an article so that metadata about that article, its contents, and its relationships to other articles can be easily recovered (and output, presumably as RDF). Much of the mark up concerns the structure of a document, which in turn is important for presenting the document (say, in a web browser). I'm interested in just those bits relevant to metadata.

I'm assuming that we have identifiers for the items of interest (i.e., URIs such as URLs, DOIs, Handles, LSIDs). Ideally, there is a way to extract metadata about the object the identifier refers to. LSIDs provide an explicit mechanism for doing this, and CrossRef provides a service to return an XML summary of metadata held for a given DOI.
In my Taxonomic Search Engine I used the Hymenoptera Name Server's SEEK prototype to get metadata about a name, e.g. returns an XML document about the "Google ant", Proceratium google. This document contains these identifiers:
  • HNS153344: a taxonomic concept
  • HNS195070: a taxonomic name
  • pubHNS153344: a publication

Only the name identifier (HNS195070) has metadata that can be easily accessed in XML, as far as I can see.

How to refer to identifiers
Given the uncertainty about resolving identifiers (i.e., will LSIDs take off), one might adopt the convention used by PubMed and the BMC journals and just include the "local" part of the identifier (see example below), rather than the full blown identifier. Otherwise, a document marked up with the complete identifier will be rendered out of date if the resolution mechanism changes. In English, just use Hymenoptera Namer Server ids, not LSIDs.

Literature cited
Literature is perhaps the least problematic topic, because there are identifiers for many publications (e.g., DOIs), and tools for looking up identifiers for publications (e.g., CrossRef OpenURL, Google Scholar, PubMed, etc.).
BMC uses the following markup for a bibliograpy entry:

<bibl id="B21">
<p>Inter-familial relationships of the shorebirds
(Aves: Charadriiformes) based on nuclear DNA sequence data</p>
<source>BMC Evol Biol</source>
<pubid idtype="pmcid">184354</pubid>
<pubid idtype="pmpid" link="fulltext">12875664</pubid>
<pubid idtype="doi">10.1186/1471-2148-3-16</pubid>

Note that individual elements of the item (such as volume, pagination, etc.) are identified, but more importantly, identifiers are provided (in this case from PubMed Central, PubMed, and CrossRef). BMC is better than PLoS in this respect, as PLoS don't embed the identifiers.

Note that the mark up above embeds identifiers, not URLs (for example). URLs are fragile and can break. By just using identifiers, BMC avoids this problem, but it means that the user has to know how to make the identifier actionable.

Marking up literature to this level of detail within even a single paper would be time consuming, but as I've noted elsewhere on iSpecies, tools like ParaCite would make this tractable. ParaCite includes code to generate OpenURL requests, which means finding DOIs would be straightforward.

What's needed?
Tools to extract citations from text and locate identifiers. ParaCite would help. Relying on CrossRef's OpenURL server will be limited to those cases where CrossRef knows about the article (i.e., it has a DOI). It would be useful to have similar tools for searching PubMed, taxon-specific bibliographic databases such as FORMIS, and the Hymenoptera Name Server. By tool I mean a Web API (can be simple as an HTTP GET interface). Google Scholar would also be useful, although there are issues with using it. There is also literature in DSpace repositories, such as the AMNH's wonderful collection of their scientific publications. How do we query this on the fly? In summary, having an OpenURL interface to taxonomic literature would greatly facilitate automated mark up.

Taxonomic names
I subscribe to uBio's view (see doi:10.1080/10635150500541680) that names by themselves are useful and should be indexed. A paper may mention a name with nothing to tell us what taxonomic concept is being used. For example, this is how the Google ant paper describes the site the ants were collected from:

Exotic vegetation dominates, most notably a scrub of strawberry guava (Psidium cattleianum) and privet (Ligustrum robustrum)—but grassland and Eucalyptus plantations also occur.

The paper is about ants, and for Discothyrea berlita Fisher, Proceratium avium Brown, Proceratium avioide de Andrade, and Proceratium google Fisher, we have a clear concept of what those names refer to (at a minimum, the specimens listed). For the other names (which include ants and plants), we have little to go on apart from the names. So, every occurrence of a taxonomic name in the document should be flagged. BioOne journals do this already. Any scientific name in the HTML is linked to ITIS. The linking is not intelligent as it is a search, not a link to an identifier (i.e., nobody actually checks that ITIS has the name).

What's needed?
Names in the document should be linked to uBio namebank LSIDs. For ants we could also use the Hymenoptera Name Server.

Taxonomic concepts
If we think of a concept as "what the name means", then this is most relevant to taxonomic papers describing names (e.g., where the author lists specimens, describes features of the taxon), argues that two taxa are synonyms, etc. uBio has a notion of a concept in the sense that a name may exist in multiple classifications, and each combination of name and classification has its own identifier. Hymenoptera has concepts (my understanding is that this is what one sees when one searches the Name Server through the web interface).

The taxonomic treatment is probably the same as the taxonomic concept (or perhaps, a treatment can be regarded as a detailed taxonomic concept). In my early experiments I assigned GUIDs to the taxonomic treatment, and used the Dublin Core tag <dcterms:isPartOf> to associate each treatment with the publication. As a quick hack the identifier for each treatment within a paper included the XPath query that would locate that treatment in the larger document, e.g. //tax:treatment[1]. Embedding this much meaning in an identifier is probably not wise, but it meant that given just the identifier, one could extract the treatment from the document.

The lack of GUIDs for authors is a long standing issue.

Most taxonomic images probably reside solely within the publication, but some may be stored in external databases (such as AntWeb). In the later case, the mark up should make the link to the external source.

If the specimen has an electronic existence, then link to that.