SemAnt

Sunday, July 16, 2006

Taxonomic treatments

For a demo for Donat Agosti, I've added some taxonomic treatments to iSpecies.org. I'd done this before, but after the server got hacked I didn't restore the treatments because they are served via a triple store, and I hadn't got that running. Now that the triple store is up, I looked at this again. Here's what is involved.

Treatments

TaxonX is a XML mark up for taxonomic descriptions. The idea is to locate and mark blocks of text that describe a taxon. For more details see the AMNH's NSF Taxonomic Literature Projectpages.

Donat has been marking up various ant papers manually as proof of concept, but the process will be automated. I want to be able to serve up a taxonomic description of a name, e.g. "Proceratium google". Because I want everything to be in a triple store, I need to map TaxonX to RDF. Here's what I do:

  1. The URI of the paper is the link to the PDF in AntBase. This should really be something else (LSID, DOI, Handle, PURL), but it will do for now.

  2. Each treatment is extracted from the TaxonX document using XPath. I use a Perl script to pull out each node matching //tax:treatment

  3. Each treatment gets a URI, based on the URI of the paper containing the treatment, and the XPath to the treatment, e.g. http://antbase.org/ants/publications/8538_fisher//tax:treatment[1]. The idea is that one could use the identifier to extract the relevant block of text from the TaxonX XML document (i.e., the identifier would be useful beyond my triple store). Although I worry that this is not semantically opaque, its seems a useful idea, and my worries eased when I discovered that Annotea uses the same idea.

  4. The actual treatment is stored as a block of <![CDATA[..]]>, so the original TaxonX markup is preserved.

  5. Each treatment is linked to the containing paper by the Dublin core term <dcterms:isPartOf>. I also have the inverse link <dcterms:hasPart> to link the publication to the treatments it contains.

  6. I have some minimal metadata about the publication (title, format), and about each treatment (name of taxon stored in <dc:subject>). This is extracted from what is in the TaxonX document - clear TaxonX needs more information on the source.

  7. Each treatment is typed using <dc:type>treatment<dc:type>. I do this so that I can classify results for a query (as part of another project).


So, a publication is modelled like this:

And a treatment is modelled like this.



SPARQL

Currently iSpecies treatments are retrieved using RDQL, but SPARQL is rather nicer. Finding the treatment for a taxon is a simple SPARQL query, e.g.:

PREFIX gla: <urn:lsid:lsid.zoology.gla.ac.uk:predicates:>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX tax: <http://research.amnh.org/informatics/taxlit/taxonx/taxonx1>
SELECT ?uri ?publication ?title ?treatment
WHERE {?uri dc:subject 'Proceratium google'
?uri gla:treatment ?treatment .
?publication dcterms:hasPart ?uri .
?publication dc:title ?title
}


Display

To display the results I take the SPARQL XML result, convert the encoded TaxonX block to XML mark up, then apply a simple XSLT style sheet. The results aren't pretty, but it works.

Future directions

As I've mentioned in an earlier post, what I'd really like is to have GUIDs for these publications sorted out, and more mark up. In particular, literature cited, specimens, and other taxonomic names should be marked up so that these links can be extracted. If this is done well, then we could do things like:

  • Generate distribution maps for papers that don't have maps

  • Generate synonymies from lists of names

  • Infer type status even if specimen databases don't have this information

  • etc.


The trick will transforming TaxonX to RDF.

Currently playing in iTunes: Shelter by Ray LaMontagne

Saturday, July 08, 2006

Disconnected databases

One consequence of having multiple databases is that they can get out of sync, that is, information in one database might not be updated to reflect changes in another. I've touched on this earlier when discussing unidentified ants in GenBank (Discovering new things and Ants, RDF, and triple stores).


I've also come across cases where AntWeb is out of date. For example, the ant Strumigenys rubigus was described in 2000 by Brian Fisher. In the TaxonX marked up version of the original paper (available here), the holotype is listed as:

Holotype worker, Madagascar: Prov. Toamasina, F.C. Andrianantantely, 18 deg. 41.7 min. S, 48 deg. 48.8 min. E, 530 m 4-10.xii.1998, ex rotten log, rainforest, #49-2 (H.J. Ratsirarson) (MCZ).


Now, in an ideal, joined-up world, we'd have a link from the Fisher paper to the actual specimen. A bit of fussing (i.e., searching for "Strumigenys rubigus" on AntWeb) reveals that the holotype is casent0005630. The AntWeb page for this ant has no indication that this is the holotype, although there is a picture of the specimen labels that make it clear that this is what it is.

Having multiple sources of information makes it harder to keep things up to date, which is another reason why I think RDF and triple stores (or distributed queries) will help. So long as we have metadata about the specimen and the publication, we can make the inference that casent0005630 is the holotype of Strumigenys rubigus. This may ease the burden on individual databases. Rather than the curators of AntWeb having to update AntWeb manually every time a new name is published, a portal along the lines of my SemAnt toy could summarise this new information easily, if (and it's a big if) we have the links between publication and specimen.

Saturday, July 01, 2006

SPARQL query for classification

After spidering uBio the next task is how to display the classification of a taxon. The following SPARQL query does the trick:

PREFIX ubio: <urn:lsid:ubio.org:predicates:>
PREFIX gla: <urn:lsid:lsid.zoology.gla.ac.uk:predicates:>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT DISTINCT ?node, ?title, ?rank, ?name, ?description

WHERE {
<urn:lsid:ubio.org:namebank:2735665> gla:objectiveSynonym ?display .
?class ubio:namebankIdentifier ?display .
?class ubio:classificationName ?name
?class ubio:classificationDescription ?description .
?class gla:lineage ?seq .
?seq ?li ?node .
?node dc:title ?title .
?node gla:rank ?rank
}

This query takes a canonical name in uBio, finds the display form, and from that the classification. The lines

?class gla:lineage ?seq .
?seq ?li ?node .

fetch the lineage which is stored as a sequence:

<gla:lineage>
<rdf:Seq>
<rdf:li rdf:resource="urn:lsid:ubio.org:classificationbank:5178917"/>
<rdf:li rdf:resource="urn:lsid:ubio.org:classificationbank:5095531"/>
.
.
.
<rdf:li rdf:resource="urn:lsid:ubio.org:classificationbank:5131593"/>
</rdf:Seq>
</gla:lineage>

This method of describing a taxonomic lineage was described in my paper in
Biodiversity Informatics
. uBio serves the lineage from lower to higher taxon (i.e., bottom up), but I want to display it top down. I do all display using XSLT style sheets, so we use the <xsl:sort order="descending"> trick (see here for an example). Then it's simply a case of indenting each node using <xsl:value-of select="18 * (position()-1)"/>, and borrowing uBio's end.png to get the tree effect.