Thursday, November 30, 2006

Damn DiGIR

I've been playing with DiGIR providers, retrieving records to be massaged into RDF as part of a project to aggregate specimens, sequences, publications, and phylogenies. DiGIR was a major advance on what went before (i.e., basically nothing), but in addition to variations in the schema, and the fact that a good portion of the providers are off line at any one time (see Perils of Federation), I'm coming up against the wide range of ways people have of writing dates.

Ideally, I'd like dates in the variation of the ISO 8601 YYYY-MM-DD format described by the W3C W3C, and recommended by Dublin Core. What we can get in DiGIR records is all manner of formats, such as (with specimen codes following)

11/28/2000 12:28:30 PM [KU 195138]
2004-02-20 00:00:00 [MVZ 149006]
2006-08-21 [FMNH 145699]
12/Jun/1983 [KU 195138]
29 Jun 1974 [MVZ 149006]

Now, variation in the <darwin:VerbatimCollectingDate> tag (the last two dates) is expected, but for a computer generated field such as <darwin:DateLastModified> this is a bit much.

Since I do most of my harvesting in Perl, I came across Date::Manip, which manages to convert these into a sensible form (for example 11/28/2000 12:28:30 PM becomes 2000-11-28T12:28:30).

Integration is not easy...

Wednesday, November 22, 2006

More GUIDs

Idylly Googling, I stumbled across The Knowledge Bank at OSU, a DSpace server at Ohio State University. A number of publications on ants are listed there, complete with PDFs. For example, "The Mating Activities of the Ant Myrmica Americana Weber" by Kannowski and Kannowski, published in The Ohio Journal of Science in 1957. This paper has a GUID in the form of a Handle (hdl:1811/4489). What I find interesting is that digitisation efforts by libraries are putting biodiversity literature online as part of a broader effort (i.e., we get these papers, and a GUID, "for free"). This also raises issues about duplication of effort — clearly, if a library has put a PDF online, we don't want to duplicate this. Hence, we need a simple way of finding whether a paper has already been digitised. Google Scholar may be useful for this, although in this case Google finds the paper but Google Scholar doesn't.

Monday, November 20, 2006

Copyright on images

This post is a copy of a comment I wrote on The Ant Room post Synchronizing and Copyrighting Images, which I've repeated here so I don't loose track of it.

Two thoughts on copyrighting images. The first is why choose copyright © as opposed to a Creative Common (cc) license? With a cc license you get to specify what I can and can't do with the image, without me having to ask you. By sticking "© K. T. Ryder Wilkie 2005" on an image (e.g., your gorgeous picture of Acanthoponera peruviana), I then have to contact you to ask your permission. For one or two images, that's OK I guess, but what it I want to use lots of images? What if you are on holiday?

The second comment is that I can read "© K. T. Ryder Wilkie 2005" but computers can't (at least, not easily). There other other ways to tag images that computers can read this information. Examples include EXIF tags (as used by Antweb, as mentioned on my iSpecies blog) which get embedded in the image file itself (also XMP information added by Photoshop, or Flickr tags (for example, this image of Strumigenys precava). My point is that if people are going to make use of your work on a large scale, using Creative Common licenses and embedding that information electronically in the image in the form of metadata will make your hard work even more useful.

If sharing information on biodiversity is going to take off, then we need to start thinking about how to share, and how to make our information accessible to computers, not just people.

Saturday, November 11, 2006

SVG specimen maps from SPARQL results

One reason why I've put off adding specimen maps to iSpecies, despite repeated requests, is that Google Maps (my preferred mapping tool) is slow if you have lots of specimen records. I've played with some other tools, notably Map Bureau's Flash-based pointMapper, but what I'd really like is a quick and simple way to display a bunch of specimen records. Because the same issue comes up with SemAnt, I thought it's time to do something about it. I stress that I really like Google Maps, but for some purposes it's overkill. Furthermore, loading hundreds of points will take too long.

So, the idea is to take georeferenced specimen records and put them on a map of the world. Slowly it dawned on me that this was trivially easy. Firstly, take a map of the world drawn using the equirectangular (or plate carrée) projection (Wikipedia provided the example below).

This projection a simple connection between geographic location and pixel position. For example, if the map is scaled to 180 pixels high and 360 pixels wide, then you have a 1 pixel/degree grid. Hence, plotting localities is no harder than plotting a X-Y scatter plot.

Now, all I need to do is take a SPARQL result with latitude and longitudes and draw the localities on this map. One way to do this is to draw the points using SVG, so I can use a XSL transformation to generate the map. If I wanted to support zooming then ideally I'd have the map itself in SVG, but I just want a small world map, so I "cheat" and use a bitmap as the base map. This can be included like this:

<image x="0" y="0" width="360" height="180"
xlink:href="http://...360px-Equirectangular-projection.jpg" />

The trick is to convert latitude and longitude to coordinates on the bitmap. For example, specimen casent0008682-d03 of Melissotarsus emeryi was collected from 31°58'0'' S, 18°51'0'' E, which in decimal values is latitude -31.966667, longitude 18.85. Now, how do I convert these values into a location on a 360 × 180 image? In SVG the coordinates grow from the upper left, whereas on the map shown above 0,0 is in the centre, such that southern latitudes are negative, as are western longitudes. We can use a transform to move the origin of the x- and y-axes to the left 180 pixels, and down 90 pixels, so that the origin of the graph is the intersection of the equator and Greenwich Meridian. We also have to invert the y-axis because in SVG it goes from top to bottom. This diagram shows the difference between SVG and geographical coordinates:

This transformation is achieved by this statement

<g transform="translate(180,90) scale(1,-1)" >

This idea came from hack #55 in Michael Fitzgerald's book XML Hacks. Here is the XSLT I use.

<?xml version='1.0'?>
<xsl:stylesheet version="1.0" xmlns:xsl=""
xmlns:res="" xmlns=""
xmlns:xlink="" exclude-result-prefixes="res xsl">
<xsl:output method="xml" version="1.0" indent="yes"/>
<xsl:template match="/">
<xsl:attribute name="width">360px</xsl:attribute>
<xsl:attribute name="height">180px</xsl:attribute>
<rect id="dot" x="0" y="0" width="4" height="4" style="stroke:none; stroke-width:1; fill:solid"/>
<image x="0" y="0" width="360" height="180" xlink:href=""/>
<g transform="translate(180,90) scale(1,-1)">
<xsl:apply-templates select="//res:result"/>
<xsl:template match="//res:result">
<use xlink:href="#dot">
<xsl:attribute name="transform">
<xsl:value-of select="res:binding[@name='long']/res:literal"/>
<xsl:value-of select="res:binding[@name='lat']/res:literal"/>

This transforms a SPARQL result that looks something like this:

<binding name="lat"><literal>10.266666</literal></binding>
<binding name="long"><literal>-84.083336</literal></binding>

One thing which drove me nuts for a while was that the SVG rendered fine in Safari using Adobe's plugin, but not in Camino, which uses the same rendering engine as Mozilla. Turns out Camino needs to be the default namespace, so xmlns="" is fine, but it barfs over xmlns:svg="". Sigh.

Here is an example SVG file rendered using the XSLT style sheet, but using a different background map, showing the distribution of the ant Azteca constructor. The source SVG is here.

A nice, simple map, with minimal effort.

Wednesday, November 08, 2006

Perils of federation

Dave Vieglais gave what looks like an interesting presentation at TDWG 2006. BigDig monitors the status of DiGIR providers that serve museum specimen records. I've not managed to get the map background to appear, but here's a snapshot of the geographical distribution of providers, and their status:

The happy faces are DiGIR providers that are live, the sad faces are not responding. What is interesting that a fair chunk are offline, of the 180 registered providers, 25 have never responded, and there are something like 17 variations of the DiGIR schema out there.

It's a little scary that so many providers are offline, and that they differ in the format of the messages they accept and return. For federated searches that are "live," this spells disaster. His presentation is here (rather unhelpfully in Open Office format, so I've made a PDF).