SemAnt: September 2006

Friday, September 29, 2006

Why can't we spell!?

I know I'm not a great speller, but it gets frustrating when you discover how many potentially useful links to information are broken due to typos. For example, I stumbled across this page, which states that LACM ENT 164470 is the type specimen of Apterostigma gonides. Hmmm ... problem is, there is no such species. What they meant was Apterostigma goniodes (note the missing "o"). The fact that the only Google hits for "Apterostigma gonides" is the LACM page itself is a clue that something's up.

So, a potentially useful resource listing types housed at the Natural History Museum of Los Angeles County looses value through a typo. Argh!

Thursday, September 28, 2006

Organizing the Ant Internet - from The Ant Room

From The Ant Room:

Organizing the Ant Internet
This is what I am thinking about today -- I have this organization thing. I have a very strong need to have things in my life organized. A big pile of papers and junk drives me crazy. I just want to go through them all and put them into categories and file them away or throw away the trash and make everything look nice and neat. I get this feeling sometimes when I am browsing through the internet and looking at ant sites. I want to take them all off the web, look at them, clean them up a bit, throw away the junk, and put them all together in one well-organized drawer. There are so many ant sites nowadays and each and everyone seems to want to have everything you could ever want from an ant site, but none of them do. And I just think, if they could all get together, you really would have the best ant site ever. I'm not really sure why they don't. Even just a little bit of information sharing would be helpful. For instance, you've got AntWeb, which is a fabulous website if you are hoping to look up ants from Madagascar, but not if you are hoping to look up ants from Costa Rica. Why is that? The ants of Costa Rica have a fabulous webpage. It doesn't seem like it would be that difficult to import all of those costa rican ants onto AntWeb. DiscoverLife has done it. They don't have any checklists from Australia or Japan, though, which are also two groups of ant fauna with great webpages. Shouldn't we be trying to incorporate all of this information together? Even just a link to the other websites would be nice. It took me forever to figure out where the good websites were. I don't even trust the lists that are on DiscoverLife now -- I have a list of Tiputini ants on Discoverlife -- it is terribly out of date and I can't figure out how to update the list so I've just let it go. Tree of Life is another webpage that is basically useless to me. There are these beautiful photos but when you get down to the species level, you get a statement like "127 described species" but no actual list of species. Pseudomyrmex, for instance, has no species list on the Tree of Life website. Why not? A list certainly exists. And there are even labs that have been looking at this genus for years. Cephalotes, Procryptocerus, the Attini, Megalomyrmex, Pheidole, and Dolichoderus -- all genera that have no list of species on their tree of life webpage. Why hasn't someone added more ant information to this website? Or other websites? It is very frustrating to me. I wish someone would put me in charge of making one fantastic ant webpage that would incorporate everything. I know it would drive me insane but it would be very satisfying work. [The Ant Room]

I guess this is one of the motivations behind SemAnt -- exploring how to integrate diverse resources on ants into a single framework.

Monday, September 25, 2006

More on Wikis

Who writes Wikipedia?

That's the title of a longish, well-argued, very readable piece by Aaron Swartz, the highlight for me being a conclusion he arrived at by mining the data:

When you put it all together, the story become clear: an outsider makes one edit to add a chunk of information, then insiders make several edits tweaking and reformatting it. In addition, insiders rack up thousands of edits doing things like changing the name of a category across the entire site -- the kind of thing only insiders deeply care about. As a result, insiders account for the vast majority of the edits. But it's the outsiders who provide nearly all of the content.

(My emphasis). This contrasts with the results of some shallower research done by Jimbo Wales, and this kind of thing is why I for one would like to see Aaron on the Wikipedia board (although I couldn't vote because I've done <400 edits).

[Raw]

Wednesday, September 20, 2006

Adding triples using EditGrid

Vince Smith has constantly been telling me that for many biologists, "database" means an Excel spreadsheet, and that a big problem is simply getting data into a form that can be used online. Bearing that in mind, and also mindful of how much data is kicking around that isn't in "real" databases, I've been playing with EditGrid as a tool for adding triples to a triple store. I've commented on EditGrid elsewhere in the context of collaborative data matrices.

So, here's the situation. In my triple store I have information on ant specimen INBIOCRI001284215, obtained from AntWeb. Now, AntWeb has no pictures of this specimen. However, John Longino's pages on Acromyrmex coronatus include pictures of this specimen. How do I get that information into my triple store, without writing RDF?

One approach is to create a spreadsheet with three columns (subject, predicate, object), and create the triples, one per row. Now, I could just do this on my computer using, say, Excel, but that's not nearly cool enough, so I'll use EditGrid. But seriously, I'm going to use EditGrid because:

You can see it, whereas you can't see a file on my computer
You and I could collaborate on editing the data in EditGrid

The spreadsheet contains triples, such as these:

subject	predicate	object
http://www.evergreen.edu/ ... /INBIOCRI001284215_face_orig.jpg	foaf:depicts	http://www.antweb.org/ ... inbiocri001284215

In this case the subject and the object are represented by URIs (here they are URLs, but they could also be LSIDs or DOIs). You can see the complete spreadsheet here. The triples link the picture to the specimen, tell us that http://www.evergreen.edu/ants/genera/acromyrmex/species/coronatus/INBIOCRI001284215_face_orig.jpg is a picture (dc:type image), that the picture has a thumbnail, and is of Acromyrmex coronatus. Armed with these triples, I can now find a picture of this ant in my triple store.

Fine so far, but how do we get this into the triple store I hear you ask? EditGrid's permalink feature can export the spreadsheet in a range of formats, including XML. So, what I do is grab the XML, apply a XSL style sheet to convert it to RDF, then import the resulting RDF into the triple store. The key thing is once the data is in the spreadsheet, the rest is trivial. Here's the XSL style sheet. It has limitations, notably the assumption that URIs are URLs.


<?xml version='1.0' encoding='iso-8859-1'?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:foaf="http://xmlns.com/foaf/0.1/" 
    xmlns:dc="http://purl.org/dc/elements/1.1/">
    <xsl:output method="xml" version="1.0" encoding="iso-8859-1" indent="yes"/>
    <xsl:template match="workbook">
        <rdf:RDF>
            <xsl:apply-templates select="//row"/>
        </rdf:RDF>
    </xsl:template>
    <xsl:template match="row">
        <xsl:if test="@row != '0'">
            <xsl:element name="rdf:Description">
                <xsl:attribute name="rdf:about">
                    <xsl:value-of select="cell[1]/@input"/>
                </xsl:attribute>
                <xsl:variable name="predicate" select="cell[2]/@input"/>
                <xsl:variable name="object" select="cell[3]/@input"/>
                <xsl:choose>
                    <xsl:when test="contains($object, 'http://')">
                        <xsl:element name="{$predicate}">
                            <xsl:attribute name="rdf:resource">
                                <xsl:value-of select="$object"/>
                            </xsl:attribute>
                        </xsl:element>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:element name="{$predicate}">
                            <xsl:value-of select="$object"/>
                        </xsl:element>
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:element>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

This particular spreadsheet makes some assumptions of the user, namely that they can figure out what is the subject and what is the object, and are comfortable choosing predicates. However, being collaborative, others could help out by editing the spreadsheet. Furthermore, one could create spreadsheets that aren't quite so complicated, and aren't geared towards the developer. For example, one basic source of information I'd like to capture is geographic location, and there is probably a lot more information available in papers than in georeferenced museum collections. Hence, a spreadsheet like this

observation	lat	long
locality	-34.0	156.26

could be used to capture locality information, and would require minimal effort to convert into RDF. We'd just have to modify the XSL style sheet shown above.

The key point of all of this is that with minimal effort we can capture information that is not in the triple store, and we can make it eas(ish) for people with data to contribute. Given that EditGrid can import Excel files, somebody interested in sharing their data could do the grunt work in Excel on their own computer, then move everything to EditGrid, which makes it accessible to others.

Simple and open wins...

Thursday, September 07, 2006

Connotea tags

The following SPARQL query returns the "tags" for a Connotea reference using the DOI as the search term (in this case doi:10.1007/bf02224026):


PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?subject
WHERE {
  ?doi ?bnode 'doi:10.1007/bf02224026'
. ?connoteaURI ?identifier ?doi 
. ?item ?connotea ?connoteaURI
. ?item dc:subject ?subject
}

The graph being queried is this RSS file, which I've put in a triple store.
This query is simply following the path in the RDF from the DOI <connotea:idValue>10.1007/bf02224026</connotea:idValue>.

Note: One potential "gotcha" is that DOI's are not case sensitive, but SPARQL queries are (oh oh).

Wednesday, September 06, 2006

GenBank extras

Idly playing with ants, it is time to blog two things that come up a few times. The first is that GenBank has links to literature that could do with updating. For example, the sequence U11912 from the fungus "Atta mexicana symbiont JF-1" is listed as being published in


  AUTHORS   Rehner,S.A., Chapela,I.H., Schultz,T.R. and Mueller,U.G.
  TITLE     Evolutionary history of the symbiosis between fungus-growing ants
            and their fungi
  JOURNAL   Unpublished

Well, this was published in Science (doi:10.1126/science.266.5191.1691) in 1994. The DOI seems broken (sigh), so here is a direct link. Ulrich Mueller's web site has a link to the PDF.
The other point is that searching the nucleotide database for "Atta mexicana" turns up no ants, but the above mentioned fungus. We get the hit because there is a line in the GenBank record that lists the ant host.


/specific_host="Atta mexicana"

As I've mentioned over on the iSpecies blog, GenBank records often contain this sort of useful information. Hence, we could search for ants and extract information about their fungal associates.

SemAnt