Thursday, June 08, 2006

Discovering new things

So, why bother with all this effort to aggregate information into a triple store, I hear you ask? Well, the expectation is that we can learn things we previously didn't know.
For example, consider the ant specimen casent0500379, which is recorded as the source of several sequences in GenBank. These sequences have been obtained by different research groups, and published in different papers.

We can see this immediately if we construct a graph based on the RDF in the triple store. Each node in the graph is a subject. Two nodes, x and y are connected by an edge if there is a triple corresponding to either (?x, ?pred, ?y) or (?y, ?pred, ?x). The neighbourhood of a node is all the nodes adjacent to that node.

This graph represents the neighbourhood of specimen casent0500379. I've expanded the graph by finding the neighbours of all the neighbours of casent0500379. Note that this graph shows that five sequences have been obtained from this specimen (identified by their "gi" numbers), and those five sequences are associated with three different taxonomic names in GenBank(!).

Two labs have sequenced the 28S ribosomal RNA gene from this specimen, one with accession number DQ353560 (gi:87047406), published by Moreau, et al. in Science (pmid:16601190, doi:10.1126/science.1124891) and one with accession number DQ401020 (gi:89477179), published by Ouellette et al. in MPE (pmid:16630727, doi:10.1016/j.ympev.2006.03.017).

Now, the specimen has not been identified beyond being assigned to the genus Proceratium (which includes the Google ant). These two research groups have given it different informal names, hence GenBank doesn't realise that these two taxa are the same.

In fact there is a third taxon, "Proceratium sp. CSM-2006" for this same specimen, which has been sequenced for wingless.

This is a small example of where aggregating and visualise links between multiple data sources can tell us something we didn't know before.

My triple store misses the sixth sequence (yet another 28S rRNA sequence) from this specimen (AY325951, gi:34398469), because the specimen is recorded in the "isolate" field, not the specimen_voucher field:
     source          1..1233
/organism="Proceratium sp. CS-2003-1"
/mol_type="genomic DNA"


  • I'm using this neat example of yours Rod in a talk I am about to give on the utility of Internet standards in science research.

    You have to wonder what the percentage of 'anomalies' that one is likely to discover using LSIDs and RDF? If these issues arise from one specimen, one wonders.

    I get the impression that it is only with the rollout of LSIDs that future problems will be curtailed. Existing isues as in your example are probably never going to be significantly addressed?

    Meanwhile, I suppose it will be (hopefully) 'detect the anomalies on the fly'.

    By Blogger Lee, at 6:36 am  

Post a Comment

<< Home