SemAnt

Thursday, November 30, 2006

Damn DiGIR

I've been playing with DiGIR providers, retrieving records to be massaged into RDF as part of a project to aggregate specimens, sequences, publications, and phylogenies. DiGIR was a major advance on what went before (i.e., basically nothing), but in addition to variations in the schema, and the fact that a good portion of the providers are off line at any one time (see Perils of Federation), I'm coming up against the wide range of ways people have of writing dates.

Ideally, I'd like dates in the variation of the ISO 8601 YYYY-MM-DD format described by the W3C W3C, and recommended by Dublin Core. What we can get in DiGIR records is all manner of formats, such as (with specimen codes following)

11/28/2000 12:28:30 PM [KU 195138]
2004-02-20 00:00:00 [MVZ 149006]
2006-08-21 [FMNH 145699]
12/Jun/1983 [KU 195138]
29 Jun 1974 [MVZ 149006]

Now, variation in the <darwin:VerbatimCollectingDate> tag (the last two dates) is expected, but for a computer generated field such as <darwin:DateLastModified> this is a bit much.

Since I do most of my harvesting in Perl, I came across Date::Manip, which manages to convert these into a sensible form (for example 11/28/2000 12:28:30 PM becomes 2000-11-28T12:28:30).

Integration is not easy...

0 Comments:

Post a Comment

<< Home