Monday, October 23, 2006

Automatically growing an ant bibliography

Earlier on iPhylo I'd mentioned the issue of updating a triple store of ants, or indeed, any data base. As an experiment, I've put together a Perl script that can be used to update a data base in Connotea with recent papers on ants. The script makes of a number of web services, and uBio's RSS feeds. It does the following:

  1. Takes an RSS feed for Formicidae from uBio. This feed lists recent papers on ants, as identified using uBio's taxonomic name recognition algorithms.

  2. Extracts DOIs or PubMed identifiers from the RSS feed. If a DOI isn't found, I see if we can extract one from the <link> tag (typically a URL to the article). uBio does a pretty good job of getting DOIs, but misses some (e.g., for Blackwell and BioOne journals).

  3. Extracts taxonomic names from the content of the <title> and <description> tags using a SOAP call to uBio's FindIT web service. Ideally, uBio would do this for us, since it has already parsed the journal feed, but for now I do it.

  4. Uses Yahoo's term extraction web service to extract keywords

  5. Submit the article GUID (DOI or PubMed id), and the tags to Connotea using the web API.

Here's a sketch of the process.

The papers are stored in my semant library. Because it is entirely automated, it could be run regularly (as a cron job, say) to update the library, hence the list of ant papers would grow without any human intervention. At the same, however, users with access to the semant library could manually edit the tags if they feel Yahoo and uBio have missed some relevant terms.

Note also that names recognised by uBio are tagged with LSIDs for the names as well, which means we could resolve those to RDF. In the same way, the Connotea data base itself can serve RDF (here are the ant papers in RDF). Hence, we could easily populate a triple store with metadata about papers and names.

What I like about this script is that it brings together a number of themes.

GUIDs play a key role here. Connotea knows which papers uBio has extracted by using the DOI (or PubMed identifier). Not only does this enable Connotea to know which paper I want, but it uses that identifier to extract metadata about the paper, for example via CrossRef. It also knows whether any other user has already added that paper.

Web services mean that I don't have to reinvent the wheel. If I want to pick out taxonomic names, I use uBio. To extract keywords for tagging, I use Yahoo. To store data, I use Connotea's API.

Tagging makes it easy to add information to a reference.

Social networking through using an open database like Connotea. People can discover other people's libraries through shared papers or shared tags.

RSS pops up at the start and at the end. The whole process starts with a RSS feed (itself an aggregation of numerous journal RSS feeds), and the resulting Connotea data base serves RSS, so others can readily make use of the results.


  • rod

    nice stuff

    is the "in the news section" in actually based on yours, or is this "only" based on UBIO's RSS feed?

    This service raises the question again, how to build up a complimentary alert system for gensequences (you have it in updating ants blog), or the one to alert for new ant taxa.

    This certainly would fill in some of the users wishes/dreams, especially if the data is provided in a form, it could easily be extracted into their current databases, such as Endnote for literature, exel or access databases...

    of course, it's one step closer to a species web page kept automatically updated including addition of all the description of new species.


    By Blogger Donat Agosti, at 8:43 am  

  • Hi everybody!
    TermExtractor, my master thesis, is (FREE) online at the

    TermExtractor is a software package for Terminology
    Extraction. The software helps a web community to
    extract and validate relevant domain terms in their
    interest domain, by submitting an archive of
    domain-related documents in any format.

    TermExtractor extracts terminology consensually
    referred in a specific application domain. The
    software takes as input a corpus of domain documents,
    parses the documents, and extracts a list of
    "syntactically plausible" terms (e.g. compounds,
    adjective-nouns, etc.).
    Documents parsing assigns a greater importance
    to terms with text layouts (title, bold, italic,
    underlined, etc.). Two entropy-based measures, called
    Domain Relevance and Domain Consensus, are then used.
    Domain Consensus is used to select only the terms
    which are consensually referred throughout the corpus
    documents. Domain Relevance to select only the terms
    which are relevant to the domain of interest, Domain
    Relevance is computed with reference to a set of
    contrastive terminologies from different domains.
    Finally, extracted terms are further filtered using
    Lexical Cohesion, that measures the degree of
    association of all the words in a terminological
    string. Accept files formats are: txt, pdf, ps, dvi,
    tex, doc, rtf, ppt, xls, xml, html/htm, chm, wpd and
    also zip archives.

    I'd like if you partecipate in the TermExtractor
    evaluation task. The result of your evaluation will be
    put in a paper (I enclose a draft). Please contact me
    if you want to partecipate (this is very important for


    Francesco Sclano
    home page:
    skype: francesco978

    By Anonymous Francesco Sclano, at 5:21 pm  

  • welcome to the wow power leveling cheap service site, buy wow power leveling cheap wow gold,wow gold,world of warcraft wow power leveling buy wow gold

    By Anonymous Anonymous, at 6:46 am  

  • good post :)

    By Anonymous Gold Guide for World of Warcraft, at 11:31 pm  

  • we provide a power leveling and free wow gold wow power leveling

    By Anonymous Anonymous, at 8:30 am  

  • Nice blog. I a also ardent player of WOW GOLD. I love this game. Nice posting about wow gold. Thanks

    By Anonymous WOW GOLD, at 9:58 am  

  • wow great posting!

    By Anonymous BUY WOW GOLD, at 10:51 am  

  • I am hoping same best effort from you in the future as well. In fact your creative writing skills has inspired me.
    SEO services pakistan

    By Blogger James Brown, at 8:09 pm  

  • Nice article you might have carried out below. My business is truly happy to see that. This is the incredibly helpful matter. keep that you're selected it up.Customer Support Services

    By Blogger Kristi Byrd, at 7:24 am  

  • What a great post with all enjoyably apps visit for more info..:) Buzz Apps

    By Blogger Jeanne Davies, at 5:26 am  

  • nice

    By Blogger Jennifer Skelton, at 2:50 pm  

  • This post give me good info about biblography. We can use this site annotated bibliography service for finding more new topic.

    By Blogger Jennifer Skelton, at 2:52 pm  

  • Thanks for sharing such a remarkable post. By following this site rewrite words You'll be able to get detailed info relevant to cheap annotated bibliography service. You can avail this amazing offer. Keep up the good work.

    By Anonymous Anonymous, at 9:35 am  

  • Wow, great post.

    By Blogger David Miller, at 3:33 pm  

  • You share a great post about biography. It is really needed for me. I have read this information more times. I also found best ideas here about biography writing service. Thanks for this post.

    By Blogger David Miller, at 2:47 am  

Post a Comment

<< Home