AstroInformatics I: From Data to Knowledge

Optical layout of LSST, the catalyst for many semantic headaches

Like many sciences, astronomy is becoming increasingly data-rich. The next generation of observatories, such as the Large Synoptic Survey Telescope, will produce staggering amounts of data every night and push the subject into the petabyte regime. The large surveys that feed a substantial portion of the research community today, such as the Sloan Digital Sky Survey, are already demonstrating the difficulties of converting large datasets into knowledge: converting the data into catalogues, estimating selection biases and performing robust statistics are all common problems to those working with the data. Astroinformatics, or the science behind the information captured in our wealth of astronomical data, is therefore becoming an increasingly relevant field of study. The AstroInformatics 2010 conference was organised with the aim of essentially defining this emerging field.

Given that the number of astronomers in the world is unlikely to increase at the same rate as our data volumes, the key to continuing our current rate of discovery is information. The only way to mine large datasets successfully is to have sufficient meta-data provided alongside it to judge its content and value. In astronomy we actually do this very well already with our standard FITS format for images and tables. Introduced as a standard in 1981, FITS files are accompanied by standardised text headers containing much of the provenance of the data. But for the flood of data that’s about to be produced in the next few years, something altogether more sophisticated will be required, and that’s where semantic astronomy will come in.

With semantic astronomy, we essentially want to capture all the information that humans can extract from a dataset in a machine-readable form. In a way, this just like the tagging and trackbacks we perform on social networking sites and blogs, like this one. But to formalise and standardise this to be able to compare and cross-match data from different sources or epochs, this requires the definition of ontologies. We spent a whole day at the conference discussing the developments in the semantic web, or web 3.0, and ontologies, drawing parallels between the web at large and the needs of astronomy in particular. These discussions veered off into a level of abstraction that was a little out of my understanding – but the ideas are very stimulating and well worth thinking about.

David Hogg of NYU gave an excellent provocative talk questioning everything we hold dear about metadata and semantics. “Semantic astronomy is doomed,” [pdf] he says, arguing that catalogs are essentially just meta-data – and that meta-data is all just interpretation rather than fact. So rather than the kind of cataloguing we do today to characterise our datasets, we need a probabilistic approach to meta-data. While his assertions make my head spin, I do think he makes some good points. Even with the best software for data discovery and the best catalogs, how do we ensure that the best data do indeed rise to the top? This takes us into search engine territory.

The development of a solid data mining infrastructure has so far resulted in the Virtual Observatory, a kind of super-repository that provides links to data sources  with a high level of standardisation that allows astronomers to search for and work with data via well-defined protocols. Several software packages have been developed using the VO protocols with varying levels of uptake in the community.

I recently attended a VO workshop in Groningen, and by working through the example science cases it was obvious that some of the packages, like TOPCAT, offer excellent added value over currently available tools, whereas others were variations on existing themes. There were a lot of calls at AstroInformatics for an improved user friendliness of this software, but personally I think it’s simply the functionality that’s key. Astronomers’ time is valuable, and we don’t want to spend a month getting to grips with a new software tool unless it does something that we need much better than our current tool of choice. Why else would we love linux so much?

Very interesting were also the talks on what’s going on with semantics in some of our favourite search tools in astronomy: ADS for literature searches, and CDS and NED for data discovery. These services too are experimenting heavily with the new opportunities offered by the social web. Simbad, CDS’ database of astronomical objects, in March started allowing annotations to object records, and the CDS folks are letting this happen with as little intervention or moderation as possible. After 100 days, 57 users have posted 333 annotations – although two thirds of those came from 4 “power users”. Amazingly, the system has received no spam. Sebastien Derriere (CDS)’s slides on this are available online here [pdf]. While the uptake of the CDS annotations service is relatively successful, other initiatives appear less attractive to the community.

Alberto Accomazzi’s talk on the curation of the bibliographic record showed [pdf] that despite the numerous new features offered by the journals, and many now having adopted a delayed open access model, the main mode of query of the literature via ADS is to look at the article’s pdf, rather than the interactive html version with access additional features. Since 2004 AASTeX offers authors the opportunity to mark their text up with tags such as \object{},  \dataset{} or \facility{}, which automatically produces machine readable annotations with the paper; however, the uptake of these features is very low. The vast majority of links from journal articles to data or object identifiers were manually extracted by the editors, rather than provided by the authors. So right now, curation is still crucial to the process.

All these developments form part of the research cycle that is becoming ever more integrated via the social and semantic web. This closer integration is absolutely required to enable us to continue our rate of discovery in astronomy amidst a deluge of data. Conferences like AstroInformatics are essential to bring this issue to the attention of the community, and to create a debate about strategies to tackle it. On the conference blog, there’s a discussion over some concrete lines of action we can take in the immediate future – go read and give your opinion.

Image: Todd Mason, Mason Productions Inc. / LSST Corporation

Comments

  1. Anders Feder says:

    Hi,

    I am not an astronomer in the traditional sense, but I am working with computer systems for astronomical data and I am very interested in combining this work with my interest in the Semantic Web. Do you know if any kind of online groups exist discussing subjects related to ‘astrosemantics’, e.g. how to apply RDF to astronomy, relevant ontologies or schema etc. I have a lot of trouble finding anything other than slides from the AstroInformatics 2010 conference. Thanks,

    Anders Feder

  2. Matthew Graham says:

    Two websites to look at are:
    http://www.practicalsemanticastronomy.org and http://www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/IvoaSemantics. The latter one is for the Semantics WG in the International Virtual Observatory Alliance and there is an associated mailing list: semantics@ivoa.net.

Trackbacks

  1. [...] This post was mentioned on Twitter by Sarah Kendrew, Matthew Graham. Matthew Graham said: Nice blog post from @sarahkendrew: http://bit.ly/b4NDVf #astroinfo2010 [...]