Some call it the data deluge, others the Fourth Paradigm – whatever your phrase of choice, it’s undeniable that science is increasingly driven by the easy availability of large amounts of data. The web is instrumental in their dissemination around the world. Web service providers such as Amazon enable storage of and access to data in the cloud. Continuing our progress in the exploration of the natural world depends ever more crucially on our ability to curate data and extract information from it.
On the last day of .Astronomy, David Hogg gave a talk on the paper he posted with collaborator Dustin Lang to astro-ph last week. In the paper Lang & Hogg describe how they reconstructed the orbit of Comet 17P/Holmes, which was prominently visible in the night sky in 2007, from images posted to the web by amateur photographers. After performing a Yahoo! image search and sorting out the relevant pictures, they ran their image set through the Astronomy.net system. Astrometry.net, created by Lang, cleverly attempts to calculate an astrometric calibration of astronomical images that contain no positional information, by fitting the positions of stars to known asterisms.
The position of the comet itself, as it’s essentially a fast-moving object rather than a star at fixed position, is itself not fitted by the algorithm. Working on the assumption that photographers try to place the object of note near the centre of the frame, Lang & Hogg essentially fitted the pointing direction of the photographers’ cameras rather than the comet itself. The only available metadata are the EXIF information generated by the digital cameras, which contains a timestamp (but no timezone, which adds a bit of confusion).
The resulting orbit, obtained by Bayesian inference from the astrometrically calibrated images, is pretty close to that previously published in the literature. This is a pretty cool result, considering that the authors didn’t even attempt to locate the comet itself in the images. The method is crude but effective.
It’s a light-hearted paper and a fun read (with lolcats!), notably posted to astro-ph on April 1st. But it nonetheless carries an important message.
The orbit of Comet Holmes was well constrained before this work was published; that’s not the novelty of this work. The method, however, is both neat and meaningful, in that it combines both the ready availability of seemingly (scientifically) useless data and clever modelling to get to a real scientific result. It’s citizen science with unwitting participants. Compared with the traditional citizen science projects in astronomy, like Galaxy Zoo or the ε Aur observing campaign, this work is cheap as chips, as the data are already there.
This was one of Hogg’s main points in his .Astronomy talk: there’s a huge amount of highly informative and unexplored data available, both in our science archives and out there, in the wild (world web). Exploring and exploiting these data using well-chosen probabilistic models, as demonstrated in the paper, is far cheaper than building new hardware to produce yet more data.
The community is continuously chasing the next big instrument, the next big milestone, and this often means we don’t get maximise our return from existing data and facilities. If it seems like I’m betraying my trade as an instrumentalist by agreeing with this, that’s not my intention. Too often instruments don’t reach their full capability because not enough resources are invested into understanding their behaviour, developing the most appropriate processing software, or getting to the bottom of unexpected results. And that’s a frustrating situation for everyone involved.
The ability to manipulate and describe large datasets in a physically meaningful way, through models and carefully constructed catalogues, will become ever more crucial with the next generation of observatories optimised for large surveys, like PanSTARRS (already here), the Square Kilometer Array (SKA) and the Large Synoptic Survey Telescope (LSST). These facilities will dominate the astronomical data landscape in the next decade. Hogg and his collaborators may well be leaders in the field of astronomical data inference techniques today – by Hogg’s own entertaining account, they are attempting to produce “a probabilistic model of everything”. But their methods should become standard ingredients of the astronomer’s toolkit in the future. That will require a change in mindset and in education.
As the money to fund new hardware appears to be drying up around the world, smaller science budgets may well help accelerate our transition from hardware- to data-driven science. We’ll simply have to get more creative with the stuff we have. This paper and Hogg’s talk have certainly inspired me to deal with data in a more thoughtful way than I have in the past.
Dustin Lang, & David W. Hogg (2011). Searching for comets on the World Wide Web: The orbit of 17P/Holmes from the behavior of photographers Arxiv arXiv: 1103.6038v1
Lang et al. (2010), Astrometry.net: Blind astrometric calibration of arbitrary astronomical images, AJ vol. 139(5), pp. 1782-1800 (ADS)