I’ve been giving some thought to software development in astronomy, which is a difficult topic. All astronomers agree that good data processing, and hence good software, is crucial to doing rigorous science. To interpret observational data, to translate electrons on a detector to scientific knowledge, requires a solid understanding of the instrument, the observing conditions, and of the exact process with which the data were treated. Many large ground- and space-based observatories, like those run by ESO, Gemini and NASA, strive to provide the community with “science-ready” data. This means that the data are processed to remove all instrumental signatures, allowing astronomers to dive straight into the analysis.
The rationale is that providing science-ready data essentially makes them usable by a much wider community than those involved in the observing campaign, or those used to working with a given instrument. Indeed, a big driver behind the global Virtual Observatory initiative is the “democratisation of astronomy” by providing anyone in the world with ready-to-use astronomical data, irrespective of their location or affiliation to large organisations.
Given that many small and large observatories don’t have the resources, or simply don’t consider it a priority, to make their data available to a broad community of astronomers, the efforts of the large organisations are entirely commendable. In their paper to the Astro2010 Decadal Survey, Richard White and colleagues convincingly showed the impact of the Hubble and Chandra archives – for the Hubble Space Telescope papers using only archival data have outnumbered PI-led ones since 2006. Indeed the Hubble and Chandra observatories seem to be examples of how to process and archive data well.
But not all observatories are managing to provide “science-ready” data to the community – see for example this discussion in ESO’s Messenger from 2004 by Silva & Perón. The problem is that writing processing software requires resources and manpower, particularly given the complexity of astronomical instruments today. But developers often don’t stick around on a project long enough to go through extensive testing and debugging. Furthermore, the software developers at large organisations like ESO are not the end users of the science data, and they often lack input from the science community on the requirements for science pipelines, or feedback on their problems.
So what happens very often is that an observatory spends considerable amounts of money on the provision of pipelines to help astronomers process their data – these pipelines turn out to be (i) too complex for non-specialists to use, rendering the data useless; and (ii) not transparent or flexible enough for the specialists, who end up writing their own pipelines for processing & analysis.
Part of the problem is that the development of pipeline software is one of the last steps in the development of new instrumentation, when budgets are running low and deadlines become increasingly critical, and it’s therefore a prime target for cutting costs or saving time. In the case of instruments built for large organisations, software development is often ongoing at the time when the consortium hands an instrument over to the observatory. Whose territory the software falls under is unclear, and it has a tendency of falling between the cracks.
This can turn pipeline development into a frustrating endeavour for all involved. Software engineers care about their work and don’t like hearing that their products are of little use. Non-specialist users find it prohibitively complicated to use archive data to do science. And specialists around the world end up writing their own software, which results in a massive duplication of effort in the observing community. To me this always seemed to be a huge waste of effort and resources, crying out for an alternative model of development.
But last week I was chatting to a good astronomer friend of mine who is a seasoned observer and data analyst, and her perspective was quite different. She argued that the current model is about right: specialist observers will never ever trust a black box-type pipeline and will always prefer to write their own analysis code. Data reduction is simply part of doing robust science. The observatories should focus on providing “quick and dirty” type pipelines that allow astronomers to visualise the data, which most of them do pretty well, rather than try to make them “science-ready”. The definition of “science-ready” in any case can’t be standardised, depending entirely on the information the observer is trying to extract from the data. That kind of makes sense too, and now I’m confused.
There must be a middle ground for all this. Another paper submitted to the Astro2010 Decadal Survey by Benjamin Weiner of Steward Observatory and colleagues presents the problem in a very clear way, and they suggest several solutions. The one I particularly like is the establishment of a code repository, where astronomers can make software publicly available. Many scientists actually already post scripts they’ve written onto their webpages; for larger software initiatives a descriptive paper can provide a method for citation. A few other recent initiatives have made promising moves in that direction – although an actual repository has yet to emerge.
In the category of citable community-led software efforts, a paper was posted to astro-ph last week describing a generic data reduction package for fibre-fed integral field spectrographs called p3d, which has been tested on several instruments on 2-4-m class telescopes. These are often the observatories that do not have the resources for a large-scale software development effort, so the p3d project is a really great initiative. Last year a group of postdocs set up the Integral Field Spectroscopy wiki, which has extensive resources on the growing number of integral field spectrographs, including many links to software packages and useful code. It’s a very useful website but as with many wikis, including Wikipedia, suffers from a lack of specialist contributions – which in turn is driven by a lack of motivation to spend time contributing to them.
What is the right approach? I have no idea. The current model seems to work fine for some, but leaves others frustrated – and it’s not clear to me that the money invested in software by the observatories is optimally spent. But do we need an entirely different model, or maybe just more discussion, collaboration – perhaps more data reduction workshops to get people to share their knowledge? If you have any ideas, post them below.
C. Sandin, T. Becker, M. M. Roth, J. Gerssen, A. Monreal-Ibero, P. Böhm, & P. Weilbacher (2010). p3d: a general data-reduction tool for fiber-fed integral-field spectrographs accepted by A&A arXiv: 1002.4406v1
D.M. Silva & M. Peron (2004). VLT Science Products Produced by Pipelines: A Status Report. The Messenger, No.118, p. 2-7 (ADS)
M. S. Westmoquette, K. M. Exter, L. Christensen, M. Maier, M. Lemoine-Busserolle, J. Turner, & T. Marquart (2009). The integral field spectroscopy (IFS) wiki accompanying IFS wiki site arXiv: 0905.3054v1