On Software in Astronomy

Importance of the Hubble archive. The number of archival papers has exceeded the number of PI-led papers since 2006 (from White et al., 2009)

ResearchBlogging.org
I’ve been giving some thought to software development in astronomy, which is a difficult topic. All astronomers agree that good data processing, and hence good software, is crucial to doing rigorous science. To interpret observational data, to translate electrons on a detector to scientific knowledge, requires a solid understanding of the instrument, the observing conditions, and of the exact process with which the data were treated. Many large ground- and space-based observatories, like those run by ESO, Gemini and NASA, strive to provide the community with “science-ready” data. This means that the data are processed to remove all instrumental signatures, allowing astronomers to dive straight into the analysis.

The rationale is that providing science-ready data essentially makes them usable by a much wider community than those involved in the observing campaign, or those used to working with a given instrument. Indeed, a big driver behind the global Virtual Observatory initiative is the “democratisation of astronomy” by providing anyone in the world with ready-to-use astronomical data, irrespective of their location or affiliation to large organisations.

Given that many small and large observatories don’t have the resources, or simply don’t consider it a priority, to make their data available to a broad community of astronomers, the efforts of the large organisations are entirely commendable. In their paper to the Astro2010 Decadal Survey, Richard White and colleagues convincingly showed the impact of the Hubble and Chandra archives – for the Hubble Space Telescope papers using only archival data have outnumbered PI-led ones since 2006. Indeed the Hubble and Chandra observatories seem to be examples of how to process and archive data well.

But not all observatories are managing to provide “science-ready” data to the community – see for example this discussion in ESO’s Messenger from 2004 by Silva & Perón. The problem is that writing processing software requires resources and manpower, particularly given the complexity of astronomical instruments today. But developers often don’t stick around on a project long enough to go through extensive testing and debugging. Furthermore, the software developers at large organisations like ESO are not the end users of the science data, and they often lack input from the science community on the requirements for science pipelines, or feedback on their problems.

So what happens very often is that an observatory spends considerable amounts of money on the provision of pipelines to help astronomers process their data – these pipelines turn out to be (i) too complex for non-specialists to use, rendering the data useless; and (ii) not transparent or flexible enough for the specialists, who end up writing their own pipelines for processing & analysis.

Part of the problem is that the development of pipeline software is one of the last steps in the development of new instrumentation, when budgets are running low and deadlines become increasingly critical, and it’s therefore a prime target for cutting costs or saving time. In the case of instruments built for large organisations, software development is often ongoing at the time when the consortium hands an instrument over to the observatory. Whose territory the software falls under is unclear, and it has a tendency of falling between the cracks.

This can turn pipeline development into a frustrating endeavour for all involved. Software engineers care about their work and don’t like hearing that their products are of little use. Non-specialist users find it prohibitively complicated to use archive data to do science. And specialists around the world end up writing their own software, which results in a massive duplication of effort in the observing community. To me this always seemed to be a huge waste of effort and resources, crying out for an alternative model of development.

But last week I was chatting to a good astronomer friend of mine who is a seasoned observer and data analyst, and her perspective was quite different. She argued that the current model is about right: specialist observers will never ever trust a black box-type pipeline and will always prefer to write their own analysis code. Data reduction is simply part of doing robust science. The observatories should focus on providing “quick and dirty” type pipelines that allow astronomers to visualise the data, which most of them do pretty well, rather than try to make them “science-ready”. The definition of “science-ready” in any case can’t be standardised, depending entirely on the information the observer is trying to extract from the data. That kind of makes sense too, and now I’m confused.

There must be a middle ground for all this. Another paper submitted to the Astro2010 Decadal Survey by Benjamin Weiner of Steward Observatory and colleagues presents the problem in a very clear way, and they suggest several solutions. The one I particularly like is the establishment of a code repository, where astronomers can make software publicly available. Many scientists actually already post scripts they’ve written onto their webpages; for larger software initiatives a descriptive paper can provide a method for citation. A few other recent initiatives have made promising moves in that direction – although an actual repository has yet to emerge.

In the category of citable community-led software efforts, a paper was posted to astro-ph last week describing a generic data reduction package for fibre-fed integral field spectrographs called p3d, which has been tested on several instruments on 2-4-m class telescopes. These are often the observatories that do not have the resources for a large-scale software development effort, so the p3d project is a really great initiative. Last year a group of postdocs set up the Integral Field Spectroscopy wiki, which has extensive resources on the growing number of integral field spectrographs, including many links to software packages and useful code. It’s a very useful website but as with many wikis, including Wikipedia, suffers from a lack of specialist contributions – which in turn is driven by a lack of motivation to spend time contributing to them.

What is the right approach? I have no idea. The current model seems to work fine for some, but leaves others frustrated – and it’s not clear to me that the money invested in software by the observatories is optimally spent. But do we need an entirely different model, or maybe just more discussion, collaboration – perhaps more data reduction workshops to get people to share their knowledge? If you have any ideas, post them below.

C. Sandin, T. Becker, M. M. Roth, J. Gerssen, A. Monreal-Ibero, P. Böhm, & P. Weilbacher (2010). p3d: a general data-reduction tool for fiber-fed integral-field spectrographs accepted by A&A arXiv: 1002.4406v1

D.M. Silva & M. Peron (2004). VLT Science Products Produced by Pipelines: A Status Report. The Messenger, No.118, p. 2-7 (ADS)

M. S. Westmoquette, K. M. Exter, L. Christensen, M. Maier, M. Lemoine-Busserolle, J. Turner, & T. Marquart (2009). The integral field spectroscopy (IFS) wiki accompanying IFS wiki site arXiv: 0905.3054v1

Comments

  1. There’s a few issues as I see it –

    Some ‘big science’ missions (even future missions) consider the data to be the property of the PI or some other group that limits its dissemination; in these cases, software is a moot point, as you don’t have the data. Hopefully, AGU’s data policy and the Astronomer’s Data Manifesto will help to change some of the attitudes towards data release.

    As for issues with the data pipeline approach — you have to be sure that whenever the PI adjusts their calibration, that they go back and adjust *everything*, which for some missions might take months or longer to reprocess and redistribute to mirror sites. Sometimes, a scientist might disagree with the assumptions that were made going into the initial analysis, or they might want to ensure that all of the files were identically processed (I can think of a case where someone discovered that the software had a flaw when run on PPC processors, but it wasn’t discovered until the code was being ported to Intel Macs years later)

    The solar physics community *does* have a system for distributing the analysis software (solarsoft : http://en.wikipedia.org/wiki/Solarsoft ) … unfortunately, the majority of the software requires a proprietary software license to run (the same as used by p3d), which limits its use by outside groups; as solar is a driver into many other fields, our data is of interest (in higher processed forms, typically not with the same requirements as for solar physics) to other earth and space disciplines.

    What it comes down to is that there are different needs for different communities. Yes, there’s a waste of duplication of effort, and there should be reward systems for distributing data and software. (I can’t remember if it was at the AGU Town Hall on Data Citation, or the ASIS&T Panel on the Data Net Partners, but someone brought up that writing a bad paper counts towards tenure; releasing a great data set doesn’t. The same is true for great software — there should be a mechanism to recognize the software that helped you find/reduce/visualize/etc the data and reward the people who write it.)

    … but in some cases, people want the processed data and trust the PIs (in situ heliophysics data and dopplergrams come to mind) or the scientists want data from other fields where ‘good enough’ data is okay (eg, was something happening at the sun that might’ve been responsible for this reading in the magnetosphere), while others prefer the level 0 data to ensure it’s all identically processed. Sometimes the issue is speed (eg, space weather; instrument planning) over absolute calibration … what I’m trying to say is that the ‘best’ form of the data is a function of the user and the intended use, not just the instrument or type of measurement.

    … um … I should stop now, as I could go on about this topic for a while …

    (insert standard disclaimer about these being personal observations and conclusions, and I’m not authorized to speak on behalf of my employee or place of work, etc.)

  2. I actually think this problem — grad students constantly re-inventing the wheel of data reduction pipelines and the lack of pipeline’s for new instruments — is one of the largest inefficiencies in astronomy. This is a tough problem and I’m too young and naive to have a good opinion about how to fix it permanently.

    In the meantime, however, we have set up the AstroBetter Wiki (in the Astronomical Methods section) to serve as a general code and cookbook repository…similar to the IFU wiki, except open to any instrument/observing method. The idea is that once someone figures something out, they post it to the wiki where others can comment and modify. The current model of putting things on personal web pages is too static and results in lots of outdated cookbooks.

    http://www.astrobetter.com/wiki/

  3. @Joe – you’re right, whether or not data are deemed “public” or “private” is a whole other can of worms, and I deliberately tried to stay away from that issue. Not that I don’t care, in fact I’m deeply interested in it, but I wanted to focus on software. It’s true that writing software is under-appreciated and it should (i) be easier to write papers about software to allow citation and hence credit, or (ii) be possible to directly cite a software package, in some way.

    @Kelle – I knew I was forgetting an example! The Astrobetter wiki is a really good initiative that can help share resources and code. Thanks for the reminder :-)

  4. Telescopes like ALMA and the SKA are going to provide pipelined data to end users which will help non-expert users get science out of what are going to be very complex systems. As I understand it, ALMA is also going to archive the raw data so those users who want to reprocess it can do so, but with the expected data rates of the SKA this may not be possible. While I can see that it may be be impractical with future telescopes, the thought of not being able to go back to the raw data does make me a bit uneasy….

  5. Those results from the White et al paper are very striking! I hope they (or someone) do a similar analysis for the SDSS at some point.

    I don’t agree with the Weiner et al paper on the need for a new repository. I think a much better idea is to use an existing solution, like github (say), and a common tag, say “astro”. This has the following advantages:

    (1) It can be done immediately.

    (2) If a few people start to do it then that will drive adoption by other people.

    (3) The site is really superb, and battle-tested.

    There are some potential problems, e.g., what if github (or whatever repository is used) goes out of business. I think the solution there is to work with github to back up all the data at a separate facility. I’d be shocked if this couldn’t be arranged easily.

  6. Megan – Yes ESO really emphasise this for ALMA. It’s not entirely clear to me in what way they’ll do things differently than with their optical/IR data, given that they’re not quite managing to provide science-ready data to the community (by their own admission). Maybe they’re just hiring more people, increasing the manpower effort rather than take some fundamentally new approach. A grad student here once asked Tim De Zeeuw (ESO DG) a question on changes in ESO’s approach to pipeline development in a talk but didn’t really get a clear answer.

    Michael – what I’d actually quite like to see in the plot by White et al is how many of those archival papers come from legacy datasets (large Hubble observing programmes have to provide justification for the “legacy” value of the resulting data at the proposal stage) vs. from small programmes. It would be interesting to see whether such planning ahead has an impact.

    Re. the software repository – yes that would actually be very sensible. However it’s been my experience that astronomers are generally skeptical of any kind of commercially-based initiative. I think the younger generation are much more web-savvy, so maybe things will change.

  7. Having a community wide software repository is a great idea. There are several flavors which allow anyone to store their own copy of the repo (Git, Mercurial, Bazaar,…), so web-host lifetime should be less of an issue.
    I imagine people will only start using and contributing to such a project once their favorite standard packages are contained and supported (cfitsio, terapix packages, cl related packages, stsci software, …). Having a common interface between components is also important, so that you can slip in a different astrometry, photometry, registering piece of software depending on your preferences. Ideally this interface would be able to handle projects developed in a variety of programming languages, and have multi-language hooks to allow inclusion into an astronomer’s existing pipeline. This is important since astronomers program using a wide variety of languages, but tend to development using only one or two.

  8. Hi, thanks for reading the white paper on software development. It is also available on the arXiv at http://arxiv.org/abs/0903.3971 – I wouldn’t be surprised if the link at the National Academies eventually breaks (which of course is another symptom of the problem we are discussing).

    Michael – I don’t care if an astronomy software repository is new, is based on an existing one with a tag, or anything else. However, I think it’s got to be obviously astronomy-themed to attract users, and it has to be usable even by dinosaurs (like me) whose idea of version control is “rsync.” The point of the paper was that this is something the community should support with its resources, not just its mouth. The arXiv, and ADS, work because everyone uses them, but they need funding too.

    Pipelines for very large projects such as HST, Chandra, or ALMA work not only because they use a lot of money and scientists to develop the pipeline, but because they are stable and have a relatively small number of observing modes. The flexibility of ground based instruments makes them harder to develop end-to-end pipelines for. I don’t expect that pipelines, even for HST, ALMA, LSST, JWST, will replace individual scientists writing code. Not should they. People will still need to use their scientific judgement to figure out when the pipeline is nt ideal or to post-process the pipeline’s product.

    One of my main points is that if you decide to solve this problem by hiring an army of software engineers, you won’t get better software, you’ll just get bigger software. For similar reasons, I am reluctant to recommend any effort to mandate a big common interface design.

    What we need to do is change the incentive structure so that scientists are credited, rewarded, and funded for writing and releasing software as well as for getting their names first on author lists. This will benefit everybody in the long run, because more people will have access to useful software and spend less time reinventing for themselves. You still have to know the limitations of any software you use (SExtractor, for example) but imagine the duplication of effort if there was no public SExtractor or DAOphot.

  9. Thanks for commenting Ben!

    Some astronomers do write papers about software or algorithms they develop, which in essence makes them citable, and I wonder what makes someone decide whether or not to publish. Size of the effort and the science that’s being produced with it seem obvious – but there must be more subtle reasons too. I’m in the process of putting a lot of my software online and the idea of opening this up to public scrutiny (although I’ve never been at all secretive about it) is a little frightening….

    I suppose also our traditional journals, except maybe PASP, are just not suited to a pure-software paper. Do we need a new peer-reviewed journal purely for software/methods/algorithms in astronomy? It could even be designed as a front end to a repository.

  10. Software developers can (and do) publish their papers in PASP but what’s missing is the respect and *incentive* factor. Writing data reduction software is just not perceived as being as important as other endeavors. I dream of a day when a department is composed of observers, theorists, instrumentalists, and software developers. The standard now seems to be that most software — even for widely used instruments on big telescopes — is written in someone’s spare time and on the side. TripleSpecNIRES

    I’ve been thinking for a while that this problem needs to be addressed similar to the way NSF took on Broader Impacts. Basically, any proposal to build a new instrument needs to also include a plan for the data reduction pipeline and will not be considered for funding without addressing the issue. It seems common for big, survey instruments to think about the pipeline from the beginning, but this does not seem true for the smaller workhorses.

  11. The desire to have an astronomy-specific software repository is a perfect example of astronomers wanting to yet again re-invent the wheel!

    Who would fund such a repository? What would happen if funding stopped? There is no reason why it would be more fail-safe than say github or sourceforge. In fact, the latter two are so widely used, that I would actually rather trust my code to them than to an astronomy specific repository. The fact that existing solutions are so popular also means that the quality of the websites is much higher than could be achieved by a couple of software developers on an astronomy grant trying to develop something ‘from scratch’.

    Developers should just use whatever existing solutions there are for code hosting. I do think it would be nice to have a centralized wiki with links to the different projects on e.g. sourceforge or github, but that would be relatively simple to develop in comparison to an ‘astronomy sourceforge’.

    I think a ‘Computational Astronomy & Astrophysics’ journal is long overdue. I want to release a radiation transfer code later this year (on sourceforge), and plan to write a paper, but it’s very hard to figure out what journal would accept such a paper!

  12. I don’t want to promote “software developer” as a separate category of scientist. That’s part of our current situation, where software is an activity that is somehow distinct from doing “real science.” I want us to recognize that developing good software is part of being an observer, theorist, or instrumentalist.

    In my opinion, much of the really useful public software in the community’s hands has been built by small numbers of scientists who needed to solve a problem they encountered in their own work, for example: DAOphot, the SDSS “unofficial” spectroscopic pipeline (which was essentially assembled by a few postdocs), GADGET; these are observer, instrument, theory activities respectively.

    One of the major problems is that cleaning up software for release, documenting it, and/or writing a paper about it take a lot of effort, and since the rewards for it are relatively small (and there is NO way to get funding for it – even for the page charges, unless it falls under the science topic of a grant you have), few people write it up or publish. This is one of the problems I claim that we need to change.

    I agree with Kelle’s comments about including software as part of instrument design/proposal. There are still problems we discussed in the paper (even with good intentions, software comes last and the funding gets squeezed). However, there are also people out there who think that because more or less everyone can program, reduction software is properly fobbed off on grad students or postdocs after construction, and this attitude needs to change. Even big instruments for large telescopes in the US often have very little funding for software (Keck did not pay for the scientists who wrote the DEIMOS pipeline, for example). I think ESO has bigger teams and takes more of the hired-programmer approach, which has its own set of drawbacks.

    If you read our white paper, at no time did we recommend brewing one’s own repository software. The software underlying at least some of these repositories is open source. Several people seem really energized about this particular question. I think the choice of repository is much less important, and much less difficult, than figuring out how to get the community to reorient its funding and crediting priorities to recognize the effort that goes into writing and releasing software. If you put all your software on sourceforge and your employers and writers of letters of recommendation are fogeys who do not understand what sourceforge is (or any other repository), it is not going to help you in the long run. (Nothing against all you wonderful fogeys out there! I nearly am one myself.)

  13. Thomas, I think computational astronomy is perfectly on-topic for the existing journals. If you think a new journal would improve matters, that’s fine (I’d worry about ghetto-izing the field, though). However, I don’t think you should rule out the regular journals, the scope statement of the ApJ specifically welcomes this type of paper and MNRAS also publishes computational papers. For example here is a radiative transfer code paper in MNRAS: P. Jonsson 2006, http://adsabs.harvard.edu/abs/2006MNRAS.372….2J

  14. No astronomy project will ever have an army of software engineers. And that’s a good thing – see Fred Brooks’ http://en.wikipedia.org/wiki/The_Mythical_Man-Month. In general, with very few exceptions, development and maintenance activities for astronomical software are starved for resources.

    Astronomy is research. Research is exploration. Exploration requires unprecedented approaches to problem solving – for software systems as much as or more so than any other part of the empirical exercise.

    That said, there is a long history of software development in astronomy. Entire programming languages and computing environments have been invented to solve our problems. Modern software development is often more about logistics than algorithms.

    The regular ADASS conferences and software sessions at the SPIE are joined in any given year by more than a few more targeted astronomy software related meetings. And then there’s the IVOA umbrella of activities. As in any community, attending the right meetings is a good first step to meeting the prerequisite of comprehending the complete context of a problem.

    Many of these meetings result in published volumes or online proceedings that have a shorter turnaround than the refereed literature. The shelf life of software papers is often quite short. Peer reviewed journal articles remain an option for projects amounting to applied computer science research, more than exercises in already established best-practice engineering.

    Which is to say that there are numerous opportunities to collaborate on software projects of wide utility to the community without duplicating efforts in progress. More traction might be gained from contacting well-established projects with suggestions for joint projects addressing missing functionality, than from undertaking a DIY program.

    Production software development is not as easy as it might seem on the surface. Robust software doesn’t just follow automatically, no matter how good the initial idea is. And unless your career goal is computer programming, somebody other than yourself has to be identified to maintain, fix and grow the software package once it is released into the wild.

  15. OK so a few ideas seem to emerge here:
    - repositories good!
    - existing commercial repositories like git, sourceforge, also good.
    - we should publish more papers about software and existing journals are suitable for this (I’m really liking git btw)
    - developing software in astronomy in terms of manpower and funding is …. a complicated issue?

    Irrespective of who does it or where to host it, I’m still a little stuck on my original question: what is the best way for observatories to offer their data to observers? Provide both raw and reduced (via “standard but non-optimal” pipeline) data? Provide only raw data plus reduction pipeline? And what is the correct balance between block-boxy pipeline and loose set of modules, transparent and flexible enough for advanced users to get the precision and insight they want? There seem to be a number of use cases for observatory archive data whose requirements are conflicting.

  16. What ever happened with the SETI thing? My friend had the screensaver installed on his website – the one that ran through all the data in the background searching for extra-terrestrial contact. Did they ever find anything at all? If so, I wonder if they’d reveal it. :)

Trackbacks

  1. [...] thinking  about software development in astronomy and talking about it with friends at work and on this blog, I thought it was about time I put my money where my mouth is. I too write software – in [...]

  2. [...] Sarah Kendrew. Originally posted at ‘One Small Step‘. Posted with Author’s [...]