Invisible Software, Dark Software | Computation Institute

By Rob Mitchum // January 14, 2014

Science is often driven by the instruments scientists use to answer questions and study the world. The historical inventions of the telescope, the microscope, the gene sequencer, or the spectrometer have propelled researchers into new frontiers of discovery. Today, as computing becomes integral to research in virtually every research field, software is arguably the most important scientific instrument. Many researchers now write their own code or acquire software to manage and analyze data, run computer simulations, operate laboratory equipment or direct other processes critical to modern science.

But software also has a publicity problem. While scientific publications often go into great detail about the laboratory equipment used during experiments, software is less likely to receive due credit. Even in studies using high-performance computing for simulation and modeling, software reporting is selective, with less information about what code is used before and after the centerpiece simulation. In two recent blog posts at his newly redesigned website, CI Director Ian Foster offers solutions to these issues that can lead to faster and better research discoveries.

In an article on “software invisibility,” Foster outlines why incomplete reporting of software usage is a problem for scientists, for scientific funding policy, and for the understanding of science itself. He also proposes a fix in the form of micrometrics, “low-level data about software usage that is collected by instrumenting software to provide low-level usage information.” Instead of counting on scientists to self-report how they used software on a project, statistics about what software they used and how often they used it can be automatically collected and reported.

“The resulting micrometrics will lack the specificity of explicit acknowledgements of use. For example, they will not tell us that a particular discovery was due to the use of package A. However, they can tell us that package A was used 10 times more than any other package by the discovering team. Similarly, we may see that use of a package B at institution C emerges at the same time that student D joins that institution. Neither correlation necessarily corresponds to causation, but in both cases we obtain useful data without any user effort.”

Foster goes on to discuss how Globus Online has used micrometrics in providing its research data management software, and how similar software-as-a-service models can facilitate the use of micrometrics in monitoring scientific software use.

Another, related issue that Foster highlights on his website is “dark software.” In this case, Foster draws upon the astronomical concept of dark matter as a metaphor for the code that is typically not reported in extreme-scale computing projects, such as climate modeling and astrophysics simulations. While the central code used in these experiments is often well-documented, the extensive software methods needed to prepare data for simulation runs — the “on-ramps” — and to work with the data the simulation produces — the “off-ramps” — are less likely to be reported.

In a white paper for a Department of Energy workshop on software productivity in extreme-scale science, Foster outlines the scale of the problem presented by dark software, citing examples of simulations that produced millions of data files requiring complicated ex post facto software to manage and analyze.

“No project today has anything but ad-hoc solutions to the problems of capturing, mapping, analyzing, and managing such large and diffuse collections of information. Researchers may use diverse file formats, file naming strategies, and databases to record some information; much more often remains in their heads where it is neither long-term accessible nor easily sharable with others. Thus, tasks such as the following tasks become exceedingly difficult, and if encoded in software form (e.g., as scripts) are not easily re-used: comparing data with output from a run performed last year or from related applications; determining what parameter values and code versions were used for a run; determining which computations might need to be re-run to incorporate new experimental results; or identifying and locating special features across a range of runs. Scientific productivity is constrained by the lack of more structured support for such tasks.”

Addressing these constraints, Foster suggests a research program for low-cost solutions to dark software, based around the question of “how discovery processes should be reconsidered in an era of massively computational and collaborative research.” In this area as well, software-as-a-service products may offer low-cost and easily-implemented solutions to bring dark software back into the light.

To read and respond to these posts (as well as a post seeking comments for the second edition of a “History of the Grid” paper Foster wrote with Carl Kesselman in 2011), and view recent publications and presentations, visit Ian Foster’s website.

[Photo from Wikimedia Commons.]