I've lamented multiple times the negative influence on scientific culture of some trends in the use of computational tools to analyze large datasets, particularly in biology.

Over at Nobel Intent, John Timmer brings up another issue related to computational models of complex phenomena: reproducibility:

In the past, reproduction was generally a straightforward affair. Given a list of reagents, and an outline of the procedure used to generate some results, other labs should be able to see the same things. If a result couldn't be reproduced, then it could be a sign that the original result was so sensitive to the initial conditions that it probably wasn't generally relevant; more seriously, it could be viewed as a sign of serious error or fraud...

But, when it comes to computational analysis, both the equivalent of reagents and procedures have a series of issues that act against reproducibility. The raw material of computational analysis can be a complex mix of public information and internally generated data—for example, it's not uncommon to see a paper that combines information from the public genome repositories with a gene expression analysis performed by an individual research team.

A lot of this data is in a constant state of flux; new genomes are being completed at a staggering pace, meaning that an analysis performed six months later may produce substantially different results unless careful versioning is used...

And that's just the data. An analysis pipeline may involve dozens of specialized software tools chained together in series, each with a number of parameters that need to be documented for their output to be reproduced. Like the data, some of these tools are proprietary, and many of them undergo frequent revisions that add new features, change algorithms, and so on. Some of them may be developed in-house, where commenting and version control often take a back seat to simply getting software that works. Finally, even the best commercial software has bugs.
The net result is that, even in cases where all the data and tools are public, it may simply be impossible to produce the exact same results.


One proposed solution is that all software code used in such research should be open to inspection by other researchers - that's definitely a good start.

The other solution, at least in biology, is that conclusions generated by complex computational tools need to stay close to empirical results they need to be focused on testable and relevant hypotheses. The experimental part of biology is not about to be replaced by computers and databases of sequence and interactome data.


Read the feed: