The problem usually crops up when computational biologists don't seem to care whether their computational results correspond with any biological reality. If a computer model or algorithm is able to (more or less) recapitulate existing data, then that's considered sufficient. But then what is your model contributing? We already knew the existing data, and chances are, your model hasn't contributed anything new to computer science.
Recently, I've moderated my stance on this a little: there is perhaps one legitimate niche, I think, for computational biologists who don't really care about testing their models with new experiments. It's conceivable that you can write an algorithm or develop a modeling approach that doesn't advance some fundamental computer science question, and that doesn't teach us anything new about biology, but nevertheless produces of new way of dealing with a computational problem.
An example: you might figure out a better way to incorporate prior information (like a model of how gene regulatory elements evolve) into a computational tool that searches for regulatory elements in genome sequence. Your improved method does better than others on the data you basically used to build your model (say, gene expression data from the cell cycle or Drosophila embryonic development), and so you're satisfied; it's time to publish. Don't deceive yourself - you have not shown that your method is now generally better at capturing real biology, because you haven't tested your model with new data some new context. Checking for overfitting by training on only half your original data and testing on the rest doesn't count, because your success could just be due to some quirk of that particular data set. Your method might do significantly worse on data in a different context. (And yes, this happens frequently.)
In this case, you haven't advanced biology or addressed a deep question in computer science, but maybe you have developed a new tool that some other computational biologist can use to genuinely learn something about biology. This is a narrow niche, and not my idea of great science, but I can see how this could be useful. Unfortunately, a big chunk of the literature falls into this category.
While I grudgingly accept that ther may be a niche for computational biology that doesn't produce new results about biology, I've had a hard time understanding how some computational biologists can be so passive about the issue. Don't they care whether their methods for aligning non-coding sequence/finding cis-regulatory elements/predicting protein-protein interactions/modeling gene regulatory networks are right? To know whether you're right or not, you must test your model on something new. And if you have two models that explain the data equally well, the next step is to devise some experiment that will distinguish between these models.
The philosophy I'm advocating here is captured in a quote by Richard Feynman that I've used before:
There is also a more subtle problem. When you have put a lot of ideas together to make an elaborate theory, you want to make sure, when explaining what it fits, that those things it fits are not just the things that gave you the idea for the theory; but that the finished theory makes something else come out right, in addition.
I believe strongly in this, but I've recently experienced an epiphany about computational biology. I understand now, I think, why some computational biologist don't agree with this philosophy of science. Their outlook is dramatically different, and it can explain why the field works the way it does. What is this outlook? It's this: the goal of computational scientists is to explain the existing data with models that produce a good fit to the data using the fewest parameters. If they can do so, even if it's just on data that was used to build the model, then, they argue, their model represents a better understanding of the biological system.
And so, if you have two models that explain the data equally well, instead of devising some experimental test that will distinguish between the two, you simply go with the model that has fewer parameters and assume, as a matter of philosophy, that this model is better.
I can't agree with this approach. If your goal is pure prediction, without understanding the underlying phenomena, then this approach is OK. But computational biologists don't limit themselves to pure predictions - they love to make claims about how their model shows what makes the cell cycle robust, or how their results disprove some prevailing idea about the evolution of transcription factor binding sites. They talk like they've gained "mechanistic insight" into gene regulation or some other biological phenomenon.
To make claims like that, about some biological phenomenon, it's simply not enough to have a model with fewer parameters. You have to have to go out and compare your claims with reality. Without that, you have no clue whether you're ideas are right.
Read the feed:
Comments