Complex Grammar Of The Genomic Language

A new study from Sweden's Karolinska Institutet shows that the 'grammar' of the human genetic code is more complex than that of even the most intricately constructed spoken languages in the world. The findings, published in the journal Nature, explain why the human genome is so difficult to decipher -- and contribute to the further understanding of how genetic differences affect the risk of developing diseases on an individual level.

"The genome contains all the information needed to build and maintain an organism, but it also holds the details of an individual's risk of developing common diseases such as diabetes, heart disease and cancer", says study lead-author Arttu Jolma, doctoral student at the Department of Biosciences and Nutrition. "If we can improve our ability to read and understand the human genome, we will also be able to make better use of the rapidly accumulating genomic information on a large number of diseases for medical benefits."

The sequencing of the human genome in the year 2000 revealed how the 3 billion letters of A, C, G and T, that the human genome consists of, are ordered. However, knowing just the order of the letters is not sufficient for translating the genomic discoveries into medical benefits; one also needs to understand what the sequences of letters mean. In other words, it is necessary to identify the 'words' and the 'grammar' of the language of the genome.

Researchers Arttu Jolma and Jussi Taipale in the lab at the Department of Biosciences and Nutrition, Karolinska Institutet in Sweden. Credit: Ulf Sirborn

The cells in our body have almost identical genomes, but differ from each other because different genes are active (expressed) in different types of cells. Each gene has a regulatory region that contains the instructions controlling when and where the gene is expressed. This gene regulatory code is read by proteins called transcription factors that bind to specific 'DNA words' and either increase or decrease the expression of the associated gene.

Under the supervision of Professor Jussi Taipale, researchers at Karolinska Institutet have previously identified most of the DNA words recognised by individual transcription factors. However, much like in a natural human language, the DNA words can be joined to form compound words that are read by multiple transcription factors. However, the mechanism by which such compound words are read has not previously been examined. Therefore, in their recent study in Nature, the Taipale team examines the binding preferences of pairs of transcription factors, and systematically maps the compound DNA words they bind to.

Their analysis reveals that the grammar of the genetic code is much more complex than that of even the most complex human languages. Instead of simply joining two words together by deleting a space, the individual words that are joined together in compound DNA words are altered, leading to a large number of completely new words.

"Our study identified many such words, increasing the understanding of how genes are regulated both in normal development and cancer", says Arttu Jolma. "The results pave the way for cracking the genetic code that controls the expression of genes. "

source: Karolinska Institutet