Tuesday, 28 October 2014

RNA is now a first class bioinformatics molecule.

RNA research is expanding very quickly, and a public resource for these extremely valuable datasets has been long overdue.

Some 30 years ago, scientists realised that RNA was not just an intermediary between DNA and protein (with a couple of functions on the side), but a polymer that could fold into complex shapes and catalyse countless reactions. The importance of RNA was cemented when the structure of the ribosome was determined (something that Venki Ramakrishnan, Ada E. Yonath, and Tom Steitz won the Nobel Prize for, eg here is Venki's Nobel lectureand it was confirmed that the core function of ribosomes – making a peptide bond between two amino acids – was catalysed by ribosomal RNA and not by proteins. It’s also likely that RNA – not protein, not DNA – was the first active biomolecule in the primordial soup that gave rise to Life. Indeed, one could easily see DNA as an efficient storage scheme for RNA information, and proteins as an extension of single-stranded RNA’s catalytic capabilities, enabled by the monstrous enzyme, ribosomal RNA. 

Even focusing on RNAs established role as the cell’s information carrier, the textbook mRNA, RNA-based interactions are widely recognised as being important. A real insight was the discovery of microRNA (miRNA): small RNAs whose actions lead to the down-regulation of transcripts by suppressing translation efficiency and cleaving mRNAs. MicroRNA has brought to life a whole new world of other small RNAs, many of which are involved in suppressing “genome parasites” – repeat sequences that every organism needs to manage.

And then there are long RNAs in mammalian genomes that do not encode proteins (i.e. long non-coding RNA - lincRNA) have long been recognised as having some significance – but what do they do? Some are clearly important, like the non-coding RNA poster child Xist, which inactivates one of the X chromosomes in female mammals to ensure the correct dosage of gene products. Others are involved in imprinting/epigenetic processes, for example the curiously named HOTAIR, which influences transcription on a neighbouring chromosome.

RNA: something missing

Discoveries in RNA biology have expanded the molecular biologist’s toolkit considerably in recent years. For instance, the cleavage systems from small RNAs can be used (in siRNA and shRNA ways) to knock down genes at a transcriptional level. The current “wow” technology, CRISPR/Cas9, is a bacterial phage defence system that uses an RNA-based component to adapt to new phages easily. This system has been repurposed for gene editing in (seemingly) all species – every genetics grant written these days probably has a CRISPR/Cas9 component.

And yet in terms of bioinformatics, RNA data was – until this past September – rather uncoordinated. There wasn’t a good way to talk consistently about well-known RNAs across all types, although this was sometimes coordinated in sub-fields such as Sam Griffiths-Jones’ excellent miRBase for miRNAs, or Todd Lowe’s gtRNAdb resource from for tRNAs. But because RNA data was mostly handled in one-off schemes, researchers working in this area were hindered. Computational research couldn’t progress to the next stages, for example capturing molecular function and process terms with GO or collecting protein–RNA interactions in a consistent way.

RNAcentral in the bioinformatics toolkit

So I’m delighted to see the RNAcentral project emerge (http://rnacentral.org/). RNAcentral is coordinating the excellent individual developments emerging in different RNA subdisciplines: miRNAs, piRNAs, lincRNAs, rRNAs, tRNAs and many more besides. It provides a common way to talk about RNA, which in turn allows other resources – such as the Gene Ontology or drug interactions databases – to slot in, usually precisely in the same “place” as the protein identifier.
Alex Bateman, who leads the RNAcentral project, has been exploring a more federated approach, quite deliberately gathering the hard-earned, community-driven expertise of member databases in specific, specialised areas of RNA biology.

RNAs were, potentially, the first things on our planet that could be considered “alive”. They are critical components in biology, not just volatile intermediaries. In terms of bioinformatics, giving RNA the same love, care and attention as proteins is long overdue, and I look forward to seeing RNAcentral provide the cohesion and stability this area of science so richly deserves. 

Monday, 29 September 2014

A cheat's guide to histone modifications

I was recently having lunch with Sandro, a charming Neapolitan computer science graduate doing a postdoc in my research group, who has a passion for great food and clean C code. We were discussing some recent aggregation results of histone modifications, and Sandro was bemoaning (verbally and non-verbally) the fact that all the histone modifications sounded "just the same". I could relate to the sentiment, recalling my own journey into this world some seven years ago during the start of the ENCODE project when I first faced this bamboozling list of modifications.

Here's my take of histone modifications. It's probably a reasonably accurate snapshot of what we knew by the end of 2013. (There is a lot more to cover, and this view will surely go out of date fairly quickly. If you are reading this post in 2016, you might want to look for your cheat sheet somewhere else!)

So - this summary is mainly for Sandro, but I am pretty sure there are others who might like to use it.


Histones are proteins that package up DNA. The combination of histones and DNA is called "chromatin", and this is the natural way one finds DNA in eukaryotic cells. Histones come as two groups of four proteins in a unit, called a nucleosome.

There are different types of histone protein. And just to be extra confusing, the same type of histone protein is sometimes made by more than one gene. (A lot of histone protein needs to be made during each cell cycle.)

Histone proteins are mainly compact, globular structures, but each one has a floppy peptide region at the start of the protein, which is described as the histone "tail" (somewhat confusingly in my view, as it's at the start of the protein! Isn't it more like a "trunk"?). These histone tails have many lysines (single amino acid code, K) which can often be modified chemically with the addition of a methyl group (CH3) or an acetyl group (CH3CO; alcohol, basically).

Modification themes

There are some strong themes that emerge in the sets of modifications, and the most important rule for recognising them is that a particular lysine can have either 1, 2 or 3 methyl groups, or one acetyl group. There are other combinations and modifications, but this rule (lysine in one of four states) is the main one.

There are many different histone proteins, and each protein has many different lysines. Because the histone modifications were first described biochemically, they had a standard naming scheme, for instance H3K4me3, or H3K27ac. The naming  scheme is:
  1. H is for histone (H3K4me3)
  2. The next number is the type of histone protein. Much of the action happens on histone 3 (H3K4me3)
  3. The next letter is amino acid that is modified. It is very often lysine: K (H3K4me3)
  4. The next number is the residue of that amino acid. Remember, the histone tail is at the N-terminus, so it is often a small number (H3K4me3)
  5. The next group is the modification. It might be me1 (mono-methylation), me2 (di-methylation), me3 (tri-methylation, as in the example) or ac (acetylation). (N.B. 'me' by itself is ambiguous, but 'ac' by itself is not.)
This is great if you are interested in the precise chemical makeup of the modifications. However, most people are interested in the implications for the cell. To be honest, I think we would be better off giving each one of these modifications a distinctive name, as they do in Drosophila gene naming schemes, but... this is what we've got. 

Histone modifications and chromatin behaviour

Histone modifications are observed across the genome, but are very different in different parts of the genome and in different cell types. There is a raging debate about whether histone modifications can be considered to drive chromatin behaviour, or whether chromatin behaviour is simply a consequence of things which are  happening nearby on the chromatin (in particular, transcription factor binding) (I am mainly a "consequence" person here). Either way, these modifications are extremely informative about what is going on in any given cell type. 

Another thing to keep in mind is that "pairs" of histone modifications can be found on the same residue. Very often, one of the pair is more about activation and the other about repression, or put another way, one features promoters and the other enhancers. Having a pair forces a sort of duality in which any particular histone has to be in one state or the other.

Well-known modifications: some notes

The H3K4me3 vs H3K4me1 pair: promoters vs intergenic

H3K4me3 is the classic histone modification, and an active mark. It was discovered early on, and is usually present on active promoters in a tight, localised area (and elsewhere, though perhaps these are mainly unannotated promoters?). H3K4me3 is your go-to histone modification to help define a promoter. 

H3K4me1 is kind of its opposite, as it is present far more in intergenic regions, including many "enhancers", through both active and inactive enhancers. H3K4me1 is more enigmatic than H3K4me3, and is a slightly less localised mark. 

H3K4me2 is (normally) tightly correlated with H3K4me3, and I think of it as really on its way of getting its third methylation to become H3K4me3.

Although H3K4me3 is activating, just to make life confusing, the other me3 modifications are, for the most part, repressive. (Biology is shameless about lack of consistency sometimes!)

The H3K27me3 vs H3K27ac pair: repressed vs active

H3K27me3 is the classic repressive histone modification. You find it in regions of the genome that are deliberately switched off. One of its most famous associations is with the polycomb complex, which is a chromatin-organising system that provides cellular memory, particularly during development.

H3K27me3 is also present on the inactive X chromosome. In common with most other repressive marks, H3K27me3 is far more diffuse, and there are mechanisms that take an initial H3K37me3 region and expand it "automatically"

In contrast, H3K27ac is a strong, "active" mark, and shows activity over both active promoters and active enhancers. As this is happening in the same residue, you can see that this system is being set up nicely to be either active or repressed. 

The H3K36me3 and H3K79me2 pair: transcriptional repression

These two marks are not opposing pairs (notice they are not on the same residue), but they do similar jobs: they essentially provide extra repression of transcription initiation in gene bodies. 

When a gene is transcribed, there is this huge, hulking protein complex (RNA polymerase) that is marching through the chromatin, making it far easier for cryptic promoters in the DNA sequence to be activated. This would lead to a mess (in particular if they were going anti-sense), except that the RNA polymerase deliberately comes to the rescue with a histone modification scheme that puts down a "don't start transcription here, because I am transcribing through this region" message: H3K36me3. This means this mark is indicative of polymerase activity, and in theory should be relatively flat across a gene body. 

H3K79me2 is the same thing, but only at the start of the gene, whereas H3K36me3 picks up after the first bit.

The H3K9me3 vs H3K9ac pair: repeat repression vs active

The H3K9me3 vs H3K9ac pairing is another repressed vs active pair, though you mostly hear about the repressive form because it is very often used on repeats. Repeats in the genome are genomic parasites which, when active, are happily copying themselves around the genome. The host genomes have some pretty ingenious ways to detect and repress these repeats. The final "effector" of repression is very often H3K9me3 deposition. 

There is also something weird going on with H3K9me3 and zinc finger genes. 

... and many more

There are many other modifications, many of which people don't actually know much about - for example the common but mysterious H4K20me1. But I'll keep this post to the classics - you can delve in to the 10s - 100s or so mysterious ones through many papers. 

Sandro - I hope this was useful :)