Thursday, 27 March 2014

‘Big Data’, genetics and translation

Today sees the announcement of a new public–private partnership in science – the Centre for Therapeutic Target Validation – between EMBL-EBI, the Sanger Institute and GlaxoSmithKline (GSK). The collaboration is dedicated to developing a framework for biological target validation so that we can reduce the amount of time it takes to discover new therapies.
This is a really exciting initiative for me personally, both because the science is challenging and because I have been appointed as interim Head of the CTTV over the next year whilst we look for a long-term Head to steer the collaboration. It has already been a fascinating journey for me to understand the pharmaceutical industry in more depth, and really get to grips with an important scientific problem.

What’s the problem?

In very broad terms, there three phases in drug discovery: validating biological targets, making small molecules and testing them in clinical trials.
The purpose of target validation is to figure out, for a particular disease, which molecule (usually a protein) you need to change in order to change the disease. This molecule is called the target.
Drugs act on these targets. Drugs are often small molecules and sometimes proteins, like antibodies, that will change the activity of a biological target. A drug has to be able to enter the body, do what it was designed to do and not mess around doing other things. Pharmaceutical companies are really good at making these.
Clinical trials involve giving a new (or repurposed) drug to a group of consenting people. They are jaw-droppingly expensive. Unfortunately, the vast majority (90%) of drugs ultimately fail to make it through clinical trials. A large proportion of those fail because the information on which they were based, from right at the start – the target protein– was not quite right.
What’s the problem? In short, it is that a billion-dollar phase III clinical trial is a very expensive way to discover that your drug wasn't changing the right target.

How do you make it better?

Clearly, validating those biological targets is extremely important and it can be done a lot better. What the CTTV is aiming to do is change the landscape for the initial phase of drug discovery by pooling our knowledge and resources to improve target validation.
GSK realised that this is not something that they (or any other commercial organisation) can do easily in house. Wisely, they decided that this work is best carried out pre-competitively, in the public domain. Around a year ago, members of GSK’s senior leadership came to visit the Genome Campus to explore a way forward, and the CTTV concept was born.

It isn’t going to be easy

So this sounds great! What’s not to like? A large company is funding public domain work for the greater good. But... it's not a walk in the park - this problem is actually quite hard. At what point can one say, definitively, that this protein is a good target for a drug to act on to change the course of a disease?
To resolve this you would ideally create a specific perturbation that changes a specific molecule, verify its safety in humans and give it to people with the disease. In short, develop a drug. But the CTTV aims to get a good handle on target validation without actually making drugs, so what information are we working with, and what are we hoping to deliver?

Genetics is powerful...

For this task, genetics provides powerful tools. For example, some people have experienced a genetic change and, as a result, have natural knockouts of a protein. Ideally, you would study a large number of people with this profile and determine that because of this protein, they are either protected from or more vulnerable to the disease than others. The Broad Institute and deCode recently published an excellent case study publication on this, in this case finding a natural knockout of a zinc transporter which protects against type II diabetes.

This is just one way to use genetics. I am also excited about using genetics to get negative information - that a particular protein is not a good drug target for a disease. How to do this? Well if you can convince yourself that a variant is definitely changing the activity of a protein, even only a little, then if this change has no impact on disease risk, it is unlikely to be a good target. This area of work will borrow statistical techniques from epidemiology, in particular the rather impressive-sounding "Mendelian Randomisation" approach. 

... but there's more to it.

Genetics is just the beginning. Most ‘good’ drug targets, from which people have made successful medicines, are not co-incident with the strongest genetic signals, even though they may be in the same pathway. Much of the genetic signal (i.e. for common diseases) is in regulatory regions, with uncertain links to the proteins of interest.
But hindsight is 20/20: it’s far easier to do this analysis post-hoc, knowing the right answer. The scale and diversity of the data - sequence, expression, interactions, reactions - make this solidly a ‘big data’ problem that requires both engineering and statistical sophistication.
Molecular biology moves very quickly; often, a new kind of experiment will unlock a problem or narrow the possibilities from thousands to tens of targets. 
Not to get too geeky, but I’m particularly excited about the CRISPR/Cas9 technology (like most experimentalists) because they make it possible to introduce specific mutations into cell lines. This should be particularly powerful in oncology, where systematic cancer sequencing efforts are giving rise to a number of robust targets. Cancer has been transformed by the ability to more systematically find oncogenes and tumour supressors via sequencing, leading to targeted therapies such as BRAF inhibitors. The question is: which ones in a now established cancer cell do you need to target to slow (or stop) it's growth?

Open data, of course

Like every other project we do at the EBI and Sanger, this collaboration is solidly committed to open data, open methods, open web sites, peer-reviewed publication and public discussion. The results will be distributed in accordance with the data-sharing policies of both institutes. Full stop.
But why would a big company find that attractive?
GSK, in its ‘enlightened self interest’, has set an ambitious goal of fundamentally shifting the way drugs are developed, and has backed the effort with substantial funds. But this isn’t just self interest – the results of work done at the CTTV will benefit all other drug companies and anyone working in drug development. And while GSK has put up the funding, the institutes on the Genome Campus have staked the considerable in-kind contributions of people and resources.

A full-bodied blend

The skill sets of the three institutions are complementary: GSK brings a deep understanding of finding and verifying drug targets, albeit in a bespoke manner. The Sanger Institute brings to the table substantial expertise in genetics, genomics, cancer genomics and cellular genetics, in terms of both experimentation and analysis. The EBI provides large-scale reference datasets, engineering at scale and innovative analysis.
The CTTV is kicking off with a motivated team of people who understand and respect these skill sets. I’m really lucky to be starting with a strong science team, and we’re already learning from each other, scientifically and organisationally. What is most exciting right now is grappling with some of these key problems in human disease biology. It's going to be a really interesting year for me, and I hope a great start for the CTTV.
Bring it on!

Tuesday, 7 January 2014

New media for science: 3 years in.

Around three years ago, I decided to use social media in order to engage - as a scientist - with a broader group of people. Since then, I've come to see platforms like Twitter and blogs as a way to reshape public scientific discourse - mainly for the better but with considerable adjustments. Recently, the 5000th person started following me on Twitter, which I think represents a kind of turning point and perhaps a good opportunity to reflect on the pros and cons of using these new channels to communicate with a wider audience.

First, some numbers

Just in terms of 'follower' stats, those 5000 Twitter users put me in the top 1%. This doesn't come anywhere near the followings of celebrities like Stephen Fry (the Crown Prince of Intelligent Twitter, with 6 million followers) or science communicators like Ed Yong (20,000 followers), but I still take it as good news for genomics and bioinformatics generally. (There are lots of different ways that people have of tracking 'influence' using multiple factors, but for now I'll stick with the simplest number.). Browsing the list of my Twitter followers is interesting - about 100 or so I know well - perhaps another 200 I know vaguely, with a small smattering of people with a broader audience than me. But the vast majority are mainly practising scientists (quite broad) with a fair proportion of what I assume are interested non-scientists (or hiding it well on their twitter profile). The reach is global - from South Africa to Norway, though far more European and US people.

This blog is viewed around 150 times per day, with spikes of up to 1000 views per day when a popular blog post is first released. Anecdotally, this seems to reach a large number of other scientists but also an unexpected number of interested non-scientists. It provides a very different channel of communication than traditional editorials in peer-reviewed publications, and lends a refreshingly informal dimension to the usual scientific discourse - one that is very much in keeping with the global nature of scientific debate. When I started out blogging my blog was read almost exclusively by people from the US and UK; over the last couple of months there has been a noticeable uptick in people viewing it from Germany and China (though US and UK still dominate). My most read posts are the scientific, more generic ones (Five Statistical things I wish I had been taught; 10 rules of thumb for genomics and Human genetics, a story of allele frequencies) - these must be linked from a variety of teaching places (there is a spike at start of term normally) and elsewhere. The single most read post is the one I made for the ENCODE publication (ENCODE; my own thoughts), though "Five Statistical things" is a pretty close second.

No Editors, no gatekeepers

One of the most powerful aspects of media like Twitter and blogs is the direct line between writer and reader: there are no editors or gatekeepers limiting what you can say. Whenever I get into an argument with an editor of a 'traditional journal' about whether my idea is "really want their readers want to hear," I appreciate how liberating it can be for writers to make these decisions directly. This blog lets me focus on the things I find important, interesting or sometimes amusing without having to go through this weird discussion about what a loosely described group of anonymous people might be willing to read. This "no gatekeeper" aspect of new media gives individuals an unprecedented amount of freedom that I don't believe we have fully realised.

But... No Editors

The downside to having no editor is that sometimes... editors are useful. There is the simple aspect of copyediting (not my strong point), but more importantly a good editor can condense or recast text to make it easier to understand or digest. There is also the meta-aspect of whether your thoughts are actually worth reading, or how open they may be to misinterpretation.

I've developed an informal network of people at the EBI who help me with these things. Foremost is Mary Todd Bergman, who is the EBI's communications officer and, thankfully, copyedits nearly every post (and edits some extensively). I've found it very useful to see my original words processed and reflected back by Mary - the readability is always improved, and if she's changed the sense that is invariably because my thoughts were not clear. After two or three rounds of editing, we've normally settled on text that is both clear and gets my points across. [Note from Mary: It feels very weird to copyedit this paragraph!]

For more scientifically focused posts (such as The EBI as a Data Refinery or CERN for Molecular Biologists), I normally start by asking scientists at the EBI who are closer to the work to feed back on the original draft. For example, for the "Refinery" post, Claire O'Donovan helped me explain the precise inter-relationship between UniProt, GO annotation and InterPro. In the case of the "CERN" post, I asked for input on the draft from Ian Bird, who Head of Scientific Computing at CERN and had some crucial feedback. Then it goes to Mary for smoothing out and back to the expert group for their blessing. For the (rarer) contentious blogs and tweets, I often talk things through with Mary and long-time colleagues Rolf Apweiler (Joint AD at the EBI) and Paul Flicek, who offer quite different - and very valuable - perspectives on these complex topics.

So, this blog is far more of a joint effort than the traditional concept of a lone person keeping an online diary. I've come to see it as a partly personal and partly 'institutional' e-journal.

Lessons learned


For any medium, really, there is no point in producing text that cannot be easily read by the majority of your audience - and this audience is bound to be much broader in new media like Twitter and blogs than it would be in a traditional journal. Scientific text in journals prizes both precision in terminology, and encourages a writing style that sometimes strays into a sort of tribal signalling process that bonds writer and (niche) reader. Precision in scientific terminology is important, but I do think this often goes too far. I have found in very liberating, when writing for my blog, to put myself in the shoes of an 'interested lay person' so that I can convey what I find so important about seemingly complex concepts.

Blogging is a new journal

I see blogging as basically a lightweight e-publication. In fact, I'd quite like to see some of the tracking features of more traditional journal (e.g., a citation scheme and a DOI) applied to blogs, but only if it did not compromise the lightweight, 'no gatekeeper' aspect of the medium. When you think about it, the difference between a blog and a pre-print server is not so great: you try to put down one or two ideas per blog post, in a way that will stand relatively independent of any other context (and one should link out to provide any necessary context).

Twitter is an overheard conversation

Twitter, on the other hand, is all about context. It is an overheard conversation - the sort of thing you get all the time in the lunch queue in conferences or in the bar - and provides a peek into the peculiar interactions between scientists. Any Twitter stream almost by definition has to be seen in context - a particular blog post perhaps, or paper, or news item. An entire Twitter conversation is rarely sparked by a topic that originates solely on Twitter, and even in those rare cases when it does, one needs a veritable swarm of tweets to understand it.

Some people don't get Twitter (which is fine - not everyone likes to natter in the lunch queue, either); but some people overrate the importance of their statements. Tweets, like conversations, are ephemeral things. The amazing thing (a bit gobsmacking, really) is that a fair proportion of the world *could* dip into any given conversation, if they wanted to.

It's worth repeating that tweeting and blogging are - seriously - fully public activities and can be stored forever. Basic rule: If you are not happy to hear your tweet or post broadcast over loudspeakers on Oxford Street / Champs-Élysées / Kurfurstendamm / Times Square / wherever, then simply refrain.

Humour doesn't work on Twitter.

Honestly, it doesn't. If you are a 'Judo Tweeter' you might get away with it, but crafting a funny, cross-cultural, informative message in 140 characters is no simple task. And no, smiley faces don't make you funny. I forget this every 6 months or so, and kick myself as my oh-so-funny tweets cause (at best) confusion.

Who am I, here?

It is pretty important to know "who you are" on Twitter or on your blog - and to have a pretty consistent set of themes for what you want to say. This is fundamentally why people want to follow you. Unless you are super-famous (or writing on Facebook) the theme has to be more defined than "yourself, uncut". On Twitter my themes are EMBL-EBI stuff, life science (with a focus on genetics) and computing in the life sciences. I rarely stray away from these themes - perhaps under 5% of my tweets are off these themes.

Early on I made the mistake of tweeting about stuff I like that seems to be well received in social media. For example, I find the annual Eurovision song contest Twitter feed just tear-jerkingly funny. Tweeting about it is not On Theme and is clearly Not Funny to a good number of people: the first time I did this I didn't have enough followers to realise what was going on. The second time, I lost about 20% of my followers, one or two with a rather snarky sign off to me. I will not repeat it again (though I might set up a ewanbirneyeurovision account just for this annual event). My rare 'cricket ashes' tweets might cross over this line, but I hope they are few enough to be tolerable, and even be considered to add a tiny bit of personal spice to the mix.


There are downsides. Twitter seems to latch on to a rather deep bit of human psychology: wanting to know what is going on, leading to obsessively checking your twitter feed. This is not good. The 140-character format is inadequate for explaining complex ideas, or for having a real to and fro discourse where both sides understand each other's position (there is really no replacement for meeting face to face). Finally, similar to email, it is surprisingly easy to be nasty in ASCII text - the speed of delivery coupled to the lack of human in person response can be dangerous.

Come join the fun

Overall, my experience of using both Twitter and blogs has been positive. I feel as if I have new forums on which to say things that I find interesting, I am able to connect to both scientists and non-scientists in an easy and global manner and I regularly learn about articles and ideas I might not otherwise have heard about. I think many other scientists would benefit from engaging more using these media. It's not for everyone, and certainly there are some downsides, but I do encourage people to give it a go.