John Reynders, VP, R&D Bioinformatics, Alexion Pharmaceuticals
Over the next few pages, I would like to cover perhaps two of the buzziest buzzwords in the entire IT lexicon: cloud and a big data. After reading these pages, I hope you will find some of the practices, anti-patterns, and hard-won learning’s from the life sciences industry to be of some use as you consider the role of cloud and big data in your organization. Starting with the cloud, let's explore value, architecture, and security. One of the main justifications I have heard for moving to the cloud is to save money, to manage the peaks and valleys of use and turn fixed cost into operational cost. All fine goals; however, they miss perhaps the greatest values the cloud provides: speed and agility. The cloud allows organizations from Fortune 500 pharma companies to start-up biotechs to spin up a super-computer or terabytes of data analytics in an hour to address an emerging opportunity or challenge. Likewise, if a business needs to pivot efforts from a compute-intensive in-silico chemistry screening campaign to a data-intensive genomic mining capability, delete one virtual private cloud and create another. The cloud's ability to confer speed and agility to an organization is the primary value, and yes, you save some money too.
“The waterfall, Big Bang approach to building the Delphic oracle is an unfortunate big data anti-pattern that seems to rear its head far too often. Instead, start with a clear set of focused questions, and chart your big data journey with small and iterative steps”
Architecture takes a different shape in the cloud. It is still deathly important to consider architectural pillars such as data models, service-oriented architecture, integration, and application layers - these are constants whether you are on-premise or in a cloud. However, something new to consider in cloud architecture is the ecosystem of applications, utilities, and solutions associated with any cloud selection. Whether you choose Azure, Force.com, or Amazon, you are also choosing a constellation of cloudbased partners and an extended cloud application stack. So when considering a cloud solution, also consider the network effect of cloud partners as you assess the suitability of a given cloud solution and architecture.
Finally, the perennial question around any cloud solution: security. This is of particular concern in pharmaceutical and biotech companies where intellectual property is of paramount importance. Let me address this point with a story. I was once briefing the senior leadership of a pharmaceutical company on cloud-based solutions and fielding more than a few questions on security. I shared with them a thought experiment. "Let's suppose we build a data center and secure it with all the cyber security experts we are able to hire. Note: we are a pharmaceutical company. At the same time we challenge our friends at Amazon to build a data center in one of their virtual private clouds and secure it with their cyber security experts. Now let's have our friends at Amazon hide some data in their data center and we'll hide some data in our data center and we will play a game of capture-the-flag. It proceeds with our selecting a mutually agreed external white-hat ’friendly hacker’ team that is challenged to find security vulnerabilities and breach our data centers to find the flag. Given this friendly game, which organization would be able to defend their flag from most effectively from the white hats?” Invariably, and rather quickly, the answers were uniformly that the team at Amazon would have the upper hand over a pharmaceutical company. This of course leads to the question, "So where then is the safest place to store our data?" After the chorus of, “hey-you-tricked-us!” died down, rather productive conversations ensued.
The last cloudy thought I would like to leave with you is a bit provocative. I received my black belt in Tae Kwon Do about eight years ago, and I remember when learning to break boards my instructor always encouraged us to focus our punches six-inches past the board. This is how a board can be broken, focus your energy at a point beyond the surface. In approaching the cloud, I have always asked, "What can't I put in the cloud?" Rather than focus on the incremental step of what to move from a data center to the cloud, focus on the provocative concept of putting everything in the cloud and consider the exception only of what has to stay behind. This leads to the critical thinking of how to operate and integrate fully within a cloud context rather than finding yourself caught in a compromise hybrid mode that is neither cost-effective nor agile. Focus on what seem six-inches past the impossible and smash through the wall of your data center into the cloud.
Now if clouds weren't buzzy enough, how about big data? Let me cover three concepts here: anti-patterns, the biggest of the three V’s, and talent. Some of the most useful lessons I've learned in the big data space are what not to do. A simple first anti-pattern is the habit of bringing data to an analytics engine. We are familiar with data warehouses and building data cubes that we can dice, slice, and mine to our heart’s content. However, with the scale of data involved in today's analytics challenges, the appropriate pattern is to bring our analytics to the data and run in place. Even given a distributed data model, many analytic tasks are loosely coupled or embarrassingly parallel, enabling distributed analysis. So to correct the anti-pattern of bringing your data to the analysis, apply the pattern of bringing your analysis to the data.
The second anti-pattern I've found is that of data myopia - the strong belief that simply more of a given class of data will provide more analytic depth or predictive value. In fact, introducing different classes of data allows far greater insight and predictive value with a fraction of the data volume. A great example comes from drug discovery where scientist seek to find biomarkers that predicts a health outcome. By combining multiple data types such as imaging, genomics, proteomics, and clinical data, scientist have been able to define health outcomes with greater fidelity and data economy as compared to biomarkers based upon only a single class of data.
The third anti-pattern, and this one is a doozy, is, “build it and they will come.” Sometimes also referred to as the, "can't you just build me a search engine that will find what I want?" syndrome . There is a temptation in thinking that if I just bring enough data together, I'll be ready to answer any question. This concept is that of hypothesis free big data, or emergent analytics. I am a big fan of hypothesis - driven big data solutions. Or put a different way: start first with the questions you seek to ask of any big data solution, this will then inform the architecture, data sets, algorithms, and analytics. In particular one can then proceed iteratively; a question yields an answer which leads to another question which leads to another answer. Each of these iterations informs an increasing and organic set of data and methods bootstrapping between successful demonstrations of value. The waterfall, Big Bang approach to building the Delphic oracle is an unfortunate big data anti-pattern that seems to rear its head far too often. Instead, start with a clear set of focused questions, and chart your big data journey with small and iterative steps.
Now, onto the Big V. Big data is sometimes described as the 3V's: volume, velocity, and variety. And in life-sciences we mostly see volume and variety, and although the volume is challenging with imaging, genomic, and real world evidence data that are very large, one can always solve these challenges with a bigger piece of iron or a larger cloud. What is super challenging in the life-sciences is the Big V of variety: complexity and heterogeneity of data. Returning to my biomarker example above if we look at Alzheimer's research, the datatypes here span multiple modalities of imaging, genomics, circulating markers, clinical data, and time courses of each. Integrating these highly dimensional, heterogeneous, and complex data with traditional database methods has met with limited success – and has led many in the industry into the vanguard of semantic web technologies. Integrating triple stores, ontologies, and graph-based methods into linked-data platforms has enabled the generation and testing of novel hypotheses across complex data graphs with millions of nodes and tens of millions of edges. And the technology curve is accelerating to the point where the interrogation of graphs with trillions of edges will be commonplace in only a few years - and essential to capturing the exponentially growing scale, complexity, and heterogeneity of life sciences data.
As challenging as everything I have described above might be, the most challenging part of any big data effort is finding the right talent. The war for talent in technology is fierce, but it is bareknuckle- brutal when it comes to fighting over a talented data scientist who knows how to fly the cloud. Four or five years ago, when hiring an informatician into a pharmaceutical or biotech, one typically focused on recruiting for an individual with a strong scientific background and a toolbox of analytics and modeling that the scientist had learned over their career. It was unusual for an informatician to join a pharmaceutical or biotech company from another industry. This has greatly changed in our big data era. Today's data scientist has familiarity with artificial intelligence, statistics, machine learning, natural language processing, scripting, data mining, visualization, data processing, mathematics, and a staggering toolbox of other capabilities. It is now the case in the pharmaceutical and biotech industry that it is easier to hire a data scientist with a full toolbox who then learns the necessary biology rather than attempt to teach a biologist what has become a very deep, complex, and unique set of data science skills. In fact, it is not uncommon for pharmaceutical and biotech companies to search for data scientists in the finance, oil and gas, and high-tech industries. As is the case with any technology, not the least of which would be big data analytics, success or failure will all rest on the caliber of talent tackling your big data challenge.
So, as you assess your next big data challenge – avoid the anti-patterns of pulling the data to your analysis, data myopia, and above all – the “build it and they will come” syndrome; consider which V is most critical to your challenges; and before addressing any of the above, make sure you have the right talent asking the right questions of big data to ensure your program’s success. And as you consider your next adventure in the cloud – be sure to consider the value of speed and agility, not just the cost; pull back the architectural lens to assess the capabilities of a given cloud platform’s ecosystem of partners; and if you find yourself in a security discussion – play a game of capture the flag.