When it comes to the science of genomics, Amazon’s AWS is by far the platform of choice for most organizations. But, just like in the musical Hamilton, Microsoft Azure is “young, scrappy, and hungry.”
It is driven to provide the tools and managed services that are needed to run genomics at scale, with HPC (high-performance computing) and storage being among the hardest facets of the field to get right.
This blog post will first briefly explain genomics. Then, we’ll dive into what Microsoft Azure has to offer in this field that solves common genomic analysis challenges.
What is Genomics?
You and everyone else you know (unless they are your identical twin) are born with a unique genome. Your genome contains the DNA specific to you, which is used as a blueprint for every kind of cell in your body. A person’s genome does change over time, based on age, environment, diet, and other factors.
DNA is analogous to the assembly code of a program. It’s the lowest level you can go, outside of binary, where instructions are somewhat intelligible. Knowing a person’s DNA allows for targeted medical treatments, tailored specifically to what works best for each specific person.
The hard part is (1) knowing a person’s genome, and (2) doing something meaningful with it.
Why is Knowing the Genome Hard?
Although your DNA can be found all over your body, extracting it takes expensive specialized machines and a lot of time. Prior to NSG (next-gen sequencing), the process of isolating a person’s genome could take days.
NGS was a game-changer. It spread out the work of sequencing multiple samples to be done in parallel, with each worker focused on different sequences.
NGS is very much like how MapReduce works, though it does come with a cost. Each sequence is read multiple times, with a quality score included in the results. Samples with low quality can be discarded, so the final result will be the most probable genome.
How is a Sequenced Genome Represented?
The three most popular formats for representing a genome are FASTA, FASTQ, and BAM. FASTA is less standardized than the other two, so FASTQ and BAM are the norms in the field.
There are also different levels of specificity for a sequenced genome. There can be a WGS (whole genome sequence), low coverage WGS, an exome (~1.5% of the genome that codes to proteins), and others.
Using a Sequenced Genome
Up to this point, everything discussed has been on-premise as well as independent of how to make use of the sequenced genome.
Meaningful Sequenced Genome Challenge #1: File Size
The first difficult part of a sequenced genome is the size. A FASTQ file can be between 50 and 125 GB. A BAM file can range from 5 GB for an exome up to 140 GB for a WGS. Transferring that data from on-premise up to the cloud is a non-trivial cost.
Once the sequenced genome gets to where it can be used, the next question is how to make sense of the gigabytes of data it represents. The most common way is to process it against a reference genome. This helps identify the delta sequences of the sample genome to isolate the signal from the noise, which can reduce hundreds of gigabytes of data down to a few megabytes.
Meaningful Sequenced Genome Challenge #2: Generating Variants
The second difficult part is generating these variants in a VCF (variant call file). There are a few open-source options, which require an investment to install and operate. Microsoft Azure, though, has made this process very simple with the Genomics Account.
Given a pair of gz’ed FASTQ files, you can upload them to a storage account, and queue a job in a Genomics Account to process the VCF. Depending upon the day and the time, the Genomics Account can process the job in anywhere from 10 minutes to 2 hours.
Having Microsoft manage the underlying infrastructure saves time and operational cost, and is a great option for both prototyping as well as a long-term solution.
Meaningful Sequenced Genome Challenge #3: Making Sense of The Data
The third hard part is to make sense of the VCF. Although it’s been narrowed down to a handful of MB, the variants can only tell so much.
There are also some biases implicit with the generation of the VCF. For instance, the genome of a female Native Pacific Islander will be inherently different from a male Scandinavian. These inherent differences can increase the noise and confuse the signals.
With those warnings in mind, a common way to make sense of a VCF is to load it into software that allows for HPC. Hadoop, originally the most popular way of doing this, has given way to other offerings such as Spark.
The primary reason for this is that because Hadoop works by minimizing the bandwidth transfers necessary by placing the job doing the analysis as close to the “disk” with the data, it is inherently less efficient than software that runs the job with the data in memory.
Setting up and running a Hadoop or Spark cluster at a scale needed by a large health system is the stuff of nightmares. Of course, there are specialists that can be paid a pretty penny to do this, but for most organizations, the cost is too prohibitive.
This is where Microsoft Azure’s HDInsights comes into play. With a few clicks, Microsoft will set up and operate a Hadoop or Spark or HBase or many other cluster offerings. The time between provisioning and usefulness is less than an hour, in my experience.
Tip: the cost of an HPC cluster is non-trivial. Even at a small scale, it’s usefulness comes at a price. If you choose to play around with this software, be mindful of the billing involved.
Storing a Sequenced Genome
Both the Genomics Account (as well as HDInsights) rely on the Microsoft Azure Storage Account offering.
In Azure, Storage Accounts have different tiers: Hot, Cold, and Archive:
- The Hot tier in Azure is the most expensive and allows for the fastest access to data.
- The Cold tier in Azure has slower access but is less expensive.
- The Archive tier in Microsoft Azure offers no immediate access to data but is the cheapest.
With the amount of genomic data that accumulates over time, keeping everything in Hot storage is prohibitively expensive. I suggest you move data to the Cold tier when not immediately needed. Later, when deemed fit, the data can then be moved to Archive.
There are many methods for doing this from manually scripting it to using tools like Microsoft Azure Data Factory. However you decide to do it, it is a necessary part of doing genomic analysis at scale.
Differences with Amazon AWS
The Microsoft Azure offerings discussed in this blog are the cornerstone of genomic analysis in the cloud. Additionally, strong parallels exist with different services provided by AWS. However, there are subtle differences.
For example, AWS offers more tiers with storage accounts. On the flip side, AWS does not have as many managed services as Azure. Additionally, getting up and running has a slightly higher upfront cost. Even if what gets billed against cost centers is comparable, in the long run, it still can end up being more expensive in terms of human capital.
In total, Microsoft Azure has some compelling offerings to use with genomic analysis. The managed services allow the team in charge of performing the analysis to spend less time on upkeep and more time on enabling clinicians to make use of the analysis.
With a similar cost to Amazon AWS and less time needed to operate, it is a serious contender for genomic analysis at scale.