Microsoft Azure Logo

UC Santa Cruz Taps Microsoft’s Azure to Handle Large-Scale Genomics Analysis Projects

NEW YORK (GenomeWeb) – The Genomics Institute at the University of California, Santa Cruz is collaborating with Microsoft’s research division to use Azure, the company’s cloud computing infrastructure, to analyze data from a number of ongoing genomics projects aimed at effectively diagnosing and treating cancer and other diseases.

The list of projects that will benefit from the partnership includes the recently funded California Kids Cancer Comparison, which aims to help clinicians identify more effective therapies for pediatric cancer patients whose tumors relapse, resist standard therapies, or have little to no treatment options. As one of the so-called demonstration projects funded by the California Initiative to Advance Precision Medicine, the project received half of the nearly $2.4 million in funding earmarked for the initiative.

UCSC researchers will also use Azure to support their participation in the International Cancer Genome Consortium (ICGC), which aims to create a comprehensive catalog of genetic abnormalities in 50 different tumor types and sub-types. It will also support the institute’s ongoing efforts to develop a comprehensive map of human genetic variation. This particular project is supported by $3 million in total funding from two separate awards from the Simons Foundation and the WM Keck Foundation.

UCSC chose to work with Microsoft Research because of the scale of Azure compute resources that the company offered as part of the terms of the partnership, Benedict Paten, assistant director of the Center for Big Data in Translational Genomics at UCSC, told GenomeWeb. Researchers at the university regularly use infrastructure from competing providers Google and Amazon and collaborate with both companies on various projects including the Global Alliance for Genomics and Health’s Data Working group, which is co-chaired and co-founded by David Haussler, a UCSC professor of biomolecular engineering. Paten co-chairs the reference variation task team, one of the subgroups of the Data Working group.

However, Microsoft offered a much larger set of resources than other cloud providers have offered for a reduced cost, Paten said. The partners are not disclosing exact numbers in either case. Access to these resources not only frees UCSC to take on more ambitious projects, it also enables researchers complete computations at a much faster clip than current resources allow, he said.

UCSC does own a 4,000-core cluster that comes with several petabytes of storage but even that is not sufficient to complete the kinds of computational analyses that a single large-scale genomics project can require in a reasonable timeframe. As an example, Paten told GenomeWeb that it took researchers the better part of 18 months to analyze roughly three petabytes of data produced by the ICGC’s pan-cancer whole-genome analysis project using the UCSC cluster along with clusters housed at other institutions.

“That’s way too slow,” Paten said. Commercial clouds provide “the scale [we need] to get things done really quickly without having to own or manage resources at that scale,” he added. “We get the jobs that we need done, done very quickly and then we just relinquish those resources when we are finished.”

The partnership is also an opportunity for Microsoft to broaden its reach in the genomics domain, where firms like Amazon and Google have already staked their claims, according to Paten. He noted that besides providing large-scale compute resources, Azure also offers features — such as the Azure Data Lake — that are not available in competing clouds.

“They recognize that genomics is one of the key growth areas for them,” he said. “Because [UCSC] has big data, it makes sense for them to want to partner with us in doing these kinds of problems… and to get genomic workflows ported to their systems.”

Researchers at the Genomics Institute have begun moving their workflows for sequence alignment, variant calling, imputation, and other tasks from the UCSC cluster onto Azure. They’re also adapting their tools — which use a distributed file system — to work with the object store-based file system used in the Azure environment, he said. They also plan to leverage machine learning technologies and other Microsoft-developed tools and features that are available within Azure.

“I think it will take us some time to scale all of our compute across [Azure], but I’m pretty confident that over the next couple of years we’ll be in a position to take any of our workflows or web-services or databases and pretty much at a flick of a switch use Azure instead of our existing resources,” Paten said.

Neil Jordan, Microsoft’s general manager of health, told GenomeWeb in an emailed statement, that Microsoft is contributing technology development and bioinformatics expertise to the partnership. He also said that the company plans to offer a broad set of solutions that will support research activities in academic, government,and health organizations moving forward. In the past, Azure has been used for drug discovery and development-related activities in firms like Molplex and TeraDiscoveries.

Originally published by genomeweb