Genetics and metabolic syndrome
“My research has involved investigating signatures of selection and their association with metabolic disease,” says University of Otago PhD student Murray Cadzow. “By using NeSI, the length of time to complete this analysis was reduced by weeks. Initially it seemed daunting thinking about using a national compute resource with a limited technical knowledge but there was plenty of training and resources available to guide me through.”
Metabolic disease, also known as metabolic syndrome, is a significant feature of many chronic diseases. It is a dysfunction of the body’s ability to process and store energy, characterised by obesity, high blood pressure, diabetes and other conditions.
“The impact of this work on metabolic syndrome would give more information as to what variants in genes may have been beneficial or detrimental to a population along its history. It should give us a hint about which variants may have been selected for or against. These days, with the change in environment, these (previously beneficial) these variants may now contribute to disease.”
Murray is based in Dunedin, but has been able to access high performance computing platforms in Auckland via NeSI and the REANNZ Network.
The research programme seeks to provide greater insight into the genetic disposition of disease. He and his colleagues have been hunting for “signatures of selection”. That is, areas of the genome that exhibit features of having been under selective pressure. These signatures are associated with phenotypes, which can be population specific. Phenotypes are physical characteristics of an organism, or could be thought of the end result of genetics plus environment.
“As part of this, I was using publicly available data from the population data from the 1000 Genomes Project in conjunction with data from Pacific populations. For this project we created a bioinformatics workflow to output different statistics of selection.”
The availability of open science data, compute resources of national scale and excellent technical support were essential to supplement the domain knowledge that Murray and his collaborators had developed through their studies. Without this synergy, the project would have progressed far more slowly, if it could have been progressed at all.
New Zealand possesses a high level of bioinformatics expertise that is supported by the collaborative and complementary efforts of its research institutions, the Virtual Institute of Statistical Genetics, NeSI, NZGL and REANNZ. Bioinformatics will be key to New Zealand’s continued progress in primary industries. As Murray and his collaborators also demonstrate, it may also be a strong factor in enabling our society to combat chronic disease.
The focus on increasing the efficiency and research productivity is demonstrated by the first peer-reviewed paper that has emerged from the work. The workflow the team has developed has been made available to others via an article in Frontiers of Genomics (Cadzow, et al 2014) and as a software repository.
Murray and his team mates have invested a lot of time to make the workflow that they’ve developed available to others as an easy-to-use toolkit. Previous work from Pybus et al (2014) enabled the community to select a small segment of the human genome. “For researchers wishing to investigate selection in other human cohorts or populations – or other organisms – a non-trivial amount of data manipulation and subsequent computation is required in order to extract this type of information from the available data.”
The toolkit integrates many tools, each written in their own programming language and utilising their own file file formats, into an easy-to-use pipeline. Different tools also have made their own own trade off between accuracy and speed. The pipeline greatly simplifies the ability for researchers to detect signatures of selection across a whole genome. This computationally intensive work is able to be distributed across many cores.
Part of the research involves calculating the iHS (integrated haplotype score) values for all areas containing Single Nucleotide Polymorphisms (SNPs) within in a genome. From Wikipedia: “A Single Nucleotide Polymorphism (SNP, pronounced snip; plural snips) is a DNA sequence variation occurring commonly within a population (e.g. 1%) in which a single nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes. For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide.”
Murray explains the significance of the iHS test and the technical challenges of running the analysis across the whole genome. “iHS is a statistical test that is based on extended haplotype homozygosity. It helps identify regions surrounding a SNP that are conserved more than we would expect, giving us a signature of natural selection. The program is trivially parallelisable, but we have crudely estimated that genome-wide iHS for multiple populations will take approximately 320,000 CPU hours.” This workload would have been extremely difficult to handle locally.
“At the start of the project, we also expected to develop and test improvements using MPI and C, and possibly GPU to perform the calculations.” The NeSI HPC platforms provide the flexibility for researchers to experiment with the algorithm and hardware that will run best.
Access to NeSI also enabled Murray to iterate quickly on new approaches described in the literature. In his case, he was interested in exploring the feasibility of incorporating a recently-released package into his analysis. “CMS is an extended derivative of iHS combining other tests into a single test statistic. The software has only recently been released and may require some work to get running, but we again estimate roughly that it will require a further 320,000 CPU hours.” NeSI was able to accommodate this request.
As Murray’s PhD continues, we are sure that we will continue to see the results of this hard work. NeSI’s high performance computing platforms, data services and computational support team will continue to be at his disposal as he progresses.
References
Murray Cadzow, James Boocock, Hoang T Nguyen, Phillip Wilcox, Tony R Merriman and Mik Black (2014) “A bioinformatics workflow for detecting signatures of selection in genomic data” in Frontiers in Genetics 10.3389/fgene.2014.00293
Pybus Marc, Dall'Olio GM, Luisi P, Uzkudun M, Carreño-Torres A, Pavlidis P, et al (2014) “1000 Genomes Selection Browser 1.0: a genome browser dedicated to signatures of natural selection in modern humans” in Nucleic Acids Research 10.1093/nar/gkt1188