Gathering insights to better understand native species
The below case study shares some of the technical details and outcomes of the scientific and HPC-focused programming support provided to a research project through NeSI’s Consultancy Service.
This service supports projects across a range of domains, with an aim to lift researchers’ productivity, efficiency, and skills in research computing. If you are interested to learn more or apply for Consultancy support, visit our Consultancy Service page.
Research background
Jessie Prebble is a plant systematist working at Manaaki Whenua - Landcare Research and studying the spatial delimitation of New Zealand native forget-me-nots (Myosotis). As noted in her 2014 publication 'Native New Zealand forget-me-nots (Myosotis, Boraginaceae) comprise a Pleistocene species radiation with very low genetic divergence': "The New Zealand forget-me-nots comprise a lineage of over 40 closely related but morphologically and ecologically diverse species whose evolutionary history and taxonomy are unclear. Myosotis is a high priority for systematic research in New Zealand because a high proportion of these species are threatened, and many have restricted geographic ranges and occupy very specific habitats."
Jessie uses the ENMTools R package (https://github.com/danlwarren/ENMTools) to compare modeled niches for different forget-me-not species and to assess how different the flowers are from each other. In the process, a square matrix comparing the five species’ niches is computed.
Project challenges
Depending on the number of replications (~100 or more), each one-to-one comparison takes 3+ hours to complete. Often, memory limits are reached. Since there are up to 50 tests, the computation can take 150+ hours to complete.
What was done
The code has been broken into an application programming interface (API) and a driver for maximum flexibility. The driver takes command line arguments allowing one to select the two species for which pair statistics can be computed. This step allows for multiple pair statistics to be computed in parallel using, e.g., job arrays
The code now reads a configuration file, confi.yml, which makes it easy to set, run and store parameters. The parameters are number of replications, the input directory containing data and the output directory where data and graphics files will be stored. Additional input parameters can easily be added.
The code now uses logging, which allows one to follow the progress of the computation and identify issues if these arise
Jobs can run concurrently on Mahuika, which reduces the time to solution
A snakefile workflow management has been added. With this approach, missing output files will trigger new computational tasks on demand. This saves computational resources and improves the researcher’s productivity
Main outcomes
Up to 20x wall clock speedup. A full suite, which takes 100 hours to run can now complete in 5 hours.
More flexibility, adding new species does not require a complete rerun of the code. This saves time and computing resources.
Introduced good software engineering practice into the project (logging, configuration file, pull-requests and workflow management)
Do you have a research project that could benefit from working with NeSI Research Software Engineers or our Data Science Engineer? Learn more about what kind of support they can offer and get in touch by emailing support@nesi.org.nz.