A Computational Biology Research Workshop was hosted at Stellenbosch Computer Science from 3 to 18 June. The objective was to introduce the participating undergraduate students to the research process by tackling an unsolved problem in computational biology. Taking part were five Stellenbosch and three UCT 2nd and 3rd year students from a variety of academic backgrounds, including computer science, engineering, mathematics, financial mathematics, pure and applied statistics and economics.

The Group (from left to right) – Seated: Lise du Buisson, Gerdus Benade and Sasha Moola. Not pictured: Dan Kaliski. Standing: Tristan Hands, Thomas Weighill, Jan Buys, Robert Ketteringham and – the organizers – Ben Murrell (Stellenbosch, MRC) and Konrad Scheffler (Stellenbosch)
The first day consisted of a series of intensive crash-courses introducing the students to biology and statistics, with the focus quickly narrowing to computational models of molecular evolution, the framework in which the problem at hand was posed. Day two was a hands-on introduction to the HyPhy (“Hypothesis testing using Phylogenies”) software tool and its scripting language, HBL, which is used to code models of molecular evolution. At the end of day two, the problem was explained, as was the planned approach to solving it.
The problem:
To model the evolution of amino acid sequences, a large number of rates of amino acid exchange need to be estimated from genetic data. Such parameter rich models require an extremely large amount of data. Because of this, biologists resort to using models estimated from large datasets, which may not fit their particular dataset very well. The task for this workshop was to find a way of appropriately constraining the models so that fewer parameters were needed. These constrained models could then be trained on smaller datasets.
The proposed solution:
Phase 1 of the solution was to take a large number of datasets, from a database called Pandit, and learn the full model for each. This would serve to characterize the space of useful models. Phase 2 involved using Matlab to implement a technique for dimensionality reduction called Non-Negative Matrix Factorisation, usually applied to decompose a dataset into a collection of parts from which the original dataset can be approximately reconstructed. In this project, however, the innovation was to apply dimensionality reduction to the large number of models that had been estimated in phase 1, decomposing the models into parts. Phase 3 involved reconstructing new models from the parts obtained in phase 2, using HyPhy to find the optimal reconstruction for any given dataset, which involves far fewer parameters than estimating the full model from scratch.
The next two weeks were spent on the various phases of this solution, with small groups of participants working on subproblems within the project. Besides the three phases described above, students were allocated to reading the related literature, and to submitting computationally intensive processes to a Beowulf computing cluster, kindly offered to the group by the HyPhy team based at the University of California, San Diego.
Various refinements were added as the project progressed, such as the introduction of a recently invented algorithm for Weighted Non-Negative Matrix Factorisation (a more sophisticated version of the aforementioned method, which allowed the incorporation of uncertainty in the parameter estimates from phase 1). When compared to existing models the preliminary results looked good, but the group had to wait until an hour before the end of the last day – when the last processing job finished on the cluster – to confirm that the approach had worked. The organizers (Computer Science MSc student Ben Murrell and Prof Konrad Scheffler) are currently drafting a manuscript for journal publication that they plan to co-author with the students.
Many thanks to Sergei Kosakovsky Pond and Wayne Delport at the University of California, San Diego, for the use of the computing cluster, to the Medical Research Council for funding Ben Murrell, and to Sasha Moola (2nd year student in Mathematics and Statistics at UCT) for drafting this report.
Undergrads produce results in Computational Biology Research Workshop
Undergrads produce results in Computational Biology Research Workshop
A Computational Biology Research Workshop was hosted at Stellenbosch Computer Science from 3 to 18 June. The objective was to introduce the participating undergraduate students to the research process by tackling an unsolved problem in computational biology. Taking part were five Stellenbosch and three UCT 2nd and 3rd year students from a variety of academic backgrounds, including computer science, engineering, mathematics, financial mathematics, pure and applied statistics and economics.
The Group (from left to right) – Seated: Lise du Buisson, Gerdus Benade and Sasha Moola. Not pictured: Dan Kaliski. Standing: Tristan Hands, Thomas Weighill, Jan Buys, Robert Ketteringham and – the organizers – Ben Murrell (Stellenbosch, MRC) and Konrad Scheffler (Stellenbosch)
The first day consisted of a series of intensive crash-courses introducing the students to biology and statistics, with the focus quickly narrowing to computational models of molecular evolution, the framework in which the problem at hand was posed. Day two was a hands-on introduction to the HyPhy (“Hypothesis testing using Phylogenies”) software tool and its scripting language, HBL, which is used to code models of molecular evolution. At the end of day two, the problem was explained, as was the planned approach to solving it.
The problem:
To model the evolution of amino acid sequences, a large number of rates of amino acid exchange need to be estimated from genetic data. Such parameter rich models require an extremely large amount of data. Because of this, biologists resort to using models estimated from large datasets, which may not fit their particular dataset very well. The task for this workshop was to find a way of appropriately constraining the models so that fewer parameters were needed. These constrained models could then be trained on smaller datasets.
The proposed solution:
Phase 1 of the solution was to take a large number of datasets, from a database called Pandit, and learn the full model for each. This would serve to characterize the space of useful models. Phase 2 involved using Matlab to implement a technique for dimensionality reduction called Non-Negative Matrix Factorisation, usually applied to decompose a dataset into a collection of parts from which the original dataset can be approximately reconstructed. In this project, however, the innovation was to apply dimensionality reduction to the large number of models that had been estimated in phase 1, decomposing the models into parts. Phase 3 involved reconstructing new models from the parts obtained in phase 2, using HyPhy to find the optimal reconstruction for any given dataset, which involves far fewer parameters than estimating the full model from scratch.
The next two weeks were spent on the various phases of this solution, with small groups of participants working on subproblems within the project. Besides the three phases described above, students were allocated to reading the related literature, and to submitting computationally intensive processes to a Beowulf computing cluster, kindly offered to the group by the HyPhy team based at the University of California, San Diego.
Various refinements were added as the project progressed, such as the introduction of a recently invented algorithm for Weighted Non-Negative Matrix Factorisation (a more sophisticated version of the aforementioned method, which allowed the incorporation of uncertainty in the parameter estimates from phase 1). When compared to existing models the preliminary results looked good, but the group had to wait until an hour before the end of the last day – when the last processing job finished on the cluster – to confirm that the approach had worked. The organizers (Computer Science MSc student Ben Murrell and Prof Konrad Scheffler) are currently drafting a manuscript for journal publication that they plan to co-author with the students.
Many thanks to Sergei Kosakovsky Pond and Wayne Delport at the University of California, San Diego, for the use of the computing cluster, to the Medical Research Council for funding Ben Murrell, and to Sasha Moola (2nd year student in Mathematics and Statistics at UCT) for drafting this report.