Computer Science 771: Honours Project

The year project

The Honours Project in Computer Science is a 32-credit year module (RW771), and is a critical part of our honours degree. The project consists of a substantial piece of independent research or software engineering (programming). The student must demonstrate skills appropriate to the sort of project, including literature survey, the software engineering process, testing and evaluation, and documenting and presenting results. There is the expectation of nontrivial or a significant amount of code to be written.

Each project will be supervised by at least one CS academic staff member. Each project proposal below has an alphanumeric identifier that starts with a sequence of letters indicating the supervisor.

Identifier Supervisor
AE Andries Engelbrecht
BF Bernd Fischer
BT Bill Tucker
BVDM Brink van der Merwe
CPI Cornelia Inggs
LVZ Lynette van Zijl
MD Marcel Dunaiski
MN Mkhuseli Ngxande
SK Steve Kroon
TG Trienko Grobler
WHK Willem Bester

Proposed projects 2022

AE1: Multi-objective Control Parameter Tuning

Just about all optimization and machine learning algorithms have control parameters. The bad thing about these problem parameters is that performance of the algorithm is very sensitive to different control parameter values, and best values are usually very problem dependent. The consequence is that, if one wants to apply an optimized instance of an algorithm, tuning of these control parameters is required. A number of different approaches to control parameter tuning can be found, including factorial design, f-race, iterated f-race, amongst others. However, these approaches tune control parameters with respect to one performance criterion, usually either accuracy or computational effort.

This study will develop a multi-objective approach to control parameter tuning, considering a number of performance criteria, including accuracy, efficiency, robustness, and stability. A very simple approach will be to evaluate performance for a range of parameter values with respect to each of the perfromance criteria. Then to rank performance for the different control parameter values per performance criterion, and then compute an average rank over all of the performance criteria.

A second approach will be developed to use non-dominance based ranking of algorithms based on the performance measures under evaluation.

An alternative will be to adapt and approach such as f-race to drill down to the most optimal regions of the control parameters using the non-dominance relation with respect to all of the performance criteria.

In addition, it is also required to visualize how performance of an algorithm changes for different values of the control parameters, over different axes, one axis per performance measure.

This is a research project, with a significant amount of empirical analysis and coding. The project has very good potential to result in a publication. The control parameter tuning approach need to interface with CIlib, an opensource library of Computational Intelligence algorithms developed in scala. However, it should be general enough to also interface with any other algorithm or algorithm framework.

Prerequisite: Statistics, and students are advised to do Artificial Intelligence 791.

AE2: Particle Swarm Optimization: Which Variations Are State-of-the-art?

Many particle swarm optimization (PSO) variants have been developed, each with varying degrees of performance on different problem landscapes. With so many variants, the question is raised as to which PSO algorithm(s) can be considered as the state-of-the-art. This question becomes increasingly important when a new variant is introduced, or when other optimization algorithms are compared with PSO. In such cases, it is necessary for these new variants and algorithms to be compared with the state-of-the-art. Various opinions are raised in the literature as to which PSO algorithms are best. However, these opinions are mostly based on a very limited empirical analysis on a small set of benchmark functions, and in comparison with very few other PSOs.

This study will initiate a process to rank PSO algorithms with reference to the performance of a base (control) algorithm, on a benchmark suite of problems that provide a good coverage of all fitness landscape characteristics. As a fi rst step, a statistically sound ranking approach will be proposed, and a set of performance measures will be defined. Note that multiple performance measures will be considered. This will require ranking on individual measures, and a multi-objective ranking over all of the measures. The focus will be on single-objective, static, boundary constrained continuous-valued optimization problems, and the outcomes of a recent fitness landscape analysis study of such benchmark problems will be used to identify problems ranging in levels of diculty. The third step will then be to start with the most frequently referenced PSO algorithms, and to empirically analyze their performance, and provide a ranking of these algorithms. As an outcome a results repository will be developed and populated with performance results.

This is a research project, with some software engineering requirements in the development of the ranking system and the results repository. Code will be developed as part of a pyhton computational intelligence library.

Prerequisite: Statistics, and students are advised to do Artificial Intelligence 791.

AE3: Python Computational Intelligence library

There exists a variety of computational intelligence libraries. Some are extremely popular, such as TensorFlow and PyTorch. However, the ecosystem for nature-inspired optimization algorithms is fragmented and underdeveloped. For example, a library may support only multi-objective optimization but not constrained optimization, while another library supports the opposite. Some libraries only support a specific computational intelligence paradigm, such as genetic algorithms or particle swarm optimization. Some libraries are not easily extendable which makes integration with other code difficult. Libraries may also lack documentation, rigorous testing, or have not received updates in months or years. For these reasons, using these libraries for research can be difficult.

The objective of this Honours project is to develop a new computational intelligence library guided by high-level but practical software design principles. The library will be developed in Python with a focus on nature-inspired optimization algorithms and the tools related to those algorithms. The library will be very generic to allow for single-solution problems, multi-solution problems, single objectives, multi- and many-objectives, static objective functions, dynamically changing search landscapes, boundary constrained and constrained problems, single population and multi-population algorithms, and hyper-heuristic algorithms. Ideas and code for the library can be adapted from the existing Scala-based CIlib project.

A very important consideration in the design of the library will be interaction with future planned empirical analysis libraries, results repositories, and fitness landscape analysis libraries.

This work will also include an analysis of existing libraries and identify areas that the proposed library aims to solve. The success of this project will be determined by the usability of the proposed library and feedback from fellow students and supervisors.

Prerequisite: Though this is a software engineering project, students are advised to do Artificial Intelligence 791.

AE4: Large-Scale Multi-Modal Particle Optimization

Multi-modal optimization (MMO) refers approaches to find multiple solutions within multi-modal search landscapes. Traditionally algorithms with the ability to find multiple solutions are referred to as niching algorithms. Thus far, MMO algorithms have been developed and evaluated mostly on small scale optimization problems. A number of MMO particle swarm optimisation (PSO) algorithms have been developed. This study will analyze the scalability of these algorithms to large scale optimization problems (LSOPs), and will develop approaches to improve their scalability. These approaches will be informed by recent studies by our research group into the reasons as to why PSO does not scale well, and algorithms will be developed to incorporate recent decomposition-based approaches. These decomposition-based approaches follows a divide-and-conquer approach towards solving LSOPs, by dividing a large-dimensional optimization problem into several smaller-dimensional problems. These smaller-dimensional problems are then solved, and a fusion approach implemented to combine optimal solutions from these smaller-dimensional problems. The scalability study will include both scalability with reference to problem dimensionality and number of optima in the search landscape.

This is a research project, with development to be done in a python computational intelligence library.

Prerequisite: Students are advised to do Artificial Intelligence 791.

AE5: Design and Development of an Automated System for Medical Diagnosis from X-Rays

Diagnosis of dental anomalies and diseases is experiencing severe strain across Afric, due to lack of sufficient infrastructure and dental radiologist expertise. Public hospitals across Africa are understaffed. People therefore do not have access to timely diagnosis of their X-rays which can lead to severe dscomfort, pain, and even fatalities. Recently, we have developed computer vision and deep learning approaches to automate the diagnosis of dental diseases and anomalies from panoramic X-rays. This work has led to the design of a more generic system to automate disease diagnoses from X-rays in general. The next stage of the research includes developing the software to maintain and deploy the models for various disease diagnoses from different types of X-rays. The purpose of this project is to refine the current architecture desig to be more generic, and then to implement and deploy this system in a cloud-based environment. This project involves the design and development of a system that allows a medical professional to automatically train and deploy computer vision models for diagnostics from X-rays into a software application.

The systems consists of two main components: A training component, where different convolutional neural networks are trained to diganose different medical conditions, and a prediction component, where a new X-ray is provided as input and routed via the appropriate classifiers to make a diagnosis. For both components, images are routed through a pre-processing component, and then on towards a convolutional neural network used to predict if the X-ray image is workable or not. The later determines if the image is of sufficient quality for automated analysis. For the training component, an annotation tool needs to be developed so that features of interest can be labelled and used for training. The system must be deployed on a scalable cloud-based system.

Already developed and available is the architecture design of the system, with a single X-ray dataset type in mind, for example panoramic X-rays. Convolutional neural network code is available for a few dental anomalies and diseases, and a basic ReactJS frontend. Python code to facilitate model deployment has also been developed. This project will require implementation of the full architecture for X-rays in general, and not for specific types of X-rays, databases for storage of images, docker images for a scalable architecture, and the image annotation tool. The training pipeline has to be implemented, which should also include transfer learning and one-shot learning approaches. A user interface has to be developed, and a reporting component needs to be implemented.

The project will largely be based on using AWS services. Different services are glued together using python code that either sits on an EC2 instance or a lambda function. Research will need to go into how we can scale the AWS services to account for many different applications for different users.

The technologies that you will work with includes python for the backend, JavaScript for the frontend, html, CSS, and YAML for configuration files. In addition the following AWS services will be used for cloud deployment: SQS, s3, lambda functions, ECS, ECR, DynamoDB, Amplify, EC2 (with autoscaling). Then ReactJS and docker/AMI (for containerization of images).

Requirements: This is a software engineering project, with some research components, specifically in the development and testing of the convolutional neural networks, transfer learning and one-shot learning. You should have been exposed to a number of the above technologies.

AE6: Competitive Coevolutionary Particle Swarm Optimziation in Dynamic Environments

Competitive coevolution (CCE) models the arms race that is observed in competing species in nature, e.g. predator-prey relationships. Competitive coevolutionary particle swarm optimization (CCEPSO) algorithms have been used successfully to solve a range of optimization problems where the environment (search landscape) is static, including training of NNs. Since coevolution in nature is a dynamic process, it is intuitively believed that the CCE models should eciently be applied to DOPs. This study will investigate to what extend the CCEPSO algorithms can track optima in dynamic environments of various types. We have recently identi ed 27 diff erent classes of DOPs, and this study will investigate performance on these classes of optimization problems. As a fi rst step, a review of CCE will be done, focusing on its application in optimization algorithms and to solve optimization problems. Then, a CCEPSO algorithm will be developed to first solve static optimization problems. As part of this study, diff erent relative fi tness measures and competion pool selection strategies will be investigated and analyzed. The study will then proceed to apply the CCEPSO to the diff erent classes of dynamic optimization. The performance of the CCEPSO will be analyzed in depth. If the CCEPSO struggles, or prematurely converges as was observed for NN training, reasons for this will be investigated and solutions developed and further analyzed. The performance of the CCEPSO will then be compared with the performance of other state-of-the-art PSO algorithms for DOPs.

This is a research project.

Prerequisite: Students are advised to do Artificial Intelligence 791.

AE7: Incremental Feature Learning

Incremental feature learning refers to problems where the number of features increases over time. More generally, it may also refer to incrementally adding features, from the current set of available features, to the predictive model during the training process. As new features are added to the predictive model, in this case we will consider a neural networks, focus is given to training the new added weights, while still refining the older weights. Incremental feature learning is ultimately a dynamic optimization problem, in the sense that the search landscape dimensionality change over time with the addition of new input units, hidden units, and weights. A training algorithm should therefore be able to cope with changing search landscapes.

This project will develop a dynamic particle swarm optimization (dPSO) neural network training algorithm for neural network incremental feature learning. A feature ranking approach will be used to rank features based feature importance, and features will be added based on feature rank. Aspects to consider are different feature ranking approaches, when to add a new feature, when to grow the neural network architecture, how to prevent overfitting, when to stop adding incremental features, and lastly how to expand the dPSO particle position and velocity vectors to cope with growing neural network architectures.

This is a research project.

Prerequisite: Students are advised to do Artificial Intelligence 791 and Machine Learning 741.

AE8: Research Manager

When doing research on a specific topic, for example particle swarm optimization (PSO), one of the first tasks that the researcher has to do is to find as many as possible research papers on PSO, to organize these in some way, and then to read and possibly annotated the papers with keywords/key phrases or to summarize the papers. This project will develop a web-based tool to help the researcher in optimizing these processes. The research manager must include the following functionalities:

The developed research manager can therefore be used to support the researcher, to provide bibliometric information, and to obtain an overview of the broad research topic.

This is a software engineering project. The use case will be on particle swarm optimization, and therefore it will be of benefit if you do do Artificial Intelligence 791.

AE9: Deep Learning Approaches for Anomaly Detection in Dental Panoramic X-Rays

The current approach followed by orthodontic radiologists to identify teeth abnormalities is through manual inspection of teeth X-rays. This is a time consuming process, and often leads to incorrect diagnosis, or even missing subtle abnormalities. A recent project has developed a prototype to predict teeth anomalies using convolutional neural networks (CNNs). The approach followed was to develop a prediction model for each separate anomaly. This project will develop and evaluate alternative approaches to improve the performance of the current prototype. The following approaches will be implemented and evaluated:

This is a research project.

AE10: Optimization Algorithm Behavior comparison

In the nature-inspired optimization algorithm research field, there is a proliferation of optimization algorithms inspired from some natural phenomenon. Examples of such algorithms include particle swarm optimization, evolutionary algorithms, artifical bee colony optimization, ant colony optimization, firefly optimization, fish school optimization salp swarm optimization, krill herd optimization, grey wolf optimization, Harris hawk optimization, to name a few. A recent survey has identified just over 150 swarm-based optimization algorithms. When the mathematical and algorithm logic of these algorithms are analyzed, then it becomes clear that many of these “new” nature-inspired algorithms are basically the same, with the main difference lying in the nature-inspired analogy. The research field of nature-inspired optimization can very much be likened to a box full of smarties, where different smarties have different colours, but in the end each is simply a smartie, with the only difference in the colour.

This project will investigate and develop an approach to evaluate the search behavior of a collection of these nature-inspired optimization problems. All these algorihtms are population-based where a number of search positions are randomly initialized throughout the search space. Over time, these positions are adjusted, converging towards one point in the search space – hopefully a good optimum. When looking at the search trajectories of each one of these solutions, collectively the search trajectories can be seen as a graph, with root node the point (or small region) to which all the solutions converges. The idea is then to construct search trajectory graphs for the different algorithms, and then to determine similarity among these trajectory graphs. This approach will only work if all of the algorithms use the same random seed to initialize the population of individuals. In addition to the comparisons of the search behavior through analysis of the trajectory graphs, a formal comparison of the position update equations will also be done to find the common components of these algorithms, and the aspects where they differ. The objective here is to see how the different optimization algorithms considered in this study can be factorized to the same general form.

This is a research project with strong publication potential.

Prerequisite: Students are advised to do Artificial Intelligence 791

BF1: Sequentialization of MPI programs

Sequentialization is a program verification technology where a concurrent program is translated into an equivalent sequential but non-deterministic program, so that existing sequential verification tools can be re-used. This approach has primarily been used for shared-memory concurrency. The goal of this project is to develop and prototype a sequentialization for distributed programs that use the MPI message passing interface for communication. The project can build on an existing sequentialization framework for C programs.

This is a research-oriented software development project which could feed into a related MSc project, and yield a publication. Candidates should be interested in programming languages and program verification, and have a good understanding of distributed programming in C and the MPI (i.e., RW314). This project can build on the CSeq framework (https://github.com/CSeq/Overview).

BF2: De-Sequencing of C programs

In C programs, the order of evaluation of sub-expressions (e.g., operands or function arguments) is guided by a set of arkane rules (see https://en.cppreference.com/w/cpp/language/eval_order). These rules become important when the sub-expressions can contain side effects, or when the program is multi-threaded. The goal of this project is to develop and implement a source-to-source translation that makes this implicit control non-determinism explicit, and rewrites affected constructs into a form that is easier to analyze (e.g., through sequentializaion).

This is a research-oriented software development project which could feed into a related MSc project, and yield a publication. Candidates should be interested in programming languages and program verification, and have a good understanding of programming in C. This project can build on the CSeq framework (https://github.com/CSeq/Overview).

BF3: Smell Mining, Detection, and Visualization

Code smells are program structures that indicate some underlying problem with the code, even if there is no actual bug. Their definition is typically fuzzy - you know something is wrong with the code when you see it, even if you can’t quite tell why…

The goal of this project is to explore class-level code smells for Java and to build an IDE plugin that warns programmers about code smells in their project. The project should investigate different smell detection mechanisms (e.g., static analysis vs deep learning).

This project can focus on either research or software development aspects, and a research-oriented approach ccould feed into a related MSc project, and yield a publication. Candidates should be interested in programming languages. This project involves a collaboration with ByteFuse (https://bytefuse.ai/) and includes the option of a paid internship with the company.

BF4: Extension and Gamification of gtutr

gtutr [1] is an interactive feedback system that helps students developing ANTLR grammars. The goal of this project is to “gamify” the system; the main idea is to let lecturers define different stages, and to show the students’ progress after each stage on a leader board. The project includes the design of different kinds of stages, and the extension of the system to enable their implementation.

The project does not require any prior knowledge of compiler implementation or formal grammars beyond the level of RW244. The project should build on top of the existing gtutr tech stack, which uses Java Spring and, JSP for the web pages, with the databases hosted on AWS.

This is primarily a software engineering project.

[1] Chelsea Barraball, Moeketsi Raselimo, Bernd Fischer: An interactive feedback system for grammar development (tool paper). SLE 2020: 101-107

BF5: Mining dictionaries for fuzz testing

Fuzz testing incrementally constructs test cases by randomly changing or adding input bytes. This simple approach is surprisingly effective in many cases, but becomes ineffective when the system under test expects more verbose and structured inputs, such as web pages, programs, or JSON or XML object descriptions. Fuzz testing tools therefore rely on keyword dictionaries to increase the granularity at which the inputs are constructed, but these dictionaries must be specified manually. The goal of this project is to develop and evaluate a system that automatically extracts keyword dictionaries for a black-box system under test from an existing input corpus; more specifically, the system should learn the sets of keywords and operators, the lexeme classes (e.g., identifiers and numbers) and the formatting and commenting rules with minimal feedback (i.e., input is syntactically correct or not) from the system under test.

This is a research-oriented software development project which could feed into a related MSc project, and yield a publication. Candidates should be interested in programming languages.

BF6: Stress-testing compilers by automated innocuous changes

The goal of this project is to design, implement, and evaluate a tool that applies behaviour-preserving transformations to test programs, resulting in modified programs that should behave as the original. The mutants can be generated by tree transformations. The project should find a grammar for a real language, implement some transformations, and evaluate the approach using a real compiler.

This is a research-oriented software development project which could feed into a related MSc project, and yield a publication. Candidates should be interested in programming languages.

BF7: Metamorphic library fuzzing

Metamorphic testing is a differential testing method that exploits equivalances and other properties of the test input domain (e.g., a+b = b+a) to construct new test inputs and, more importantly, to derive the expected test outputs (and so circumvent the oracle problem). Metamorphic fuzzing uses fuzzing methods to construct large number of test inputs, and so helps scaling metamorphic testing.

The goal of this project is to build a metamorphic fuzzing framework for Java classes. It can build on a previously developed grammar-based fuzzing framework for Java classes.

This is a research-oriented software development project which could feed into a related MSc project, and yield a publication. Candidates should be interested in programming languages and software testing.

BT1: Network platforms

Please visit this page for more info, and get a feel for how this project fits into a Bigger Picture.

BT2: Application platforms

Please visit this page for more info, and get a feel for how this project fits into a Bigger Picture.

BT3: Analysis platforms

Please visit this page for more info, and get a feel for how this project fits into a Bigger Picture.

BT4: Localised no/low-code environments

Please visit this page for more info, and get a feel for how this project fits into a Bigger Picture.

BT5: e-Health

Please visit this page for more info, and get a feel for how this project fits into a Bigger Picture.

BvdM1: Angluin Learning

This project is only for students that took cs345 in 2021, and would like to extend their term 4 2021 project on Angluin learning for symbolic automata. The starting point should be to implement (or improve on the 2021 implementation of) the challenging aspects listed in [1] and then to benchmark the implementation more extensively than was required in 2021. Amongst various additional aspects to consider, is the gains of using a classification tree approach (see [2]) compared to when making use of an observation table.

Resources

[1] Computer Science 345: 2021 - Project for Term 4

[2] The TTT Algorithm: A Redundancy-Free Approach to Active Automata Learning

BvdM2: Comparison of Parser Generators

Co-supervisor: Willem Bester

For this project, parser generators, and thus indirectly parsing algorithms, should be compared in terms of parsing performance (time and memory usage), and also in terms of ease of use in specifying a correct grammar (on a set of grammars used for benchmarking). Particular focus should be placed on if and how grammars with (indirect) left recursion are handled, and also ambiguity resolution. The equivalence of the parsing performed by each of these tools (on sample grammars) should be established by comparing the parse trees or abstract syntax trees generated by each of these tools. The grammar/ebnf of your cs244 project should be included in the benchmark of grammars. Focus should be placed on parsing based on parsing expression grammars, and especially on implementations supporting left recursion. A comparison between how (indirect) left recursion is supported by parsing tools based on parsing expression grammars, and tools based on other formalisms, should be made.

The following is a list of some of the sample tools/formalisms to consider as starting point, but it is required to consider parser generators based on parsing expressions.

Resources

[1] Comparison of parser generators

[2] Writing a PEG parser for fun and profit

[3] PEG Parsers

[4] pegen

[5] The Pika Parser reference implementation

[6] Antlr

[7] Bison

[8] Earley parser

[9] Introducing Glush: a robust, human readable, top-down parser compiler

[10] GoGLL: General Context-free Parsing Made Easy

BvdM3: Static Analysis as a Service

Co-supervised by David Baker Effendi with involvement from Fabian Yamaguchi, chief scientist at ShiftLeft.

Static analysis is the analysis of a program or section of code without executing it. This is done by examining the program structure and semantics which is often possible by first converting the code into an intermediate representation (IR).

This project aims at providing static analysis as a service via a self-made server accessed via either a web application interface or IDE plugin e.g. VS Code or Jetbrains IDEs. The target for analysis will either be Kotlin Android applications or Solidity smart contracts - this is up to the student.

A language front-end for Joern will be written to convert the target language to an IR called a code property graph to allow for Joern to perform analysis on the target program. The server will accept the program (or procedures from it) to analyse and results will then be communicated to the client which will either be a web application or IDE.

Tools such as Joern, ANTLR, and Soot will most likely be used along with tools associated with developing the client. Expected languages to develop in will be Kotlin, Scala, or JavaScript. The server may end up being deployed on AWS or a Stellenbosch server such as Hydra.

Resources

[1] Static Analysis Using the Cloud

[2] Using Static Analysis for Automatic Assessment and Mitigation of Unwanted and Malicious Activities Within Android Applications

[3] Testing Smart Contracts

[4] Modeling and Discovering Vulnerabilities with Code Property Graphs

[5] (Video) Elegant and Scalable Code Querying with Code Property Graphs

[6] (Video) AWS CodeGuru - This is an example of static analysis as a service but this is only limited to Java programs

BvdM4: Smart Contract Security

Co-supervised by David Baker Effendi.

In this project a library to convert solidity smart contracts to the code property graph intermediate representation, should be developed, making use of [3]. The benefits of making use of the code property graph intermediate representation when looking for security vulnerabilities, is well established, as illustrated by the company ShiftLeft. Solidity vulnerabilities can often be checked either directly on the abstract syntax tree (type inference or underflow/overflow) or using taint analysis, and thus the reason for investigating the benefits of using the code property graph intermediate representation for security analysis of Solidity smart contracts. The viablity of searching for solidity vulnerabilities by doing queries on the code property graph, should be illustrated by making use of smart contracts with known vulnerabilities.

Resources

[1] Testing Smart Contracts

[2] Elegant and Scalable Code Querying with Code Property Graphs. Fabian Yamaguchi

[3] Sūrya, The Sun God: A Solidity Inspector

[4] Solidity Top 10 Common Issues

[5] ShiftLeft

BvdM5: Exploration of Bibliometric Datasets

This project is co-supervised by Marcel Dunaiski.

This project should develop an application to structure and analyze bibliometric data in real time. It should also investigate the relevancy, trends and commonalities shown by different similarity algorithms used to structure bibliometric data. You might also want to measure and compare the performance of similarity algorithms and general queries with regards to the response times, for the graph databases used in this project.

For this project the work done in [1], using natural language processing techniques and graph databases, partly explained in [2], should be extended. For example, [1] did not incorporate citations into similarity computations, which is extremely important in bibliometric applications. Citations can be used sparingly, such as tiebreakers when two or more papers are equally similar. Or more fundamentally, as a basis for finding relevant similar papers, augmented by community detection and clustering based on full-texts. As [1] points out, TF-IDF is not ideal and word2vec should rather be used. The similarity measures should also be adapted to be more appropriate for bibliometrics. Connected Papers can be regarded as a competing product.

Since this project deals with building a search tool that helps researchers to find (academic) papers, or papers similar to a given paper, it is worth exploring community detection algorithms, which is already impliemented in graph database libraries such the one for Tigergraph.

See [3] for a competing solution.

Resources

[1] Simon du Toit’s 2020 honours project

[2] (Video) Text-Based Graph Analytics with Simon Du Toit

[3] Connected papers

BvdM6: The capitec graph database hackathon challenge

This is a list of three projects, co-supervised by Capitec, and a student selecting this project, should then select one of the three choices.

CHOICE 1:

Co-supervisor: Davis Gouvias from Capitec

Use the dataset from the Capitec Hackathon and make use of a graph database to explore at least one of the following challenges described in [1].

Resources

[1] Hackathon Challenge

CHOICE 2:

A Graph Based Behaviour Recommender system for banking

The success of online retail can largely be attributed to recommender systems. Financial institutions have in recent years moved to ML approaches to supplement lead generation. However, algorithms such as XGBoost use a conventional ML approach based on a tabular data format. These approaches aren’t optimized for graph data structures, and they require complex feature engineering to cope with these data patterns.

A behaviour recommendation engine will optimise how customers interact with the bank and its products through recommending behaviour changes. The expectation would be to do research and build a viable approach/recommender looking at behavioural components of a client rather than product/economic factors. The average number of products bank customers have is between 1-3, therefore by also focusing on behavioural changes, such as increasing bank interaction can lead to improved customer experience and additional revenue streams.

The graph interaction information is critical to improving the recommendation accuracy. Graph Neural Network (GNN) is a deep learning (DL) framework that can be applied to graph data to perform edge-level, node-level, or graph-level prediction tasks. GNNs can leverage individual node characteristics as well as graph structure information when learning the graph representation and underlying patterns. Therefore, in recent years, GNN-based methods have set new standards on many recommenders’ system benchmarks.

This coding apects of this project needs to be worked on at Capitec in Technopark. A laptop and access to AWS Neptune Graph Database, and development environment will be provided.

Resources

[1] Graph-based recommendation system with Neptune ML: An illustration on social network link prediction challenges

[2] A Blitz Introduction to DGL

CHOICE 3:

The application of graph databases to banking

Choose from several interesting problems Capitec has identified that can be uniquely solved through the application of Graph Databases.

This coding apects of this project needs to be worked on at Capitec in Technopark. A laptop and access to AWS Neptune Graph Database, and development environment will be provided.

Resources

[1] Amazon Neptune

[2] A Blitz Introduction to DGL

[3] Neo4j Graph Data Science library

BvdM7: Phase Equilibrium Modelling

Co-supervised by David Baker Effendi and Matt Mayne from Geology.

Phase equilibrium modelling has revolutionised the way earth scientists understand earth materials. In this computational method the thermodynamic stability of mineral phases is calculated from databases of experimental and natural data. Stellenbosch University has been a forerunner in this in creating an open source software tool Rcrust [1], which allows chemical changes to also be considered in these thermodynamic calculations. With the calculation potential made possible through cluster processing we aim to produce a server based application that maximises the potential of this software as processing times are reduced and much larger simulations could be run. This would allow Rcrust to expand its applications within the earth sciences as well as to be used within planetary sciences, including recent research performed on the evolution of the crust of Mars.

In summary, the aim of this project is to add parallel processing to the current version of Rcrust. Instead of or in addition to the focus on parallel processing, you can also (i) rewrite the current implementation using better software engineering practices such as adding extensive test cases and (ii) making it easier for other developers to contribute to the current codebase, and (iii) possibly also rewrite the current implementation to rather use Python instead of R.

A new dedicated high processing computer laboratory has been setup to provide a server for distributing Rcrust and providing remote host access for Rcrust users to run long calculations making use of parallel processing. You will have access to this laboratory and would assist in the design of protocols and workflows to optimise the distribution of workload to the cluster.

Resources

[1] Rcrust homepage

[2] PhD thesis on Rcrust - Matt Mayne

[3] Main publication explaining Rcrust functionality

[4] Daily Maverick Rcrust article

CPI1: Robust backend for tutorial submissions and testing

Co-supervised by Willem Bester.

Summary: Develop a robust backend for code submissions
Focus: A software engineering project, but could lead to a publication in collaboration with WHK1

A tool for automatic testing and generation of feedback on code submissions was developed last year. However, the implementation served as a proof-of-concept. Our goal is to use the tool for our large first-year classes from 2023. Your task would be to implement a backend that can be hosted in the cloud, accessed via a web-interface, and that is robust and scalable enough to be used by 500 students simultaneously.

Requirements of the tool:

CPI2: Scalable symbolic linearisability checking

Summary: Implement a parallel linearisability checker.
Focus: Software engineering, but with scope for research
Reading: JPF

A shared data structure is linearisabile if it appears as a sequential data structure; i.e., every operation on the data structure appears to take effect instantaneously. We’ve implemented a tool that checks linearisability of Java implementations by comparing all possible execution traces (generated by a tool called JPF) of a parallel implementation with the execution trace of a sequential implementation (called the oracle). The checker is more automatic if symbolic input instead of concrete input is provided to JPF, but it scales worse. One obvious way to parallelise the linearisability checker is to check each symbolic input in a different thread. However, since symbolic execution in itself does not scale well there is also scope for parallelising JPF’s underlying symbolic execution engine.

Objectives

CPI3: Parallel MCTS Enhancement Strategies

Summary: Implement and analyse the performance of enhancements strategies for MCTS
Focus: Software Engineering and research which could lead to a publication.
Further reading A Survey of MCTS Methods, sections 1 & 3

The success of Monte-Carlo-Tree-Search (MCTS) in computer-go resulted in much interest in MCTS research and a number of enhancement strategies have been proposed to improve the playing strength of MCTS agents. While various articles have proposed MCTS enhancements in the context of certain domains and implementations, only in some cases are it incorporated into a parallel MCTS algorithm. This makes it challenging to compare the performance of different MCTS enhancements, both for serial and parallel MCTS. However, we’ve developed a framework and tournament engine at Stellenbosch which can be used to run experiments with different algorithms for the same domain on the same hardware setup.

Objectives

CPI4: Race condition checker

Summary Develop a tool that can test/illustrate race conditions
Focus: Software development with scope for research
The nondeterministic behaviour of parallel programs makes it tricky for anyone new to parallelism to understand what a race condition is and how to fix it properly. Furthermore, tutorials that test whether a student can correctly solve the critical section problem (without introducing deadlocks), are currently marked by hand.

Objectives

CPI5: Checking program correctness for relaxed memory models

Summary: Develop a tool that can test/illustrate the effect of a memory model
Focus: Software development with plenty of scope for research Further Reading: Bytecode Verification Software Validation JPF

The nondeterministic behaviour of parallel programs already makes it tricky to write correct code, because there are so many possible behaviours (execution sequences) to think of. Writing correct parallel programs becomes even more tricky if the program might be run on hardware with different memory models. In reality, a memory model is simply a bunch of rules. As such it would be easy to implement a program that can illustrate the different execution orders allowed by a certain set of rules.

Objectives

LvZ1: Automata toolkit for large automata

When implementing automata, it is essential to make the implementations as efficient as possible in terms of space and time complexity. This project requires the implementation of the standard algorithms for finite automata, and in particular for symmetric difference nondeterministic finite automata (XNFA). It should be implemented in C, and be easily extendable to new algorithms. It is intended as a command line tool, with an accompanying grammar to define input automata in a textual language. There is no need for a graphical user interface.

Research/software engineering project.

Prerequisite: You must have an interest in low-level optimisation of algorithms.

Difficulty meter: 7.5/10

LvZ2: Optical character recognition of Braille documents

This project requires that a double-sided Braille document be scanned/photographed and the dot patterns recognized. Given a digital image of dots, thresholding would need to be applied to highlight the dots and depressions. Segmentation can then be used to identify the positions of the dots which form one character group, and the lines in which the groups of dots occur. The dot groups must then be translated to characters. Given an existing translation system from Braille to Afrikaans, the correctness of the identified dot groups must be improved using a digital Afrikaans dictionary.

[1] A. Antonacopoulos and D. Bridson, A robust Braille recognition system. In 6th International Workshop on Document Analysis Systems, S. Marinai and A. Dengel (eds.), Lecture Notes in Computer Science 3163, pp. 533-545.

[2] N. Falcón et al, Image processing techniques for Braille writing recognition. In EUROCAST 2005, Lecture Notes in Computer Science 3643, pp 379-385.

This is a research/software engineering project.

Difficulty meter: 8.5/10

LvZ3: Implement a suite of game apps that are accessible for the blind

You have to investigate design methods to make digital board and other game apps accessible to the blind. Based on your findings, you have to implement multiple Android game apps, including dominoes, battleships, cluedo, snakes and ladders, and a Tamagotchi type animal care app. All apps must be sound enabled. The crux of the project, however, is that these games must be designed to be easy to play for the blind, which means that you cannot simply add sound to a visual approach to the game. Refer to [1] for details on design principles.

[1] Filho, F. Board Game Accessibility for Persons with Visual Impairment. MSc thesis, University of Ontario, 2018.

This is a pure Software Engineering project.

Difficulty meter: 6/10

LvZ4: Picture generation with parameterized shape grammars

You must developed a parametrized shape grammar, where the terminals are lines of one unit. These lines will represent stitches in a blackwork embroidery pattern. Different grammar rules are needed for borders, motifs and space-filling patterns. You must develop a software tool where the effects of different rules can be investigated by visual output, new rules be executed, and the effect of random versus density-based expansion be analysed. Once a pattern is generated, manual manipulation of the pattern should be allowed. This would include actions like duplication, reflection or rotation.

[1] Grow, A., Mateas, M. and Wardrip-Fruin, N., 2017. Blackwork Embroidery Pattern Generation Using a Parametric Shape Grammar. In Proceedings of the 8th International Conference on Computational Creativity.

Prerequisite: An interest in layout algorithms, and CS345.

This is a software engineering/research project. Difficulty meter: 7.5/10

LvZ5: Animation of LEGO comic strips

The aim of the project is to make LEGO animated comic strips. You need to use either OpenGL or Unity to produce the comics. You need to animate movement and facial expressions of LEGO minifigures and models. For facial expressions, you would have to redefine existing facial expression models (such as FACS for human faces) and present smooth changes in emotions on any LEGO face. Secondly, smooth movements have to be animated (such as walking, running, or head movements). Lastly, once these mathematical models have been defined and implemented, you will be expected to combine these into a software tool that will allow a user to easily produce such movies. That is, there must be at least one supplied template with a certain theme (such as pirates or space). The user must then choose expressions and movements to produce a short movie clip.

Note that this project involves solid mathematical modelling, and not just drawing pretty pictures.

Prerequisite: strong applied mathematics background, and preferably experience with OpenGL or Unity.

This is a software engineering project.

Difficulty meter: 8.0/10, and it is a substantial amount of graphics coding

LvZ6: JFlap for XNFA

A package needs to be developed to implement symmetric difference nondeterministic finite automata (XNFA), with the functionality similar to that of NFA in JFlap. The important issue here is the interface, but a basic implementation for XNFA is required for teaching purposes. Hence, the system must allow an XNFA to be input from both a text file and also interactively via a GUI. Full editing of the machine on the display is required. You should compare different layout algorithms to display the automaton on the screen. You would need to implement comparative interfaces, with for example the input XNFA on the left of the screen and the corresponding DFA on the right of the screen. You should visually show the execution of an XNFA on a given input word by means of an execution tree. You must provide the facility to print the output to a picture file (like jpg or png).

Prerequisite: CS345

This is a pure software engineering project.

Difficulty meter: 6/10, but it is a lot of coding.

LvZ7: Clustered cellular automata for three-dimensional shape generation

Cellular automata with clustering (CCA) allow for varying cell size during cellular evolutions. It has been shown that such clustering improves the efficiency of ant sorting algorithms. However, the manner in which ants drop their colour cells can differ widely, based on different neighbourhood definitions and different strategies. Shape generation has been applied in the two-dimensional case. The aim of this project is to investigate the extension of CCA in the 3D case, and to develop dropping strategies which can be used to generate 3D shapes, such as balls, cubes and pyramids.

Prerequisite: A mathematical inclination and an interest in patterns.

This is a pure research project, and can be used as a stepping stone towards an MSc.

[1] R. Adams, L. van Zijl, Ant sorting with clustering cellular automata. Proceedings of SAICSIT 2018, South Africa. [2] C. Zeeman, Shape generation based on cellular automata with clustering. Report, Dept. of Computer Science, Stellenbosch University.

Difficulty meter: 9.0/10

MD1: Supervised author disambiguation algorithms

The ability to automatically map journal papers to the researchers that wrote them and vice versa map researchers to their collections of published papers is a fundamental issue. It would seem to be a simple and straightforward problem to solve yet it remains a major, unsolved problem in information science.

Reliable author name disambiguation is difficult because:

In this project, you will research current state-of-the-art supervised author name disambiguation methods. You will implement the most promising methods for the South African context where the performance of your implementation will be compared to state-of-the-art unsupervised methods.

The project is research heavy.

Publishable: Likely. This research can be extended and can lead into a M.Sc. degree.

Skills you will acquire:

MD2: Information extraction from South African research output

South African universities are obligated to make all masters theses and doctoral dissertations publicly available and upload them in full text document formats. The goal for this project is to extract meaningful meta-data from these documents. This also includes using Optical Character Recognition (OCR) techniques to parse older submissions that were uploaded as scanned documents.

For this project, you will develop software that parses PDF files and extracts useful information such as the thesis title, student and supervisor names, abstracts, and references. The technique used to extract the desired information will have to be determined. Lastly, you will design a web app that allows users to upload a PDF document and retrieve all the desired meta data.

This project is software engineering heavy but contains a bit of research depending on the information extraction technique used.
Publishable: unlikely.

Skills you will acquire:

MD3: Classroom Moderator

The aim of this project is to design and develop an online lecturing question and participation tool. This tool should allow students to ask questions, upvote their peers’ questions, and start discussions on a topic. Questions should be ranked based on the number of upvotes or participation/controversy. This allows lecturers or moderators to focus on the important questions and answer them.

This project is inspired by Google Moderator (https://en.wikipedia.org/wiki/Google_Moderator), a service that was discontinued by Google in 2015.

Features that the service should offer:

This project consists almost exhaustively of software engineering.
Publishable: No.
Skills you will acquire:

Further information:

MD4: Crawler for African Research Output: Adaptive Information Extraction

Adaptive information extraction is the task of finding information and automatically identifying patterns in data that is not well structured and almost free text. The goal of this project is to design a web crawler that can download information about theses and dissertations from African universities’ library websites. Adaptive information extraction techniques will be used to find the desired information and extraction rules from the webpages by leveraging the typical structure of HTML templates and shallow natural language knowledge for free text. The induced rules will also be used to guide the crawler to the correct web pages for further crawling.

This project features both research and software engineering.

Publishable: unlikely.

Skills you will acquire:

MD5: Text categorization of South African research outputs

The Department of Education maintains a hierarchical structure of categories that represent the various fields of study in South African higher education. The higher levels correspond to broad disciplines (e.g., Engineering and Law), while lower levels cover smaller, finer-grained fields of study (e.g., Chemical Engineering and Mechanical Engineering).

The goal for this project is to use machine learning techniques to develop a multi-label classifier that uses the titles and abstracts of Master theses and PhD dissertations and classifies them into one or more categories. More specifically, you will combine cutting-edge transformer models (such as BERT or SciBERT) to extract local semantic information and latent Dirichlet allocation (LDA) to capture global semantic information. For the multi-label classifier, you will look at convolution neural networks (other options possible). This is cutting-edge research and we might consider other techniques and models.

The project is research heavy.
Publishable: Likely. This research can be extended and can lead into a M.Sc. degree.
Skills you will acquire:

MD6: Mapping and analysing global socio-political company data

ORBIS is an proprietary database which provides comprehensive lists of both private and public companies and organizations. It can be filtered by name, geographical area, industry, financials, and many other criteria. The database includes information on firms’ locations, as well as other relevant information, such as revenues, number of employees, industry sectors, etc. GDELT is a “global database of society” that includes events and mentions, which are used to construct a global knowledge graph by scanning websites for the company, organization, or entity mentions.

For this project you will translate ORBIS data into a spatial map, i.e, convert latitude/longitude points to align with this firm’s headquarters. In addition, you will match the ORBIS firm names to the named entities in GDELT-GKG. The names do not match well (i.e, a company may have one name in the ORBIS database and yet the same company name will be modified in GDELT-GKG) between the databases. Subsidiaries with very similar names and the scale of these databases makes this a challenging problem.

Publishable: No.

Skills you will acquire:

MD7: Network analysis of socio-political online data to understand conflicts at the dyadic and group level

GDELT (Global Database of Events, Language, and Tone) is global socio-political news media database which gets updated every 15 minutes. It contains information about events such as civil unrests and strikes, while capturing information about the general tone and sentiment of the actors involved in these events.

You will construct a network of ties data from the GDELT organizations data, where the organizations are nodes and edges represent ties compilled from mentions. The actors are media-reported organizations and if two organizations are mentioned in an article together then there will be a tie between them. You will implement various graph algorithms (network analysis measures) that are used as indicators to gain knowledge about the actors.

A few examples of measures you will invesetigate:

Ideally we would like to be able to use these measures to develop predictions such as: What actions or decisions can an actors (or other stakeholder) in the network take to make the network more favourable to them (i.e., have a higher cooperation score or gain more unique connections)?

The project is research heavy.
Publishable: Possible.

Skills you will acquire:

MN1: Deep learning for leafroll disease detection

Leafroll disease is a serious problem that affects many of South African vineyards. This disease can be observed in the shape and coloration of the leaves. Many farmers notice this disease in a later stage of the production. There is a need to assist humans in detecting leafroll disease in the vineyard at the early stages. The challenging part of this project is based on the limited dataset for training a convolutional neural network. This will rely on augmentation techniques to populate training datasets with high-quality images.

MN2: Face recognition using Siamese neural network

Facial recognition is the task of making a positive identification of a face in a photo or video image against a pre-existing database of faces. It begins with detection - distinguishing human faces from other objects in the image - and then works on the identification of those detected faces. The student is required to research the Siamese neural network and the use of different similarity functions similar to Euclidean distance to compare the similarity of the images.

MN3: Activity recognition

Human Activity Recognition is the problem of identifying events performed by humans given video input. Activity Recognition is an important problem with many societal applications, including smart surveillance. Surveillance videos can capture a variety of realistic anomalies such as fighting, highjacking activities. The student is required to perform anomaly detection in surveillance videos. For this task, there are many publicly available datasets for training a model to recognize anomalies in videos. This can be used as evidence in crime operations.

MN4: Real-Time Multi-Object Tracking

Multi-Object Tracking (MOT) has been a longstanding goal in computer vision which aims to estimate trajectories for objects of interest in videos. Object detection and re-identification (re-ID) is an important problem in surveillance monitoring systems. Most methods treat this problem by two separate models which the first one is the detection model firstly localizes the objects of interest by bounding boxes in each frame, then the association model extracts re-identification (re-ID) features for each bounding box and links it to one of the existing tracks according to certain metrics defined on features. The student has to investigate methods to combine these two tasks and improve the detection and tracking of multiple objects in videos.

MN5: Text to image generation using Generative Adversarial Networks(GANs)

GANs have shown a significant improvement over the years in image synthesis. In this project, you have to explore the capabilities of different variations of GANs for text-to-image generation. The student also has to improve the training mechanism of GANs.

MN6: License Plate Recognition via Deep Neural Networks

Automatic License Plate Recognition is a challenging and important task, where it is used in traffic management, digital security surveillance, vehicle recognition, parking management of big cities and can be used to detect stolen cars through security surveillance. The student will have to investigate how the plate number recognition system in South Africa works and what can be done to improve it through deep learning techniques.

MN7: Bias Remediation in datasets using Variational Autoencoders

Datasets are crucial when training a deep neural network. When datasets are unrepresentative, trained models are prone to bias because they are unable to generalise to real-world settings. This is particularly problematic for models trained in a dataset where there some other classes are represented than others and thus fail to generalise. The student will have to investigate how will the autoencoders perform instead of using Generative Adversarial Networks and also using latent space instead of Principal Component Analysis (PCA) space.

SK1: Ingenious Counterfactual Regret Minimization

Counterfactual regret minimization (CFR) algorithms are used in computational game theory to find optimal behaviour for agents involved in interactions. CFR algorithms also form the basis of the top poker AI bots at present. The Ingenious framework is a system for investigating search and planning algorithms in development at the Computer Science division. This project will add suitable games (including some poker variants) to the framework, and implement a number of CFR variants in the framework using these games as testbeds.

References/sources:

SK2: AI for Skat

Skat is a three-player trick-taking card game. This project involves implementing a new or further developing an existing computer player for Skat. The implementation should allow the developed computer player to play on Michael Buro’s International Skat Server. Key approaches to consider for the core of your artificial intelligence is perfect information Monte Carlo, and information set Monte Carlo tree search, while some aspects might be tuned on game records using machine learning. Optionally, the techniques and the Skat domain should be incorporated into the Ingenious Framework.

Potentially relevant sources:

SK3: Item Response Theory toolkit

<a name=#item-response-theory-toolkit”></a>

Item response theory (IRT) refers to a class of statistical models where responses to various items (usually explicit questions) are used to try to determine or quantify an unknown value of interest. Typical examples arise in education and health, where the aim may be to quantify abilities, attitudes, or susceptibility to various risks. Item response theory extends classical test theory (CTT), primarily by modelling the variation in the items, which CTT consider identical. Put another way, IRT models explicitly consider the varying difficulty of questions.

Determining these difficulties and the values of the unknown attribute from the responses of a population can be viewed as a statistical estimation or inference problem. In this project, you will build a Python library for specifying IRT models, and performing automated estimation and/or inference on these models. The tool should include frequentist and Bayesian approaches, with the Bayesian approaches including methods based on conjugacy where appropriate, as well as various approximate inference methods. Methods based on variational inference would e desirable, since they may facilitate the use of automatic differentiation variational inference to more easily investigate and fit custom IRT models.

This project can be developed from scratch, or one can fork the Python py-irt library and develop it further and contribute pull requests.

Potential resource:

SK4: Probabilistic Graphical Models toolkit

Probabilistic graphical models (PGMs) allow one to visually specify (classes of) probability distributions by making use of graph structures. Important operations one may wish to perform on these distributions can then be implemented as algorithms which make use of this graph structure, with the complexity of the graph typically impacting the efficiency of the operations. There are a number of types of PGMs with corresponding algorithms for various tasks. This project involves implementing a toolkit for specifying and manipulating classes of distributions using their graphical representations, as well as the underlying algorithms to enable efficient probabilistic reasoning with respect to the distributions specified.

There is a variety of other software available for reasoning with PGMs (see the paper by Kevin Murphy below for a review of tools available in 2007). In this project, the aim is to allow a convenient GUI for specifying and querying PGMs in an interactive manner, but also to allow programmatic access to an API. The core of the toolkit should be implemented in Julia to enable good performance while maintaining convenient use from Python).

References:

SK5: Best practice software engineering for the Ingenious Framework

The Ingenious framework is a system under development at our Computer Science division for supporting research in search and planning. A number of previous students have contributed to it - the Java portion of the codebase is currently approximately 65k lines of code - and over time an extensive list of issues and shortcomings w.r.t. design and software engineering have been identified. In this project, you will focus on streamlining the design of the existing codebase to reduce redundant code, improve flexibility, and establish software engineering standards for the project going forward. Your changes should make it simpler to understand the framework design and add new domains and algorithms to it.

SK6: Improved testing for the Ingenious Framework

Recommended modules (2022): Software Testing and Analysis; Vulnerability Discovery and Exploitation

The Ingenious framework is a system under development at our Computer Science division for supporting research in search and planning. A significant weakness of the framework is the lack of tests for existing code, as well as suitable testing infrastructure for future development. In this project, you will focus on developing such infrastructure: developing and incorporating tests for existing code, and setting up testing as part of a continuous integration pipeline providing suitable fine-grained testing metrics and integration with software verification tools.

SK7: OpenPeakViewer

Recommended: Previous experience with computer graphics, spatial data processing, and/or Android app development.

This project involves building an open-source clone of applications like PeakVisor, PeakView, PeakFinder, and PeakLens. You will need to acquire height information and map names from open-source data, and then build a tool for storing, manipulating, and efficiently querying this data, and rendering the view from any location in real time. The system should ideally allow navigation on the map in real time, and include an Android app which can overlay the camera view with names of visible peaks.

TG1: Remote Sensing Data Extractor and Synthesizer

Remote Sensing is the science of converting data about the Earth which was collected via remote sensors (like satellites) into information. There are different applications of remote sensing data, namely the detection and monitoring of settlement expansion, the monitoring of the polar ice caps and the like. All of these applications require data. In this project a student will develop a web application that can extract data from a variety of sensors of different locations on Earth via Google Earth Engine. The extracted data will then be pre-processed and provided to the end user. The application should also be able to synthesize data from an existing labeled dataset if one is provided to the application by an end-user.

TG2: Morphological classification of radio galaxies using CNNs

The morphological classification of radio sources (celestial sources emitting radio waves) is important to gain a full understanding of galaxy evolution processes and their relation with local environmental properties. Furthermore, the complex nature of the problem, its appeal for citizen scientists and the large data rates generated by existing and upcoming radio telescopes combine to make the morphological classification of radio sources an ideal test case for the application of machine learning techniques. One approach that has shown great promise recently is Convolutional Neural Networks (CNNs). A CNN is a type of neural network that learns to extract its own features. It then uses these extracted features to perform classification. Up until recently, only simple CNN architectures like AlexNet, VGG ResNet and GoogLeNet (or variants thereof) have been considered for this task. In this project a student will investigate how well more complicated architectures like gated, group equivariant and multi-branch CNNs perform in comparison to the simpler architectures. In addition to comparing the recognition performance of the various architectures with one another, the computational requirements of the various architectures will also have to be compared with one another.

Reference:

TG3: Geospatial Maritime Visualisation Mobile Application / Tool

There is a lack of applications that allows managers, security officers and owners of assets at sea to easily keep track of them, such as using a Mobile Application. Therefore we propose the development of a free and open-source application to aid in this regard. The application should enable users to subscribe to a particular vessel (asset), or a set of assets and receive status updates from the vessels, in the form of notifications. There are different vessel events that a user would like to receive notifications for, such notifications include (all with different priorities): Emergencies, assets in distress, berthed vessels, vessels starts sailing, vessels going off course. The development of such an application will aid in the improvement of monitoring maritime assets and aid in the improvement of Maritime Situational Awareness (MSA). Phase two of the application will be the proactive monitoring of such assets (vessels). E.g. If a vessel stops unexpectedly (getting stuck in the Suez Canal) the relevant authorities should be notified to attend to the situation as fast as possible to reduce any risk of danger to its crew, minimize monetary loss and environmental effects. There are several types of vessel distress such as, being attacked by pirates, going off course, oil spill, collision, sinking vessel, grounded vessels, medical related emergencies. What type of Mobile Application platform? Web, iOS and Android using Flutter. What will you learn? Mobile app development, notification routing, database skills, data streaming, big data analytics.

TG4: Climate and Phenology change analysis for wine producing regions in the Western Cape

This data science project falls within a currently funded flagship project (TerraClim), funded by the South African wine industry (Winetech). TerraClim is founded on extensive research and development, that continues, with the aim to improve the understanding of climate change in the complex terrain of the Western Cape and how the grapevine responds to these changes. This project is an extension of previous analysis included in Southey, TO PhD, highlighting climate variables and tempo of change across the six wine producing regions of the Western Cape.

The research objectives are:

  1. data integration of climate variables over the six wine producing regions of the Western Cape.
  2. data integration of plant response in the context of climate change:
    • For one cultivar (Cabernet Sauvignon) over three sites
    • For multiple cultivars over two sites

Data provided:

  1. Climate data (hourly) for the six wine producing regions for 5 years.
  2. Grapevine phenology data for 9 years for one cultivar (Cabernet Sauvignon)
  3. Climate data for the study sites – study sites (Elgin (cooler site), Somerset West (moderate-warm site), Stellenbosch (warm site).

TG5: Customer Complaint Classification

Reacting in a timely fashion to client feedback and complaints is a crucial aspect of any business. In particular, routing a complaint to the right person / department that is able to deal with the matter in the shortest amount of time is of utmost importance. The aim of the project is to implement a series of classifiers that takes as input a customer complaint text narrative and automatically predicts the most probable topic thus enabling the matter to be routed to the correct department. The student is expected to implement a series of classifiers making use of NLP processing methods, a comparison can be made between the methods in terms of their accuracy as well as computational complexity.

TG6: Boxes

This project will be co-supervised by Prof Lynette van Zijl. A box is a word over a given alphabet, where the first symbol in the word is the same as the last symbol. For example, the word abaca is a box. A box can contain other boxes. The box abaca contains boxes aba and aca. Boxes can overlap, such as aba and aca in abaca. The main purpose of the project is to investigate the generation of the longest possible sequence of symbols, over a given alphabet, which does not contain any repeating boxes. These strings are known as longest optimal box-repetition free (LOBRF) words. In a previous study we found that one can generate these longest strings by finding Hamiltonian paths in a so called innerbox graph. An innerbox is a box that contain no other boxes. An innerbox graph is a graph whose nodes are innerboxes (all non-optimal edges are removed as well). Two main properties of innerbox graphs will be investigated:

  1. The Hamiltonian paths in innerbox graphs will be analyzed to identify any common sub-paths that are traversed when LOBRF words are generated. The main goal of this analysis is to find a deterministic generation algorithm by which LOBRF words can be constructed.
  2. The structure and theoretical properties of innerbox graphs will be investigated to furhter help reach the goal outlined in point one.

WHK1: Automatic student feedback generator for tutorial submissions

Cosupervised by Cornelia Inggs.

TLDR: Implement automatic feedback for programming assignments submitted by students in CodeTester, based on code correctness and quality.

Focus: Research and software engineering
Broad / Deep: Broad and Deep
Publication: Possible to some extent
Scope for continued study: Yes

Last year, Cornelia and I supervised a proof-of-concept implementation of CodeTester, a course-management system that allows students to follow unique trails through a curated set of programming exercises, for use in introductory programming courses. The ultimate goal is to help strong students quickly get to more challenging exercises, whereas those who need more help get more exposure to simpler exercises, providing a thorough grounding in basic programming skills.

CodeTester currently handles Java, but now support for Python must be added. The feedback is also pretty basic – compiler error messages and running unit tests – and we require the kind of feedback a demi would give, for example, by pointing out probable causes for failing test cases, and a nudge or two in the right direction.

Requirements:
The new version of CodeTester

WHK2: Regular Expression Matching via Virtual Machine

TLDR: Implement regular expression matching using virtual machine (VM) instructions (instead of automata)

Focus: Research and Software Engineering
Broad or deep: Broad and deep
Publication: Possible
Scope for continued study: Yes

The classical approach to regular expressions, followed by standard implementations for languages such as Java, PERL, and Python, is write a matcher based on building and processing Nondeterministic Finite Automata (NFAs). A different approach is to compile a regular expression to virtual machine (VM) instructions, as explained by Cox; in the literature, VM approaches have mostly not been explored in detail, leaving a lot of virgin territory.

In this project, you will investigate implementing regular expression matching using a VM, with a particular emphasis on performance. There is a lot of scope for intricate, low-level programming, and also a bunch of theoretical issues to explore and address. If you are interested in this project, this talk is a nice introduction to the VM approach, as used by Grammerly.

WHK3: A Quant-Qual Narrative-Based Research Tool for Doing Transformative Transdisciplinary Research

Cosupervised by John van Breda, Centre for Sustainability Transitions.

TLDR: Write a Python tool for encoding and visualising quantative–qualitative narrative data, to support decision-making processes.

Focus: Research and Software Engineering
Broad or deep: Deep
Publication: Possible
Scope for continued study: Yes

There is a big need in the research community for using integrated research approaches for both sense-making and decision-making purposes—in other words, research approaches capable of handling large context-generated quantitative–qualitative data sets. This project is essentially about developing a Python programme capable of capturing and visualizing multiple human (qualitative) and non-human (quantitative) stories, and making these available to researchers and decision-makers both as individual stories and narrative patterns for collaborative sensemaking and decision-making purposes.

You must be willing to gain some knowledge in a research area unfamiliar to someone with the typical Computer Science background, and then, to negotiate with a “client” in terms of deliverables. Methods from Natural Language Processing (NLP) might be relevant, and particular attention has to be paid to encoding fuzzy narrative data.

Requirements:

WHK4: Music “Take” Recognition and Similarity Scoring

Cosupervised by Gerhard Roux, Department of Music

TLDR: Build a fuzzy Shazam for classical music.

Focus: Research
Broad or deep: Deep
Publication: Possible
Scope for continued study: Yes

Western art music recordings are produced by recording multiple “takes” in sequential order and edited together. Unlike popular music productions, these takes have slight tempo deviations because they are not recorded on a click track—metronome played via headphones to the performers (King, 2017). There can also be slight variations in pitch between different takes. Pitch deviations might be caused by an unaccompanied vocal ensemble drifting in pitch or temperature changes that causes a piano to go out of tune.

Recording technicians would benefit from having a music information retrieval tool that can be fed a few seconds of music and find similar phrases in other takes. Unlike an app like Shazam that recognizes a specific “audio fingerprint” by exactly matching FFT data of incoming audio to a database (Wang, 2006), this tool will need to be fuzzy to compensate for slight deviations in pitch and tempi. It might require melody (Bosch et al., 2016) and tempo extraction (McKinney et al., 2007). A database of different recordings comprising of multiple takes in WAV format will be provided.

Further reading:

WHK5: A System for Managing CS Project Allocations

TLDR: Implement a web-based system that manages the allocation of projects to students and of examiners to projects, based on adjustable constraints.

Focus: Software Engineering and Research
Broad or deep: Broad and deep
Publication: Unlikely
Scope for continued study: No

The current approach to allocating honours projects to students uses a genetic algorithm that requires lots of hand tuning to keep both supervisors and students more or less not unhappy. (sic.) It is also quite a chore to manage the allocation of examiners for the different assessment opportunities.

For this project, you will have to investigate, evaluate, and compare different approaches to project and examiner allocation, for example, constraint-based approaches, genetic algorithms, and ranking problems. Then, your findings have to be integrated in a web-based system that can be used by students, supervisors, and the project coordinator alike, to manage project and examiner choice and allocation. In particular, I envision a system where, in the future, examiners can bid on assessing certain projects, similar to what happens with paper reviews.

Requirements:

WHK6: An Automata Layout Engine for LaTeX

TLDR: Extend the existing graph layout capabilities of TikZ to handle the state–transition diagrams for finite automata, with automatic algorithm selection and scaling in bounding boxes.

Focus: Software engineering and Research
Broad or deep: Deep
Publication: Possible to some extent
Scope for continued study: Possible, but unlikely

As part of my research, I draw a lot of finite state machines, in particular variants of DFAs and NFAs used for regular expression processing. Setting them by hand in LaTeX gets very tedious, especially if I have to update my work. Some tools, like Graphviz and JFLAP can do basic layouts, but they do not even come close to the variety of layouts supported by the yEd Graph Editor. Unfortunately, the latter is a commercial product and GUI-driven.

For this project, you must create a tool for automatically laying out automata graphs, ideally reusing existing functionality for graph drawing in TikZ. This will involve creating a small input language à la Graphviz DOT for specifying automata graphs, evaluating and implementing existing graph layout algorithms that might be suitable for automata, and putting these ideas together in a tool that produces output suitable for direct use in LaTeX with TikZ. An important requirement is that the implementation must be able to handle bounding boxes automatically, changing layout parameters and algorithms on-the-fly, which is to say, your work will have to interface with the (Lua)LaTeX and call on the layout engine directly.

WHK7: Types Inference for Python Functions

TLDR: Implement type inference for Python functions using static analysis.

Focus: Research and Software Engineering
Broad or deep: Deep
Publication: Possible
Scope for continued study: Yes

Python is both strongly and dynamically typed. PEP484 added type hints to the language, which, although not enforced, can be used by third-party tools for static analysis; for background, see PEP482 and PEP483.

In this project, we will attempt to do type inference for the parameter and return values of Python functions. To do so, we employ techniques of path-sensitive static analysis to follow possible type information through a function’s control flow. Python’s extensive introspection and reflection facilities will be useful, and we might borrow ideas from the Hindley–Milner type system—an extension of which is used in the Haskell language. Once we have collected type information, we have to decide how to use it, for example, by adding type hints to the source code, or by using symbolic execution to find problematic code blocks.