# The year project

The project is a critical part of the honours degree. The project is about the independent development of a large application. The student must demonstrate his/her skill with respect to all aspects of software engineering including formal specification, production of a prototype, testing, and documentation.

The precise definition of a large application is somewhat nebulous and we shall interpret this a little more broadly. The project can be seen as a pure software engineering task, an exercise in non-trivial problem-solving, or a research project. In all three cases students are expected to

• write a significant amount of code,

• produce a clear and substantial report about his/her work, and

• make a public presentation of the work at the end of the year.

Many projects are intended to be done by a single student, and most students prefer to work on their project on their own, but it is possible for two or more students to work on a large project together.

The project development will be done following the Agile (Scrum) process. Two of the sprint demos will be given special significance (the so-called do-or-die demos) and will be evaluated for marks.

## Proposed projects 2021 - the list will be final by the end of January 2021

### LvZ1: Electronic classroom for Pioneer School

An electronic classroom, where a teacher can see all the computer screens of students in the classroom, is a well known concept. But this project comes with a twist: you have to implement an electronic classroom where the teacher has a Windows PC, but the students work on BrailleNote Apex machines. You have to assume an in-classroom wifi connection between approximately 20 Apex machine, and one PC. Secure connections must be catered for, and no interference between two classrooms over wifi must be disallowed. The teacher, at any time, must be presented with a split screen, where all the screens from the students are visible. No feedback is necessary from the teacher to the students. As physical BrailleNote Apex machines will not be available, you will have to simulate the Apex outputs.

Pure software engineering project.

Prerequisite: CS3

Difficulty meter: 7/10

### LvZ2: Automata toolkit for large automata

When implementing automata, it is essential to make the implementations as efficient as possible in terms of space and time complexity. This project requires the implementation of the standard algorithms for finite automata, and in particular for symmetric difference nondeterministic finite automata (XNFA). It should be implemented in C, and be easily extendable to new algorithms. It is intended as a command line tool, with an accompanying grammar to define input automata in a textual language. There is no need for a graphical user interface.

Research/software engineering project.

Prerequisite: You must have an interest in low-level optimisation of algorithms.

Difficulty meter: 7.5/10

### LvZ3: Optical music recognition

Given a physical sheet of music, develop a system that can scan the page, and translate the music into MusicXML. This is a large project, and the aim is for you to either develop as complete a solution as possible, or to focus on known difficult issues in the problem, and see if you can improve accuracy. You may initially assume a perfect scan (that is, the project is not about the OCR so much as about the semantics of the music symbols). Your output must be in MusicXML format. You must make use of existing open data sets to test your results.

[1] Bainbridge, D and Bell, T. The challenge of optical music recognition. Computers and the Humanities 35, pp 95–121, 2001.

[2] Rebelo, A et al. Optical music recognition: state-of-the-art and open issues. International Journal of Multimedia Retrieval 1, 137-190, 2012.

Research/software engineering project.

Prerequisite: You must have some background in music theory.

Difficulty meter: 8/10

### LvZ4: Cellular automata for automated pattern layout

In the textile industry, an active research area is the automated layout of pattern pieces so that the amount of wasted space is minimal. Kargar and Payvandi[1] proposed an optimised pattern layout technique based on the Imperialist Competitive algorithm (ICA) and the Bottom Left-Fill algorithm (BLF). This project has a two-fold goal: first, Kargar and Payvandi’s approach must be implemented as a baseline. Secondly, clustering cellular automata must be used to do optimised pattern layout, and the results compared with that of Kargar and Payvandi.

This is a pure research project.

[1] Kargar, M. and Payvandi, P. Optimization of Fabric Layout by Using Imperialist Competitive Algorithm. Journal of Textiles and Polymers 3(2), June 2015, pp 55–63.

Difficulty meter: 7.0/10

### LvZ5: Translations to Braille music

This project requires that different input formats should be translated correctly into Braille music notation. As the basis, MusicXML must be translated into Braille music notation. Secondly, MIDI files must be reproduced in Braille music notation. The translations must implement all notation up to at least UNISA grade 5, but preferable more. It must be possible to play notes on an electronic keyboard which are then saved in MIDI format and translated into Braille.

[1] Goto, D. et al, A Transcription System from MusicXML Format to Braille Music Notation, EURASIP Journal on Advances in Signal Processing, 2007.

This is a pure software engineering project.

Prerequisite: You must have some background in music theory.

Difficulty meter: 6/10.

### LvZ6: Optical character recognition of Braille documents

This project requires that a double-sided Braille document be scanned/photographed and the dot patterns recognized. Given a digital image of dots, thresholding would need to be applied to highlight the dots and depressions. Segmentation can then be used to identify the positions of the dots which form one character group, and the lines in which the groups of dots occur. The dot groups must then be translated to characters. Given an existing translation system from Braille to Afrikaans, the correctness of the identified dot groups must be improved using a digital Afrikaans dictionary.

[1] A. Antonacopoulos and D. Bridson, A robust Braille recognition system. In 6th International Workshop on Document Analysis Systems, S. Marinai and A. Dengel (eds.), Lecture Notes in Computer Science 3163, pp. 533-545.

[2] N. Falcón et al, Image processing techniques for Braille writing recognition. In EUROCAST 2005, Lecture Notes in Computer Science 3643, pp 379-385.

This is a research/Software Engineering project.

Difficulty meter: 8/10

### LvZ7: Implement a suite of game apps that are accessible for the blind

You have to investigate design methods to make digital board and other game apps accessible to the blind. Based on your findings, you have to implement multiple Android game apps, including dominoes, battleships, cluedo, snakes and ladders, and a Tamagotchi type animal care app. All apps must be sound enabled. The crux of the project, however, is that these games must be designed to be easy to play for the blind, which means that you cannot simply add sound to a visual approach to the game. Refer to [1] for details on design principles.

[1] Filho, F. Board Game Accessibility for Persons with Visual Impairment. MSc thesis, University of Ontario, 2018.

This is a pure Software Engineering project.

Difficulty meter: 6/10

### LvZ8: Picture generation for non-linear images

You must design a grammar where the terminals are picture symbols (such as hieroglyphs). The grammar must allow for multiple sets of different terminal symbols. Using this grammar, you must be able to generate kaleidoscopic images that are rectangular, circular, hexagonal or octoganal (that is, concentric). The grammar rules will define the form of the final picture. Note that the grammar assumes a multilayered and hierarchical output, which makes it different from the sequential and linear grammars that one uses for programming languages, for example. Given a grammar to generate picture elements, you must randomly generate sentences from the grammar. These textual sentences must then be rendered geometrically to produce pleasing visual outputs. The main work will fall into the layout and visual output of the sentences. As an additional bonus, you can add animations to the outputs.

See examples of the types of images that you would be expected to create.

Prerequisite: An interest in layout algorithms, and CS345.

This is a software engineering/research project. Difficulty meter: 7.5/10

### LvZ9: Cellular automata with clustering for form generation

Cellular automata with clustering (CCA) allow for varying cell size during cellular evolutions. It has been shown that such clustering improves the efficiency of ant sorting algorithms. However, the manner in which ants drop their colour cells can differ widely, based on different neighbourhood definitions and different strategies. The aim of this project is to investigate the different possible neighbourhoods and dropping strategies, and find different strategies that will generate different geometrical forms such as triangles, circles and rectangles.

Prerequisite: A mathematical inclination and an interest in patterns.

This is a pure research project.

[1] R. Adams, L. van Zijl, Ant sorting with clustering cellular automata. Proceedings of SAICSIT 2018, South Africa.

Difficulty meter: 9.0/10

### LvZ10: Animation project

This project focuses on applications of animation. You can choose one of the two animation choices – choice (a) is a research project, and choice (b) is a software engineering project. You may put up only (a) or (b) on your choice list – not both.

(a) Animation of leopard skin deformation for identification Using photographs, identification of individual animals against a database of known individuals can be time-consuming, as photographs can be taken from different angles, and an animal’s skin shows a large variation in identification features based on their position and movement. This project specifically concerns leopard, and requires you to implement known computer vision techniques to model possible deformations. In short, you will be expected to implement 1) a rapid, coarse key-view detection based on spot appearance, 2) pose estimation and 3D model fitting using a (pre-computed) dynamic Feature Prediction Tree (FPT) followed by bundle adjustment and 3) texture back-projection, extraction of unique phase singularities and final encoding using an extended variant of Shape Contexts.

Prerequisite: strong applied mathematics background

This is a pure research project.

[1] T. Burghardt and N. Campbell, Individual Animal Identification using Visual Biometrics on Deformable Coat-Patterns. Proceedings of the 5th International Conference on Computer Vision Systems, 2007.

Difficulty meter: 9.0/10

(b) Animation of LEGO figures To make stop motion LEGO movies, you need to animate movement and facial expressions of LEGO minifigures and models. For facial expressions, you would have to redefine existing facial expression models (such as FACS for human faces) and present smooth changes in emotions on any LEGO face. Secondly, smooth movements have to be animated (such as walking, running, or head movements). Lastly, once these mathematical models have been defined and implemented, you will be expected to combine these into a software tool that will allow a user to easily produce such movies. That is, there must be at least one supplied template with a certain theme (such as pirates or space). The user must then choose expressions and movements to produce a short movie clip.

Note that this project involves solid mathematical modelling, and not just drawing pretty pictures.

Prerequisite: strong applied mathematics background

This is a software engineering project.

Difficulty meter: 8.0/10, but it is a substantial amount of coding

### SK1: Building a phylogeny

Recommended: Mathematical Statistics 214 or Machine Learning A 742

A phylogeny, or evolutionary tree, represents plausible relationships between sequences (typically protein or nucleotide sequences correspond to various individuals/species) under some assumed evolutionary model. Building a phylogeny given an evolutionary model and sequence data is essentially an optimization problem, where one searches for the most likely tree to yield that data under the assumed model. A common intermediate step is summarizing the differences between sequences as a pairwise distance matrix. In this project, you will implement a tool for constructing a phylogeny from a pairwise distance matrix, as well as implement algorithms for extracting such a distance matrix from sequences under various evolutionary models, which typically involves sequence alignment followed by an analysis of the differences between the aligned sequences. More refined approaches which do not use a distance matrix may also be considered. Such a tool could be applied to reconstructing portions of the tree of life, or studying the mutation history of viruses such as HIV or COVID.

Some resources:

### SK2: Probabilistic Graphical Models toolkit

Pre-/co-requisites: Mathematical Statistics 214 and Computer Science 315/Machine Learning A 742

Probabilistic graphical models (PGMs) allow one to visually specify (classes of) probability distributions by making use of graph structures. Important operations one may wish to perform on these distributions can then be implemented as algorithms which make use of this graph structure, with the complexity of the graph typically impacting the efficiency of the operations. There are a number of types of PGMs with corresponding algorithms for various tasks. This project involves implementing a toolkit for specifying and manipulating classes of distributions using their graphical representations, as well as the underlying algorithms to enable efficient probabilistic reasoning with respect to the distributions specified.

There is a variety of other software available for reasoning with PGMs (see the paper by Kevin Murphy below for a review of tools available in 2007). In this project, the aim is to allow a convenient GUI for specifying and querying PGMs in an interactive manner, but also to allow programmatic access to an API. The core of the toolkit should be implemented in Julia to enable good performance while maintaining convenient use from Python.

References:

• K. Murphy. Software for Graphical Models: A Review. ISBA Bulletin, Dec. 2007. Available at http://www.ar-tiste.com/kmurphyDec2007.pdf.
• D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques (2009). MIT Press.

Various machine learning textbooks (e.g. those by David Barber and Kevin Murphy) also present information on PGMs.

### SK3: Ingenious Counterfactual Regret Minimization

Counterfactual regret minimization (CFR) algorithms are used in computational game theory to find optimal behaviour for agents involved in interactions. CFR algorithms also form the basis of the top poker AI bots at present. The Ingenious framework is a system for investigating search and planning algorithms in development at the Computer Science division. This project will add suitable games (including some poker variants) to the framework, and implement a number of CFR variants in the framework using these games as testbeds.

References/sources:

### SK4: Modern board games for the Ingenious framework

This project involves extending the Ingenious Framework by adding at least two modern board games, such as Ingenious, Carcassone, or Ticket to Ride as domains to the framework. The aim should be to use a modular approach allowing components to be developed for easy reuse by other domains. The project should also extend the current primitive visualization approaches in the framework to enable humans to play these domains against each other and computer players using a GUI. An ambitious student can consider a wider range of domains, or developing computer players for the domains identified.

### SK5: More Ingenious reinforcement learning

The Ingenious framework is a system under development at our Computer Science division for supporting research in search and planning. Reinforcement learning is a branch of artificial intelligence which considers how agents can learn to behave well based on interaction with their environment. The framework has support for reinforcement learning problems, added in a previous project. This project involves extending this support in a number of directions. This might include:

• Developing an approach to testing code making use of the Python pipeline for neural network training.
• Developing an RL agent using the existing algorithms for a domain already in the framework.
• Adding further enhancements from the Rainbow algorithm to the double deep Q-learning (DDQN) algorithm currently in the framework.
• Improving the efficiency of prioritized experience replay by developing a solution using a sum-tree.
• Add a new domain with a continuous action space to the framework.
• Implement relevant policy gradient reinforcement learning algorithms in the framework, and apply them to this domain.

Beyond these directions, there is plenty of scope to incorporate more recent developments in reinforcement learning by agreement with your supervisor.

References:

### SK6: Ingenious Planning Module

The Ingenious framework is a system under development at our Computer Science division for supporting research in search and planning. Automated planning is a branch of artificial intelligence focused on optimizing multi-step decision-making to achieve a goal. This project involves adding a suite of planning algorithms and selected planning tasks to the Ingenious framework, and using these tasks to benchmark and compare these implementations.

Reference:

• S. LaValle (2006). Planning Algorithms. Cambridge University Press.

### SK7: Causal calculator

Recommended: Mathematical Statistics 214

The causal calculus is a set of rules describing how one can compute probabilities in the presence of causal interventions rather than observations. These rules are based on representing probability distributions by Bayesian networks, one of a number of types of probabilistic graphical models. The aim of this project is to build an environment allowing one to specify such models, specify observations and causal interventions, and calculate relevant probabilities. This in principle allows one to establish causal dependence between factors in a model, which is essential in many fields of science.

In addition, the tool should be able to suggest appropriate experiments/interventions to determine whether causal dependence is present between specified variables given constraints on which variables may be controlled for (for ethical/financial/temporal/other reasons), or report that no suitable experiments can be performed in the given model under the given constraints.

Other sources:

References:

• J. Pearl (2000). Causality: models, reasoning, and inference. Cambridge University Press New York, NY, USA. ISBN:0-521-77362-8.
• A. Hyttinen, F. Eberhardt and M. J{"a}rvisalo. Do-calculus when the True Graph is Unknown. UAI 2015.
• J. Pearl and D. Mackenzie (2018). The Book of Why. Basic Books, New York, NY, USA.

### SK8: AI for Skat

Skat is a three-player trick-taking card game. This project involves implementing a computer player for Skat. The implementation should allow the developed computer player to play on Michael Buro’s International Skat Server. Key approaches to consider for the core of your artificial intelligence is perfect information Monte Carlo, and information set Monte Carlo tree search, while some aspects might be tuned on game records using machine learning. Optionally, the techniques and the Skat domain should be incorporated into the Ingenious Framework.

Potentially relevant sources:

• P. Cowling, E. Powley, and D. Whitehouse (2012). Information Set Monte Carlo Tree Search. IEEE Transactions on Computational Intelligence and AI in Games, 4(2), pp. 120-143. DOI: 10.1109/TCIAIG.2012.2200894.
• T. Furtak and M. Buro (2013). Recursive Monte Carlo search for imperfect information games. 2013 IEEE Conference on Computational Inteligence in Games (CIG). DOI: 10.1109/CIG.2013.6633646
• M. Buro, J. Long, T. Furtak and N. Sturtevant (2009). Improving State Evaluation, Inference, and Search in Trick-Based Card Games. Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09)
• S. Edelkamp (2020). Representing and Reducing Uncertainty for Enumerating the Belief Space to Improve Endgame Play in Skat. 24th European Conference on Artificial Intelligence - ECAI 2020.
• T. Furtak (2013). Symmetries and Search in Trick-Taking Card Games. PhD thesis, University of Alberta.
• J. Long (2011). Search, Inference and Opponent Modelling in an Expert-Caliber Skat Player. PhD thesis, University of Alberta.
• C. Solinas, D. Rebstock and M. Buro (2019). Improving Search with Supervised Learning in Trick-Based Card Games.
• D. Rebstock, C. Solinas and M. Buro (2019). Learning Policies from Human Data for Skat - see also the PapersWithCode entry.

### TG1: Predicting fruit ripening

Fruit growers spend millions each year to manage and control pests and still major losses are incurred every year, placing pressure on the economy and peoples livelihoods. Population dynamics of major fruit pests like Mediterranean fruit flies, Codling moth, and False codling moth are closely linked to fruit tree phenology, which in turn, is dependent on climatic factors. Knowing when and where fruit ripen on a regional scale is important information to improve and support area-wide pest management (Sterile Insect Technique – SIT). Remote sensing and climate data could be used to predict fruit tree phenology, thus supplying the fruit industry with valuable pest and crop management information. This project will aim to build on work conducted by a previous student in creating a predictive fruit phenological model using farm-collected phenological data, Sentinel 2 satellite data and climatic data.

### TG2: Identifying and tracking vessels in radar images

This project requires the development of an algorithm that ingests a series of radar images acquired in a harbour environment and tracks vessels within the image. A requirement exists to preporocess the images to mitigate anomalies over land/ocean, identify potential vessels (using image processing) within the harbour boundary and track the location and speed of these vessels. Real examples of vessels traversing the area will be provided to evaluate the accuracy of the algorithm.

### TG3: Urban change detection

This project requires the development of a system that is able to do urban change detection (i.e the development of new or destruction of existing buildings) making use of a series of medium resolution (10m resolution) satellite images. Given a series of images over a selected area, a change heatmap output highlighting changes within urban areas is expected. Processing steps will include, image co-registration, urban landcover classification, change detection using an image time series approach and filtering change areas that are not applicable.

### TG4: Sales data analysis

This project requires the development of a system able to analyze sales data for a series of products in a series of stores for multiple months. The analysis will include, for example, sales anomaly detection, product sales correlations, geospatial analysis of product sales, sales predictions as well as the development of optimization strategies based on the data.

### TG5: Morphological classification of radio galaxies using CNNs

The morphological classification of radio sources (celestial sources emitting radio waves) is important to gain a full understanding of galaxy evolution processes and their relation with local environmental properties. Furthermore, the complex nature of the problem, its appeal for citizen scientists and the large data rates generated by existing and upcoming radio telescopes combine to make the morphological classification of radio sources an ideal test case for the application of machine learning techniques. One approach that has shown great promise recently is Convolutional Neural Networks (CNNs). A CNN is a type of neural network that learns to extract its own features. It then uses these extracted features to perform classification. Up until recently, only simple CNN architectures like AlexNet and VGG (or variants thereof) have been considered for this task. In this project a student will investigate how well more complicated architectures like ResNet and GoogLeNet perform in comparison to the simpler architectures. In addition to comparing the recognition performance of the various architectures with one another, the computational requirements of the various architectures will also have to be compared with one another. This project will build on the work of an MSc student.

Reference:

### TG6: Detecting RFI in interferometric data using machine learning

Interferometers record images of the radio sky. Unfortunately the images so recored are degraded by Radio Frequency Interference (RFI). Luckily there are techniques that we can apply to detect and remove the RFI and in doing so mitigate the effect it has on image quality. In this project we will investigate how different machine learning strategies fare against excepted statistical approaches (such as those implemented in AOFLAGGER). The machine learning approaches we will consider are: conventional supervised and unsupervised classification algorithms, Convolutional Neural Network (CNN) architectures and one-class classifiers. We will apply these techniques on simulated and real-world datasets.

### TG7: Segmentation of satellite imagery into urban and non-urban areas

Being able to track the growth of urban areas are paramount if governments are to be able to perform proper resource planning. In this project a student will compare the efficacy and the amount of computational resources that are required by current Convolutional Neural Network (CNN) architectures when they are employed to segment satellite imagery into urban and non-urban areas. The different architectures we will compare with one another will be trained using the dataset given below.

Reference:

### TG8: Autoencoders and interferometry

An interferometer consists of multiple antennas that together form a single instrument that observes the radio sky. An Autoencoder is a type of neural network that can learn low dimensional codings of data. It can also be used for denoising. In this project we will investigate whether Autoencoders can be used to compress interferometric data. If time permits we will also investigate whether it can be used for deconvolution (filling in the missing information that is not observable by a finite element instrument such as an interferometer).

### TG9: Converting MODIS time-series into images for classification

Satellite imagery can be broadly classified into high resolution and low resolution data. Low resolution data usually has a high temporal cadence, while high resolution data does not. Each low resolution pixel will therefore have a time-series associated with it. In this project we will use data from the Moderate Resolution Imaging Spectroradiometer (MODIS). MODIS has a resolution of 500 m and a revisit time of 1 to 2 days. In this project we will investigate whether it is possible (or useful) to convert the aforementioned time-series into an image and use the newly created image to discern between vegetation and settlement MODIS pixels. To encode a time-series into an image we will make use of a Gram matrix. We will first investigate simple supervised classifiers as well as deep learning.

References:

In this project we will investigate if we can use the idea of tiered training to improve the accuracy of a Neural Network. We will alternate between training particular hidden layers of a network and the output layer to respectively recognize characteristics of objects and class labels. We will determine if this kind of training strategy has any benefit over a more traditional approach.

### TG11: First order optimization methods for calibration

Calibration is the process by which we remove atmospheric and instrumental errors from interferometric data (radio telescope data). This is accomplished by solving the radio inteferometric measurment equation via second order methods like Levenberg-Marquardt. Some work has been done to lower the order complexity of calibration by developing new algorithms like StefCal (this algorithm exploits the inherent symmetry of the calibration problem). In this project we will investigate whether simple firts order methods like ADAM and Momentum can also achieve similar or better results than that which can be obtained by employing StefCal.

### BvdM1: Memoization to defeat regex denial of service

In this project you should extend the term 2 project of rw345 of 2020. See [1] for details.

Suggested extensions

(A) Employing the ambiguity methods, described in Chapter 2 of the thesis by Nicolaas Weideman (see [3]), where he uses these methods to discover bad matching time, to design a more selective, selective memoization scheme than Davis in [2]. Weideman also refers to to his analysis as simple analysis. His techniques come from the papers by Seidl and Mohri ([6] and [7]). When using these algorithms from Mohri, epsilon loops need to be removed (see [3] for details). Mohri generalises the algorithms of Seidl by allowing epsilon transitions. Some form of this code is in the github repo of Weideman (see [8]). Davis claims in his PhD thesis (see [1], [4] and [5]) that these kinds of ideas will be too expensive from a computational point of view, i.e. determining where to do the memoization will be too expensive. Is this indeed the case, especially given that it is a once off computation for a given regex? The complexity bound is given in Th2 of the ambiguity paper by Mohri (see [6]). Also see the discussion just before section VIII on p7 of [4].

(B) Figure out what was done to improve the Java matcher in terms of memoization. Much more complicated regexes are required to force ReDoS compared to a few years back. The million dollar contribution will be to show that additional memoization strategies can be added to the Java matcher.

(C) Investigate how to, just as in .NET 5, let a matcher analyze regular expressions as part of the node tree optimization phase, adding atomic groups where it sees that they won’t make a semantic difference, but could help to avoid backtracking (see [9]).

If you are interested in exploring ReDoS, but prefer not to work on memoization, you can also do a more software engineering type of project where you do a proper rewrite of the code in [8] using good software engineering practices. When doing this rewrite, you should consider possible performance improvements, testing and also extend the regex features supported. You should compare your implementation with other similar ones for identifying ReDoS based on performance and regex features supported. You should also investigate new improvements in the Java regex matcher that reduce the number of regexes suffering from ReDoS when updating/improving the code from [8].

If one regex project to select from is not enough, and you are also interested in functional programming, let me know, and I will also post an extension of the following 2020 honours functional programming course project as a 2021 honours project, especially given that really high quality solutions were submitted by some students in the functional programming course in 2020.

References

### BvdM2: Program Similarity with Code Property Graphs

Co-supervised by David Baker Effendi with involvement from Fabian Yamaguchi, chief scientist at ShiftLeft.

Program similarity (or code similarity) has applications in plagiarism detection, clone identification, code refactoring, duplicate code detection, etc. Programs require being transformed into intermediate representations (IRs) which capture their structure, semantics, and flow in order to accurately estimate the similarity between two programs.

Programs can be similar in various ways but we are interested in the behaviour and semantics [1]. Existing methods such as hashing/fingerprinting documents [2] (as is the case with MOSS used by SU for code plagiarism detection), examining the parse tree and call graphs [3], analyse control-flow graphs (CFG) with graph similarity measures e.g. graph edit distance (GED) or graph neural networks [4].

The downside to methods such as hashing is that this is often limited to syntactic similarity whereas performing graph edit distances are NP-hard problems. The advantage of neural network based approaches are that they perform the task much more efficiently and have found to estimate with low error margins but they are limited to the quality of the input graph embedding and training data (garbage in, garbage out).

Your task is to explore how using code property graphs (an IR combining a program’s AST, CFG, and PDG) could enhance existing and new approaches to code similarity. Tools such as Joern, ANTLR, OPAL or Soot will most likely be used. Expected languages to develop in will be Kotlin or Scala.

Resources

[7] (Video) funcGNN: a Graph Neural Network approach to Program Similarity - This brief video explains how [4] works

### BvdM3: Static Analysis as a Service

Co-supervised by David Baker Effendi with involvement from Fabian Yamaguchi, chief scientist at ShiftLeft.

Static analysis is the analysis of a program or section of code without executing it. This is done by examining the program structure and semantics which is often possible by first converting the code into an intermediate representation (IR).

This project aims at providing static analysis as a service via a self-made server accessed via either a web application interface or IDE plugin e.g. VS Code or Jetbrains IDEs. The target for analysis will either be Kotlin Android applications or Solidity smart contracts - this is up to the student.

A language front-end for Joern will be written to convert the target language to an IR called a code property graph to allow for Joern to perform analysis on the target program. The server will accept the program (or procedures from it) to analyse and results will then be communicated to the client which will either be a web application or IDE.

Tools such as Joern, ANTLR, and Soot will most likely be used along with tools associated with developing the client. Expected languages to develop in will be Kotlin, Scala, or JavaScript. The server may end up being deployed on AWS or a Stellenbosch server such as Hydra.

Resources

[6] (Video) AWS CodeGuru - This is an example of static analysis as a service but this is only limited to Java programs

### BvdM4: Blockchain degree validation

Design, implement and compare an (enterprise and/or public) Ethereum and Hyperledger Fabric blockchain based solution for academic degree validation. Consider making use of off-chain storage options if using a public Ethereum blockchain. For these, IPFS and Swarm are two popular options. If using a public Ethereum blockchain, a gas usage estimation should be done - see for example the subsection on Gas Considerations, Chapter 7, in [6]. This is a well studied blockchain application area, thus before designing your own solution, some of the multitude of current blockchain solutions should be considered.

I am also open to the possibility of developing a blockchain solution for another application area, (see for example the following suggestion) and will also allow other blockchain platforms.

This project forms part of an Erasmus+ EU collaboration between Stellenbosch University and universities in Russia, Sweden and Finland.

References

### BvdM5: Exploration of Bibliometric Datasets

This project is co-supervised by Marcel Dunaiski.

This project should develop an application to structure and analyze bibliometric data in real time. It should also investigate the relevancy, trends and commonalities shown by different similarity algorithms used to structure bibliometric data. You might also want to measure and compare the performance of similarity algorithms and general queries with regards to the response times, for the graph databases used in this project.

For this project the work done in [1], using natural language processing techniques and graph databases, partly explained in [2], should be extended. For example, [1] did not incorporate citations into similarity computations, which is extremely important in bibliometric applications. Citations can be used sparingly, such as tiebreakers when two or more papers are equally similar. Or more fundamentally, as a basis for finding relevant similar papers, augmented by community detection and clustering based on full-texts. As [1] points out, TF-IDF is not ideal and word2vec should rather be used. The similarity measures should also be adapted to be more appropriate for bibliometrics. Connected Papers can be regarded as a competing product.

Since this project deals with building a search tool that helps researchers to find (academic) papers, or papers similar to a given paper, it is worth exploring community detection algorithms, which is already impliemented in graph database libraries such the one for Tigergraph.

References

### BvdM6: Extraction of Microservices from Monoliths

This project is co-supervised by Andrew Collett.

Reimplement and extend the work done in [1]. More details on the problem under consideration can be found in [2].

Below a summary of the project.

Driven by developments such as mobile computing, cloud computing infrastructure, DevOps and elastic computing, the microservice architectural style has emerged as a new alternative to the monolithic style for designing large software systems. Monolithic legacy applications in industry undergo a migration to microservice-oriented architectures. A key challenge in this context is the extraction of microservices from existing monolithic code bases. You should present a formal microservice extraction model to allow algorithmic recommendation of microservice candidates in a refactoring and migration scenario. A graph-based clustering algorithm should be designed to extract microservice candidates from the model. The formal model should be implemented in a web-based prototype. The recommendation quality should be evaluated quantitatively by custom microservice-specific metrics.

References

### BvdM7: Phase Equilibrium Modelling

This project is co-supervised by Matt Mayne from Geology.

Phase equilibrium modelling has revolutionised the way earth scientists understand earth materials. In this computational method the thermodynamic stability of mineral phases is calculated from databases of experimental and natural data. Stellenbosch University has been a forerunner in this in creating an open source software tool Rcrust [1], which allows chemical changes to also be considered in these thermodynamic calculations. With the calculation potential made possible through cluster processing we aim to produce a server based application that maximises the potential of this software as processing times are reduced and much larger simulations could be run. This would allow Rcrust to expand its applications within the earth sciences as well as to be used within planetary sciences, including recent research performed on the evolution of the crust of Mars.

In summary, the aim of this project is to add parallel processing to the current version of Rcrust. Instead of or in addition to the focus on parallel processing, you can also (i) rewrite the current implementation using better software engineering practices such as adding extensive test cases and (ii) making it easier for other developers to contribute to the current codebase, and (iii) possibly also rewrite the current implementation to rather use Python instead of R.

A new dedicated high processing computer laboratory has been setup to provide a server for distributing Rcrust and providing remote host access for Rcrust users to run long calculations making use of parallel processing. You will have access to this laboratory and would assist in the design of protocols and workflows to optimise the distribution of workload to the cluster.

References

### BvdM8: Exploring frontiers of learning regexes or automata

In this project you can choose between the exploring learning of regular expressions or learning of symbolic automata.

Automata learning is an active research field with many publications, research projects and application areas. In contrast, not much work has be done in terms of learning regular expressions. In [1], a tool called RFixer is presented, with the following aims: RFixer takes as input a regular expression and a set of positive and negative examples - i.e., strings the regular expression should respectively accept and reject. When the regular expression is incorrect on the examples, RFixer automatically synthesizes the syntactically smallest repair of the original regular expression that is correct on the given examples. Implement, extend and benchmark your own version of RFixer, the bRinkFixer.

Finite automata are used in applications in software engineering, including software verification, text processing, and computational linguistics. Despite their many useful applications, automata models suffer from the major drawback that in most common forms they can only handle finite and small alphabets. To overcome this limitation, symbolic automata are used which allow transitions to carry predicates. Watch the first 20 minutes of [4] to get a basic understanding and intuition for symbolic automata. Also read parts of [5] to get an understanding for how symbolic automata are used in software verification. The purpose of this project is to understand the symbolic automata model and to experiment with symbolic automata learning algorithms. The efficiency of these algorithms should be investigated, at least from an experimental point of view. Start by implementing the algorithm presented in [3] and [6] and then the algorithm presented in [2]. Compare the efficiency of the algorithms from an experimental point of view.

References

### BvdM9: Using Knowledge Graphs to Improve Conversational Agents

When we have conversations with computers, we want it to mimic human interactions as much as possible. We need the computers to speak human and communicate the way we do. Conversational AI is the technology that makes that possible. It allows artificial intelligence (AI) technologies like chatbots to interact with people in a humanlike way. By bridging the gap between human and computer language, it makes communication between the two easy and natural. But conversational AI isn’t just one thing. It’s a set of technologies that allow computers to recognize human language and decipher different languages, comprehend what is being said, determine the right response, and respond in a way that mimics human conversation.

Build a coversational agent, which can be regarded contextual assistant (see [1] for the defintion of a contextual assistant), in an application area of your choice (one idea is that the conversational agent should be able to answer prospective computer science students in terms of subject choices and career possibilities), making use of Rasa or Dialogflow. You should incorporate a knowledge graph (or knowledge base) constructed by using text mining. See [6] for why you want to make use of a knowledge graph (or base) when building a conversational agent. It might be convenient to make use of Grakn, GraphAware or Lucidworks for the construction of the knowledge graph (or base). You also have to come up with a strategy to evaluate the quality of the developed conversational agent - lots of ideas and tools to do this is already available online.

This project form part of larger project which is being developed with the help of a developer at Praekelt.

References

[3] Dialogflow

[4] GraphAware

### BvdM10: Chessvizz 2

In this project you have to explore how the take the content of a PGN (portable game notation) chess database and store these chess games in a graph database of your choice. The next step is to use the graph database to develop an application that provides novel ways of exploring the chess data in a graphical way, buidling upon the work started in [1].

Taken from the introduction in [1]. Many of the existing tools used for the analysis of past games are limited by bland interfaces and unintuitive layouts, typically presenting data in text format. ChessX is a popular chess database providing analysis and exploration of PGN databases. Its opening tree implementation counts occurrences of moves from a board position and presents this data in a basic table, showing move popularity. However, due to the database format, this data needs to be processed in real-time from the database, leading to large delays between moves. In contrast to this, Chessvizz contains a powerful visual interface for exploring moves from a high quality pre-processed database, allowing for very quick navigation through move sequences. Chesstree.net is another existing solution that offers visual exploration through a database using an opening tree. However, the data for many move sequences is limited and one will often explore a sequence of just a few moves before no data is available for that position. There is also no intuitive way to compare move win rates, as each move must be selected to see the statistics for that move. Chessvizz is a chess database explorer that makes use of intuitive visualisations and a database of over 1 million games to offer a unique way to explore past games, analyze board states and determine optimal moves from these states. This allows users to rapidly improve their opening play and develop effective responses to the play of their opponents.

If you are interested in a chess related project, but not the one I proposed above, I’m also happy to supervise a continuation of the work started in [2]. I also do not mind discusiing other chess related projects as options if you decide to select this project.

References

[1] Chessvizz 1

### WHK1: Types Inference for Python Functions

TLDR: Implement type inference for Python functions using static analysis.

Focus: Research and Software Engineering
Publication: Possible
Scope for continued study: Yes

Python is both strongly and dynamically typed. PEP484 added type hints to the language, which, although not enforced, can be used by third-party tools for static analysis; for background, see PEP482 and PEP483.

In this project, we will attempt to do type inference for the parameter and return values of Python functions. To do so, we employ techniques of path-sensitive static analysis to follow possible type information through a function’s control flow. Python’s extensive introspection and reflection facilities will be useful, and we might borrow ideas from the Hindley–Milner type system—an extension of which is used in the Haskell language. Once we have collected type information, we have to decide how to use it, for example, by adding type hints to the source code, or by using symbolic execution to find problematic code blocks.

### WHK2: Automata Layout Engine for LaTeX

TLDR: Extend the existing graph layout capabilities of TikZ to handle the state–transition diagrams for finite automata, with automatic algorithm selection and scaling in bounding boxes.

Focus: Software engineering and Research
Publication: Possible to some extent
Scope for continued study: Possible, but unlikely

As part of my research, I draw a lot of finite state machines, in particular variants of DFAs and NFAs used for regular expression processing. Setting them by hand in LaTeX gets very tedious, especially if I have to update my work. Some tools, like Graphviz and JFLAP can do basic layouts, but they do not even come close to the variety of layouts supported by the yEd Graph Editor. Unfortunately, the latter is a commercial product and GUI-driven.

For this project, you must create a tool for automatically laying out automata graphs, ideally reusing existing functionality for graph drawing in TikZ. This will involve creating a small input language à la Graphviz DOT for specifying automata graphs, evaluating and implementing existing graph layout algorithms that might be suitable for automata, and putting these ideas together in a tool that produces output suitable for direct use in LaTeX with TikZ. An important requirement is that the implementation must be able to handle bounding boxes automatically, changing layout parameters and algorithms on-the-fly, which is to say, your work will have to interface with the (Lua)LaTeX and call on the layout engine directly.

### WHK3: Regular Expression Matching using the GraalVM

TLDR: Implement regular expression matching using the GraalVM, optimising for performance.

Focus: Research and Software Engineering
Publication: Possible
Scope for continued study: Yes

The classical approach to regular expressions, followed by standard implementations for languages such as Java, PERL, and Python, is write a matcher based on building and processing Nondeterministic Finite Automata (NFAs). A different approach is to compile a regular expression to virtual machine (VM) instructions, as explained by Cox; in the literature, VM approaches have mostly not been explored in detail, leaving a lot of virgin territory.

In this project, you will investigate implementing regular expression matching using the GraalVM, with a particular emphasis on performance. There is a lot of scope for intricate, low-level programming, and also a bunch of theoretical issues to explore and address. If you are interested in this project, this talk is a nice introduction to the VM approach, as used by Grammerly.

### WHK4: Scala decompiler

TLDR: Write a decompiler for the Scala language.

Focus: Research and Software Engineering
Publication: Possible, but unlikely given the time constraints
Scope for continued study: No

Currently, there is no Scala decompiler, that is, a program that converts a JVM class file generated by the Scala compiler back to Scala source code, like JD and Fernflower do for Java. Some blog posts discuss using javap to inspect Scala class files, but at best, this yields JVM bytecode assembly.

In this project, you will investigate Scala decompilation. The nature of Scala makes decompilation inherently complex, and therefore, the project is rather open-ended: Depending on your preferences and background knowledge, it can go more towards hacking around with BCEL, more towards a theoretical angle (considering the new stuff in Scala 3), or somewhere down the middle.

### WHK5: In-browser ELM REPL

TLDR: Write an in-browser REPL for the Elm language.

Focus: Software Engineering and Research
Publication: Unlikely
Scope for continued study: Unlikely

This.

### WHK6: Beamer Slide Designer

TLDR: Create a GUI for designing Beamer slides.

Focus: Software Engineering
Publication: No
Scope for continued study: No

Note: This project has been offered before, but has not yet delivered exactly what I have in mind.

Beamer has become the de facto standard for LaTeX presentations. Unfortunately, there are few aesthetically pleasing slide designs, and creating new themes are quite intensive when done by hand. For this project, you will create a cross-platform GUI—for example, using Electron—for designing Beamer slides. The application must generate overall, inner, outer, and font themes, with gallery functionality and a catalogue of widgets like progress bars, page counter roundels, etc. The expectation is that the user will download a theme, and still create the presentation itself with a normal text editor.

### WHK7: Python testing and marking script servive

TLDR: Write a Python testing and marking script service for use in modules like Scientific Computing.

Focus: Software Engineering
Publication: No
Scope for continued study: No

Modules like Scientific Computing use a script-based approach to marking student Python-programming submissions, also allowing students some test cases while they work on time-constrained assessments. There is also a master marking script that can take the marking script for a particular question, and then extract and mark all submissions by an entire class group. Currently, each script (for marking a particular question) is coded by hand, and this is no longer a sustainable effort.

In this project, you will create One Marking Script System To Rule Them All, based on the following requirements:

• Everything necessary for students to test their submission MUST be contained in one Python script, which is to say, the test case functionality MUST NOT be distributed over several files from the student’s point of view.
• There MUST be one command-line interface to create such scripts for particular questions, which MUST include a template-based approach to storing question text, answers, and example code solutions.
• The implementation MUST handle results over different data type domains (some of which may not be exactly comparable on equality) and exceptions expected to be raised.
• The implementation MUST allow limitations on statement types and modules allowed for import, as well as timeouts.
• The implementation SHOULD allow reporting on insights gained from processing the abstract syntax tree (AST) of a submission.
• The master script MUST be able to assign marks according to a specified rubric.
• The master script MUST either interface with existing plagiarism detection software, or preferably, employ techniques of plagiarism detection in its implementation.
• The implementation MAY have a web-based interface, allowing markers to generate test and marking scripts, and also, to allow direct upload of student submissions.

### CPIWHK1: CodeTester

This project is co-supervised by Cornelia Inggs and Willem Bester.

CodeTester is a tool for automatic testing and generation of feedback on code submissions. The ultimate goal is to help the stronger students quickly get to more challenging problems and the weaker students get more practice when needed so that they can learn the programming skills they need to successfully complete the programming task.

Requirements

• Each submission to a handin-topic should be immediately checked and feedback generated
• Facilitators should be able to create/edit a handin-topic and its accompanying script
• Statistics should be kept for each student and handin-topic
• Handin-topic behaviour should be controlled by rules, e.g.,
• number of submissions allowed and
• topic(s) to unlock for student when this topic is successfully completed, …
For example, when the rule: successful completion is set to: pass all tests, a student can keep retrying until the submission passes all the tests for that handin-topic.
• CodeTester should have both a web interface and a command-line interface
• Various technologies with support that could enhance the capabilities of CodeTester exist. Github also has good support for automatic running of tests.
• The tool should be so easy to use that it can be used for all formal tutorial and project handins for any module from 1st-year to third-year as well as informal extra tutorials that students can complete to master a particular programming skill.

### CPI2: Checking linearizability of non-blocking shared data structures

A shared data structure is linearizabile if it appears as a sequential data structure; every operation on the data structure appears to take effect instantaneously at the so called linearization point, between the invocation and response of the operation. The behaviour exposed by the sequence of operations serialized at their linearization points must conform to the sequential specification of the shared object.

We have developed a tool that uses JPF to check linearizability of concurrent data structures. Current functionality include linearizability checking for both the concrete and symbolic mode of JPF.

Objectives

• Implement functionality that can automatically identify shared data structures in a small Java program to be analysed by JPF
• Implement functionality that can automatically extract an oracle for a concurrent data structure, a sequential version of the target data structure
• Extend the linearizability checker for the symbolic mode, so that it will work for more parameter and return types than what is currently supported

### CPI3: Parallel enhancements to shared-memory MCTS algorithms for Go

Several parallel search algorithms for AI-game agents (players) for domains such as Othello, GO, Lines of Action, and Bomberman have been developed at Stellenbosch and is available as part of the Ingenious Framework. In recent work we have incorporated a number of enhancements, such as RAVE (Rapid action value estimation) and FPU (First play urgency), into our tree-parallel algorithm, which is one of our three parallel Monte Carlo Tree Search (MCTS) algorithms.

Problem statement We have found that our framework is ideal for comparing new and existing techniques on the same domain and the same homogeneous hardware setup, something that has not been done for the MCTS enhancements. We have also found that our Go engine, which executes the rules (such as capturing stones) when a move is made, has some inefficiencies which influences the effectiveness of some of the enhancements. So here is what you’ll do.

Objectives

• Incorporate the MCTS enchancements into our other two algorithms: the root-parallel and leaf-parallel algorithms algorithms
• Improve the Go engine by implementing a few known techniques that will improve its efficiency
• Analyse and compare the effectiveness of the different enhancements to each of three parallel implementations, on different domains

### CPI4: DF-UCT for scalable distributed MCTS

Monte Carlo tree search (MCTS) is useful in domains such as Go, where the traditional Minimax algorithm fail due to large branching factors, deep trees, and the absence of reliable evaluation functions for nonterminal board positions. MCTS approximates the value of a nonterminal board position by running random simulations and using the values at the terminal states of those simulations. Parallelism plays an important role in the development of stronger programs, because more simulations can be performed within the same time-limit.

Problem statement We have implemented the most prominent techniques that exist for parallel MCTS: leaf-, root-, and tree-parallelisation for shared-memory environments in the Ingenious Framework, a system which have been developed at Stellenbosch; but it does not currently have any parallel MCTS implementation for distributed-memory environments.

Objectives

• Implement distributed MCTS using the Transposition Table Driven Work Scheduling (TDS) appraoch, which distributes the game tree across multiple compute nodes, and using the depth-first UCT algorithm of Yoshizoe et al..
• Analyse the scalability of your algorithm for the game LOA (available in the Ingenious Framework). You’ll have login access to compute resources in the department with enough resources for the scalability experiments.

### CPI5: UCT-treesplit for scalable distributed MCTS

Monte Carlo tree search (MCTS) is useful in domains such as Go, where the traditional Minimax algorithm fail due to large branching factors, deep trees, and the absence of reliable evaluation functions for nonterminal board positions. MCTS approximates the value of a nonterminal board position by running random simulations and using the values at the terminal states of those simulations. Parallelism plays an important role in the development of stronger programs, because more simulations can be performed within the same time-limit. The results of more simulations are useful, because the statistical significance of the information gathered by the Monte Carlo simulations increases with an increase in the number of simulations performed.

Problem statement We have implemented the most prominent techniques that exist for parallel MCTS: leaf-, root-, and tree-parallelisation for shared-memory environments in the Ingenious Framework, a system which have been developed at Stellenbosch; but it does not currently have any parallel MCTS implementation for distributed-memory environments. Leaf and tree parallelisation lend themselves to shared-memory environments, but when implemented using TDS, a technique that distributes a game tree across multiple compute nodes, tree parallelism can be used effectively in distributed environments.

Objectives

• Implement distributed MCTS using the Transposition Table Driven Work Scheduling (TDS) appraoch, which distributes the game tree across multiple compute nodes and implementing the UCT-treesplit algorithm of Schaefers and Platzner.
• Analyse the scalability of your algorithm for the game LOA (available in the Ingenious Framework). You’ll have login access to compute resources in the department with enough resources for the scalability experiments.

### CPI6: Checking program correctness for relaxed memory models

Modern architectures typically support weaker memory models than the sequential consistency model, which means that memory operations may be reordered. To ensure correctness of a program, a programmer has to add barriers/locks.

Model checking tools have been developed to automatically check correctness of concurrent programs with respect to certain properties. JPF is such a tool that checks properties of a Java program by executing Java bytecode on its own VM. During model checking JPF’s VM controls program execution to explore all possible bytecode interleavings, but only supports sequential consistency memory models.

Objective Some work has been done to develop new or modify existing checkers to support more relaxed memory models and your task is to do the same for JPF; i.e., influence its exploration to support more relaxed memory models. You’ll do this by implementing a listener. The JPF listener interface is provided as an extension mechanism that can be used to observe and influence the order in which JPF executes instructions. Your listener will be called for every instruction and the listener will reorder instructions that are allowed to be reorded.

### BF1: Java API fuzzing

A number of tools can generate test suites from Java class libraries (e.g., JCrasher or Randoop). However, they typically construct randomized tests and cannot guarantee that all specific sequences of method calls are tested, or that all constructors are used in all argument positions. The goal of this project is to convert the method declarations in a class into rules of a context-free grammar and to use an existing grammar-based fuzzing tool (i.e., gtestr) to systematically fuzz the API. Further extensions can take advantage of annotations (e.g., checkerframework.org/) or the results of static analysis tools to reduce the number of test cases that are constructed.

This is a research-oriented project that can lead to a publication and can lay the foundations for an MSc project in the general areas of compiler engineering and software testing.

### BF2: Incremental Computation of Grammar Predicates

Many applications in compiler and software language engineering (e.g., parser construction, grammar-based test case generation, or grammar learning) require a variety of predicates and functions on context-free grammars (e.g., nullable, first, or follows). Their computations are typically based on fix-point algorithms iterating over the rules of the grammar.

The goal of this project is to develop, implement, and benchmark efficient incremental versions of these fix-point algorithms that “reuse” as far as possible the information stored in tables by the base algorithms whenever the grammar is modified (e.g., in a genetic grammar learning algorithm).

The project does not require any prior knowledge of compiler implementation or formal grammars beyond the level of RW244. Some experience with efficient programming in C would be helpful.

This is a research-oriented project that can lead to a publication and can lay the foundations for an MSc project in the general area of compiler engineering.

### BF3: Rust Fuzzer

Rust (www.rust-lang.org/) is a comprehensive programming language with many advanced features, in particular in its type system. The goal of this project is to develop a tool that automatically generates suites of Rust test programs that are guaranteed to either conform to a given range of well-formedness properties (e.g., syntactically correct or type correct) or to violate them in a specific way and at a single location (e.g., syntax error, out-of-scope reference, or type-incompatible assignment). The tool should systematically cover different syntactic and semantic (e.g., all possible combinations of declarations shadowing each other) patterns. The main deliverable of the project is a series of explicit declarative models, in form of a context-free grammars (CFG) that are augmented with “semantic mark-up tokens” [1] and used together with the gtestr system to generate the test cases.

This is a research-oriented project that can lead to a publication and can be extended into an MSc project.

[1] Phillip van Heerden, Moeketsi Raselimo, Konstantinos Sagonas, Bernd Fischer: Grammar-based testing for little languages: an experience report with student compilers. SLE 2020: 253-269

### BF4: Extension and Gamification of gtutr

gtutr [1] is an interactive feedback system that helps students developing ANTLR grammars. The goal of this project is to “gamify” the system; the main idea is to let lecturers define different stages, and to show the students’ progress after each stage on a leader board. The project includes the design of different kinds of stages, and the extension of the system to enable their implementation.

The project does not require any prior knowledge of compiler implementation or formal grammars beyond the level of RW244. The project should build on top of the existing gtutr tech stack, which uses Java Spring and, JSP for the web pages, with the databases hosted on AWS.

Ideally, a preliminary version of some functionality should be trialled in the RW713 module towards end of March.

This is primarily a software engineering project.

[1] Chelsea Barraball, Moeketsi Raselimo, Bernd Fischer: An interactive feedback system for grammar development (tool paper). SLE 2020: 101-107

### BF5: Inferring context-free grammars using genetic programming

Context-free grammars (CFGs) can be used to automatically generate test cases for software systems that process structurally complex inputs, such as compilers, web browsers, email readers, and many others. Unfortunately, such CFGs are not always available. The goal of this project is to develop and evaluate a system that uses genetic programming techniques to infer appropriate CFGs from an initial set of test cases and feedback (i.e., syntax errors) from the targeted software system. The main aspects of the project are the definition and implementation of mutation and crossover operations for CFGs, and the definition and evaluation of different scoring mechanisms.

This is a research-oriented project that can lead to a publication and can be extended into an MSc project. It can build on existing prior work.

References:

Matej Crepinsek, Marjan Mernik, Barrett R. Bryant, Faizan Javed, Alan P. Sprague: Inferring Context-Free Grammars for Domain-Specific Languages. Electr. Notes Theor. Comput. Sci. 141(4): 99-116 (2005)

### BF6: Mining dictionaries for fuzz testing

Fuzz testing incrementally constructs test cases by randomly changing or adding input bytes. This simple approach is surprisingly effective in many cases, but becomes ineffective when the system under test expects more verbose and structured inputs, such as web pages, programs, or JSON or XML object descriptions. Fuzz testing tools therefore rely on keyword dictionaries to increase the granularity at which the inputs are constructed, but these dictionaries must be specified manually. The goal of this project is to develop and evaluate a system that automatically extracts keyword dictionaries for a black-box system under test from an existing input corpus; more specifically, the system should learn the sets of keywords and operators, the lexeme classes (e.g., identifiers and numbers) and the formatting and commenting rules with minimal feedback (i.e., input is syntactically correct or not) from the system under test.

This is a research-oriented project that can lead to a publication and can lay the foundations for an MSc project.

### BF7: Testsuite Generation for Probabilistic Models

Many data science applications need realistic looking but artificial sets of training data. Such training sets can be produced from models that encode structural, probabilistic, and logical assumptions of the application domain. The goal of this project is to define a domain-specific language to formulate probabilistic models and to generate programs from such models that then produce the testsuites. Ideally this could also be used to generate large SQL data bases with data that exhibits given statistical properties.

This is a research-oriented software development project which could feed into a related MSc project. Candidates should be interested in machine learning and programming languages; some knowledge about probability distributions would be helpful.

References: Bernd Fischer, Johann Schumann: AutoBayes: a system for generating data analysis programs from statistical models. J. Funct. Program. 13(3): 483-508 (2003)

### BF8: Checker for counter-examples

Software model checking has become a powerful and popular software verification technique. Part of its popularity comes from the fact that software model checking tools typically not only return a “yes/no” answer, but can in the case of “no” also provide an counter example that allows the developer (in principle) to quickly confirm and identify the error. However, counter examples can get very large, so this is tedious. In this project you will develop an automatic checker for counter examples from different C verifiers. This should ideally;- support different counter-example formats (at minimum CBMC, CSeq, ESBMC, and SV-COMP witness);- support different checking modes: program execution, debugging, symbolic execution, proof checking;- support heap-allocated memory and concurrency;- visualize single- and multi-threaded error traces (e.g., in the form of message sequence charts).

This is a research-oriented software development project which could feed into a related MSc project. Candidates should be interested in programming languages and program verification.

### BF9: Stress-testing compilers by automated innocuous changes

The goal of this project is to design, implement, and evaluate a tool that applies behaviour-preserving transformations to test programs, resulting in modified programs that should behave as the original. The mutants can be generated by tree transformations. The project should find a grammar for a real language, implement some transformations, and evaluate the approach using a real compiler.

This is a research-oriented project that can lead to a publication and can be extended into an MSc project.

### BF10: SQL Fuzzing and differential testing

Different relational databases implement slighly different variants of SQL. The goal of this project is to use grammar-based fuzzing techniques (as implemented in the gtestr tool) to uncover possible implementation differences and errors. The project entails adapting existing SQL grammars to the gtestr notation, adding semantic mark-up to ensure that the generated tests are valid SQL, and mapping corresponding constructs in the different grammars. It also requires the development of a testing infrastructure.

This is a research-oriented project that can lead to a publication and can be extended into an MSc project.

### AE1: Multi-objective Control Parameter Tuning

Just about all optimization and machine learning algorithms have control parameters. The bad thing about these problem parameters is that performance of the algorithm is very sensitive to different control parameter values, and best values are usually very problem dependent. The consequence is that, if one wants to apply an optimized instance of an algorithm, tuning of these control parameters is required. A number of different approaches to control parameter tuning can be found, including factorial design, f-race, iterated f-race, amongst others. However, these approaches tune control parameters with respect to one performance criterion, usually either accuracy or computational effort.

This study will develop a multi-objective approach to control parameter tuning, considering a number of performance criteria, including accuracy, efficiency, robustness, and stability. A very simple approach will be to evaluate performance for a range of parameter values with respect to each of the perfromance criteria. Then to rank performance for the different control parameter values per performance criterion, and then compute an average rank over all of the performance criteria.

A second approach will be developed to use non-dominance based ranking of algorithms based on the performance measures under evaluation.

An alternative will be to adapt and approach such as f-race to drill down to the most optimal regions of the control parameters using the non-dominance relation with respect to all of the performance criteria.

In addition, it is also required to visualize how performance of an algorithm changes for different values of the control parameters, over different axes, one axis per performance measure.

This is a research project, with a significant amount of empirical analysis and coding. The project has very good potential to result in a publication. The control parameter tuning approach need to interface with CIlib, an opensource library of Computational Intelligence algorithms developed in scala. However, it should be general enough to also interface with any other algorithm or algorithm framework.

Prerequisite: Statistics, and students are advised to do Artificial Intelligence 791.

### AE2: Self-Adaptive Meta-Heuristics using Cultural Algorithms

Cultural algorithms (CAs) are a form of evolutionary algorithm which maintain a belief space in parallel with a population space. The population space represents a set of candidate solutions to the optimization problem, and the belief space maintains a set of beliefs about where in the search landscape and optimum resides. Any population-based meta-heuristic can be used in the population space to adapt candidate solutions. This meta-heuristic has as its objective to find an optimal solution to the optimization problem under consideration. The belief space is a collection of beliefs’’ as formed by a selected few individuals in the population as to where in the search space these individuals believe the optimal solution can be found.

Just about all meta-heuristics have control parameters, with different control parameter configurations resulting in different levels of performance. The best control parameter configuration is then also very problem dependent. In order to find the best control parameter configuration, a computationally expensive parameter tuning can be done prior to solving the problem. For each new problem, the tuning has to be redone. Instead, self-adaptive algorithms have been developed where control parameter values are adjusted during the optimization process.

Considering particle swarm optimization (PSO) specifically, a number of such self-adaptive approaches have been developed. However, a recent study has shown these approaches to be very inefficient.

This research will develop a CA approach to search for the optimal control parameter values of the meta-heuristic (PSO in this case) used in the population space, by defining a belief space to represent th econtrol parameter space. The belief space will effectively indicate parts of control parameter space where the influential (or the best performing) individuals belief the best values for the control parameters can be found. Each individual will then sample values for its control parameters from this belief space. Different staregies to update the belief space, and to utilize the belief space will be developed to prevent premature convergence in both the belief (control parameter) space and the population space. The end result will hopefully be a more efficient self-adaptive PSO algorithm.

The approach will be extensively empirically analyzed.

This is a research project, with a significant amount of empirical analysis and some coding. The project has very good potential to result in a publication. You are strongly advised to make use of CIlib, an opensource library of Computational Intelligence algorithms developed in scala.

Prerequisite: Students are advised to do Artificial Intelligence 791.

### AE3: A Concise Domain Specific Language for the Specification of Evolutionary and Swarm Intelligence Algorithms using CIlib

Evolutionary and swarm intelligence algorithms are a subset of nature-inspired optimization algorithms (inspiration comes from birds, ants, the survival of the fittest, bacteria etc). A software library, called CIlib, focuses on the definition, execution and evaluation of these kinds of algorithms.

Within this project, the student will be exposed to functional programming and will learn core abstractions and tools to enable this style of programming. The goal of this project is to create a domain specific language (DSL) to allow for a subset of the algorithms, problems and measurement to be used by individuals that do not have the needed skills to define such algorithms themselves. It will be a challenging problem to solve, requiring Scala code to combine different parts of the CIlib library in order to produce an executable experiment definition for the DSL user. A domain-specific language (DSL) will allow users without the adequate programming skills to define and execute algorithms in order to obtain measurements for analytical work, whilst adding to an ever growing repository of experimental data.

The DSL should be easily extendible, because additional algorithms and functionality are continuously added to CIlib.

The DSL produced within this project will become part of the main interface to a larger data processing pipeline. This pipeline will produce data that will add to an ever-increasing repository of algorithm results – these results will be referenced in subsequent academic publications and will be used as verification data to help prove or disprove hypotheses on algorithm performance and behaviour.

The project will have the following deliverables: 1.The design of a DSL to allow for simplfied experiment definitions – either kind of DSL (external or internal) is acceptable. 2.Implementation of an interpreter (a program that reads the DSL and performs actions) to convert the defined DSL into the required program such that the experiment can execute. 3.A simplified user manual to describe and explain the functionality of the DSL, and to explain how to expand the DSL as new functionality is added to the library. 4.A web-based interface to allow for the processing of the DSL language – the web based interface will not execute the programs, but instead validate the DSL syntax and submit the experiments as a job for processing by an external processing engine.

Important to note: *The student will receive regular mentoring from the main CIlib developers to aid with the process *Knowledge of how CIlib works and operates will be provided to the student *This project will form part of a larger project aimed at simplifying the computaional intelligence research field *Implementations will require knowledge of the Scala programming language for the DSL implementation. The web interface will not require any upfront knowledge.

Some research will be required in order to implement the DSL. The student will need to motivate for a specific type of DSL implementation, either being an internal or external DSL. Some inspiration may be taken from existing projects, but understanding the shortcomings of these projects is even more important.

### AE4: An Automated Empirical Analysis and Analytics Tool for Stochastic Optimization Algorithms

When a new optimization algorithm is developed, the performance of that algorithm has to be compared with other algorithms against different performance criteria on an extensive benchmark set. This firstly requires implementation of a number of optimization algorithms, benchmark problems, performance criteria, and then executing tuned versions of these for a number of independent runs on the provided benchmark problems. A task that can be extremely time consuming.

It happens that many researchers will implement the same algorithms (possibly with bugs), the same problems, and will run these algorithms on the same problems. After the research has been done, and a research paper written, the captured performance data is lost. No central repository is kept, and there is usually no archiving of raw performance data and analyzed data, with links to the corresponding algorithms, algorithm configuration, problems, and problem configurations. Should such a results repository exist, much effort and time can be saved. One can simply perform a query and extract algorithm performance data, and use that in the analysis and comparison with a new algorithm.

Empirical analysis and comparisons of algorithms also require that formal statistical tests be done to determine if one algorithm perform significantly better than another, and to rank algorithms based on different performance criteria.

In addition, benchmark problems have certain characteristics, and fitness landscape characteristics, which can be used to predict which algorithm will be best to solve that problem. In order to get to such predictions, performance results need to be correlated with fitness landscape characteristics.

This project will design such a data repository for optimization algorithms. A database has to be designed to characterize benchmark problems, to populate the landscape characteristics of problems, to store performance results and to tag these rsults with the algorithm and problem configurations. In addition, a dynamic approach to ranking algorithms on specific problems or classes of problems will be developed. This will be a dynamic system that will be able to update the ranking when new algorithms are added, and that will be able to rank algorithms based on different performance criteria.

In addition, some basic queries for exploring the raw and analyzed data will be included, to extract specific data and results, and to visualize results will be developed.

The final tool will need a web-based client to interface with the repository and analysis tools. Note that the tool will link to CIlib, a scala library of optimization algorithms and benchmark functions. It will also link to a fitness landscape analysis tool to populate the landscape characteristics of the problems.

Prerequisite: Students are advised to do Artificial Intelligence 791, and need strong experience in database designe and development, and client-server architectures. You will also have to learn scala.

### AE5: Dynamic Multi-Guide Particle Swarm Optimization

We have developed a sub-swarm based PSO, referred to as the multi-guide PSO (MGPSO) to solve static multi-objective optimization problems. The algorithm uses multiple swarms, one swarm per objective, and adds to the velocity update equation an archive term to guide the search towards the Pareto front. Results have shown that this MGPSO performs very competitively to other multi-objective optimization algorithms. The goal of this research is to develop approaches to adapt the \gls{mgpso} to solve dynamic multi-objective optimization problems. As a very first attempt, archive update mechanisms will be developed, to ensure that the solutions in the archive remains non-dominated after environment changes. As a second approach, mechanisms will be implemented to maintain diversity in the sub-swarms, to ensure efficient tracking of changing optimum positions.

This is a research project, with a significant amount of empirical analysis and some coding. The project has very good potential to result in a publication. You are strongly advised to make use of CIlib, an opensource library of Computational Intelligence algorithms developed in scala.

Prerequisite: Students are advised to do Artificial Intelligence 791.

### AE6: Diversity Measure for Genetic Programming

Diversity is an indication of the extent to which a population is still exploring the search space, or converging to a single solution. Genetic programming populations contain candidate solutions represented as tree structures, which makes it difficult to define diversity measures. This research will develop and validate diversity measures for genetic programming populations, that is to develop measures to quantify the degree of similarity of tree structures.

This is a research project, with the potential of publications.

Prerequisite: Students are advised to do Artificial Intelligence 791, and to have some knowledge of graph theory and tree structures.

### AE7: Active Learning Learning as A Dynamic Optimization Problem

Active learning in neural networks (NNs) refers to the ability of a NN to identify the most informative training patterns, and to focus training only on these informative patterns. Due to the training set changing over time, active learning is in essence a dynamic optimization problem (DOP), which will result in concept drift. Existing research on the development of active learning approaches does not consider these approaches as DOPs. This research will formally define active learning as a DOP. Dynamic meta-heuristic algorithms, such as particle swarm optimization or differential evolution, will then be applied to train these neural networks to see if performance can be improved compared to gradient based training.

This is a research project, with the potential of publications.

Prerequisite: Students are advised to do Artificial Intelligence 791 and Machine Learning 741.

### AE8: Set-based Particle Swarm Optimization for Data Clustering

Data clustering concerns the grouping of patterns into clusters such that patterns in the same cluster are similar to one another, and such that clusters differ from one another. Clustering has three objectives: (1) to produce compact clusters, i.e. distances between patterns and their cluster centroid should be as small as possible; (2) clusters should be well-separated, i.e. distances between cluster centroids should be as large as possible; (3) the number of clusters should be optimal. Clustering is therefore, in essence, a multi-objective optimization problem, where the optimal set of centroids need to be found.

The set-based particle swarm optimization (SBPSO) algorithm is a variation of particle swarm optimization (PSO) where candidate solutions are represented as sets, and not as vectors as for the standard PSO. Solutions are found by finding optimal sets as combinations of elements from the defined universe.

Because the solution to the clustering optimization problem is a set of centroids, and the universe is a finite set of elements, the centroids will be actual patterns from the provided dataset. That is, the set universe is the actual dataset. The SBPSO’s task is then to select the smallest number of centroids such that the resulting clusters are compact and well-separated. This is then very similar to the k-mediods clustering algorithm.

This project will require to develop approaches to cope with multiple objectives in the SBPSO. One such approach is to make use of a weighted linear aggregation of the objectives. An alternative is to make use of a recently developed universal AND-OR fuzzy operator for multi-objective optimization problems. Lastly, a non-dominance (Pareto-optimal) based approach can be used.

It will be expected that the code be implemented in CIlib, an opensource library implemented in scala.

This is a research project, that will very likely lead to publications.

Prerequisite: Students are advised to do Artificial Intelligence 791.

### AE9: A Set-based Particle Swarm Optimization Algorithm for Rule Induction

In data analytics, the extraction of if-then rules from a dataset is a very important task in gaining an understanding of the patterns embedded in dataset. A number of rule induction (rule extraction) algorithms have been developed, e.g. the class of covering algorithms. These algorithms consider each class individually, and then extracts a set of rules from the data to predict that class. The set of rules then also serve as an explanation of the correlations and patterns between descriptive features and the target (class) features. Each rule in the rule set is in the form of IF antecedant THEN consequent, where the antecedant is a conjunction of selectors (or conditions on descriptive features).

Ultimately, rule induction is a multi-objective optimization problem, where the goals are to extract as few and as simple as possible rules that are as accurate as possible. This then will result in rules with the best possible generalization performance (i.e. rules do not overfit the training data) and that are easily interpretable.

This project will see the development of a set-based particle swarm optimization (SBPSO) algorithm to extract rules from datasets. The universe will be the set of possible selectors, and the candidate solutions (i.e. particles) will antecedants of the rules. As for covering algorithms, rules will be extracted per class. Since each particle will represent only one rule, a challenge will be to develop an approach to produce a set of such rules, i.e. a disjunction of antecedants per class.

It will be expected that the code be implemented in CIlib, an opensource library implemented in scala.

This is a research project, that will very likely lead to publications.

Prerequisite: Students are advised to do Artificial Intelligence 791.

### AE10: Fitness Landscape Analysis of Higher-Order Neural Networks

Higher-order neural networks (NNs) have higher-order transformations of input signals. One type of higher-order NNs is a product unit NN (PUNN) or pi-sigma NN. A PUNN has product units in the hidden layer, where the net input signal is a product of highly non-linear transformations of the input signal. Hidden units use linear activation functions in both the input and the output layers. The output layer makes use of summation units, where the net input signal is calculated as a weighted sum over the input signals.

Product units have the advantage that non-linear relationships are easier to approaximate, with more accurate results and less product units are necessary than summation units. The consequence is then less complex NN architectures. Product units do, however, have a significant impact on the NN loss landscape (error surface), resulting in steeper gradients, and possibly more local minima and plateaus.

The objective of this study is to analyze the impact of product units by conducting a fitness landscape and loss-gradient cloud analysis of the NN fitness landscape, in comparison of the fitness landscape produced by summation units. The study will then proceed to evaluate the performance of a number of NN training algorithms, and to map performance of these training algorithms to the observed fitness landscape characteristics.

Prerequisite: Students are advised to do Artificial Intelligence 791 and Machine Learning 741.

### AE11: Multi-Guide Particle Swarm Optimization Algorithm for Many-Objective Optimization

We have recently developed a sub-swarm based approach to particle swarm to solve multi-objective optimization problems, called the multi-guide particle swarm optimization (MGPSO) algorithm. This algorithm was shown to be very competitive against state-of-the-art multi-objective optimization algorithms. Many-objective optimization problems refer to problems where there are more than three conflicting sub-objectives that have to be simultaneously optimized. Standard multi-objective optimization algorithms fail to scale to many objectives, due to the fact that the non-dominance relation (used to determine when one solution is better than another), fails for many-objective problems. Therefore, existing multi-objective optimization algorithms have to be adapted to allow these algorithms to scale to many objectives. The scalability of the MGPSO was recently evaluated and it was shown to scale very well to many-objectives, and very competitive to algorithms specifically designed with many objectives in mind. This scalability of the MGPSO was achieved without having to adapt the algorithm. Scalability is simply achieved by adding a new sub-swarm for each sub-objective. Though the scalability was shown to be competitive, this is mainly with respect to the hypervolume (a measure of how close the found solutions are to the true Pareto front of optimal solutions). With reference to the inverted generational distance (a measure of how spread the found solutions are along the Pareto front), the MGPSO does perform better than most algorithms, but not all. This is due to the fact that the approach to guide the MGPSO towards the Pareto front does not bias towards diverse solutions. This project will implement mechanisms to improve the diversity of the non-dominated solutions found by the MGPSO, and will then compare the performance of these new strategies to the current state-of-the-art algorithms.

The approach will be extensively empirically analyzed.

This is a research project, with a significant amount of empirical analysis and some coding. The project has very good potential to result in a publication. You are strongly advised to make use of CIlib, an opensource library of Computational Intelligence algorithms developed in scala.

Prerequisite: Students are advised to do Artificial Intelligence 791.

### AEK1: Blockchain Consensus

A blockchain is a distributed data structure which stores blocks of data. The blocks form a tree rooted at the first (genesis) block, and each block is linked by a pointer to its predecessor block in the blockchain. The tree is almost linear with a few short side branches. The blockchain is maintained by a community of participants which each execute a distributed consensus algorithm. This algorithm, if executed correctly by a majority of the participants, ensures that the blockchains at each participant are identical (in consensus) for most of the time.

Consensus cannot be observed since such measurements would require an observer with instantaneous access to the state of the blockchain at each of the participants. The absence of instantaneous global state access is precisely the reason for the existence of distributed consensus.

Part 1. Public blockchains place no restriction on which participants (nodes) take part in the distributed consensus algorithm. These participants compete with each other to add the next block to the blockchain. This competition consumes resources and the winner of the competition is rewarded for the resources it consumed in winning the competition. The amount of resources consumed by the many participants in the competition is very large.

Permissioned blockchains allow a relatively small number of approved participants to take part in the distributed consensus algorithm. These participants compete with each other to elect a leader who adds a block to the blockchain. After a delay the participants elect a new leader to add the next block to the blockchain. The leader is elected by a so-called Byzantine Fault Tolerant process. Sometimes the BFT process does not identify a leader and, after a delay, the BFT election process restarts.

We will study the properties of the distributed consensus process as used in the per- missioned Hyperledger blockchain. Hyperledger currently has several variants each using a different BFT algorithm. We will develop a discrete event simulator to evaluate the performance of the Hyperledger variants. We will develop a simulation model of a custom BFT algorithm designed with improved scalability properties.

Part 2. Consensus occurs when the blockchains at all the participants are identical: each blockchain at each participant consists of a single main branch and the leaf blocks of all the blockchains are identical.

When blocks are discovered frequently and/or when the block communication delays are lengthy, the blockchain can split and multiple branches are formed. The distributed consensus algorithm will preferentially append blocks to one of these branches and delete (prune) the other branches, allowing consensus to emerge.

When many homogeneous participants compete and when the block discovery interval is of the same order as the block communication interval, the blockchain splits into many branches. The blockchain is out of consensus for long periods of time, although it will with certainty return to consensus, but only for a very short time.

Another form of consensus, so-called weak consensus (joint work with colleagues from the Department of Mathematics and Statistics, University of Melbourne, Australia) is observed in simulations of public blockchains where the blockchain operates under conditions where consensus is rare.

Under these circumstances, simulations show that the blockchain remains in consensus up to the Last Common Ancestor block. A tree of branches descends from the LCA block. The consensus algorithm will eventually prune the tree to a single branch, consensus occurs, and the LCA advances to the single leaf block which is identical among all the blockchains. This consensus is short-lived and a tree quickly emerges from the leaf block. However, the blockchain from the genesis block to the LCA is (tentatively) immutable. The blocks in this prefix of the blockchain can be tentatively confirmed and this is what we call weak consensus.

We propose to simulate the occurrence of weak consensus in public blockchains and in permissioned blockchains. We will study the dynamics of weak consensus and establish how distant a block must be from the LCA before the block can be regarded as confirmed.

Weak consensus may allow blockchains to function reliably under conditions where they were previously regarded as ineffective because no long-lived main branch emerged.

This project forms part of an Erasmus+ EU collaboration between Stellenbosch University and universities in Russia, Sweden and Finland.

Reference

### AEK2: Bitcoin with Just-In-Time Block Propagation

Bitcoin miners receive copies of all transactions as they are generated. Each miner gathers several validated transactions into a candidate block. Miners compete to append their candidate blocks to their blockchain. The process of appending a block to the blockchain requires the miners to solve a computationally difficult cryptographic problem.

If a miner computes a valid solution then the newly mined block is appended to the blockchain at the miner. The block is copied to every miner in the Bitcoin network and the block is appended to the blockchains at these miners.

BLOCKCHAIN SPLITS

Each miner maintains its own blockchain. The blockchains are locally updated by the miners in such a way that the blockchains at each miner are either identical, or if they differ then the differences will soon be resolved and the blockchains will become identical.

Mined blocks are subject to communication delays as they are broadcast through the Bitcoin network. If two or more miners compute a valid solution more or less simultaneously, then the blockchain splits. The split will be resolved when the next block is mined.

Split avoidance schemes already exist in Bitcoin to rapidly communicate the contents of a mined block among the miners. Instead of broadcasting blocks of transactions, thin blocks are broadcast. Thin blocks contain hashes of transactions rather than transactions. The idea is that most of the miners have already received most of the transactions in the currently mined block. Sending them again in a broadcast block is unnecessary.

JUST-IN-TIME PROTOCOL

There is a simple way to avoid blockchain splits, namely to communicate the transaction contents of a block before the block is mined. When a block is mined, the miner broadcasts only the block header. The header is small and will rapidly be communicated to all the miners. When the header arrives, it will most likely find that the corresponding body, which was broadcast at the end of the previous mining interval (which occurred on average 10 minutes ago), will have already arrived at the destination miner.

THE PROJECT

We will extend our blockchain simulator to model the just-in-time protocol. We will investigate the relationship between the block mining interval and the size of the block body. If the body is too large, the body may not arrive before the corresponding header arrives. We will investigate the percentage of the time that the blockchains are in consensus.

This project forms part of an Erasmus+ EU collaboration between Stellenbosch University and universities in Russia, Sweden and Finland.

Reference

### MN1: Deep learning for leafroll disease detection

Leafroll disease is a serious problem that affects many of South African vineyards. This disease can be observed in the shape and coloration of the leaves. Many farmers notice this disease in a later stage of the production. There is a need to assist humans in detecting leafroll disease in the vineyard at the early stages. The challenging part of this project is based on the limited dataset for training a convolutional neural network. This will rely on augmentation techniques to populate training datasets with high-quality images.

### MN2: Automated hyperparameter finder

Training a neural network, there are many factors to be taken into consideration for better performance. The learning rate is one of the important parameters in the training process of the network. This project aims to implement a framework that can use different learning rates and adjust itself for greater performance. This project will be able to provide standardized learning rates that are not manually tuned.

### MN3: Model Interpretability

Artificial intelligence algorithms have been seen as being black boxes, providing no way to understand their inner processes and making it difficult to explain how they reach their high accuracies. There is a debate around these models to be applied to real-world problems. Model decisions are based on the representation of the dataset on a particular problem, which sometimes leads to biased models. There is a need to explain the model’s decision and this can be done through the model interpretability concept. The student is required to research the availability of interpretability tools for Convolutional Neural Networks such as Grad-CAM and SHAP (Shapley Additive exPlanation).

### MN4: Video enhancement

Video enhancement is one of the most important and difficult tasks in video research. Video enhancement aims to improve the visual appearance of the video or to provide a “better” transform representation for future automated video processing, such as analysis, detection, segmentation, recognition, surveillance, traffic, criminal justice systems. The student is required to research video restoration tasks including super-resolution, deblurring on the publicly available dataset to improve the quality of the dataset.

### MN5: Face recognition using Siamese neural network

Facial recognition is the task of making a positive identification of a face in a photo or video image against a pre-existing database of faces. It begins with detection - distinguishing human faces from other objects in the image - and then works on the identification of those detected faces. The student is required to research the Siamese neural network and the use of different similarity functions similar to Euclidean distance to compare the similarity of the images.

### MN6: Activity recognition

Human Activity Recognition is the problem of identifying events performed by humans given video input. Activity Recognition is an important problem with many societal applications, including smart surveillance. Surveillance videos can capture a variety of realistic anomalies such as fighting, highjacking activities. The student is required to perform anomaly detection in surveillance videos. For this task, there are many publicly available datasets for training a model to recognize anomalies in videos. This can be used as evidence in crime operations.

### MN7: Real-Time Multi-Object Tracking

Multi-Object Tracking (MOT) has been a longstanding goal in computer vision which aims to estimate trajectories for objects of interest in videos. Object detection and re-identification (re-ID) is an important problem in surveillance monitoring systems. Most methods treat this problem by two separate models which the first one is the detection model firstly localizes the objects of interest by bounding boxes in each frame, then the association model extracts re-identification (re-ID) features for each bounding box and links it to one of the existing tracks according to certain metrics defined on features. The student has to investigate methods to combine these two tasks and improve the detection and tracking of multiple objects in videos.

### MN8: Describe What to Change: A Text-guided Unsupervised image-to-image Translation Approach

Manipulating visual attributes of images through human-written text is a very challenging task. On the one hand, models have to learn the manipulation without the ground truth of the desired output. On the other hand, models have to deal with the inherent ambiguity of natural language. The student has to evaluate the model for text-guided image-to-to-image translation. The quality of generated images will have to be evaluated for high-quality images that can be used for dataset creation.

### MN9: License Plate Recognition via Deep Neural Networks

Automatic License Plate Recognition is a challenging and important task, where it is used in traffic management, digital security surveillance, vehicle recognition, parking management of big cities and can be used to detect stolen cars through security surveillance. The student will have to investigate how the plate number recognition system in South Africa works and what can be done to improve it through deep learning techniques.

### MN10: Bias Remediation in datasets using Variational Autoencoders

Datasets are crucial when training a deep neural network. When datasets are unrepresentative, trained models are prone to bias because they are unable to generalise to real-world settings. This is particularly problematic for models trained in a dataset where there some other classes are represented than others and thus fail to generalise. The student will have to investigate how will the autoencoders perform instead of using Generative Adversarial Networks and also using latent space instead of Principal Component Analysis (PCA) space.

### MD1: Author name disambiguation in publication datasets

The ability to automatically map journal and conference papers to the researchers that wrote them and vice versa map researchers to their collections of published papers is a fundamental issue. It would seem to be a simple and straightforward process yet it remains a major, unsolved problem in information science.

Reliable author name disambiguation is difficult since:

• A single person may publish under different names
• Many different persons have identical names
• Sparse information (e.g., only initials and the surname are available)
• A single person may publish in different fields

In this project, you will research current state-of-the-art author name disambiguation methods and implement the most promising methods for the South African context. You will showcase your results by developing a search engine for students to help them find suitable supervisors across South African universities.

The project is research heavy.
Publishable: Possible.
Skills you will acquire:

• Working with big data, unsupervised learning and possible parallelisation
• Some web development

### MD2: Sentiment analysis of in-text citations

In academia citations are used as a method to acknowledge the use of someone else’s idea, add credibility and verifiability to your own work, and to avoid plagiarism. However, most of the time all citations in a paper are treated exactly the same and a researcher has to read the entire paper to find the important papers for further reading. Even though different classes of citations exist, distinctions are rarely made between citations that point to related work, citations that refute prior work, or citations to papers which form the pivotal basis of the current paper.

In this project, the aim is to extract the citation contexts (sentence(s) surrounding the citation) and annotate them with additional information:

• Type: related to existing work, comparison to existing work, usage of existing work, advancing existing work?
• Sentiment: Is there agreement or disagreement of the results or methods?
• Relatedness: How related are the two papers?

This information can then be used to give a more detailed look at the composition of citations for a paper or journal to, for example, recommend the best paper to read next.

Lastly, you will design and implement a proof of concept website to visually display papers and their citations networks, with the semantic information obtained from the classified citations.

The work comprises software engineering and research. This project is suited for continued studies.
Publishable: Yes.
Skills you will acquire:

• Natural language processing: text pre-processing and parsing
• Machine learning: sentiment analysis and classification of the citations
• Some web development and aspects of big data

### MD3: Mapping GitHub users to Stackoverflow contributers

Public data dumps of both Stackoverflow and GitHub are publicly available. However, in recent years uniquely identifying information (e.g., e-mail addresses) have been removed due to privacy concerns. Being able to map Stackoverflow users to GitHub users can give valuable insights into the development processes of software projects and crowed-sourced knowledge.

For this project you will have to research effective methods to access the two datasets and leverage the huge amount of (often noisy) data to link these two dataset together.

The work consists of software engineering tasks and research.
Work can be used as groundwork for continued studies. Skills you will acquire:

• Natural language processing: Latent Semantic Analysis
• Designing experiments and developing datasets for robust evaluations

### MD4: Data enrichment of South African thesis and dissertation repositories

South African universities are obligated to make all masters theses and doctoral dissertations publicly available. Most universities make use of the DSpace repository system that allows them to add metadata such as the title, the student’s name, or the date of graduation. However, a lot of issues still exist:

• A lot of valuable metadata is missing from most repositories (e.g., supervisor names, graduation date vs. upload date, etc.)
• Keywords and topics are sparsely used

Luckily, since theses and dissertations have to be made publicly available, the PDFs or Word documents are readily available for data mining. For this project, you will develop software that parses PDF files and extracts/annotates them with useful information. Furthermore, you will apply natural language techniques to generate keywords in order to cluster the theses and dissertations into meaningful topics. Lastly, you will use a backend database to store the extracted metadata and implement a browsable and searchable website to display the information that links back to the actual university repositories.

This project is software engineering heavy but contains a bit of research.
Publishable: unlikely.
Skills you will acquire:

• Text processing and parsing techniques
• Data mining and topic modelling
• Web development: data summarisation and presentation

### MD5: Crawling data from Google Scholar

Google Scholar is probably the most comprehensive, albeit not the most accurate, database and search engine for academic papers and technology patents. Google does not make this data easily available. In contrast, Google tries to make it as difficult as possible to scrape data through automatic scripts by implementing Recaptcha puzzles and IP blocking.

For this project you will design and engineer a distributed crawler to scrape data from Google Scholar. The idea is that a team of people, on multiple computers, are able to solve puzzles. This involves using the Scrapy framework, SQLalchemy for database access, and PyQt5 for GUI development (a alternative web GUI can also be considered). Furthermore, you will use the Selenium library to identify and automatically try and solve the Recaptcha puzzles. This does not involve machine learning to solve the puzzles, but instead creating a database of images in order to help humans in the process of solving the puzzles.

This project comprises almost exhaustively of software engineering. Publishable: No. Skills you will acquire:

• Some image processing
• Distributed systems design
• Database design
• PyQT5 GUI development