Master project ideas

The WIS group is open for students who want to do their thesis on subjects in the wider area of web information systems and information architecture.

At the WIS group we stimulate students to contact prof. Geert-Jan Houben, to discuss possible topics for a thesis project (and literature survey). In such discussions students are free to suggest their own topic and then together a concrete thesis (or literature survey) topic will be defined.

Master students who like to do a specialisation in WIS are encouraged to select Web Science & Engineering, Information Retrieval, Crowd Computing, and Seminar Web Information Systems (possibly in the second MSc year). In any case, for help in composing a program, students can always contact prof. Houben for advice.

As a rough indication of possible subjects for thesis projects, below we give some subjects of projects that have been running, that are running or that are open for new students. Many of these topics can be approached in projects within the research lab, in industry, or in a collaboration between academia and industry. Industry here includes organizations that are public or private, large or small, national or international. Examples of organisations hosting some of our thesis student: Adyen, Crowdsense, CGI (Prorail), Exact Software, Capgemini (Unilever), Gemeente Amsterdam, Greetinq, KPMG, IBM, ICTU, IDS Scheer, ING, Innopay, Isaac, ISD, Logica, Sanoma, Tamtam, TNO, and Truvo.

This page also includes a section with thesis topics that are relevant for the LDE-CEL team led by professor Marcus Specht on the topics around Education & Learning.

Thesis topics at Sigma-Lab

The mission of the Sigma-Lab is to understand, design, and build social computing systems that process social data to improve We-based applications and systems. The Sigma-Lab team is supervised by Alessandro Bozzon, and it includes the post-docs Achilleas Psyllidis, Pavel Kucherbaev, and Andrea Mauri; the PhD students Jasper Oosterman, Jie Yang, Sepideh Mesbah, Vincent Gong, and Shahin Sharifi; and the research engineers Carlo van der Valk, and Ioannis Protonotarios

The research conducted in Sigma-Lab aims at answering questions such as: How can humans and machines better collaborate in the creation, analysis, and sense making of social data? How to control and accelerate the knowledge creation process at scale? How to systematically and reliably exploit social data in urban analytics? How can social data be effectively and efficiently injected in existing WIS to achieve pre-defined data-driven business goals?

In the last five years, the Sigma-Lab supervised more than 20 MsC theses, and is looking for students interested in one of the following topics.

  • Urban Data Analytics for Smart Cities: in the context of the SocialGlass project, we are looking for master students with a passion for data science, and an interest in improving the quality of life in our cities. In SocialGlass we develop new urban data science methods that can help addressing issues in domains such as transportation, crowdedness in the city, responsible energy consumption, urban planning, and business attractiveness. Examples of available MsC thesis include:
    • Developing (Deep) Machine Learning models to quantify and predict safety / crime rates in urban neighborhoods, using combinations of StreetView, social media, and socio-economic data
    • Developing (Deep) Machine Learning models to quantify and predict quality-of-life aspects (e.g. segregation, deprivation etc.) in cities, using satellite imagery and social media data
    • Developing models and implementing systems to recommend new POI locations in cities
  • Analysing Individual Energy Consumption Behaviour using Social Media Data: Currently, energy consumption data are primarily being gathered by (smart) energy meters at the household level. While such data is highly reliable and temporally complete, its acquisition requires access to the energy infrastructure; moreover, such data semantically poor, The aim of this project is  to explore the potential usefulness of social media as an alternative source to collect data about individuals energy consumption behaviour.  We focus on four components of energy lifestyle namely: Dwelling, Mobility, Food consumption and Leisure. The output of this project will be a social media analysis pipeline for collecting and classifying the energy related social media posts (e.g. tweets) and finally generating an energy consumption profile for social media users. Relevant MSc courses: Information Retrieval, Web Science and Engineering.
  • Chatbots Able to Learn New Skills: There are chatbots serving a purpose to retrieve information (e.g. “when is the next train to Amsterdam?”) or perform a transaction (e.g. “Purchase one ticket to Escher museum in The Hague”). Such chatbots are usually designed for a specific narrow use case and their functionality is hardcoded. Extending functionality of such chatbot requires an intervention of a software developer to the codebase of the chatbot. We envision a chatbot system, which can extend its functionality by learning from users, crowd workers, experts, or even automatically. Think about Wikipedia. Years ago, it lacked articles on many topics. Now with contributions of thousands of people all around the world it is hard to find a topic, which is not covered there. Similarly if thousands of people teach new skills to such chatbot, it will be able to effectively serve millions of users in wide range of domains.
  • Generating Chatbots based on APIs and DB schemas: Currently it is possible to develop a chatbot semi-automatically based on Q/A dataset or based on an API. We believe that a logical next step is to be able to construct a chatbot automatically based on data base schema or REST API. This research will help to understand how to map database schema and API endpoints tree with a conversation tree, and to allow fast creation of chatbots.
  • Human Aided Bots - Dialogue Management: When we purchase a coffee the conversation we have with the barista is quite standard. In contract at work the conversation with a colleague about solving a unique complex problem is not predefined, and we adapt along its way. Similarly chatbots usually manage to follow a dialogue in a predefined domain it is designed for quite well, and fail to do so in more complex and less predictable conversation scenarios. We aim to address this issue by designing methods and tools for modeling both fixed and open dialogues. A special interest is understanding dialogues on the go, even if this chatbot was not initially designed for such chatbot.
  • Enterprise Crowdsourcing: While machine learning and artificial intelligence applications are gaining popularity, enterprises are devoting more and more attention to enterprise crowd-sourcing as an effective technique able to capitalize on their available human resources to achieve inclusion of in-house human generated data. The aim of this thesis project, to be performed in collaboration with IBM Netherlands, is to advance the state-of-the-art in enterprise crowdsourcing by studying how different task design and participation incentives affect the quality and reliability of the employees' work. 
  • Music Recommendation Based on User Context: It is known that people listen to different music at work and at home. People listen to different music having breakfast alone at a working day and having dinner with friends on Saturday. At all these different contexts music as well changes from person to person. Going to the app and manually choosing a different playlist every time is so 20th century. We envision a system that can learn from the user, and plays automatically the music the user wants to listen now, depending on activity, location, weather, mood, physical status and other context features. In this project we need to model user’s context relevant for music preferences, to develop methods to detect this context, and develop a recommender system mapping contexts and listening preferences.
  • Extracting Domain Specific Entities and Relations from text (Web pages): Extracting entities of interest (e.g dataset, method, evaluation metrics) and their relations (e.g. isUsedBy, ComparedWith, ..) from massive text corpora (e.g Clueweb) is important for enhancing the semantic search, linking information across different sources and etc. The aim of this project is to devise methods to automatically extract the entities of interest and the relations between them. Relevant MSc courses  Information Retrieval, Pattern Recognition.
  • Long-Tail Named Entity Extraction: This engineering-heavy MSc thesis focuses on implementing a framework for named entity recognition and extraction from natural text, with a focus on rare entities. In collaboration with our team, novel NER and NEE methods are developed, implemented and evaluated on scientific publication corpora . The final result is released as a well-documented open source project.


Thesis topics at Lambda-Lab

The Lambda-Lab has two broad research lines: information retrieval (conversational search, deep learning approaches to ad-hoc retrieval, reproducibility in IR, search as learning) and data science in the context of Massive Open Online Courses. Get in touch with Claudia Hauff to discuss possible options.

Thesis projects at Epsilon-lab

The E(psilon)-lab is a lab within the Web Information Systems group and is concerned with human interaction with artificial advice givers, and specifically explanations to support decision making. The lab is supervised by Nava Tintarev, and it includes the post-docs Oana Inel, Emily Sullivan; and the PhD students Shabnam Najafian and Yucheng Jin (KU Leuven).

To have a concrete idea of what this means, take a look at the introduction to the special issue on human interaction with artificial advice givers.

The E-lab takes a user-centered approach to research, and evaluates the quality of human decision making to drive both interface and algorithm design. The research is currently driven by two applied challenges: 1) Explainable algorithms; and 2) Interactive interfaces for explanations. Below are some examples of possible topics in the lab. Beyond what is mentioned here as examples, you are also welcome to use your own ideas as long as they fit into the research lines.

  • Explaining news recommendations on disputed topics.
  • Algorithms to help users discover unexplored news articles.
  • Fair and explainable news summarization.
  • Crowdsourcing explainable annotations for diversity of perspectives
  • Explaining recommendations to groups with different preferences
  • Decision support for a train control system (with CGI and Prorail)
  • Novel interfaces and interactions for explanations.

Recommended courses for this research line: Information Retrieval, Multimedia Search & Recommendation, Fundamentals of Data Analytics, Crowd Computing, and Data Visualization.

Thesis Projects at Delta-lab

Supervised by Christoph Lofi, Delta has topics related to structural representation of information, with a specific focus on querying that information from a user's perspective. As such, it is bridging between database research, and the areas knowledge engineering, natural language processing, or user modelling.
Currently, the following thesis topics are available:

  • Smart Access to Open Educational Resources: This graduation project is hosted at TU Delft library. More information can be found here.
  • Long-Tail Named Entity Extraction: This engineering-heavy MSc thesis focuses on implementing a framework for named entity recognition and extraction from natural text, with a focus on rare entities. In collaboration with our team, novel NER and NEE methods are developed, implemented and evaluated. The final result is released as a well-documented open source project.
  • Extracting Domain Specific Entities and Relations from text (Web pages): Extracting entities of interest (e.g dataset, method, evaluation metrics) and their relations (e.g. isUsedBy, ComparedWith, ..) from massive text corpora (e.g Clueweb) is important for enhancing the semantic search, linking information across different sources and etc. The aim of this project is to devise methods to automatically extract the entities of interest and the relations between them. Relevant MSc courses  Information Retrieval, Pattern Recognition.


The thesis subjects below are to be advised by Dr. Asterios Katsifodimos, Assistant Professor with the Web Information Systems Group. Asterios works in the broad area of data management, with a focus on scalable batch and streaming analytics. The thesis subjects below are defined in a high-level fashion so that students can steer the subject to their liking (more on systems, or theory), level (Bachelor or Master), and skill-set. If you are interested in any of the subjects below, or want to propose one that would match Asterios' style of research, please get in contact with him!

Internship opportunities: Many of the theses below are relevant to real-life problems and (depending on the motivation of the student and the quality of their results) have the potential of an internship opportunity with companies like KPMG in Amsterdam, SAP Innovation Center in Berlin, the KTH university in Stockholm, TU Berlin, or the - under development - Delft Data Science Platform.

  • Bridging Linear and Relational Algebra for Scalable Data Science
    Linear algebra operations are at the core of many Machine Learning (ML) pipelines. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to-end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and crossvalidation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. The goal of this thesis would be to optimize Data Science pipelines by applying ideas from database optimizers in large Big Data pipelines which indlude both linear and relational algebra operations. The thesis can take over work on the theory side (how can we represent “query plans” so that we can optimize them?) and/or on the practical side (how can be design novel physical operators to use in scalable ML pipelines?).
  • Executing Transactions in Modern Stream Processors
    Stream processors such as Apache Flink, Storm or APEX are emerging in the industry as a tool to perform both analytical workloads (e.g., monitoring log files, sensors, and micro-services) but also mission critical services such as fraud detection in credit-card transactions. Modern streaming systems are now on an arms-race to provide first-class support for application state with strong state consistency guarantees in the presence of failures. At the same time, there is growing need for executing high-throughput transaction directly on the stream processor, rather than on a traditional database system. The goal of this thesis is to investigate ways of executing transactions on the application state of modern stream processing systems (e.g., Apache Flink).
  • Dataset Versioning For Social Data Science
    Version control is a very important part of every development process. Developers typically branch from a version of a software system, apply their own changes and then merge their changes to a master branch. Various tools and systems exist, the most famous and successful being Git (and the gitHub website). Git, however, is designed for the develpment process of software, not data. This thesis should create tools and a platform, very similar to the ideas behind Github, but for very large datasets. There are a lot of challenges associated with dataset versioning. Most of them stem from the sheer volume of datasets which can be in the order of Terabytes. It is evident that retrieving, comparing (and creating deltas) and storing data of such a volume is a non-trivial task. This thesis will investigate current techniques for version management of massive datasets, and propose changes to those techniques, in order to tackle the challenges mentioned above.
  • Data Lakes
    The aim of this thesis work is to understand the state of the art in technology for data lakes. More specifically, the student will work on implementing novel data processing functionality and services into an existing data-lake platform.
  • Scalable Inference with Deep-Learning Models in the GPU Clouds
    Nowadays we witness the proliferation of solutions for scalable Machine LEarning Inference (e.g., Google Cloud Machine Learning, SAP Leonardo ML foundations, Amazon’s AI on AWS). In these platforms, a specialized model is first trained and then used to respond to users’ requests, such as image recognition, where the user sends an image to a running service and receives a set of objects that are found in that image. Tensorflow and the Inception deep-learning model are typical examples of technologies used for such ML inference. However, such inference is very slow on CPUs and cloud companies typically use GPUs to perform inference at scale. In Software-as-a-Service (SaaS) offerings, the objective of a service provider is to allocate and de-allocate resources (CPU, GPU, memory and network) to satisfy its SLAs, while minimizing its operational costs. Since costs are directly associated to the amount of resources a provider is utilizing service providers typically achieve higher utilization and profits by multiplexing workloads from different users. The goal of this thesis is to design a solution for multiplexing ML inference workloads on GPUs, in order to increase resource utilization and adhere to SLAs of users.

Thesis topics at LDE-CEL group

Learning Analytics & Analytics Infrastructure

  • Visual analytics dashboard for emotion tracking.
    Description: Using visual metaphors, the visual analytics system enables teachers to monitor real-time learner emotional states (positiveness, arousal, dominance) be it individual or group work in classroom or online learning environment. In this way, teachers can manage the learning process and gauge the effectiveness of the learning activities and. teaching strategies.


  • Power to the learner: Effects of selecting personal data sources in dashboard building
    Description: Learner agency supports the idea that learners should own their learning. For example, learners should create and follow their own learning goals and select the learning strategies and learning resources they wish to study. This project would look at how learning analytics can be used to support and develop learner agency by developing tools that allow learners to set their own goals, monitor their progress towards these goals and choose personal data sources that they wish to monitor.
  • Transparency in learning analytics interfaces
    Description: Many learning analytics algorithms and systems that personalise interfaces or make decisions automatically for their users are often seen as black boxes, hiding the intricacies of decision making. This often leaves teachers and learners in the dark and erodes their trust in the learning analytics system. How can transparency be thus implemented in such a way that it’s not trivial, but at the same time does not overwhelm its users?


  • Writing Analytics and Feedback
    Description: The project should focus on the analysis of student’s text-based assignments and the analysis of student submissions. The work should explore different ways to develop anc expert model and give feedback to learners based on their submitted text. The project can also be jointly developed with a company in the field of educational technologies. Feedback can be used for students or educators on formal text criteria but also on specified assignments in one domain.

Augmented and Virtual Reality

  • Building Virtual Reality Escape Rooms for joint problem-solving in collaborative virtual environments
    Description:Virtual Reality (VR) shows great promise for educational processes such as experiential- and collaborative learning. To explore VR’s possibilities in these areas, this project will focus on creating Virtual Reality Escape Rooms, where users will have to work together in virtual space and solve puzzles in order to succeed. By designing, developing and testing such an application, the goal is to study different aspects of collaborative learning and how these relate to the virtual world of VR.


  • Exploring collaborative affordances of handheld augmented reality devices.
    Description: Augmenting collaboration is an affordance of Augmented Reality (AR) mediums. As compared to traditional platforms, AR allows you to not only have physical shared space but also  a virtual shared space, providing affordances for new methods of collaboration. However, because AR is still evolving, exploration of such affordances must be done.
  • A Path to success: Exploring AI/MLalgorithms for intelligent training planners.
    Description: A key task of the coach/mentor is to plan practice routines keeping focus on the performance and weakness of the trainee. The mentor balances challenge and competence during practice, enables deliberate practice for efficient attainment of mastery while also maintaining the motivation of the trainee. This takes up a lot of time of the expert which is costly. AI/ML algorithms look promising for such orchestration of practice by feeding trainee data to automatically suggest optimal practice paths.


  • Back the past: Augmented storytelling for immersive history learning.
    Description: The first wave of AR was heavily focused on technological advances, answering the question “Can we do this?”. However, now AR has matured technologically but much of the work still focuses on technical aspects lacking user studies and application of such technology in the real world. Therefore, much user experience related exploration is needed to define and objectify the benefits of AR as a medium. One such benefit is the potential of AR to redefine story telling. Using AR story can now revolve around the user, removing the user from being a static consumer of information.

Mobile and Seamless Learning

  • Mobile app for community trail e.g., Delft heritage/ architectural trail.
    Description: Within an Inquiry-based learning framework that fosters user-generated enquiries, the app should enable HE students to interact with the physical environment; draw on prior knowledge resources, conceptual resources and contextual resources to create micro-sites for learning and create learning artifacts individually and/ or collectively.
  • Creating an experience sampling tool for personal data collection in quantified self.
    Description: Quantified Self is a methodology for exploring hypotheses about personal behaviour and for behaviour change. The experience sampling tool should enable learners to specify contextual parameters (location, time, social contact) for which they would like to collect data points. Driven by the specified trigger data can be collected from sensor systems, web-based services, and personal log-book entries. These data entries can later be analysed to evaluate the given hypotheses based on the collected data.

Serious Games

  • Gamifying deliberate practice: Implications of computer games design principles for vocational skills training
    Description: Gamification has been hailed as the tool for increasing engagement, and to certain extent the motivation of the user. On the other hand, maintaining motivation and engagement in deliberate practice is a challenge. Thus, this project will explore the effects of gamification and design implications specific towards the requirements of deliberate practice.