Dick Epema

Full Professor (Chair) in Distributed Systems of the Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS) of Delft University of Technology

pds tag cloud

Dick H.J. Epema obtained an MSc in mathematics with a minor in Spanish in 1979, and a PhD in mathematics (algebraic geometry) in 1983, both from Leiden University in the Netherlands. His PhD thesis is entitled Surfaces with Canonical Hyperplane Sections. In 1988 he also obtained an MSc in computer science from Delft University of Technology.

Currently he is full professor of Distributed Systems at Delft University of Technology. From September 2011 until 2016, he was a part-time full professor of Decentralized Distributed Systems in the System Architecture and Networking group of Eindhoven University of Technology.

His research interests are in the areas of scheduling in distributed computing systems  (grids, clusters, clouds, datacenters) and cooperative systems (peer-to-peer systems, online social networks, blockchain). 

In the area of scheduling, his current focus is on resource allocation to data-processing frameworks such as MapReduce and Spark. An important topic has been processor co-allocation, that is, the distribution of single (parallel) applications across multiple clusters. His scheduling research centers around the KOALA grid scheduler, which has been deployed on the DAS system and which had processor co-allocation as one of its initial main features. KOALA has later been extended to deal with many application types, such as workflows, bags-of-tasks, and data-processing frameworks.

In the area of cooperative systems, his research was on measurements and modeling of the BitTorrent P2P system, on all aspects of video distribution (recorded, live, and VoD) in swarm-based P2P systems, and on reputation mechanisms and resilience against sybil attacks, as part of the research and development of the Tribler P2P system. His research in this area recently moved to trust and the blockchain.

Previously, he did research in performance analysis, and he has investigated many different types of priority and fair queuing systems, ranging from theoretical single-server queuing models to the decay-usage scheduling policy in UNIX multiprocessors.

Dick Epema has obtained many research grants from NWO, the EU, and the Dutch government (BSIK, COMMIT). He was involved in the VL-e (grid computing) and I-Share/Freeband (virtual communities in the Internet and P2P computing) BSIK projects, and he participates in the Infrastructure Virtualization for e-Science project of the Dutch national COMMIT program. He has authored over 140 scientific papers, and has been on numerous program committees in grids, clouds, and P2P computing. He is an associate editor of the IEEE Trans. on Parallel and Distributed Systems and the IEEE Trans. on Cloud Computing. He was General Co-Chair of the EuroPar2009 and the IEEE P2P 2010 conferences, and he was General Chair of the 21st ACM Symp. on High-Performance Parallel and Distributed Computing in 2012 and of the 13th IEEE/ACM Symp. on Cluster, Cloud and Grid Computing in 2013. He was Program Committee Co-Chair of the 22nd ACM Symp. on High-Performance Parallel and Distributed Computing in 2013.

Main research interests

  • Distributed systems: design, operation and performance analysis
  • Resource management and scheduling in distributed computing systems: grids, clusters, clouds, datacenters
  • Cooperative Systems: modeling and analysis, trust and reputation mechanisms, blockchain

Current PhD students

  • Vincent van Beek (scheduling business-critical workloads in clouds, with Alexandru Iosup)
  • Alexey Ilyushkin (workflow scheduling in clusters, with Alexandru Iosup)
  • Aleksandra Kuzmanovska (scheduling frameworks in clusters)
  • Sobhan Omranian Khorasani (optimizing data-processing frameworks, with Jan Rellermeyer)
  • Quinten Stokkink (with Johan Pouwelse)
  • Martijn de Vos (with Johan Pouwelse)

Previous PhD students

Research highlights

  • Condor Flocking
  • Decay-usage scheduling in multiprocessors
  • Processor co-allocation in multicluster systems
  • Measuring and analyzing grid and cloud workloads
  • Balancing resources among frameworks in datacenters
  • 2Fast: Collaborative downloading in BitTorrent
  • Measuring and modeling swarm-based P2P systems
  • Reputation systems in online social networks

Editorships

Chairmanships

Program Committee member for

2015

Resource Management and Scheduling in Distributed Processing Systems

The KOALA Multicluster Scheduler

KOALA is a scheduler that we have designed and implemented in the PDS group, and that has been deployed on the DAS system. KOALA is our research vehicle for research in scheduling and resource management in multicluster systems, grids, and clouds. Its main original feature was processor co-allocation, but it supports now many more application types, such as Bags-of-Tasks, workflows, and MapReduce applications. KOALA development has been an ongoing effort in several research projects.

The Distributed ASCI Supercomputer (DAS)

The DAS is a six-cluster computer-science infrastructure funded by NWO (the Dutch National Science Foundation) and installed and maintained by the ASCI Research School. One of the clusters is located at TU Delft. The DAS is very important for the research of the PDS group. The KOALA scheduler has been developed for and installed on the DAS.

Infrastructure Virtualization for e-Science (IV-e, part of the national Dutch COMMIT programme, 2011-2017).

This project is a sequel to the VL-e project (see below) on resource management, e-Science applications, workflows and data management in large-scale distributed computing systems such as clouds. The two research topics of the PDS group in this project are further development of the KOALA scheduler and application-specific scheduling. In particular, we currently focus on scheduling data-intensive frameworks such as MapReduce and workflow scheduling.

PhD students: Bogdan Ghit and Alexey Ilyuskin

GUARD-G: Guaranteed Delivery in Grids (2007-2012)

The goal of this project on grid computing is to design and analyze techniques for delivering guaranteed service to applications in grids. The GUARD-G project is part of the GLANCE programme funded by NWO, and is performed jointly with Leiden University.

PhD student: Nezih Yigitbasi
Postdoc: Hashim Mohamed

ALEA: Handling Uncertainties in Large-Scale Distributed Systems (2009-2010)

The goal of ALEAE is to provide models and algorithmic solutions in the field of resource management that cope with uncertainties in large-scale distributed systems. ALEAE is a joint project of Delft University of Technology, INRIA in France, Osaka University in Japan, and the Zuse Institute in Berlin, Germany. One of the main achievements of the ALEAE project is the Failure Trace Archive (FTA), which is a centralized public repository of availability traces of parallel and distributed systems, and tools for their analysis. The purpose of this archive is to facilitate the design, validation, and comparison of fault-tolerant models and algorithms.

Virtual Laboratory for e-Science (2004-2010)

In the Dutch national project Virtual Lab for e-Science (VL-e), we focus on resource management, scheduling, and performance analysis in grids. In particular, we study the management and scheduling of jobs that require co-allocation, that is, the simultaneous allocation of resources (processors, data, etc.) in multiple subsystems making up a grid. For this purpose, we have designed and implemented the KOALA grid scheduler. 

PhD students: Alexandru Iosup and Ozan Sonmez
Postdocs: Alexandru Iosup, Ozan Sonmez and Hashim Mohamed

CoreGRID (2004-2008)

CoreGRID is a Network of Excellence of the European Union in grid computing, with 42 participating universities and public research institutes in Europe. CoreGRID is divided into six work packages or so-called virtual institutes. One of these is the virtual institute on Resource Management and Scheduling, in which the PDS group participates.

Condor (1992-2996)

In this project on grid computing, we focused on resource management across multiple sites. In particular, we designed and implemented the flocking mechanism in Condor for load sharing and job migration across different Condor pools, in cooperation with the main designer of the Condor system, Miron Livny of the University of Wisconsin at Madison.

Peer-to-Peer Systems and Online Social Networks

P2P-Fusion (2006-2009)

P2P-Fusion is an EU project on peer-to-peer systems for creative reuse of multimedia content in virtual communities. The project has seven partners in Finland, Hungary, and the Netherlands.

PhD students: Michel Meulpolder and Rahim Delaviz

I-SHARE (2004-2010)

I-SHARE is a project on sharing technology at different levels in wired and wireless P2P systems. It is part of the BSIK programmme Freeband. As a guiding example, we are defining an architecture for P2P-TV, a P2P system for the dissemination of both live and recorded programs of 10,000+ TV channels. Research issues are how to do recommendations to users on TV programs, how to design the user interface, how to build application-level multicast trees for distributing live video, and in general, how to share the contents of individual video recordings on users' hard disks. 

PhD student: Jan David Mol
Postdoc: Johan Pouwelse

Two-level peer-to-peer systems (TLP2PS, 2003-2008)

The research topic in this NWO-funded project is to exploit the heterogeneity of P2P systems, and in particular, to assess the performance impact of the presence of superpeers, which are peers that have more capabilities than other peers.

PhD student: Pawel Garbacki

Main teaching interests

  • Distributed systems
  • Distributed algorithms
  • Cloud Computing

Master's courses

Seminar Cloud Computing (IN4392)

This course is a mix of a rather practical introduction to many aspects of cloud computing and of training in academic skills such as presenting, paper reviewing and report writing. It consists of 7 lectures on different aspects of cloud computing such as datacenters and energy efficiency, resource management, and programming models, of presentations and paper reviewing, of small lab exercises, and of a large lab exercise in which a cloud application has to be designed, implemented, and tested.

For TU Delft students, more information is available on Blackboard.

Distributed Algorithms (IN4150)

In this course, basic distributed algorithms are treated for such problems as synchronization, causal message ordering, deadlock, mutual exclusion, election, minimum-weight spanning trees, fault tolerance, consensus, and stabilization.

For TUD students, more information is available on Blackboard.

PhD course

Advanced Blockchain Engineering (ASCI course A27)

with Johan Pouwelse, Quinten Stokkink, Martijn de Vos (all TU Delft), and Marc Makkes (Vrije Universiteit Amsterdam)

April 23-25, 2018

Slides:

  1. Consensus in Distributed Systems
  2. Impossibility of Consensus in Asynchronous Distributed Systems
  3. State Machine Replication
  4. Stabilization

 

Some previous MSc students and MSc theses

Prof.dr.ir. D.H.J. Epema

Visiting Address
Building 28

Room: 340 East 3rd floor
Van Mourik Broekmanweg 6
2628 XE Delft
The Netherlands

Mailing Address
EEMCS, Distributed Systems
P.O. Box 5031, 2600 GA Delft
The Netherlands