Full Professor (Chair) in Distributed Systems of the Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS) of Delft University of Technology
Dick H.J. Epema obtained an MSc in mathematics with a minor in Spanish in 1979, and a PhD in mathematics (algebraic geometry) in 1983, both from Leiden University in the Netherlands. His PhD thesis is entitled Surfaces with Canonical Hyperplane Sections. In 1988 he also obtained an MSc in computer science from Delft University of Technology.
Currently he is full professor of Distributed Systems at Delft University of Technology. From September 2011 until 2016, he was a part-time full professor of Decentralized Distributed Systems in the System Architecture and Networking group of Eindhoven University of Technology.
His research interests are in the areas of scheduling in distributed computing systems (grids, clusters, clouds, datacenters) and cooperative systems (peer-to-peer systems, online social networks, blockchain).
In the area of scheduling, his current focus is on resource allocation to data-processing frameworks such as MapReduce and Spark. An important topic has been processor co-allocation, that is, the distribution of single (parallel) applications across multiple clusters. His scheduling research centers around the KOALA grid scheduler, which has been deployed on the DAS system and which had processor co-allocation as one of its initial main features. KOALA has later been extended to deal with many application types, such as workflows, bags-of-tasks, and data-processing frameworks.
In the area of cooperative systems, his research was on measurements and modeling of the BitTorrent P2P system, on all aspects of video distribution (recorded, live, and VoD) in swarm-based P2P systems, and on reputation mechanisms and resilience against sybil attacks, as part of the research and development of the Tribler P2P system. His research in this area recently moved to trust and the blockchain.
Previously, he did research in performance analysis, and he has investigated many different types of priority and fair queuing systems, ranging from theoretical single-server queuing models to the decay-usage scheduling policy in UNIX multiprocessors.
Dick Epema has obtained many research grants from NWO, the EU, and the Dutch government (BSIK, COMMIT). He was involved in the VL-e (grid computing) and I-Share/Freeband (virtual communities in the Internet and P2P computing) BSIK projects, and he participates in the Infrastructure Virtualization for e-Science project of the Dutch national COMMIT program. He has authored over 140 scientific papers, and has been on numerous program committees in grids, clouds, and P2P computing. He is an associate editor of the IEEE Trans. on Parallel and Distributed Systems and the IEEE Trans. on Cloud Computing. He was General Co-Chair of the EuroPar2009 and the IEEE P2P 2010 conferences, and he was General Chair of the 21st ACM Symp. on High-Performance Parallel and Distributed Computing in 2012 and of the 13th IEEE/ACM Symp. on Cluster, Cloud and Grid Computing in 2013. He was Program Committee Co-Chair of the 22nd ACM Symp. on High-Performance Parallel and Distributed Computing in 2013.
Main research interests
- Distributed systems: design, operation and performance analysis
- Resource management and scheduling in distributed computing systems: grids, clusters, clouds, datacenters
- Cooperative Systems: modeling and analysis, trust and reputation mechanisms, blockchain
Current PhD students
- Vincent van Beek (scheduling business-critical workloads in clouds, with Alexandru Iosup)
- Alexey Ilyushkin (workflow scheduling in clusters, with Alexandru Iosup)
- Aleksandra Kuzmanovska (scheduling frameworks in clusters)
- Sobhan Omranian Khorasani (optimizing data-processing frameworks, with Jan Rellermeyer)
- Quinten Stokkink (with Johan Pouwelse)
- Martijn de Vos (with Johan Pouwelse)
Previous PhD students
- Jan de Jongh, Share Scheduling in Distributed Systems, february 2002
- Anca Bucur, Performance Analysis of Processor Co-Allocation Policies in Multicluster Systems, march 2004
- Hashim Mohamed, The Design and Implementation of the KOALA Grid Resource Management System, november 2007
- Pawel Garbacki, Improving P2P Applications by Breaking the Architecture Symmetry, december 2008
- Alexandru Iosup, A Framework for the Study of Grid Inter-operation Mechanisms, january 2009
- Jan David Mol, Free-riding Resilient Video Streaming in Peer-to-Peer Networks, january 2010
- Ozan Sonmez, Application-Oriented Scheduling in Multicluster Grids, june 2010
- Michel Meulpolder, Managing Supply and Demand of Bandwith in Peer-to-Peer Communities, march 2011
- Nezih Yigitbasi, Understanding and Improving the Performance Consistency of Distributed Computing Systems, december 2012
- Rahim Delaviz Aghbolagh, A Robust Reputation Mechanism for Peer-to-Peer Systems, october 2013 (with Johan Pouwelse)
- Adele Lu Jia, Online Networks as Societies: User Behaviors and Contribution Incentives, october 2013 (with Johan Pouwelse)
- Dimitra Gkorou, Exploiting Graph Properties for Decentralized Reputation Systems, november 2014 (with Johan Pouwelse)
- Siqi Shen, Massivizing Networked Virtual Environments on Clouds, April 2015 (with Alex Iosup)
- Mihai Capota, User Contribution in Peer-to-Peer Communities, July 2015 (with johan Pouwelse)
- Riccardo Petrocco, Improving Peer-to-Peer Video Streaming, April 2016 (with Johan Pouwelse)
- Yong Guo, Distributed Heterogeneous Systems for Large-Scale Graph Processing, May 2016 (with Alex Iosup)
- Bogdan Ghit, Optimizing the Performance of Data Analytics Frameworks, May 2017
- Condor Flocking
- Decay-usage scheduling in multiprocessors
- Processor co-allocation in multicluster systems
- Measuring and analyzing grid and cloud workloads
- Balancing resources among frameworks in datacenters
- 2Fast: Collaborative downloading in BitTorrent
- Measuring and modeling swarm-based P2P systems
- Reputation systems in online social networks
- Associate editor of IEEE Trans. on Parallel and Distributed Systems (2009-2014)
- Associate editor of IEEE Trans. on Cloud Computing
- General and Program Committee Co-Chair of LSAP 2009 in Munich
- General and Program Committee Co-Chair of Euro-Par 2009 in Delft
- Vice Program Committee Chair 10th IEEE/ACM Int'l Symp. on Cluster, Cloud and Grid Computing (CCGrid) 2010 in Melbourne
- General Co-Chair of the 10th IEEE Conference on Peer-to-Peer Computing in Delft
- General and Program Committee Co-Chair of LSAP 2010 in Chicago
- General and Program Committee Co-Chair of LSAP 2011 in San Jose
- General Chair of the 21st Int'l ACM Symp. on High-Performance Parallel and Distributed Computing (HPDC) 2012 in Delft
- General Chair of the 13th IEEE/ACM Int'l Symp. on Cluster, Cloud and Grid Computing (CCGrid) 2013 in Delft
- Program Committee Co-chair of the 22nd Int'l ACM Symp. on High-Performance Parallel and Distrubuted Computing (HPDC) 2013 in New York City
- Area Chair of Clouds and Distributed Computing Supercomputing 2016 in Salt Lake City
Program Committee member for
- 28th ACM Symp. on High-Performance Parallel and Distributed Computing (HPDC'19), Phoenix, AZ, USA, June 2018
- CCGrid 2019, Cyprus, May 2019
- 27th ACM Symp. on High-Performance Parallel and Distributed Computing (HPDC'18), Tempe, AZ, USA, June 2018
- ICDCS 2018, Vienna, July 2018
- CCGrid 2018, Washington, DC, May 2018
- IEEE Int'l Conference on Big Data, December 2017
- Supercomputing, Denver, USA, November 2017
- IEEE 24th Int'l Symp. on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), Banff, Canada, September 2017
- 21st Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), Orlando, USA, June 2017
- 26th ACM Symp. on High-Performance Parallel and Distributed Computing (HPDC'17), Washington DC, USA, June 2017
- IEEE 24th Int'l Symp. on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), London, September 2016
- 1st Workshop on Edge Computing (WEC'16), in conjunction with ICDCS, Nara, Japan, June 2016.
- 25th ACM Symp. on High-Performance Parallel and Distributed Computing (HPDC'16), Kyoto, Japan, June 2016
- CCGrid 2016, Cartagena, Colombia, May 2016
- 20th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), Chicago, USA, May 2016
- IEEE 23rd Int'l Symp. on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), Atlanta, October 2015
- 24th ACM Symp. on High-Performance Parallel and Distributed Computing (HPDC'15), Portland, June 2015
- Workshop on The Science of Cyberinfrastructure: Research, Experience, Applications and Models (SCREAM'15, in conjunction with HPDC'15), Portland, June 2015
- 8th Workshop on Virtualization Technologies in Distributed Computing (VTDC, in conjunction with HPDC'15), Portland, June 2015
- 19th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), Hyderabad, India, May 2015
- IEEE Cluster 2014, Madrid, September 2014
- 23rd ACM Symp. on High-Performance Parallel and Distributed Computing (HPDC'14), Vancouver, June 2014
- CCGrid 2014, Chicago, May 2014
- JSSPP 2014, Phoenix, May 2014
- 17th Int'l Conf. on Principles of Distributed Systems (OPODIS), December 2013
- SuperComputing 2013, Denver, November 2013
- 2013 IEEE Int'l Conference on Big Data (IEEE BigData 2013), October, Silicon Valley
- 1st Int'l Workshop on Optimization Techniques for Resource Management in Clouds (ORMaCloud, with HPDC-13), New York, June 2013
- JSSPP 2013, Boston, May 2013
- 4th ACM/SPEC Int'l Conference on Performance Engineering, Prague, April 21-24, 2013
- 5th IEEE/ACM Int'l Conference on Utility and Cloud Computing (UCC 2012), Chicago, November 2012
- IEEE P2P Computing 2012, Tarragona, Spain, September 2012
- JSSPP 2012, Shanghai, May 2012
- ParCo 2011, Gent, Belgium, Aug-Sept 2011
- 4th Annual Int'l Systems and Storage Conference (SYSTOR), Haifa, Israel, May-June 2011
- HPDC 2011, San Jose, CA, USA, June 2011
- CCGrid 2011, Newport Beach, CA, USA, May 2011
- Grid2010, Brussels, Belgium, October 2010
- JSSPP 2010, Atlanta, USA, April 2010
- CCGrid 2010, Melbourne, Australia, May 2010 (vice chair Performance Modeling and Evaluation)
- IEEE P2P Computing 2009, Seattle, USA, September 2009
- Euro-Par 2009, Delft, the Netherlands, August 2009 (co-chair)
- HPDC 2009, Garching, Germany, June 2009
- CCGRID 2009, Shanghai, China, May 2009
- IEEE P2P Computing 2008, Aachen, Germany, September 2008
- Grid2008, Tsukuba, Japan, September 2008
- Euro-Par 2008, Canary Islands, Spain, August 2008 (global chair of the topic peer-to-peer systems)
- HPDC 2008, Boston, USA, June 2008
- IPDPS 2008, Miami, Florida, USA, April 2008
- IPTPS 2008, Tampa Bay, Florida, USA, February 2008
- IEEE P2P Computing 2007, Galway, Ireland, September 2007
- Euro-Par 2007, Rennes, France, August 2007
- ICDCS 2007, Toronto, Canada, 25-29 June 2007
- CCGrid 2006, Rio de Janeiro, Brazil, May 2007
- HPDC-15, Paris, France, 19-23 June 2006
- The Sixth International Workshop on Global and Peer-to-Peer Computing organized at the IEEE/ACM International Symposium on Cluster Computing and the Grid 2006 (IEEE/ACM CCGRID 2006), Singapore, May 2006
- Second Workshop on System Management Tools for Large-Scale Parallel Systems, in conjunction with the 2006 Int'l Parallel and Distributed Processing Symp., April 29, 2006, Rhodos, Greece
- Grid 2005 - 6th IEEE/ACM Int'l Workshop on Grid Computing, November 12, 2005, in conjunction with SuperComputing 2005, Seattle, Washington, USA
- The Second Grid Resource Management Workshop (GRMW-2005), in conjunction with the Sixth Int'l Conference on Parallel Processing and Applied Mathematics, September 11-14, 2005, Poznan, Poland
- The Fifth International Workshop on Global and Peer-to-Peer Computing organized at the IEEE/ACM International Symposium on Cluster Computing and the Grid 2005 (IEEE/ACM CCGRID 2005), Cardiff, UK, May 2005
- The European Grid Conference 2005, Amsterdam, the Netherlands, 14-16 feb. 2005
- ICCP 2003, Koahsiung, Taiwan, October 2003
- Performance 2002, The IFIP WG 7.3 Int'l Symposium on Computer Performance Modeling, Measurement and Evaluation, Rome, september 23-27, 2002
- CCGrid 2002, Berlin, Germany, May 2002
- The First Euroglobus Workshop in Lecce, Italy, 16-23 june, 2001
- The 2nd Workshop on MAthematical (performance) Modeling and Analysis (MAMA2000) in conjunction with Sigmetrics 2000, june 17-18, 2000, in Santa Clara, Ca., USA
- The Distributed Computing and Metacomputing Workshop as part of HPCN'99
Resource Management and Scheduling in Distributed Processing Systems
The KOALA Multicluster Scheduler
KOALA is a scheduler that we have designed and implemented in the PDS group, and that has been deployed on the DAS system. KOALA is our research vehicle for research in scheduling and resource management in multicluster systems, grids, and clouds. Its main original feature was processor co-allocation, but it supports now many more application types, such as Bags-of-Tasks, workflows, and MapReduce applications. KOALA development has been an ongoing effort in several research projects.
The Distributed ASCI Supercomputer (DAS)
The DAS is a six-cluster computer-science infrastructure funded by NWO (the Dutch National Science Foundation) and installed and maintained by the ASCI Research School. One of the clusters is located at TU Delft. The DAS is very important for the research of the PDS group. The KOALA scheduler has been developed for and installed on the DAS.
Infrastructure Virtualization for e-Science (IV-e, part of the national Dutch COMMIT programme, 2011-2017).
This project is a sequel to the VL-e project (see below) on resource management, e-Science applications, workflows and data management in large-scale distributed computing systems such as clouds. The two research topics of the PDS group in this project are further development of the KOALA scheduler and application-specific scheduling. In particular, we currently focus on scheduling data-intensive frameworks such as MapReduce and workflow scheduling.
PhD students: Bogdan Ghit and Alexey Ilyuskin
GUARD-G: Guaranteed Delivery in Grids (2007-2012)
The goal of this project on grid computing is to design and analyze techniques for delivering guaranteed service to applications in grids. The GUARD-G project is part of the GLANCE programme funded by NWO, and is performed jointly with Leiden University.
PhD student: Nezih Yigitbasi
Postdoc: Hashim Mohamed
ALEA: Handling Uncertainties in Large-Scale Distributed Systems (2009-2010)
The goal of ALEAE is to provide models and algorithmic solutions in the field of resource management that cope with uncertainties in large-scale distributed systems. ALEAE is a joint project of Delft University of Technology, INRIA in France, Osaka University in Japan, and the Zuse Institute in Berlin, Germany. One of the main achievements of the ALEAE project is the Failure Trace Archive (FTA), which is a centralized public repository of availability traces of parallel and distributed systems, and tools for their analysis. The purpose of this archive is to facilitate the design, validation, and comparison of fault-tolerant models and algorithms.
Virtual Laboratory for e-Science (2004-2010)
In the Dutch national project Virtual Lab for e-Science (VL-e), we focus on resource management, scheduling, and performance analysis in grids. In particular, we study the management and scheduling of jobs that require co-allocation, that is, the simultaneous allocation of resources (processors, data, etc.) in multiple subsystems making up a grid. For this purpose, we have designed and implemented the KOALA grid scheduler.
PhD students: Alexandru Iosup and Ozan Sonmez
Postdocs: Alexandru Iosup, Ozan Sonmez and Hashim Mohamed
CoreGRID is a Network of Excellence of the European Union in grid computing, with 42 participating universities and public research institutes in Europe. CoreGRID is divided into six work packages or so-called virtual institutes. One of these is the virtual institute on Resource Management and Scheduling, in which the PDS group participates.
In this project on grid computing, we focused on resource management across multiple sites. In particular, we designed and implemented the flocking mechanism in Condor for load sharing and job migration across different Condor pools, in cooperation with the main designer of the Condor system, Miron Livny of the University of Wisconsin at Madison.
Peer-to-Peer Systems and Online Social Networks
PhD students: Michel Meulpolder and Rahim Delaviz
I-SHARE is a project on sharing technology at different levels in wired and wireless P2P systems. It is part of the BSIK programmme Freeband. As a guiding example, we are defining an architecture for P2P-TV, a P2P system for the dissemination of both live and recorded programs of 10,000+ TV channels. Research issues are how to do recommendations to users on TV programs, how to design the user interface, how to build application-level multicast trees for distributing live video, and in general, how to share the contents of individual video recordings on users' hard disks.
PhD student: Jan David Mol
Postdoc: Johan Pouwelse
Two-level peer-to-peer systems (TLP2PS, 2003-2008)
The research topic in this NWO-funded project is to exploit the heterogeneity of P2P systems, and in particular, to assess the performance impact of the presence of superpeers, which are peers that have more capabilities than other peers.
PhD student: Pawel Garbacki
- On May 27, 2016, I held my inaugural lecture entitled Gedistribueerde systemen: van efficientie tot vertrouwen at Delft University of Technology. Here are all the materials (all are in Dutch):
- the video recording of the lecture
- the powerpoint slides of the presentation: 1-2, 3, 4-6, 7-9, 10-18, 19-27
- the text of the lecture (closely matches the slides)
- A high-level overview of the Distributed Systems group, 11 December 2015.
- Dynamic Resource Provisioning for Application Frameworks in Datacenters, presentation at Google, Mountain View, 10 March 2015.
- Decentraliseer--en Beheers?, Inaugural lecture at Eindhoven University of Technology, 23 November 2012 (in Dutch, full text of the lecture).
- Twenty Years of Grid Scheduling Research and Beyond, Keynote at the 12th IEEE/ACM Symposium on Cluster, Cloud and Grid Computing (CCGrid 2012), 16 May 2012.
- Peer-to-Peer File Sharing: Past!-Present-Future? A Delft View, Keynote at the 11th IEEE Int'l Conference on Peer-to-Peer Computing (P2P'11), 31 August 2011.
- Exploiting Heterogeneity in Parallel and Distributed Systems, Keynote at the 7th Workshop on Algorithms, Models, and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar'2009), 25 August 2009.
Main teaching interests
- Distributed systems
- Distributed algorithms
- Cloud Computing
Seminar Cloud Computing (IN4392)
This course is a mix of a rather practical introduction to many aspects of cloud computing and of training in academic skills such as presenting, paper reviewing and report writing. It consists of 7 lectures on different aspects of cloud computing such as datacenters and energy efficiency, resource management, and programming models, of presentations and paper reviewing, of small lab exercises, and of a large lab exercise in which a cloud application has to be designed, implemented, and tested.
For TU Delft students, more information is available on Blackboard.
Distributed Algorithms (IN4150)
In this course, basic distributed algorithms are treated for such problems as synchronization, causal message ordering, deadlock, mutual exclusion, election, minimum-weight spanning trees, fault tolerance, consensus, and stabilization.
For TUD students, more information is available on Blackboard.
Advanced Blockchain Engineering (ASCI course A27)
with Johan Pouwelse, Quinten Stokkink, Martijn de Vos (all TU Delft), and Marc Makkes (Vrije Universiteit Amsterdam)
April 23-25, 2018
- Consensus in Distributed Systems
- Impossibility of Consensus in Asynchronous Distributed Systems
- State Machine Replication
Some previous MSc students and MSc theses
- Xander Evers, Condor Flocking: Load Sharing between Pools of Workstations (1993)
- Peter van Sebille, Design and Implementation of Support for Pipes in Condor (1994)
- Richard Boer, Resource Management in the Condor Systems (1996)
- Denis Koelewijn, Flexible Collection and Exploration of Condor Monitoring Data (1998)
- Joris van Rantwijk, Data Transmission in the Antares Data Acquisition System (2002)
- Jan David Mol, Resource Allocation for Streaming Applications in Multiprocessors (2004)
- Wouter Lammers, Adding Support for New Application Types to the KOALA Grid Scheduler (2005)
- Michel Meulpolder, TriblerCampus: An Integrated Peer-to-Peer Platform for File Distribution in Course Management Systems (2006)
- Jelle Roozenburg, Secure Decentralized Swarm Discovery in Tribler (2006)
- Bart Grundeken, Adding Cycle Scavenging Support to the KOALA Grid Resource Manager (2009)
- Egbert Bouman, Tribler-G: A Decentralized Social Network for Playing Chess Online (2012)
- Lipu Fei, KOALA-C: A Scheduler for Integrated Multi-Cluster and Multi-Cloud Environments (2013)
- Stefan van Wouw, Performance Evaluation of Distributed SQL Query Engines and Query Time Predictors (2014)
- Jonathan Heiss (EIT Digital Cloud Computing and Services), Improving the Performance of the Variant Calling Workflow for DNA Sequencing (2017)