Technical reports

2017

: ANANKE: a Q-Learning-Based Portfolio Scheduler for Complex IndustrialWorkflows, a Technical Report, DS Technical Report DS-2017-001.

Abstract DS-2017-001

Complex workflows that process sensor data are useful for industrial infrastructure management and diagnosis. Although running such workflows in clouds promises reduces operational costs, there are still numerous scheduling challenges to overcome. Such complex workflows are dynamic, exhibit periodic patterns, and combine diverse task groupings and requirements. In this work, we propose ANANKE, a scheduling system addressing these challenges. Our approach extends the state-of-the-art in portfolio scheduling for datacenters with a reinforcement-learning technique, and proposes various scheduling policies for managing complex workflows. Portfolio scheduling addresses the dynamic aspect of the workload. Reinforcement learning, based in this work on Q-learning, allows our approach to adapt to the periodic patterns of the workload, and to tune the other configuration parameters. The proposed policies are heuristics that guide the provisioning process, and map workflow tasks to the provisioned cloud resources. Through real-world experiments based on real and synthetic industrial workloads, we analyze and compare our prototype implementation of ANANKE with a system without portfolio scheduling (baseline) and with a system equipped with a standard portfolio scheduler. Overall, our experimental results give evidence that a learning-based portfolio scheduler can perform better (5--20%) and cost less (20--35%) than the considered alternatives.

2016

: Datacenter-wide Portfolio Scheduling for Managing Operational and Disaster-Recovery Risks in Virtualized Datacenters Hosting Business-Critical Workloads, DS Technical Report DS-2016-002

Abstract DS-2016-002

Attracted by the dual promise of infrastructure efficiency and widespread uptake, large enterprises and governmental organizations are increasingly using public- and/or private-cloud resources to run their large-scale business-critical workloads. Although the promises are enticing, hosting business-critical workloads is relatively new, and raises many resource management and scheduling challenges. Particularly challenging is enforcing for such workloads risk-aware service-level agreements (SLAs) that express not only strict customer demands for high performance and reliability, but also the aversion of enterprises and governments to the risk of their performance and reliability demands not being met. Traditional datacenters focus on best-effort execution of workloads, which serves well customer demands in traditional datacenters, but may not be suitable for risk-aware SLAs and thus may lead to high financial penalties for cloud operators. In contrast to traditional work in resource management and scheduling in datacenters, in this work we address the research question of How to manage the risk of not meeting SLAs for datacenters hosting business-critical workloads?
: LDBC Graphalytics: A Benchmark forLarge-Scale Graph Analysis on Parallel andDistributed Platforms, a Technical Report, DS Technical Report DS-2016-001

Abstract DS-2016-001

In this paper we introduce LDBC Graphalytics, a new industrial-grade benchmark for graph analysis platforms. It consists of six deterministic algorithms, standard datasets, synthetic dataset generators, and reference output, that enable the objective comparison of graph analysis platforms. Its test harness produces deep metrics that quantify multiple kinds of system scalability, such as horizontal/vertical and weak/strong, and of robustness, such as failures and performance variability. The benchmark comes with open-source software for generating data and monitoring performance. We describe and analyze six implementations of the benchmark (three from the community, three from the industry), providing insights into the strengths and weaknesses of the platforms. Key to our contribution, vendors perform the tuning and benchmarking of their platforms.

2015

: Modeling, Analysis, and Experimental Comparison of Streaming Graph-Partitioning Policies: A Technical Report, PDS Technical Report PDS-2015-002

Abstract PDS-2015-002

In recent years, many distributed graph-processing systems have been designed and developed to analyze large-scale graphs. For all distributed graph-processing systems, partitioning graphs is a key part of processing and an important aspect of achieve good processing performance. To keep low the performance of partitioning graphs, even when processing the ever-increasing modern graphs, many previous studies use lightweight streaming graph-partitioning policies. Although many such policies exist, currently there is no comprehensive study of their impact on load balancing and communication overheads, and on the overall performance of graph-processing systems. This relative lack of understanding hampers the development and tuning of new streaming policies, and could limit the entire research community to the existing classes of policies. We address these issues in this work. We begin by modeling the execution time of distributed graph-processing systems. By analyzing this model under the load of realistic graph-data characteristics, we propose a method to identify important performance issues and then design new streaming graph-partitioning policies to address them. By using three typical large-scale graphs and three popular graph-processing algorithms, we conduct comprehensive experiments to study the performance of our and of many alternative streaming policies on a real distributed graph-processing system. We also explore the impact on performance of using different real-world networks and of other real-world technical details. We further discuss the coverage of our model and method, and the design of future partitioning policies.
: A Study of Application Kernel Structure for Data Parallel Applications, PDS Technical Report PDS-2015-001

Abstract PDS-2015-001

In this paper, we study the application kernel structure for data parallel applications. The application kernel structure includes two aspects, the number of kernels in the application and their execution flow. Based on the analysis of application kernel structure, we classify data parallel applications into five classes. To examine the coverage of the classification, we check five benchmark suites with a total of 86 applications, and show that the five classes cover all 86 applications. The classification makes it possible to design efficient partitioning strategies for data parallel applications, and to propose an application-driven method that selects the best performing partitioning strategy for a given workload.

2014

: Investment strategies for credit-based P2P communities, PDS Technical Report PDS-2014-005

Abstract PDS-2014-005

P2P communities that use credits to incentivize their members to contribute have emerged over the last few years. In particular, private BitTorrent communities keep track of the total upload and download of each member and impose a minimum threshold for their upload/download ratio, which is known as their sharing ratio. It has been shown that these private communities have signicantly better download performance than public communities. However, this performance is based on oversupply, and it has also been shown that it is hard for users to maintain a good sharing ratio to avoid being expelled from the community. In this paper, we address this problem by introducing a speculative download mechanism to automatically manage user contribution in BitTorrent private communities. This mechanism, when integrated in a BitTorrent client, identies the swarms that have the biggest upload potential, and automatically downloads and seeds them. In other words, it tries to invest the bandwidth of the user in a protable way. In order to accurately asses the upload potential of swarms we analyze a private BitTorrent community and derive through multiple regression a predictor for the upload potential based on simple parameters accessible to each peer. The speculative download mechanism uses the predictor to build a cache of protable swarms to which the peer can contribute. Our results show that 75 % of investment decisions result in an increase in upload bandwidth utilization, with a median 207% return on investment.
: Libswift: the PPSPP Reference Implementation, PDS Technical Report PDS-2014-004

Abstract PDS-2014-004

The Peer-to-Peer Streaming Peer Protocol (PPSPP) is a transmission protocol for disseminating the same content to a group of interested parties in a streaming fashion. PPSPP is currently proposed as a standard and waiting for approval at the IETF. PPSPP supports streaming of both pre-recorded (on-demand) and live audio/video content. It is based on the peer-to-peer (P2P) paradigm, where clients consuming the content are put on equal footing with the servers initially providing the content, to create a system where everyone can potentially provide upload bandwidth.
As any standard protocol description, PPSPP defines in great detail how the protocol must behave and how information is exchanged between communicating nodes. While PPSPP, including its behaviour and the message structure, has been clearly described in the draft, the details of the implementation are left to the developer.

To work at its best, a transmission protocol needs to provide several features such as a well defined and extensible communication mechanism, flow and congestion control and a good interface to external programs, and it has to make efficient use of the available resources, without being too intrusive. While the official PPSPP draft clearly describes many of those features such as the communication pattern, the message structure, and the congestion control, some details of the implementation are not presented as they are out of scope for the protocol description.
In addition, the PPSPP draft presents several novel ideas and data structures for increasing its efficiency, of which the implementation details have not been discussed before. In this technical report, we describe in detail four aspects of the PPSPP reference implementation Libswift: the binmap, the state machine, the pull mechanism and experimental results of Libswift in challenging environments.
: A Survey of Parallel Graph Processing Frameworks, PDS Technical Report PDS-2014-003

Abstract PDS-2014-003

As graph analysis tasks see a significant growth in complexity - as exposed by recent advances in complex networks analysis, information retrieval and data mining, and even logistics - the productivity of deploying such complex graph processing applications becomes a significant bottleneck. Therefore, many programming paradigms, models, frameworks - graph processing systems all together - have been proposed to tackle this challenge. In the same time, many data collections have exploded in size, posing huge performance problems. Modern graph processing systems strive to find the best balance between simple, user-friendly and productivity-enhancing front-ends and high-performance back-ends for the analyses they enable.

Since 2004, more than 80 systems have been proposed from both academia and the industry. However, a clear overview of these systems is lacking. Therefore, in this work, we survey scalable frameworks aimed at efficiently processing large-scale graphs and present a taxonomy of over 80 systems. Useful for both users and researchers, we provide an overview of the state of the art techniques and remaining challenges related to graph processing frameworks.
: An Empirical Performance Evaluation of Distributed SQL Query Engines, PDS Technical Report PDS-2014-002

Abstract PDS-2014-002

Distributed SQL Query Engines (DSQEs) are increasingly used in a variety of domains, but especially users in small companies with little expertise may face the challenge of selecting an appropriate engine for their specific applications. Although both industry and academia are attempting to come up with high level benchmarks, the performance of DSQEs has never been explored or compared in-depth. We propose an empirical method for evaluating the performance of DSQEs with representative metrics, datasets, and system configurations. We implement a micro-benchmarking suite of three classes of SQL queries for both a synthetic and a real world dataset and we report response time, resource utilization, and scalability. We use our micro-benchmarking suite to analyze and compare three state- of-the-art engines, viz. Shark, Impala, and Hive. We gain valuable insights for each engine and we present a comprehensive comparison of these DSQEs. We find that different query engines have widely varying performance: Hive is always being outperformed by the other engines, but whether Impala or Shark is the best performer highly depends on the query type.
: Statistical Characterization of Business-Critical Workloads Hosted in Cloud Datacenters, PDS Technical Report PDS-2014-001

Abstract PDS-2014-001

Business-critical workloads---web servers, mail servers, app servers, etc.---are increasingly hosted in virtualized datacenters acting as Infrastructure-as-a-Service clouds (cloud datacenters). Understanding how business-critical workloads demand and use resources is key in capacity sizing, in infrastructure operation and testing, and in application performance management. However, relatively little is currently known about these workloads, because the information is complex---large-scale, heterogeneous, shared-clusters---and because datacenter operators remain reluctant to share such information. Moreover, the few operators that did share data (Google, Cloudera, many supercomputing centers, etc.) have enabled studies in business intelligence (MapReduce), search, and scientific computing (HPC), but not in business-critical workloads.

To alleviate this situation, in this work we conduct the first comprehensive study of business-critical workloads hosted in cloud datacenters. We collect 2 large-scale and long-term workload traces corresponding to requested and actually used resources in a distributed datacenter servicing business-critical workloads. Our comprehensive datasets focus on four key types of resources, all of which can become bottlenecks for business-critical applications: CPU, disk I/O, and, rare among studies of datacenter workloads, memory and network I/O. We characterize the demand and use of these resources and conduct an analysis of basic statistics, including correlations, and of basic time-patterns.

Our study gives evidence that, for business-critical workloads, the workload is more dynamic than for other classes of hosted workloads: most of the virtual machines require less than 4 CPU-cores and less than 8GB of memory, the requested amount of CPUs and memory size are correlated, the resource usage is low (10%) relative to the request, and and the resource usage is bursty but otherwise predictable even over the short-term. Last, we also release the traces for public use via the open-access Grid Workloads Archive.

2013

: The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workfow, and Empirical Evaluation, Technical Report PDS-2013-008

Abstract PDS-2013-008

The commoditization of big data analytics, that is, the deployment, tuning, and future development of big data processing platforms such as MapReduce, relies on a thorough understanding of relevant use cases and workloads. In this work we propose BTWorld, a use case for time-based big data analytics that is representative for processing data col-lected periodically from a global-scale distributed system. BTWorld enables a data-driven approach to understanding the evolution of BitTorrent, a global le-sharing network that has over 100 million users and accounts for a third of today's upstream trac. We describe for this use case the analyst questions and the structure of a multi-terabyte data set. We design a MapReduce-based logical work ow, which includes three levels of data dependency - inter-query, inter-job, and intra-job - and a query diversity that make the BTWorld use case challenging for today's big data processing tools; the work ow can be instantiated in various ways in the MapReduce stack. Last, we instantiate this complex work ow using Pig-Hadoop-HDFS and evaluate the use case empirically. Our MapReduce use case has challenging features: small (kilobytes) to large (250 MB) data sizes per observed item, excellent (10^-6) and very poor (10²) selectivity, and short (seconds) to long (hours) job duration.
: Estimating user interaction strength in online networks, PDS Technical Report PDS-2013-007

Abstract PDS-2013-007

Online networks like Facebook and BitTorrent are based on user interactions such as wall posts andcontent exchange. In such systems, user relationships can be used to enhance security and promotecooperation, but in order to be meaningful, these relationships should be based on user interactionstrength instead of "binary" friendships. To date, severaltheoretical, centralized schemes forestimatinguser interaction strengthhave been proposed. Here we present the design, deployment,and analysis ofthe UISE scheme for User Interaction Strength Estimation for both centralized and decentralized onlinenetworks.

Among the strong points of UISE is that it captures direct andindirect user interactions, that it scaleswith only partial information dissemination in decentralized systems, and that it provides disincentivesfor malicious user behaviors. We apply UISE to detect user interaction patterns based on wall posts inFacebook, which resemble those observed in the offline human society. We further apply UISE to onlinetime estimation based on rendezvous as user interactions inTribler, an online network for media and socialapplications like file sharing, streaming, and voting. We demonstrate the accuracy and scalability of UISEwith different information dissemination protocols and user behaviors using simulations, emulations, anda real-world deployment.
: Identifying the Key Features of Intel Xeon Phi: A Comparative Approach, PDS Technical Report PDS-2013-006

Abstract PDS-2013-006

With the increasing diversity of many-core processors, it becomes more and more difficult to guarantee performance portability with a unified programming model. The main reason lies in the architecture disparities, e.g., CPUs and GPUs have different architectural features from each other, which leads to the differences in performance optimization techniques. Thus, it is of great necessity to abstract performance-wise key features from many-core processors.

In this paper, taking the Intel's Xeon Phi as a case study, we present a two-stage comparative approach to abstract key features. Our approach needs a reference processor and is executed at both the application level and the microbenchmark level. We select multiple benchmarks from the Parboil benchmarks and measure the performance differences to identify performance factors. Further, we perform an in-depth analysis to identify the key features with microbenchmarks. Finally, we briefly discuss a use case in our optimizing framework.
: Benchmarking Intel Xeon Phi to Guide Kernel Design, PDS Technical Report PDS-2013-005

Abstract PDS-2013-005

With a minimum of 50 cores, Intel's Xeon Phi is a true many-core architecture. Featuring fairly powerful cores, two levels of caches, and a very fast interconnection, the Xeon Phi is able to achieve theoretical peak of 1000 GFLOPs and over 240 GB/s. These numbers, as well as its flexibility - it can be used as both coprocessor or a stand-alone processor - are very tempting for parallel applications looking for new performance records.

In this paper, we present four hardware-centric guidelines and a machine model for Xeon Phi programmers in search for performance. Specifically, we have benchmarked the main hardware components of the processor - the cores, the memory hierarchies, and the ring interconnect. We show that, in ideal microbenchmarking conditions, the achieved performance is very close to the theoretical one as given in the official programmer's guide. Furthermore, we have identified and quantified several causes for significant performance penalties, which are not available in the official documentation. Based on this information, we synthesized four optimization guidelines and applied them to a set of kernels, aiming to systematically optimize their performance. The optimization process is guided by performance roofs, derived from the same benchmarks. Our experimental results show that, using this strategy, we can achieve impressive performance gains and, more importantly, a high utilization of the processor.
: How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis: Extended Report, PDS Technical Report PDS-2013-004

Abstract PDS-2013-004

Giraph, GraphLab, and other graph-processing platforms are increasingly used in a variety of domains, including social networking and gaming, targeted advertisements, and bioinformatics. Although both industry and academia are developing and tuning graph-processing algorithms and platforms, the performance of graph-processing platforms has never been explored or compared in-depth. Thus, users face the daunting challenge of selecting an appropriate platform for their specific application and even dataset. To alleviate this challenge, in this work we propose and apply an empirical method for evaluating and comparing graph-processing platforms. We define a benchmarking suite for graph-processing platforms, which includes a comprehensive process and a selection of representative metrics, datasets, and algorithmic classes. In our process, we focus on evaluating for each system the basic performance, the resource utilization, the scalability, and various performance overheads. Our selection includes five classes of graph-processing algorithms and seven graphs of up to 1.8 billion edges each. We report on job execution time and various normalized metrics. Finally, we use our benchmarking suite on six different platforms and, besides the valuable insights gained for each platform, we also present the first comprehensive comparison of graph-processing platforms.
: Towards a workload model for online social applications: Extended report, PDS Technical Report PDS-2013-003

Abstract PDS-2013-003

Popular online social applications hosted by social platforms serve, each, millions of interconnected users. Understanding the workloads of these applications is key in improving the management of their performance and costs. In this work, we analyse traces gathered over a period of thirty-one months for hundreds of Facebook applications. We characterize the popularity of applications, which describes how applications attract users, and the evolution pattern, which describes how the number of users changes over the lifetime of an application. We further model both application popularity and evolution, and validate our model statistically, by fitting five probability distributions to empirical data for each of the model variables. Among the results, we find that most applications reach their maximum number of users within a third of their lifetime, and that the lognormal distribution provides the best fit for the popularity distribution.
: Dispersy Bundle Synchronization, PDS Technical Report PDS-2013-002

Abstract PDS-2013-002

In this technical report we present Dispersy, a fully decentralized system for data bundle synchronization, capable of running in challenged network environments. Key features of Dispersy are stateless synchronization using Bloomlters, decentralized NAT traversal, and data bundle selection algorithms that allow the system to scale over 100,000 bundles in the presence of high churn and high-load scenario's. The versatility and good performance of Dispersy is shown by comparing it to Cassandra using the Yahoo! Cloud Serving Benchmark (YCSB). We have modied YCSB as to include nodes randomly joining and leaving, a kind of behaviour that is very typical in challenged network environments. Furthermore, we integrated Dispersy in the BitTorrent client Tribler and show that it is performing very well in various real-time challenged network scenario's (3G and WIFI).

: Understanding User Behavior in Spotify, PDS Technical Report PDS-2013-001

Abstract PDS-2013-001

Spotify is a peer-assisted music streaming service that offers instant access to a vast music catalogue, and it has gained worldwide popularity in the past few years. Until now, little is known about user behavior in such services. In this paper, we study the user behavior in Spotify by analyzing a massive dataset collected between 2010 and 2011. Firstly, we investigate the system dynamics including session arrival patterns, daily variations of session length, and playback arrival patterns. Secondly, we analyze the behavioral patterns of individual users on multiple devices and single device, respectively. Our analysis reveals the patterns of users switching between multiple devices on successive sessions and the favorite times of day for Spotify users. We also show the correlations between both the length and the downtime of successive user sessions on single devices. Our findings not only deepen our understanding of user behavior in Spotify, but also shed light on user behavior in other music streaming services. We believe the results may also be used for building other types of services such as P2P social networks. In particular, our analysis of mobile user behavior provides valuable insights for developing systems on mobile platforms.

2012

: The Game Trace Archive: A Technical Report, PDS Technical Report PDS-2012-005

Abstract PDS-2012-005

Spurred by the rapid development of the gaming industry and the expansion of Online Meta-Gaming Networks (OMGNs), many gaming studies and measurements have been conducted in recent years. However, few or no traces of games and OMGNs are publicly available to game researchers and practitioners. Moreover, the few traces that are available are shared using diverse formats. This situation is an obstacle in exchanging, studying, and using game traces. To address this problem, we design the Game Trace Archive (GTA) to be a virtual meeting space for the game community. We identify five main requirements to build an archive for game traces, and address them with the GTA. We propose a unified format for game traces, and introduce a number of tools associated with the format. With these tools, we collect, process, and analyze 9 traces of both games and OMGNs. We collect in the GTA traces corresponding to more than 8 million real players and more than 200 million information items, spanning over 14 operational years. We also show that the GTA can be extended to include a variety of real-game trace types. Finally, we discuss possible applications of the GTA in gaming area such as game resource management, Quality of Experience for players, and advertisement.
: Memory Access Patterns on Architectures with Local Memory: A Performance Database, PDS Technical Report PDS-2012-002

Abstract PDS-2012-02

Nowadays architectures and their programming model implementations are becoming increasingly complex and diverse, making the performance benefits of using local memory unpredictable via only simplistic modeling. In this paper, we present a benchmark-based approach to tackle this issue. We first present a two-part approach to describe memory access patterns for many-thread applications. For each MAP, we design benchmarks of native versions (without local memory) and optimized versions (using local memory). Then we evaluate them on typically used platforms (NVIDIA GTX280, NVIDIA GTX580, AMD HD6970, and Intel E5620), compare the performance of native versions versus optimized versions, and get a performance database. This database can provide essential information for automated usage of local memory.

2011

: CLVectorizer: A Source-to-Source Vectorizer for OpenCL Kernels, PDS Technical Report PDS-2011-012

Abstract PDS-2011-012

While many-core processors offer multiple layers of hardware parallelism to boost performance, applications are lagging behind in exploiting them effectively. A typical example is vector parallelism(SIMD), offered by many processors, but used by too few applications.

In this paper we discuss two different strategies to enable the vectorization of naive OpenCL kernels. Further, we show how these solutions are successfully applied on four different applications on three different many-core platforms. Our results demonstrate significant improvements in both achieved bandwidth and execution time for most (application, platform) pairs. We conclude therefore that vectorization is not a computation-only optimization for OpenCL kernels, but one that enables the applications to better utilize the hardware.

Using our experience with the vectorization, we present a refinement of the process into a two-module framework to assist programmers to optimize OpenCL code by considering the specifics of the target architectures. We argue that such a framework can further speedup applications based on the current work, and we also show what are the requirements for making such an extension.
: A Detailed Performance Analysis of the OpenMP Rodinia Benchmark, PDS Technical Report PDS-2011-011

Abstract PDS-2011-011

The development of multi-core processors has aroused extensive interests in parallel programming modelresearch. OpenMP, as one of the most widely used shared memory programming model, has been proposedand improved for more than ten years. In our work, we carry out a set of intensive performance experimentsby using the OpenMP Rodinia benchmark. We cover eleven different types of applications and conductthe experiments on three multi-core CPUs. In order to investigate the OpenMP scalability and the bestperformance that OpenMP can achieve, we vary the scale of datasets in each application and deploy differentnumbers of OpenMP threads for each experiment run. We use IQR method to guarantee a more reliableanalysis. According to our results, we find that OpenMP shows diversified performance behaviors amongdifferent applications, platforms, and datasets. For most applications, it performs reasonably and scaleswell, reaching the maximum performance around the number of hardware cores/threads of the underlyinghardware platforms. The results will be used for further research on comparing programming models onmulti-cores.
: An Analysis of Provisioning and Allocation Policies for Infrastructure-as-a-Service Clouds: Extended Results, PDS Technical Report PDS-2011-009

Abstract PDS-2011-009

Today, many commercial and private cloud computing providers offer resources for leasing under the infrastructure as a service (IaaS) paradigm. Although an abundance of mechanisms already facilitate the lease and use of single infrastructure resources, to complete multi-job workloads IaaS users still need to select adequate provisioning and allocation policies to instantiate resources and map computational jobs to them. While such policies have been studied in the past, no experimental investigation in the context of clouds currently exists that considers them jointly. In this paper we present a comprehensive and empirical performance-cost analysis of provisioning and allocation policies in IaaS clouds. We first introduce a taxonomy of both types of policies, based on the type of information used in the decision process, and map to this taxonomy eight provisioning and four allocation policies. Then, we analyze the performance and cost of these policies through experimentation in three clouds, including Amazon EC2. We show that policies that dynamically provision and/or allocate resources can achieve better performance and cost, but that the lack of adaptation to the specific pricing model and technical capabilities of the used cloud can significantly reduce cost-effectiveness. We also look at the interplay between provisioning and allocation, for which we show preliminary results.
: It's not that simple: an empirical study on the influence of locality on download speed in BitTorrent, PDS Technical Report PDS-2011-008

Abstract PDS-2011-008

Traffic localization is currently considered a promising approach to decrease the load of P2P systems on the Internet backbone. A number of solutions that promote P2P locality have therefore been proposed. However, their evaluation has been largely focused on simulations with limited validation against real-world systems.

In this paper we empirically measure the influence of locality on end-user download speed in BitTorrent. For that, we design Proximity Toolbox, a set of tools that uses cached round-trip-time measurements to estimate latency and feed close peers to a regular BitTorrent client. We run experiments from five vantage points in North America, Europe and Asia downloading real content using real-world connectivity.

We find that locality influences end-user speed in different ways in different locations: it has no effect in one location, generates significant increase (up to 57.65% for the average) in three out of five, but also leads to speed decrease (up to 37%) in one location. Furthermore, we show that this diversity of effects is masked if one overlooks the diversity of vantage points: analyzing all locations together, one can observe a significant positive influence of locality on download speed.

Our measurements reveal a more complex picture of P2P-locality than the one considered in the design and evaluation of previous locality-based solutions. Even with few measurement points, we are able to contradict the claim that P2P-locality does not lower end-user performance. This, in turn, points to the need for further research on the factors determining the effectiveness of locality.
: Human Mobility in Virtual and Real Worlds: Characterization, Modeling, and Implications, PDS Technical Report PDS-2011-007

Abstracts PDS-2011-007

The performance of networked systems that support common human activities, such as mobile communication and participation in online games, depends on the characteristics of human mobility. Human mobility models are crucial in the design and tuning of these systems, but depend on detailed and long-term human mobility traces, which, in turn, are difficult to obtain from real environments. Thus, although many human mobility models have already been developed, few have been validated against adequate traces. In contrast, in this work we investigate the use human mobility traces collected from realistic, human-like virtual worlds for the purpose of understanding human mobility and developing new mobility models. We collect traces for over 30,000 virtual citizens of a popular virtual world, and compare these traces with other real- and virtual-world traces. Our analysis reveals that the mobility traces of humans in virtual and real worlds have many similar characteristics. We further propose SAMOVAR, a model for human mobility in virtual and real worlds that takes into account mobility characteristics such as the population-wide and individual area popularity. Based on synthetic traces generated from our and three other models, we conduct a simulation-based analysis of the impact of human mobility on the performance of virtual and real networked environments. We find that SAMOVAR leads to useful insights, and discriminates better than the other models among the performance of different networked environments.
: On the Design of a Practical Information-Centric Transport, PDS Technical Report PDS-2011-006

Abstract PDS-2011-006

A recent chain of exploratory publications proposed information-centric architectures for the Internet. The common pitfall of such proposals is the imbalance of upfront costs and immediate benefits. To address this concern, we focus on prospects of piecemeal adoption. We start with the nesessary basic primitives of any infocentric architecture, primarily self-certifying names, to determine what are they sufficient for. We define natural interfaces to other parts of the architecture, primarily naming and routing, for which no special assumptions are made. We define a natural separation of infocentric transport and internetworking layers and their message vocabulary, that allows to run our infocentric transport over IP, UDP, TCP, HTTP or entirely IP-free. As a proof of concept, we have implemented a UDP-based transport protocol named swift with per-datagram data integrity checks. Our architecture highly prioritizes modularity and sketches a path for piecemeal adoption which we consider a critical enabler of any progress in the field.
: Binmaps: Hybridizing Bitmaps and Binary Trees, PDS Technical Report PDS-2011-005

Abstract PDS-2011-005

This report addresses the classical problem of keeping huge bitmaps predominantly consisting of long ranges of zeros and ones. The problem is most often encountered in filesystems (free space tracking) and network protocols (transmission progress tracking).

Three classical solutions to the problem are plain bitmaps (NTFS), extent lists (TCP SACK) and extent binary trees (XFS, Btrfs). Bitmaps are simple but have high fixed space requirements. Lists are able to aggregate solid ranges, but they don't scale well with regard to search. Extent binary trees are able of aggregation, allow scalable search, but have high overhead and extremely bad worst case behavior, potentially exploding to sizes a couple orders of magnitude higher than plain bitmaps. The latter problem is sometimes resolved by ad-hoc means, e.g. by converting parts of an extent tree to bitmaps (Btrfs). Another possible workaround is to impose a divide-and-conquer multilayered unit system (BitTorrent).

We introduce a new data structure named "binmap'', a hybrid of bitmap and binary tree, which resolves the shortcomings of the extent binary tree approach. Namely (a) it has lower average-case overhead and (b) as it is tolerant to patchy bitmaps, its worst-case behavior is dramatically better.

: From user experience to strategies: how to survive in a private BitTorrent community, PDS Technical Report PDS-2011-004

Abstract PDS-2011-004

Many private BitTorrent communities employ Sharing Ratio Enforcement (SRE) schemes to incentivize users to contribute their upload resources. It has been demonstrated that communities that use SRE are greatly oversupplied, i.e., they have much higher seeder-to-leecher ratios than communities in which SRE is not employed. The first order effect of oversupply under SRE is a positive increase in the average downloading speed. However, in this paper we show that the oversupply induces severe side-effects and under SRE 1) users are forced to seed for extremely long times to maintain adequate sharing ratios to be able to start new downloads, and 2) many users have seeded for very long times but are still with low sharing ratios, due to the counter-intuitive phenomenon that long seeding time does not necessarily lead to large upload amount. We propose strategies for peers to gain sharing ratios more efficiently. We further analyze the existing community strategies beyond SRE, which are initially designed to further incentivize contribution. We show that some strategies have limited or even negative effects on the system performance and we propose our remedies.
: Bandwidth Allocation in BitTorrent-like VoD Systems under Flashcrowds, PDS Technical Report PDS-2011-003

Abstract PDS-2011-003

The efficiency of BitTorrent in content distribution has inspired a number of peer-to-peer (P2P) proto-cols for on-demand video (VoD) streaming systems (henceforth BitTorrent-like VoD systems). However, the fundamental quality-of-service (QoS) requirements of VoD (i.e. providing peers with a smooth play-back continuity and a short startup delay) make the design of these systems more challenging than normal file-sharing systems. In particular, the bandwidth allocation strategy is an important aspect in the design of BitTorrent-like VoD systems, which becomes even more crucial in a scenario where a large number of peers joins in a short period of time, a phenomenon known as flashcrowd. In fact, the new joining peers all demand for content while having few or no pieces of content to offer in return yet. An unwise allocation of the limited bandwidth actually available during this phase may cause peers to experience poor QoS.

In this work, we analyze the effects of a flashcrowd on the scalability of a BitTorrent-like VoD system and propose a number of mechanisms to make the bandwidth allocation in this phase more effective. In particular, we derive an upper bound for the number of peers that can be admitted in the system over time and we find that there is a trade-off between having the seeders minimize the upload of pieces already injected recently and high peer QoS. Based on the insights gained from our analysis, we devise some flashcrowd-handling algorithms for the allocation of peer bandwidth to improve peer QoS during flashcrowd. We validate the effectiveness of our proposals by means of extensive simulations.
: RTSenv: An Experimental Environment for Real-Time Strategy Games on Multi-Clusters, PDS Technical Report PDS-2011-002

Abstract PDS-2011-002

Today, Real-Time Strategy (RTS) games entertain tens of millions of players world-wide. This growing population expects new game designs and more scalable games every year. However, few tools and environments exist for game designers and implementers; of these, even fewer are available to researchers and game communities. In this work, we introduce RTSenv, an environment and associated set of tools for RTS games. Our environment can configure and manage the main aspects of RTS games, such as maps, computer-controlled units, and game scenarios. RTSenv leverages multi-cluster systems and reactive fault tolerance mechanisms to perform robust, multi-machine, and multi-instance game experiments. Using our reference implementation of RTSenv in DAS-4, a real multi-cluster system, we show that our approach can be used in a variety of scenarios for game performance evaluation and for comparing game design choices. Our results give evicence that several common assumptions made by researchers about game workloads do not hold in general for RTS games and thus warrant a more detailed investigation.
: Cost-driven Scheduling of Grid Workflows Using Partial Critical Paths, PDS Technical Report PDS-2011-001

Abstract PDS-2011-001

Recently, utility Grids have emerged as a new model of service provisioning in heterogeneous distributed systems. In this model, users negotiate with service providers on their required Quality of Service and on the corresponding price to reach a Service Level Agreement. One of the most challenging problems in utility Grids is workflow scheduling, i.e., the problem of satisfying the QoS of the users as well as minimizing the cost of workflow execution. In this paper, we propose a new QoS-based workflow scheduling algorithm based on a novel concept called Partial Critical Paths (PCP), that tries to minimize the cost of workflow execution while meeting a user-defined deadline. The PCP algorithm has two phases: in the deadline distribution phase it recursively assigns sub-deadlines to the tasks on the partial critical paths ending at previously assigned tasks, and in the planning phase it assigns the cheapest service to each task while meeting its sub-deadline. The simulation results show that the performance of the PCP algorithm is very promising.