Tight Bounds for Clairvoyant Dynamic Bin Packing
Pagoda: A GPU Runtime System For Narrow Tasks
Tapir: Embedding Recursive Fork-Join Parallelism into LLVM's Intermediate Representation
On Energy Conservation in Data Centers
Near Optimal Parallel Algorithms for Dynamic DFS in Undirected Graphs
Processor-Oblivious Record and Replay
TOPC Introduction to the Special Issue for SPAA?17
Distributed Graph Clustering and Sparsification
The hyperqueue is a programming abstraction for queues that results in deterministic and scale-free parallel programs. Hyperqueues extend the concept of Cilk++ hyperobjects to provide thread-local views on a shared data structure. While hyperobjects are organized around private local views, hyperqueues provide a shared view on a queue data structure. Hereby, hyperqueues guarantee determinism for programs using concurrent queues. We define the programming API and semantics of two instances of the hyperqueue concept. These hyperqueues differ in their API and the degree of concurrency that is extracted. We describe the implementation of the hyperqueues in a work-stealing scheduler and demonstrate scalable performance on pipeline-parallel benchmarks from PARSEC and StreamIt.
Using Butterfly-Patterned Partial Sums to Draw from Discrete Distributions
New Cover Time Bounds for the Coalescing-Branching Random Walk on Graphs
The Mobile Server Problem
The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task- parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel's SSE4.2 vector units, as well as accelerators using Intel's AVX512 units. We then show through rigorous sampling that, in practice, our vectorization techniques are effective for a much larger class of programs.