ACM DL

ACM Transactions on

Parallel Computing (TOPC)

Menu
Latest Articles

Pagoda: A GPU Runtime System for Narrow Tasks

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the their hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their... (more)

Using Butterfly-patterned Partial Sums to Draw from Discrete Distributions

We describe a simd technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model,... (more)

Hyperqueues: Design and Implementation of Deterministic Concurrent Queues

The hyperqueue is a programming abstraction for queues that results in deterministic and scale-free parallel programs. Hyperqueues extend the concept of Cilk++ hyperobjects to provide thread-local views on a shared data structure. While hyperobjects are organized around private local views, hyperqueues provide a shared view on a queue data... (more)

NEWS

About TOPC

ACM Transactions on Parallel Computing (TOPC) is a forum for novel and innovative work on all aspects of parallel computing, including foundational and theoretical aspects, systems, languages, architectures, tools, and applications. It will address all classes of parallel-processing platforms including concurrent, multithreaded, multicore, accelerated, multiprocessor, clusters, and supercomputers. READ MORE

Forthcoming Articles

Tapir: Embedding Recursive Fork-Join Parallelism into LLVM's Intermediate Representation

Processor-Oblivious Record and Replay

Guest Editor Introduction PPoPP 2017 Special Issue 1 of 2

Extracting SIMD Parallelism from Recursive Task-Parallel Programs

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task- parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel's SSE4.2 vector units, as well as accelerators using Intel's AVX512 units. We then show through rigorous sampling that, in practice, our vectorization techniques are effective for a much larger class of programs.

All ACM Journals | See Full Journal Index

Search TOPC
enter search term and/or author name