ACM Transactions on

Parallel Computing (TOPC)

Latest Articles

A Library for Portable and Composable Data Locality Optimizations for NUMA Systems

Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time... (more)

Automatic Scalable Atomicity via Semantic Locking

In this article, we consider concurrent programs in which the shared state consists of instances of linearizable abstract data types (ADTs). We... (more)

Generality and Speed in Nonblocking Dual Containers

Nonblocking dual data structures extend traditional notions of nonblocking progress to accommodate partial methods, both by bounding the number of... (more)

Resource Oblivious Sorting on Multicores

We present a deterministic sorting algorithm, Sample, Partition, and Merge Sort (SPMS), that interleaves the partitioning of a sample sort with merging. Sequentially, it sorts n elements in O(nlog n) time cache-obliviously with an optimal number of cache misses. The parallel complexity (or critical path length) of the algorithm is O(log nlog log... (more)


About TOPC

ACM Transactions on Parallel Computing (TOPC) is a forum for novel and innovative work on all aspects of parallel computing, including foundational and theoretical aspects, systems, languages, architectures, tools, and applications. It will address all classes of parallel-processing platforms including concurrent, multithreaded, multicore, accelerated, multiprocessor, clusters, and supercomputers. 

read more
Forthcoming Articles
Gunrock: GPU Graph Analytics

For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph- processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level
programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.

Guest Editor Introduction (1 of 2)

Guest editor introduction for first special issue from PPoPP 2016

SciPAL: Expression Templates and Composition Closure Objects for High Performance Computational Physics with CUDA and OpenMP

We present SciPAL (scientific parallel algorithms library), a C++-based, hardware-independent open-source library.
Its core is a domain-specific embedded language for numerical linear algebra.
The main fields of application are finite element simulations, coherent optics and the solution of inverse problems.
Using SciPAL, algorithms can
be stated in a mathematically intuitive way in terms of matrix and vector operations.
Existing algorithms can easily be adapted to GPU-based computing by proper template specialization.
Our library is compatible with the finite element library deal.II and provides a port of deal.II's most frequently used linear algebra classes to CUDA (NVidia's extension of the programming languages C and C++ for programming their GPUs).
SciPAL's operator-based API for BLAS operations particularly aims at simplifying the usage of NVidia's CUBLAS.
For non-BLAS array arithmetic SciPAL's expression templates are able to generate CUDA kernels at compile-time.
We demonstrate the benefits of SciPAL using the iterative principal component analysis as example which is the core algorithm for the spike-sorting problem
in neuroscience.

GPU Multisplit: an extended study of a parallel algorithm

Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer.
Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort.
One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it.
In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data.
Both methods are inefficient and require more work than necessary: the former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket.
In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small (up to 256) number of buckets.
We use warp-synchronous programming models and emphasize warp-wide communications to avoid branch divergence and reduce memory usage.
We also hierarchically reorder input elements to achieve better coalescing of global memory accesses.
On a GeForce GTX 1080 GPU, we can reach a peak throughput of 18.93 Gkeys/s (or 11.68 Gpairs/s) for a key-only (or key-value) multisplit.
Finally, we demonstrate how multisplit can be used as a building block for radix sort. In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 3.0 G keys/s (and 2.1 Gpair/s).

Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy

Dense matrix factorizations, such as LU, Cholesky and QR, are widely
used for scientific applications that require solving systems of
linear equations, eigenvalues and linear least squares problems.
Such computations are normally carried out on supercomputers, whose
ever-growing scale induces a fast decline of the Mean Time To
Failure (MTTF). This paper proposes a new hybrid approach, based on
Algorithm-Based Fault Tolerance (ABFT), to help matrix
factorizations algorithms survive fail-stop failures. We consider
extreme conditions, such as the absence of any reliable component
and the possibility of losing both data and checksum from a single
failure. We will present a generic solution for protecting the
right factor, where the updates are applied, of all above mentioned
For the left factor, where the panel has been applied, we propose a
scalable checkpointing algorithm. This algorithm features high
degree of checkpointing parallelism and cooperatively utilizes the
checksum storage leftover from the right factor protection. The
fault-tolerant algorithms derived from this hybrid solution is
applicable to a wide range of dense matrix factorizations, with
minor modifications. Theoretical analysis shows that the fault
tolerance overhead sharply decreases with the scaling in the number
of computing units and the problem size. Experimental results of LU
and QR factorization on the Kraken (Cray XT5) supercomputer validate
the theoretical evaluation and confirm negligible overhead, with-
and without-errors. Applicability to tolerate multiple failures
and accuracy after multiple recovery is also considered.

Collective algorithms for multi-ported torus networks

Modern supercomputers with torus networks allow each node to simultaneously pass messages on all of its links. However, most collective algorithms are designed to only use one link at a time. In this work, we present novel multi-ported algorithms for the scatter, gather, allgather, and reduce-scatter operations. Our algorithms can be combined to create multi-ported reduce, all-reduce, and broadcast algorithms. Several of these algorithms involve a new technique where we relax the MPI message-ordering constraints to achieve high performance and restore the correct ordering using an additional stage of redundant communication.

According to our models, on an n-dimensional torus, our algorithms should allow for nearly a 2n-fold improvement in communication performance compared to known, single-ported torus algorithms. In practice, we have achieved nearly 6x better performance on a 32k-node 3-dimensional torus.

Hybridizing and Relaxing Dependence Tracking for Efficient Parallel Runtime Support

It is notoriously challenging to develop parallel software systems that are both scalable and correct. Run-time support for parallelismsuch as multithreaded record & replay, data race detectors, transactional memory, and enforcement of stronger memory modelshelps achieve these goals, but existing commodity solutions slow programs substantially in order to track (i.e., detect or control) an execution's cross-thread dependences accurately. Prior work tracks cross-thread dependences either "pessimistically," slowing every program access, or "optimistically," allowing for lightweight instrumentation of most accesses but dramatically slowing accesses that are conflicting (i.e., involved in cross-thread dependences).

This paper presents two novel approaches that seek to improve the performance of dependence tracking. Hybrid tracking (HT) hybridizes pessimistic and optimistic tracking by overcoming a fundamental mismatch between pessimistic and optimistic tracking. HT uses an adaptive, profile-based policy to make run-time decisions about switching between pessimistic and optimistic tracking. Relaxed tracking (RT) attempts to reduce optimistic tracking's overhead on conflicting accesses by tracking dependences in a "relaxed" waymeaning that not all dependences are tracked accurately. Instead, RT requires extra caution to preserve both program semantics and runtime support's correctness. To demonstrate the usefulness and potential of HT and RT, we build runtime support based on the two approaches. Our evaluation shows that both approaches offer performance advantages over existing approaches, although challenges for further improvement exist.

HT and RT are distinct solutions to the same problem. Our experience shows that runtime support based on HT is easier to build than that based on RT, while RT and its runtime support do not incur the overhead of online profiling. This paper presents the two approaches together in order to inspire future designs for efficient parallel runtime support.

Automatic Parallelization of a Class of Irregular Loops for Distributed Memory Systems

Many scientific applications spend significant time within loops that are parallel, except for dependencies from associative reduction operations. However these loops often contain data-dependent control-flow and array-access patterns. Traditional optimizations that rely on purely static analysis fail to generate parallel code.

This paper proposes an approach for automatic parallelization for distributed memory environments, using both static and run-time analysis. We formalize the computations that are targeted by this approach and develop algorithms to detect such computation. We describe in detail, algorithms to generate a parallel inspector that
performs the run-time analysis, and a parallel executor. The
effectiveness of the approach is demonstrated on several benchmarks and a real-world applications. We measure the inspector overhead and also evaluate the benefit of optimizations applied during the transformation.

Near-Optimal Scheduling Mechanisms for Deadline-Sensitive Jobs in Large Computing Clusters

We consider a market-based resource allocation model for batch jobs in cloud computing clusters. In our model, we incorporate the importance of the due date of a job by which it needs to be completed rather than the number of servers allocated to it at any given time. Each batch job is characterized by the work volume of total computing units (e.g., CPU hours) along with a bound on maximum degree of parallelism. Users specify, along with these job characteristics, their desired due date and a value for finishing the job by its deadline. Given this specification, the primary goal is to determine the scheduling of cloud computing instances under capacity constraints in order to maximize the social welfare (i.e., sum of values gained by allocated users). Our main result is a new $\frac{C}{C-k}\frac{s}{s-1}-approximation algorithm for this objective, where $C$ denotes cloud capacity, $k$ is the maximal bound on parallelized execution (in practical settings, $k << C$) and $s$ is the slackness on the job completion time i.e., the minimal ratio between a specified deadline and the earliest finish time of a job. Our algorithm is based on utilizing dual fitting arguments over a strengthened linear program to the problem.

Based on the new approximation algorithm, we construct truthful allocation and pricing mechanisms, in which reporting the true value and other properties of the job (deadline, work volume and the parallelism bound) is a dominant strategy for all users. To that end, we extend known results for single-value settings to provide a general framework for transforming allocation algorithms into truthful mechanisms in domains of single-value and multi-properties. We then show that the basic mechanism can be extended under proper Bayesian assumptions to the objective of maximizing revenues, which is important for public clouds. We empirically evaluate the benefits of our approach through simulations on datacenter job traces, and show that the revenues obtained under our mechanism are comparable with an ideal fixed-price mechanism, which sets an on-demand price using oracle knowledge of users' valuations. Finally, we discuss how our model can be extended to accommodate uncertainties in job work volumes, which is a practical challenge in cloud settings.

Avoiding Communication in Successive Band Reduction

The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time.
In this work, we present sequential and distributed-memory parallel algorithms for tridiagonalizing full symmetric and symmetric band matrices that asymptotically reduce communication compared to previous approaches.

The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices.
In order to preserve structure, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality and high parallel latency cost.
We improve both by reorganizing the computation and obtain asymptotic improvements.
We also propose new algorithms for reducing a full symmetric matrix to band form in a communication-efficient manner.
In this paper, we consider the cases of computing eigenvalues only and of computing eigenvalues and all eigenvectors.

A Methodology for Automatic Generation of Executable Communication Specifications from Parallel MPI Applications

Portable parallel benchmarks are widely used for performance
evaluation of HPC systems. However, because these are manually
produced, they generally represent a greatly simplified view of
application behavior, missing the subtle but important-to-performance
nuances that may exist in a complete application.
This work contributes novel methods to
automatically generate highly portable and customizable communication
benchmarks from HPC applications. We utilize ScalaTrace, a lossless
yet scalable parallel-application tracing framework to collect
selected aspects of the run-time behavior of HPC applications,
including communication operations and execution time, while
abstracting away the details of the computation proper. We
subsequently generate benchmarks with identical run-time behavior from
the collected traces.
Results demonstrate that the generated
benchmarks are in fact able to preserve the run-time behavior (including both
the communication pattern and the execution time) of the original
Such automated benchmark generation is without
and particularly valuable for proprietary,
export-controlled, or classified application codes.

Power Management of Extreme-scale Networks with On/Off Links in Runtime Systems

Networks are among major power consumers in large-scale parallel systems. During execution of common
parallel applications, a sizeable fraction of the links in the high-radix interconnects are either never used
or are underutilized. We propose a runtime system based adaptive approach to turn off unused links, which
has various advantages over the previously proposed hardware and compiler based approaches. We discuss
why the runtime system is the best system component to accomplish this task, and test the effectiveness
of our approach using real applications (including NAMD, MILC), and application benchmarks (including
NAS Parallel Benchmarks, Stencil). These codes are simulated on representative topologies such as 6-D
Torus and multilevel directly-connected network (similar to IBM PERCS in Power 775 and Dragonfly in
Cray Aries). For common applications with near-neighbor communication pattern, our approach can save
up to 20% of total machine's power and energy, without any performance penalty.

Lock Cohorting: A General Technique for Designing NUMA Locks

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA- aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful.
Lock cohorting allows one to transform any spin-lock algorithm, with minimal non-intrusive changes, into scalable NUMA-aware spin-locks. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability.
We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA- oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.

ESTIMA: Extrapolating ScalabiliTy of In-Memory Applications

This article presents ESTIMA, an easy-to-use tool for extrapolating the scalability of in-memory applications. ESTIMA is designed to perform a simple, yet important task: given the performance of an application on a small machine with a handful of cores, ESTIMA extrapolates its scalability to a larger machine with more cores, while requiring minimum input from the user. The key idea underlying ESTIMA is the use of stalled cycles (e.g., cycles that the processor spends waiting for missed cache line fetches or busy locks). ESTIMA measures stalled cycles on a few cores and extrapolates them to more cores, estimating the amount of waiting in the system. ESTIMA can be effectively used to predict the scalability of in-memory applications for bigger execution machines. For instance, using measurements of memcached and SQLite on a desktop machine, we obtain accurate predictions of their scalability on a server. Our extensive evaluation shows the effectiveness of ESTIMA on a large number of in-memory benchmarks.


Publication Years 2014-2017
Publication Count 68
Citation Count 39
Available for Download 68
Downloads (6 weeks) 477
Downloads (12 Months) 4728
Downloads (cumulative) 11760
Average downloads per article 173
Average citations per article 1
First Name Last Name Award
Grey Ballard ACM Doctoral Dissertation Award
Honorable Mention (2013) ACM Doctoral Dissertation Award
Honorable Mention (2013)
James Demmel ACM Paris Kanellakis Theory and Practice Award (2014)
Jack Dongarra ACM-IEEE CS Ken Kennedy Award (2013)
William D Gropp ACM-IEEE CS Ken Kennedy Award (2016)
SIAM/ACM Prize in Computational Science and Engineering (2014)
David Paul Grove ACM Distinguished Member (2010)
ACM Senior Member (2006)
Charles E Leiserson ACM-IEEE CS Ken Kennedy Award (2014)
ACM Paris Kanellakis Theory and Practice Award (2013)
ACM Doctoral Dissertation Award (1982)
Vijay Saraswat ACM Doctoral Dissertation Award (1989)
Julian Shun ACM Doctoral Dissertation Award (2015)

First Name Last Name Paper Counts
Nicholas Knight 3
Joseph Naor 2
Guy Blelloch 2
Benjamin Moseley 2
James Demmel 2
Charles Leiserson 2
Tao Schardl 2
Grey Ballard 2
Peter Kling 2
Chinmoy Dutta 1
Gopal Pandurangan 1
Andrea Vattani 1
Christian Scheideler 1
Thomas Groß 1
Hafiz Sheikh 1
Ishfaq Ahmad 1
Yves Robert 1
Gokcen Kestor 1
Walther Maldonado 1
Torsten Hoefler 1
William Gropp 1
Sungjin Im 1
Davide Bilò 1
Luciano Gualà 1
Maurice Herlihy 1
Xavier Martorell 1
Olivier Tardieu 1
Paul Thomson 1
Dave Dice 1
Aurélien Bouteiller 1
Thomas Hérault 1
William Gropp 1
Andrew Grimshaw 1
Jianjia Chen 1
Stephen Siegel 1
Ishai Menache 1
Bo Zhao 1
Mahesh Ravishankar 1
Ponnuswamy Sadayappan 1
Xing Wu 1
Matthieu Dorier 1
Gabriel Antoniu 1
Yu Wang 1
Sergei Vassilvitskii 1
Shenchen Xu 1
Paolo Romano 1
Oliver Sinnen 1
James Dinan 1
Ioana Bercea 1
David Harris 1
Kirk Pruhs 1
Eric Torng 1
Tim Kaler 1
Wickus Nienaber 1
Darko Petrović 1
George Teodoro 1
Adam Betts 1
Youtao Zhang 1
Jagannathan Ramanujam 1
Francis O'Connell 1
Bruce Mealey 1
Robert Sisneros 1
Rajeev Barua 1
Johannes Hagemann 1
Raoul Steffen 1
Scott Roche 1
Vijaya Ramachandran 1
Mooly Sagiv 1
Jeffrey Blanchard 1
Erik Opavsky 1
Lukas Arnold 1
Aurélien Cavelan 1
Lionel Eyraud-Dubois 1
Frédéric Vivien 1
Pascal Felber 1
Étienne Rivière 1
Erin Carson 1
Phillip Gibbons 1
Aapo Kyrola 1
Zhiyu Liu 1
Santosh Mahapatra 1
Vijay Saraswat 1
Mandana Vaziri 1
Paul Sack 1
Santiago Pagani 1
Moran Feldman 1
Liane Lewin-Eytan 1
Yi Xu 1
Jun Yang 1
Louis Pouchet 1
Scott Pakin 1
Pradip Bose 1
Chaodong Zheng 1
I Lee 1
Jim Sukha 1
Joseph Izraelevitz 1
Zoltan Majo 1
Anne Benoit 1
Ioannis Koutis 1
Nuno Diegues 1
Osman Ünsal 1
Rajeev Thakur 1
Navin Goyal 1
Aravind Srinivasan 1
William Hasenplaugh 1
Guido Proietti 1
Eduard Ayguadé 1
Alba De Melo 1
Avraham Shinnar 1
Mikio Takeuchi 1
Virendra Marathe 1
Nir Shavit 1
Michael Garland 1
Laxmikant Kale 1
Jörg Henkel 1
Andrew Siegel 1
Atanas Rountev 1
Frank Mueller 1
Timothy Heil 1
Anil Krishna 1
Roberto Gioiosa 1
Marc Snir 1
Shadi Ibrahim 1
Leigh Orf 1
Ronghua Liang 1
Stephan Kramer 1
Rajmohan Rajaraman 1
Michael Scott 1
Uday Bondhugula 1
Felix Wolf 1
Julia Lawall 1
Jeremy Fineman 1
Thomas Ropars 1
Guillermo Miranda 1
Duane Merrill 1
Ehsan Totoni 1
Nikhil Jain 1
Adam Hammouda 1
John Eisenlohr 1
Ashay Rane 1
Farnaz Toussi 1
Francisco Cazorla 1
Franck Cappello 1
Jun Wang 1
Seth Gilbert 1
Peter Sanders 1
Jochen Speck 1
Ravi Kumar 1
Guy Golan-Gueta 1
Ganesan Ramalingam 1
Felix Voigtlaender 1
Loris Marchal 1
Patrick Marlier 1
Harsha Simhadri 1
Jiayang Jiang 1
Michael Mitzenmacher 1
George Bosilca 1
Peng Du 1
Jack Dongarra 1
Sebastian Kobbe 1
Ciaran McCreesh 1
Jonathan Yaniv 1
Bastian Degener 1
Friedhelm Heide 1
Julian Shun 1
Steven Vanderwiel 1
Alex Druinsky 1
Peter Pietrzyk 1
Richard Cole 1
Eran Yahav 1
Emircan Uysaler 1
David Böhme 1
Markus Geimer 1
Pavan Balaji 1
Keith Underwood 1
Stefano Leucci 1
Xin Yuan 1
Edans De O. Sandes 1
Justin Thaler 1
Benjamin Herta 1
David Grove 1
Prabhanjan Kambadur 1
Alastair Donaldson 1
Saeed Maleki 1
Madanlal Musuvathi 1
Todd Mytkowicz 1
Janmartin Jahn 1
Patrick Prosser 1
Navendu Jain 1
James Browne 1
Orcun Yildiz 1
Tom Peterka 1
Timothy Creech 1
Zhunping Zhang 1
Martina Eikel 1
Roshan Dathathri 1
Ravi Mullapudi 1
Hongyang Sun 1
Adrián Cristal 1
Serdar Taşiran 1
Gilles Muller 1
Brian Barrett 1
Edgar Solomonik 1
André Schiper 1
Wei Zhang 1
David Cunningham 1
Barbara Kempkes 1
Nicholas Lindberg 1
Víctor Jiménez 1
Alper Buyuktosunoglu 1
Jiaquan Gao 1
Oded Schwartz 1

Affiliation Paper Counts
Tel Aviv University 1
University of Auckland 1
University of Houston 1
Los Alamos National Laboratory 1
Koc University 1
Spanish National Research Council 1
Louisiana State University 1
Hebrew University of Jerusalem 1
University of California , Merced 1
University of Sassari 1
Technical University of Darmstadt 1
RWTH Aachen University 1
Nanjing Normal University 1
University of Virginia 1
Massachusetts Institute of Technology 1
University of Delaware 1
Georgetown University 1
Lawrence Livermore National Laboratory 1
University of Roma Tor Vergata 1
University of California, Los Angeles 1
University of California, San Diego 1
Michigan State University 1
University of Wisconsin Madison 1
University of Puerto Rico 1
Yahoo Research Labs 1
Huawei Technologies Co., Ltd., USA 1
Universite de Bordeaux 1
IBM, Japan 1
University of Glasgow 2
University of Texas at Arlington 2
North Carolina State University 2
Instituto Superior Tecnico 2
Google Inc. 2
Lawrence Berkeley National Laboratory 2
Sandia National Laboratories, New Mexico 2
Washington University in St. Louis 2
Brown University 2
National University of Singapore 2
University of L'Aquila 2
New York University 2
Pacific Northwest National Laboratory 2
Zhejiang University of Technology 2
University of Rochester 2
Northeastern University 2
University of Gottingen 2
Universite de Lyon 2
Florida State University 3
Universitat Politecnica de Catalunya 3
Harvard University 3
University of Texas at Austin 3
INRIA Institut National de Rechereche en Informatique et en Automatique 3
Indian Institute of Science 3
Imperial College London 3
University of Brasilia 3
Swiss Federal Institute of Technology, Zurich 3
Swiss Federal Institute of Technology, Lausanne 3
Grinnell College 3
Barcelona Supercomputing Center 3
University of Neuchatel 4
Ohio State University 4
Ecole Normale Superieure de Lyon 4
Intel Corporation 4
University of Pittsburgh 5
University of Tennessee, Knoxville 5
University of Maryland 5
Technion - Israel Institute of Technology 5
Carnegie Mellon University 5
Microsoft Research 6
University of California, Berkeley 7
University of Illinois at Urbana-Champaign 8
MIT Computer Science and Artificial Intelligence Laboratory 8
University of Paderborn 8
Argonne National Laboratory 8
Karlsruhe Institute of Technology 8
IBM Thomas J. Watson Research Center 11

ACM Transactions on Parallel Computing (TOPC) - Special Issue on PPoPP 2015 and Regular Papers

Volume 3 Issue 4, March 2017 Special Issue on PPoPP 2015 and Regular Papers

Volume 3 Issue 3, December 2016
Volume 3 Issue 2, August 2016
Volume 3 Issue 1, June 2016 Special Issue for SPAA 2014
Volume 2 Issue 4, March 2016 Special Issue on PPOPP 2014

Volume 2 Issue 3, October 2015 Special Issue for SPAA 2013
Volume 2 Issue 2, July 2015
Volume 2 Issue 1, May 2015 Special Issue on SPAA 2012
Volume 1 Issue 2, January 2015 Special Issue on PPOPP 2012

Volume 1 Issue 1, September 2014 Inaugural Issue and Special Section on Top Papers from PACT-21, and Regular Papers
All ACM Journals | See Full Journal Index

Search TOPC
enter search term and/or author name