ACM DL

ACM Transactions on

Parallel Computing (TOPC)

Menu
Latest Articles

A Library for Portable and Composable Data Locality Optimizations for NUMA Systems

Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time... (more)

Automatic Scalable Atomicity via Semantic Locking

In this article, we consider concurrent programs in which the shared state consists of instances of linearizable abstract data types (ADTs). We... (more)

Generality and Speed in Nonblocking Dual Containers

Nonblocking dual data structures extend traditional notions of nonblocking progress to accommodate partial methods, both by bounding the number of... (more)

Resource Oblivious Sorting on Multicores

We present a deterministic sorting algorithm, Sample, Partition, and Merge Sort (SPMS), that interleaves the partitioning of a sample sort with merging. Sequentially, it sorts n elements in O(nlog n) time cache-obliviously with an optimal number of cache misses. The parallel complexity (or critical path length) of the algorithm is O(log nlog log... (more)

NEWS

About TOPC

ACM Transactions on Parallel Computing (TOPC) is a forum for novel and innovative work on all aspects of parallel computing, including foundational and theoretical aspects, systems, languages, architectures, tools, and applications. It will address all classes of parallel-processing platforms including concurrent, multithreaded, multicore, accelerated, multiprocessor, clusters, and supercomputers. 

read more
Forthcoming Articles
Guest Editor Introduction (1 of 2)

Guest editor introduction for first special issue from PPoPP 2016

SciPAL: Expression Templates and Composition Closure Objects for High Performance Computational Physics with CUDA and OpenMP

We present SciPAL (scientific parallel algorithms library), a C++-based, hardware-independent open-source library.
Its core is a domain-specific embedded language for numerical linear algebra.
The main fields of application are finite element simulations, coherent optics and the solution of inverse problems.
Using SciPAL, algorithms can
be stated in a mathematically intuitive way in terms of matrix and vector operations.
Existing algorithms can easily be adapted to GPU-based computing by proper template specialization.
Our library is compatible with the finite element library deal.II and provides a port of deal.II's most frequently used linear algebra classes to CUDA (NVidia's extension of the programming languages C and C++ for programming their GPUs).
SciPAL's operator-based API for BLAS operations particularly aims at simplifying the usage of NVidia's CUBLAS.
For non-BLAS array arithmetic SciPAL's expression templates are able to generate CUDA kernels at compile-time.
We demonstrate the benefits of SciPAL using the iterative principal component analysis as example which is the core algorithm for the spike-sorting problem
in neuroscience.

DomLock: A New Multi-Granularity Locking Technique for Hierarchies

AutoGen: Automatic Discovery of Efficient Recursive Divide-&-Conquer Algorithms for Solving Dynamic Programming Problems

We present AUTOGEN - an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run on a DP table of small size, and automatically identifies a recursive access pattern and a corresponding provably correct recursive algorithm for solving the DP recurrence. We use AUTOGEN to autodiscover efficient algorithms for several well-known problems. Our experimental results show that several autodiscovered algorithms significantly outperform parallel looping and tiled loop-based algorithms. Also these algorithms are less sensitive to fluctuations of memory and bandwidth compared with their looping counterparts, and their running times and energy profiles remain relatively more stable. To the best of our knowledge, AUTOGEN is the first algorithm that can automatically discover new nontrivial divide-and-conquer algorithms.

Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy

Dense matrix factorizations, such as LU, Cholesky and QR, are widely
used for scientific applications that require solving systems of
linear equations, eigenvalues and linear least squares problems.
Such computations are normally carried out on supercomputers, whose
ever-growing scale induces a fast decline of the Mean Time To
Failure (MTTF). This paper proposes a new hybrid approach, based on
Algorithm-Based Fault Tolerance (ABFT), to help matrix
factorizations algorithms survive fail-stop failures. We consider
extreme conditions, such as the absence of any reliable component
and the possibility of losing both data and checksum from a single
failure. We will present a generic solution for protecting the
right factor, where the updates are applied, of all above mentioned
factorizations.
For the left factor, where the panel has been applied, we propose a
scalable checkpointing algorithm. This algorithm features high
degree of checkpointing parallelism and cooperatively utilizes the
checksum storage leftover from the right factor protection. The
fault-tolerant algorithms derived from this hybrid solution is
applicable to a wide range of dense matrix factorizations, with
minor modifications. Theoretical analysis shows that the fault
tolerance overhead sharply decreases with the scaling in the number
of computing units and the problem size. Experimental results of LU
and QR factorization on the Kraken (Cray XT5) supercomputer validate
the theoretical evaluation and confirm negligible overhead, with-
and without-errors. Applicability to tolerate multiple failures
and accuracy after multiple recovery is also considered.

Collective algorithms for multi-ported torus networks

Modern supercomputers with torus networks allow each node to simultaneously pass messages on all of its links. However, most collective algorithms are designed to only use one link at a time. In this work, we present novel multi-ported algorithms for the scatter, gather, allgather, and reduce-scatter operations. Our algorithms can be combined to create multi-ported reduce, all-reduce, and broadcast algorithms. Several of these algorithms involve a new technique where we relax the MPI message-ordering constraints to achieve high performance and restore the correct ordering using an additional stage of redundant communication.

According to our models, on an n-dimensional torus, our algorithms should allow for nearly a 2n-fold improvement in communication performance compared to known, single-ported torus algorithms. In practice, we have achieved nearly 6x better performance on a 32k-node 3-dimensional torus.

Automatic Parallelization of a Class of Irregular Loops for Distributed Memory Systems

Many scientific applications spend significant time within loops that are parallel, except for dependencies from associative reduction operations. However these loops often contain data-dependent control-flow and array-access patterns. Traditional optimizations that rely on purely static analysis fail to generate parallel code.

This paper proposes an approach for automatic parallelization for distributed memory environments, using both static and run-time analysis. We formalize the computations that are targeted by this approach and develop algorithms to detect such computation. We describe in detail, algorithms to generate a parallel inspector that
performs the run-time analysis, and a parallel executor. The
effectiveness of the approach is demonstrated on several benchmarks and a real-world applications. We measure the inspector overhead and also evaluate the benefit of optimizations applied during the transformation.

Near-Optimal Scheduling Mechanisms for Deadline-Sensitive Jobs in Large Computing Clusters

We consider a market-based resource allocation model for batch jobs in cloud computing clusters. In our model, we incorporate the importance of the due date of a job by which it needs to be completed rather than the number of servers allocated to it at any given time. Each batch job is characterized by the work volume of total computing units (e.g., CPU hours) along with a bound on maximum degree of parallelism. Users specify, along with these job characteristics, their desired due date and a value for finishing the job by its deadline. Given this specification, the primary goal is to determine the scheduling of cloud computing instances under capacity constraints in order to maximize the social welfare (i.e., sum of values gained by allocated users). Our main result is a new $\frac{C}{C-k}\frac{s}{s-1}-approximation algorithm for this objective, where $C$ denotes cloud capacity, $k$ is the maximal bound on parallelized execution (in practical settings, $k << C$) and $s$ is the slackness on the job completion time i.e., the minimal ratio between a specified deadline and the earliest finish time of a job. Our algorithm is based on utilizing dual fitting arguments over a strengthened linear program to the problem.

Based on the new approximation algorithm, we construct truthful allocation and pricing mechanisms, in which reporting the true value and other properties of the job (deadline, work volume and the parallelism bound) is a dominant strategy for all users. To that end, we extend known results for single-value settings to provide a general framework for transforming allocation algorithms into truthful mechanisms in domains of single-value and multi-properties. We then show that the basic mechanism can be extended under proper Bayesian assumptions to the objective of maximizing revenues, which is important for public clouds. We empirically evaluate the benefits of our approach through simulations on datacenter job traces, and show that the revenues obtained under our mechanism are comparable with an ideal fixed-price mechanism, which sets an on-demand price using oracle knowledge of users' valuations. Finally, we discuss how our model can be extended to accommodate uncertainties in job work volumes, which is a practical challenge in cloud settings.

Avoiding Communication in Successive Band Reduction

The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time.
In this work, we present sequential and distributed-memory parallel algorithms for tridiagonalizing full symmetric and symmetric band matrices that asymptotically reduce communication compared to previous approaches.

The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices.
In order to preserve structure, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality and high parallel latency cost.
We improve both by reorganizing the computation and obtain asymptotic improvements.
We also propose new algorithms for reducing a full symmetric matrix to band form in a communication-efficient manner.
In this paper, we consider the cases of computing eigenvalues only and of computing eigenvalues and all eigenvectors.

A Methodology for Automatic Generation of Executable Communication Specifications from Parallel MPI Applications

Portable parallel benchmarks are widely used for performance
evaluation of HPC systems. However, because these are manually
produced, they generally represent a greatly simplified view of
application behavior, missing the subtle but important-to-performance
nuances that may exist in a complete application.
This work contributes novel methods to
automatically generate highly portable and customizable communication
benchmarks from HPC applications. We utilize ScalaTrace, a lossless
yet scalable parallel-application tracing framework to collect
selected aspects of the run-time behavior of HPC applications,
including communication operations and execution time, while
abstracting away the details of the computation proper. We
subsequently generate benchmarks with identical run-time behavior from
the collected traces.
Results demonstrate that the generated
benchmarks are in fact able to preserve the run-time behavior (including both
the communication pattern and the execution time) of the original
applications.
Such automated benchmark generation is without
precedent
and particularly valuable for proprietary,
export-controlled, or classified application codes.

Power Management of Extreme-scale Networks with On/Off Links in Runtime Systems

Networks are among major power consumers in large-scale parallel systems. During execution of common
parallel applications, a sizeable fraction of the links in the high-radix interconnects are either never used
or are underutilized. We propose a runtime system based adaptive approach to turn off unused links, which
has various advantages over the previously proposed hardware and compiler based approaches. We discuss
why the runtime system is the best system component to accomplish this task, and test the effectiveness
of our approach using real applications (including NAMD, MILC), and application benchmarks (including
NAS Parallel Benchmarks, Stencil). These codes are simulated on representative topologies such as 6-D
Torus and multilevel directly-connected network (similar to IBM PERCS in Power 775 and Dragonfly in
Cray Aries). For common applications with near-neighbor communication pattern, our approach can save
up to 20% of total machine's power and energy, without any performance penalty.

Lock Cohorting: A General Technique for Designing NUMA Locks

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA- aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful.
Lock cohorting allows one to transform any spin-lock algorithm, with minimal non-intrusive changes, into scalable NUMA-aware spin-locks. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability.
We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA- oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.

ESTIMA: Extrapolating ScalabiliTy of In-Memory Applications

This article presents ESTIMA, an easy-to-use tool for extrapolating the scalability of in-memory applications. ESTIMA is designed to perform a simple, yet important task: given the performance of an application on a small machine with a handful of cores, ESTIMA extrapolates its scalability to a larger machine with more cores, while requiring minimum input from the user. The key idea underlying ESTIMA is the use of stalled cycles (e.g., cycles that the processor spends waiting for missed cache line fetches or busy locks). ESTIMA measures stalled cycles on a few cores and extrapolates them to more cores, estimating the amount of waiting in the system. ESTIMA can be effectively used to predict the scalability of in-memory applications for bigger execution machines. For instance, using measurements of memcached and SQLite on a desktop machine, we obtain accurate predictions of their scalability on a server. Our extensive evaluation shows the effectiveness of ESTIMA on a large number of in-memory benchmarks.

Bibliometrics

Publication Years 2014-2017
Publication Count 68
Citation Count 47
Available for Download 68
Downloads (6 weeks) 420
Downloads (12 Months) 4766
Downloads (cumulative) 12066
Average downloads per article 177
Average citations per article 1
First Name Last Name Award
Grey Ballard ACM Doctoral Dissertation Award
Honorable Mention (2013) ACM Doctoral Dissertation Award
Honorable Mention (2013)
Guy Blelloch ACM Fellows (2011)
James C Browne ACM Fellows (1998)
James Demmel ACM Paris Kanellakis Theory and Practice Award (2014)
ACM Fellows (1999)
Jack Dongarra ACM-IEEE CS Ken Kennedy Award (2013)
ACM Fellows (2001)
Phillip B Gibbons ACM Fellows (2006)
William D Gropp ACM-IEEE CS Ken Kennedy Award (2016)
SIAM/ACM Prize in Computational Science and Engineering (2014)
ACM Fellows (2006)
David Paul Grove ACM Fellows (2012)
ACM Distinguished Member (2010)
ACM Senior Member (2006)
Maurice Herlihy ACM Fellows (2005)
Charles E Leiserson ACM-IEEE CS Ken Kennedy Award (2014)
ACM Paris Kanellakis Theory and Practice Award (2013)
ACM Fellows (2006)
ACM Doctoral Dissertation Award (1982)
Michael Mitzenmacher ACM Fellows (2014)
Mooly Sagiv ACM Fellows (2015)
Vijay Saraswat ACM Doctoral Dissertation Award (1989)
Michael Scott ACM Fellows (2006)
Nir N Shavit ACM Fellows (2013)
Julian Shun ACM Doctoral Dissertation Award (2015)
Aravind Srinivasan ACM Fellows (2014)

First Name Last Name Paper Counts
Nicholas Knight 3
Joseph Naor 2
Guy Blelloch 2
Benjamin Moseley 2
James Demmel 2
Charles Leiserson 2
Tao Schardl 2
Grey Ballard 2
Peter Kling 2
Uday Bondhugula 1
Felix Wolf 1
Julia Lawall 1
Thomas Ropars 1
Guillermo Miranda 1
Duane Merrill 1
Ehsan Totoni 1
Nikhil Jain 1
Adam Hammouda 1
John Eisenlohr 1
Ashay Rane 1
Farnaz Toussi 1
Francisco Cazorla 1
Franck Cappello 1
Jun Wang 1
Seth Gilbert 1
Peter Sanders 1
Jochen Speck 1
Ravi Kumar 1
Guy Golan-Gueta 1
Ganesan Ramalingam 1
Harsha Simhadri 1
Jiayang Jiang 1
Michael Mitzenmacher 1
Felix Voigtlaender 1
Loris Marchal 1
Patrick Marlier 1
George Bosilca 1
Peng Du 1
Jack Dongarra 1
Jonathan Yaniv 1
Sebastian Kobbe 1
Bastian Degener 1
Friedhelm Heide 1
Steven Vanderwiel 1
Alex Druinsky 1
Ciaran McCreesh 1
Julian Shun 1
Peter Pietrzyk 1
Richard Cole 1
Eran Yahav 1
Justin Thaler 1
Stefano Leucci 1
Emircan Uysaler 1
David Böhme 1
Markus Geimer 1
Pavan Balaji 1
Keith Underwood 1
Xin Yuan 1
Edans De O. Sandes 1
Benjamin Herta 1
David Grove 1
Prabhanjan Kambadur 1
Alastair Donaldson 1
Saeed Maleki 1
Madanlal Musuvathi 1
Todd Mytkowicz 1
Navendu Jain 1
Janmartin Jahn 1
James Browne 1
Orcun Yildiz 1
Tom Peterka 1
Timothy Creech 1
Patrick Prosser 1
Zhunping Zhang 1
Martina Eikel 1
Edgar Solomonik 1
Roshan Dathathri 1
Ravi Mullapudi 1
Hongyang Sun 1
Adrián Cristal 1
Serdar Taşiran 1
Gilles Muller 1
Brian Barrett 1
André Schiper 1
Wei Zhang 1
David Cunningham 1
Barbara Kempkes 1
Nicholas Lindberg 1
Víctor Jiménez 1
Alper Buyuktosunoglu 1
Oded Schwartz 1
Jiaquan Gao 1
Chinmoy Dutta 1
Gopal Pandurangan 1
Andrea Vattani 1
Christian Scheideler 1
Thomas Groß 1
Sungjin Im 1
Davide Bilò 1
Luciano Gualà 1
Hafiz Sheikh 1
Ishfaq Ahmad 1
Yves Robert 1
Gokcen Kestor 1
Walther Maldonado 1
Torsten Hoefler 1
William Gropp 1
Maurice Herlihy 1
Xavier Martorell 1
Olivier Tardieu 1
Paul Thomson 1
Dave Dice 1
Aurélien Bouteiller 1
Thomas Hérault 1
William Gropp 1
Andrew Grimshaw 1
Ishai Menache 1
Jianjia Chen 1
Stephen Siegel 1
Bo Zhao 1
Mahesh Ravishankar 1
Ponnuswamy Sadayappan 1
Matthieu Dorier 1
Gabriel Antoniu 1
Yu Wang 1
Xing Wu 1
Sergei Vassilvitskii 1
Ioana Bercea 1
David Harris 1
Kirk Pruhs 1
Eric Torng 1
Tim Kaler 1
Shenchen Xu 1
Paolo Romano 1
Oliver Sinnen 1
James Dinan 1
Wickus Nienaber 1
Darko Petrović 1
George Teodoro 1
Adam Betts 1
Johannes Hagemann 1
Youtao Zhang 1
Jagannathan Ramanujam 1
Francis O'Connell 1
Bruce Mealey 1
Robert Sisneros 1
Rajeev Barua 1
Raoul Steffen 1
Scott Roche 1
Vijaya Ramachandran 1
Mooly Sagiv 1
Phillip Gibbons 1
Aapo Kyrola 1
Erin Carson 1
Jeffrey Blanchard 1
Erik Opavsky 1
Lukas Arnold 1
Aurélien Cavelan 1
Lionel Eyraud-Dubois 1
Frédéric Vivien 1
Pascal Felber 1
Étienne Rivière 1
Zhiyu Liu 1
Santosh Mahapatra 1
Vijay Saraswat 1
Mandana Vaziri 1
Paul Sack 1
Santiago Pagani 1
Yi Xu 1
Jun Yang 1
Louis Pouchet 1
Scott Pakin 1
Pradip Bose 1
Moran Feldman 1
Liane Lewin-Eytan 1
Chaodong Zheng 1
I Lee 1
Jim Sukha 1
Joseph Izraelevitz 1
Zoltan Majo 1
Navin Goyal 1
Aravind Srinivasan 1
William Hasenplaugh 1
Guido Proietti 1
Anne Benoit 1
Ioannis Koutis 1
Nuno Diegues 1
Osman Ünsal 1
Rajeev Thakur 1
Eduard Ayguadé 1
Alba De Melo 1
Avraham Shinnar 1
Mikio Takeuchi 1
Virendra Marathe 1
Nir Shavit 1
Michael Garland 1
Laxmikant Kale 1
Stephan Kramer 1
Jörg Henkel 1
Andrew Siegel 1
Atanas Rountev 1
Frank Mueller 1
Timothy Heil 1
Anil Krishna 1
Roberto Gioiosa 1
Marc Snir 1
Shadi Ibrahim 1
Leigh Orf 1
Ronghua Liang 1
Rajmohan Rajaraman 1
Michael Scott 1
Jeremy Fineman 1

Affiliation Paper Counts
Tel Aviv University 1
University of Auckland 1
University of Houston 1
Los Alamos National Laboratory 1
Koc University 1
Spanish National Research Council 1
Louisiana State University 1
Hebrew University of Jerusalem 1
University of California , Merced 1
University of Sassari 1
Technical University of Darmstadt 1
RWTH Aachen University 1
Nanjing Normal University 1
University of Virginia 1
Massachusetts Institute of Technology 1
University of Delaware 1
Georgetown University 1
Lawrence Livermore National Laboratory 1
University of Roma Tor Vergata 1
University of California, Los Angeles 1
University of California, San Diego 1
Michigan State University 1
University of Wisconsin Madison 1
University of Puerto Rico 1
Yahoo Research Labs 1
Huawei Technologies Co., Ltd., USA 1
Universite de Bordeaux 1
IBM, Japan 1
University of Glasgow 2
University of Texas at Arlington 2
North Carolina State University 2
Instituto Superior Tecnico 2
Google Inc. 2
Lawrence Berkeley National Laboratory 2
Sandia National Laboratories, New Mexico 2
Washington University in St. Louis 2
Brown University 2
National University of Singapore 2
University of L'Aquila 2
New York University 2
Pacific Northwest National Laboratory 2
Zhejiang University of Technology 2
University of Rochester 2
Northeastern University 2
University of Gottingen 2
NVIDIA 2
Universite de Lyon 2
Florida State University 3
Universitat Politecnica de Catalunya 3
Harvard University 3
University of Texas at Austin 3
INRIA Institut National de Rechereche en Informatique et en Automatique 3
Indian Institute of Science 3
Imperial College London 3
University of Brasilia 3
Swiss Federal Institute of Technology, Zurich 3
Swiss Federal Institute of Technology, Lausanne 3
Grinnell College 3
Barcelona Supercomputing Center 3
University of Neuchatel 4
Ohio State University 4
Ecole Normale Superieure de Lyon 4
Intel Corporation 4
University of Pittsburgh 5
University of Tennessee, Knoxville 5
University of Maryland 5
Technion - Israel Institute of Technology 5
Carnegie Mellon University 5
Microsoft Research 6
IBM, USA 7
University of California, Berkeley 7
University of Illinois at Urbana-Champaign 8
MIT Computer Science and Artificial Intelligence Laboratory 8
University of Paderborn 8
Argonne National Laboratory 8
Karlsruhe Institute of Technology 8
IBM Thomas J. Watson Research Center 11

ACM Transactions on Parallel Computing (TOPC) - Special Issue on PPoPP 2015 and Regular Papers
Archive


2017
Volume 3 Issue 4, March 2017 Special Issue on PPoPP 2015 and Regular Papers

2016
Volume 3 Issue 3, December 2016
Volume 3 Issue 2, August 2016
Volume 3 Issue 1, June 2016 Special Issue for SPAA 2014
Volume 2 Issue 4, March 2016 Special Issue on PPOPP 2014

2015
Volume 2 Issue 3, October 2015 Special Issue for SPAA 2013
Volume 2 Issue 2, July 2015
Volume 2 Issue 1, May 2015 Special Issue on SPAA 2012
Volume 1 Issue 2, January 2015 Special Issue on PPOPP 2012

2014
Volume 1 Issue 1, September 2014 Inaugural Issue and Special Section on Top Papers from PACT-21, and Regular Papers
 
All ACM Journals | See Full Journal Index

Search TOPC
enter search term and/or author name