ACM DL

ACM Transactions on

Parallel Computing (TOPC)

Menu
Latest Articles

Guest Editor Introduction PPoPP 2016, Special Issue 2 of 2

ESTIMA

This article presents estima, an easy-to-use tool for extrapolating the scalability of in-memory applications. estima is designed to perform a simple yet important task: Given the performance of an application on a small machine with a handful of cores, estima extrapolates its scalability to a larger machine with more cores, while requiring minimum... (more)

NEWS

About TOPC

ACM Transactions on Parallel Computing (TOPC) is a forum for novel and innovative work on all aspects of parallel computing, including foundational and theoretical aspects, systems, languages, architectures, tools, and applications. It will address all classes of parallel-processing platforms including concurrent, multithreaded, multicore, accelerated, multiprocessor, clusters, and supercomputers. 

read more
Forthcoming Articles
Guest Editor Introduction (1 of 2)

Guest editor introduction for first special issue from PPoPP 2016

SciPAL: Expression Templates and Composition Closure Objects for High Performance Computational Physics with CUDA and OpenMP

We present SciPAL (scientific parallel algorithms library), a C++-based, hardware-independent open-source library.
Its core is a domain-specific embedded language for numerical linear algebra.
The main fields of application are finite element simulations, coherent optics and the solution of inverse problems.
Using SciPAL, algorithms can
be stated in a mathematically intuitive way in terms of matrix and vector operations.
Existing algorithms can easily be adapted to GPU-based computing by proper template specialization.
Our library is compatible with the finite element library deal.II and provides a port of deal.II's most frequently used linear algebra classes to CUDA (NVidia's extension of the programming languages C and C++ for programming their GPUs).
SciPAL's operator-based API for BLAS operations particularly aims at simplifying the usage of NVidia's CUBLAS.
For non-BLAS array arithmetic SciPAL's expression templates are able to generate CUDA kernels at compile-time.
We demonstrate the benefits of SciPAL using the iterative principal component analysis as example which is the core algorithm for the spike-sorting problem
in neuroscience.

DomLock: A New Multi-Granularity Locking Technique for Hierarchies

AutoGen: Automatic Discovery of Efficient Recursive Divide-&-Conquer Algorithms for Solving Dynamic Programming Problems

We present AUTOGEN - an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run on a DP table of small size, and automatically identifies a recursive access pattern and a corresponding provably correct recursive algorithm for solving the DP recurrence. We use AUTOGEN to autodiscover efficient algorithms for several well-known problems. Our experimental results show that several autodiscovered algorithms significantly outperform parallel looping and tiled loop-based algorithms. Also these algorithms are less sensitive to fluctuations of memory and bandwidth compared with their looping counterparts, and their running times and energy profiles remain relatively more stable. To the best of our knowledge, AUTOGEN is the first algorithm that can automatically discover new nontrivial divide-and-conquer algorithms.

Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy

Dense matrix factorizations, such as LU, Cholesky and QR, are widely
used for scientific applications that require solving systems of
linear equations, eigenvalues and linear least squares problems.
Such computations are normally carried out on supercomputers, whose
ever-growing scale induces a fast decline of the Mean Time To
Failure (MTTF). This paper proposes a new hybrid approach, based on
Algorithm-Based Fault Tolerance (ABFT), to help matrix
factorizations algorithms survive fail-stop failures. We consider
extreme conditions, such as the absence of any reliable component
and the possibility of losing both data and checksum from a single
failure. We will present a generic solution for protecting the
right factor, where the updates are applied, of all above mentioned
factorizations.
For the left factor, where the panel has been applied, we propose a
scalable checkpointing algorithm. This algorithm features high
degree of checkpointing parallelism and cooperatively utilizes the
checksum storage leftover from the right factor protection. The
fault-tolerant algorithms derived from this hybrid solution is
applicable to a wide range of dense matrix factorizations, with
minor modifications. Theoretical analysis shows that the fault
tolerance overhead sharply decreases with the scaling in the number
of computing units and the problem size. Experimental results of LU
and QR factorization on the Kraken (Cray XT5) supercomputer validate
the theoretical evaluation and confirm negligible overhead, with-
and without-errors. Applicability to tolerate multiple failures
and accuracy after multiple recovery is also considered.

Collective algorithms for multi-ported torus networks

Modern supercomputers with torus networks allow each node to simultaneously pass messages on all of its links. However, most collective algorithms are designed to only use one link at a time. In this work, we present novel multi-ported algorithms for the scatter, gather, allgather, and reduce-scatter operations. Our algorithms can be combined to create multi-ported reduce, all-reduce, and broadcast algorithms. Several of these algorithms involve a new technique where we relax the MPI message-ordering constraints to achieve high performance and restore the correct ordering using an additional stage of redundant communication.

According to our models, on an n-dimensional torus, our algorithms should allow for nearly a 2n-fold improvement in communication performance compared to known, single-ported torus algorithms. In practice, we have achieved nearly 6x better performance on a 32k-node 3-dimensional torus.

Automatic Parallelization of a Class of Irregular Loops for Distributed Memory Systems

Many scientific applications spend significant time within loops that are parallel, except for dependencies from associative reduction operations. However these loops often contain data-dependent control-flow and array-access patterns. Traditional optimizations that rely on purely static analysis fail to generate parallel code.

This paper proposes an approach for automatic parallelization for distributed memory environments, using both static and run-time analysis. We formalize the computations that are targeted by this approach and develop algorithms to detect such computation. We describe in detail, algorithms to generate a parallel inspector that
performs the run-time analysis, and a parallel executor. The
effectiveness of the approach is demonstrated on several benchmarks and a real-world applications. We measure the inspector overhead and also evaluate the benefit of optimizations applied during the transformation.

Near-Optimal Scheduling Mechanisms for Deadline-Sensitive Jobs in Large Computing Clusters

We consider a market-based resource allocation model for batch jobs in cloud computing clusters. In our model, we incorporate the importance of the due date of a job by which it needs to be completed rather than the number of servers allocated to it at any given time. Each batch job is characterized by the work volume of total computing units (e.g., CPU hours) along with a bound on maximum degree of parallelism. Users specify, along with these job characteristics, their desired due date and a value for finishing the job by its deadline. Given this specification, the primary goal is to determine the scheduling of cloud computing instances under capacity constraints in order to maximize the social welfare (i.e., sum of values gained by allocated users). Our main result is a new $\frac{C}{C-k}\frac{s}{s-1}-approximation algorithm for this objective, where $C$ denotes cloud capacity, $k$ is the maximal bound on parallelized execution (in practical settings, $k << C$) and $s$ is the slackness on the job completion time i.e., the minimal ratio between a specified deadline and the earliest finish time of a job. Our algorithm is based on utilizing dual fitting arguments over a strengthened linear program to the problem.

Based on the new approximation algorithm, we construct truthful allocation and pricing mechanisms, in which reporting the true value and other properties of the job (deadline, work volume and the parallelism bound) is a dominant strategy for all users. To that end, we extend known results for single-value settings to provide a general framework for transforming allocation algorithms into truthful mechanisms in domains of single-value and multi-properties. We then show that the basic mechanism can be extended under proper Bayesian assumptions to the objective of maximizing revenues, which is important for public clouds. We empirically evaluate the benefits of our approach through simulations on datacenter job traces, and show that the revenues obtained under our mechanism are comparable with an ideal fixed-price mechanism, which sets an on-demand price using oracle knowledge of users' valuations. Finally, we discuss how our model can be extended to accommodate uncertainties in job work volumes, which is a practical challenge in cloud settings.

Adding Approximate Counters

We describe a general framework for adding the values of two approximate counters to produce a new approximate counter value whose expected estimated value is equal to the sum of the expected estimated values of the given approximate counters. (To the best of our knowledge, this is the first published description of any algorithm for adding two approximate counters.) We then work out implementation details for five different kinds of approximate counter and provide optimized pseudocode. For three of them, we present proofs that the variance of a counter value produced by adding two counter values in this way is bounded, and in fact is no worse, or not much worse, than the variance of the value of a single counter to which the same total number of increment operations have been applied. Addition of approximate counters is useful in massively parallel divide-and-conquer algorithms that use a distributed representation for large arrays of counters. We describe two machine-learning algorithms for topic modeling that use millions of integer counters, and confirm that replacing the integer counters with approximate counters is effective, speeding up a GPU-based implementation by over 65% and a CPU-based implementation by nearly 50%, as well as reducing memory requirements, without degrading their statistical effectiveness.

Power Management of Extreme-scale Networks with On/Off Links in Runtime Systems

Networks are among major power consumers in large-scale parallel systems. During execution of common
parallel applications, a sizeable fraction of the links in the high-radix interconnects are either never used
or are underutilized. We propose a runtime system based adaptive approach to turn off unused links, which
has various advantages over the previously proposed hardware and compiler based approaches. We discuss
why the runtime system is the best system component to accomplish this task, and test the effectiveness
of our approach using real applications (including NAMD, MILC), and application benchmarks (including
NAS Parallel Benchmarks, Stencil). These codes are simulated on representative topologies such as 6-D
Torus and multilevel directly-connected network (similar to IBM PERCS in Power 775 and Dragonfly in
Cray Aries). For common applications with near-neighbor communication pattern, our approach can save
up to 20% of total machine's power and energy, without any performance penalty.

Lock Cohorting: A General Technique for Designing NUMA Locks

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA- aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful.
Lock cohorting allows one to transform any spin-lock algorithm, with minimal non-intrusive changes, into scalable NUMA-aware spin-locks. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability.
We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA- oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.

Lease/Release: Architectural Support for Scaling Contended Data Structures

High memory contention is generally agreed to be a worst-case scenario for concurrent data structures. There has been a significant amount of research effort spent investigating designs which minimize con- tention, and several programming techniques have been proposed to mitigate its effects. However, there are currently few architectural mechanisms to allow scaling contended data structures at high thread counts.
In this paper, we investigate hardware support for scalable contended data structures. We propose Lease/Release, a simple addition to standard directory-based MESI cache coherence protocols, allowing par- ticipants to lease memory, at the granularity of cache lines, by delaying coherence messages for a short, bounded period of time. Our analysis shows that Lease/Release can significantly reduce the overheads of contention for both non-blocking (lock-free) and lock-based data structure implementations, while ensuring that no deadlocks are introduced. We validate Lease/Release empirically on the Graphite multiprocessor simulator, on a range of data structures, including queue, stack, and priority queue implementations, as well as on transactional applications. Results show that Lease/Release consistently improves both through- put and energy usage, by up to 5x, both for lock-free and lock-based data structure designs.

ESTIMA: Extrapolating ScalabiliTy of In-Memory Applications

This article presents ESTIMA, an easy-to-use tool for extrapolating the scalability of in-memory applications. ESTIMA is designed to perform a simple, yet important task: given the performance of an application on a small machine with a handful of cores, ESTIMA extrapolates its scalability to a larger machine with more cores, while requiring minimum input from the user. The key idea underlying ESTIMA is the use of stalled cycles (e.g., cycles that the processor spends waiting for missed cache line fetches or busy locks). ESTIMA measures stalled cycles on a few cores and extrapolates them to more cores, estimating the amount of waiting in the system. ESTIMA can be effectively used to predict the scalability of in-memory applications for bigger execution machines. For instance, using measurements of memcached and SQLite on a desktop machine, we obtain accurate predictions of their scalability on a server. Our extensive evaluation shows the effectiveness of ESTIMA on a large number of in-memory benchmarks.

Efficient data streaming multiway aggregation through concurrent algorithmic designs and new abstract data types

Data streaming relies on continuous queries to process unbounded streams of data in a real-time fashion. It is commonly demanding in computation capacity, given that the relevant applications involve very large volumes of data. Data structures act as articulation points and maintain the state of data streaming operators, potentially supporting high parallelism and balancing the work between them. Prompted by this fact, in this work we study and analyze parallelization needs of these articulation points, an issue that has been overlooked in the literature. In particular, we place several cornerstones in this study avenue, by focusing on the problem of streaming multiway aggregation, where large data volumes are received from multiple input streams. The analysis of the parallelization needs, as well as of the use and limitations of existing aggregate designs and their data structures, lead us to identify needs for new concurrent abstract data types for achieving low-latency and high-throughput multiway aggregation. We present efficient lock-free linearizable implementations of these concurrent abstract data types and new multiway aggregate designs that leverage them, supporting both deterministic order-sensitive and order-insensitive aggregate functions. Furthermore, we point out future directions that open through these contributions. The paper includes an extensive experimental study, based on a variety of aggregation continuous queries on two large datasets extracted from SoundCloud, a music social network, and from a Smart Grid network. In all the experiments, the proposed data structures and the enhanced aggregate operators improved the processing performance significantly, up to one order of magnitude, in terms of both throughput and latency over the commonly-used techniques based on queues.

Bibliometrics

Publication Years 2014-2017
Publication Count 71
Citation Count 51
Available for Download 71
Downloads (6 weeks) 321
Downloads (12 Months) 4652
Downloads (cumulative) 12267
Average downloads per article 173
Average citations per article 1
First Name Last Name Award
Grey Ballard ACM Doctoral Dissertation Award
Honorable Mention (2013) ACM Doctoral Dissertation Award
Honorable Mention (2013)
Guy Blelloch ACM Fellows (2011)
James C Browne ACM Fellows (1998)
James Demmel ACM Paris Kanellakis Theory and Practice Award (2014)
ACM Fellows (1999)
Jack Dongarra ACM-IEEE CS Ken Kennedy Award (2013)
ACM Fellows (2001)
Phillip B Gibbons ACM Fellows (2006)
William D Gropp ACM-IEEE CS Ken Kennedy Award (2016)
SIAM/ACM Prize in Computational Science and Engineering (2014)
ACM Fellows (2006)
David Paul Grove ACM Fellows (2012)
ACM Distinguished Member (2010)
ACM Senior Member (2006)
Maurice Herlihy ACM Fellows (2005)
Charles E Leiserson ACM-IEEE CS Ken Kennedy Award (2014)
ACM Paris Kanellakis Theory and Practice Award (2013)
ACM Fellows (2006)
ACM Doctoral Dissertation Award (1982)
Michael Mitzenmacher ACM Fellows (2014)
Mooly Sagiv ACM Fellows (2015)
Vijay Saraswat ACM Doctoral Dissertation Award (1989)
Michael Scott ACM Fellows (2006)
Nir N Shavit ACM Fellows (2013)
Julian Shun ACM Doctoral Dissertation Award (2015)
Aravind Srinivasan ACM Fellows (2014)

First Name Last Name Paper Counts
Nicholas Knight 3
Joseph Naor 2
Guy Blelloch 2
Benjamin Moseley 2
James Demmel 2
Charles Leiserson 2
Tao Schardl 2
Grey Ballard 2
Peter Kling 2
Chinmoy Dutta 1
Gopal Pandurangan 1
Andrea Vattani 1
Christian Scheideler 1
Thomas Groß 1
Sungjin Im 1
Davide Bilò 1
Luciano Gualà 1
Hafiz Sheikh 1
Ishfaq Ahmad 1
Yves Robert 1
Gokcen Kestor 1
Walther Maldonado 1
Torsten Hoefler 1
William Gropp 1
Maurice Herlihy 1
Xavier Martorell 1
Olivier Tardieu 1
Paul Thomson 1
Dave Dice 1
William Gropp 1
Andrew Grimshaw 1
Jianjia Chen 1
Stephen Siegel 1
Ishai Menache 1
Bo Zhao 1
Mahesh Ravishankar 1
Ponnuswamy Sadayappan 1
Matthieu Dorier 1
Gabriel Antoniu 1
Yu Wang 1
Xing Wu 1
Aurélien Bouteiller 1
Thomas Hérault 1
Sergei Vassilvitskii 1
Ioana Bercea 1
David Harris 1
Kirk Pruhs 1
Eric Torng 1
Tim Kaler 1
Shenchen Xu 1
Paolo Romano 1
Oliver Sinnen 1
James Dinan 1
Wickus Nienaber 1
Darko Petrović 1
George Teodoro 1
Adam Betts 1
Johannes Hagemann 1
Youtao Zhang 1
Jagannathan Ramanujam 1
Francis O'Connell 1
Bruce Mealey 1
Robert Sisneros 1
Rajeev Barua 1
Raoul Steffen 1
Scott Roche 1
Vijaya Ramachandran 1
Mooly Sagiv 1
Phillip Gibbons 1
Aapo Kyrola 1
Erin Carson 1
Jeffrey Blanchard 1
Erik Opavsky 1
Lukas Arnold 1
Aurélien Cavelan 1
Lionel Eyraud-Dubois 1
Frédéric Vivien 1
Pascal Felber 1
Étienne Rivière 1
Zhiyu Liu 1
Santosh Mahapatra 1
Vijay Saraswat 1
Mandana Vaziri 1
Paul Sack 1
Santiago Pagani 1
Moran Feldman 1
Liane Lewin-Eytan 1
Yi Xu 1
Jun Yang 1
Louis Pouchet 1
Scott Pakin 1
Pradip Bose 1
Chaodong Zheng 1
I Lee 1
Jim Sukha 1
Joseph Izraelevitz 1
Zoltan Majo 1
Navin Goyal 1
Aravind Srinivasan 1
William Hasenplaugh 1
Guido Proietti 1
Anne Benoit 1
Ioannis Koutis 1
Nuno Diegues 1
Osman Ünsal 1
Rajeev Thakur 1
Eduard Ayguadé 1
Alba De Melo 1
Avraham Shinnar 1
Mikio Takeuchi 1
Virendra Marathe 1
Nir Shavit 1
Michael Garland 1
Stephan Kramer 1
Laxmikant Kale 1
Jörg Henkel 1
Andrew Siegel 1
Atanas Rountev 1
Frank Mueller 1
Timothy Heil 1
Anil Krishna 1
Roberto Gioiosa 1
Marc Snir 1
Shadi Ibrahim 1
Leigh Orf 1
Ronghua Liang 1
Rajmohan Rajaraman 1
Michael Scott 1
Jeremy Fineman 1
Uday Bondhugula 1
Felix Wolf 1
Julia Lawall 1
Thomas Ropars 1
Guillermo Miranda 1
Duane Merrill 1
Ehsan Totoni 1
Nikhil Jain 1
Adam Hammouda 1
John Eisenlohr 1
Ashay Rane 1
Farnaz Toussi 1
Francisco Cazorla 1
Franck Cappello 1
Jun Wang 1
Seth Gilbert 1
Peter Sanders 1
Jochen Speck 1
Ravi Kumar 1
Guy Golan-Gueta 1
Ganesan Ramalingam 1
Harsha Simhadri 1
Jiayang Jiang 1
Michael Mitzenmacher 1
Felix Voigtlaender 1
Loris Marchal 1
Patrick Marlier 1
George Bosilca 1
Peng Du 1
Jack Dongarra 1
Sebastian Kobbe 1
Jonathan Yaniv 1
Bastian Degener 1
Friedhelm Heide 1
Julian Shun 1
Steven Vanderwiel 1
Alex Druinsky 1
Ciaran McCreesh 1
Peter Pietrzyk 1
Richard Cole 1
Eran Yahav 1
Justin Thaler 1
Stefano Leucci 1
Emircan Uysaler 1
David Böhme 1
Markus Geimer 1
Pavan Balaji 1
Keith Underwood 1
Xin Yuan 1
Edans De O. Sandes 1
Benjamin Herta 1
David Grove 1
Prabhanjan Kambadur 1
Alastair Donaldson 1
Saeed Maleki 1
Madanlal Musuvathi 1
Todd Mytkowicz 1
Janmartin Jahn 1
Navendu Jain 1
James Browne 1
Orcun Yildiz 1
Tom Peterka 1
Timothy Creech 1
Patrick Prosser 1
Zhunping Zhang 1
Martina Eikel 1
Edgar Solomonik 1
Roshan Dathathri 1
Ravi Mullapudi 1
Hongyang Sun 1
Adrián Cristal 1
Serdar Taşiran 1
Gilles Muller 1
Brian Barrett 1
André Schiper 1
Wei Zhang 1
David Cunningham 1
Barbara Kempkes 1
Nicholas Lindberg 1
Víctor Jiménez 1
Alper Buyuktosunoglu 1
Oded Schwartz 1
Jiaquan Gao 1

Affiliation Paper Counts
Tel Aviv University 1
University of Auckland 1
University of Houston 1
Los Alamos National Laboratory 1
Koc University 1
Spanish National Research Council 1
Louisiana State University 1
Hebrew University of Jerusalem 1
University of California , Merced 1
University of Sassari 1
Technical University of Darmstadt 1
RWTH Aachen University 1
Nanjing Normal University 1
University of Virginia 1
Massachusetts Institute of Technology 1
University of Delaware 1
Georgetown University 1
Lawrence Livermore National Laboratory 1
University of Roma Tor Vergata 1
University of California, Los Angeles 1
University of California, San Diego 1
Michigan State University 1
University of Wisconsin Madison 1
University of Puerto Rico 1
Yahoo Research Labs 1
Huawei Technologies Co., Ltd., USA 1
Universite de Bordeaux 1
IBM, Japan 1
University of Glasgow 2
University of Texas at Arlington 2
North Carolina State University 2
Instituto Superior Tecnico 2
Google Inc. 2
Lawrence Berkeley National Laboratory 2
Sandia National Laboratories, New Mexico 2
Washington University in St. Louis 2
Brown University 2
National University of Singapore 2
University of L'Aquila 2
New York University 2
Pacific Northwest National Laboratory 2
Zhejiang University of Technology 2
University of Rochester 2
Northeastern University 2
University of Gottingen 2
NVIDIA 2
Universite de Lyon 2
Florida State University 3
Universitat Politecnica de Catalunya 3
Harvard University 3
University of Texas at Austin 3
INRIA Institut National de Rechereche en Informatique et en Automatique 3
Indian Institute of Science 3
Imperial College London 3
University of Brasilia 3
Swiss Federal Institute of Technology, Zurich 3
Swiss Federal Institute of Technology, Lausanne 3
Grinnell College 3
Barcelona Supercomputing Center 3
University of Neuchatel 4
Ohio State University 4
Ecole Normale Superieure de Lyon 4
Intel Corporation 4
University of Pittsburgh 5
University of Tennessee, Knoxville 5
University of Maryland 5
Technion - Israel Institute of Technology 5
Carnegie Mellon University 5
Microsoft Research 6
IBM, USA 7
University of California, Berkeley 7
University of Illinois at Urbana-Champaign 8
MIT Computer Science and Artificial Intelligence Laboratory 8
University of Paderborn 8
Argonne National Laboratory 8
Karlsruhe Institute of Technology 8
IBM Thomas J. Watson Research Center 11

ACM Transactions on Parallel Computing (TOPC)
Archive


2017
Volume 4 Issue 2, August 2017  Issue-in-Progress
Volume 4 Issue 1, August 2017  Issue-in-Progress
Volume 3 Issue 4, March 2017 Special Issue on PPoPP 2015 and Regular Papers

2016
Volume 3 Issue 3, December 2016
Volume 3 Issue 2, August 2016
Volume 3 Issue 1, June 2016 Special Issue for SPAA 2014
Volume 2 Issue 4, March 2016 Special Issue on PPOPP 2014

2015
Volume 2 Issue 3, October 2015 Special Issue for SPAA 2013
Volume 2 Issue 2, July 2015
Volume 2 Issue 1, May 2015 Special Issue on SPAA 2012
Volume 1 Issue 2, January 2015 Special Issue on PPOPP 2012

2014
Volume 1 Issue 1, September 2014 Inaugural Issue and Special Section on Top Papers from PACT-21, and Regular Papers
 
All ACM Journals | See Full Journal Index

Search TOPC
enter search term and/or author name