Concurrent hash tables are one of the most important concurrent data structures which is used in numerous applications. Since hash table accesses can dominate the execution time of whole applications, we need implementations that achieve good speedup even in these cases. Unfortunately, currently available concurrent hashing libraries turn out to be far away from this requirement in particular when adaptively sized tables are necessary or contention on some elements occurs.
Our starting point for better performing data structures is a fast and simple lock-free concurrent hash table based on linear probing that is however limited to word sized key-value types and does not support dynamic size adaptation. We explain how to lift these limitations in a provably scalable way and demonstrate that dynamic growing has a performance overhead comparable to the same generalization in sequential hash tables.
We perform extensive experiments comparing the performance of our implementations with six of the most widely used concurrent hash tables. Ours are considerably faster than the best algorithms with similar restrictions and an order of magnitude faster than the best more general tables. In some extreme cases, the difference even approaches four orders of magnitude.
Optimizing I/O Performance of HPC Applications with Autotuning
BARAN: Bimodal Adaptive Reconfigurable-Allocator Network-on-Chip
Natural graphs with skewed distribution raise unique challenges to distributed graph computation and partitioning. Existing graph-parallel systems usually use a "one size fits all" design that uniformly processes all vertices, which either suffer from notable load imbalance and high contention for high-degree vertices (e.g., Pregel and GraphLab), or incur high communication cost and memory consumption even for low-degree vertices (e.g., PowerGraph and GraphX).
In this paper, we argue that skewed distribution in natural graphs also calls for differentiated processing on high-degree and low-degree vertices. We then introduce PowerLyra, a new distributed graph processing system that embraces the best of both worlds of existing graph-parallel systems, by dynamically applying different computation and partitioning strategies for different vertices. PowerLyra further provides an efficient hybrid graph partitioning algorithm (hybrid-cut) that combines edge-cut and vertex-cut with heuristics. Based on PowerLyra, we design locality-conscious data layout optimization to improve cache locality of graph accesses during communication. PowerLyra is implemented based on the latest GraphLab and can seamlessly support various graph algorithms running in both synchronous and asynchronous execution modes. A detailed evaluation on three clusters using various graph-analytics and MLDM (machine learning and data mining) applications show that PowerLyra outperforms PowerGraph by up to 5.53X (from 1.24X) and 3.26X (from 1.49X) for real-world and synthetic graphs accordingly, and is much faster than other systems like GraphX and Giraph, yet with much less memory consumption. A porting of hybrid-cut to GraphX further confirms the efficiency and generality of PowerLyra.
The group mutual exclusion (GME) problem (also called the room synchronization problem) arises in various practical applications that require concurrent data sharing. The group mutual exclusion aims to achieve exclusive access to shared data (a shared room) while facilitating concurrency among non-conflicting requests. The problem is that threads with distinct interests are not allowed to access the shared resource concurrently, but multiple threads with
same interest can. Blelloch et. al. have presented an elegant solution to the room synchronization problem using
fetch-and-add and test-and-set atomic operations. This algorithm has O(m) remote memory references (RMRs) in the cache coherent (CC) model, where $m$ is the number of
forums. Bhatt et. al. have posed an open problem: ``Is it
possible to design a GME algorithm with constant RMR for the CC model using fetch-and-add instructions?" This question is answered affirmatively in this paper, by presenting a group mutual exclusion algorithm using fetch-and-increment instructions. The algorithm is simple and scalable.