# Efficient Parallel Scan Algorithms for GPUs

@inproceedings{Sengupta2011EfficientPS, title={Efficient Parallel Scan Algorithms for GPUs}, author={Shubhabrata Sengupta and Mark J. Harris and Michael Garland}, year={2011} }

Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. [...] Key Method Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly ecient, and free of irregular access patterns that lead to memory bank conicts. These algorithms form the basis for current and upcoming releases of the… Expand

#### 158 Citations

Efficient Scan Operator Methods on a GPU

- Computer Science
- 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing
- 2014

A number of parallel scan methods based on the traditional cyclic reduction tridiagonal solver and the Ladner-Fischer parallel prefix adder are presented and a set of new features introduced in the Kepler Nvidia architecture such as read-only data cache and shuffle instructions are analyzed. Expand

Automatic Scan Parallelization in OpenMP

- Computer Science
- 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)
- 2017

Experiments running a set of typical scan based algorithms on NVIDIA, Intel, and ARM GPUs reveal that the performance of the proposed OpenMP clause is equivalent to that achieved when using OpenCL library calls, with the advantage of a simpler programming complexity. Expand

StreamScan: fast scan algorithms for GPUs without global barrier synchronization

- Computer Science
- PPoPP '13
- 2013

StreamScan is a novel approach to implement scan on GPUs with only one computation phase, and the main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. Expand

Parallel Scan for Stream Architectures

- Computer Science
- 2012

This work presents three implementations of parallel scan that are memory-bound, utilize 100% of achievable memory bandwidth, and only require the use of a constant amount of global device memory for the storage of intermediate results. Expand

Optimizing Parallel Prefix Operations for the Fermi Architecture

- Computer Science
- 2012

This chapter demonstrates the application of new instructions designed to facilitate basic, but important, parallel primitives on per-thread predicates as well as instructions for manipulating and querying bits within a word in the construction of efficient parallel algorithm primitives such as reductions, scans, and segmented scans of binary or Boolean data. Expand

Efficient Data-Parallel Primitives on Heterogeneous Systems

- Computer Science
- ICPP
- 2019

The goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms and deliver common primitive optimization strategies for heterogeneous systems. Expand

Single-pass Parallel Prefix Scan with Decoupled Lookback

- 2016

We describe a work-efficient, communication-avoiding, singlepass method for the parallel computation of prefix scan. When consuming input from memory, our algorithm requires only ~2n data movement: n… Expand

Efficient Solving of Scan Primitive on Multi-GPU Systems

- Computer Science
- 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2018

This article presents a GPU-suitable algorithm and a tuning strategy for performing the scan primitive over large problem sizes in CUDA, and describes an optimal proposal analyzed over different multiple GPU environments, the first multiple-GPU batch scan proposal to the best of the authors' knowledge. Expand

A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment

- Computer Science
- IEEE Transactions on Parallel and Distributed Systems
- 2013

This study presents a parallel scan method that is derived from the Han-Carlson parallel prefix graph and is both a work-efficient and a depth-optimal process and employs a novel cascaded thread-block execution method to exploit the single-program-multiple- data (SPMD) nature of the compute unified device architecture (CUDA) environment developed by NVIDIA. Expand

A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication

- Computer Science
- IEEE Transactions on Parallel and Distributed Systems
- 2012

A novel approach to sparse matrix-vector multiplication (SpMV) based on the proposed scan, unlike the existing ones that make use of backward segmented operations, uses forward ones for more efficient caching. Expand

#### References

SHOWING 1-10 OF 22 REFERENCES

Fast scan algorithms on graphics processors

- Computer Science
- ICS '08
- 2008

This work uses novel data representations that map well to the GPU architecture to exploit shared memory to improve memory performance and improves the performance of algorithms for scan and segmented scan by eliminating shared-memory bank conflicts and reducing the overheads. Expand

Scan primitives for GPU computing

- Computer Science, Materials Science
- GH '07
- 2007

Using the scan primitives, this work shows novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyzes the performance of the scanPrimitives, several sort algorithms that use the scan Primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver. Expand

Scans as Primitive Parallel Operations

- Computer Science
- ICPP
- 1987

A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented. It is shown that the primitives improve the asymptotic… Expand

Scan primitives for vector computers

- Computer Science
- Proceedings SUPERCOMPUTING '90
- 1990

The authors describe an optimized implementation of a set of scan primitives on a single processor of a CRAY Y-MP, and demonstrate that their use leads to greatly improved performance for several applications that cannot be vectorized with existing computer technology. Expand

Vector Models for Data-Parallel Computing

- Computer Science
- 1990

A model of parallelism that extends and formalizes the Data-Parallel model on which the Connection Machine and other supercomputers are based is described, and it is argued that data-parallel models are not only practical and can be applied to a surprisingly wide variety of problems, they are also well suited for very-high-level languages and lead to a concise and clear description of algorithms and their complexity. Expand

Implementation of a Portable Nested Data-Parallel Language

- Computer Science
- J. Parallel Distributed Comput.
- 1994

Initial benchmark results of NESL show that NESL′s performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data. Expand

A Work-Efficient Step-Efficient Prefix Sum Algorithm

- Computer Science
- 2006

The Prefix-sum algorithm is one of the most important building blocks for data-parallel computation and its applications include parallel implementations of deleting marked elements from an array, radix-sort, solving recurrence equations, solving tri-diagonal linear systems, and quicksort. Expand

On Parallel Prefix Computation

- Computer Science
- Parallel Process. Lett.
- 1994

We prove that prefix sums of n integers of at most b bits can be found on a COMMON CRCW PRAM in time with a linear time-processor product. The algorithm is optimally fast, for any polynomial number… Expand

Fast Summed‐Area Table Generation and its Applications

- Computer Science
- Comput. Graph. Forum
- 2005

The algorithm for generating summed-area tables is introduced, similar to a technique used in scientific computing called recursive doubling, that allows the generation of a summed- area table in O(log n) time. Expand

A programming language

- Computer Science
- AIEE-IRE '62 (Spring)
- 1962

The paper describes a succinct problem-oriented programming language that relies heavily on a systematic extension of a small set of basic operations to vectors, matrices, and trees, and on a family of flexible selection operations controlled by logical vectors. Expand