See reference paper for description.

Notes on implementation

The sort function is named bin_sort() instead of upc_all_sort() so it can coexist with the original sorting function in the upc collectives. A compiled version of the upc collectives library is included with the source code.




Implementing Sort in UPC: Performance and Optimization

Technical Report 05-06, Michigan Technological University, Department of Computer Science (2005)

Author(s): K. Begum and S. Seidel
Abstract: Unified Parallel C (UPC) is a parallel extension of ANSI C that is based on a partitioned shared address space. It supports development of parallel applications over diverse hardware platforms and does not require a true shared memory system. It simplifies programming by providing the same syntax for local references as for remote references. This helps the user focus on performance, and allows for the exploitation of locality. Our project is to implement a generic sort function in UPC. It will help programmers focus on design, implementation and performance of the entire application rather than work on details of sorting.
upc_all_sort is an implementation of a generic sort function in UPC using a bin sort algorithm. Techniques applied to improve the performance of the sort function include: sampling of data to ensure more nearly uniform bucket sizes among threads, an all-to-all exchange of bucket sizes to minimize the amount of dynamically allocated memory, rebalancing data at the end of the sort to ensure that each thread finishes with the same number of elements as it started with. In this paper, we analyze and compare the performance of sequential and parallel sort, the potential of performance improvements over sequential sort and also the bottlenecks limiting these improvements. This sort function runs 3.9 times faster than the sequential sort for problem sizes of at least 128K when run on 8 threads. The performance scales up with the increasing number of threads and bigger problem sizes. It runs 6.8 times faster than sequential sort for problem size of 512K when run on 16 threads.
( pdf | ps )


Oystein Thorsen
Dr. Steven Seidel Last modified 3/3/06