62
Views
20
CrossRef citations to date
0
Altmetric
Original Articles

Generic support of algorithmic and structural recursion for scientific computingFootnote1

, &
Pages 479-503 | Received 19 Jan 2009, Accepted 19 Jan 2009, Published online: 30 Nov 2009
 

Abstract

Recursive algorithms, like quick-sort and recursive data structures, like trees, play a central role in programming. In the context of scientific computing, recursive algorithms and memory layouts are shown here to provide excellent cache and Translation Lookaside Buffer (TLB) locality independently of the platform. We show how, for the first time, generic programming and object-oriented programming allow us to abstract a multitude of dense-matrix memory layouts: from conventional row-major and column-major layouts over Z- and И-Morton orders to block-wise combinations of them. All are provided by a single class that is based on our new matrix abstraction.

The algorithmic recursion is supported in generic fashion by classes modelling the new recursator, an analogue of the Standard Template Library iterator. Although this concept supports recursion in general, we focus again on matrix operations. Results are presented for matrix multiplication, on both conventional and tiled representations using both homogeneous and heterogeneous matrix representations. Reaching about 60% peak performance in portable C++ code establishes competitive performance without explicit prefetching and other platform-specific tuning. Comparisons with the manufacturers' libraries show superior locality. These new techniques are embedded in the Matrix Template Library, Version 4 (MTL4).

Acknowledgements

We thank Christopher Cole for his help rerunning data and final editing, and Laurent Plagne for his help with BTL and the libraries.

Notes

2. Supported, in part, by NSF under grants numbered ACI–0219884 and CCF-0541364. The US Government retains a license to exercise, or to have exercised on its behalf, almost all of the rights of copyright. Permission for others to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

1. Supported, in part, by NSF under a grant numbered EIA-0202048. An earlier version of this paper was presented at the 7th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing (POOSC’08) in Paphos, Cyprus, on 2008 July 8.

3. Classic Latin suggests Recursor from recurere, but Cursor already exists in MTL4. We root our word from later Latin recursare.

4. The real version in MTL4 is implemented as a functor for better composition. It also contains dispatching to non-recursive functors when one of the matrix types does not support sub-divisibility. Another template argument parameterises the element update to subsume C+ = A*B, C − = A*B, and C = A*B. Nevertheless, the recursive technique is the same and we omit other issues here for the sake of simplicity.

5. Current compilers are not able to verify whether the performed operation is a semantically correct matrix multiplication, only whether the signatures match. Research is in progress to incorporate compilers into the semantic verification of generic and non-generic libraries [Citation14,Citation15].

6. In addition, we compared our locality with GotoBLAS for the Opteron [Citation13]. For brevity, we omit the plots here and only state that our block-recursive revealed better locality regarding L1, L2 and TLB misses. However, the differences are much smaller than for ACML.

7. There is an exception to this statement. Some compilers have predefined, hand-tuned target code for certain source code patterns. Then, rather simplistic implementation might lead to excellent performance, while sophisticated source code might not match the corresponding pattern and result in lower performance.

8. Currently, this dispatching is only available for dense matrix products. However, after the technology for it is developed, further utilisations will be straight-forward.

Additional information

Notes on contributors

David S. Wise

2. 2. Supported, in part, by NSF under grants numbered ACI–0219884 and CCF-0541364. The US Government retains a license to exercise, or to have exercised on its behalf, almost all of the rights of copyright. Permission for others to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.