191
Views
8
CrossRef citations to date
0
Altmetric
Section B

Exploring emerging manycore architectures for uncertainty quantification through embedded stochastic Galerkin methods

, , &
Pages 707-729 | Received 30 Jan 2013, Accepted 29 Aug 2013, Published online: 05 Dec 2013
 

Abstract

We explore approaches for improving the performance of intrusive or embedded stochastic Galerkin uncertainty quantification methods on emerging computational architectures. Our work is motivated by the trend of increasing disparity between floating-point throughput and memory access speed on emerging architectures, thus requiring the design of new algorithms with memory access patterns more commensurate with computer architecture capabilities. We first compare the traditional approach for implementing stochastic Galerkin methods to non-intrusive spectral projection methods employing high-dimensional sparse quadratures on relevant problems from computational mechanics, and demonstrate the performance of stochastic Galerkin is reasonable. Several reorganizations of the algorithm with improved memory access patterns are described and their performance measured on contemporary manycore architectures. We demonstrate these reorganizations can lead to improved performance for matrix–vector products needed by iterative linear system solvers, and highlight further algorithm research that might lead to even greater performance.

2010 AMS Subject Classifications::

View correction statement:
Erratum

Acknowledgements

This work was supported by the LDRD program of Sandia National Laboratories.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the US Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

Notes

1. In some cases better performance can be obtained through nested Clenshaw–Curtis or Gauss–Patterson abscissas, however we found these methods to be more expensive for a similar level of accuracy when applied to these problems.

2. For each algorithm, we compute the floating-point throughput rate by dividing the total number of floating-point operations by the total run-time. For the two flat CRS approaches, the total number of floating-point operations is exactly twice the total number of matrix non-zeros. For the dense-block multiply, it is 2(P+1)2nz (where nz is the number of non-zeros arising from the spatial discretization) and for the polynomial multiply algorithm it is (5 nc+P+1)nz where nc is the number of non-zero triple-product values.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.