Abstract
We explore approaches for improving the performance of intrusive or embedded stochastic Galerkin uncertainty quantification methods on emerging computational architectures. Our work is motivated by the trend of increasing disparity between floating-point throughput and memory access speed on emerging architectures, thus requiring the design of new algorithms with memory access patterns more commensurate with computer architecture capabilities. We first compare the traditional approach for implementing stochastic Galerkin methods to non-intrusive spectral projection methods employing high-dimensional sparse quadratures on relevant problems from computational mechanics, and demonstrate the performance of stochastic Galerkin is reasonable. Several reorganizations of the algorithm with improved memory access patterns are described and their performance measured on contemporary manycore architectures. We demonstrate these reorganizations can lead to improved performance for matrix–vector products needed by iterative linear system solvers, and highlight further algorithm research that might lead to even greater performance.
Acknowledgements
This work was supported by the LDRD program of Sandia National Laboratories.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the US Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
Notes
1. In some cases better performance can be obtained through nested Clenshaw–Curtis or Gauss–Patterson abscissas, however we found these methods to be more expensive for a similar level of accuracy when applied to these problems.
2. For each algorithm, we compute the floating-point throughput rate by dividing the total number of floating-point operations by the total run-time. For the two flat CRS approaches, the total number of floating-point operations is exactly twice the total number of matrix non-zeros. For the dense-block multiply, it is 2(P+1)2nz (where nz is the number of non-zeros arising from the spatial discretization) and for the polynomial multiply algorithm it is (5 nc+P+1)nz where nc is the number of non-zero triple-product values.