72
Views
0
CrossRef citations to date
0
Altmetric
Articles

Accelerating Dense Matrix Computations with Effective Workload Partitioning on Heterogeneous Architectures

ORCID Icon, &

References

  • J. D. Owens , D. Luebke , N. Govindaraju , M. Harris , J. Krüger , A. E. Lefohn , and T. J. Purcell , “A survey of general-purpose computation on graphics hardware,” in Computer Graphics Forum, Vol. 26. Wiley Online Library, 2007, pp. 80–113.
  • “General-purpose computation on graphics hardware.” Available: http://gpgpu.org/ .
  • “Cuda toolkit documentation.” Available: http://docs.nvidia.com/cuda .
  • J. E. Stone , D. Gohara , and G. Shi , “Opencl: A parallel programming standard for heterogeneous computing systems,” Comput. Sci. Eng. , Vol. 12, pp. 66–73, 2010. doi: 10.1109/MCSE.2010.69
  • S. Mittal and J. S. Vetter , “A survey of CPU-GPU heterogeneous computing techniques,” ACM Comput. Surv. , Vol. 47, p. 69, 2015. doi: 10.1145/2788396
  • “Using the cublasxt api.” Available: http://docs.nvidia.com/cuda/cublas/index.html#using-the-cublasXt-api .
  • K. Hwang and Z. Xu , Scalable Parallel Computing: Technology, Architecture, Programming . New York, NY: McGraw-Hill, Inc., 1998.
  • M. Garcia , J. Corbalan , and J. Labarta , “Lewi: A runtime balancing algorithm for nested parallelism,” in International Conference on Parallel Processing, 2009. ICPP'09, IEEE, 2009, pp. 526–33.
  • M. Garcia , J. Labarta , and J. Corbalan , “Hints to improve automatic load balancing with Lewi for hybrid applications,” J. Parallel Distrib. Comput. , Vol. 74, pp. 2781–94, 2014. doi: 10.1016/j.jpdc.2014.05.004
  • J. M. Perez , R. M. Badia , and J. Labarta , “A dependency-aware task-based programming environment for multi-core architectures,” in 2008 IEEE International Conference on Cluster Computing, IEEE, 2008, pp. 142–51.
  • P. Sao , R. Vuduc , and X. S. Li , “A distributed CPU-GPU sparse direct solver,” in Euro-Par 2014 Parallel Processing, Springer, 2014, pp. 487–98.
  • P. Valero-Lara and F. L. Pelayo , “Full-overlapped concurrent kernels,” in ARCS 2015-The 28th International Conference on Architecture of Computing Systems. Proceedings, VDE, 2015, pp. 1–8.
  • J.-F. Dollinger and V. Loechner , “Cpu+GPU load balance guided by execution time prediction,” in Fifth International Workshop on Polyhedral Compilation Techniques (IMPACT 2015), Amsterdam, Netherlands, 2015. Available: http://impact.gforge.inria.fr/impact2015/ .
  • U. Bondhugula , A. Hartono , J. Ramanujam , and P. Sadayappan , “A practical automatic polyhedral parallelizer and locality optimizer,” ACM SIGPLAN Notices, 2008, pp. 101–13.
  • S. Verdoolaege , J. Carlos Juega , A. Cohen , J. Ignacio Gomez , C. Tenllado , and F. Catthoor , “Polyhedral parallel code generation for cuda,” ACM Trans. Archit. Code Optim. , Vol. 9, p. 54, 2013. doi: 10.1145/2400682.2400713
  • B. Pradelle , P. Clauss , and V. Loechner , “Adaptive runtime selection of parallel schedules in the polytope model,” in Proceedings of the 19th High Performance Computing Symposia, Society for Computer Simulation International, 2011, pp. 81–8.
  • J.-F. Dollinger and V. Loechner , “Adaptive runtime selection for GPU,” in 2013 42nd International Conference on Parallel Processing (ICPP), IEEE, 2013, pp. 70–9.
  • Z. Zhong , V. Rychkov , and A. Lastovetsky , “Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications,” in 2012 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, 2012, pp. 191–9.
  • D. Shulga , A. Kapustin , A. Kozlov , A. Kozyrev , and M. Rovnyagin , “The scheduling based on machine learning for heterogeneous CPU/GPU systems,” in NW Russia Young Researchers in Electrical and Electronic Engineering Conference (EIConRusNW), 2016 IEEE, IEEE, 2016, pp. 345–8.
  • C.-K. Luk , S. Hong , and H. Kim , “Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping,” in 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. MICRO-42, IEEE, 2009, pp. 45–55.
  • A. Nere , A. Hashmi , and M. Lipasti , “Profiling heterogeneous multi-GPU systems to accelerate cortically inspired learning algorithms,” in Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, IEEE, 2011, pp. 906–20.
  • C.-Y. Shei , P. Ratnalikar , and A. Chauhan , “Automating GPU computing in MATLAB,” in Proceedings of the International Conference on Supercomputing, ACM, 2011, pp. 245–54.
  • S. Tomov , J. Dongarra , V. Volkov , and J. Demmel , “Magma library,” Univ. of Tennessee and Univ. of California, Knoxville, TN, and Berkeley, CA, 2009. Available: http://icl.cs.utk.edu/magma/software/ .
  • C. Augonnet , S. Thibault , R. Namyst , and P.-A. Wacrenier , “Starpu: A unified platform for task scheduling on heterogeneous multicore architectures,” Concurr. Comput.: Pract. Exp. , Vol. 23, pp. 187–98, 2011. doi: 10.1002/cpe.1631
  • “Magma-2.2.0 matrix algebra for GPU and multicore architectures.” Available: http://icl.cs.utk.edu/projectsfiles/magma/doxygen/routines.html#blas .
  • E. Sun , D. Schaa , R. Bagley , N. Rubin , and D. Kaeli , “Enabling task-level scheduling on heterogeneous platforms,” in Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, ACM, 2012, pp. 84–93.
  • M. D. Linderman , J. D. Collins , H. Wang , and T. H. Meng , “Merge: A programming model for heterogeneous multi-core systems,” in ACM SIGOPS Operating Systems Review, Vol. 42, ACM, 2008, pp. 287–96.
  • “Openblas library.” Available: http://www.openblas.net/ .
  • “Openmp application programming interface examples.” Available: http://www.openmp.org/wp-content/uploads/openmp-examples-4.5.0.pdf/ .
  • “ATLAS library.” Available: http://math-atlas.sourceforge.net/ .
  • “Intel Math Kernel Library.” Available: https://software.intel.com/en-us/mkl-reference-manual-for-c.
  • “NVIDIA CUBLAS documentation.” Available: http://docs.nvidia.com/cuda/cublas/ .
  • “AMD Core Math Library.” Available: http://developer.amd.com/tools-and-sdks/archive/amd-core-math-library-acml/ .
  • V. W. Lee et al. , “Debunking the 100× GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU,” in ACM SIGARCH Computer Architecture News, Vol. 38, ACM, 2010, pp. 451–60.
  • C. Gregg and K. Hazelwood , “Where is the data? why you cannot debate CPU vs. GPU performance without the answer,” in 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), IEEE, 2011, pp. 134–44.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.