Search in:

IETE Journal of Research Volume 65, 2019 - Issue 5

Views

CrossRef citations to date

Altmetric

Articles

Accelerating Dense Matrix Computations with Effective Workload Partitioning on Heterogeneous Architectures

Mohsin Khan Department of Computer Science and Engineering, HKBK College of Engineering, Visvesvaraya Technological University , Bangalore, Karnataka560045, IndiaCorrespondence[email protected]

https://orcid.org/0000-0002-4966-070X View further author information

Waseem Ahmed Department of Computer Science and Engineering, HKBK College of Engineering, Visvesvaraya Technological University , Bangalore, Karnataka560045, IndiaView further author information

Touseef M. Golandaz Department of Computer Science and Engineering, HKBK College of Engineering, Visvesvaraya Technological University , Bangalore, Karnataka560045, IndiaView further author information

Pages 613-626 | Published online: 11 Jun 2018

Cite this article
https://doi.org/10.1080/03772063.2018.1436476
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions

References

J. D. Owens , D. Luebke , N. Govindaraju , M. Harris , J. Krüger , A. E. Lefohn , and T. J. Purcell , “A survey of general-purpose computation on graphics hardware,” in Computer Graphics Forum, Vol. 26. Wiley Online Library, 2007, pp. 80–113.
Google Scholar
“General-purpose computation on graphics hardware.” Available: http://gpgpu.org/ .
Google Scholar
“Cuda toolkit documentation.” Available: http://docs.nvidia.com/cuda .
Google Scholar
J. E. Stone , D. Gohara , and G. Shi , “Opencl: A parallel programming standard for heterogeneous computing systems,” Comput. Sci. Eng. , Vol. 12, pp. 66–73, 2010. doi: 10.1109/MCSE.2010.69
PubMed Web of Science ®Google Scholar
S. Mittal and J. S. Vetter , “A survey of CPU-GPU heterogeneous computing techniques,” ACM Comput. Surv. , Vol. 47, p. 69, 2015. doi: 10.1145/2788396
Web of Science ®Google Scholar
“Using the cublasxt api.” Available: http://docs.nvidia.com/cuda/cublas/index.html#using-the-cublasXt-api .
Google Scholar
K. Hwang and Z. Xu , Scalable Parallel Computing: Technology, Architecture, Programming . New York, NY: McGraw-Hill, Inc., 1998.
Google Scholar
M. Garcia , J. Corbalan , and J. Labarta , “Lewi: A runtime balancing algorithm for nested parallelism,” in International Conference on Parallel Processing, 2009. ICPP'09, IEEE, 2009, pp. 526–33.
Google Scholar
M. Garcia , J. Labarta , and J. Corbalan , “Hints to improve automatic load balancing with Lewi for hybrid applications,” J. Parallel Distrib. Comput. , Vol. 74, pp. 2781–94, 2014. doi: 10.1016/j.jpdc.2014.05.004
Web of Science ®Google Scholar
J. M. Perez , R. M. Badia , and J. Labarta , “A dependency-aware task-based programming environment for multi-core architectures,” in 2008 IEEE International Conference on Cluster Computing, IEEE, 2008, pp. 142–51.
Google Scholar
P. Sao , R. Vuduc , and X. S. Li , “A distributed CPU-GPU sparse direct solver,” in Euro-Par 2014 Parallel Processing, Springer, 2014, pp. 487–98.
Google Scholar
P. Valero-Lara and F. L. Pelayo , “Full-overlapped concurrent kernels,” in ARCS 2015-The 28th International Conference on Architecture of Computing Systems. Proceedings, VDE, 2015, pp. 1–8.
Google Scholar
J.-F. Dollinger and V. Loechner , “Cpu+GPU load balance guided by execution time prediction,” in Fifth International Workshop on Polyhedral Compilation Techniques (IMPACT 2015), Amsterdam, Netherlands, 2015. Available: http://impact.gforge.inria.fr/impact2015/ .
Google Scholar
U. Bondhugula , A. Hartono , J. Ramanujam , and P. Sadayappan , “A practical automatic polyhedral parallelizer and locality optimizer,” ACM SIGPLAN Notices, 2008, pp. 101–13.
Google Scholar
S. Verdoolaege , J. Carlos Juega , A. Cohen , J. Ignacio Gomez , C. Tenllado , and F. Catthoor , “Polyhedral parallel code generation for cuda,” ACM Trans. Archit. Code Optim. , Vol. 9, p. 54, 2013. doi: 10.1145/2400682.2400713
Web of Science ®Google Scholar
B. Pradelle , P. Clauss , and V. Loechner , “Adaptive runtime selection of parallel schedules in the polytope model,” in Proceedings of the 19th High Performance Computing Symposia, Society for Computer Simulation International, 2011, pp. 81–8.
Google Scholar
J.-F. Dollinger and V. Loechner , “Adaptive runtime selection for GPU,” in 2013 42nd International Conference on Parallel Processing (ICPP), IEEE, 2013, pp. 70–9.
Google Scholar
Z. Zhong , V. Rychkov , and A. Lastovetsky , “Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications,” in 2012 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, 2012, pp. 191–9.
Google Scholar
D. Shulga , A. Kapustin , A. Kozlov , A. Kozyrev , and M. Rovnyagin , “The scheduling based on machine learning for heterogeneous CPU/GPU systems,” in NW Russia Young Researchers in Electrical and Electronic Engineering Conference (EIConRusNW), 2016 IEEE, IEEE, 2016, pp. 345–8.
Google Scholar
C.-K. Luk , S. Hong , and H. Kim , “Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping,” in 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. MICRO-42, IEEE, 2009, pp. 45–55.
Google Scholar
A. Nere , A. Hashmi , and M. Lipasti , “Profiling heterogeneous multi-GPU systems to accelerate cortically inspired learning algorithms,” in Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, IEEE, 2011, pp. 906–20.
Google Scholar
C.-Y. Shei , P. Ratnalikar , and A. Chauhan , “Automating GPU computing in MATLAB,” in Proceedings of the International Conference on Supercomputing, ACM, 2011, pp. 245–54.
Google Scholar
S. Tomov , J. Dongarra , V. Volkov , and J. Demmel , “Magma library,” Univ. of Tennessee and Univ. of California, Knoxville, TN, and Berkeley, CA, 2009. Available: http://icl.cs.utk.edu/magma/software/ .
Google Scholar
C. Augonnet , S. Thibault , R. Namyst , and P.-A. Wacrenier , “Starpu: A unified platform for task scheduling on heterogeneous multicore architectures,” Concurr. Comput.: Pract. Exp. , Vol. 23, pp. 187–98, 2011. doi: 10.1002/cpe.1631
Web of Science ®Google Scholar
“Magma-2.2.0 matrix algebra for GPU and multicore architectures.” Available: http://icl.cs.utk.edu/projectsfiles/magma/doxygen/routines.html#blas .
Google Scholar
E. Sun , D. Schaa , R. Bagley , N. Rubin , and D. Kaeli , “Enabling task-level scheduling on heterogeneous platforms,” in Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, ACM, 2012, pp. 84–93.
Google Scholar
M. D. Linderman , J. D. Collins , H. Wang , and T. H. Meng , “Merge: A programming model for heterogeneous multi-core systems,” in ACM SIGOPS Operating Systems Review, Vol. 42, ACM, 2008, pp. 287–96.
Google Scholar
“Openblas library.” Available: http://www.openblas.net/ .
Google Scholar
“Openmp application programming interface examples.” Available: http://www.openmp.org/wp-content/uploads/openmp-examples-4.5.0.pdf/ .
Google Scholar
“ATLAS library.” Available: http://math-atlas.sourceforge.net/ .
Google Scholar
“Intel Math Kernel Library.” Available: https://software.intel.com/en-us/mkl-reference-manual-for-c.
Google Scholar
“NVIDIA CUBLAS documentation.” Available: http://docs.nvidia.com/cuda/cublas/ .
Google Scholar
“AMD Core Math Library.” Available: http://developer.amd.com/tools-and-sdks/archive/amd-core-math-library-acml/ .
Google Scholar
V. W. Lee et al. , “Debunking the 100× GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU,” in ACM SIGARCH Computer Architecture News, Vol. 38, ACM, 2010, pp. 451–60.
Google Scholar
C. Gregg and K. Hazelwood , “Where is the data? why you cannot debate CPU vs. GPU performance without the answer,” in 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), IEEE, 2011, pp. 134–44.
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Accelerating Dense Matrix Computations with Effective Workload Partitioning on Heterogeneous Architectures

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Accelerating Dense Matrix Computations with Effective Workload Partitioning on Heterogeneous Architectures

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date