Abstract
Graphics processing units (GPUs) offer parallel computing power that usually requires a cluster of networked computers or a supercomputer to accomplish. While writing kernel code is fairly straightforward, achieving efficiency and performance requires very careful optimisation decisions and changes to the original serial algorithm. We introduce a parallel canonical ensemble Monte Carlo (MC) simulation that runs entirely on the GPU. In this paper, we describe two MC simulation codes of Lennard-Jones particles in the canonical ensemble, a single CPU core and a parallel GPU implementations. Using Compute Unified Device Architecture, the parallel implementation enables the simulation of systems containing over 200,000 particles in a reasonable amount of time, which allows researchers to obtain more accurate simulation results. A remapping algorithm is introduced to balance the load of the device resources and demonstrate by experimental results that the efficiency of this algorithm is bounded by available GPU resource. Our parallel implementation achieves an improvement of up to 15 times on a commodity GPU over our efficient single core implementation for a system consisting of 256k particles, with the speedup increasing with the problem size. Furthermore, we describe our methods and strategies for optimising our implementation in detail.
Acknowledgements
The authors thank the anonymous reviewers for their insightful comments and invaluable feedback. This work has been supported by Wayne State University's Research Enhancement Program (REP) and National Science Foundation (NSF) grants NSF CBET-0730768 and OCI-1148168. The authors also thank NVIDIA® for donating some of the graphics cards used in this study.
Notes
1. Threads in the same warp do not need to be synchronised.
2. Although hundreds of millions of simulation steps are required to obtain scientifically accurate simulation results, one million steps is sufficient to show the relative speedup of the GPU code.