Abstract
Accurate, reproducible and comparable measurement of the overheads, communication times and progression behaviour of blocking and nonblocking collective operations is a complicated task. Although different measurement schemes for blocking collective operations are implemented in well-known benchmarks, many of these schemes introduce different systematic errors in their measurements. We characterise these errors and select a window-based approach as the most accurate method. However, this approach complicates measurements significantly and introduces clock synchronisation as a new source of errors. We analyse approaches to avoid or correct those errors and develop a scalable synchronisation scheme to conduct benchmarks on massively parallel systems. Our results are compared to the window-based scheme implemented in the SKaMPI benchmarks and show a reduction of the synchronisation overhead by a factor of 16 on 128 processes. We also describe two different measurement schemes for the overhead and asynchronous progress of nonblocking collective communications. An implementation and results of both measurement schemes are presented.
Acknowledgements
This research was funded by a gift from the Silicon Valley Community Foundation, on behalf of the Cisco Collaborative Research Initiative of Cisco Systems.
Notes
3 We used the 50,000 RTTs gathered as described in Section 2.