124
Views
2
CrossRef citations to date
0
Altmetric
Original Articles

Concentric layout, a new scientific data layout for matrix data-set in Hadoop file system

, &
Pages 407-433 | Received 15 Mar 2012, Accepted 04 Aug 2012, Published online: 13 Sep 2012
 

Abstract

Due to the explosive growth in the size of scientific data-sets, data-intensive computing and analysing are an emerging trend in computational science. In these applications, data pre-processing is widely adopted because it can optimise the data layout or format beforehand to facilitate the future data access. On the other hand, current research shows an increasing popularity of MapReduce framework for large-scale data processing. However, the data access patterns which are generally applied to scientific data-set are not supported by current MapReduce framework directly. This gap motivates us to provide support for these scientific data access patterns in MapReduce framework. In our work, we study the data access patterns in matrix files and propose a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a data layout which maintains the dimensional property in chunk level. Contrary to the continuous data layout adopted in the current Hadoop framework, concentric data layout stores the data from the same sub-matrix into one chunk. This layout can guarantee that the average performance of data access is optimal regardless of the various access patterns. The concentric data layout requires reorganising the data before it is being analysed or processed. Our experiments are launched on a real-world halo-finding application; the results indicate that the concentric data layout improves the overall performance by up to 38%.

Acknowledgements

This work is supported, in part, by the US National Science Foundation under grants CCF-0811413, CNS 1115665, National Science Foundation Early Career Award 0953946 and ‘One-hundred Talents’ Programme of Chinese Academy of Sciences.

Notes

Additional information

Notes on contributors

Lu Cheng

1 1. [email protected].

Lizhe Wang

2 2. [email protected].

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.