269
Views
9
CrossRef citations to date
0
Altmetric
Articles

Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud

, , , &
Pages 363-378 | Received 21 Oct 2012, Accepted 04 May 2013, Published online: 22 Jan 2014

REFERENCES

  • Amazon. [Online]. Available at. http://aws.amazon.com/ec2/.
  • Baremetalcloud. [Online]. Available at. http://baremetalcloud.com/index.php/en/.
  • B.P.Rimal, E.Choi, and I.Lumb, A taxonomy and survey of cloud computing systems, in NCM ’09: Proceedings of the 2009 Fifth International Joint Conference on INC, IMS and IDC. Washington, DC, USA, IEEE Comp. Society, 2009, pp. 44–51.
  • Nicholas Carr. [Online]. Available at. http://www.roughtype.com/?p = 279.
  • CFDR. Available at. http://cfdr.usenix.org (2012)
  • B.Schroeder and G.A.Gibson, A Large-Scale Study of Failures in High Performance Computing Systems, Dependable and Secure Computing, IEEE Transactions7(4) (2010), pp. 337–351.
  • AlGeist and ChristianEngelmann, Development of naturally fault tolerant algorithms for computing on 100,000 processors, J. Parallel Distributed Comput. (2002). Available at www.csm.ornl.gov/∼geist.
  • R.Riesen, K.Ferreira, and J.Stearley, See Applications Run and Throughput Jump: The Case for Redundant Computing in HPC, in Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), DSNW ’10, Washington, DC, USA, IEEE Computer Society, 2010, pp. 29–34.
  • HuajunHu, SuchangGuo, and BoYang, Cost-oriented task allocation and hardware redundancy policies in heterogeneous distributed computing systems considering software reliability, Comput. Ind. Eng.56(4) (2009), pp. 1687–1696.
  • C.-C.Hsieh, Optimal task allocation and hardware redundancy policies in distributed computing systems, Eur. J. Operational Res.147(2) (2003), pp. 430–447.
  • F.Cappello, Fault Tolerance in Petascale/exascale systems: current knowledge, challenges and research opportunities, Int. J. High Perform. Comput. Appl.23(3) (2009), pp. 212–226.
  • The MPI Forum, The MPI message-passing interface standard, 1995. Available at: http://www.mcs.anl.gov/mpi/standard.html.
  • C.Evangelinos and C.N.Hill, Cloud computing for parallel scientific HPC Applications: feasibility of running coupled Atmosphere-Ocean climate models on Amazon's EC2, in Cloud Computing and Its Applications 2008 (CCA-08), October 2008, Chicago, IL, ACM.
  • I.P.Egwutuoha, S.Chen, D.Levy, and B.Selic, A fault tolerance framework for high performance computing in cloud, in Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium, Ottawa, Canada, IEEE, 2012, pp. 709–710.
  • A.Kumar, L.Shang, L.Peh, and N.Jha, System-level dynamic thermal management for high-performance microprocessors, IEEE Trans. Computer-Aided Design Integr. Circuits Syst.27(1) (2008), pp. 96–108.
  • J.Stearley and A.Oliner, What Supercomputers Say: A Study of Five System Logs, in DSN 07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Washington, DC, USA, 2007, pp. 575–584.
  • Arun BabuNagarajan, FrankMueller, ChristianEngelmann, and Stephen L.Scot, Proactive Fault Tolerance for HPC with Xen Virtualization, in Proceedings of the 21st Annual International Conference on Supercomputing, Seattle, Washington, 2007, pp. 23–32.
  • Lm-sensors. [Online]. Aavailable at. http://lm-sensors.org/wiki/Documentation.
  • E.Deelman, G.Singh, M.Livny, B.Berriman, and J.Good, The Cost of Doing Science on the Cloud: The Montage Example, in SC 08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, Piscataway, NJ, USA, 2008, pp. 1–12.
  • M.Armbrust, A.Fox, R.Griffith, A.Joseph, R.Katz, A.Konwinski, G.Lee, D.Patterson, A.Rabkin, I.Stoica, and M.Zaharia, A view of cloud computing. Commun. ACM.53(4) (2010), pp. 50–58.
  • Open-iscsi, 2013. Available at: http://www.open-iscsi.org.
  • Xen, Xen hypervisor. [Online]. Available at. http://www.xen.org/products/xenhyp.html.
  • A. Petitet, R.C. Whaley, J. Dongarra, and A. Cleary, HPL. [Online]. Available at. http://www.netlib.org/benchmark/hpl/ (2008)
  • E.N.M.Elnozahy, L.Alvisi, Y.M.Wang, and D.B.Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Comput. Surv. (CSUR)34(3) (2002), pp. 375–408
  • Checkpointing.org, [Online]. Available at. http://checkpointing.org/.
  • I.P.Egwutuoha, D.Schragl, and R.Calvo, A Brief Review of Cloud Computing, Challenges and Potential Solutions, J. Parallel Cloud Comput.2(1) (2013)
  • H.J.Berendsen, D.van der Spoel, and R.van Drunen, GROMACS: a message-passing parallel molecular dynamics implementation, Comput. Phys. Commun.91(1) (1995), pp. 43–56.
  • I.P.Egwutuoha, S.Chen, D.Levy, B.Selic, and R.Calvo, A Proactive Fault Tolerance Approach to High Performance Computing (HPC) in 2012 Second International Conference on the Cloud, in Cloud and Green Computing (CGC), Xiangtan, Hunan, China, IEEE, 2012, pp. 268–273.
  • L.Youseff, M.Butrico, and D.D.Silva, Toward a unified ontology of cloud computing, in Proc. of the Grid Computing Environments Workshop (GCE08), Nov 2008, pp. 1–10.
  • I.P.Egwutuoha, D.Levy, B.Selic, and S.Chen, A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems, J. Supercomput. (2013). https://doi.org/10.1007/s11227-013-0884-0.
  • J.T.Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Generation Comput. Syst.22 (2006), pp. 303–312.
  • C.Clark, K.Fraser, S.Hand et al. Live migration of virtual machines, in Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2, USENIX Association2005, pp. 273–286.
  • K.Li, J.F.Naughton, and J.S.Plank, Low-latency, concurrent checkpointing for parallel programs, IEEE Transactions on Parallel and Distributed Systems5(8) (1994), pp. 874–879.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.