274
Views
10
CrossRef citations to date
0
Altmetric
Articles

Fault tolerance for a scientific workflow system in a Cloud computing environment

, , &
Pages 705-714 | Received 31 Mar 2019, Accepted 19 Jul 2019, Published online: 30 Jul 2019

References

  • Taylor IJ, Deelman E, et al. Workflows for e-Science. In: Taylor IJ, Deelman E, Gannon DB, Shields M, editors. Scientific workflows for grids. London: Springer-Verlag; 2007. p. 526.
  • Katz DS, Anagnostou N, Berriman GB, et al. Astronomical Image Mosaicking on a Grid: initial Experiences. In: BD Martino, J Dongarra, A Hoisie, LT Yang, H Zima, editor. Engineering the Grid: Status and Perspective. American Scientific Publishers; 2006.
  • Maechling P, Deelman E, et al. SCEC CyberShake workflows— Automating Probabilistic Seismic Hazard Analysis calculations. workflows for e-Science. scientific workflows for Grids. London, UK: Springer; 2007.
  • Deelman E, Ellisman MH. Enabling parallel scientific applications with workflow tools. presented at Challenges of Large Applications in Distributed Environments, 2006 IEEE; 2006.
  • Juve G, Chervenak A, Deelman E, et al. Characterizing and profiling scientific workflows. Futur Gener Comput Syst. 2013;29(3):682–692.
  • Singh G, Su M, Vahi K, et al. Workflow task clustering for best effort systems with pegasus. Proceedings of the 15th ACM Mardi Gras Conference; 2008. p. 9.
  • Chen W, Deelman E. Fault tolerant clustering in scientific workflows. Proceedings of the IEEE 8th World Congr. Services; 2012. pp. 9–16.
  • Ferreira da Silva R, Glatard T, Desprez F. On-line, non-clairvoyant optimization of workflow activity granularity on grids. Proc Euro-Par Parallel Process. 2013;8097:255–266.
  • Zhang Y, Squillante MS. Performance implications of failures in large-scale cluster scheduling. Proceedings of the 10th Workshop Job Scheduling Strategies Parallel Process; Jun. 2004. pp. 233–252.
  • Schroeder B, Gibson GA. A large-scale study of failures in high-performance computing systems. International Conference on Dependable Systems and Networks; 2006. pp. 249–258.
  • Deelman E, Blythe J, Gil Y, et al. Pegasus: Mapping scientific workflows onto the grid. Proceedings of the 2nd Eur. AcrossGrid Conference; 2004. pp. 11–20.
  • Duan R, Prodan R, Fahringer T. Run-time optimisation of grid workflow applications. Proceedings of the 7th IEEE/ACM International Conference on Grid Computing; 2006. pp. 33–40.
  • Ferreira da Silva R, Glatard T. A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task substeps, and workflow executions. Proceedings of the 18th International Conference on Parallel Processing Workshops; 2013. pp. 79–88.
  • Bresnahan J, Freeman T, La Bissoniere D, et al. Managing appliance launches in infrastructure clouds. Proceedings of the Teragrid Conference; 2011. p. 12.
  • Deelman E, Singh G, Livny M, et al. The cost of doing science on the cloud: The montage example. Proceedings of the ACM/IEEE Conference Supercomput; 2008. p. 50.
  • Berriman GB, Juve G, Deelman E, et al. The application of cloud computing to astronomy: A study of cost and performance. Proceedings of the Workshop e-Science Challenges Astron. Astrophys; 2010. pp. 1–7.
  • Vinay K, Dilip Kumar SM, Raghavendra S, et al. Multimed Tools Appl. 2018;77:10171. https://doi.org/10.1007/s11042-017-5304-7.
  • Maheshwari K, Espinosa A, Wilde M, et al. Job and data clustering for aggregate use of multiple production cyberinfrastructures. Proceedings of the 5th International Workshop Data-Intensive Distributed Computing; 2012. pp. 3–12.
  • Ferreira da Silva R, Glatard T, Desprez F. Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions. Concurrency Comput, Practice Exp. 2014;26(14):2347–2366.
  • Chen W, Deelman E. Integration of workflow partitioning and resource provisioning. 12th IEEE/ACM International Symposium on Cluster Cloud and Grid Computing (CCGrid) 2012; May 2012. pp. 764–768,
  • Chen W, da Silva RF, Deelman E, et al. Dynamic and fault-tolerant clustering for scientific workflows. IEEE Trans Cloud Comput. 2016;4(1):49–62.
  • Zhang Y, Mandal A, Koelbel C, et al. Combined fault tolerance and scheduling techniques for workflow applications on computational grids. Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing Grid; 2009. pp. 244–251.
  • Montagnat J, Glatard T, Reimert D, et al. Workflow-based comparison of two distributed computing infrastructures. Proceedings of the 5th Workshop Workflows Support Large-Scale Sci.; 2010. pp. 1–10.
  • Kandaswamy G, Mandal A, Reed D. Fault tolerance and recovery of scientific workflows on computational grids. Proceedings of the 8th IEEE International Symposium on Cluster Computing Grid; 2008. pp. 777–782.
  • Plankensteiner K, Prodan R, Fahringer T. A new fault tolerance heuristic for scientific workflows in highly distributed environments based on resubmission impact. Proceedings of the 5th IEEE International Conference e-Sci.; 2009. pp. 313–320.
  • Juve G, Deelman E, Vahi K, et al. Experiences with resource provisioning for scientific workflows using Corral. Sci Program. April 2010;18(2):77–92.
  • Amoon M, El-Bahnasawy N, Sadi S, et al. On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems. J Ambient Intell Human Comput. 2018. doi:10.1007/s12652-018-1139-y.
  • Ying C, Yu J, He J. Towards fault tolerance optimization based on checkpoints of in-memory framework spark. J Ambient Intell Human Comput. 2018. doi:10.1007/s12652-018-1018-6.
  • Zhao W, Melliar-Smith PM, Moser LE. (2010). Fault tolerance middleware for cloud computing. Proceedings 2010 IEEE 3rd International Conference on Cloud Computing CLOUD; 2010. pp 67–74. https://doi.org/10.1109/CLOUD.2010.26.
  • Jhawar R, Piuri V, Santambrogio M. (2012). A comprehensive conceptual system-level approach to fault tolerance in cloud computing. SysCon 2012–2012 IEEE International Systems Conference, Proceedings; pp 601–605. https://doi.org/10.1109/SysCon.2012.61895 03.
  • Poola D, Ramamohanarao K, Buyya R. Fault-tolerant workflow scheduling using spot instances on clouds. Procedia Comput Sci. 2014;29:523–533. https://doi.org/10.1016/j.procs.2014.05.047.
  • Padmakumari P, Umamakeswari A. J Ambient Intell Human Comput. 2019. doi:10.1007/s12652-019-01174-9.
  • Deelman E, Vahi K, Juve G, et al. Pegasus, a workflow management system for science automation,” future Gen. Comput Syst: 17–35. doi:10.1016/j.future.2015.10.008, 2015.
  • Samak T, Gunter D, Goode M, et al. Failure prediction and localization in large scientific workflows. Proceedings of the 6th Workshop Workflows Supporting Large-Scale Sci.; Nov. 2011. pp. 107–116.
  • Plankensteiner K, Prodan R, Fahringer T, et al. Fault detection, prevention and recovery in current grid workflow systems. In: Grid and Services Evolution. New York, NY: Springer; 2009. p. 1–13.
  • Muthuvelu N, Liu J, Soe NL, et al. A dynamic job grouping-based scheduling for deploying applications with fine grained tasks on global grids. Proceedings of the Australasian Workshop Grid Comput. e-res; 2005. pp. 41–48.
  • Muthuvelu N, Chai I, Eswaran C. An adaptive and parameterized job grouping algorithm for scheduling grid jobs. Proceedings of the 10th International Conference on Advanced Communication Technology; 2008. pp. 975 –980.
  • Muthuvelu N, Chai I, Chikkannan E, et al. On-line task granularity adaptation for dynamic grid applications. Proceedings of the 10th International Conference on Algorithms Archit. Parallel Process.; 2010. pp. 266–277.
  • Ng WK, Ang T, Ling T, et al. Scheduling framework for bandwidth-aware job grouping-based scheduling in grid computing. Malaysian J Comput Sci. 2006;19(2):117–126.
  • Ang T, Ng W, Ling T, et al. A bandwidth-aware job grouping-based scheduling on grid environment. Inf Technol J. 2009;8:372–377.
  • Liu Q, Liao Y. Grouping-based fine-grained job scheduling in grid computing. Proceedings of the 1st Int. Workshop Educ. Technol. Comput. Sci.; Mar. 2009., pp. 556–559.
  • Stergiou C, Psannis KE, Kim B-G, et al. Secure integration of IoT and cloud computing. Future Gener Comput Syst. 2018;78:964–975. doi:10.1016/j.future.2016.11.031.
  • Dharwadkar NV, Poojara SR, Kadam PM. Fault tolerant and Optimal task clustering for scientific workflow in Cloud. Int J Cloud Appl Comput. 2018;8(3):1–19. https://doi.org/10.4018/ijcac.2018070101.
  • M. Shojafar, S. Javanmardi, S. Abolfazli, N. Cordeschi, “FUGE: A joint meta-heuristic approach to cloud job scheduling algorithm using fuzzy theory and a genetic method”, Cluster Comput, Vol. 18, Issue 2, pp 829–844, June 2015.
  • Topcuoglu H, Hariri S, Wu M-Y. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst. Mar. 2002;13(3):260–274.
  • Blythe J, Jain S, Deelman E, et al. Task scheduling strategies for workflow-based applications in grids. Proceedings of the 5th IEEE International Symposium on Cluster Computing Grid; 2005. pp. 759–767.
  • Kalayci S, Dasgupta G, Fong L, et al. Distributed and adaptive execution of condor DAGMan workflows. Proceedings of the 22nd International Conference on Software Engineering and Knowledge Engineering; 2010. pp. 587–590.
  • Fahringer T, Prodan R, Duan R, et al. ASKALON: A development and grid computing environment for scientific workflows. Proceedings of the Workflows e-Sci.; 2007. pp. 450–471.
  • Oinn T, Addis M, Ferris J, et al. Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 2004;20(17):3045–3054.
  • Topcuoglu H, Hariri S, Wu M-Y. Task scheduling algorithms for heterogeneous processors. In 8th Proceedings of the of Heterogeneous Computing Workshop; 1999.
  • Amazon EC2. 2019. http://aws.amazon.com/ec2.
  • Rodriguez MA, Buyya R. Deadline based resource Provisioningand scheduling algorithm for scientific workflows on Clouds. IEEE Trans Cloud Comput. 2014;2(2):222–235.
  • Li Z, Ge J, Hu HY, et al. Cost and energy Aware scheduling algorithm for scientific workflows with deadline constraint in Clouds. IEEE Trans Ser Comput. 2018;11(4):713–726.
  • Li Z, Ge J, Yang H, et al. A security and cost aware scheduling algorithm for heterogeneous tasks of scientific workflow in Clouds. Future Gener Comput Syst. 2016;65:140–152.
  • Yao G, Ding Y, Hao K. Using imbalance characteristic for fault-tolerant workflow scheduling in Cloud systems. Trans Parallel Distrib Syst. 2017;28(12):3671–3683.
  • Manimaran G, Murthy CSR. A fault-tolerant dynamic scheduling algorithm for multiprocessor real-time systems and its analysis. IEEE Trans Parallel Distrib Syst. 1998;9(11):1137–1152.
  • Zhu X, Wang J, Guo H, et al. Fault-Tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized Clouds. IEEE Trans Parallel Distrib Syst. 2016;27(12):3501–3517.
  • Bharathi S, Chervenak A, Deelman E, et al. Characterization of scientific workflows. 2008 Third Workshop on Workflows in Support of Large-Scale Science; 2008. pp. 1–10.
  • Chen W, Deelman E. Workflowsim: A toolkit for simulating scientific workflows in distributed environments. IEEE 8th International Conference on Escience (e-science), 2012, IEEE; 2012. pp. 1–8.
  • Calheiros RN, Ranjan R, Beloglazov A, et al. Cloudsim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw: Practice Experience. 2011;41(1):23–50.
  • Ferreira da Silva R, Chen W, Juve G, et al. Community resources for enabling and evaluating research on scientific workflows. Proceedings of the 10th IEEE International Conference on e-Science; 2014. pp. 177–184.
  • Workflow Archive. [Online]. Available: http://workflowarchive. org, 2019.
  • Baccarelli E, Naranjo PGV, Scarpiniti M, et al. Fog of everything: energy-efficient networked computing architectures research challenges and a case study. IEEE Access. 2017;5:9882–9910.
  • Baccarelli E, Naranjo PGV, Shojafar M, et al. Q: energy and delay-efficient dynamic queue management in TCP/IP virtualized data centers. Comput Commun. Apr. 2017;102:89–106.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.