Datenbestand vom 17. April 2024

Warenkorb Datenschutzhinweis Dissertationsdruck Dissertationsverlag Institutsreihen     Preisrechner

aktualisiert am 17. April 2024

ISBN 9783843951685

84,00 € inkl. MwSt, zzgl. Versand


978-3-8439-5168-5, Reihe Informatik

Philipp Johannes Samfaß
Reactive Load Balancing and Resilience Techniques in Simulation Applications on Supercomputers

237 Seiten, Dissertation Technische Universität München (2022), Hardcover, B5

Zusammenfassung / Abstract

Simulation applications running on supercomputers enable important scientific breakthroughs. Achieving optimal resource utilization and minimal power consumption on such systems requires effective load balancing methods. However, with growing performance variability, load balancing becomes increasingly challenging as execution times for work can no longer be predicted. Further, modern numerical algorithms are highly dynamic with respect to their computational work. Besides, the sheer scale of supercomputers makes them vulnerable to errors, which can result in process failures and silent data corruptions. Simulation applications require new techniques that render them more resilient against the increasing unpredictability and unreliability of modern hardware and software.

In this thesis, I design, implement and evaluate such techniques. They are reactive in the sense that they - in contrast to many predictive state-of-the-art load balancing or fault resilience approaches -not only predict future behavior of hardware and software, but they detect unexpected events (e.g., imbalances or errors) at runtime and react to them. All methods employ migration or even replication of tasks and sharing of their outcomes between processes and nodes for reactive resilience.

Their benefits are shown for two task-based parallel simulation applications for solving systems of hyperbolic partial differential equations on dynamically adaptive meshes. I demonstrate that the reactive methods can tackle the increasing variability of execution times on the hardware level and that they can balance unpredictable workload imbalances in modern numerics on the software level, which resulted in performance improvements in time-to-solution of up to a factor of 3.3X. Findings in the context of replication-based fault resilience indicate that reactive resilience against process failures and silent data corruptions can be achieved without the full performance price of replicated computations.