Name: PACO 2019: 3rd Workshop on Power-Aware Computing
Start: 2019-11-05T08:00:00+01:00
End: 2019-11-06T13:00:00+01:00
Location: Max Planck Institute for Dynamics of Complex Technical Systems

PACO 2019: 3rd Workshop on Power-Aware Computing

from Tuesday 5 November 2019 (08:00) to Wednesday 6 November 2019 (13:00)

Monday 4 November 2019
Tuesday 5 November 2019

08:00 Registration
Registration
08:00 - 08:30
Room: Main/groundfloor-none - Magistrale
08:30 Opening - Peter Benner (Max Planck Institute for Dynamics of Complex Technical Systems)
Opening
- Peter Benner (Max Planck Institute for Dynamics of Complex Technical Systems)
08:30 - 08:45
Room: Main/groundfloor-V0.05/2+3 - Prigogine
08:45 Parallel solution of large sparse systems by direct and hybrid methods - Iain S. Duff (STFC RAL, UK and Cerfacs, France)
Parallel solution of large sparse systems by direct and hybrid methods
- Iain S. Duff (STFC RAL, UK and Cerfacs, France)
08:45 - 09:30
Room: Main/groundfloor-V0.05/2+3 - Prigogine We discuss a range of algorithms and codes for the solution of sparse systems that we have developed in an EU Horizon 2020 Project, called NLAFET, that finished on 30 April 2019. We used two approaches to get good single node performance. For symmetric systems we used task-based algorithms based on an assembly tree representation of the factorization. We then used runtime systems for scheduling the computation on both multicore CPU nodes and GPU nodes [4]. The second approach was to design a new parallel threshold Markowitz algorithm [2] based on Luby’s method [5] for obtaining a maximal independent set in an undirected graph. This was a significant extension since our graph model is a directed graph. We then extended the scope of these two approaches to exploit distributed memory parallelism. In the first case, we base our work on the block Cimmino algorithm [3] using the ABCD software package coded by Zenadi in Toulouse [6]. The kernel for this algorithm is the direct factorization of a symmetric indefinite submatrix for which we use the above symmetric code. To extend the unsymmetric code to distributed memory, we use the Zoltan code from Sandia [1] to partition the matrix to singly bordered block diagonal form and then use the above unsymmetric code on the blocks on the diagonal. We show the performance of our codes on industrial strength large test problems on a heterogeneous platform. Our codes that are available on github are shown to outperform other state-of-the-art codes. [1] E. BOMAN , K. D EVINE , L. A. F ISK , R. H EAPHY , B. H ENDRICK- SON, C. V AUGHAN , U. C ATALYUREK , D. B OZDAG , W. M ITCHELL , AND J. TERESCO , Zoltan 3.0: Parallel Partitioning, Load-balancing, and Data Management Services; User’s Guide, Sandia National Laboratories, Albuquerque, NM, 2007. Tech. Report SAND2007-4748W http://www.cs.sandia.gov/Zoltan/ug html/ug.html. [2] T. A. D AVIS, I. S. D UFF , AND S. NAKOV, Design and implementation of a parallel markowitz threshold algorithm, Technical Report RAL-TR-2019-003, Rutherford Appleton Laboratory, Oxfordshire, England, 2019. NLAFET Working Note 22. Submitted to SIMAX. [3] I. S. D UFF , R. G UIVARCH , D. RUIZ , AND M. Z ENADI, The augmented block Cimmino distributed method, SIAM J. Scientific Computing, 37 (2015), pp. A1248–A1269. [4] I. S. D UFF , J. H OGG , AND F. LOPEZ, A new sparse symmetric indefinite solver using a posteriori threshold pivoting, Tech. Rep. RAL-TR-2018-012, Rutherford Appleton Laboratory, Oxfordshire, England, 2018. NLAFET Working Note 21. Submitted to SISC. [5] M. L UBY , A simple parallel algorithm for the maximal independent set problem, SIAM J. Computing, 15 (1986), pp. 1036–1053. [6] M. Z ENADI, The solution of large sparse linear systems on parallel computers using a hybrid implementation of the block Cimmino method., Thése de Doctorat, Institut National Polytechnique de Toulouse, Toulouse, France, décembre 2013.
09:30 Coffeebreak
Coffeebreak
09:30 - 09:50
Room: Main/groundfloor-none - Magistrale
09:50 Session I
Session I
09:50 - 11:50
Room: Main/groundfloor-V0.05/2+3 - Prigogine
Contributions

09:50 Exploiting Nested Task-Parallelism in the LU Factorization of Hierarchical Matrices - Rocío Carratalá-Sáez (Universitat Jaume I)

10:20 Unleashing the sptrsv method in FPGAs - Federico Favaro (Facultad de Ingeniería, Universidad de la República)

10:40 Towards an efficient many-core implementation of the IRKA - Matías Valdés (Universidad de la República)

11:00 Automatic selection of GPU sparse triangular solvers based on energy consumption - Raúl Marichal (Universidad de la República)

11:20 Gingko's load-balancing COO SpMV on NVIDIA and AMD GPU architectures - YU-HSIANG TSAI (Karlsruhe Institute of Technology) Terry Cojean (Karlsruhe Institute of Technology)
11:50 Lunch
Lunch
11:50 - 13:45
13:45 Iterative Refinement in Three Precisions - Erin Carson (Charles University)
Iterative Refinement in Three Precisions
- Erin Carson (Charles University)
13:45 - 14:30
Room: Main/groundfloor-V0.05/2+3 - Prigogine Support for floating point arithmetic in multiple precisions is becoming increasingly common in emerging architectures. For example, half precision is now available on the NVIDIA V100 GPUs, on which it runs twice as fast as single precision with a proportional savings in energy consumption. Further, the NVIDIA V100's half-precision tensor cores can provide up to a 16x speedup over double precision. We present a general algorithm for solving an n-by-n nonsingular linear system Ax = b based on iterative refinement in three precisions. The working precision is combined with possibly different precisions for solving for the correction term and for computing the residuals. Our rounding error analysis of the algorithm provides sufficient conditions for convergence and bounds for the attainable normwise forward error and normwise and componentwise backward errors, generalizing and unifying many existing rounding error analyses for iterative refinement. We show further that by solving the correction equations by GMRES preconditioned by the LU factors the restriction on the condition number can be weakened to allow for the solution of systems which are extremely ill-conditioned with respect to the working precision. Compared with a standard $Ax = b$ solver that uses LU factorization in single precision, these results suggest that on architectures for which half precision is efficiently implemented it will be possible to solve certain linear systems $Ax = b$ in less time and with greater accuracy. We present recent performance results on the latest GPU architectures which show that this approach can result in practical speedups and also discuss recent work in extending this approach to iterative refinement for least squares problems.
14:30 Coffeebreak
Coffeebreak
14:30 - 14:50
Room: Main/groundfloor-none - Magistrale
14:50 Session II
Session II
14:50 - 17:10
Room: Main/groundfloor-V0.05/2+3 - Prigogine
Contributions

14:50 Parallel multiprecision iterative Krylov subspace solver - Xenia Rosa Volk (Karlsruhe Institute of Technology)

15:20 Energy-Time Analysis of Heterogeneous Clusters for EEG Classification - Julio Ortega (University of Granada, Granada, Spain)

15:50 Convolutional Neural Nets for the Run Time and Energy Consumption Estimation of the Sparse Matrix–Vector Product - Manuel F. Dolz (Universitat Jaume I)

16:20 Revisiting the idea of multiprecision block-Jacobi preconditioning - where do we stand 2 years later? - Hartwig Anzt (Karlsruhe Institute of Technology)

16:50 In-application energy measurement on Megware SlideSX systems - Martin Köhler (Max Planck Institute for Dynamics of Complex Technical Systems)
17:10 Break
Break
17:10 - 17:25
Room: Main/groundfloor-none - Magistrale
17:25 Massively Parallel & Low Precision Accelerator Hardware as Trends in HPC - How to use it for large scale simulations allowing high computational, numerical and energy efficiency with application to CFD - Stefan Turek (TU Dortmund)
Massively Parallel & Low Precision Accelerator Hardware as Trends in HPC - How to use it for large scale simulations allowing high computational, numerical and energy efficiency with application to CFD
- Stefan Turek (TU Dortmund)
17:25 - 18:10
Room: Main/groundfloor-V0.05/2+3 - Prigogine The aim of this talk is to present and to discuss how modern, resp., future High Performance Computing (HPC) facilities regarding massively parallel hardware with millions of cores together with very fast, but low precision accelerator hardware can be exploited in numerical simulations so that a very high computational, numerical and hence energy efficiency can be obtained. Here, as prototypical extreme-scale PDE-based applications, we concentrate on nonstationary flow simulations with hundreds of millions or even billions of spatial unknowns in long-time computations with many thousands up to millions of time steps. For the expected huge computational resources in the coming exascale era, such type of spatially discretized problems which typically are treated sequentially, that means one time after the other, are still too small to exploit adequately the huge number of compute nodes, resp., cores so that further parallelism, for instance w.r.t. time, might get necessary. In this context, we discuss how "parallel-in-space simultaneous-in-time" Newton-Multigrid approaches can be designed which allow a much higher degree of parallelism. Moreover, to exploit current accelerator hardware in low precision (for instance, GPUs or TPUs), that means mainly working in single precision or even half precision, we discuss the concept of "prehandling" (in contrast to "preconditioning") of the corresponding ill-conditioned systems of equations, for instance arising from Poisson-like problems. Here, we assume a transformation into an equivalent linear system with similar sparsity but with much lower condition numbers so that the use of low-precision hardware might get feasible. In our talk, we provide for both aspects preliminary numerical results as "proof-of-concept" and discuss the open problems, but also the challenges, particularly for incompressible flow problems.
19:30 Conference dinner
Conference dinner
19:30 - 22:00
Wednesday 6 November 2019
09:00 Parallel Algorithms for CP, Tucker, and Tensor Train Decompositions - Grey Ballard (Wake Forest University)
Parallel Algorithms for CP, Tucker, and Tensor Train Decompositions
- Grey Ballard (Wake Forest University)
09:00 - 09:45
Room: Main/groundfloor-V0.05/2+3 - Prigogine Multidimensional data, coming from scientific applications such as numerical simulation, can often overwhelm the memory or computational resources of a single workstation. In this talk, we will describe parallel algorithms and available software implementations for computing CP, Tucker, and Tensor Train decompositions of large tensors. The open-source software is designed for clusters of computers and has been benchmarked on various supercomputers. The algorithms are scalable, able to process terabyte-sized tensors and maintain high computational efficiency for 100s to 1000s of processing nodes. We will detail the data distribution and parallelization strategies for the key computational kernels within the algorithms, which include the matricized-tensor times Khatri-Rao product, computing (structured) Gram matrices, and tall-skinny QR decompositions.
09:45 Coffee break
Coffee break
09:45 - 10:15
Room: Main/groundfloor-none - Magistrale
10:15 Session III
Session III
10:15 - 12:15
Room: Main/groundfloor-V0.05/2+3 - Prigogine
Contributions

10:15 Energy Efficiency of Nonlinear Domain Decomposition Methods - Axel Klawonn (Universität zu Köln)

10:45 Domain decomposition methods in FreeFEM with ffddm - Pierre-Henri Tournier (Sorbonne Université, CNRS, Université de Paris, Inria, Laboratoire Jacques-Louis Lions, F-75005 Paris, France)

11:15 Evaluating asynchronous Schwarz solvers for Exascale. - Pratik Nayak (Karlsruhe Institute of Technology)

11:45 S-Step Enlarged Conjugate Gradient Methods - Sophie Moufawad (American University of Beirut (AUB))