16–19 Feb 2025
Ringberg castle
Europe/Berlin timezone

Solving Dense Sylvester-like Matrix Equations - Recent Advances

17 Feb 2025, 11:00
30m
Ringberg castle

Ringberg castle

Schloss Ringberg Schlossstraße 20 83708 Kreuth Coordinates: 47° 40' 43'' N 11° 44' 56'' E

Speaker

Martin Köhler (Max Planck Institute for Dynamics of Complex Technical Systems)

Description

The solution of Sylvester-like matrix equations still constitutes a core task in systems and control theory, with their solution also being required in the extensive field of eigenvalue analysis. In addition to numerous iterative algorithms, Bartels and Stewart presented an algorithm for dense matrices in the 1970s. This algorithm has been improved over the last two decades by the introduction of techniques such as recursive blocking, level-3 block partitioning, and task-based scheduling.

In the 2010s, studies demonstrated that the utilisation of GPUs yielded only a marginal enhancement in execution speed. However, contemporary computing infrastructures, characterised by the transition from multi-core to many-core systems and the integration of accelerators, have emerged as a potential solution. These advancements offer a substantial boost in floating-point operations, thereby enhancing the overall performance of these systems. These include higher floating-point performance, increased memory bandwidth, and enhanced interconnection.
The landscape has undergone a fundamental shift in terms of the relative performance of CPUs and GPUs. In addition to these hardware advancements, the author proposes two distinct methodologies for enhancing the efficiency of the Bartels-Stewart algorithm. In the first instance, an approach is adopted that considers the differing styles of parallelism developed over the last decade. On the one hand, the available parallelism on the CPU side requires hundreds of (small) tasks, as in task-based scheduling algorithms, to ensure the Bartels-Stewart algorithm runs as fast as possible. On the other hand, accelerator devices such as GPUs require a critical problem size to become efficient. The necessity to address this discrepancy has led to the extension of existing algorithms to incorporate the ideas of recursive blocking, level-3 partitioning, and
task-based scheduling using the StarPU library as a hybrid scheduling
library. The utilisation of multiple levels of block partitioning enables the effective management of the algorithm's parallel execution, ensuring its efficient operation on both CPU and GPU platforms.
This approach allows us to simultaneously satisfy the block size requirements for both CPUs and GPUs. The StarPU library then schedules the individual tasks to the computing devices that are most likely to result in the minimum execution time. Secondly, the 'macroscopic' aspect of the solvers is optimised. Given that current CPUs only achieve optimal performance once a substantial amount of work is performed on top of the vector registers, techniques employed in the optimization of the GEMM operation are integrated into the level-2 BLAS inner solvers. These optimisations are expected to reduce the critical path in a parallel execution environment, thereby enhancing the overall performance.

Author

Martin Köhler (Max Planck Institute for Dynamics of Complex Technical Systems)

Presentation materials

There are no materials yet.