BiGmax Summer School 2021

Name: BiGmax Summer School 2021
Start: 2021-09-13T14:00:00+02:00
End: 2021-09-17T18:00:00+02:00
Location: Virtual event

13 Sept 2021, 14:00 → 17 Sept 2021, 18:00 Europe/Berlin

Virtual event

Christian Liebscher, Christoph Freysoldt, Ralph Ernstorfer

Description

The BiGmax Summer School 2021 "Harnessing big data in materials science from theory to experiment" will take place from September 13 - 17, 2021 (held as an online event only).

Scope

Our abilities to produce, store, and process huge amounts of information have exploded in the past decades. In parallel, the progress in advanced statistical analysis, machine learning, and artificial intelligence revolutionizes our ways of thinking about data in almost every field. In particular, these new methods aim at discovering and extracting quantitative relations from data directly, without resorting to specific theoretical models or human insight. In materials science, however, novel data-centered approaches are still less established than the traditional theoretical framework, that aims at “explaining” experimental observations by a variety of models at different length and time scales, and allows for quantitative predictions from these models directly, or via computer simulations.

To meet the challenges of the ever-growing amount of data in materials, and to use the opportunities that come with it, future materials research will need to integrate data-oriented approaches with the state-of-the-art domain knowledge. Yet, neither the current materials-science education nor the numerous available tutorials on data methods alone prepare the next generation of materials scientist to achieve this goal.

The aim of this school is to address recent advancements in structuring, analyzing, and harvesting big data in materials science. The school focuses on FAIR data representation of computational and experimental data, the development, implementation and application of machine-learning tools, and the deployment of novel mathematical approaches for data mining and diagnostics. An additional emphasis of the school will be laid on unified approaches in representing big data sets and machine-learning algorithms, spanning across the different disciplines from theory to experiment and within the diverse experimental and theoretical approaches.

The school focuses on combining lectures of renowned experts with hands-on tutorials predominantly targeted towards PhD students and early career researchers.

Invited speakers

Contact

bigmax_summerschool2021@mpi-magdeburg.mpg.de

Participants

157 View full list

Monday 13 September
- 14:00 → 19:00
  Machine Learning - Theory
  - 14:00
    
    Welcome 15m
  - 14:15
    
    Lecture "Four Generations of Neural Network Potentials" 1h
    
    A lot of progress has been made in recent years in the development of machine learning (ML) potentials for atomistic simulations [1]. Neural network potentials (NNPs), which have been introduced more than two decades ago [2], are an important class of ML potentials. While the first generation of NNPs has been restricted to small molecules with only a few degrees of freedom, the second generation extended the applicability of ML potentials to high-dimensional systems containing thousands of atoms by constructing the total energy as a sum of environment-dependent atomic energies [3]. Long-range electrostatic interactions can be included in third-generation NNPs employing environment-dependent charges [4], but only recently limitations of this locality approximation could be overcome by the introduction of fourth-generation ML potentials [5], which are able to describe non-local charge transfer using a global charge equilibration step. In this talk an overview about the evolution of NNPs will be given along with typical applications in large-scale atomistic simulations.
    
    [1] J. Behler, J. Chem. Phys. 145 (2016) 170901.
    [2] T. B. Blank, S. D. Brown, A. W. Calhoun, and D. J. Doren, J. Chem. Phys. 103
    (1995) 4129.
    [3] J. Behler and M. Parrinello, Phys. Rev. Lett. 98 (2007) 146401.
    [4] N. Artrith, T. Morawietz, J. Behler, Phys. Rev. B 83 (2011) 153101.
    [5] T. W. Ko, J. A. Finkler, S. Goedecker, J. Behler, Nature Comm. 12 (2021) 398.
    
    Speaker: Jörg Behler (University of Göttingen)
  - 15:15
    
    Lecture "Science in the Age of Machine Learning" 1h
    
    Traditionally the “best” observations are those with the largest signal from the most tightly controlled systems. In a wide range of phenomena – the dance of proteins in function, femtosecond breaking of molecular bonds, the gestation of fetuses – tight control is neither feasible, nor desirable. Modern machine-learning techniques extract far more information from sparse random sightings than usually obtained from set-piece experiments. I will describe on-going efforts to extract structural and dynamical information from noisy, random snapshots recorded with very poor, or non-existent timing information. Examples will include functional motions of molecular machines, and ultrafast dynamics of photo-induced reactions.
    
    Speaker: Abbas Ourmazd (University of Wisconsin Milwaukee)
  - 16:15
    
    Coffee Break 45m
  - 17:00
    
    Tutorial "Active Learning with Bayesian Optimization" 1h
    
    Gaussian process regression (GPR) is a kernel-based regression tool with intrinsic uncertainty estimation, which makes it well-suited to natural science datasets. In Bayesian optimization, GPR is coupled with acquisition functions for an active learning approach, where models are iteratively refined by addition of new data points with high information content. This tutorial will use the BOSS code to demonstrate the basic principles of Bayesian optimization and how it can be applied in N-dimensional atomistic structures search.
    
    Speaker: Milica Todorović (Turku University)
  - 18:00
    
    Tutorial "Compressed sensing meets symbolic regression: learning interpretable models" 1h
    
    In this tutorial, we introduce the AI technique of symbolic regression, combined with compressed sensing for the identification of compact, interpretable models.
    Specifically, we introduce the Sure-Independence Screening and Sparsifying Operator (SISSO), together with its recent variants.
    The methodology starts from a set of candidate features, provided by the user, and it builds a tree of possible mathematical expression, involving linear and nonlinear operators, up to a given complexity. A compressed sensing solver finds, among billions or trillions of candidate expressions those that better explain the training data.
    We will show demonstrative applications to materials science, including prediction of perovskite-materials stability and topological-insulators identification.
    
    Speaker: Luca M. Ghiringhelli (Fritz Haber Institute of the Max Planck Society)
Tuesday 14 September
- 14:00 → 19:00
  Machine Learning - Theory & Application
  - 14:00
    
    Tutorial "Neural Networks" 1h
    
    In this tutorial, we discuss a neural network application in materials discovery. More specifically, we will showcase how to accelerate functional high entropy alloy discovery using neural network based generative model and ensemble model for the regression task. Therefore, this tutorial consists of two parts. Firstly, we discuss in detail how to construct an alloy generation scheme based on generative neural network model, attribute classifier, stochastic sampling, and density estimation model. Secondly, we demonstrate a systematic approach to combine Bayesian optimization, neural network and gradient boosting decision tree to achieve a two-stage Ensemble Regression Model (TERM). The first stage concerns composition-based regression models, aiming at fast and large-scale composition inference. The top results from the first-stage model enter a more refined model, where density functional calculations and thermodynamic calculations are included as part of the model input. Finally, the TERM outputs were evaluated based on a rank-based policy.
    
    Speaker: Ye Wei (Max-Planck-Institut für Eisenforschung GmbH)
  - 15:00
    
    Lecture "Machine-learning aided atom probe tomography: status and (possible) directions" 1h
    
    Atom probe tomography (APT) is a materials analysis technique that provides sub-nanometer resolution compositional mapping. The data is in the form of a point cloud containing often millions of atoms, and to each of these points is assocaited an elemental nature. By interrogating the point cloud, the local composition of a material or a phase of a specific microstructural feature can be reported. APT is often referred to as "data-intensive" technique, and has long made use of many clustering-type techniques (DBSCAN, NN etc.) to facilitate data extraction, which are all now often classified as belonging to machine-learning.
    
    In this presentation, I will review some of the recent developments from MPIE in the application of machine-learning techniques to atom probe analysis workflows – i.e. beyond just extraction of data from the point cloud – targeting faster and more efficient, reliable and reproducible data analysis.
    
    Speaker: Baptiste Gault (Max-Planck-Institut für Eisenforschung GmbH)
  - 16:00
    
    Coffee break 45m
  - 16:45
    
    Lecture "Machine Learning for Electron Microscopy: from Imaging to Atomic Fabrication" 1h
    
    I will discuss recent progress in automated experiment in electron microscopy, ranging from feature to physics discovery via active learning. The applications of classical deep learning methods in streaming image analysis are strongly affected by the out of distribution drift effects, and the approaches to minimize though are discussed. We further present invariant variational autoencoders as a method to disentangle affine distortions and rotational degrees of freedom from other latent variables in imaging and spectral data and decode physical mechanisms. Extension of encoder approach towards establishing structure-property relationships will be illustrated on the example of plasmonic structures. Finally, I illustrate transition from post-experiment data analysis to active learning process. Here, the strategies based on simple Gaussian Processes often tend to produce sub-optimal results due to the lack of prior knowledge and very simplified (via learned kernel function) representation of the system. Comparatively, deep kernel learning (DKL) methods allow to realize both the exploration of complex systems towards the discovery of structure-property relationship, and enable automated experiment workflows targeting physics (rather than simple spatial feature) discovery. The latter is illustrated via experimental discovery of the edge plasmons in STEM/EELS in MnPS3, a lesser-known 2D material.
    
    This research is supported by the by the U.S. Department of Energy, Basic Energy Sciences, Materials Sciences and Engineering Division and the Center for Nanophase Materials Sciences, which is sponsored at Oak Ridge National Laboratory by the Scientific User Facilities Division, BES DOE.
    
    Speaker: Sergei Kalinin (Oak Ridge National Laboratory)
  - 17:45
    
    Break 1h 15m
- 19:00 → 23:00
  Poster Session: Machine Learning -Theory & Applications / Open Data: (meet us in gather.town)
  - 19:00
    
    A materials informatics framework to discover patterns in atom probe tomography data. 2h
    
    To quantify chemical segregation at multiple length scales in APT in a semi-automatic way, we propose a multi-stage strategy. First, we collect composition statistics from APT datasets for 2x2x2 nm voxels. These voxel compositions are then clustered in compositional space using Gaussian mixture models to automatically identify key phases. Next, based on this compositional classification we employ DBSCAN in physical space at voxel resolution to detect individual precipitates at a small fraction of the effort needed for single-atom-based algorithms. This framework was used to identify and disentangle plate-like Zr-rich and topologically complex Cu-rich precipitates in two APT datasets from Fe-doped Sm-Co magnets, each containing approximately 500 million ions. Upon segmentation of each precipitate using DBSCAN a new approach based on principle component analysis (PCA) is applied to 2D slices of the complex Cu-rich precipitates to further decompose them into approximate planar regions along with their junctions. For each precipitate, the actual distribution of atomic fractions is compared to the expected distribution of a random alloy. This step helps to quantitatively assess clustering within a given precipitate for each atomic species. Finally, for each quasi-planar precipitate, a triangular grid is superimposed to investigate in-plane compositional fluctuations, thickness, and 1D composition profiles.
    
    Speaker: Alaukik Saxena (Max-Planck-Institut für Eisenforschung GmbH ( Helmholtz School for Data Science in Life, Earth and Energy (HDS-LEE) ))
  - 19:00
    
    Deep generative models for the design of Dual phase steel microstructures 2h
    
    Dual Phase (DP) steels are an important family of steel grades used widely in the automotive industry because of their beneficial properties such as high ultimate tensile strength (GPa range), low initial yield stress, and high early-stage strain hardening. The DP steel microstructure consists of soft ferritic grains, which are mainly responsible for ductility, and hard martensitic zones, which give these steels their strength. However, due to the huge parameter space, establishing the microstructure-property relationship in DP steels is a combinatorial challenge. To tackle this challenge we use data from crystal plasticity simulations and train machine learning models to automatically extract the patterns from DP steel microstructures without any need of hand-designed features. In particular, I will present my work on using Variational Auto Encoders (VAEs, a generative machine learning model) to learn the low dimensional latent representations and to generate synthetic EBSD data for DP steels. Also, I will present, that the learned latent space can serve as a design space for microstructure design.
    
    Speaker: Navyanth Kusampudi (Max-Planck-Institut für Eisenforschung GmbH)
  - 19:00
    
    Dynamic structure investigation of biomolecules with pattern recognition algorithms and X-ray experiments 2h
    
    Speaker: Amir Kotobi (Deutsches Elektronen-Synchrotron DESY)
  - 19:00
    
    Learning Dynamics of STEM by Enforcing Physical Consistency with Phase-Field Models 2h
    
    In this poster, we present our research goals of a recently BiGmax funded project towards learning dynamics of scanning transmission electron microscopy (STEM) by incorporating physical consistency with phase-field models. The primary idea of this project is to develop machine learning (ML)-based modeling of an interpretable coarse-grained dynamic model utilizing in situ STEM video sequences fulfilling a suitable dynamical phase-field equation. The modeling approach aims to discover governing equations by utilizing the video sequence data and prior physics knowledge that is directly compatible with analytic theories or subsequent ML-based analysis.
    
    Speakers: Lekshmi Sreekala (Max-Planck-Institut für Eisenforschung GmbH), Pawan Goyal (Max Planck Institute for Dynamics of Complex Technical Systems)
  - 19:00
    
    Multidimensional Photoemission Spectroscopy: proposal & demonstration of data (infra)structure 2h
    
    The complexity of photoemission data is rapidly increasing,
    as new technological breakthroughs have enabled multidimensional parallel acquisition of multiple observables. Most of the community is currently using heterogeneous data formats and workflows.
    
    We propose a new data format based on NeXus, a hierarchically organized hdf5 structure. The aim is to immediately enable preprocessed data and metadata shareability according to FAIR principles, employing the existing storage and archiving infrastructure such as Zenodo, OpenAIRE and Nomad/FAIRmat. Ultimately, the multidimensional photoemission spectroscopy (MPES) format is designed to allow high-performance automated access, providing experimental databases for high-throughput material search.
    
    The MPES format is based on the creation of a standardized set of classes that univocally identify experimental observables and metadata. Such dictionary is complemented by “application definitions”, i.e. ontologies that constrain the existence of specific elements in a file.
    
    Our approach involves reaching out to the community using a website with a wiki structure. By this medium, we wish to favour acceptance, to avoid conflict with different low-level preprocessing workflows, and to create continuously updated documentation.
    
    As a demonstrator of the potential of our approach, we present the workflow we developed for our data pipeline, originating from time-resolved angularly-resolved photoemission spectroscopy.
    
    Speakers: Tommaso Pincelli (Fritz Haber Institute of the Max Planck Society), Steinn Ymir Agustsson (JGU Mainz)
  - 19:00
    
    Robust recognition and exploratory analysis of crystal structures via Bayesian deep learning 2h
    
    Due to their ability to recognize complex patterns, neural networks can drive a paradigm shift in the analysis of materials science data. Here, we introduce ARISE, a crystal-structure identification method based on Bayesian deep learning. As a major step forward, ARISE is robust to structural noise and can treat more than 100 crystal structures, a number that can be extended on demand. While being trained on ideal structures only, ARISE correctly characterizes strongly perturbed single- and polycrystalline systems, from both synthetic and experimental resources. The probabilistic nature of the Bayesian-deep-learning model allows to obtain principled uncertainty estimates, which are found to be correlated with crystalline order of metallic nanoparticles in electron-tomography experiments. Applying unsupervised learning to the internal neural-network representations reveals grain boundaries and (unapparent) structural regions sharing easily interpretable geometrical properties. This work enables the hitherto hindered analysis of noisy atomic structural data from computations or experiments.
    
    Speaker: Andreas Leitherer (Fritz Haber Institute of the Max Planck Society)
  - 19:00
    
    Teaching solid mechanics to artificial intelligence 2h
    
    I will present our latest progress in using machine learning for solving non-linear solid mechanics. The presentation will be based on our published work here: https://www.nature.com/articles/s41524-021-00571-z
    
    Speaker: Jaber Mianroodi (Max-Planck-Institut für Eisenforschung GmbH)
Wednesday 15 September
- 14:00 → 18:45
  Learning from Complex Data
  - 14:00
    Tutorial "Tips, tricks, and tools for reproducible materials science" 1h
    
    The ability to replicate results is a key characteristic of quality science, and is growing ever more important in light of the replication crisis [1, 2].
    A study can rarely be repeated using only the minimalistic descriptions provided in the "Materials and methods" section in a paper.
    It is therefore important to properly document the entire knowledge generation pipeline in such a way that it could be repeated by anyone with minimal effort.
    In this tutorial we will look at the tools and techniques by which we can improve the reproducibility of an experimental study through a single comprehensive example from electron microscopy.
    
    The session will cover:
    
    eLabFTW, an electronic lab notebook application with a python API, which can be used to document samples and experiments.
    
    Various strategies of working with experimental metadata, including the HDF5 file format and JSON.
    
    Strategies and best practices for reproducible data analysis pipelines in jupyter notebooks, using git, conda-forge, mybinder and docker.
    
    <div id="ioannidis">[1] Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2(8): e124.</div>
    <dif id="schooler">[2] Schooler, J. Metascience could rescue the ‚'replication crisis'. Nature 515, 9 (2014).</div>
    
    Speaker: Niels Cautaerts (Max-Planck-Institut für Eisenforschung GmbH)
  - 15:00
    
    Lecture "Neural Networks with Euclidean Symmetry for Physical Sciences" 1h
    
    Atomic systems (molecules, crystals, proteins, etc.) are naturally represented by a set of coordinates in 3D space labeled by atom type. This is a challenging representation to use for machine learning because the coordinates are sensitive to 3D rotations and translations and there is no canonical orientation or position for these systems. One motivation for incorporating symmetry into machine learning models on 3D data is to eliminate the need for data augmentation — the 500-fold increase in brute-force training necessary for a model to learn 3D patterns in arbitrary orientations.
    
    Most symmetry-aware machine learning models in the physical sciences avoid augmentation through invariance, throwing away coordinate systems altogether. But this comes at a price; many of the rich consequences of Euclidean symmetry are lost: geometric tensors, point and space groups, degeneracy, atomic orbitals, etc.
    
    We present a general neural network architecture that faithfully treats the equivariance of physical systems, naturally handles 3D geometry, and operates on the scalar, vector, and tensor fields that characterize them. We describe how the networks achieves equivariance, demonstrate the capabilities of our network using simple tasks, and provide coding examples to build these models using e3nn: a modular framework for Euclidean Neural Networks (https://e3nn.org).
    
    Speaker: Tess Smidt (Berkeley Lab)
  - 16:00
    
    Coffee Break 45m
  - 16:45
    
    Tutorial "Neural Networks with Euclidean Symmetry for Physical Sciences" 1h
    
    Atomic systems (molecules, crystals, proteins, etc.) are naturally represented by a set of coordinates in 3D space labeled by atom type. This is a challenging representation to use for machine learning because the coordinates are sensitive to 3D rotations and translations and there is no canonical orientation or position for these systems. One motivation for incorporating symmetry into machine learning models on 3D data is to eliminate the need for data augmentation — the 500-fold increase in brute-force training necessary for a model to learn 3D patterns in arbitrary orientations.
    
    Most symmetry-aware machine learning models in the physical sciences avoid augmentation through invariance, throwing away coordinate systems altogether. But this comes at a price; many of the rich consequences of Euclidean symmetry are lost: geometric tensors, point and space groups, degeneracy, atomic orbitals, etc.
    
    We present a general neural network architecture that faithfully treats the equivariance of physical systems, naturally handles 3D geometry, and operates on the scalar, vector, and tensor fields that characterize them. We describe how the networks achieves equivariance, demonstrate the capabilities of our network using simple tasks, and provide coding examples to build these models using e3nn: a modular framework for Euclidean Neural Networks (https://e3nn.org).
    
    Speaker: Tess Smidt (Berkeley Lab)
  - 17:45
    
    Lecture "Artificial Intelligence and High-Performance Data Mining for Accelerating Scientific Discovery" 1h
    
    The increasing availability of data from the first three paradigms of science (experiments, theory, and simulations), along with advances in artificial intelligence and machine learning (AI/ML) techniques has offered unprecedented opportunities for data-driven science and discovery, which is the fourth paradigm of science. Within the arena of AI/ML, deep learning (DL) has emerged as a game-changing technique in recent years with its ability to effectively work on raw big data, bypassing the (otherwise crucial) manual feature engineering step traditionally required for building accurate ML models, thus enabling numerous real-world applications, such as autonomous driving. In this talk, I will present our ongoing research in AI and high performance data mining, along with illustrative real-world scientific applications. In particular, we will discuss approaches to gainfully apply DL on big data (by accelerating DL and enabling deeper learning) as well as on small data (deep transfer learning) in the context of materials science. I will also demonstrate some of the software tools developed in our group.
    
    Speaker: Ankit Agrawal (Northwestern University)
Thursday 16 September
- 14:00 → 17:45
  Multidimensional Data Analysis
  - 14:00
    
    Lecture "Variational methods in material sciences" 1h
    
    Variational methods are powerful tools in image processing.
    Basically we are searching for a suitable mathematical model (function) consisting of a data term and a prior
    which minimizer provides a solution of the task at hand and can be computed in an efficient, reliable way. Typically this leads to non-smooth, high-dimensional
    optimization problems.
    This talk deals with recent results obtained by applying
    variational methods for different tasks in material sciences as
    - crack detection using optical flow models in image sequences,
    - determination of deformation fields in electron backscatter diffraction image sequences,
    - superresolution of material images by learned patch-based priors,
    and
    - denoising of FIB images with directional total variation priors.
    
    Speaker: Gabriele Steidl (TU Berlin)
  - 15:00
    
    Lecture "HyperSpy: theory and applications" 1h
    
    HyperSpy (https://hyperspy.org/) is an open-source Python package for the analysis of multi-dimensional datasets. In its fourteen years of existence, its community of developers has taken it from being a simple collection of scripts for electron energy-loss data analysis to become the core of a multidisciplinary software ecosystem. In this talk, I will its evolution, ecosystem, main features andstructure—all illustrated with a bit of theory and applications. I will also briefly describe how to contribute to HyperSpy and how to create your own HyperSpy extension.
    
    Speaker: Francisco De La Peña (Université de Lille)
  - 16:00
    
    Coffee Break 45m
  - 16:45
    
    Tutorial "Introduction to EELS analysis with HyperSpy" 1h
    
    HyperSpy (https://hyperspy.org/) is an open-source Python package for the analysis of multi-dimensional datasets. In its fourteen years of existence, its community of developers has taken it from being a simple collection of scripts for electron energy-loss data analysis to become the core of a multidisciplinary software ecosystem. In this talk, I will its evolution, ecosystem, main features andstructure—all illustrated with a bit of theory and applications. I will also briefly describe how to contribute to HyperSpy and how to create your own HyperSpy extension.
    
    Speaker: Francisco De La Peña (Université de Lille)
Friday 17 September
- 14:00 → 17:30
  Open Data
  - 14:00
    
    Lecture "NFDI4Chem" 1h
    
    tba
    
    Speaker: Nicole Jung (Karlsruhe Institute of Technology)
  - 15:00
    
    Lecture "Structuring, analyzing, and harvesting big data in materials science electron microscopy" 1h
    
    Every day, experimental materials science data is being collected in thousands of laboratories around the world. However, the diversity of instruments, vendor software packages and (proprietary) data formats, lab cultures, and the focus mostly on new discoveries causes most of this data to end up in a black hole in terms of accessibility to the scientific community (including in many cases the lab in which the data was acquired). Using the example of transmission electron microscopy we will present our attempts to encourage scientists to annotate and contribute more of their data to the scientific community and benefiting themselves in the process. We are striving to achieve this by being able to offer relevant online data processing and analyzing capabilities and tools to annotate data in more and more automated ways. As more and more data is being accumulated, big data techniques can be applied to benefit from the added value that sharing of experimental data sets produces.
    
    Speaker: Christoph Koch (Humboldt-Universität zu Berlin)
  - 16:00
    
    Coffee Break 30m
  - 16:30
    
    Tutorial "Sharing, publishing, and managing computational materials science data with NOMAD" 1h
    
    The main focus of this tutorial is the FAIR sharing of materials science data and how to do it with NOMAD. We will be covering the publication of new data and the exploration and download from NOMAD’s existing data; both through our browser-based interface and APIs.
    
    Speaker: Markus Scheidgen (Humboldt Universität zu Berlin / Fritz Haber Institut der Max Planck Gesellschaft)
- 17:30 → 17:45
  
  Concluding remarks

Choose timezone

BiGmax Summer School 2021

Virtual event