NHERI Computational Symposium

February 5-7, 2025

Making data, benchmarks and testbeds reusable and AI/ML-ready

Session 10B Chairs: Maria Esteva & Scott Brandenberg

 


Henry Burton

Associate Professor
UCLA

Multi-Hazard Non-Relational Database for Disaster Impact Predictive Modeling

Co-Authors: Eric Choi (University of California, Los Angeles), Sebastian Galicia Madero (University of California, Los Angeles), Sravan Jayati (University of California, Merced), Shawn Newsam (University of California, Merced), and Pedro Ferandez-Caban (Florida State University)

Abstract: Organizations such as the Structural and Geotechnical Extreme Event Reconnaissance networks have made strides towards collecting the data needed to support natural hazards predictive modeling. These efforts have been complimented by the NHERI Rapid and DesignSafe centers. The Rapid facility provides equipment and resources for collecting post-event data and DesignSafe provides a platform for compiling and publishing individual datasets. While numerous reconnaissance datasets have been placed on DesignSafe, their existence has been primarily in the form of “raw” data from individual events. There is currently no integration and modeling framework to support multihazard resilience-based assessments.

This presentation will describe a MultI-Hazard Resilience Assessment dataBase (MHRABase) that supports predictive disaster impact models through the systematic aggregation and curation of multimodal data, models and heuristics from different sources over time. The non-relational database structure will provide the benefit of being highly adaptable and capable of integrating continuously curated multihazard data, predictive models and related publications into a single environment. MHRABase will support a diversity of queries ranging from those that can be answered by the embedded data and heuristics to more complex questions that require the integration of data and models. The current version of the database includes building damage and recovery datasets and models for four hurricanes and one earthquake. However, the intent is to create an adaptable knowledge system that can be expanded to incorporate other types of hazards (e.g. flooding), impact metrics (e.g. economic losses), infrastructure systems (e.g. lifelines) and interactions.

Scott Brandenberg

Professor
UCLA

Relational Databases for Accessible, AI-ready Data Dissemination

Abstract: Natural hazards researchers often use the word "database" to refer to a file repository. File repositories often use human readable file structures, which may differ significantly among research groups even for the same type of data, and often do not follow consistent metadata standards. For example, metadata often appears in the upper rows of a spreadsheet with data in lower rows, or important metadata is included in the filename. By contrast, a relational database organizes data into rows and columns in interconnected tables that can be quired using Structured Query Language (SQL). Relational databases follow an organizational structure called a schema, and are uniformly formatted with well-defined metadata fields, and are therefore computer readable. An important benefit of relational databases is that they minimize the need for data wrangling, which currently occupies a significant fraction of the time data scientists spend doing tasks like coding parsers to organize data from different projects, make decisions about which fields are similar between projects, etc. Relational databases enable data scientists to more quickly draw conclusions from the data. This presentation will discuss relational databases for the Next Generation Liquefaction project, the shear wave velocity database, the ground motion database for the NGAWest3 project, and a database of geotechnical data in coastal areas of California. These databases may be useful for audience members seeking datasets for their data science applications, including AI/ML. I will also focus some attention on defining best practices for creating relational databases for dissemination to the research community.

Piyush Vyas

Graduate Student Researcher
University of Southern California

Relational Database for Analyzing Biocemented Sand in Seismic Mitigation

Co-Authors: Chukwuebuka C. Nweke (University of Southern California)

Abstract: Biocementation offers tangible benefits for seismic mitigation, including increased soil stiffness, enhanced shear strength, and reduced liquefaction susceptibility. However, the development of reliable forward models and predictive analyses for biocemented soils is hindered by the lack of a standardized dataset, a resource that has proven invaluable in fields like concrete design. To address this gap, we present a relational database designed for efficient storage and analysis of biocemented sand data across varied experimental conditions.

Our database organizes and links datasets related to material properties (e.g., grain size distribution, mineralogy), treatment methods (e.g., microbial strains, cementation solutions), and seismic performance metrics (e.g., shear wave velocity, cyclic stress ratio, and post-liquefaction volumetric strain). This structured approach facilitates comprehensive cross-comparative analyses, enabling researchers to identify key factors influencing seismic resistance and optimize biocementation protocols.

By providing a queryable repository of experimental results, the database supports the development and validation of predictive models. Researchers can use this tool to extrapolate performance under various seismic scenarios, refine treatment strategies, and make data-driven decisions. Ultimately, this database aims to accelerate the practical implementation of biocementation in seismic mitigation projects by establishing a robust foundation for predictive modeling and performance optimization.

Tracy Kijewski-Correa

Professor
University of Notre Dame

Promoting Data Reuse in Computational Simulation Workflows in Natural Hazards Engineering

Co-Authors: David Roueche (Auburn University), Mohammad S. Alam (University of Hawai’i at Manoa), Khalid Mosalam (University of California, Berkeley), and David O. Prevatt (University of Florida)

Abstract: The Structural Engineering Extreme Events Reconnaissance (StEER) network was founded with the mission to build societal resilience by generating new knowledge on the performance of the built environment through impactful post-event reconnaissance disseminated to affected communities. Since its founding, StEER has focused on shifting the paradigm from isolated researchers using paper forms to explore specific hypotheses by generating proprietary data toward community collaboratives using digital platforms to acquire highly structured and standardized data that is openly curated data. This has required building the community’s capacity to efficiently collect perishable data, but also on ensuring that data is suitable for reuse. With more than five years of curated data, we see growing evidence of data reuse.

The presentation will first examine how assuring the quality of the data, achieved by creating an objective and consistent approach to structural assessment, was a precondition for reuse. We further explore how rigorous Data Enrichment and Quality Control (DEQC) processes deliver the reliability necessary for reuse. We then discuss the role of clear documentation and structuring of data, as well as discoverability in promoting reusability, enhanced through the creation of policies and protocols for curation/publication within DesignSafe using its Field Research Data Model. These learnings will be contextualized within case studies from different major earthquakes and windstorms, demonstrating how different subsets of data were re-used in support of evolving computational simulation workflows. This presentation will close by outlining the remaining challenges engaging field observations in computational simulation workflows for natural hazards.

Carlos del-Castillo-Negrete

Researcher
University of Texas at Austin

Data-Driven Frameworks for Global Storm Surge Prediction

Co-Authors: Jinpai Zhao (The University of Texas at Austin), Benjamin Pachev (The University of Texas at Austin), Prateek Arora (New York University), Eirik Valseth (The University of Texas at Austin), and Clint Dawson (University of Texas at Austin)

Abstract: Storm surges - abnormal sea level rises caused by hurricanes and typhoons - are among the most significant societal risks posed by coastal hazards. While high-fidelity models like the ADvanced CIRCulation (ADCIRC) model predict storm surges accurately, they are computationally expensive. To address this, our work develops data-driven surrogate models for storm surge prediction.

Our initial models consisted of simple feed-forward networks trained on a dataset of 446 synthetic hurricanes in the Texas coast region, predicting storm surge levels to within 30 centimeters. We have now expanded our approach with a global dataset of over 48,000 synthetic hurricanes, including more than 13,000 in the North Atlantic basin. This expanded dataset enables our models to predict storm surges worldwide. Preliminary results with the expanded dataset for the North Atlantic region indicate a reduced error of 10 centimeters. We are actively exploring novel neural-operator frameworks and incorporating our surrogate models into a forecasting pipeline.

All datasets and results are hosted on DesignSafe, emphasizing the creation of feature-rich, ML-ready datasets. We provide guidelines on training models using Jupyter Notebooks accessible on DesignSafe's compute resources, supporting both CPU and GPU parallelization for training and inference. This allows researchers to access trained models for quick inference in hindcasting or to integrate with hurricane best-track forecast pipelines for forecasting.

Our research not only advances the field of storm surge prediction but also contributes to the broader scientific community by providing accessible, ML-ready datasets and tools through the DesignSafe platform.

Maria Esteva

Research Scientist
University of Texas at Austin

What Makes Natural Hazards Data AI-Ready?

Co-Authors: Scott J. Brandenberg (UCLA)

Abstract: AI/ML-ready data is defined as accurate, complete, unbiased, and adequately described datasets that can be confidently used, with minimal data wrangling, to train reproducible AI/ML models. Researchers in natural hazards are increasingly training surrogate models from simulation-based datasets. Many of those are testbeds and databases of numerical results specifically designed for AI applications. Although guiding principles are being developed for curation best practices for making data Findable, Accessible, Interoperable, and Reusable (FAIR), there is no consensus on standards for what makes data AI/ML-ready in the natural hazards space. With a focus on wind and storm surge simulation data, this presentation will: a) introduce and discuss datasets that have been designed for and successfully used in AI applications including feedback from the data producers and users, b) discuss data formats and annotation practices to make data AI-ready, c) identify challenges and requirements for finding and documenting AI/ML ready data and related models, and d) discuss pre- and post-processing requirements, tools, and applications. The conclusions will serve to initiate a movement toward establishing data curation best practices for creation, curation and use of natural hazards AI-ready data.