Semantic Enrichment and LLM-Driven Analytics to the Mindat Open Data Portal

SPEAKERS

Dr. Xiaogang Ma, Associate Professor of Computer Science, University of Idaho

DATE

August 18, 2025

Abstract

Mindat is the world’s largest open mineral database, featuring over 6,600 mineral species and 430,000 localities. Recently, through NSF and NAIRR support, we have made enhancements to the Mindat open data service by integrating semantic technologies and large language models (LLMs) to improve data quality, usability, and accessibility. We have aligned records with community standards (e.g., rock and mineral nomenclature and classification) and persistent identifiers, which enable FAIR-compliant, machine-readable data access via a new open data API. On top of this infrastructure, LLMs assist with natural-language querying, record cleansing, synonym resolution, and geospatial disambiguation. Embedding-based search and interactive visualizations, such as mineral–element networks and locality heatmaps, allow users to explore data with minimal coding. These innovations reduce barriers to data use, enhance transparency, and support reproducible geoscience research. This work demonstrates how semantic enrichment and LLM tools can transform domain-specific data services into intelligent, user-friendly platforms for data science in the Earth sciences.

Date and Time

August 18, 2025, 10:00 - 11:00 AM PDT

Speaker Biography

Xiaogang (Marshall) Ma is an Associate Professor of Computer Science at the University of Idaho. He received his Ph.D. degree of Earth Systems Science and GIScience from University of Twente, Netherlands in 2011, and then completed postdoctoral training of Data Science at Rensselaer Polytechnic Institute. His research focuses on deploying data science in the Semantic Web to support cross-disciplinary collaboration and scientific discovery, with broad interests in complex systems in Earth and environmental sciences, data interoperability and provenance, and visualized exploratory analysis of Big and Small Data.

Recording

To be uploaded.