Physical Sciences Data Infrastructure PSDI Materials Community Workshop

16 - 17 June 2025

Royce Hub Building

Manchester, UK


Joshua Berry

Harnessing Large Language Models to Extract Complex Concentrated Alloy Properties from Scientific Literature for Materials Science and Engineering

Joshua Berry

The University of Sheffield

2025-06-17 · 11:35 BST

J. Berry1, A. Thomas2,3, X. Liu2,3, H. Lu2,3, N. Morley1, K.A. Christofidou1

Materials informatics tools necessitate the availability and robustness of large databases that in the complex concentrated alloy (CCA) space, are unavailable. This talk outlines a methodology for leveraging large language models (LLMs) to automatically extract key information, such as alloy compositions, processing parameters, and material properties, from scientific publications and convert it into structured, machine-readable databases.

The presented workflow combines PDF parsing tools like Nougat with instruction-tuned LLMs to identify and extract relevant data directly from unstructured text. Unlike domains such as chemistry, where standardised notations facilitate data extraction, alloy-related data is often expressed inconsistently. When combined with the embedding of information within figures or tables, or described contextually, this can pose a challenge to successful data extraction.

Comparative results will be shown for multiple LLMs tested against a manually curated CCA dataset, illustrating the current capabilities and limitations of automated extraction in this context. The creation of such databases has the potential to significantly enhance the accessibility and usability of published data, enabling broader reuse and integration. Ultimately, this approach supports the development of more robust and accurate machine learning models by providing higher-quality and more comprehensive datasets, and contributes to the reduction of data siloing that hinders progress in materials science and engineering.

Affiliation

1Department of Materials Science & Engineering, University of Sheffield, Mappin Street, Sheffield, S1 3JD, UK
2Department of Computer Science, University of Sheffield, Regent Court, Sheffield, S1 4DP, UK
3Centre of Machine Intelligence, University of Sheffield, Regent Court, Sheffield, S1 4DA, UK