Harnessing Large Language Models to Extract Complex Concentrated Alloy Properties from Scientific Literature for Materials Science and Engineering
Joshua Berry
The University of Sheffield
2025-06-17 · 11:35 BST
J. Berry1, A. Thomas2,3, X. Liu2,3, H. Lu2,3, N. Morley1, K.A. Christofidou1
Materials informatics tools necessitate the availability and robustness of large databases that in the complex concentrated alloy (CCA) space, are unavailable. This talk outlines a methodology for leveraging large language models (LLMs) to automatically extract key information, such as alloy compositions, processing parameters, and material properties, from scientific publications and convert it into structured, machine-readable databases.
The presented workflow combines PDF parsing tools like Nougat with instruction-tuned LLMs to identify and extract relevant data directly from unstructured text. Unlike domains such as chemistry, where standardised notations facilitate data extraction, alloy-related data is often expressed inconsistently. When combined with the embedding of information within figures or tables, or described contextually, this can pose a challenge to successful data extraction.
Comparative results will be shown for multiple LLMs tested against a manually curated CCA dataset, illustrating the current capabilities and limitations of automated extraction in this context. The creation of such databases has the potential to significantly enhance the accessibility and usability of published data, enabling broader reuse and integration. Ultimately, this approach supports the development of more robust and accurate machine learning models by providing higher-quality and more comprehensive datasets, and contributes to the reduction of data siloing that hinders progress in materials science and engineering.
Affiliation
1Department of Materials Science & Engineering, University of Sheffield, Mappin Street, Sheffield, S1 3JD, UK
2Department of Computer Science, University of Sheffield, Regent Court, Sheffield, S1 4DP, UK
3Centre of Machine Intelligence, University of Sheffield, Regent Court, Sheffield, S1 4DA, UK