Evaluating generative AI for research using an open-ended benchmark

APR 11, 2025

A series of open-ended questions at a graduate level or higher showcases AI’s potential and reveals its shortcomings.

Avery Thompson

DOI: 10.1063/10.0036463

Evaluating generative AI for research using an open-ended benchmark internal name — Evaluating generative AI for research using an open-ended benchmark lead image

Large language models (LLMs) are a recent iteration of generative neural networks capable of producing human-sounding text, including answers to submitted questions. Despite the appearance of usefulness, whether the information provided by LLMs is accurate and helpful, especially in highly technical contexts, is still an open question.

Yaunguas-Gil et al. developed an open-ended benchmark to evaluate LLMs in the context of atomic layer deposition (ALD). Their work can provide insight into the usefulness of LLMs for assisting in ALD research and can act as a template for researchers in other fields looking to evaluate LLMs for use in their own work.

Most LLM benchmarks have a clear correct answer, usually consisting of multiple-choice questions, text completion, or answering very specific questions.

“These evaluate aspects of LLMs related to accuracy, but they miss other aspects such as the relevance or usefulness of the response that are important in the context of AI assistants,” said author Angel Yanguas-Gil. “Open-ended benchmarks help to bridge the gap between conventional benchmarks and potential real-life uses of LLMs for scientific applications.”

For their benchmark, the authors assembled a list of 70 questions about ALD, ranging from early graduate level to advanced expert. They posed each question to an LLM, ChatGPT 4o, and rated each answer on criteria of quality, specificity, relevance, and accuracy. They found that while the model gave mostly correct answers, it struggled with the more difficult and specific questions, highlighting shortcomings that would not have been apparent with a conventional benchmark.

“Open response questions helped us test different facets of the response generation beyond accuracy that are really important for scientific applications,” said Yanguas-Gil.

Source: “Benchmarking large language models for materials synthesis: the case of atomic layer deposition,” by Angel Yanguas-Gil, Matthew T Dearing, Jeffrey W. Elam, Jessica C Jones, Sungjoon Kim, Adnan Mohammad, Chi Thang Nguyen, and Bratin Sengupta, Journal of Vacuum Science and Technology: A (2025). The article can be accessed at https://doi.org/10.1116/6.0004319 .

This paper is part of the Artificial Intelligence and Machine Learning for Materials Discovery, Synthesis and Characterization Collection, learn more here .