174
Views
0
CrossRef citations to date
0
Altmetric
Computer programming, system and service

Mining experimental data from materials science literature with large language models: an evaluation study

ORCID Icon, ORCID Icon, & ORCID Icon
Article: 2356506 | Received 19 Jan 2024, Accepted 06 May 2024, Published online: 08 Jul 2024

References

  • Xu P, Ji X, Li M, et al. Small data machine learning in materials science. Npj Comput Mater. 2023 Mar;9(1):42. doi: 10.1038/s41524-023-01000-z
  • Boyd PG, Chidambaram A, Garca-Dez E, et al. Data-driven design of metal–organic frameworks for wet flue gas co2 capture. Nature. 2019;576(7786):253–15. doi: 10.1038/s41586-019-1798-7
  • Rao Z, Tung P-Y, Xie R, et al. Machine learning–enabled high-entropy alloy discovery. Science. 2022;378(6615):78–85. doi: 10.1126/science.abo4940
  • Zakutayev A, Wunder N, Schwarting M, et al. An open experimental database for exploring inorganic materials. Sci Data. 2018;5(1):1–12. doi: 10.1038/sdata.2018.53
  • Doan Huan T, Mannodi-Kanakkithodi A, Kim C, et al. A polymer dataset for accelerated property prediction and design. Sci Data. 2016;3(1):1–10. doi: 10.1038/sdata.2016.12
  • Pyzer-Knapp EO, Pitera JW, Staar PW, et al. Accelerating materials discovery using artificial intelligence, high performance computing and robotics. Npj Comput Mater. 2022;8(1):84. doi: 10.1038/s41524-022-00765-z
  • Huber N, Kalidindi SR, Klusemann B, et al. Machine learning and data mining in materials science. Front Mater. 2020;7:51. doi: 10.3389/fmats.2020.00051
  • Park G, Pouchard L. Advances in scientific literature mining for interpreting materials characterization. Mach Learn. 2021;2(4):045007. doi: 10.1088/2632-2153/abf751
  • Chittam S, Gokaraju B, Xu Z, et al. Big data mining and classification of intelligent material science data using machine learning. Appl Sci. 2021;11(18):2021. doi: 10.3390/app11188596
  • Ma B, Wei X, Liu C, et al. Data augmentation in microscopic images for material data mining. Npj Comput Mater. 2020;6(1):125. doi: 10.1038/s41524-020-00392-6
  • Parinov IA. Microstructure and properties of high-temperature superconductors. Springer Science & Business Media; 2013.
  • Hosono H, Tanabe K, Takayama-Muromachi E, et al. Exploration of new superconductors and functional materials, and fabrication of superconducting tapes and wires of iron pnictides. Sci Technol Adv Mater. 2015;16(3):033503. doi: 10.1088/1468-6996/16/3/033503
  • Mydeen K, Jesche A, Meier-Kirchner K, et al. Electron doping of the iron-arsenide superconductor cefeaso controlled by hydrostatic pressure. Phys Rev Lett. 2020;125(20):207001. doi: 10.1103/PhysRevLett.125.207001
  • Bardeen J, Cooper LN, Robert Schrieffer J. Theory of superconductivity. Phys Rev. 1957;108(5):1175. doi: 10.1103/PhysRev.108.1175
  • Zhang C, Zhang C, Li C, et al. One small step for generative ai, one giant leap for agi: a complete survey on chatgpt in aigc era. arXiv preprint arXiv:2304.06488. 2023.
  • Yao S, Yu D, Zhao J, et al. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601. 2023.
  • Valmeekam K, Marquez M, Sreedharan S, et al. On the planning abilities of large language models–a critical investigation. arXiv preprint arXiv:2305.15771. 2023.
  • Sun S, Liu Y, Wang S, et al. Pearl: Prompting large language models to plan and execute actions over long documents. arXiv preprint arXiv:2305.14564. 2023.
  • OpenAI. Models. 2024 [cited 2024 Jan 4]. Available from: https://platform.openai.com/docs/models
  • Kocoń J, Cichecki I, Kaszyca O, et al. ChatGPT: Jack of all trades, master of none. Inf Fusion. 2023 Nov;99:101861. doi: 10.1016/j.inffus.2023.101861
  • Ma Y, Cao Y, Hong Y, et al. Large language model is not a good few-shot information extractor, but a good reranker for hard samples! arXiv preprint arXiv:2303.08559. 2023.
  • González-Gallardo C-E, Boros E, Girdhar N, et al. Yes but. can chatgpt identify entities in historical documents? arXiv preprint arXiv:2303.17322. 2023.
  • Moradi M, Blagec K, Haberl F, et al. Gpt-3 models are poor few-shot learners in the biomedical domain. arXiv preprint arXiv:2109.02555. 2021.
  • Hatakeyama-Sato K, Yamane N, Igarashi Y, et al. Prompt engineering of gpt-4 for chemical research: what can/cannot be done? Sci Technol Adv Mater. 2023;3(1):2260300. doi: 10.1080/27660400.2023.2260300
  • Hatakeyama-Sato K, Watanabe S, Yamane N, et al. Using gpt-4 in parameter selection of polymer informatics: improving predictive accuracy amidst data scarcity and ‘ugly duckling’dilemma. Digital Discov. 2023;2(5):1548–1557. doi: 10.1039/D3DD00138E
  • Nadeau D, Sekine S. A survey of named entity recognition and classification. Lingvisticae Investigationes. 2007;30(1):3–26. doi: 10.1075/li.30.1.03nad
  • Foppiano L, Castro P, Suarez P, et al. Automatic extraction of materials and properties from superconductors scientific literature. Sci Technol Adv Mater. 2023;3(1). doi: 10.1080/27660400.2022.2153633
  • Foppiano L, Romary L, Ishii M, et al. Automatic identification and normalisation of physical measurements in scientific literature. In: Proceedings of the ACM Symposium on Document Engineering 2019, DocEng ’19. New York, NY, USA: Association for Computing Machinery; 2019.
  • Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Inui K, Jiang J, Ng V Wan X, editors. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong,China: Association for Computational Linguistics; 2019 Nov. p. 3982–3992.
  • Harper C, Cox J, Kohler C, et al. SemEval-2021 task 8: MeasEval – extracting counts and measurements and their related contexts. In: Palmer A, Schneider N, Schluter N, Emerson G, Herbelot A Zhu X, editors. Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Online: Association for Computational Linguistics; 2021 Aug. p. 306–316.
  • Foppiano L, Dieb T, Suzuki A, et al. Supermat: construction of a linked annotated dataset from superconductors-related publications. Sci Technol Adv Mater. 2021;1(1):34–44. doi: 10.1080/27660400.2021.1918396
  • Beltagy I, Lo K, and Cohan A. SciBERT: A pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; 2019 Nov. p. 3615–3620.
  • Ratcliff John W. Pattern matching: the gestalt approach. 1988 [cited 2024 Jan 4]. Available from: https://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970?pgno=5
  • Taylor R, Kardas M, Cucurull G, et al. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085. 2022.
  • Mullick A, Akash Ghosh GSC, Ghui S, et al. Matscire: Leveraging pointer networks to automate entity and relation extraction for material science knowledge-base construction. Comput Mater Sci. 2024;233:112659. doi: 10.1016/j.commatsci.2023.112659