96
Views
0
CrossRef citations to date
0
Altmetric
Research Article

A large-scale Chinese patent dataset for information extraction

, &
Article: 2365328 | Received 17 Jan 2024, Accepted 30 May 2024, Published online: 13 Jun 2024

References

  • Akhondi, S. A., Hettne, K. M., van der Horst, E., van Mulligen, E. M., & Kors, J. A. (2015). Recognition of chemical entities: Combining dictionary-based and grammar-based approaches. Journal of Cheminformatics, 7(Suppl 1). https://doi.org/10.1186/1758-2946-7-S1-S10
  • Akhondi, S. A., Klenner, A. G., Tyrchan, C., Manchala, A. K., Boppana, K., Lowe, D., Zimmermann, M., Jagarlapudi, S. A., Sayle, R., & Kors, J. A. (2014). Annotated chemical patent corpus: A gold standard for text mining. PLoS One, 9(9), e107477. https://doi.org/10.1371/journal.pone.0107477
  • Akhondi, S. A., Pons, E., Afzal, Z., van Haagen, H., Becker, B. F., Hettne, K. M., van Mulligen, E. M., & Kors, J. A. (2016). Chemical entity recognition in patents by combining dictionary-based and statistical approaches. Database, 2016, baw061. https://doi.org/10.1093/database/baw061
  • Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. In NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems (pp. 1877–1901). ACM.
  • Chen, L., Xu, S., Shang, W., Wang, Z., Wei, C., & Xu, H. (2020a). What is special about patent information extraction? In Proceedings of the EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (pp. 63–72).
  • Chen, L., Xu, S., Zhu, L., Zhang, J., Lei, X., & Yang, G. (2020b). A deep learning based method for extracting semantic information from patent documents. Scientometrics, 125(1), 289–312. https://doi.org/10.1007/s11192-020-03634-y
  • Cho, H. P., Lim, H., Lee, D., Cho, H., & Kang, K.-I. (2018). Patent analysis for forecasting promising technology in high-rise building construction. Technological Forecasting and Social Change, 128, 144–153. https://doi.org/10.1016/j.techfore.2017.11.012
  • Choi, S., Kim, H., Yoon, J., Kim, K., & Lee, J. Y. (2013). An SAO-based text-mining approach for technology roadmapping using patent information. R&D Management, 43(1), 52–74. https://doi.org/10.1111/j.1467-9310.2012.00702.x
  • Choi, S.-J., Lee, H., Park, E., & Choi, S. (2019). Deep patent landscaping model using transformer and graph embedding. ArXiv 2019, https://arxiv.org/abs/1903.05823
  • Cui, Y., Che, W., Liu, T., Qin, B., & Yang, Z. (2021). Pre-training with whole word masking for Chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3504–3514. https://doi.org/10.1109/TASLP.2021.3124365
  • Devlin, J., Chang, M.-W., & Lee, K. (2018). Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv 2018, https://arxiv.org/abs/1810.04805
  • Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics.
  • Grishman, R. (2015). Information extraction. IEEE Intelligent Systems, 30(5), 8–15. https://doi.org/10.1109/MIS.2015.68
  • Guo, J., Wang, X., Li, Q., & Zhu, D. (2016). Subject–action–object-based morphology analysis for determining the direction of technological change. Technological Forecasting and Social Change, 105, 27–40. https://doi.org/10.1016/j.techfore.2016.01.028
  • He, J., Nguyen, D. Q., Akhondi, S. A., Druckenbrodt, C., Thorne, C., Hoessel, R., Afzal, Z., Zhai, Z., Fang, B., Yoshikawa, H., Albahem, A., Cavedon, L., Cohn, T., Baldwin, T., & Verspoor, K. (2021). ChEMU 2020: Natural language processing methods Are effective for information extraction from chemical patents. Frontiers in Research Metrics and Analytics, 6, 654438. https://doi.org/10.3389/frma.2021.654438
  • Hemati, W., & Mehler, A. (2019). LSTMVoter: Chemical named entity recognition using a conglomerate of sequence labeling tools. Journal of Cheminformatics, 11(1), 3. https://doi.org/10.1186/s13321-018-0327-2
  • Hettne, K. M., Stierum, R. H., Schuemie, M. J., Hendriksen, P. J., Schijvenaars, B. J., Mulligen, E. M., Kleinjans, J., & Kors, J. A. (2009). A dictionary to identify small molecules and drugs in free text. Bioinformatics (Oxford, England), 25(22), 2983–2991. https://doi.org/10.1093/bioinformatics/btp535
  • Hong, Z., Ward, L., Chard, K., Blaiszik, B., & Foster, I. (2021). Challenges and advances in information extraction from scientific literature: A review. JOM Journal of the Minerals Metals and Materials Society, 73(11), 3383–3400. https://doi.org/10.1007/s11837-021-04902-9
  • Jiang, S., Sarica, S., Song, B., Hu, J., & Luo, J. (2022). Patent data for engineering design: A critical review and future directions. Journal of Computing and Information Science in Engineering, 22, 1–48. https://doi.org/10.1115/1.4054802.
  • Klinger, R., Kolářik, C., Fluck, J., Hofmann-Apitius, M., & Friedrich, C. M. (2008). Detection of IUPAC and IUPAC-like chemical names. Bioinformatics (Oxford, England), 24(13), i268–i276. https://doi.org/10.1093/bioinformatics/btn181
  • Kruiper, R., Vincent, J. F., Chen-Burger, J., Desmulliez, M. P., & Konstas, I. (2020). A scientific information extraction dataset for nature inspired engineering. ArXiv 2020, https://arxiv.org/abs/2005.07753
  • Leaman, R., Wei, C.-H., Zou, C., & Lu, Z. (2016). Mining chemical patents with an ensemble of open systems. Database, 2016, baw065. https://doi.org/10.1093/database/baw065
  • Lee, J.-S. (2019). PatentTransformer: A framework for personalized patent claim generation. In Proceedings of the JURIX (Doctoral Consortium).
  • Lee, J.-S., & Hsiang, J. (2020a). Patent claim generation by fine-tuning OpenAI GPT-2. World Patent Information, 62, 101983. https://doi.org/10.1016/j.wpi.2020.101983
  • Lee, J.-S., & Hsiang, J. (2020b). Patent classification by fine-tuning BERT language model. World Patent Information, 61, 101965. https://doi.org/10.1016/j.wpi.2020.101965
  • Li, Y., Bontcheva, K., & Cunningham, H. (2004). SVM based learning system for information extraction. In Proceedings of the International Conference on Deterministic & Statistical Methods in Machine Learning.
  • Li, S., He, W., Shi, Y., Jiang, W., Liang, H., Jiang, Y., Zhang, Y., Lyu, Y., & Zhu, Y. (2019). Duie: A large-scale Chinese dataset for information extraction. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing (pp. 791–800).
  • Liu, H., Christiansen, T., Baumgartner, W. A., Jr., & Verspoor, K. (2012). Biolemmatizer: A lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics, 3(1), 3, https://doi.org/10.1186/2041-1480-3-3.
  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. ArXiv 2019, https://arxiv.org/abs/1907.11692
  • Liu, C., Sun, W., Chao, W., & Che, W. (2013). Convolution neural network for relation extraction. In Proceedings of the International Conference on Advanced Data Mining and Applications (pp. 231–242).
  • Mccallum, A., Freitag, D., & Pereira, F. (2001). Maximum entropy Markov models for information extraction and segmentation. icml.
  • Park, H., Yoon, J., & Kim, K. (2012). Identifying patent infringement using SAO based semantic technological similarities. Scientometrics, 90(2), 515–529. https://doi.org/10.1007/s11192-011-0522-7
  • Rebholz-Schuhmann, D., Kirsch, H., Arregui, M., Gaudan, S., Riethoven, M., & Stoehr, P. (2007). EBIMed–text crunching to gather facts for proteins from Medline. Bioinformatics (Oxford, England), 23(2), e237–e244. https://doi.org/10.1093/bioinformatics/btl302
  • Risch, J., Alder, N., Hewel, C., & Krestel, R. (2020). PatentMatch: A dataset for matching patent claims & prior art. ArXiv 2020, https://arxiv.org/abs/2012.13919v1
  • Saad, F., Aras, H., & Hackl-Sommer, R.. (2020). Improving named entity recognition for biomedical and patent data using Bi-LSTM deep neural network models. In E. Métais, F. Meziane, H. Horacek, & P. Cimiano (Eds.), Natural Language Processing and Information Systems. NLDB 2020. Lecture Notes in Computer Science (Vol. 12089, pp. 25–36). Springer. https://doi.org/10.1007/978-3-030-51310-8_3
  • Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. ArXiv 2003, https://arxiv.org/abs/cs/0306050
  • Seymore, K., McCallum, A., & Rosenfeld, R. (1999). Learning hidden Markov model structure for information extraction. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction (pp. 37–42).
  • Shi, P., & Lin, J. (2019). Simple BERT models for relation extraction. ArXiv 2019. https://doi.org/10.48550/arXiv.1904.05255. https://arxiv.org/abs/1904.05255
  • Singh, S. (2018). Natural language processing for information extraction.
  • Son, J., Moon, H., Lee, J., Lee, S., Park, C., Jung, W., & Lim, H. (2022). AI for patents: A novel yet effective and efficient framework for patent analysis. IEEE Access, 10, 59205–59218. https://doi.org/10.1109/ACCESS.2022.3176877
  • Tseng, Y.-H., Lin, C.-J., & Lin, Y.-I. (2007). Text mining techniques for patent analysis. Information Processing & Management, 43(5), 1216–1247. https://doi.org/10.1016/j.ipm.2006.11.011
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, & Polosukhin, I.. (2017). Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA  (pp. 5998–6008). http://arxiv.org/abs/1706.03762
  • Wu, H. (2019). Report of 2019 language & intelligence technique evaluation. Baidu Corporation.
  • Yoon, J., & Kim, K. (2012). Trendperceptor: A property-function based technology intelligence system for identifying technology trends from patents. Expert Systems with Applications, 39(3), 2927–2938. https://doi.org/10.1016/j.eswa.2011.08.154
  • Zhai, Z., Nguyen, D. Q., Akhondi, S., Thorne, C., Druckenbrodt, C., Cohn, T., Gregory, M., & Verspoor, K. (2019). Improving chemical named entity recognition in patents with contextualized word embeddings, Florence, Italy, August (pp. 328–338).
  • Zhang, L., Li, L., & Li, T. (2015). Patent mining: A survey. ACM Sigkdd Explorations Newsletter, 16(2), 1–19. https://doi.org/10.1145/2783702.2783704
  • Zhang, Y., Xu, J., Chen, H., Wang, J., Wu, Y., Prakasam, M., & Xu, H. (2016). Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning. Database (Oxford), 2016, baw049. https://doi.org/10.1093/database/baw049.