A large-scale Chinese patent dataset for information extraction

Qian Zhenga Zhengzhou University of Light Industry, Zhengzhou, People’s Republic of ChinaView further author information

Kefu Guoa Zhengzhou University of Light Industry, Zhengzhou, People’s Republic of ChinaView further author information

Lin Xub Chengdu University of Traditional Chinese Medicine, Chengdu, People’s Republic of ChinaCorrespondence[email protected]
View further author information

Article: 2365328 | Received 17 Jan 2024, Accepted 30 May 2024, Published online: 13 Jun 2024

Cite this article
https://doi.org/10.1080/21642583.2024.2365328
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

References

Akhondi, S. A., Hettne, K. M., van der Horst, E., van Mulligen, E. M., & Kors, J. A. (2015). Recognition of chemical entities: Combining dictionary-based and grammar-based approaches. Journal of Cheminformatics, 7(Suppl 1). https://doi.org/10.1186/1758-2946-7-S1-S10
Google Scholar
Akhondi, S. A., Klenner, A. G., Tyrchan, C., Manchala, A. K., Boppana, K., Lowe, D., Zimmermann, M., Jagarlapudi, S. A., Sayle, R., & Kors, J. A. (2014). Annotated chemical patent corpus: A gold standard for text mining. PLoS One, 9(9), e107477. https://doi.org/10.1371/journal.pone.0107477
PubMed Web of Science ®Google Scholar
Akhondi, S. A., Pons, E., Afzal, Z., van Haagen, H., Becker, B. F., Hettne, K. M., van Mulligen, E. M., & Kors, J. A. (2016). Chemical entity recognition in patents by combining dictionary-based and statistical approaches. Database, 2016, baw061. https://doi.org/10.1093/database/baw061
PubMedGoogle Scholar
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. In NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems (pp. 1877–1901). ACM.
Google Scholar
Chen, L., Xu, S., Shang, W., Wang, Z., Wei, C., & Xu, H. (2020a). What is special about patent information extraction? In Proceedings of the EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (pp. 63–72).
Google Scholar
Chen, L., Xu, S., Zhu, L., Zhang, J., Lei, X., & Yang, G. (2020b). A deep learning based method for extracting semantic information from patent documents. Scientometrics, 125(1), 289–312. https://doi.org/10.1007/s11192-020-03634-y
Web of Science ®Google Scholar
Cho, H. P., Lim, H., Lee, D., Cho, H., & Kang, K.-I. (2018). Patent analysis for forecasting promising technology in high-rise building construction. Technological Forecasting and Social Change, 128, 144–153. https://doi.org/10.1016/j.techfore.2017.11.012
Web of Science ®Google Scholar
Choi, S., Kim, H., Yoon, J., Kim, K., & Lee, J. Y. (2013). An SAO-based text-mining approach for technology roadmapping using patent information. R&D Management, 43(1), 52–74. https://doi.org/10.1111/j.1467-9310.2012.00702.x
Web of Science ®Google Scholar
Choi, S.-J., Lee, H., Park, E., & Choi, S. (2019). Deep patent landscaping model using transformer and graph embedding. ArXiv 2019, https://arxiv.org/abs/1903.05823
Google Scholar
Cui, Y., Che, W., Liu, T., Qin, B., & Yang, Z. (2021). Pre-training with whole word masking for Chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3504–3514. https://doi.org/10.1109/TASLP.2021.3124365
Web of Science ®Google Scholar
Devlin, J., Chang, M.-W., & Lee, K. (2018). Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv 2018, https://arxiv.org/abs/1810.04805
Google Scholar
Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics.
Google Scholar
Grishman, R. (2015). Information extraction. IEEE Intelligent Systems, 30(5), 8–15. https://doi.org/10.1109/MIS.2015.68
Web of Science ®Google Scholar
Guo, J., Wang, X., Li, Q., & Zhu, D. (2016). Subject–action–object-based morphology analysis for determining the direction of technological change. Technological Forecasting and Social Change, 105, 27–40. https://doi.org/10.1016/j.techfore.2016.01.028
Web of Science ®Google Scholar
He, J., Nguyen, D. Q., Akhondi, S. A., Druckenbrodt, C., Thorne, C., Hoessel, R., Afzal, Z., Zhai, Z., Fang, B., Yoshikawa, H., Albahem, A., Cavedon, L., Cohn, T., Baldwin, T., & Verspoor, K. (2021). ChEMU 2020: Natural language processing methods Are effective for information extraction from chemical patents. Frontiers in Research Metrics and Analytics, 6, 654438. https://doi.org/10.3389/frma.2021.654438
PubMedGoogle Scholar
Hemati, W., & Mehler, A. (2019). LSTMVoter: Chemical named entity recognition using a conglomerate of sequence labeling tools. Journal of Cheminformatics, 11(1), 3. https://doi.org/10.1186/s13321-018-0327-2
PubMedGoogle Scholar
Hettne, K. M., Stierum, R. H., Schuemie, M. J., Hendriksen, P. J., Schijvenaars, B. J., Mulligen, E. M., Kleinjans, J., & Kors, J. A. (2009). A dictionary to identify small molecules and drugs in free text. Bioinformatics (Oxford, England), 25(22), 2983–2991. https://doi.org/10.1093/bioinformatics/btp535
PubMed Web of Science ®Google Scholar
Hong, Z., Ward, L., Chard, K., Blaiszik, B., & Foster, I. (2021). Challenges and advances in information extraction from scientific literature: A review. JOM Journal of the Minerals Metals and Materials Society, 73(11), 3383–3400. https://doi.org/10.1007/s11837-021-04902-9
Web of Science ®Google Scholar
Jiang, S., Sarica, S., Song, B., Hu, J., & Luo, J. (2022). Patent data for engineering design: A critical review and future directions. Journal of Computing and Information Science in Engineering, 22, 1–48. https://doi.org/10.1115/1.4054802.
Web of Science ®Google Scholar
Klinger, R., Kolářik, C., Fluck, J., Hofmann-Apitius, M., & Friedrich, C. M. (2008). Detection of IUPAC and IUPAC-like chemical names. Bioinformatics (Oxford, England), 24(13), i268–i276. https://doi.org/10.1093/bioinformatics/btn181
PubMed Web of Science ®Google Scholar
Kruiper, R., Vincent, J. F., Chen-Burger, J., Desmulliez, M. P., & Konstas, I. (2020). A scientific information extraction dataset for nature inspired engineering. ArXiv 2020, https://arxiv.org/abs/2005.07753
Google Scholar
Leaman, R., Wei, C.-H., Zou, C., & Lu, Z. (2016). Mining chemical patents with an ensemble of open systems. Database, 2016, baw065. https://doi.org/10.1093/database/baw065
PubMedGoogle Scholar
Lee, J.-S. (2019). PatentTransformer: A framework for personalized patent claim generation. In Proceedings of the JURIX (Doctoral Consortium).
Google Scholar
Lee, J.-S., & Hsiang, J. (2020a). Patent claim generation by fine-tuning OpenAI GPT-2. World Patent Information, 62, 101983. https://doi.org/10.1016/j.wpi.2020.101983
Web of Science ®Google Scholar
Lee, J.-S., & Hsiang, J. (2020b). Patent classification by fine-tuning BERT language model. World Patent Information, 61, 101965. https://doi.org/10.1016/j.wpi.2020.101965
Web of Science ®Google Scholar
Li, Y., Bontcheva, K., & Cunningham, H. (2004). SVM based learning system for information extraction. In Proceedings of the International Conference on Deterministic & Statistical Methods in Machine Learning.
Google Scholar
Li, S., He, W., Shi, Y., Jiang, W., Liang, H., Jiang, Y., Zhang, Y., Lyu, Y., & Zhu, Y. (2019). Duie: A large-scale Chinese dataset for information extraction. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing (pp. 791–800).
Google Scholar
Liu, H., Christiansen, T., Baumgartner, W. A., Jr., & Verspoor, K. (2012). Biolemmatizer: A lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics, 3(1), 3, https://doi.org/10.1186/2041-1480-3-3.
PubMedGoogle Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. ArXiv 2019, https://arxiv.org/abs/1907.11692
Google Scholar
Liu, C., Sun, W., Chao, W., & Che, W. (2013). Convolution neural network for relation extraction. In Proceedings of the International Conference on Advanced Data Mining and Applications (pp. 231–242).
Google Scholar
Mccallum, A., Freitag, D., & Pereira, F. (2001). Maximum entropy Markov models for information extraction and segmentation. icml.
Google Scholar
Park, H., Yoon, J., & Kim, K. (2012). Identifying patent infringement using SAO based semantic technological similarities. Scientometrics, 90(2), 515–529. https://doi.org/10.1007/s11192-011-0522-7
Web of Science ®Google Scholar
Rebholz-Schuhmann, D., Kirsch, H., Arregui, M., Gaudan, S., Riethoven, M., & Stoehr, P. (2007). EBIMed–text crunching to gather facts for proteins from Medline. Bioinformatics (Oxford, England), 23(2), e237–e244. https://doi.org/10.1093/bioinformatics/btl302
PubMed Web of Science ®Google Scholar
Risch, J., Alder, N., Hewel, C., & Krestel, R. (2020). PatentMatch: A dataset for matching patent claims & prior art. ArXiv 2020, https://arxiv.org/abs/2012.13919v1
Google Scholar
Saad, F., Aras, H., & Hackl-Sommer, R.. (2020). Improving named entity recognition for biomedical and patent data using Bi-LSTM deep neural network models. In E. Métais, F. Meziane, H. Horacek, & P. Cimiano (Eds.), Natural Language Processing and Information Systems. NLDB 2020. Lecture Notes in Computer Science (Vol. 12089, pp. 25–36). Springer. https://doi.org/10.1007/978-3-030-51310-8_3
Google Scholar
Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. ArXiv 2003, https://arxiv.org/abs/cs/0306050
Google Scholar
Seymore, K., McCallum, A., & Rosenfeld, R. (1999). Learning hidden Markov model structure for information extraction. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction (pp. 37–42).
Google Scholar
Shi, P., & Lin, J. (2019). Simple BERT models for relation extraction. ArXiv 2019. https://doi.org/10.48550/arXiv.1904.05255. https://arxiv.org/abs/1904.05255
Google Scholar
Singh, S. (2018). Natural language processing for information extraction.
Google Scholar
Son, J., Moon, H., Lee, J., Lee, S., Park, C., Jung, W., & Lim, H. (2022). AI for patents: A novel yet effective and efficient framework for patent analysis. IEEE Access, 10, 59205–59218. https://doi.org/10.1109/ACCESS.2022.3176877
Web of Science ®Google Scholar
Tseng, Y.-H., Lin, C.-J., & Lin, Y.-I. (2007). Text mining techniques for patent analysis. Information Processing & Management, 43(5), 1216–1247. https://doi.org/10.1016/j.ipm.2006.11.011
Web of Science ®Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, & Polosukhin, I.. (2017). Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA (pp. 5998–6008). http://arxiv.org/abs/1706.03762
Google Scholar
Wu, H. (2019). Report of 2019 language & intelligence technique evaluation. Baidu Corporation.
Google Scholar
Yoon, J., & Kim, K. (2012). Trendperceptor: A property-function based technology intelligence system for identifying technology trends from patents. Expert Systems with Applications, 39(3), 2927–2938. https://doi.org/10.1016/j.eswa.2011.08.154
Web of Science ®Google Scholar
Zhai, Z., Nguyen, D. Q., Akhondi, S., Thorne, C., Druckenbrodt, C., Cohn, T., Gregory, M., & Verspoor, K. (2019). Improving chemical named entity recognition in patents with contextualized word embeddings, Florence, Italy, August (pp. 328–338).
Google Scholar
Zhang, L., Li, L., & Li, T. (2015). Patent mining: A survey. ACM Sigkdd Explorations Newsletter, 16(2), 1–19. https://doi.org/10.1145/2783702.2783704
Google Scholar
Zhang, Y., Xu, J., Chen, H., Wang, J., Wu, Y., Prakasam, M., & Xu, H. (2016). Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning. Database (Oxford), 2016, baw049. https://doi.org/10.1093/database/baw049.
PubMedGoogle Scholar

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

A large-scale Chinese patent dataset for information extraction

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

A large-scale Chinese patent dataset for information extraction

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date