A novel framework for Chinese personal sensitive information detection

Chenglong Rena School of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of ChinaView further author information

Xiao Lanb Cyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaCorrespondence[email protected]
View further author information

Xingshu Chena School of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of China;b Cyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaView further author information

Yonggang Luob Cyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaView further author information

Shuhua Ruana School of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of China;b Cyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaCorrespondence[email protected]
View further author information

Article: 2298310 | Received 23 May 2023, Accepted 19 Dec 2023, Published online: 03 Jan 2024

Cite this article
https://doi.org/10.1080/09540091.2023.2298310
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

References

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919.
Google Scholar
Anand, S., Shukla, M., & Lodha, S. (2023). Detecting sensitive information from unstructured text in a data-constrained environment. 15th International Conference on COMmunication Systems & NETworkS (COMSNETS). IEEE, 159–164.
Google Scholar
Beckwith, B. A., Mahaadevan, R., Balis, U. J., & Kuo, F. (2006). Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Medical Informatics and Decision Making, 6(1), 1–9. https://doi.org/10.1186/1472-6947-6-12
PubMedGoogle Scholar
Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. Advances in Neural Information Processing Systems, 13.
Google Scholar
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. https://doi.org/10.1109/72.279181
PubMed Web of Science ®Google Scholar
Cheng, L., Liu, F., & Yao, D. (2017). Enterprise data breach: Causes, challenges, prevention, and future directions. WIRES Data Mining and Knowledge Discovery, 7(5), e1211. https://doi.org/10.1002/widm.1211
Web of Science ®Google Scholar
Clark, K., Luong, M. T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
Google Scholar
Dai, X., & Adel, H. (2020). An analysis of simple data augmentation for named entity recognition. arXiv preprint arXiv:2010.11683.
Google Scholar
Devlin, J., Chang, M. W., & Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Google Scholar
Ding, M., Wang, X., Wu, C., Wang, K., & Yang, X. (2021). Research on automated detection of sensitive information based on BERT[C]//Journal of Physics: Conference Series. IOP Publishing, 1757(1): 012088.
Google Scholar
Friedlin, F. J., & McDonald, C. J. (2008). A software tool for removing patient identifying information from clinical documents. Journal of the American Medical Informatics Association, 15(5), 601–610. https://doi.org/10.1197/jamia.M2702
PubMed Web of Science ®Google Scholar
Gambarelli, G., Gangemi, A., & Tripodi R. (2022). Is your model sensitive? SPeDaC: A new benchmark for detecting and classifying sensitive personal data. arXiv preprint arXiv:2208.06216.
Google Scholar
Gambarelli, G., Gangemi, A., & Tripodi, R. (2023). Is your model sensitive? SPEDAC: A New resource for the automatic classification of sensitive personal data. IEEE Access, 11, 10864–10880. https://doi.org/10.1109/ACCESS.2023.3240089
Web of Science ®Google Scholar
Gan, C., Feng, Q., & Zhang, Z. (2021). Scalable multi-channel dilated CNN–BiLSTM model with attention mechanism for Chinese textual sentiment analysis. Future Generation Computer Systems, 118, 297–309. https://doi.org/10.1016/j.future.2021.01.024
Web of Science ®Google Scholar
García-Pablos, A., Perez, N., & Cuadros, M. (2020). Sensitive data detection and classification in Spanish clinical text: Experiments with BERT. arXiv preprint arXiv:2003.03106.
Google Scholar
Goldberg, Y., & Levy, O. (2014). word2vec explained: Deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.
Google Scholar
Guo, Y., Liu, J., Tang, W., & Huang, C. (2021). Exsense: Extract sensitive information from unstructured data. Computers & Security, 102, 102156. https://doi.org/10.1016/j.cose.2020.102156
Web of Science ®Google Scholar
He, J., & Wang, H. (2008). Chinese named entity recognition and word segmentation based on character[C]//Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing.
Google Scholar
Huang, C., & Zhao, Q. (2022). Sensitive information detection method based on Attention-based ELMo. Computer Applications, 42(7), 2009.
Google Scholar
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
Google Scholar
Jiang, L., & Wang, D. (2016). Research on automatic extraction of domain terms using continuous word bag model (CBOW). New Technology of Library and Information Service, 2, 9-15.
Google Scholar
Kužina, V., Petric, A. M., Barišić, M., Barišić, M., & Jović, A. (2023). Cassed: Context-based approach for structured sensitive data detection. Expert Systems With Applications, 223, 119924. https://doi.org/10.1016/j.eswa.2023.119924
Web of Science ®Google Scholar
Kužina, V., Vušak, E., & Jović, A. (2021). Methods for automatic sensitive data detection in large datasets: A review. 44th international convention on information, communication and electronic technology (MIPRO). IEEE, 187–192.
Google Scholar
Li, B., Hou, Y., & Che, W. (2022). Data augmentation approaches in natural language processing: A survey. AI Open, 3, 71–90. https://doi.org/10.1016/j.aiopen.2022.03.001
Google Scholar
Li, H., Luo, T., Huang, J., & Zhang, X. (2018). Jieba Chinese Word Segmentation Library[CP]. https://github.com/fxsjy/jieba
Google Scholar
Lin, Y., Xu, G., Xu, G., Chen, Y., & Sun, D. (2020). Sensitive information detection based on convolution neural network and bi-directional lstm. IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). IEEE, 1614–1621.
Google Scholar
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019.
Google Scholar
Liu, P., Guo, Y., Wang, F., & Li, G. (2022). Chinese named entity recognition: The state of the art. Neurocomputing, 473, 37–53. https://doi.org/10.1016/j.neucom.2021.10.101
Web of Science ®Google Scholar
Madan, A., George, A. M., Singh, A., & Bhatia, M. P. S. (2018). Redaction of protected health information in ehrs using crfs and bi-directional lstms. 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO). IEEE, 513–517.
Google Scholar
Mayer, P., Zou, Y., Schaub, F., & Aviv, A. J. (2021). “Now I'm a bit angry:” Individuals’ awareness, perception, and responses to data breaches that affected them[C]//USENIX security symposium, 393–410.
Google Scholar
Nadeau, D., & Sekine, S. (2007). Named entities: Recognition, classification and use. Lingvisticae Investigationes, 30(1), 3–26. https://doi.org/10.1075/li.30.1.03nad
Google Scholar
Neamatullah, I., Douglass, M. M., Lehman, L. W. H., Reisner, A., Villarroel, M., Long, W. J., Szolovits, P., Moody, G. B., Mark, R. G., & Clifford, G. D. (2008). Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making, 8(1), 1–17. https://doi.org/10.1186/1472-6947-8-32
PubMedGoogle Scholar
Neerbeky, J., Assent, I., & Dolog, P. (2018). Detecting complex sensitive information via phrase structure in recursive neural networks[C]//Advances in Knowledge Discovery and Data Mining: 22nd Pacific-Asia conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III 22. Springer International Publishing, 373-385.
Google Scholar
Neerbeky, J., Assentz, I., & Dolog, P. (2017). TABOO. Detecting unstructured sensitive information using recursive neural networks. IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 1399–1400.
Google Scholar
Pedersen, J. S., Laursen, M. S., Soguero-Ruiz, C, Savarimuthu, T. R., Hansen, R. S., & Vinholt, P. J. (2022). Domain over size: Clinical ELECTRA surpasses general BERT for bleeding site classification in the free text of electronic health records. IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 1–14.
Google Scholar
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. North American Chapter of the Association for Computational Linguistics.
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Google Scholar
Wang, Y., Shen, X., & Yang, Y. (2020). The classification of Chinese sensitive information based on BERT-CNN[C]//Artificial Intelligence Algorithms and Applications: 11th International Symposium, ISICA 2019, Guangzhou, People’s Republic of China, November 16–17, 2019, Revised Selected Papers 11. Springer Singapore, 2020: 269-280.
Google Scholar
Xu, G., Qi, C., Yu, H., Xu, S., Zhao, C., & Yuan, J. (2019). Detecting sensitive information of unstructured text using convolutional neural network. International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC). IEEE, 474–479.
Google Scholar
Yang, Y., Shen, X., & Wang, Y. (2020). BERT-BiLSTM-CRF for Chinese sensitive vocabulary recognition[C]//artificial intelligence algorithms and applications: 11th International Symposium, ISICA 2019, Guangzhou, People’s Republic of China, November 16–17, 2019, Revised Selected Papers 11. Springer Singapore, 2020: 257–268.
Google Scholar
Zhang, S., Yu, H., & Zhu, G. (2022). An emotional classification method of Chinese short comment text based on ELECTRA. Connection Science, 34(1), 254–273. https://doi.org/10.1080/09540091.2021.1985968
Web of Science ®Google Scholar
Zhang, Y., & Yang, J. (2018). Chinese NER using lattice LSTM. arXiv preprint arXiv:1805.02023.
Google Scholar

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

A novel framework for Chinese personal sensitive information detection

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

A novel framework for Chinese personal sensitive information detection

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date