611
Views
0
CrossRef citations to date
0
Altmetric
Note

A novel framework for Chinese personal sensitive information detection

, , , &
Article: 2298310 | Received 23 May 2023, Accepted 19 Dec 2023, Published online: 03 Jan 2024

References

  • Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919.
  • Anand, S., Shukla, M., & Lodha, S. (2023). Detecting sensitive information from unstructured text in a data-constrained environment. 15th International Conference on COMmunication Systems & NETworkS (COMSNETS). IEEE, 159–164.
  • Beckwith, B. A., Mahaadevan, R., Balis, U. J., & Kuo, F. (2006). Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Medical Informatics and Decision Making, 6(1), 1–9. https://doi.org/10.1186/1472-6947-6-12
  • Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. Advances in Neural Information Processing Systems, 13.
  • Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. https://doi.org/10.1109/72.279181
  • Cheng, L., Liu, F., & Yao, D. (2017). Enterprise data breach: Causes, challenges, prevention, and future directions. WIRES Data Mining and Knowledge Discovery, 7(5), e1211. https://doi.org/10.1002/widm.1211
  • Clark, K., Luong, M. T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
  • Dai, X., & Adel, H. (2020). An analysis of simple data augmentation for named entity recognition. arXiv preprint arXiv:2010.11683.
  • Devlin, J., Chang, M. W., & Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Ding, M., Wang, X., Wu, C., Wang, K., & Yang, X. (2021). Research on automated detection of sensitive information based on BERT[C]//Journal of Physics: Conference Series. IOP Publishing, 1757(1): 012088.
  • Friedlin, F. J., & McDonald, C. J. (2008). A software tool for removing patient identifying information from clinical documents. Journal of the American Medical Informatics Association, 15(5), 601–610. https://doi.org/10.1197/jamia.M2702
  • Gambarelli, G., Gangemi, A., & Tripodi R. (2022). Is your model sensitive? SPeDaC: A new benchmark for detecting and classifying sensitive personal data. arXiv preprint arXiv:2208.06216.
  • Gambarelli, G., Gangemi, A., & Tripodi, R. (2023). Is your model sensitive? SPEDAC: A New resource for the automatic classification of sensitive personal data. IEEE Access, 11, 10864–10880. https://doi.org/10.1109/ACCESS.2023.3240089
  • Gan, C., Feng, Q., & Zhang, Z. (2021). Scalable multi-channel dilated CNN–BiLSTM model with attention mechanism for Chinese textual sentiment analysis. Future Generation Computer Systems, 118, 297–309. https://doi.org/10.1016/j.future.2021.01.024
  • García-Pablos, A., Perez, N., & Cuadros, M. (2020). Sensitive data detection and classification in Spanish clinical text: Experiments with BERT. arXiv preprint arXiv:2003.03106.
  • Goldberg, Y., & Levy, O. (2014). word2vec explained: Deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.
  • Guo, Y., Liu, J., Tang, W., & Huang, C. (2021). Exsense: Extract sensitive information from unstructured data. Computers & Security, 102, 102156. https://doi.org/10.1016/j.cose.2020.102156
  • He, J., & Wang, H. (2008). Chinese named entity recognition and word segmentation based on character[C]//Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing.
  • Huang, C., & Zhao, Q. (2022). Sensitive information detection method based on Attention-based ELMo. Computer Applications, 42(7), 2009.
  • Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
  • Jiang, L., & Wang, D. (2016). Research on automatic extraction of domain terms using continuous word bag model (CBOW). New Technology of Library and Information Service, 2, 9-15.
  • Kužina, V., Petric, A. M., Barišić, M., Barišić, M., & Jović, A. (2023). Cassed: Context-based approach for structured sensitive data detection. Expert Systems With Applications, 223, 119924. https://doi.org/10.1016/j.eswa.2023.119924
  • Kužina, V., Vušak, E., & Jović, A. (2021). Methods for automatic sensitive data detection in large datasets: A review. 44th international convention on information, communication and electronic technology (MIPRO). IEEE, 187–192.
  • Li, B., Hou, Y., & Che, W. (2022). Data augmentation approaches in natural language processing: A survey. AI Open, 3, 71–90. https://doi.org/10.1016/j.aiopen.2022.03.001
  • Li, H., Luo, T., Huang, J., & Zhang, X. (2018). Jieba Chinese Word Segmentation Library[CP]. https://github.com/fxsjy/jieba
  • Lin, Y., Xu, G., Xu, G., Chen, Y., & Sun, D. (2020). Sensitive information detection based on convolution neural network and bi-directional lstm. IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). IEEE, 1614–1621.
  • Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019.
  • Liu, P., Guo, Y., Wang, F., & Li, G. (2022). Chinese named entity recognition: The state of the art. Neurocomputing, 473, 37–53. https://doi.org/10.1016/j.neucom.2021.10.101
  • Madan, A., George, A. M., Singh, A., & Bhatia, M. P. S. (2018). Redaction of protected health information in ehrs using crfs and bi-directional lstms. 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO). IEEE, 513–517.
  • Mayer, P., Zou, Y., Schaub, F., & Aviv, A. J. (2021). “Now I'm a bit angry:” Individuals’ awareness, perception, and responses to data breaches that affected them[C]//USENIX security symposium, 393–410.
  • Nadeau, D., & Sekine, S. (2007). Named entities: Recognition, classification and use. Lingvisticae Investigationes, 30(1), 3–26. https://doi.org/10.1075/li.30.1.03nad
  • Neamatullah, I., Douglass, M. M., Lehman, L. W. H., Reisner, A., Villarroel, M., Long, W. J., Szolovits, P., Moody, G. B., Mark, R. G., & Clifford, G. D. (2008). Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making, 8(1), 1–17. https://doi.org/10.1186/1472-6947-8-32
  • Neerbeky, J., Assent, I., & Dolog, P. (2018). Detecting complex sensitive information via phrase structure in recursive neural networks[C]//Advances in Knowledge Discovery and Data Mining: 22nd Pacific-Asia conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III 22. Springer International Publishing, 373-385.
  • Neerbeky, J., Assentz, I., & Dolog, P. (2017). TABOO. Detecting unstructured sensitive information using recursive neural networks. IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 1399–1400.
  • Pedersen, J. S., Laursen, M. S., Soguero-Ruiz, C, Savarimuthu, T. R., Hansen, R. S., & Vinholt, P. J. (2022). Domain over size: Clinical ELECTRA surpasses general BERT for bleeding site classification in the free text of electronic health records. IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 1–14.
  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. North American Chapter of the Association for Computational Linguistics.
  • Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
  • Wang, Y., Shen, X., & Yang, Y. (2020). The classification of Chinese sensitive information based on BERT-CNN[C]//Artificial Intelligence Algorithms and Applications: 11th International Symposium, ISICA 2019, Guangzhou, People’s Republic of China, November 16–17, 2019, Revised Selected Papers 11. Springer Singapore, 2020: 269-280.
  • Xu, G., Qi, C., Yu, H., Xu, S., Zhao, C., & Yuan, J. (2019). Detecting sensitive information of unstructured text using convolutional neural network. International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC). IEEE, 474–479.
  • Yang, Y., Shen, X., & Wang, Y. (2020). BERT-BiLSTM-CRF for Chinese sensitive vocabulary recognition[C]//artificial intelligence algorithms and applications: 11th International Symposium, ISICA 2019, Guangzhou, People’s Republic of China, November 16–17, 2019, Revised Selected Papers 11. Springer Singapore, 2020: 257–268.
  • Zhang, S., Yu, H., & Zhu, G. (2022). An emotional classification method of Chinese short comment text based on ELECTRA. Connection Science, 34(1), 254–273. https://doi.org/10.1080/09540091.2021.1985968
  • Zhang, Y., & Yang, J. (2018). Chinese NER using lattice LSTM. arXiv preprint arXiv:1805.02023.