5,572
Views
2
CrossRef citations to date
0
Altmetric
Special Issue: Software Quality, Reliability and Security

KG4Py: A toolkit for generating Python knowledge graph and code semantic search

, ORCID Icon, &
Pages 1384-1400 | Received 14 Feb 2022, Accepted 26 Apr 2022, Published online: 11 May 2022
 

Abstract

In the era of big data, there are numerous duplicate code snippets on the Internet, it is especially necessary to make use of them to build new software projects. In this paper, we present a toolkit (KG4Py) for generating a knowledge graph of Python files in GitHub repositories and conducting semantic search with the knowledge graph. In KG4Py, we remove all duplicate files in 317 K Python files and perform static code analyses of these files by using a concrete syntax tree (CST) to build a code knowledge graph of Python functions. We integrate a pre-trained model with an unsupervised model to generate a new model, and combine this new model with a code knowledge graph for the purpose of searching code snippets with natural language descriptions. The experimental results show that KG4Py achieves good performance in both the construction of the code knowledge graph and the semantic search of code snippets.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

Additional information

Funding

This work is supported by the Xinjiang Tianshan Youth Project of China [grant number 2020Q019], the National Natural Science Foundation of China [grant number 61562087], and the Doctoral Scientific Research Foundation of Xinjiang Normal University [grant number XJNUBS1905].