| CPC G06F 8/35 (2013.01) | 12 Claims |

|
1. A repository-level semantic graph system stored in a computer readable memory for repository-level code completion, the repository-level semantic graph system includes:
a source code initialization module, which is used to generate a query;
a source code repository module that is a collection of source code that is systematically organized, stored, and generate a context;
a Large Language Model (LLM) encoding module configured to receive the query, and embed the context for the query to generate a corresponding node embedding;
a repository-level semantic graph is center of the repository-level semantic graph system, which is constructed to encapsulate broad context of a code repository based on the source code repository module;
wherein the semantic graph construction at the repository level semantic graph is performed according to following steps:
i) initializing the models;
wherein, initializing models by importing all necessary libraries and the models, ensuring that a necessary infrastructure such as an Abstract Syntax Tree (AST), a call graph generator, or a class hierarchy parser is in place;
ii) loading and parsing code file after step i) which comprises:
receiving as input a directory path containing the code file;
parsing each file in the directory path by parsing the code file with a tree-sitter to extract the abstract syntax tree (AST); identifying and extracting functions, classes, and methods from the AST; and storing each element with its name, context (body), parameters, and location;
generating an extracted code element (functions, classes, methods, scripts);
iii) generating nodes for code elements;
wherein, receive the extracted code element (functions, classes, methods, scripts) in step ii) as input and initializes an empty graph structure;
wherein, generate a node in the graph for each extracted code element and attaches metadata such as name, parameters, and context to the node;
iv) defining relationships between nodes which is done after generating the nodes in step iii);
wherein, defines the relationships between the nodes which initializes dictionaries for different types of the relationships: import, invoke, ownership, encapsulate, class hierarchy;
wherein, for each code element: identify imports and create edges for imported elements, identify function/method calls and create call edges via call graph generator, identify parent-child relationships between classes and methods and create ownership edges, identify file-level script scopes and create encapsulation edges, and identify class inheritance and create class hierarchy edges;
generating a code context of nodes;
v) generating embeddings for the code elements;
receiving as input the code context of nodes in step iv);
each node in the graph will be processed using an encoder-only embedding model such as CodeT5 or UniXcoder to generate an embedding representing a semantic meaning of the code element used for comparison as well as retrieval of relevant contexts later, and store the embeddings in a dictionary mapped to the node;
generating an embedding for each node; and
vi) constructing the empty graph structure receiving the embedding of each node in step v);
in which, construct the graph by adding all nodes and edges to the graph based on the defined relationships, ensuring that all nodes are connected according to the defined relationships (import, call, own, encapsulate, class hierarchy);
wherein, the repository-level semantic graph comprises:
a search module configured to receive as input the node embedding, and search for nodes that are similar to the node embedding to generate a set of anchor nodes in the repository-level semantic graph;
an expansion module configured to receive as input the set of anchor nodes, and expand nodes in the set of anchor nodes to generate a set of expanded nodes;
an update module configured to receive as input the set of expanded nodes, and update new context information for each node in the set of expanded nodes to generate a set of updated node embeddings;
a re-ranking module configured to:
receive input the set of updated node embeddings and the node embedding;
re-update the new context information when integrating the node embedding into the repository-level semantic graph before reranking;
score links for the updated nodes using a message-passing network and a link prediction for each node in the set of updated node embeddings against the node embedding for re-ranking, and generating a ranked list of nodes;
a node selection module configured to receive input the list of ranked nodes, select a number of most relevant top nodes in the list of ranked nodes to generate a set of top nodes; and
a Large Language Models (LLM) decoding module configured to receive as input the set of top nodes, and the query; extract the new contexts in the set of top nodes for decoding, complete a specific code for the query, and generate a final code completion in response to the update module and the source code repository module.
|