The authors have developed an approach to the search for structurally similar projects of software systems. Teachers can use the proposed approach to search for borrowings in the works of students. The concept behind this proposal is that it can to locate projects that students have used as parts of a current project.
The authors propose a new algorithm for determining the similarity between the structures of software projects. The proposed algorithm is based on finding similar structural elements in the source code of the program in an abstract syntax trees analyzing.
The authors developed a software system to evaluate the proposed algorithm. The current version of the system only supports Java programs. However, the system operates with its own representation of the abstract syntax tree, which allows you to add support for new programming languages.
Currently, the most practical work of students in information technology includes laboratory and coursework. These classes help students understand the theoretical information from their lectures.
Typically, student work is a small program that solves typical problems. In most cases, these works contain few files or a few lines of code. The architecture and algorithms of such programs are also simple.
The teacher needs to spend many time to check all the works. The teacher usually notices when the student has borrowed a program source code. Students in such cases do not change the structure of the borrowed source code, but rename variables or change types of loops (from \textit{for} to \textit{while}), etc.
The software system proposed in this article allows you to analyze the structure of projects and provide information about their structural similarity. The indicator of the uniqueness of the current project structure is used to evaluate the uniqueness of the project in comparison with each other.
There are no universal methods for analyzing the source code of software systems at the moment. Certain methods of analysis are used to solve various problems.
We can analyze projects using call graph generation tools, such as \textit{CodeViz} or \textit{Egypt}. Or we can use of reverse engineering tools, such as IDA Pro. The call graph based approaches allow developers to solve the program comprehension task for better program maintenance or to reduce security issues \cite{ghavamnia2020temporal,soares2021integrating,tang2022assessing,vinayaka2021android}.
Another group of methods is based on obtaining and analyzing abstract syntax trees (AST). An AST is an abstract representation of the grammatical structure of a source code. It expresses the structure of a program as a structured tree and rarely depends on programming language. Each AST node is an operator or a set of operators of the analyzed source code. The compiler generates an AST on the parsing step. Unlike a parse tree, an AST does not contain nodes or edges that do not define the semantics of the program (for example, grouping brackets).
AST-based approaches allow us to find structurally similar projects. However, such approaches have high computational complexity \cite{nguyen2018crosssim}. Many existing approaches analyze a larger number of parameters than is necessary to solve the problem of this study \cite{beniwal2021npmrec,aleksey2020approach,nguyen2018crosssim,nguyen2020automated,nadezhda2019approach}: project dependencies, the number of stars in the repository, the contents of the documentation, etc.
The paper \cite{ali2011overview} presents an review of approaches and software tools for borrowings searching in the text and source code. However, there is no mention of existing software tools for borrowings searching in the source code.
In the article \cite{chae2013software}, the authors analyze borrowings in the source code according to the sequences of using external programming interfaces (external dependencies) and the frequency of such calls. This method is not suitable for solving the problem of this study because of the educational orientation. Some student projects can not use external dependencies.
Thus, it is necessary to develop an approach to the search for structurally similar projects, which are focused on simple software systems and a high speed of analysis.
We represent software system projects as a set of source code files. The source code of the software system is the main data source for structural features identifying in the proposed algorithm.
We formed an AST to analyze the source code. There are various libraries and tools for all existing programming languages for the AST formation. We use own representation of the AST to add support for new programming languages without changing the analysis algorithms. Therefore, we need to develop a converter that transforms the AST generated by the parser for some programming language into our AST representation.
For example, we analyze Java files and their hierarchy at the package level for the Java-based software systems. We use the JavaParser library to form an AST for Java projects. The algorithm considered below allows us to transform the AST, which is generated by the JavaParser library, into our AST representation.
In this algorithm, the $F$ is a search function that finds nested nodes. The function parameter is a node or subtree, and the output is a set of nodes with the desired type: class, class field, method, method argument or statements (operators).
Figure \ref{DeploymentDiagram} shows the deployment diagram in the UML notation of the developed software system. The developed system has the three-tier architecture.
Users interact with the web client on the \textit{Frontend} node. The \textit{Backend} node performs the main business logic for searching of the structurally similar projects. \textit{Backend} and \textit{Frontend} nodes communicate through an API. The \textit{Database} provides data storage functions.
The web client is an application written in JavaScript with the Vue.js framework. Vue.js is a framework for developing single page applications and web interfaces. The main advantages of this framework are the small size of the library in lines of code, performance, flexibility, and excellent documentation.
We implement the server part of the application in Java with the Spring Boot framework. The Spring framework is a ecosystem for developing applications in the Java language. The Spring Boot includes a huge number of ready-to-use modules. The main advantages of this framework include speed and convenience of development, auto-configuration of all components, easy access to databases and network capabilities.
The current version of the software system supports only Java-based software projects. The JavaParser library is used to form an AST in the Java source code analysis. This library allows you to extract the AST using the previously discussed algorithm.
We use the Neo4j \cite{ref_neo4j} as the data storage. Neo4j is a graph database management system (GDB). Neo4j allows us to store nodes and edges to connecting them. We can to add additional attributes to nodes and edges. Neo4j has a high speed of operation even with a large amount of stored data.
GDB is a non-relational type of database based on the topographic structure of the network. GDBs are more flexible than relational databases. GDBs are more flexible than relational databases and allows us to fast obtain data of various types, considering numerous relations.
Cypher \cite{ref_cypher} provides a convenient way to express queries and other Neo4j actions. Although Cypher is particularly useful for exploratory work, it is fast enough to be used in production.
Also, we use the apoc.util.md5 plugin \cite{ref_md5}. This plugin allows us to compute the md5 of the concatenation of all string representations of the Neo4j entities list.
We arrange the nodes in the GDB hierarchically. For example, a class-node is a part of a package-node, a method-node is a part of a class-node. The data model allows you to form the following ties between data model nodes:
The proposed algorithm for searching for structurally similar projects is to use hashing of graph paths based on the md5 function. We describe the hashing algorithm in the previous section. The searching algorithm can be represented as the following Cypher-query:
\item two projects with identifiers 7872 and 7977 contain this structural patterns (paths). The length of the collection in the \textit{ids} column shows how many projects contains the $i$-th structural element. And the length of the element of this collection allows us to get the length of the chain of structural elements to calculate project originality degree.
Thus, we can calculate the number of matching and not matching paths (see eq. \ref{eq:calc-orig}) in the analyzed project compare with other projects in data storage. Figure \ref{fig:ExampleSystem} shows the main form of the developed system.
We conducted experiments to evaluate the speed of source code analysis. We calculated the results relative to the number of lines of code and the number of files in the analyzing project. The main aim of the experiment is to determine the speed of the algorithm, considering the average number of lines of code processed per minute. We used the IntelliJ IDEA Statistic plugin \cite{ref_Statistic} to get the data for the experiment. The plugin allows you to calculate the number, size, number of lines, average value and other information for each file in the project. You can also find out the total number of rows, the number of lines of code, the proportion of lines of code, the number of comment lines, the proportion of comment lines, etc.
We selected 10 random Java projects for this experiment. Table \ref{tab:speed} presents the results of experiments for analyzing the speed of the proposed algorithm.
Table \ref{tab:time-size} presents the results of experiments to determine the total time of projects analyzing and the number of nodes in resulting graphs.
The experiment revealed that we processed an average of 2 750 lines of code per minute. Student projects contains average 500-3000 lines of code. Thus, the analysis of one project takes on average less than one minute.
\subsubsection{Acknowledgements} The authors acknowledge that the work was supported by the framework of the state task No. 075-03-2023-143 "Research of intelligent predictive analytics based on the integration of methods for constructing features of heterogeneous dynamic data for machine learning and methods of predictive multimodal data analysis".