You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

317 lines
22 KiB
TeX

% This is samplepaper.tex, a sample chapter demonstrating the
% LLNCS macro package for Springer Computer Science proceedings;
% Version 2.21 of 2022/01/12
%
\documentclass[runningheads]{llncs}
%
\usepackage[T1]{fontenc}
% T1 fonts will be used to generate the final print and online PDFs,
% so please use T1 fonts in your manuscript whenever possible.
% Other font encondings may result in incorrect characters.
%
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{listings}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
% If you use the hyperref package, please uncomment the following two lines
% to display URLs in blue roman font according to Springer's eBook style:
%\usepackage{color}
%\renewcommand\UrlFont{\color{blue}\rmfamily}
%
\begin{document}
%
\title{Search for Structurally Similar Projects of Software Systems}
%
%\titlerunning{Abbreviated paper title}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{Aleksey Filippov\inst{1}\orcidID{0000-0003-0008-5035} \and
Julia Stroeva\inst{1}\orcidID{0009-0003-8026-235X}}
%
\authorrunning{A. Filippov, J. Stroeva}
% First names are abbreviated in the running head.
% If there are more than two authors, 'et al.' is used.
%
\institute{Department of Information Systems, Ulyanovsk State
Technical University, 32 Severny Venetz Street, 432027 Ulyanovsk, Russia}
%
\maketitle % typeset the header of the contribution
%
\begin{abstract}
The authors have developed an approach to the search for structurally similar projects of software systems. Teachers can use the proposed approach to search for borrowings in the works of students. The concept behind this proposal is that it can to locate projects that students have used as parts of a current project.
The authors propose a new algorithm for determining the similarity between the structures of software projects. The proposed algorithm is based on finding similar structural elements in the source code of the program in an abstract syntax trees analyzing.
The authors developed a software system to evaluate the proposed algorithm. The current version of the system only supports Java programs. However, the system operates with its own representation of the abstract syntax tree, which allows you to add support for new programming languages.
\keywords{source code \and structure analysis \and structurally similar projects \and hashing.}
\end{abstract}
%
%
%
\section{Introduction}
Currently, the most practical work of students in information technology includes laboratory and coursework. These classes help students understand the theoretical information from their lectures.
Typically, student work is a small program that solves typical problems. In most cases, these works contain few files or a few lines of code. The architecture and algorithms of such programs are also simple.
The teacher needs to spend many time to check all the work. The teacher usually notices when the student has borrowed a program source code. Students in such cases do not change the structure of the borrowed source code, but rename variables or change types of loops (from \textit{for} to \textit{while}), etc.
The software system proposed in this article allows you to analyze the structure of projects and provide information about their structural similarity. The indicator of the uniqueness of the current project structure is used to evaluate the uniqueness of the project in comparison with other projects.
\section{State of the art}
There are no universal methods for analyzing the source code of software systems at the moment. Certain methods of analysis are used to solve various problems.
There is a group of methods for analyzing source code, which is based on obtaining and analyzing abstract syntax trees (AST). An AST is an abstract representation of the grammatical structure of a source code. It expresses the structure of a program in some programming language as a tree structure. Each AST node is an operator or a set of operators of the analyzed source code. The compiler generates an AST because of the parsing step. Unlike a parse tree, an AST does not have nodes or edges for syntax rules that do not affect the semantics of the program (for example, grouping brackets).
It can also analyze projects using call graph generation tools, such as \textit{CodeViz} or \textit{Egypt}. It is possible to use some functions of reverse engineering tools, such as IDA Pro.
AST-based approaches can find structurally similar projects. However, such approaches have high computational complexity \cite{nguyen2018crosssim}. Many existing approaches analyze a larger number of parameters than is necessary to solve the problem of this study \cite{aleksey2020approach,beniwal2021npmrec,nadezhda2019approach,nguyen2018crosssim,nguyen2020automated}: project dependencies, the number of stars in the repository, the contents of the documentation, etc.
The paper \cite{ali2011overview} presents an analysis of approaches and software tools for searching for borrowings in the text and source code. However, there is no mention of software tools for searching for borrowings in the source code.
In the article \cite{chae2013software}, the authors analyze borrowings in the source code according to the sequences of using external programming interfaces and the frequency of such calls. This method is not suitable for solving the problem of this study because of the educational orientation of the analyzed source code.
Thus, it is necessary to develop an approach to the search for structurally similar projects, which are focused on working with simple software systems and with a high speed of analysis.
\section{The Proposed Algorithm for Analyzing the Structure of the Source Code}
The source code of the software system in the proposed algorithm is the main source of data for identifying structural features.
We formed an AST to analyze the source code. There are various libraries and tools for all existing programming languages for the formation of AST. Thus, using your own representation of the AST allows you to add support for new programming languages to the system without changing the analysis algorithms.
We will use the following AST model:
\begin{equation*}
AST = \langle N,R \rangle,
\end{equation*}
where $N = \lbrace N_{1}, N_{2},\ldots, N_{n}\rbrace$ is the set of nodes AST;
$N_{i} = \langle name, data \rangle$ is an i-th AST node containing the node name, node data;
$R$ is the set of relations between AST nodes.
We developed an algorithm to highlight the structure of the project in analyzing the source code, which comprises the following steps:
\begin{enumerate}
\item Form an ASD for the project.
\item Select nodes with type “Class”:
\begin{equation*}
N^{Class} = \lbrace N_{i} \in N | F \left( N_{i}.data \right) = \text{'Class'} \rbrace,
\end{equation*}
\item Find nodes with the “Class field” type in the found classes:
\begin{equation*}
N^{Vars} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{'Field'} \rbrace,
\end{equation*}
\item Find nodes with the “Methods” type in the found classes:
\begin{equation*}
N^{Methods} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{'Method'} \rbrace,
\end{equation*}
\item Find nodes with type “Method Argument” in the found methods:
\begin{equation*}
N^{MethodsArgs} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{'Arg'} \rbrace,
\end{equation*}
\item Find nodes with the type “Operator” in the found methods:
\begin{equation*}
N^{MethodsOps} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{'Operator'} \rbrace,
\end{equation*}
\item Create based on previously got sets of AST for the analyzed source code, considering the set of relations R.
\item Save the resulting AST in a graph database (GDB) to facilitate data handling.
\end{enumerate}
Figure \ref{SourceCodeAndAST} shows an example of the source code and the ASD got for it.
\begin{figure}
\includegraphics[width=\textwidth]{SourceCodeAndAST.png}
\caption{Sample source code and its AST.} \label{SourceCodeAndAST}
\end{figure}
GDB is a non-relational type of database based on the topographic structure of the network. Graphs represent sets of data as nodes, edges, and properties. Relational databases provide a structured approach to data. GDBs are more flexible and focused on fast data acquisition, considering various types of links between them.
We used Neo4j as the GDB, since this system has a high speed of operation even with a large amount of stored data.
\section{The Proposed Algorithm for Determining the Structural Similarity of Software Projects}
The determination of the structural similarity of projects is based on the use of a hashing algorithm. We use a hash function to collapse an input array of any size into string.
It is necessary to get the values of the AST hash function of the analyzed project:
\begin{enumerate}
\item Get fragments (paths) of the AST graph from the root node of the graph tree to each node of the graph.
\item We extract the distinguishing property of each node (type) in the current fragment.
\item We passed the type of the node to the hash function. The result is an output string of a certain length.
\item If fragment \textit{A} contains hash \textit{H} and fragment \textit{B} contains hash H, then we can say that fragments \textit{A} and \textit{B} have similarity in one node.
\item If fragment \textit{A} contains hash \textit{H}, and fragment \textit{B} does not contain hash \textit{H}, then fragments \textit{A} and \textit{B} do not have similar nodes.
\end{enumerate}
We calculate the hash function values using the Neo4j GDB using the md5 algorithm.
Thus, the number of matching hash values affects the uniqueness and borrowing rates of code in a project. The following expression is used to calculate project originality:
\begin{equation*}
O = \frac {H^{C} \notin H}{H^{C}},
\end{equation*}
where $H^{C}$ is the set of values of the hash function of the current project;
$H$ is the set of hash values of other projects in the system.
\section{Description of the Developed Software System}
Figure \ref{DeploymentDiagram} shows the developed software system. The system comprises three nodes, which are on different nodes of the computer network.
\begin{figure}
\includegraphics[width=\textwidth]{DeploymentDiagram.png}
\caption{Deployment diagram.} \label{DeploymentDiagram}
\end{figure}
Users interact with the web client on the \textit{Frontend} node. The \textit{Backend} node performs the main business logic for implementing the proposed approach in searching for structurally similar projects. \textit{Backend} and \textit{Frontend} nodes communicate through an API. The GDB is on the \textit{Database} node and provides a saving of project data.
The developed software system represents a web application using a client-server architecture. The web client sends requests to the server, the server processes the requests and returns responses to the web client.
The web client is an application written in JavaScript using the Vue js framework. Vue.js is a framework for developing single page applications and web interfaces. The main advantages of this framework are its lightness (small size of the library in lines of code), performance, flexibility, and excellent documentation.
We implement the server part of the application in Java using the Spring Boot framework. The Spring framework is a whole ecosystem for developing applications in the Java language, which includes a huge number of ready-made modules. Spring Boot extends the Spring framework. The main advantages of this framework include speed and ease of development, auto-configuration of all components, easy access to databases and network capabilities.
The current version of the software system only supports source code analysis in the Java. The JavaParser library is used to form an AST in the process of Java code analysis. This library allows you to build an internal representation of the AST, which is then translated into the proposed AST model using the previously discussed algorithm.
Figure \ref{JavaParserLibrary} shows an example of the internal representation of the AST in the JavaParser library.
\begin{figure}
\includegraphics[width=\textwidth]{JavaParserLibrary.png}
\caption{An example of the internal representation of the JavaParser library AST.} \label{JavaParserLibrary}
\end{figure}
We used neo4j GDB for data storage. Possibly redundant expression GDBs over relational ones is the ability to change the data model.
The GDB data model allows you to store the following nodes:
\begin{itemize}
\item nodes with type “Package” (Java-specific);
\item nodes with type “Class”;
\item nodes with the “Class field” type;
\item nodes with type “Method”;
\item nodes with type “Method argument”;
\item nodes with type “Operator”.
\end{itemize}
We arrange the nodes in the GDB hierarchically. For example, a class is in a package, but a method is in a class. The data model allows you to form the following relationships between graph nodes:
\begin{itemize}
\item $HAS_CLASS$ is a relationship between a package and a class;
\item $HAS_FIELD$ is a relationship between a class and a class field;
\item $HAS_METHOD$ is a is a link between a class and a method;
\item $HAS_ARG$ is a relationship between method and method argument;
\item $HAS_BLOCK$ is a link between a method and a statement.
\end{itemize}
The main idea of searching for structurally similar projects is to use hashing of graph fragments (paths) based on the md5 function. We formed the path from the root node to each node of the graph. We take nesting into account when forming the path (package -> class -> field / method -> method argument / operator). The hash function takes the string representation of the node type as input.
Example of a request to get paths and a hash that matches this path:
\begin{lstlisting}
MATCH p = (o{name:"root"})-[r*]- ()
WHERE ID(o)={0}
WITH [x in nodes(p) | CASE WHEN EXISTS(x.name)
THEN x.name ELSE x.type END] as names,
[x in nodes(p) | ID(x)] as ids
WITH names, apoc.util.md5(names) as hash, ids
RETURN names, hash, ids
\end{lstlisting}
Figure \ref{ResultOfRequest} shows the result of the query execution.
\begin{figure}
\includegraphics[width=\textwidth]{ResultOfRequest.png}
\caption{An example of the result of a request to Neo4j.} \label{ResultOfRequest}
\end{figure}
Figure \ref{ResultOfRequest} shows that the hash \textit{"b199ef8568f72c43f6fd50860e228c51"} matches the path \textit{[“root”, “cont”]}. Two graphs contain this hash. The primary keys of the root nodes of these graphs are 7872 and 7977.
Thus, we can discover the number of matching and different paths in the analyzed projects by obtaining hashes for all AST fragments. Figure \ref{ExampleSystem} shows an example of the developed system.
\begin{figure}
\includegraphics[width=\textwidth]{ExampleSystem.png}
\caption{System operation example.} \label{ExampleSystem}
\end{figure}
\section{Experiments}
We conducted experiments to evaluate the speed of source code analysis. We calculated the results relative to the number of lines of code and files in the project. The main aim of the experiment is to calculate the average number of lines of code and the time to complete the analysis in one minute. This allows us to determine the speed of the algorithm. We used the IntelliJ IDEA Statistic plugin to get the data for the experiment.
Table \ref{tab1} presents the initial data for determining the speed of the proposed algorithm. We selected 10 random Java projects as data.
\begin{table}[]
\caption{Initial data for analyzing the speed of the proposed algorithm.}\label{tab1}
\begin{tabular}{|llll|l|}
\hline
\multicolumn{1}{|l|}{} & \multicolumn{1}{l|}{Project name} & \multicolumn{1}{l|}{Number of lines of code} & Number of java files & Number of rows processed per 1 minute \\ \hline
\multicolumn{1}{|l|}{1} & \multicolumn{1}{l|}{BaseRecycler} & \multicolumn{1}{l|}{3 896} & 92 & 2 491 \\ \hline
\multicolumn{1}{|l|}{2} & \multicolumn{1}{l|}{AlamazDev} & \multicolumn{1}{l|}{15 776} & 103 & 2 658 \\ \hline
\multicolumn{1}{|l|}{3} & \multicolumn{1}{l|}{SnakeBoom} & \multicolumn{1}{l|}{20 534} & 158 & 3 255 \\ \hline
\multicolumn{1}{|l|}{4} & \multicolumn{1}{l|}{retrofit} & \multicolumn{1}{l|}{32 119} & 227 & 2 718 \\ \hline
\multicolumn{1}{|l|}{5} & \multicolumn{1}{l|}{Glide} & \multicolumn{1}{l|}{37 508} & 203 & 2 576 \\ \hline
\multicolumn{1}{|l|}{6} & \multicolumn{1}{l|}{ZXing} & \multicolumn{1}{l|}{51 857} & 310 & 2 533 \\ \hline
\multicolumn{1}{|l|}{7} & \multicolumn{1}{l|}{RxJava} & \multicolumn{1}{l|}{64 101} & 339 & 2 814 \\ \hline
\multicolumn{1}{|l|}{8} & \multicolumn{1}{l|}{VisualProjectCore} & \multicolumn{1}{l|}{71 303} & 450 & 2 969 \\ \hline
\multicolumn{1}{|l|}{9} & \multicolumn{1}{l|}{mc-dev} & \multicolumn{1}{l|}{85 267} & 877 & 2 746 \\ \hline
\multicolumn{1}{|l|}{10} & \multicolumn{1}{l|}{xRayJavaTool} & \multicolumn{1}{l|}{97 249} & 937 & 2 730 \\ \hline
\multicolumn{4}{|l|}{Average value} & 2 750 \\ \hline
\end{tabular}
\end{table}
Table \ref{tab2} presents the results of experiments to determine the speed of the proposed algorithm.
\begin{table}[]
\caption{Results of experiments performed to evaluate the speed of the proposed algorithm.}\label{tab2}
\begin{tabular}{|llll|l|}
\hline
\multicolumn{1}{|l|}{} & \multicolumn{1}{l|}{Project name} & \multicolumn{1}{l|}{Parsing speed (min)} & Number of graph nodes & Number of rows processed per 1 minute \\ \hline
\multicolumn{1}{|l|}{1} & \multicolumn{1}{l|}{BaseRecycler} & \multicolumn{1}{l|}{1.564} & 844 & 2 491 \\ \hline
\multicolumn{1}{|l|}{2} & \multicolumn{1}{l|}{AlamazDev} & \multicolumn{1}{l|}{5.935} & 1 837 & 2 658 \\ \hline
\multicolumn{1}{|l|}{3} & \multicolumn{1}{l|}{SnakeBoom} & \multicolumn{1}{l|}{6.308} & 2 197 & 3 255 \\ \hline
\multicolumn{1}{|l|}{4} & \multicolumn{1}{l|}{retrofit} & \multicolumn{1}{l|}{11.817} & 7 118 & 2 718 \\ \hline
\multicolumn{1}{|l|}{5} & \multicolumn{1}{l|}{Glide} & \multicolumn{1}{l|}{14.556} & 8 496 & 2 576 \\ \hline
\multicolumn{1}{|l|}{6} & \multicolumn{1}{l|}{ZXing} & \multicolumn{1}{l|}{20.468} & 10 560 & 2 533 \\ \hline
\multicolumn{1}{|l|}{7} & \multicolumn{1}{l|}{RxJava} & \multicolumn{1}{l|}{22.777} & 11 972 & 2 814 \\ \hline
\multicolumn{1}{|l|}{8} & \multicolumn{1}{l|}{VisualProjectCore} & \multicolumn{1}{l|}{24.009} & 13 334 & 2 969 \\ \hline
\multicolumn{1}{|l|}{9} & \multicolumn{1}{l|}{mc-dev} & \multicolumn{1}{l|}{31.048} & 14 444 & 2 746 \\ \hline
\multicolumn{1}{|l|}{10} & \multicolumn{1}{l|}{xRayJavaTool} & \multicolumn{1}{l|}{35.613} & 23 946 & 2 730 \\ \hline
\multicolumn{4}{|l|}{Average value} & 2 750 \\ \hline
\end{tabular}
\end{table}
The experiment revealed that we processed an average of 2,750 lines of code per minute. Laboratory and coursework are on average 500-3000 lines of code. Thus, the processing speed of one laboratory on average will take less than one minute.
\section{Conclusion}
This article presents the results of developing an approach and a system for searching for structurally similar projects.
We completed the following tasks:
\begin{itemize}
\item we analyzed existing methods of source code analysis, including for determining originality of the project;
\item we developed an algorithm for constructing ASD in analyzing the source code of the project;
\item we developed an algorithm for determining originality of the project based on the analysis of the AST structure;
\item we implemented a software system to determine originality based on the analysis of its structure;
\item we conducted experiments to determine the speed of the proposed algorithm.
\end{itemize}
Thus, the developed system makes it possible to find borrowings in student projects in less than a minute on average.
%
% ---- Bibliography ----
%
% BibTeX users should specify bibliography style 'splncs04'.
% References will then be sorted and formatted in the correct style.
%
\bibliographystyle{splncs04}
\bibliography{samplepaper}
\end{document}