\@writefile{toc}{\contentsline{section}{\numberline{2}State of the art}{2}{}\protected@file@percent }
\@writefile{toc}{\contentsline{section}{\numberline{2}State of the art}{2}{}\protected@file@percent }
\@writefile{toc}{\contentsline{section}{\numberline{3}The Proposed Algorithm for Analyzing the Structure of the Source Code}{2}{}\protected@file@percent }
\@writefile{toc}{\contentsline{section}{\numberline{3}The Proposed Algorithm for Analyzing the Structure of the Source Code}{2}{}\protected@file@percent }
\@writefile{toc}{\contentsline{section}{\numberline{4}The Proposed Algorithm for Determining the Structural Similarity of Software Projects}{3}{}\protected@file@percent }
\@writefile{toc}{\contentsline{section}{\numberline{4}The Proposed Algorithm for Determining the Structural Similarity of Software Projects}{3}{}\protected@file@percent }
\@writefile{toc}{\contentsline{section}{\numberline{5}State of the art}{3}{}\protected@file@percent }
\@writefile{lot}{\contentsline{table}{\numberline{1}{\ignorespaces Table captions should be placed above the tables.}}{4}{}\protected@file@percent }
\newlabel{DeploymentDiagram}{{2}{5}}
\newlabel{tab1}{{1}{4}}
\@writefile{lof}{\contentsline{figure}{\numberline{3}{\ignorespaces An example of the internal representation of the JavaParser library AST.}}{6}{}\protected@file@percent }
\newlabel{JavaParserLibrary}{{3}{6}}
\@writefile{lof}{\contentsline{figure}{\numberline{4}{\ignorespaces An example of the result of a request to Neo4j.}}{7}{}\protected@file@percent }
\newlabel{ResultOfRequest}{{4}{7}}
\@writefile{lof}{\contentsline{figure}{\numberline{5}{\ignorespaces System operation example.}}{7}{}\protected@file@percent }
\caption{Sample source code and its AST.}\label{fig1}
\caption{Sample source code and its AST.}\label{SourceCodeAndAST}
\end{figure}
\end{figure}
GDB is a non-relational type of database based on the topographic structure of the network. Graphs represent sets of data as nodes, edges, and properties. Relational databases provide a structured approach to data. GDBs are more flexible and focused on fast data acquisition, considering various types of links between them.
GDB is a non-relational type of database based on the topographic structure of the network. Graphs represent sets of data as nodes, edges, and properties. Relational databases provide a structured approach to data. GDBs are more flexible and focused on fast data acquisition, considering various types of links between them.
@ -135,6 +136,111 @@ We used Neo4j as the GDB, since this system has a high speed of operation even w
\section{The Proposed Algorithm for Determining the Structural Similarity of Software Projects}
\section{The Proposed Algorithm for Determining the Structural Similarity of Software Projects}
The determination of the structural similarity of projects is based on the use of a hashing algorithm. We use a hash function to collapse an input array of any size into string.
It is necessary to get the values of the AST hash function of the analyzed project:
\begin{enumerate}
\item Get fragments (paths) of the AST graph from the root node of the graph tree to each node of the graph.
\item We extract the distinguishing property of each node (type) in the current fragment.
\item We passed the type of the node to the hash function. The result is an output string of a certain length.
\item If fragment \textit{A} contains hash \textit{H} and fragment \textit{B} contains hash H, then we can say that fragments \textit{A} and \textit{B} have similarity in one node.
\item If fragment \textit{A} contains hash \textit{H}, and fragment \textit{B} does not contain hash \textit{H}, then fragments \textit{A} and \textit{B} do not have similar nodes.
\end{enumerate}
We calculate the hash function values using the Neo4j GDB using the md5 algorithm.
Thus, the number of matching hash values affects the uniqueness and borrowing rates of code in a project. The following expression is used to calculate project originality:
\begin{equation*}
O = \frac{H^{C}\notin H}{H^{C}},
\end{equation*}
where $H^{C}$ is the set of values of the hash function of the current project;
$H$ is the set of hash values of other projects in the system.
\section{Description of the Developed Software System}
Figure \ref{DeploymentDiagram} shows the developed software system. The system comprises three nodes, which are on different nodes of the computer network.
Users interact with the web client on the \textit{Frontend} node. The \textit{Backend} node performs the main business logic for implementing the proposed approach in searching for structurally similar projects. \textit{Backend} and \textit{Frontend} nodes communicate through an API. The GDB is on the \textit{Database} node and provides a saving of project data.
The developed software system represents a web application using a client-server architecture. The web client sends requests to the server, the server processes the requests and returns responses to the web client.
The web client is an application written in JavaScript using the Vue js framework. Vue.js is a framework for developing single page applications and web interfaces. The main advantages of this framework are its lightness (small size of the library in lines of code), performance, flexibility, and excellent documentation.
We implement the server part of the application in Java using the Spring Boot framework. The Spring framework is a whole ecosystem for developing applications in the Java language, which includes a huge number of ready-made modules. Spring Boot extends the Spring framework. The main advantages of this framework include speed and ease of development, auto-configuration of all components, easy access to databases and network capabilities.
The current version of the software system only supports source code analysis in the Java. The JavaParser library is used to form an AST in the process of Java code analysis. This library allows you to build an internal representation of the AST, which is then translated into the proposed AST model using the previously discussed algorithm.
Figure \ref{JavaParserLibrary} shows an example of the internal representation of the AST in the JavaParser library.
\caption{An example of the internal representation of the JavaParser library AST.}\label{JavaParserLibrary}
\end{figure}
We used neo4j GDB for data storage. Possibly redundant expression GDBs over relational ones is the ability to change the data model.
The GDB data model allows you to store the following nodes:
\begin{itemize}
\item nodes with type “Package” (Java-specific);
\item nodes with type “Class”;
\item nodes with the “Class field” type;
\item nodes with type “Method”;
\item nodes with type “Method argument”;
\item nodes with type “Operator”.
\end{itemize}
We arrange the nodes in the GDB hierarchically. For example, a class is in a package, but a method is in a class. The data model allows you to form the following relationships between graph nodes:
\begin{itemize}
\item$HAS_CLASS$ is a relationship between a package and a class;
\item$HAS_FIELD$ is a relationship between a class and a class field;
\item$HAS_METHOD$ is a is a link between a class and a method;
\item$HAS_ARG$ is a relationship between method and method argument;
\item$HAS_BLOCK$ is a link between a method and a statement.
\end{itemize}
The main idea of searching for structurally similar projects is to use hashing of graph fragments (paths) based on the md5 function. We formed the path from the root node to each node of the graph. We take nesting into account when forming the path (package -> class -> field / method -> method argument / operator). The hash function takes the string representation of the node type as input.
Example of a request to get paths and a hash that matches this path:
\begin{lstlisting}
MATCH p = (o{name:"root"})-[r*]- ()
WHERE ID(o)={0}
WITH [x in nodes(p) | CASE WHEN EXISTS(x.name)
THEN x.name ELSE x.type END] as names,
[x in nodes(p) | ID(x)] as ids
WITH names, apoc.util.md5(names) as hash, ids
RETURN names, hash, ids
\end{lstlisting}
Figure \ref{ResultOfRequest} shows the result of the query execution.
\caption{An example of the result of a request to Neo4j.}\label{ResultOfRequest}
\end{figure}
Figure \ref{ResultOfRequest} shows that the hash \textit{"b199ef8568f72c43f6fd50860e228c51"} matches the path \textit{[“root”, “cont”]}. Two graphs contain this hash. The primary keys of the root nodes of these graphs are 7872 and 7977.
Thus, we can discover the number of matching and different paths in the analyzed projects by obtaining hashes for all AST fragments. Figure \ref{ExampleSystem} shows an example of the developed system.