Add some fixes

master
Aleksey Filippov 1 year ago
parent 914a8edaf1
commit a3573bb85d

Binary file not shown.

Before

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB

@ -17,7 +17,7 @@
}
@inproceedings{nadezhda2019approach,
title={An approach to similar software projects searching and architecture analysis based on artificial intelligence methods},
author={Nadezhda, Yarushkina and Gleb, Guskov and Pavel, Dudarin and Vladimir, Stuchebnikov},
author={Yarushkina, Nadezhda and Guskov, Gleb and Dudarin, Pavel and Stuchebnikov, Vladimir},
booktitle={Proceedings of the Third International Scientific Conference “Intelligent Information Technologies for Industry”(IITI18) Volume 1 3},
pages={341--352},
year={2019},
@ -33,15 +33,15 @@
}
@inproceedings{aleksey2020approach,
title={Approach to the Search for Software Projects Similar in Structure and Semantics Based on the Knowledge Extracted from Existed Projects},
author={Aleksey Alekundrovich, Filippov and Yurevich, Guskov Gleb and Aleksey Michailovich, Namestnikov and Nudezhda Glebovna, Yarushkina},
author={Filippov, Aleksey and Guskov, Gleb and Namestnikov, Aleksey and Yarushkina, Nadezhda},
booktitle={Computational Science and Its Applications--ICCSA 2020: 20th International Conference, Cagliari, Italy, July 1--4, 2020, Proceedings, Part I},
pages={718--733},
year={2020},
organization={Springer}
}
@inproceedings{ali2011overview,
title={Overview and comparison of plagiarism detection tools.},
author={Ali, Asim M El Tahir and Abdulla, Hussam M Dahwa and Snasel, Vaclav},
title={Overview and comparison of plagiarism detection tools},
author={Ali, Asim M El Tahir and Abdulla, Hussam M Dahwa and Snasel, Vaclav},
booktitle={Dateso},
pages={161--172},
year={2011}

@ -12,6 +12,14 @@
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{listings}
\lstset{
numbers=none,
language=SQL,
basicstyle=\footnotesize\ttfamily,
framexleftmargin=-4pt,
linewidth=13.75cm
}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
@ -93,7 +101,6 @@ $R$ is the set of relations between AST nodes.
We developed an algorithm to extract the structure of the project in the source code analyzing. The proposed algorithm contains the following steps:
\begin{enumerate}
\item Extract the AST from the project.
\item Select nodes with the `Class' type as the $N^{Class}$ set:
\begin{equation*}
N^{Class} = \lbrace N_{i} \in N | F \left( N_{i}.data \right) = \text{`Class'} \rbrace,
@ -114,100 +121,91 @@ We developed an algorithm to extract the structure of the project in the source
\begin{equation*}
N^{MethodsOps} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{`Operator'} \rbrace,
\end{equation*}
\item Create based on previously got sets of AST for the analyzed source code, considering the set of relations R.
\item Save the resulting AST in a graph database (GDB) to facilitate data handling.
\item Create the set of ties $R$ between the nodes from sets obtained in previous steps.
\item Save the resulting AST in a graph database (GDB).
\end{enumerate}
Figure \ref{SourceCodeAndAST} shows an example of the source code and the ASD got for it.
Figure \ref{fig:SourceCodeAndAST} shows a fragment of the source code and the resulting AST.
\begin{figure}
\includegraphics[width=\textwidth]{images/SourceCodeAndAST.png}
\caption{Sample source code and its AST.} \label{SourceCodeAndAST}
\includegraphics[width=0.71\textwidth]{images/SourceCodeAndASTVert.png}
\caption{Sample source code and its AST.} \label{fig:SourceCodeAndAST}
\end{figure}
GDB is a non-relational type of database based on the topographic structure of the network. Graphs represent sets of data as nodes, edges, and properties. Relational databases provide a structured approach to data. GDBs are more flexible and focused on fast data acquisition, considering various types of links between them.
We used Neo4j as the GDB, since this system has a high speed of operation even with a large amount of stored data.
GDB is a non-relational type of database based on the topographic structure of the network. Graphs represent sets of data as nodes, edges, and properties. GDBs are more flexible than relational databases. GDBs are more flexible than relational databases and allow you to fast obtain data of various types, considering numerous relations.
\section{The Proposed Algorithm for Determining the Structural Similarity of Software Projects}
We use the Neo4j GDB as the data storage. Neo4j has a high speed of operation even with a large amount of stored data.
The determination of the structural similarity of projects is based on the use of a hashing algorithm. We use a hash function to collapse an input array of any size into string.
\section{The Proposed Algorithm for Detecting the Structural Similarity of Software Projects}
It is necessary to get the values of the AST hash function of the analyzed project:
The detection of the projects structural similarity is based on the hashing algorithm. We use a hash function to minimize the size of an input data.
The proposed AST hashing algorithm is contains from the following steps:
\begin{enumerate}
\item Get fragments (paths) of the AST graph from the root node of the graph tree to each node of the graph.
\item We extract the distinguishing property of each node (type) in the current fragment.
\item We passed the type of the node to the hash function. The result is an output string of a certain length.
\item If fragment \textit{A} contains hash \textit{H} and fragment \textit{B} contains hash H, then we can say that fragments \textit{A} and \textit{B} have similarity in one node.
\item If fragment \textit{A} contains hash \textit{H}, and fragment \textit{B} does not contain hash \textit{H}, then fragments \textit{A} and \textit{B} do not have similar nodes.
\item Select all paths of the AST graph from the root node to each other node.
\item Get a value of the `type' property for each node of the current path.
\item Calculate an MD5 hash function for the current path. We calculate the hash functions using the apoc.util.md5 plugin for the Neo4j. As a result of this step, formed a set that contains a tuple of the following values:
\begin{itemize}
\item a path,
\item a path md5 hash.
\end{itemize}
For example:
\begin{itemize}
\item <`root->class->method->if', `820b...9c4b'>,
\item <`root->class->field', `6161...eab3'>.
\end{itemize}
\end{enumerate}
We calculate the hash function values using the Neo4j GDB using the md5 algorithm.
Thus, the number of matching hash values affects the uniqueness and borrowing rates of code in a project. The following expression is used to calculate project originality:
The following expression is used to calculate project originality:
\begin{equation*}
O = \frac {H^{C} \notin H}{H^{C}},
\end{equation*}
where $H^{C}$ is a set of hash functions of the analyzed project;\\
$H$ is a set of hash values of other projects in the system.
where $H^{C}$ is the set of values of the hash function of the current project;
$H$ is the set of hash values of other projects in the system.
\section{Results}
\section{Description of the Developed Software System}
\subsection{Architecture of the developed system}
Figure \ref{DeploymentDiagram} shows the developed software system. The system comprises three nodes, which are on different nodes of the computer network.
Figure \ref{DeploymentDiagram} shows the deployment diagram in the UML notation of the developed software system. The developed system has the three-tier architecture.
\begin{figure}
\includegraphics[width=\textwidth]{images/DeploymentDiagram.png}
\caption{Deployment diagram.} \label{DeploymentDiagram}
\end{figure}
Users interact with the web client on the \textit{Frontend} node. The \textit{Backend} node performs the main business logic for implementing the proposed approach in searching for structurally similar projects. \textit{Backend} and \textit{Frontend} nodes communicate through an API. The GDB is on the \textit{Database} node and provides a saving of project data.
Users interact with the web client on the \textit{Frontend} node. The \textit{Backend} node performs the main business logic for searching of the structurally similar projects. \textit{Backend} and \textit{Frontend} nodes communicate through an API. The \textit{Database} provides data storage functions.
The developed software system represents a web application using a client-server architecture. The web client sends requests to the server, the server processes the requests and returns responses to the web client.
The web client is an application written in JavaScript with the Vue.js framework. Vue.js is a framework for developing single page applications and web interfaces. The main advantages of this framework are the small size of the library in lines of code, performance, flexibility, and excellent documentation.
The web client is an application written in JavaScript using the Vue js framework. Vue.js is a framework for developing single page applications and web interfaces. The main advantages of this framework are its lightness (small size of the library in lines of code), performance, flexibility, and excellent documentation.
We implement the server part of the application in Java with the Spring Boot framework. The Spring framework is a ecosystem for developing applications in the Java language. The Spring Boot includes a huge number of ready-to-use modules. The main advantages of this framework include speed and convenience of development, auto-configuration of all components, easy access to databases and network capabilities.
We implement the server part of the application in Java using the Spring Boot framework. The Spring framework is a whole ecosystem for developing applications in the Java language, which includes a huge number of ready-made modules. Spring Boot extends the Spring framework. The main advantages of this framework include speed and ease of development, auto-configuration of all components, easy access to databases and network capabilities.
The current version of the software system supports only Java-based software projects. The JavaParser library is used to form an AST in the Java source code analysis. This library allows you to extract the AST using the previously discussed algorithm.
The current version of the software system only supports source code analysis in the Java. The JavaParser library is used to form an AST in the process of Java code analysis. This library allows you to build an internal representation of the AST, which is then translated into the proposed AST model using the previously discussed algorithm.
\subsection{Data model for representing AST as a GDB fragment}
Figure \ref{JavaParserLibrary} shows an example of the internal representation of the AST in the JavaParser library.
\begin{figure}
\includegraphics[width=\textwidth]{images/JavaParserLibrary.png}
\caption{An example of the internal representation of the JavaParser library AST.} \label{JavaParserLibrary}
\end{figure}
We used neo4j GDB for data storage. Possibly redundant expression GDBs over relational ones is the ability to change the data model.
The GDB data model allows you to store the following nodes:
In this subsection, we discussed the proposed data model for representing AST as a GDB fragment.
The GDB data model contains the nodes with the following type:
\begin{itemize}
\item nodes with type `Package' (Java-specific);
\item nodes with type `Class';
\item nodes with the `Class field' type;
\item nodes with type `Method';
\item nodes with type `Method argument';
\item nodes with type `Operator'.
\item `Package' (Java-specific),
\item `Class',
\item `Class field',
\item `Method',
\item `Method argument',
\item `Statement' (declaration, expression and control statements).
\end{itemize}
We arrange the nodes in the GDB hierarchically. For example, a class is in a package, but a method is in a class. The data model allows you to form the following relationships between graph nodes:
We arrange the nodes in the GDB hierarchically. For example, a class-node is a part of a package-node, a method-node is a part of a class-node. The data model allows you to form the following ties between data model nodes:
\begin{itemize}
\item $HAS_CLASS$ is a relationship between a package and a class;
\item $HAS_FIELD$ is a relationship between a class and a class field;
\item $HAS_METHOD$ is a is a link between a class and a method;
\item $HAS_ARG$ is a relationship between method and method argument;
\item $HAS_BLOCK$ is a link between a method and a statement.
\item `HAS\_CLASS' is a relationship between a `Package' and a `Class' nodes,
\item `HAS\_FIELD' is a relationship between a `Class' and a `Class field' nodes,
\item `HAS\_METHOD' is a relationship between a `Class' and a `Method' nodes,
\item `HAS\_ARG' is a relationship between `Method' and `Method argument' nodes,
\item `HAS\_BLOCK' is a link between a `Method' and a `Statement'.
\end{itemize}
The main idea of searching for structurally similar projects is to use hashing of graph fragments (paths) based on the md5 function. We formed the path from the root node to each node of the graph. We take nesting into account when forming the path (package -> class -> field / method -> method argument / operator). The hash function takes the string representation of the node type as input.
Example of a request to get paths and a hash that matches this path:
The proposed algorithm for searching for structurally similar projects is to use hashing of graph paths based on the md5 function. We describe the hashing algorithm in the previous section. The searching algorithm can be represented as the following Cypher-query:
\begin{lstlisting}
MATCH p = (o{name:"root"})-[r*]- ()
WHERE ID(o)={0}
@ -218,31 +216,33 @@ Example of a request to get paths and a hash that matches this path:
RETURN names, hash, ids
\end{lstlisting}
Figure \ref{ResultOfRequest} shows the result of the query execution.
Figure \ref{fig:ResultOfRequest} shows the result of the Cypher-query.
\begin{figure}
\includegraphics[width=\textwidth]{images/ResultOfRequest.png}
\caption{An example of the result of a request to Neo4j.} \label{ResultOfRequest}
\caption{An example of the result of the searching Cypher-query.}
\label{fig:ResultOfRequest}
\end{figure}
Figure \ref{ResultOfRequest} shows that the hash \textit{"b199ef8568f72c43f6fd50860e228c51"} matches the path \textit{[`root', `cont']}. Two graphs contain this hash. The primary keys of the root nodes of these graphs are 7872 and 7977.
Figure \ref{fig:ResultOfRequest} shows that the hash `b199ef8568f72c43f6fd50860e228c51' matches the path `root->cont', and two projects with identifiers 7872 and 7977 contain this structural pattern (path).
Thus, we can discover the number of matching and different paths in the analyzed projects by obtaining hashes for all AST fragments. Figure \ref{ExampleSystem} shows an example of the developed system.
Thus, we can calculate the number of matching and not matching paths in the analyzed project compare with other projects in data storage. Figure \ref{fig:ExampleSystem} shows the main form of the developed system.
\begin{figure}
\includegraphics[width=\textwidth]{images/ExampleSystem.png}
\caption{System operation example.} \label{ExampleSystem}
\caption{The main form of the developed system.}
\label{fig:ExampleSystem}
\end{figure}
\section{Experiments}
We conducted experiments to evaluate the speed of source code analysis. We calculated the results relative to the number of lines of code and files in the project. The main aim of the experiment is to calculate the average number of lines of code and the time to complete the analysis in one minute. This allows us to determine the speed of the algorithm. We used the IntelliJ IDEA Statistic plugin to get the data for the experiment.
We conducted experiments to evaluate the speed of source code analysis. We calculated the results relative to the number of lines of code and files in the project. The main aim of the experiment is to calculate the average number of lines of code processed in one minute. This allows us to determine the speed of the algorithm. We used the IntelliJ IDEA Statistic plugin to get the data for the experiment.
Table \ref{tab1} presents the initial data for determining the speed of the proposed algorithm. We selected 10 random Java projects as data.
We selected 10 random Java projects for this experiment. Table \ref{tab:speed} presents the initial data for determining the speed of the proposed algorithm.
\begin{table}[]
\caption{Initial data for analyzing the speed of the proposed algorithm.}\label{tab1}
\caption{Initial data for analyzing the speed of the proposed algorithm.}
\label{tab:speed}
\begin{tabular}{|llll|l|}
\hline
\multicolumn{1}{|l|}{} & \multicolumn{1}{l|}{Project name} & \multicolumn{1}{l|}{Number of lines of code} & Number of java files & Number of rows processed per 1 minute \\ \hline
@ -291,7 +291,7 @@ Table \ref{tab2} presents the results of experiments to determine the speed of t
\begin{itemize}
\item we analyzed existing methods of source code analysis, including for determining originality of the project;
\item we developed an algorithm for constructing ASD in analyzing the source code of the project;
\item we developed an algorithm for constructing AST in analyzing the source code of the project;
\item we developed an algorithm for determining originality of the project based on the analysis of the AST structure;
\item we implemented a software system to determine originality based on the analysis of its structure;
\item we conducted experiments to determine the speed of the proposed algorithm.

Loading…
Cancel
Save