Complete fixes

This commit is contained in:
Aleksey Filippov 2023-04-09 00:36:15 +04:00
parent a3573bb85d
commit a1655e0397

169
paper.tex
View File

@ -19,6 +19,11 @@ basicstyle=\footnotesize\ttfamily,
framexleftmargin=-4pt,
linewidth=13.75cm
}
\usepackage{array}
\newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{C}[1]{>{\centering\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{R}[1]{>{\raggedleft\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
@ -77,9 +82,9 @@ We can analyze projects using call graph generation tools, such as \textit{CodeV
Another group of methods is based on obtaining and analyzing abstract syntax trees (AST). An AST is an abstract representation of the grammatical structure of a source code. It expresses the structure of a program as a structured tree and rarely depends on programming language. Each AST node is an operator or a set of operators of the analyzed source code. The compiler generates an AST on the parsing step. Unlike a parse tree, an AST does not contain nodes or edges that do not define the semantics of the program (for example, grouping brackets).
AST-based approaches allow us to find structurally similar projects. However, such approaches have high computational complexity \cite{nguyen2018crosssim}. Many existing approaches analyze a larger number of parameters than is necessary to solve the problem of this study \cite{aleksey2020approach,beniwal2021npmrec,nadezhda2019approach,nguyen2018crosssim,nguyen2020automated}: project dependencies, the number of stars in the repository, the contents of the documentation, etc.
AST-based approaches allow us to find structurally similar projects. However, such approaches have high computational complexity \cite{nguyen2018crosssim}. Many existing approaches analyze a larger number of parameters than is necessary to solve the problem of this study \cite{beniwal2021npmrec,aleksey2020approach,nguyen2018crosssim,nguyen2020automated,nadezhda2019approach}: project dependencies, the number of stars in the repository, the contents of the documentation, etc.
The paper \cite{ali2011overview} presents an review of approaches and software tools for searching for borrowings in the text and source code. However, there is no mention of existing software tools for searching for borrowings in the source code.
The paper \cite{ali2011overview} presents an review of approaches and software tools for borrowings searching in the text and source code. However, there is no mention of existing software tools for borrowings searching in the source code.
In the article \cite{chae2013software}, the authors analyze borrowings in the source code according to the sequences of using external programming interfaces (external dependencies) and the frequency of such calls. This method is not suitable for solving the problem of this study because of the educational orientation. Some student projects can not use external dependencies.
@ -128,6 +133,7 @@ We developed an algorithm to extract the structure of the project in the source
Figure \ref{fig:SourceCodeAndAST} shows a fragment of the source code and the resulting AST.
\begin{figure}
\centering
\includegraphics[width=0.71\textwidth]{images/SourceCodeAndASTVert.png}
\caption{Sample source code and its AST.} \label{fig:SourceCodeAndAST}
\end{figure}
@ -157,9 +163,10 @@ The proposed AST hashing algorithm is contains from the following steps:
\end{enumerate}
The following expression is used to calculate project originality:
\begin{equation*}
\begin{equation}
\label{eq:calc-orig}
O = \frac {H^{C} \notin H}{H^{C}},
\end{equation*}
\end{equation}
where $H^{C}$ is a set of hash functions of the analyzed project;\\
$H$ is a set of hash values of other projects in the system.
@ -170,6 +177,7 @@ $H$ is a set of hash values of other projects in the system.
Figure \ref{DeploymentDiagram} shows the deployment diagram in the UML notation of the developed software system. The developed system has the three-tier architecture.
\begin{figure}
\centering
\includegraphics[width=\textwidth]{images/DeploymentDiagram.png}
\caption{Deployment diagram.} \label{DeploymentDiagram}
\end{figure}
@ -216,19 +224,45 @@ The proposed algorithm for searching for structurally similar projects is to use
RETURN names, hash, ids
\end{lstlisting}
Figure \ref{fig:ResultOfRequest} shows the result of the Cypher-query.
Table \ref{tab:query-results} shows the result of the Cypher-query.
\begin{figure}
\includegraphics[width=\textwidth]{images/ResultOfRequest.png}
\begin{table}
\centering
\caption{An example of the result of the searching Cypher-query.}
\label{fig:ResultOfRequest}
\end{figure}
\label{tab:query-results}
\begin{tabular}{L{4cm}L{3cm}L{4cm}}
\hline
\noalign{\vskip 3pt}
names & hash & ids \\
\noalign{\vskip 2pt}
\hline
\noalign{\vskip 3pt}
\textbf{root-\textgreater{}package} & \textbf{"346f...a463"} & {[}{[}7872, 7873{]}, {[}7977, 7978{]}{]} \\
\textbf{root-\textgreater{}package-\textgreater{}class} & \textbf{"840b...7f9a"} & {[}{[}7872, 7873, 7874{]},\newline {[}7977, 7978, 7979{]}{]} \\
\textbf{root-\textgreater{}package-\textgreater{}class}\newline \hspace{3mm}\textbf{-\textgreater{}method} & \textbf{"7151...0f3d"} & {[}{[}7872, 7873, 7874, 7875{]}, {[}7977, 7978, 7979, 7980{]}{]} \\
root-\textgreater{}package-\textgreater{}class\newline \hspace{3mm}-\textgreater{}method\newline \hspace{3mm}-\textgreater{}statement.control & "5810...f0c9" & {[}{[}7872, 7873, 7874, 7875, 7879{]}{]} \\
root-\textgreater{}package-\textgreater{}class\newline \hspace{3mm}-\textgreater{}method\newline \hspace{3mm}-\textgreater{}statement.control\newline \hspace{3mm}-\textgreater{}statement.expression & "fd3c...5a3c" & {[}{[}7872, 7873, 7874, 7875, 7879, 7880{]}{]} \\
\multicolumn{3}{c}{...} \\
\textbf{root-\textgreater{}package} & \textbf{"346f...a463"} & {[}{[}7872, 7873{]}, {[}7977, 7978{]}{]} \\
\textbf{root-\textgreater{}package-\textgreater{}class} & \textbf{"840b...7f9a"} & {[}{[}7872, 7873, 7874{]}, \newline {[}7977, 7978, 7979{]}{]} \\
\textbf{root-\textgreater{}package-\textgreater{}class}\newline \textbf{\hspace{3mm}-\textgreater{}method} & \textbf{"7151...0f3d"} & {[}{[}7872, 7873, 7874, 7875{]}, {[}7977, 7978, 7979, 7980{]}{]} \\
\noalign{\vskip 2pt}
\hline
\end{tabular}
\end{table}
Figure \ref{fig:ResultOfRequest} shows that the hash `b199ef8568f72c43f6fd50860e228c51' matches the path `root->cont', and two projects with identifiers 7872 and 7977 contain this structural pattern (path).
Table \ref{tab:query-results} shows that:
\begin{itemize}
\item the hash `346f...a463' matches the path `root->package',
\item the hash `840b...7f9a' matches the path `root->package->class',
\item the hash `7151...0f3d' matches the path `root->package->class->method',
\item two projects with identifiers 7872 and 7977 contain this structural patterns (paths).
\end{itemize}
Thus, we can calculate the number of matching and not matching paths in the analyzed project compare with other projects in data storage. Figure \ref{fig:ExampleSystem} shows the main form of the developed system.
Thus, we can calculate the number of matching and not matching paths (see eq. \ref{eq:calc-orig}) in the analyzed project compare with other projects in data storage. Figure \ref{fig:ExampleSystem} shows the main form of the developed system.
\begin{figure}
\centering
\includegraphics[width=\textwidth]{images/ExampleSystem.png}
\caption{The main form of the developed system.}
\label{fig:ExampleSystem}
@ -236,68 +270,85 @@ Thus, we can calculate the number of matching and not matching paths in the anal
\section{Experiments}
We conducted experiments to evaluate the speed of source code analysis. We calculated the results relative to the number of lines of code and files in the project. The main aim of the experiment is to calculate the average number of lines of code processed in one minute. This allows us to determine the speed of the algorithm. We used the IntelliJ IDEA Statistic plugin to get the data for the experiment.
We conducted experiments to evaluate the speed of source code analysis. We calculated the results relative to the number of lines of code and the number of files in the analyzing project. The main aim of the experiment is to determine the speed of the algorithm, considering the average number of lines of code processed per minute. We used the IntelliJ IDEA Statistic plugin to get the data for the experiment.
We selected 10 random Java projects for this experiment. Table \ref{tab:speed} presents the initial data for determining the speed of the proposed algorithm.
We selected 10 random Java projects for this experiment. Table \ref{tab:speed} presents the results of experiments for analyzing the speed of the proposed algorithm.
\begin{table}[]
\caption{Initial data for analyzing the speed of the proposed algorithm.}
\begin{table}[ht]
\centering
\caption{Results of experiments for analyzing the speed of the proposed algorithm.}
\label{tab:speed}
\begin{tabular}{|llll|l|}
\hline
\multicolumn{1}{|l|}{} & \multicolumn{1}{l|}{Project name} & \multicolumn{1}{l|}{Number of lines of code} & Number of java files & Number of rows processed per 1 minute \\ \hline
\multicolumn{1}{|l|}{1} & \multicolumn{1}{l|}{BaseRecycler} & \multicolumn{1}{l|}{3 896} & 92 & 2 491 \\ \hline
\multicolumn{1}{|l|}{2} & \multicolumn{1}{l|}{AlamazDev} & \multicolumn{1}{l|}{15 776} & 103 & 2 658 \\ \hline
\multicolumn{1}{|l|}{3} & \multicolumn{1}{l|}{SnakeBoom} & \multicolumn{1}{l|}{20 534} & 158 & 3 255 \\ \hline
\multicolumn{1}{|l|}{4} & \multicolumn{1}{l|}{retrofit} & \multicolumn{1}{l|}{32 119} & 227 & 2 718 \\ \hline
\multicolumn{1}{|l|}{5} & \multicolumn{1}{l|}{Glide} & \multicolumn{1}{l|}{37 508} & 203 & 2 576 \\ \hline
\multicolumn{1}{|l|}{6} & \multicolumn{1}{l|}{ZXing} & \multicolumn{1}{l|}{51 857} & 310 & 2 533 \\ \hline
\multicolumn{1}{|l|}{7} & \multicolumn{1}{l|}{RxJava} & \multicolumn{1}{l|}{64 101} & 339 & 2 814 \\ \hline
\multicolumn{1}{|l|}{8} & \multicolumn{1}{l|}{VisualProjectCore} & \multicolumn{1}{l|}{71 303} & 450 & 2 969 \\ \hline
\multicolumn{1}{|l|}{9} & \multicolumn{1}{l|}{mc-dev} & \multicolumn{1}{l|}{85 267} & 877 & 2 746 \\ \hline
\multicolumn{1}{|l|}{10} & \multicolumn{1}{l|}{xRayJavaTool} & \multicolumn{1}{l|}{97 249} & 937 & 2 730 \\ \hline
\multicolumn{4}{|l|}{Average value} & 2 750 \\ \hline
\begin{tabular}{L{0.5cm}L{3.5cm}L{2cm}L{2cm}L{2cm}}
\hline
\noalign{\vskip 3pt}
\# & Project name & Lines of code & Java files & Lines of code per minute \\
\noalign{\vskip 2pt}
\hline
\noalign{\vskip 3pt}
1 & BaseRecycler & 3 896 & 92 & 2 491 \\
2 & AlamazDev & 15 776 & 103 & 2 658 \\
3 & SnakeBoom & 20 534 & 158 & 3 255 \\
4 & retrofit & 32 119 & 227 & 2 718 \\
5 & Glide & 37 508 & 203 & 2 576 \\
6 & ZXing & 51 857 & 310 & 2 533 \\
7 & RxJava & 64 101 & 339 & 2 814 \\
8 & VisualProjectCore & 71 303 & 450 & 2 969 \\
9 & mc-dev & 85 267 & 877 & 2 746 \\
10 & xRayJavaTool & 97 249 & 937 & 2 730 \\
\noalign{\vskip 2pt}
\hline
\noalign{\vskip 3pt}
\multicolumn{4}{l}{Average value} & 2
749 \\
\noalign{\vskip 2pt}
\hline
\end{tabular}
\end{table}
Table \ref{tab2} presents the results of experiments to determine the speed of the proposed algorithm.
Table \ref{tab:time-size} presents the results of experiments to determine the total time of projects analyzing and the number of nodes in resulting graphs.
\begin{table}[]
\caption{Results of experiments performed to evaluate the speed of the proposed algorithm.}\label{tab2}
\begin{tabular}{|llll|l|}
\hline
\multicolumn{1}{|l|}{} & \multicolumn{1}{l|}{Project name} & \multicolumn{1}{l|}{Parsing speed (min)} & Number of graph nodes & Number of rows processed per 1 minute \\ \hline
\multicolumn{1}{|l|}{1} & \multicolumn{1}{l|}{BaseRecycler} & \multicolumn{1}{l|}{1.564} & 844 & 2 491 \\ \hline
\multicolumn{1}{|l|}{2} & \multicolumn{1}{l|}{AlamazDev} & \multicolumn{1}{l|}{5.935} & 1 837 & 2 658 \\ \hline
\multicolumn{1}{|l|}{3} & \multicolumn{1}{l|}{SnakeBoom} & \multicolumn{1}{l|}{6.308} & 2 197 & 3 255 \\ \hline
\multicolumn{1}{|l|}{4} & \multicolumn{1}{l|}{retrofit} & \multicolumn{1}{l|}{11.817} & 7 118 & 2 718 \\ \hline
\multicolumn{1}{|l|}{5} & \multicolumn{1}{l|}{Glide} & \multicolumn{1}{l|}{14.556} & 8 496 & 2 576 \\ \hline
\multicolumn{1}{|l|}{6} & \multicolumn{1}{l|}{ZXing} & \multicolumn{1}{l|}{20.468} & 10 560 & 2 533 \\ \hline
\multicolumn{1}{|l|}{7} & \multicolumn{1}{l|}{RxJava} & \multicolumn{1}{l|}{22.777} & 11 972 & 2 814 \\ \hline
\multicolumn{1}{|l|}{8} & \multicolumn{1}{l|}{VisualProjectCore} & \multicolumn{1}{l|}{24.009} & 13 334 & 2 969 \\ \hline
\multicolumn{1}{|l|}{9} & \multicolumn{1}{l|}{mc-dev} & \multicolumn{1}{l|}{31.048} & 14 444 & 2 746 \\ \hline
\multicolumn{1}{|l|}{10} & \multicolumn{1}{l|}{xRayJavaTool} & \multicolumn{1}{l|}{35.613} & 23 946 & 2 730 \\ \hline
\multicolumn{4}{|l|}{Average value} & 2 750 \\ \hline
\centering
\caption{Results of experiments to determine the total time of projects analyzing and the number of nodes in resulting graphs.}
\label{tab:time-size}
\begin{tabular}{L{0.5cm}L{3.5cm}L{2cm}L{2cm}}
\hline
\noalign{\vskip 3pt}
\# & Project name & Total time (min) & Number of graph nodes \\
\noalign{\vskip 2pt}
\hline
\noalign{\vskip 3pt}
1 & BaseRecycler & 1.6 & 844 \\
2 & AlamazDev & 6.0 & 1 837 \\
3 & SnakeBoom & 6.3 & 2 197 \\
4 & retrofit & 11.8 & 7 118 \\
5 & Glide & 14.6 & 8 496 \\
6 & ZXing & 20.5 & 10 560 \\
7 & RxJava & 22.7 & 11 972 \\
8 & VisualProjectCore & 24.1 & 13 334 \\
9 & mc-dev & 31.1 & 14 444 \\
10 & xRayJavaTool & 35.6 & 23 946 \\
\noalign{\vskip 2pt}
\hline
\end{tabular}
\end{table}
\end{table}
The experiment revealed that we processed an average of 2,750 lines of code per minute. Laboratory and coursework are on average 500-3000 lines of code. Thus, the processing speed of one laboratory on average will take less than one minute.
The experiment revealed that we processed an average of 2 750 lines of code per minute. Student projects contains average 500-3000 lines of code. Thus, the analysis of one project takes on average less than one minute.
\section{Conclusion}
\section{Conclusion}
This article presents the results of developing an approach and a system for searching for structurally similar projects.
This article presents the results of developing an approach and a system for searching for structurally similar projects.
We completed the following tasks:
We solved the following tasks:
\begin{itemize}
\item we analyzed existing methods of source code analysis, including the methods for borrowings searching in a text and source code;
\item we developed the algorithm for extracting the AST in analyzing a project source code;
\item we developed the algorithm for determining originality of a project based on the the AST structure hashing;
\item we implemented the software system to determine originality of a project;
\item we conducted experiments to determine the speed of the proposed algorithm.
\end{itemize}
\begin{itemize}
\item we analyzed existing methods of source code analysis, including for determining originality of the project;
\item we developed an algorithm for constructing AST in analyzing the source code of the project;
\item we developed an algorithm for determining originality of the project based on the analysis of the AST structure;
\item we implemented a software system to determine originality based on the analysis of its structure;
\item we conducted experiments to determine the speed of the proposed algorithm.
\end{itemize}
Thus, the developed system makes it possible to find borrowings in student projects in less than a minute on average.
Thus, the developed system makes it possible to find borrowings in student projects in less than a minute on average.
%
% ---- Bibliography ----