Fix reviewer issues

This commit is contained in:
Aleksey Filippov 2023-06-17 01:04:07 +04:00
parent 90f29a6188
commit adc1f8b9c3
14 changed files with 174 additions and 104 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 49 KiB

After

Width:  |  Height:  |  Size: 38 KiB

BIN
images/ASTMethod.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 45 KiB

View File

Before

Width:  |  Height:  |  Size: 88 KiB

After

Width:  |  Height:  |  Size: 88 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

After

Width:  |  Height:  |  Size: 60 KiB

View File

Before

Width:  |  Height:  |  Size: 66 KiB

After

Width:  |  Height:  |  Size: 66 KiB

View File

Before

Width:  |  Height:  |  Size: 73 KiB

After

Width:  |  Height:  |  Size: 73 KiB

View File

Before

Width:  |  Height:  |  Size: 78 KiB

After

Width:  |  Height:  |  Size: 78 KiB

View File

Before

Width:  |  Height:  |  Size: 55 KiB

After

Width:  |  Height:  |  Size: 55 KiB

View File

Before

Width:  |  Height:  |  Size: 60 KiB

After

Width:  |  Height:  |  Size: 60 KiB

View File

@ -93,6 +93,18 @@
note = "[Online; accessed 11-June-2023]"
}
@misc{ref_cypher,
title = {Cypher Query Language - Developer Guides},
url = {https://neo4j.com/developer/cypher/},
note = {[Online; accessed 11-June-2023]}
}
@misc{ref_md5,
title = {Text Functions - APOC Extended Documentation},
url = {https://neo4j.com/labs/apoc/4.4/misc/text-functions/\#text-functions-hashing},
note = {[Online; accessed 11-June-2023]}
}
@misc{ref_statistic,
title = {Statistic - IntelliJ IDEs plugin},
url = {https://plugins.jetbrains.com/plugin/4509-statistic},

BIN
paper.pdf

Binary file not shown.

260
paper.tex
View File

@ -11,10 +11,13 @@
%
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{xcolor}
\definecolor{eminence}{RGB}{108,48,130}
\usepackage{listings}
\lstset{
numbers=none,
language=SQL,
language=Java,
keywordstyle=\color{eminence}\bf,
basicstyle=\footnotesize\ttfamily,
framexleftmargin=-4pt,
linewidth=13.75cm
@ -23,6 +26,7 @@ linewidth=13.75cm
\newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{C}[1]{>{\centering\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{R}[1]{>{\raggedleft\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\usepackage{longtable}
% Used for displaying a sample figure. If possible, figure files should
@ -43,6 +47,7 @@ linewidth=13.75cm
%
\author{Aleksey Filippov\inst{1}\orcidID{0000-0003-0008-5035} \and
Anton Romanov\inst{1}\orcidID{0000-0001-5275-7628} \and
Anton Skalkin\inst{1}\orcidID{0000-0003-0044-8027} \and
Julia Stroeva\inst{1}\orcidID{0009-0003-8026-235X}}
%
\authorrunning{A. Filippov et al.}
@ -90,13 +95,15 @@ The paper \cite{ali2011overview} presents an review of approaches and software t
In the article \cite{chae2013software}, the authors analyze borrowings in the source code according to the sequences of using external programming interfaces (external dependencies) and the frequency of such calls. This method is not suitable for solving the problem of this study because of the educational orientation. Some student projects can not use external dependencies.
Thus, it is necessary to develop an approach to the search for structurally similar projects, which are focused on simple software systems and a high speed of analysis. A set of source code files will be considered as projects.
Thus, it is necessary to develop an approach to the search for structurally similar projects, which are focused on simple software systems and a high speed of analysis.
\section{The Proposed Algorithm for Analyzing the Structure of the Source Code}
The source code of the software system is the main data source for structural features identifying in the proposed algorithm.
We represent software system projects as a set of source code files. The source code of the software system is the main data source for structural features identifying in the proposed algorithm.
We formed an AST to analyze the source code. There are various libraries and tools for all existing programming languages for the formation of AST. We use own representation of the AST to add support for new programming languages without changing the analysis algorithms.
We formed an AST to analyze the source code. There are various libraries and tools for all existing programming languages for the AST formation. We use own representation of the AST to add support for new programming languages without changing the analysis algorithms. Therefore, we need to develop a converter that transforms the AST generated by the parser for some programming language into our AST representation.
For example, we analyze Java files and their hierarchy at the package level for the Java-based software systems. We use the JavaParser library to form an AST for Java projects. The algorithm considered below allows us to transform the AST, which is generated by the JavaParser library, into our AST representation.
We define the proposed AST model as follows:
\begin{equation*}
@ -112,124 +119,131 @@ We developed an algorithm to extract the structure of the project in the source
\begin{equation*}
N^{Class} = \lbrace N_{i} \in N | F \left( N_{i}.data \right) = \text{`Class'} \rbrace,
\end{equation*}
\begin{table}[]
\begin{tabular}{|l|c|}
\hline
\multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline
public class Main \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTClass.png}} \\ \hline
\end{tabular}
\end{table}
\item Select nodes with the `Class field' as the $N^{Vars}$ set from the $N^{Class}$ set:
\begin{equation*}
N^{Vars} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{`Field'} \rbrace,
\end{equation*}
\begin{table}[]
\begin{tabular}{|l|c|}
\hline
\multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline
private String a; & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTField.png}} \\ \hline
\end{tabular}
\end{table}
\item Select nodes with the `Methods' type as the $N^{Methods}$ set from the $N^{Class}$ set:
\begin{equation*}
N^{Methods} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{`Method'} \rbrace,
\end{equation*}
\begin{table}[]
\begin{tabular}{|l|c|}
\hline
\multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline
void run() \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTMethod1.png}} \\ \hline
void show(String text) \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTMethod2.png}} \\ \hline
\end{tabular}
\end{table}
\item Select nodes with the `Method Argument' type as the $N^{MethodsArgs}$ set from the $N^{Methods}$ set:
\begin{equation*}
N^{MethodsArgs} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{`Arg'} \rbrace,
\end{equation*}
\begin{table}[]
\begin{tabular}{|l|c|}
\hline
\multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline
void show(String text) \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTMetArgs.png}} \\ \hline
\end{tabular}
\end{table}
\item Select nodes with the `Operator' type as the $N^{MethodsOps}$ from the $N^{Methods}$ set:
\begin{equation*}
N^{MethodsOps} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{`Operator'} \rbrace,
\end{equation*}
\begin{table}[]
\begin{tabular}{|l|c|}
\hline
\multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline
while(true) \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTLoop.png}} \\ \hline
int a = 1; & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTLoopField.png}} \\ \hline
if (a == 1) \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTCondition.png}} \\ \hline
this.show("Hello") & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTConditionEx.png}} \\ \hline
int c = "Foo"; & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTConditionField.png}} \\ \hline
System.out.println(text); & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTFinish.png}} \\ \hline
\end{tabular}
\end{table}
\item Create the set of ties $R$ between the nodes from sets obtained in previous steps.
\item Save the resulting AST in a graph database (GDB).
\item Save the resulting AST.
\end{enumerate}
In this algorithm, F (*) is a search function that finds nested nodes. The function input is a node or subtree, and the output is a node of the desired type.
Figure 1\ref{fig:ExampleAST} shows the resulting ACT for the following source code:
In this algorithm, the $F$ is a search function that finds nested nodes. The function parameter is a node or subtree, and the output is a set of nodes with the desired type: class, class field, method, method argument or statements (operators).
Let us consider the proposed algorithm in the example of the following Java code:
\begin{lstlisting}
package laba1;
public class Scheduler {
private final int time = 5;
private ArrayList<Thread> threads = new ArrayList<Thread>();
void scheduler() {
for (int i=0; i<threads.size(); i++) {
threads.get(i).run(time);
System.out.println(threads.get(i).toString());
package com.example.demo.simple;
public class Main {
private String a;
void run() {
while(true) {
int a = 1;
if (a == 1)
this.show("Hello");
}
String c = "Foo";
}
void show(String text) {
System.out.println(text);
}
}
\end{lstlisting}
This source code looks different, but the AST is the same:
The Table \ref{tab:alg} shows an example of the proposed algorithm. Each line of the table shows how the algorithm works at each step.
\begin{lstlisting}
package os-lab-1;
public class Planing {
private final int QUANT = 10;
private List<Stream> streams = new ArrayList<>();
\begin{center}
\begin{longtable}{
>{\centering\arraybackslash}p{1cm}
>{\centering\arraybackslash}p{4cm}
>{\centering\arraybackslash}p{7cm}}
\caption{AST generation example.} \label{tab:alg} \\
void plan() {
for (Stream stream: streams) {
stream.run(QUANT);
System.out.println(stream.toString());
}
}
}
\end{lstlisting}
\hline
\noalign{\vskip 2pt}
Step & Source code & AST Nodes \\
\noalign{\vskip 1pt}
\hline
\endfirsthead
\begin{figure}
\centering
\includegraphics[width=0.71\textwidth]{images/ExampleAST.png}
\caption{Sample AST.} \label{fig:ExampleAST}
\end{figure}
\hline
\noalign{\vskip 2pt}
Step & Source code & AST Nodes \\
\noalign{\vskip 1pt}
\hline
\endhead
\hline
\noalign{\vskip 2pt}
\multicolumn{3}{c}{{Continued on next page}} \\
\noalign{\vskip 1pt}
\hline
\endfoot
GDB is a non-relational type of database based on the topographic structure of the network. Graphs represent sets of data as nodes, edges, and properties. GDBs are more flexible than relational databases. GDBs are more flexible than relational databases and allow you to fast obtain data of various types, considering numerous relations.
We use the Neo4j \cite{ref_neo4j} GDB as the data storage. Neo4j is a graph database management system. Neo4j stores nodes, edges connecting them, and attributes of nodes and edges. Neo4j has a high speed of operation even with a large amount of stored data.
\hline
\endlastfoot
1
&
public class Main \{\}
&
\begin{tabular}{l}
\vspace{-8pt} \\
\includegraphics[width=4cm]{images/ASTClass.png}
\end{tabular} \\
\hline
2
&
private String a;
&
\begin{tabular}{l}
\vspace{-8pt} \\
\includegraphics[width=4cm]{images/ASTField.png}
\end{tabular} \\
\hline
3
&
void run() \{\} \linebreak
void show(String text) \{\}
&
\begin{tabular}{l}
\vspace{-8pt} \\
\includegraphics[width=4.7cm]{images/ASTMethod.png}
\end{tabular} \\
\hline
4
&
void show(String text) \{\}
&
\begin{tabular}{l}
\vspace{-8pt} \\
\includegraphics[width=4.7cm]{images/ASTMetArgs.png}
\end{tabular} \\
\hline
5
&
while(true) \{\} \linebreak
int a = 1; \linebreak
if (a == 1) \{\} \linebreak
this.show("Hello"); \linebreak
String c = "Foo"; \linebreak
System.out.println(text);
&
\begin{tabular}{l}
\vspace{-8pt} \\
\includegraphics[width=6.9cm]{images/ASTOperators.png}
\end{tabular} \\
\end{longtable}
\end{center}
\section{The Proposed Algorithm for Detecting the Structural Similarity of Software Projects}
@ -239,7 +253,7 @@ The proposed AST hashing algorithm is contains from the following steps:
\begin{enumerate}
\item Select all paths of the AST graph from the root node to each other node.
\item Get a value of the `type' property for each node of the current path.
\item Calculate an MD5 hash function for the current path. We calculate the hash functions using the apoc.util.md5 plugin for the Neo4j. As a result of this step, formed a set that contains a tuple of the following values:
\item Calculate an MD5 hash function for the current path. As a result of this step, formed a set that contains a tuple of the following values:
\begin{itemize}
\item a path,
\item a path md5 hash.
@ -279,6 +293,14 @@ We implement the server part of the application in Java with the Spring Boot fra
The current version of the software system supports only Java-based software projects. The JavaParser library is used to form an AST in the Java source code analysis. This library allows you to extract the AST using the previously discussed algorithm.
We use the Neo4j \cite{ref_neo4j} as the data storage. Neo4j is a graph database management system (GDB). Neo4j allows us to store nodes and edges to connecting them. We can to add additional attributes to nodes and edges. Neo4j has a high speed of operation even with a large amount of stored data.
GDB is a non-relational type of database based on the topographic structure of the network. GDBs are more flexible than relational databases. GDBs are more flexible than relational databases and allows us to fast obtain data of various types, considering numerous relations.
Cypher \cite{ref_cypher} provides a convenient way to express queries and other Neo4j actions. Although Cypher is particularly useful for exploratory work, it is fast enough to be used in production.
Also, we use the apoc.util.md5 plugin \cite{ref_md5}. This plugin allows us to compute the md5 of the concatenation of all string representations of the Neo4j entities list.
\subsection{Data model for representing AST as a GDB fragment}
In this subsection, we discussed the proposed data model for representing AST as a GDB fragment.
@ -310,10 +332,10 @@ The proposed algorithm for searching for structurally similar projects is to use
THEN x.name ELSE x.type END] as names,
[x in nodes(p) | ID(x)] as ids
WITH names, apoc.util.md5(names) as hash, ids
RETURN names, hash, ids
RETURN DISTINCT names, hash, ids
\end{lstlisting}
Table \ref{tab:query-results} shows the result of the Cypher-query.
Table \ref{tab:query-results} shows the sample result of the Cypher-query.
\begin{table}
\centering
@ -330,11 +352,6 @@ Table \ref{tab:query-results} shows the result of the Cypher-query.
\textbf{root-\textgreater{}package-\textgreater{}class} & \textbf{"840b...7f9a"} & {[}{[}7872, 7873, 7874{]},\newline {[}7977, 7978, 7979{]}{]} \\
\textbf{root-\textgreater{}package-\textgreater{}class}\newline \hspace{3mm}\textbf{-\textgreater{}method} & \textbf{"7151...0f3d"} & {[}{[}7872, 7873, 7874, 7875{]}, {[}7977, 7978, 7979, 7980{]}{]} \\
root-\textgreater{}package-\textgreater{}class\newline \hspace{3mm}-\textgreater{}method\newline \hspace{3mm}-\textgreater{}statement.control & "5810...f0c9" & {[}{[}7872, 7873, 7874, 7875, 7879{]}{]} \\
root-\textgreater{}package-\textgreater{}class\newline \hspace{3mm}-\textgreater{}method\newline \hspace{3mm}-\textgreater{}statement.control\newline \hspace{3mm}-\textgreater{}statement.expression & "fd3c...5a3c" & {[}{[}7872, 7873, 7874, 7875, 7879, 7880{]}{]} \\
\multicolumn{3}{c}{...} \\
\textbf{root-\textgreater{}package} & \textbf{"346f...a463"} & {[}{[}7872, 7873{]}, {[}7977, 7978{]}{]} \\
\textbf{root-\textgreater{}package-\textgreater{}class} & \textbf{"840b...7f9a"} & {[}{[}7872, 7873, 7874{]}, \newline {[}7977, 7978, 7979{]}{]} \\
\textbf{root-\textgreater{}package-\textgreater{}class}\newline \textbf{\hspace{3mm}-\textgreater{}method} & \textbf{"7151...0f3d"} & {[}{[}7872, 7873, 7874, 7875{]}, {[}7977, 7978, 7979, 7980{]}{]} \\
\noalign{\vskip 2pt}
\hline
\end{tabular}
@ -345,18 +362,59 @@ Table \ref{tab:query-results} shows that:
\item the hash `346f...a463' matches the path `root->package',
\item the hash `840b...7f9a' matches the path `root->package->class',
\item the hash `7151...0f3d' matches the path `root->package->class->method',
\item two projects with identifiers 7872 and 7977 contain this structural patterns (paths).
\item two projects with identifiers 7872 and 7977 contain this structural patterns (paths). The length of the collection in the \textit{ids} column shows how many projects contains the $i$-th structural element. And the length of the element of this collection allows us to get the length of the chain of structural elements to calculate project originality degree.
\end{itemize}
Thus, we can calculate the number of matching and not matching paths (see eq. \ref{eq:calc-orig}) in the analyzed project compare with other projects in data storage. Figure \ref{fig:ExampleSystem} shows the main form of the developed system.
\begin{figure}
\newpage
\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{images/ExampleSystemEng.png}
\caption{The main form of the developed system.}
\label{fig:ExampleSystem}
\end{figure}
Figure \ref{fig:ExampleAST} shows the resulting AST for the following source code:
\begin{lstlisting}
package laba1;
public class Scheduler {
private final int time = 5;
private ArrayList<Thread> threads = new ArrayList<Thread>();
void scheduler() {
for (int i=0; i<threads.size(); i++) {
threads.get(i).run(time);
System.out.println(threads.get(i).toString());
}
}
}
\end{lstlisting}
This source code looks different, but the resulting AST is the same:
\begin{lstlisting}
package os-lab-1;
public class Planing {
private final int QUANT = 10;
private List<Stream> streams = new ArrayList<>();
void plan() {
for (Stream stream: streams) {
stream.run(QUANT);
System.out.println(stream.toString());
}
}
}
\end{lstlisting}
\begin{figure}[h]
\centering
\includegraphics[width=6cm]{images/ExampleAST.png}
\caption{Sample AST.} \label{fig:ExampleAST}
\end{figure}
\section{Experiments}
We conducted experiments to evaluate the speed of source code analysis. We calculated the results relative to the number of lines of code and the number of files in the analyzing project. The main aim of the experiment is to determine the speed of the algorithm, considering the average number of lines of code processed per minute. We used the IntelliJ IDEA Statistic plugin \cite{ref_Statistic} to get the data for the experiment. The plugin allows you to calculate the number, size, number of lines, average value and other information for each file in the project. You can also find out the total number of rows, the number of lines of code, the proportion of lines of code, the number of comment lines, the proportion of comment lines, etc.