diff --git a/images/ASTMetArgs.png b/images/ASTMetArgs.png index cabe6ef..318eeb2 100644 Binary files a/images/ASTMetArgs.png and b/images/ASTMetArgs.png differ diff --git a/images/ASTMethod.png b/images/ASTMethod.png new file mode 100644 index 0000000..c4a6611 Binary files /dev/null and b/images/ASTMethod.png differ diff --git a/images/ASTMethod1.png b/images/ASTMethod1.png deleted file mode 100644 index 72b346a..0000000 Binary files a/images/ASTMethod1.png and /dev/null differ diff --git a/images/ASTMethod2.png b/images/ASTMethod2.png deleted file mode 100644 index 7a41e22..0000000 Binary files a/images/ASTMethod2.png and /dev/null differ diff --git a/images/ASTFinish.png b/images/ASTOperators.png similarity index 100% rename from images/ASTFinish.png rename to images/ASTOperators.png diff --git a/images/ExampleSystemEng.png b/images/ExampleSystemEng.png index 6f5f38f..e1603f5 100644 Binary files a/images/ExampleSystemEng.png and b/images/ExampleSystemEng.png differ diff --git a/images/ASTCondition.png b/old/ASTCondition.png similarity index 100% rename from images/ASTCondition.png rename to old/ASTCondition.png diff --git a/images/ASTConditionEx.png b/old/ASTConditionEx.png similarity index 100% rename from images/ASTConditionEx.png rename to old/ASTConditionEx.png diff --git a/images/ASTConditionField.png b/old/ASTConditionField.png similarity index 100% rename from images/ASTConditionField.png rename to old/ASTConditionField.png diff --git a/images/ASTLoop.png b/old/ASTLoop.png similarity index 100% rename from images/ASTLoop.png rename to old/ASTLoop.png diff --git a/images/ASTLoopField.png b/old/ASTLoopField.png similarity index 100% rename from images/ASTLoopField.png rename to old/ASTLoopField.png diff --git a/paper.bib b/paper.bib index 7a3ca46..45e2204 100644 --- a/paper.bib +++ b/paper.bib @@ -93,6 +93,18 @@ note = "[Online; accessed 11-June-2023]" } +@misc{ref_cypher, + title = {Cypher Query Language - Developer Guides}, + url = {https://neo4j.com/developer/cypher/}, + note = {[Online; accessed 11-June-2023]} +} + +@misc{ref_md5, + title = {Text Functions - APOC Extended Documentation}, + url = {https://neo4j.com/labs/apoc/4.4/misc/text-functions/\#text-functions-hashing}, + note = {[Online; accessed 11-June-2023]} +} + @misc{ref_statistic, title = {Statistic - IntelliJ IDEs plugin}, url = {https://plugins.jetbrains.com/plugin/4509-statistic}, diff --git a/paper.pdf b/paper.pdf index b00f2b5..bda3a73 100644 Binary files a/paper.pdf and b/paper.pdf differ diff --git a/paper.tex b/paper.tex index a1a9a6e..a8af8e6 100644 --- a/paper.tex +++ b/paper.tex @@ -11,10 +11,13 @@ % \usepackage{amsmath} \usepackage{graphicx} +\usepackage{xcolor} +\definecolor{eminence}{RGB}{108,48,130} \usepackage{listings} \lstset{ numbers=none, -language=SQL, +language=Java, +keywordstyle=\color{eminence}\bf, basicstyle=\footnotesize\ttfamily, framexleftmargin=-4pt, linewidth=13.75cm @@ -23,6 +26,7 @@ linewidth=13.75cm \newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}} \newcolumntype{C}[1]{>{\centering\let\newline\\\arraybackslash\hspace{0pt}}m{#1}} \newcolumntype{R}[1]{>{\raggedleft\let\newline\\\arraybackslash\hspace{0pt}}m{#1}} +\usepackage{longtable} % Used for displaying a sample figure. If possible, figure files should @@ -43,6 +47,7 @@ linewidth=13.75cm % \author{Aleksey Filippov\inst{1}\orcidID{0000-0003-0008-5035} \and Anton Romanov\inst{1}\orcidID{0000-0001-5275-7628} \and +Anton Skalkin\inst{1}\orcidID{0000-0003-0044-8027} \and Julia Stroeva\inst{1}\orcidID{0009-0003-8026-235X}} % \authorrunning{A. Filippov et al.} @@ -55,7 +60,7 @@ Technical University, 32 Severny Venetz Street, 432027 Ulyanovsk, Russia} \maketitle % typeset the header of the contribution % \begin{abstract} - The authors have developed an approach to the search for structurally similar projects of software systems. Teachers can use the proposed approach to search for borrowings in the works of students. The concept behind this proposal is that it can to locate projects that students have used as parts of a current project. + The authors have developed an approach to the search for structurally similar projects of software systems. Teachers can use the proposed approach to search for borrowings in the works of students. The concept behind this proposal is that it can to locate projects that students have used as parts of a current project. The authors propose a new algorithm for determining the similarity between the structures of software projects. The proposed algorithm is based on finding similar structural elements in the source code of the program in an abstract syntax trees analyzing. @@ -90,13 +95,15 @@ The paper \cite{ali2011overview} presents an review of approaches and software t In the article \cite{chae2013software}, the authors analyze borrowings in the source code according to the sequences of using external programming interfaces (external dependencies) and the frequency of such calls. This method is not suitable for solving the problem of this study because of the educational orientation. Some student projects can not use external dependencies. -Thus, it is necessary to develop an approach to the search for structurally similar projects, which are focused on simple software systems and a high speed of analysis. A set of source code files will be considered as projects. +Thus, it is necessary to develop an approach to the search for structurally similar projects, which are focused on simple software systems and a high speed of analysis. \section{The Proposed Algorithm for Analyzing the Structure of the Source Code} -The source code of the software system is the main data source for structural features identifying in the proposed algorithm. +We represent software system projects as a set of source code files. The source code of the software system is the main data source for structural features identifying in the proposed algorithm. -We formed an AST to analyze the source code. There are various libraries and tools for all existing programming languages for the formation of AST. We use own representation of the AST to add support for new programming languages without changing the analysis algorithms. +We formed an AST to analyze the source code. There are various libraries and tools for all existing programming languages for the AST formation. We use own representation of the AST to add support for new programming languages without changing the analysis algorithms. Therefore, we need to develop a converter that transforms the AST generated by the parser for some programming language into our AST representation. + +For example, we analyze Java files and their hierarchy at the package level for the Java-based software systems. We use the JavaParser library to form an AST for Java projects. The algorithm considered below allows us to transform the AST, which is generated by the JavaParser library, into our AST representation. We define the proposed AST model as follows: \begin{equation*} @@ -111,125 +118,132 @@ We developed an algorithm to extract the structure of the project in the source \item Select nodes with the `Class' type as the $N^{Class}$ set: \begin{equation*} N^{Class} = \lbrace N_{i} \in N | F \left( N_{i}.data \right) = \text{`Class'} \rbrace, - \end{equation*} - - \begin{table}[] - \begin{tabular}{|l|c|} - \hline - \multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline - public class Main \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTClass.png}} \\ \hline - \end{tabular} - \end{table} - + \end{equation*} \item Select nodes with the `Class field' as the $N^{Vars}$ set from the $N^{Class}$ set: \begin{equation*} N^{Vars} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{`Field'} \rbrace, \end{equation*} - - \begin{table}[] - \begin{tabular}{|l|c|} - \hline - \multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline - private String a; & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTField.png}} \\ \hline - \end{tabular} - \end{table} - \item Select nodes with the `Methods' type as the $N^{Methods}$ set from the $N^{Class}$ set: \begin{equation*} N^{Methods} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{`Method'} \rbrace, \end{equation*} - - \begin{table}[] - \begin{tabular}{|l|c|} - \hline - \multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline - void run() \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTMethod1.png}} \\ \hline - void show(String text) \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTMethod2.png}} \\ \hline - \end{tabular} - \end{table} - \item Select nodes with the `Method Argument' type as the $N^{MethodsArgs}$ set from the $N^{Methods}$ set: \begin{equation*} N^{MethodsArgs} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{`Arg'} \rbrace, \end{equation*} - - \begin{table}[] - \begin{tabular}{|l|c|} - \hline - \multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline - void show(String text) \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTMetArgs.png}} \\ \hline - \end{tabular} - \end{table} - \item Select nodes with the `Operator' type as the $N^{MethodsOps}$ from the $N^{Methods}$ set: \begin{equation*} N^{MethodsOps} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{`Operator'} \rbrace, \end{equation*} - - \begin{table}[] - \begin{tabular}{|l|c|} - \hline - \multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline - while(true) \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTLoop.png}} \\ \hline - int a = 1; & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTLoopField.png}} \\ \hline - if (a == 1) \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTCondition.png}} \\ \hline - this.show("Hello") & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTConditionEx.png}} \\ \hline - int c = "Foo"; & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTConditionField.png}} \\ \hline - System.out.println(text); & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTFinish.png}} \\ \hline - \end{tabular} - \end{table} - \item Create the set of ties $R$ between the nodes from sets obtained in previous steps. - \item Save the resulting AST in a graph database (GDB). + \item Save the resulting AST. \end{enumerate} -In this algorithm, F (*) is a search function that finds nested nodes. The function input is a node or subtree, and the output is a node of the desired type. - -Figure 1\ref{fig:ExampleAST} shows the resulting ACT for the following source code: +In this algorithm, the $F$ is a search function that finds nested nodes. The function parameter is a node or subtree, and the output is a set of nodes with the desired type: class, class field, method, method argument or statements (operators). +Let us consider the proposed algorithm in the example of the following Java code: \begin{lstlisting} - package laba1; - public class Scheduler { - private final int time = 5; - private ArrayList threads = new ArrayList(); - - void scheduler() { - for (int i=0; i streams = new ArrayList<>(); +\begin{center} + \begin{longtable}{ + >{\centering\arraybackslash}p{1cm} + >{\centering\arraybackslash}p{4cm} + >{\centering\arraybackslash}p{7cm}} + \caption{AST generation example.} \label{tab:alg} \\ + + \hline + \noalign{\vskip 2pt} + Step & Source code & AST Nodes \\ + \noalign{\vskip 1pt} + \hline + \endfirsthead - void plan() { - for (Stream stream: streams) { - stream.run(QUANT); - System.out.println(stream.toString()); - } - } - } -\end{lstlisting} + \hline + \noalign{\vskip 2pt} + Step & Source code & AST Nodes \\ + \noalign{\vskip 1pt} + \hline + \endhead -\begin{figure} - \centering - \includegraphics[width=0.71\textwidth]{images/ExampleAST.png} - \caption{Sample AST.} \label{fig:ExampleAST} -\end{figure} + \hline + \noalign{\vskip 2pt} + \multicolumn{3}{c}{{Continued on next page}} \\ + \noalign{\vskip 1pt} + \hline + \endfoot - - -GDB is a non-relational type of database based on the topographic structure of the network. Graphs represent sets of data as nodes, edges, and properties. GDBs are more flexible than relational databases. GDBs are more flexible than relational databases and allow you to fast obtain data of various types, considering numerous relations. - -We use the Neo4j \cite{ref_neo4j} GDB as the data storage. Neo4j is a graph database management system. Neo4j stores nodes, edges connecting them, and attributes of nodes and edges. Neo4j has a high speed of operation even with a large amount of stored data. + \hline + \endlastfoot + 1 + & + public class Main \{\} + & + \begin{tabular}{l} + \vspace{-8pt} \\ + \includegraphics[width=4cm]{images/ASTClass.png} + \end{tabular} \\ + \hline + 2 + & + private String a; + & + \begin{tabular}{l} + \vspace{-8pt} \\ + \includegraphics[width=4cm]{images/ASTField.png} + \end{tabular} \\ + \hline + 3 + & + void run() \{\} \linebreak + void show(String text) \{\} + & + \begin{tabular}{l} + \vspace{-8pt} \\ + \includegraphics[width=4.7cm]{images/ASTMethod.png} + \end{tabular} \\ + \hline + 4 + & + void show(String text) \{\} + & + \begin{tabular}{l} + \vspace{-8pt} \\ + \includegraphics[width=4.7cm]{images/ASTMetArgs.png} + \end{tabular} \\ + \hline + 5 + & + while(true) \{\} \linebreak + int a = 1; \linebreak + if (a == 1) \{\} \linebreak + this.show("Hello"); \linebreak + String c = "Foo"; \linebreak + System.out.println(text); + & + \begin{tabular}{l} + \vspace{-8pt} \\ + \includegraphics[width=6.9cm]{images/ASTOperators.png} + \end{tabular} \\ + \end{longtable} +\end{center} \section{The Proposed Algorithm for Detecting the Structural Similarity of Software Projects} @@ -239,7 +253,7 @@ The proposed AST hashing algorithm is contains from the following steps: \begin{enumerate} \item Select all paths of the AST graph from the root node to each other node. \item Get a value of the `type' property for each node of the current path. - \item Calculate an MD5 hash function for the current path. We calculate the hash functions using the apoc.util.md5 plugin for the Neo4j. As a result of this step, formed a set that contains a tuple of the following values: + \item Calculate an MD5 hash function for the current path. As a result of this step, formed a set that contains a tuple of the following values: \begin{itemize} \item a path, \item a path md5 hash. @@ -279,6 +293,14 @@ We implement the server part of the application in Java with the Spring Boot fra The current version of the software system supports only Java-based software projects. The JavaParser library is used to form an AST in the Java source code analysis. This library allows you to extract the AST using the previously discussed algorithm. +We use the Neo4j \cite{ref_neo4j} as the data storage. Neo4j is a graph database management system (GDB). Neo4j allows us to store nodes and edges to connecting them. We can to add additional attributes to nodes and edges. Neo4j has a high speed of operation even with a large amount of stored data. + +GDB is a non-relational type of database based on the topographic structure of the network. GDBs are more flexible than relational databases. GDBs are more flexible than relational databases and allows us to fast obtain data of various types, considering numerous relations. + +Cypher \cite{ref_cypher} provides a convenient way to express queries and other Neo4j actions. Although Cypher is particularly useful for exploratory work, it is fast enough to be used in production. + +Also, we use the apoc.util.md5 plugin \cite{ref_md5}. This plugin allows us to compute the md5 of the concatenation of all string representations of the Neo4j entities list. + \subsection{Data model for representing AST as a GDB fragment} In this subsection, we discussed the proposed data model for representing AST as a GDB fragment. @@ -310,10 +332,10 @@ The proposed algorithm for searching for structurally similar projects is to use THEN x.name ELSE x.type END] as names, [x in nodes(p) | ID(x)] as ids WITH names, apoc.util.md5(names) as hash, ids - RETURN names, hash, ids + RETURN DISTINCT names, hash, ids \end{lstlisting} -Table \ref{tab:query-results} shows the result of the Cypher-query. +Table \ref{tab:query-results} shows the sample result of the Cypher-query. \begin{table} \centering @@ -330,11 +352,6 @@ Table \ref{tab:query-results} shows the result of the Cypher-query. \textbf{root-\textgreater{}package-\textgreater{}class} & \textbf{"840b...7f9a"} & {[}{[}7872, 7873, 7874{]},\newline {[}7977, 7978, 7979{]}{]} \\ \textbf{root-\textgreater{}package-\textgreater{}class}\newline \hspace{3mm}\textbf{-\textgreater{}method} & \textbf{"7151...0f3d"} & {[}{[}7872, 7873, 7874, 7875{]}, {[}7977, 7978, 7979, 7980{]}{]} \\ root-\textgreater{}package-\textgreater{}class\newline \hspace{3mm}-\textgreater{}method\newline \hspace{3mm}-\textgreater{}statement.control & "5810...f0c9" & {[}{[}7872, 7873, 7874, 7875, 7879{]}{]} \\ - root-\textgreater{}package-\textgreater{}class\newline \hspace{3mm}-\textgreater{}method\newline \hspace{3mm}-\textgreater{}statement.control\newline \hspace{3mm}-\textgreater{}statement.expression & "fd3c...5a3c" & {[}{[}7872, 7873, 7874, 7875, 7879, 7880{]}{]} \\ - \multicolumn{3}{c}{...} \\ - \textbf{root-\textgreater{}package} & \textbf{"346f...a463"} & {[}{[}7872, 7873{]}, {[}7977, 7978{]}{]} \\ - \textbf{root-\textgreater{}package-\textgreater{}class} & \textbf{"840b...7f9a"} & {[}{[}7872, 7873, 7874{]}, \newline {[}7977, 7978, 7979{]}{]} \\ - \textbf{root-\textgreater{}package-\textgreater{}class}\newline \textbf{\hspace{3mm}-\textgreater{}method} & \textbf{"7151...0f3d"} & {[}{[}7872, 7873, 7874, 7875{]}, {[}7977, 7978, 7979, 7980{]}{]} \\ \noalign{\vskip 2pt} \hline \end{tabular} @@ -345,18 +362,59 @@ Table \ref{tab:query-results} shows that: \item the hash `346f...a463' matches the path `root->package', \item the hash `840b...7f9a' matches the path `root->package->class', \item the hash `7151...0f3d' matches the path `root->package->class->method', - \item two projects with identifiers 7872 and 7977 contain this structural patterns (paths). + \item two projects with identifiers 7872 and 7977 contain this structural patterns (paths). The length of the collection in the \textit{ids} column shows how many projects contains the $i$-th structural element. And the length of the element of this collection allows us to get the length of the chain of structural elements to calculate project originality degree. \end{itemize} Thus, we can calculate the number of matching and not matching paths (see eq. \ref{eq:calc-orig}) in the analyzed project compare with other projects in data storage. Figure \ref{fig:ExampleSystem} shows the main form of the developed system. -\begin{figure} +\newpage + +\begin{figure}[h] \centering \includegraphics[width=\textwidth]{images/ExampleSystemEng.png} \caption{The main form of the developed system.} \label{fig:ExampleSystem} \end{figure} +Figure \ref{fig:ExampleAST} shows the resulting AST for the following source code: +\begin{lstlisting} + package laba1; + public class Scheduler { + private final int time = 5; + private ArrayList threads = new ArrayList(); + + void scheduler() { + for (int i=0; i streams = new ArrayList<>(); + + void plan() { + for (Stream stream: streams) { + stream.run(QUANT); + System.out.println(stream.toString()); + } + } + } +\end{lstlisting} + + +\begin{figure}[h] + \centering + \includegraphics[width=6cm]{images/ExampleAST.png} + \caption{Sample AST.} \label{fig:ExampleAST} +\end{figure} + \section{Experiments} We conducted experiments to evaluate the speed of source code analysis. We calculated the results relative to the number of lines of code and the number of files in the analyzing project. The main aim of the experiment is to determine the speed of the algorithm, considering the average number of lines of code processed per minute. We used the IntelliJ IDEA Statistic plugin \cite{ref_Statistic} to get the data for the experiment. The plugin allows you to calculate the number, size, number of lines, average value and other information for each file in the project. You can also find out the total number of rows, the number of lines of code, the proportion of lines of code, the number of comment lines, the proportion of comment lines, etc.