IITI-2023/paper.tex

% This is samplepaper.tex, a sample chapter demonstrating the
% LLNCS macro package for Springer Computer Science proceedings;
% Version 2.21 of 2022/01/12
%
\documentclass[runningheads]{llncs}
%
\usepackage[T1]{fontenc}
% T1 fonts will be used to generate the final print and online PDFs,
% so please use T1 fonts in your manuscript whenever possible.
% Other font encondings may result in incorrect characters.
%
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{listings}
\lstset{
numbers=none,
language=SQL,
basicstyle=\footnotesize\ttfamily,
framexleftmargin=-4pt,
linewidth=13.75cm
}
\usepackage{array}
\newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{C}[1]{>{\centering\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{R}[1]{>{\raggedleft\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}


% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
% If you use the hyperref package, please uncomment the following two lines
% to display URLs in blue roman font according to Springer's eBook style:
%\usepackage{color}
%\renewcommand\UrlFont{\color{blue}\rmfamily}
%
\begin{document}
%
\title{Search for Structurally Similar Projects of Software Systems}
%
%\titlerunning{Abbreviated paper title}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{Aleksey	Filippov\inst{1}\orcidID{0000-0003-0008-5035} \and
Anton Romanov\inst{1}\orcidID{0000-0001-5275-7628} \and
Julia Stroeva\inst{1}\orcidID{0009-0003-8026-235X}}
%
\authorrunning{A. Filippov et al.}
% First names are abbreviated in the running head.
% If there are more than two authors, 'et al.' is used.
%
\institute{Department of Information Systems, Ulyanovsk State
Technical University, 32 Severny Venetz Street, 432027 Ulyanovsk, Russia}
%
\maketitle              % typeset the header of the contribution
%
\begin{abstract}
    The authors have developed an approach to the search for structurally similar projects of software systems. Teachers can use the proposed approach to search for borrowings in the works of students.    The concept behind this proposal is that it can to locate projects that students have used as parts of a current project.

    The authors propose a new algorithm for determining the similarity between the structures of software projects. The proposed algorithm is based on finding similar structural elements in the source code of the program in an abstract syntax trees analyzing.

    The authors developed a software system to evaluate the proposed algorithm. The current version of the system only supports Java programs. However, the system operates with its own representation of the abstract syntax tree, which allows you to add support for new programming languages.

\keywords{source code \and structure analysis \and structurally similar projects \and hashing.}
\end{abstract}

%
%
%
\section{Introduction}
Currently, the most practical work of students in information technology includes laboratory and coursework. These classes help students understand the theoretical information from their lectures.

Typically, student work is a small program that solves typical problems. In most cases, these works contain few files or a few lines of code. The architecture and algorithms of such programs are also simple.

The teacher needs to spend many time to check all the works. The teacher usually notices when the student has borrowed a program source code. Students in such cases do not change the structure of the borrowed source code, but rename variables or change types of loops (from \textit{for} to \textit{while}), etc.

The software system proposed in this article allows you to analyze the structure of projects and provide information about their structural similarity. The indicator of the uniqueness of the current project structure is used to evaluate the uniqueness of the project in comparison with each other.

\section{State of the art}

There are no universal methods for analyzing the source code of software systems at the moment. Certain methods of analysis are used to solve various problems.

We can analyze projects using call graph generation tools, such as \textit{CodeViz} or \textit{Egypt}. Or we can use of reverse engineering tools, such as IDA Pro. The call graph based approaches allow developers to solve the program comprehension task for better program maintenance or to reduce security issues \cite{ghavamnia2020temporal,soares2021integrating,tang2022assessing,vinayaka2021android}.

Another group of methods is based on obtaining and analyzing abstract syntax trees (AST). An AST is an abstract representation of the grammatical structure of a source code. It expresses the structure of a program as a structured tree and rarely depends on programming language. Each AST node is an operator or a set of operators of the analyzed source code. The compiler generates an AST on the parsing step. Unlike a parse tree, an AST does not contain nodes or edges that do not define the semantics of the program (for example, grouping brackets).

AST-based approaches allow us to find structurally similar projects. However, such approaches have high computational complexity \cite{nguyen2018crosssim}. Many existing approaches analyze a larger number of parameters than is necessary to solve the problem of this study \cite{beniwal2021npmrec,aleksey2020approach,nguyen2018crosssim,nguyen2020automated,nadezhda2019approach}: project dependencies, the number of stars in the repository, the contents of the documentation, etc.

The paper \cite{ali2011overview} presents an review of approaches and software tools for borrowings searching in the text and source code. However, there is no mention of existing software tools for borrowings searching in the source code.

In the article \cite{chae2013software}, the authors analyze borrowings in the source code according to the sequences of using external programming interfaces (external dependencies) and the frequency of such calls. This method is not suitable for solving the problem of this study because of the educational orientation. Some student projects can not use external dependencies.

Thus, it is necessary to develop an approach to the search for structurally similar projects, which are focused on simple software systems and a high speed of analysis. A set of source code files will be considered as projects.

\section{The Proposed Algorithm for Analyzing the Structure of the Source Code}

The source code of the software system is the main data source for structural features identifying in the proposed algorithm.

We formed an AST to analyze the source code. There are various libraries and tools for all existing programming languages for the formation of AST. We use own representation of the AST to add support for new programming languages without changing the analysis algorithms.

We define the proposed AST model as follows:
\begin{equation*}
    AST = \langle N,R \rangle,
\end{equation*}
where $N = \lbrace N_{1}, N_{2},\ldots, N_{n}\rbrace$ is the set of AST nodes;\\
$N_{i} = \langle name, data \rangle$ is an $i$-th AST node containing the node name and data;\\
$R$ is the set of relations between AST nodes.

We developed an algorithm to extract the structure of the project in the source code analyzing. The proposed algorithm contains the following steps:
\begin{enumerate}
    \item Select nodes with the `Class' type as the $N^{Class}$ set:
    \begin{equation*}
        N^{Class} = \lbrace N_{i} \in N | F \left( N_{i}.data \right) = \text{`Class'} \rbrace,
    \end{equation*}

    \begin{table}[]
        \begin{tabular}{|l|c|}
        \hline
        \multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline
        public class Main \{                     &  \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTClass.png}}     \\ \hline
        \end{tabular}
    \end{table}

    \item Select nodes with the `Class field' as the $N^{Vars}$ set from the $N^{Class}$ set:
    \begin{equation*}
        N^{Vars} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{`Field'} \rbrace,
    \end{equation*}

    \begin{table}[]
        \begin{tabular}{|l|c|}
        \hline
        \multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline
        private String a;  &  \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTField.png}}     \\ \hline
        \end{tabular}
    \end{table}

    \item Select nodes with the `Methods' type as the $N^{Methods}$ set from the $N^{Class}$ set:
    \begin{equation*}
        N^{Methods} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{`Method'} \rbrace,
    \end{equation*}

    \begin{table}[]
        \begin{tabular}{|l|c|}
        \hline
        \multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline
        void run() \{                     &  \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTMethod1.png}}     \\ \hline
        void show(String text) \{ & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTMethod2.png}} \\ \hline
        \end{tabular}
    \end{table}

    \item Select nodes with the `Method Argument' type as the $N^{MethodsArgs}$ set from the $N^{Methods}$ set:
    \begin{equation*}
        N^{MethodsArgs} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{`Arg'} \rbrace,
    \end{equation*}

    \begin{table}[]
        \begin{tabular}{|l|c|}
        \hline
        \multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline
        void show(String text) \{                     &  \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTMetArgs.png}}     \\ \hline
        \end{tabular}
    \end{table}

    \item Select nodes with the `Operator' type as the $N^{MethodsOps}$ from the $N^{Methods}$ set:
    \begin{equation*}
        N^{MethodsOps} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{`Operator'} \rbrace,
    \end{equation*}

    \begin{table}[]
        \begin{tabular}{|l|c|}
        \hline
        \multicolumn{1}{|c|}{Source code string} & Nodes \\ \hline
        while(true) \{  &  \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTLoop.png}}     \\ \hline
        int a = 1; & \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTLoopField.png}} \\ \hline
        if (a == 1) \{  &  \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTCondition.png}}     \\ \hline
        this.show("Hello")  &  \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTConditionEx.png}}     \\ \hline
        int c = "Foo";  &  \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTConditionField.png}}     \\ \hline
        System.out.println(text);  &  \raisebox{-\totalheight}{\includegraphics[width=10cm]{images/ASTFinish.png}}     \\ \hline
        \end{tabular}
    \end{table}

    \item Create the set of ties $R$ between the nodes from sets obtained in previous steps.
    \item Save the resulting AST in a graph database (GDB).
\end{enumerate}

In this algorithm, F (*) is a search function that finds nested nodes. The function input is a node or subtree, and the output is a node of the desired type.

Figure 1\ref{fig:ExampleAST} shows the resulting ACT for the following source code:

\begin{lstlisting}
    package laba1;
    public class Scheduler {
        private final int time = 5;
        private ArrayList<Thread> threads = new ArrayList<Thread>();

        void scheduler() {
            for (int i=0; i<threads.size(); i++) {
                threads.get(i).run(time);
                System.out.println(threads.get(i).toString());
            }
        }
    }
\end{lstlisting}

This source code looks different, but the AST is the same:

\begin{lstlisting}
    package os-lab-1;
    public class Planing {
        private final int QUANT = 10;
        private List<Stream> streams = new ArrayList<>();

        void plan() {
            for (Stream stream: streams) {
                stream.run(QUANT);
                System.out.println(stream.toString());
            }
        }
    }
\end{lstlisting}

\begin{figure}
    \centering
    \includegraphics[width=0.71\textwidth]{images/ExampleAST.png}
    \caption{Sample AST.} \label{fig:ExampleAST}
\end{figure}


GDB is a non-relational type of database based on the topographic structure of the network. Graphs represent sets of data as nodes, edges, and properties. GDBs are more flexible than relational databases. GDBs are more flexible than relational databases and allow you to fast obtain data of various types, considering numerous relations.

We use the Neo4j \cite{ref_neo4j} GDB as the data storage. Neo4j  is a graph database management system. Neo4j stores nodes, edges connecting them, and attributes of nodes and edges. Neo4j has a high speed of operation even with a large amount of stored data.

\section{The Proposed Algorithm for Detecting the Structural Similarity of Software Projects}

The detection of the projects structural similarity is based on the hashing algorithm. We use a hash function to minimize the size of an input data.

The proposed AST hashing algorithm is contains from the following steps:
\begin{enumerate}
    \item Select all paths of the AST graph from the root node to each other node.
    \item Get a value of the `type' property for each node of the current path.
    \item Calculate an MD5 hash function for the current path. We calculate the hash functions using the apoc.util.md5 plugin for the Neo4j. As a result of this step, formed a set that contains a tuple of the following values:
    \begin{itemize}
        \item a path,
        \item a path md5 hash.
    \end{itemize}
    For example:
    \begin{itemize}
        \item <`root->class->method->if', `820b...9c4b'>,
        \item <`root->class->field', `6161...eab3'>.
    \end{itemize}
\end{enumerate}

The following expression is used to calculate project originality:
\begin{equation}
    \label{eq:calc-orig}
    O = \frac {H^{C} \notin H}{H^{C}},
\end{equation}
where $H^{C}$ is a set of hash functions of the analyzed project;\\
$H$ is a set of hash values of other projects in the system.

\section{Results}

\subsection{Architecture of the developed system}

Figure \ref{DeploymentDiagram} shows the deployment diagram in the UML notation of the developed software system. The developed system has the three-tier architecture.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{images/DeploymentDiagram.png}
    \caption{Deployment diagram.} \label{DeploymentDiagram}
\end{figure}

Users interact with the web client on the \textit{Frontend} node. The \textit{Backend} node performs the main business logic for searching of the structurally similar projects. \textit{Backend} and \textit{Frontend} nodes communicate through an API. The \textit{Database} provides data storage functions.

The web client is an application written in JavaScript with the Vue.js framework. Vue.js is a framework for developing single page applications and web interfaces. The main advantages of this framework are the small size of the library in lines of code, performance, flexibility, and excellent documentation.

We implement the server part of the application in Java with the Spring Boot framework. The Spring framework is a ecosystem for developing applications in the Java language. The Spring Boot includes a huge number of ready-to-use modules. The main advantages of this framework include speed and convenience of development, auto-configuration of all components, easy access to databases and network capabilities.

The current version of the software system supports only Java-based software projects. The JavaParser library is used to form an AST in the Java source code analysis. This library allows you to extract the AST using the previously discussed algorithm.

\subsection{Data model for representing AST as a GDB fragment}

In this subsection, we discussed the proposed data model for representing AST as a GDB fragment.

The GDB data model contains the nodes with the following type:
\begin{itemize}
    \item `Package' (Java-specific),
    \item `Class',
    \item `Class field',
    \item `Method',
    \item `Method argument',
    \item `Statement' (declaration, expression and control statements).
\end{itemize}

We arrange the nodes in the GDB hierarchically. For example, a class-node is a part of a package-node, a method-node is a part of a class-node. The data model allows you to form the following ties between data model nodes:
\begin{itemize}
    \item `HAS\_CLASS' is a relationship between a `Package' and a `Class' nodes,
    \item `HAS\_FIELD' is a relationship between a `Class' and a `Class field' nodes,
    \item `HAS\_METHOD' is a relationship between a `Class' and a `Method' nodes,
    \item `HAS\_ARG' is a relationship between `Method' and `Method argument' nodes,
    \item `HAS\_BLOCK' is a link between a `Method' and a `Statement'.
\end{itemize}

The proposed algorithm for searching for structurally similar projects is to use hashing of graph paths based on the md5 function. We describe the hashing algorithm in the previous section. The searching algorithm can be represented as the following Cypher-query:
\begin{lstlisting}
    MATCH p = (o{name:"root"})-[r*]- ()
    WHERE ID(o)={0}
    WITH [x in nodes(p) | CASE WHEN EXISTS(x.name)
    THEN x.name ELSE x.type END] as names,
        [x in nodes(p) | ID(x)] as ids
    WITH names, apoc.util.md5(names) as hash, ids
    RETURN names, hash, ids
\end{lstlisting}

Table \ref{tab:query-results} shows the result of the Cypher-query.

\begin{table}
    \centering
    \caption{An example of the result of the searching Cypher-query.}
    \label{tab:query-results}
    \begin{tabular}{L{4cm}L{3cm}L{4cm}}
        \hline
        \noalign{\vskip 3pt}
        names & hash & ids \\
        \noalign{\vskip 2pt}
        \hline
        \noalign{\vskip 3pt}
        \textbf{root-\textgreater{}package} & \textbf{"346f...a463"} & {[}{[}7872, 7873{]}, {[}7977, 7978{]}{]} \\
        \textbf{root-\textgreater{}package-\textgreater{}class} & \textbf{"840b...7f9a"} & {[}{[}7872, 7873, 7874{]},\newline {[}7977, 7978, 7979{]}{]} \\
        \textbf{root-\textgreater{}package-\textgreater{}class}\newline \hspace{3mm}\textbf{-\textgreater{}method} & \textbf{"7151...0f3d"} & {[}{[}7872, 7873, 7874, 7875{]}, {[}7977, 7978, 7979, 7980{]}{]} \\
        root-\textgreater{}package-\textgreater{}class\newline \hspace{3mm}-\textgreater{}method\newline \hspace{3mm}-\textgreater{}statement.control & "5810...f0c9" & {[}{[}7872, 7873, 7874, 7875, 7879{]}{]} \\
        root-\textgreater{}package-\textgreater{}class\newline \hspace{3mm}-\textgreater{}method\newline \hspace{3mm}-\textgreater{}statement.control\newline \hspace{3mm}-\textgreater{}statement.expression & "fd3c...5a3c" & {[}{[}7872, 7873, 7874, 7875, 7879, 7880{]}{]} \\
        \multicolumn{3}{c}{...} \\
        \textbf{root-\textgreater{}package} & \textbf{"346f...a463"} & {[}{[}7872, 7873{]}, {[}7977, 7978{]}{]} \\
        \textbf{root-\textgreater{}package-\textgreater{}class} & \textbf{"840b...7f9a"} & {[}{[}7872, 7873, 7874{]}, \newline {[}7977, 7978, 7979{]}{]} \\
        \textbf{root-\textgreater{}package-\textgreater{}class}\newline \textbf{\hspace{3mm}-\textgreater{}method} & \textbf{"7151...0f3d"} & {[}{[}7872, 7873, 7874, 7875{]}, {[}7977, 7978, 7979, 7980{]}{]} \\
        \noalign{\vskip 2pt}
        \hline
    \end{tabular}
\end{table}

Table \ref{tab:query-results} shows that:
\begin{itemize}
    \item the hash `346f...a463' matches the path `root->package',
    \item the hash `840b...7f9a' matches the path `root->package->class',
    \item the hash `7151...0f3d' matches the path `root->package->class->method',
    \item two projects with identifiers 7872 and 7977 contain this structural patterns (paths).
\end{itemize}

Thus, we can calculate the number of matching and not matching paths (see eq. \ref{eq:calc-orig}) in the analyzed project compare with other projects in data storage. Figure \ref{fig:ExampleSystem} shows the main form of the developed system.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{images/ExampleSystemEng.png}
    \caption{The main form of the developed system.}
    \label{fig:ExampleSystem}
\end{figure}

\section{Experiments}

We conducted experiments to evaluate the speed of source code analysis. We calculated the results relative to the number of lines of code and the number of files in the analyzing project. The main aim of the experiment is to determine the speed of the algorithm, considering the average number of lines of code processed per minute. We used the IntelliJ IDEA Statistic plugin \cite{ref_Statistic} to get the data for the experiment. The plugin allows you to calculate the number, size, number of lines, average value and other information for each file in the project. You can also find out the total number of rows, the number of lines of code, the proportion of lines of code, the number of comment lines, the proportion of comment lines, etc.

We selected 10 random Java projects for this experiment. Table \ref{tab:speed} presents the results of experiments for analyzing the speed of the proposed algorithm.

\begin{table}
    \centering
    \caption{Results of experiments for analyzing the speed of the proposed algorithm.}
    \label{tab:speed}
    \begin{tabular}{L{0.5cm}L{3.5cm}L{2cm}L{2cm}L{2cm}}
        \hline
        \noalign{\vskip 3pt}
        \# & Project name      & Lines of code & Java files & Lines of code per minute \\
        \noalign{\vskip 2pt}
        \hline
        \noalign{\vskip 3pt}
        1  & BaseRecycler      & 3 896         & 92         & 2 491                    \\
        2  & AlamazDev         & 15 776        & 103        & 2 658                    \\
        3  & SnakeBoom         & 20 534        & 158        & 3 255                    \\
        4  & retrofit          & 32 119        & 227        & 2 718                    \\
        5  & Glide             & 37 508        & 203        & 2 576                    \\
        6  & ZXing             & 51 857        & 310        & 2 533                    \\
        7  & RxJava            & 64 101        & 339        & 2 814                    \\
        8  & VisualProjectCore & 71 303        & 450        & 2 969                    \\
        9  & mc-dev            & 85 267        & 877        & 2 746                    \\
        10 & xRayJavaTool      & 97 249        & 937        & 2 730                    \\
        \noalign{\vskip 2pt}
        \hline
        \noalign{\vskip 3pt}
        \multicolumn{4}{l}{Average value}                   & 2
        749                   \\
        \noalign{\vskip 2pt}
        \hline
    \end{tabular}
\end{table}

Table \ref{tab:time-size} presents the results of experiments to determine the total time of projects analyzing and the number of nodes in resulting graphs.

\begin{table}
    \centering
    \caption{Results of experiments to determine the total time of projects analyzing and the number of nodes in resulting graphs.}
    \label{tab:time-size}
    \begin{tabular}{L{0.5cm}L{3.5cm}L{2cm}L{2cm}}
        \hline
        \noalign{\vskip 3pt}
        \#  & Project name  & Total time (min) & Number of graph nodes \\
        \noalign{\vskip 2pt}
        \hline
        \noalign{\vskip 3pt}
        1 & BaseRecycler & 1.6 & 844 \\
        2 & AlamazDev & 6.0 & 1 837 \\
        3 & SnakeBoom & 6.3 & 2 197 \\
        4 & retrofit & 11.8 & 7 118 \\
        5 & Glide &  14.6 & 8 496 \\
        6 & ZXing & 20.5 & 10 560 \\
        7 & RxJava &  22.7 & 11 972 \\
        8 & VisualProjectCore & 24.1 & 13 334 \\
        9 & mc-dev & 31.1 & 14 444 \\
        10 & xRayJavaTool & 35.6 & 23 946 \\
        \noalign{\vskip 2pt}
        \hline
    \end{tabular}
\end{table}

The experiment revealed that we processed an average of 2 750 lines of code per minute. Student projects contains average 500-3000 lines of code. Thus, the analysis of one project takes on average less than one minute.

\section{Conclusion}

This article presents the results of developing an approach and a system for searching for structurally similar projects.

We solved the following tasks:
\begin{itemize}
    \item we analyzed existing methods of source code analysis, including the methods for borrowings searching in a text and source code;
    \item we developed the algorithm for extracting the AST in analyzing a project source code;
    \item we developed the algorithm for determining originality of a project based on the the AST structure hashing;
    \item we implemented the software system to determine originality of a project;
    \item we conducted experiments to determine the speed of the proposed algorithm.
\end{itemize}

Thus, the developed system makes it possible to find borrowings in student projects in less than a minute on average.

\subsubsection{Acknowledgements} The authors acknowledge that the work was supported by the framework of the state task No. 075-03-2023-143 "Research of intelligent predictive analytics based on the integration of methods for constructing features of heterogeneous dynamic data for machine learning and methods of predictive multimodal data analysis".

%
% ---- Bibliography ----
%
% BibTeX users should specify bibliography style 'splncs04'.
% References will then be sorted and formatted in the correct style.
%
\bibliographystyle{splncs04}
\bibliography{paper}

\end{document}