You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

371 lines
20 KiB
TeX

% This is samplepaper.tex, a sample chapter demonstrating the
% LLNCS macro package for Springer Computer Science proceedings;
% Version 2.21 of 2022/01/12
%
\documentclass[runningheads]{llncs}
%
\usepackage[T1]{fontenc}
% T1 fonts will be used to generate the final print and online PDFs,
% so please use T1 fonts in your manuscript whenever possible.
% Other font encondings may result in incorrect characters.
%
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{listings}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
% If you use the hyperref package, please uncomment the following two lines
% to display URLs in blue roman font according to Springer's eBook style:
%\usepackage{color}
%\renewcommand\UrlFont{\color{blue}\rmfamily}
%
\begin{document}
%
\title{Search for Structurally Similar Projects of Software Systems}
%
%\titlerunning{Abbreviated paper title}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{Aleksey Filippov\inst{1}\orcidID{0000-0003-0008-5035} \and
Julia Stroeva\inst{1}\orcidID{0009-0003-8026-235X}}
%
\authorrunning{A. Filippov, J. Stroeva}
% First names are abbreviated in the running head.
% If there are more than two authors, 'et al.' is used.
%
\institute{Department of Information Systems, Ulyanovsk State
Technical University, 32 Severny Venetz Street, 432027 Ulyanovsk, Russia}
%
\maketitle % typeset the header of the contribution
%
\begin{abstract}
The authors have developed an approach to the search for structurally similar projects of software systems. Teachers can use the proposed approach to search for borrowings in the works of students. The concept behind this proposal is that it can to locate projects that students have used as parts of a current project.
The authors propose a new algorithm for determining the similarity between the structures of software projects. The proposed algorithm is based on finding similar structural elements in the source code of the program in an abstract syntax trees analyzing.
The authors developed a software system to evaluate the proposed algorithm. The current version of the system only supports Java programs. However, the system operates with its own representation of the abstract syntax tree, which allows you to add support for new programming languages.
\keywords{source code \and structure analysis \and structurally similar projects \and hashing.}
\end{abstract}
%
%
%
\section{Introduction}
Currently, the most practical work of students in information technology includes laboratory and coursework. These classes help students understand the theoretical information from their lectures.
Typically, student work is a small program that solves typical problems. In most cases, these works contain few files or a few lines of code. The architecture and algorithms of such programs are also simple.
The teacher needs to spend many time to check all the work. The teacher usually notices when the student has borrowed a program source code. Students in such cases do not change the structure of the borrowed source code, but rename variables or change types of loops (from \textit{for} to \textit{while}), etc.
The software system proposed in this article allows you to analyze the structure of projects and provide information about their structural similarity. The indicator of the uniqueness of the current project structure is used to evaluate the uniqueness of the project in comparison with other projects.
\section{State of the art}
There are no universal methods for analyzing the source code of software systems at the moment. Certain methods of analysis are used to solve various problems.
There is a group of methods for analyzing source code, which is based on obtaining and analyzing abstract syntax trees (AST). An AST is an abstract representation of the grammatical structure of a source code. It expresses the structure of a program in some programming language as a tree structure. Each AST node is an operator or a set of operators of the analyzed source code. The compiler generates an AST because of the parsing step. Unlike a parse tree, an AST does not have nodes or edges for syntax rules that do not affect the semantics of the program (for example, grouping brackets).
It can also analyze projects using call graph generation tools, such as \textit{CodeViz} or \textit{Egypt}. It is possible to use some functions of reverse engineering tools, such as IDA Pro.
AST-based approaches can find structurally similar projects. However, such approaches have high computational complexity \cite{nguyen2018crosssim}. Many existing approaches analyze a larger number of parameters than is necessary to solve the problem of this study \cite{aleksey2020approach,beniwal2021npmrec,nadezhda2019approach,nguyen2018crosssim,nguyen2020automated}: project dependencies, the number of stars in the repository, the contents of the documentation, etc.
The paper \cite{ali2011overview} presents an analysis of approaches and software tools for searching for borrowings in the text and source code. However, there is no mention of software tools for searching for borrowings in the source code.
In the article \cite{chae2013software}, the authors analyze borrowings in the source code according to the sequences of using external programming interfaces and the frequency of such calls. This method is not suitable for solving the problem of this study because of the educational orientation of the analyzed source code.
Thus, it is necessary to develop an approach to the search for structurally similar projects, which are focused on working with simple software systems and with a high speed of analysis.
\section{The Proposed Algorithm for Analyzing the Structure of the Source Code}
The source code of the software system in the proposed algorithm is the main source of data for identifying structural features.
We formed an AST to analyze the source code. There are various libraries and tools for all existing programming languages for the formation of AST. Thus, using your own representation of the AST allows you to add support for new programming languages to the system without changing the analysis algorithms.
We will use the following AST model:
\begin{equation*}
AST = \langle N,R \rangle,
\end{equation*}
where $N = \lbrace N_{1}, N_{2},\ldots, N_{n}\rbrace$ is the set of nodes AST;
$N_{i} = \langle name, data \rangle$ is an i-th AST node containing the node name, node data;
$R$ is the set of relations between AST nodes.
We developed an algorithm to highlight the structure of the project in analyzing the source code, which comprises the following steps:
\begin{enumerate}
\item Form an ASD for the project.
\item Select nodes with type “Class”:
\begin{equation*}
N^{Class} = \lbrace N_{i} \in N | F \left( N_{i}.data \right) = \text{'Class'} \rbrace,
\end{equation*}
\item Find nodes with the “Class field” type in the found classes:
\begin{equation*}
N^{Vars} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{'Field'} \rbrace,
\end{equation*}
\item Find nodes with the “Methods” type in the found classes:
\begin{equation*}
N^{Methods} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{'Method'} \rbrace,
\end{equation*}
\item Find nodes with type “Method Argument” in the found methods:
\begin{equation*}
N^{MethodsArgs} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{'Arg'} \rbrace,
\end{equation*}
\item Find nodes with the type “Operator” in the found methods:
\begin{equation*}
N^{MethodsOps} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{'Operator'} \rbrace,
\end{equation*}
\item Create based on previously got sets of AST for the analyzed source code, considering the set of relations R.
\item Save the resulting AST in a graph database (GDB) to facilitate data handling.
\end{enumerate}
Figure \ref{SourceCodeAndAST} shows an example of the source code and the ASD got for it.
\begin{figure}
\includegraphics[width=\textwidth]{SourceCodeAndAST.png}
\caption{Sample source code and its AST.} \label{SourceCodeAndAST}
\end{figure}
GDB is a non-relational type of database based on the topographic structure of the network. Graphs represent sets of data as nodes, edges, and properties. Relational databases provide a structured approach to data. GDBs are more flexible and focused on fast data acquisition, considering various types of links between them.
We used Neo4j as the GDB, since this system has a high speed of operation even with a large amount of stored data.
\section{The Proposed Algorithm for Determining the Structural Similarity of Software Projects}
The determination of the structural similarity of projects is based on the use of a hashing algorithm. We use a hash function to collapse an input array of any size into string.
It is necessary to get the values of the AST hash function of the analyzed project:
\begin{enumerate}
\item Get fragments (paths) of the AST graph from the root node of the graph tree to each node of the graph.
\item We extract the distinguishing property of each node (type) in the current fragment.
\item We passed the type of the node to the hash function. The result is an output string of a certain length.
\item If fragment \textit{A} contains hash \textit{H} and fragment \textit{B} contains hash H, then we can say that fragments \textit{A} and \textit{B} have similarity in one node.
\item If fragment \textit{A} contains hash \textit{H}, and fragment \textit{B} does not contain hash \textit{H}, then fragments \textit{A} and \textit{B} do not have similar nodes.
\end{enumerate}
We calculate the hash function values using the Neo4j GDB using the md5 algorithm.
Thus, the number of matching hash values affects the uniqueness and borrowing rates of code in a project. The following expression is used to calculate project originality:
\begin{equation*}
O = \frac {H^{C} \notin H}{H^{C}},
\end{equation*}
where $H^{C}$ is the set of values of the hash function of the current project;
$H$ is the set of hash values of other projects in the system.
\section{Description of the Developed Software System}
Figure \ref{DeploymentDiagram} shows the developed software system. The system comprises three nodes, which are on different nodes of the computer network.
\begin{figure}
\includegraphics[width=\textwidth]{DeploymentDiagram.png}
\caption{Deployment diagram.} \label{DeploymentDiagram}
\end{figure}
Users interact with the web client on the \textit{Frontend} node. The \textit{Backend} node performs the main business logic for implementing the proposed approach in searching for structurally similar projects. \textit{Backend} and \textit{Frontend} nodes communicate through an API. The GDB is on the \textit{Database} node and provides a saving of project data.
The developed software system represents a web application using a client-server architecture. The web client sends requests to the server, the server processes the requests and returns responses to the web client.
The web client is an application written in JavaScript using the Vue js framework. Vue.js is a framework for developing single page applications and web interfaces. The main advantages of this framework are its lightness (small size of the library in lines of code), performance, flexibility, and excellent documentation.
We implement the server part of the application in Java using the Spring Boot framework. The Spring framework is a whole ecosystem for developing applications in the Java language, which includes a huge number of ready-made modules. Spring Boot extends the Spring framework. The main advantages of this framework include speed and ease of development, auto-configuration of all components, easy access to databases and network capabilities.
The current version of the software system only supports source code analysis in the Java. The JavaParser library is used to form an AST in the process of Java code analysis. This library allows you to build an internal representation of the AST, which is then translated into the proposed AST model using the previously discussed algorithm.
Figure \ref{JavaParserLibrary} shows an example of the internal representation of the AST in the JavaParser library.
\begin{figure}
\includegraphics[width=\textwidth]{JavaParserLibrary.png}
\caption{An example of the internal representation of the JavaParser library AST.} \label{JavaParserLibrary}
\end{figure}
We used neo4j GDB for data storage. Possibly redundant expression GDBs over relational ones is the ability to change the data model.
The GDB data model allows you to store the following nodes:
\begin{itemize}
\item nodes with type “Package” (Java-specific);
\item nodes with type “Class”;
\item nodes with the “Class field” type;
\item nodes with type “Method”;
\item nodes with type “Method argument”;
\item nodes with type “Operator”.
\end{itemize}
We arrange the nodes in the GDB hierarchically. For example, a class is in a package, but a method is in a class. The data model allows you to form the following relationships between graph nodes:
\begin{itemize}
\item $HAS_CLASS$ is a relationship between a package and a class;
\item $HAS_FIELD$ is a relationship between a class and a class field;
\item $HAS_METHOD$ is a is a link between a class and a method;
\item $HAS_ARG$ is a relationship between method and method argument;
\item $HAS_BLOCK$ is a link between a method and a statement.
\end{itemize}
The main idea of searching for structurally similar projects is to use hashing of graph fragments (paths) based on the md5 function. We formed the path from the root node to each node of the graph. We take nesting into account when forming the path (package -> class -> field / method -> method argument / operator). The hash function takes the string representation of the node type as input.
Example of a request to get paths and a hash that matches this path:
\begin{lstlisting}
MATCH p = (o{name:"root"})-[r*]- ()
WHERE ID(o)={0}
WITH [x in nodes(p) | CASE WHEN EXISTS(x.name)
THEN x.name ELSE x.type END] as names,
[x in nodes(p) | ID(x)] as ids
WITH names, apoc.util.md5(names) as hash, ids
RETURN names, hash, ids
\end{lstlisting}
Figure \ref{ResultOfRequest} shows the result of the query execution.
\begin{figure}
\includegraphics[width=\textwidth]{ResultOfRequest.png}
\caption{An example of the result of a request to Neo4j.} \label{ResultOfRequest}
\end{figure}
Figure \ref{ResultOfRequest} shows that the hash \textit{"b199ef8568f72c43f6fd50860e228c51"} matches the path \textit{[“root”, “cont”]}. Two graphs contain this hash. The primary keys of the root nodes of these graphs are 7872 and 7977.
Thus, we can discover the number of matching and different paths in the analyzed projects by obtaining hashes for all AST fragments. Figure \ref{ExampleSystem} shows an example of the developed system.
\begin{figure}
\includegraphics[width=\textwidth]{ExampleSystem.png}
\caption{System operation example.} \label{ExampleSystem}
\end{figure}
\section{Experiments}
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
\section{State of the art}
\subsection{A Subsection Sample}
Please note that the first paragraph of a section or subsection is
not indented. The first paragraph that follows a table, figure,
equation etc. does not need an indent, either.
Subsequent paragraphs, however, are indented.
\subsubsection{Sample Heading (Third Level)} Only two levels of
headings should be numbered. Lower level headings remain unnumbered;
they are formatted as run-in headings.
\paragraph{Sample Heading (Fourth Level)}
The contribution should contain no more than four levels of
headings. Table~\ref{tab1} gives a summary of all heading levels.
% \begin{figure}
% \end{figure}
\begin{table}
\caption{Table captions should be placed above the
tables.}\label{tab1}
\begin{tabular}{|l|l|l|}
\hline
Heading level & Example & Font size and style\\
\hline
Title (centered) & {\Large\bfseries Lecture Notes} & 14 point, bold\\
1st-level heading & {\large\bfseries 1 Introduction} & 12 point, bold\\
2nd-level heading & {\bfseries 2.1 Printing Area} & 10 point, bold\\
3rd-level heading & {\bfseries Run-in Heading in Bold.} Text follows & 10 point, bold\\
4th-level heading & {\itshape Lowest Level Heading.} Text follows & 10 point, italic\\
\hline
\end{tabular}
\end{table}
\begin{itemize}
\item
\end{itemize}
\begin{enumerate}
\item
\end{enumerate}
\noindent Displayed equations are centered and set on a separate
line.
\begin{equation*}
N^{Class} = \lbrace N_{i} \in N | F \left( N_{i}.data \right) = \text{'Class'} \rbrace,
\end{equation*}
Please try to avoid rasterized images for line-art diagrams and
schemas. Whenever possible, use vector graphics instead (see
Fig.~\ref{fig1}).
% \begin{figure}
% \includegraphics[width=\textwidth]{fig1.eps}
% \caption{A figure caption is always placed below the illustration.
% Please note that short captions are centered, while long ones are
% justified by the macro package automatically.} \label{fig1}
% \end{figure}
\begin{theorem}
This is a sample theorem. The run-in heading is set in bold, while
the following text appears in italics. Definitions, lemmas,
propositions, and corollaries are styled the same way.
\end{theorem}
%
% the environments 'definition', 'lemma', 'proposition', 'corollary',
% 'remark', and 'example' are defined in the LLNCS documentclass as well.
%
\begin{proof}
Proofs, examples, and remarks have the initial word in italics,
while the following text appears in normal font.
\end{proof}
For citations of references, we prefer the use of square brackets and consecutive numbers. Citations using labels or the author/year convention are also acceptable. The following bibliography provides a sample reference list with entries for journal
articles~\cite{Reviewer-Recommender}, an LNCS chapter~\cite{ref_lncs1}, a
book~\cite{ref_book1}, proceedings without editors~\cite{ref_proc1},
and a homepage~\cite{ref_url1}. Multiple citations are grouped
\cite{ref_article1,ref_lncs1,ref_book1},
\cite{ref_article1,ref_book1,ref_proc1,ref_url1}.
\subsubsection{Acknowledgements} Please place your acknowledgments at
the end of the paper, preceded by an unnumbered run-in heading (i.e.
3rd-level heading).
%
% ---- Bibliography ----
%
% BibTeX users should specify bibliography style 'splncs04'.
% References will then be sorted and formatted in the correct style.
%
\bibliographystyle{splncs04}
\bibliography{samplepaper}
%
% \begin{thebibliography}{8}
% \bibitem{ref_article1}
% Author, F.: Article title. Journal \textbf{2}(5), 99--110 (2016)
% \bibitem{ref_lncs1}
% Author, F., Author, S.: Title of a proceedings paper. In: Editor,
% F., Editor, S. (eds.) CONFERENCE 2016, LNCS, vol. 9999, pp. 1--13.
% Springer, Heidelberg (2016). \doi{10.10007/1234567890}
% \bibitem{ref_book1}
% Author, F., Author, S., Author, T.: Book title. 2nd edn. Publisher,
% Location (1999)
% \bibitem{ref_proc1}
% Author, A.-B.: Contribution title. In: 9th International Proceedings
% on Proceedings, pp. 1--2. Publisher, Location (2010)
% \bibitem{ref_url1}
% LNCS Homepage, \url{http://www.springer.com/lncs}. Last accessed 4
% Oct 2017
% \end{thebibliography}
\end{document}