IITI-2023/samplepaper.tex

265 lines
13 KiB
TeX
Raw Normal View History

2023-04-06 13:43:26 +04:00
% This is samplepaper.tex, a sample chapter demonstrating the
% LLNCS macro package for Springer Computer Science proceedings;
% Version 2.21 of 2022/01/12
%
\documentclass[runningheads]{llncs}
%
\usepackage[T1]{fontenc}
% T1 fonts will be used to generate the final print and online PDFs,
% so please use T1 fonts in your manuscript whenever possible.
% Other font encondings may result in incorrect characters.
%
\usepackage{amsmath}
\usepackage{graphicx}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
% If you use the hyperref package, please uncomment the following two lines
% to display URLs in blue roman font according to Springer's eBook style:
%\usepackage{color}
%\renewcommand\UrlFont{\color{blue}\rmfamily}
%
\begin{document}
%
\title{Search for Structurally Similar Projects of Software Systems}
%
%\titlerunning{Abbreviated paper title}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{Aleksey Filippov\inst{1}\orcidID{0000-0003-0008-5035} \and
Julia Stroeva\inst{1}\orcidID{0009-0003-8026-235X}}
%
\authorrunning{A. Filippov, J. Stroeva}
% First names are abbreviated in the running head.
% If there are more than two authors, 'et al.' is used.
%
\institute{Department of Information Systems, Ulyanovsk State
Technical University, 32 Severny Venetz Street, 432027 Ulyanovsk, Russia}
%
\maketitle % typeset the header of the contribution
%
\begin{abstract}
The authors have developed an approach to the search for structurally similar projects of software systems. Teachers can use the proposed approach to search for borrowings in the works of students. The concept behind this proposal is that it can to locate projects that students have used as parts of a current project.
The authors propose a new algorithm for determining the similarity between the structures of software projects. The proposed algorithm is based on finding similar structural elements in the source code of the program in an abstract syntax trees analyzing.
The authors developed a software system to evaluate the proposed algorithm. The current version of the system only supports Java programs. However, the system operates with its own representation of the abstract syntax tree, which allows you to add support for new programming languages.
\keywords{source code \and structure analysis \and structurally similar projects \and hashing.}
\end{abstract}
%
%
%
\section{Introduction}
Currently, the most practical work of students in information technology includes laboratory and coursework. These classes help students understand the theoretical information from their lectures.
Typically, student work is a small program that solves typical problems. In most cases, these works contain few files or a few lines of code. The architecture and algorithms of such programs are also simple.
The teacher needs to spend many time to check all the work. The teacher usually notices when the student has borrowed a program source code. Students in such cases do not change the structure of the borrowed source code, but rename variables or change types of loops (from \textit{for} to \textit{while}), etc.
The software system proposed in this article allows you to analyze the structure of projects and provide information about their structural similarity. The indicator of the uniqueness of the current project structure is used to evaluate the uniqueness of the project in comparison with other projects.
\section{State of the art}
There are no universal methods for analyzing the source code of software systems at the moment. Certain methods of analysis are used to solve various problems.
There is a group of methods for analyzing source code, which is based on obtaining and analyzing abstract syntax trees (AST). An AST is an abstract representation of the grammatical structure of a source code. It expresses the structure of a program in some programming language as a tree structure. Each AST node is an operator or a set of operators of the analyzed source code. The compiler generates an AST because of the parsing step. Unlike a parse tree, an AST does not have nodes or edges for syntax rules that do not affect the semantics of the program (for example, grouping brackets).
It can also analyze projects using call graph generation tools, such as \textit{CodeViz} or \textit{Egypt}. It is possible to use some functions of reverse engineering tools, such as IDA Pro.
AST-based approaches can find structurally similar projects. However, such approaches have high computational complexity \cite{nguyen2018crosssim}. Many existing approaches analyze a larger number of parameters than is necessary to solve the problem of this study \cite{aleksey2020approach,beniwal2021npmrec,nadezhda2019approach,nguyen2018crosssim,nguyen2020automated}: project dependencies, the number of stars in the repository, the contents of the documentation, etc.
The paper \cite{ali2011overview} presents an analysis of approaches and software tools for searching for borrowings in the text and source code. However, there is no mention of software tools for searching for borrowings in the source code.
In the article \cite{chae2013software}, the authors analyze borrowings in the source code according to the sequences of using external programming interfaces and the frequency of such calls. This method is not suitable for solving the problem of this study because of the educational orientation of the analyzed source code.
Thus, it is necessary to develop an approach to the search for structurally similar projects, which are focused on working with simple software systems and with a high speed of analysis.
\section{The Proposed Algorithm for Analyzing the Structure of the Source Code}
The source code of the software system in the proposed algorithm is the main source of data for identifying structural features.
We formed an AST to analyze the source code. There are various libraries and tools for all existing programming languages for the formation of AST. Thus, using your own representation of the AST allows you to add support for new programming languages to the system without changing the analysis algorithms.
We will use the following AST model:
\begin{equation*}
AST = \langle N,R \rangle,
\end{equation*}
where $N = \lbrace N_{1}, N_{2},\ldots, N_{n}\rbrace$ is the set of nodes AST;
$N_{i} = \langle name, data \rangle$ is an i-th AST node containing the node name, node data;
R is the set of relations between AST nodes.
We developed an algorithm to highlight the structure of the project in analyzing the source code, which comprises the following steps:
\begin{enumerate}
\item Form an ASD for the project.
\item Select nodes with type “Class”:
\begin{equation*}
N^{Class} = \lbrace N_{i} \in N | F \left( N_{i}.data \right) = \text{'Class'} \rbrace,
\end{equation*}
\item Find nodes with the “Class field” type in the found classes:
\begin{equation*}
N^{Vars} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{'Field'} \rbrace,
\end{equation*}
\item Find nodes with the “Methods” type in the found classes:
\begin{equation*}
N^{Methods} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{'Method'} \rbrace,
\end{equation*}
\item Find nodes with type “Method Argument” in the found methods:
\begin{equation*}
N^{MethodsArgs} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{'Arg'} \rbrace,
\end{equation*}
\item Find nodes with the type “Operator” in the found methods:
\begin{equation*}
N^{MethodsOps} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{'Operator'} \rbrace,
\end{equation*}
\item Create based on previously got sets of AST for the analyzed source code, considering the set of relations R.
\item Save the resulting AST in a graph database (GDB) to facilitate data handling.
\end{enumerate}
Figure \ref{fig1} shows an example of the source code and the ASD got for it.
\begin{figure}
\includegraphics[width=\textwidth]{SourceCodeAndAST.png}
\caption{Sample source code and its AST.} \label{fig1}
\end{figure}
GDB is a non-relational type of database based on the topographic structure of the network. Graphs represent sets of data as nodes, edges, and properties. Relational databases provide a structured approach to data. GDBs are more flexible and focused on fast data acquisition, considering various types of links between them.
We used Neo4j as the GDB, since this system has a high speed of operation even with a large amount of stored data.
\section{The Proposed Algorithm for Determining the Structural Similarity of Software Projects}
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
% -----------------------------------------
\section{State of the art}
\subsection{A Subsection Sample}
Please note that the first paragraph of a section or subsection is
not indented. The first paragraph that follows a table, figure,
equation etc. does not need an indent, either.
Subsequent paragraphs, however, are indented.
\subsubsection{Sample Heading (Third Level)} Only two levels of
headings should be numbered. Lower level headings remain unnumbered;
they are formatted as run-in headings.
\paragraph{Sample Heading (Fourth Level)}
The contribution should contain no more than four levels of
headings. Table~\ref{tab1} gives a summary of all heading levels.
% \begin{figure}
% \end{figure}
\begin{table}
\caption{Table captions should be placed above the
tables.}\label{tab1}
\begin{tabular}{|l|l|l|}
\hline
Heading level & Example & Font size and style\\
\hline
Title (centered) & {\Large\bfseries Lecture Notes} & 14 point, bold\\
1st-level heading & {\large\bfseries 1 Introduction} & 12 point, bold\\
2nd-level heading & {\bfseries 2.1 Printing Area} & 10 point, bold\\
3rd-level heading & {\bfseries Run-in Heading in Bold.} Text follows & 10 point, bold\\
4th-level heading & {\itshape Lowest Level Heading.} Text follows & 10 point, italic\\
\hline
\end{tabular}
\end{table}
\begin{itemize}
\item
\end{itemize}
\begin{enumerate}
\item
\end{enumerate}
\noindent Displayed equations are centered and set on a separate
line.
\begin{equation*}
N^{Class} = \lbrace N_{i} \in N | F \left( N_{i}.data \right) = \text{'Class'} \rbrace,
\end{equation*}
Please try to avoid rasterized images for line-art diagrams and
schemas. Whenever possible, use vector graphics instead (see
Fig.~\ref{fig1}).
% \begin{figure}
% \includegraphics[width=\textwidth]{fig1.eps}
% \caption{A figure caption is always placed below the illustration.
% Please note that short captions are centered, while long ones are
% justified by the macro package automatically.} \label{fig1}
% \end{figure}
\begin{theorem}
This is a sample theorem. The run-in heading is set in bold, while
the following text appears in italics. Definitions, lemmas,
propositions, and corollaries are styled the same way.
\end{theorem}
%
% the environments 'definition', 'lemma', 'proposition', 'corollary',
% 'remark', and 'example' are defined in the LLNCS documentclass as well.
%
\begin{proof}
Proofs, examples, and remarks have the initial word in italics,
while the following text appears in normal font.
\end{proof}
For citations of references, we prefer the use of square brackets and consecutive numbers. Citations using labels or the author/year convention are also acceptable. The following bibliography provides a sample reference list with entries for journal
articles~\cite{Reviewer-Recommender}, an LNCS chapter~\cite{ref_lncs1}, a
book~\cite{ref_book1}, proceedings without editors~\cite{ref_proc1},
and a homepage~\cite{ref_url1}. Multiple citations are grouped
\cite{ref_article1,ref_lncs1,ref_book1},
\cite{ref_article1,ref_book1,ref_proc1,ref_url1}.
\subsubsection{Acknowledgements} Please place your acknowledgments at
the end of the paper, preceded by an unnumbered run-in heading (i.e.
3rd-level heading).
%
% ---- Bibliography ----
%
% BibTeX users should specify bibliography style 'splncs04'.
% References will then be sorted and formatted in the correct style.
%
\bibliographystyle{splncs04}
\bibliography{samplepaper}
%
% \begin{thebibliography}{8}
% \bibitem{ref_article1}
% Author, F.: Article title. Journal \textbf{2}(5), 99--110 (2016)
% \bibitem{ref_lncs1}
% Author, F., Author, S.: Title of a proceedings paper. In: Editor,
% F., Editor, S. (eds.) CONFERENCE 2016, LNCS, vol. 9999, pp. 1--13.
% Springer, Heidelberg (2016). \doi{10.10007/1234567890}
% \bibitem{ref_book1}
% Author, F., Author, S., Author, T.: Book title. 2nd edn. Publisher,
% Location (1999)
% \bibitem{ref_proc1}
% Author, A.-B.: Contribution title. In: 9th International Proceedings
% on Proceedings, pp. 1--2. Publisher, Location (2010)
% \bibitem{ref_url1}
% LNCS Homepage, \url{http://www.springer.com/lncs}. Last accessed 4
% Oct 2017
% \end{thebibliography}
\end{document}