Add some review and fixes

2023-04-07 13:16:15 +04:00 · 2023-04-07 13:16:15 +04:00 · 914a8edaf1
commit 914a8edaf1
parent fd15620da0
2 changed files with 65 additions and 37 deletions
--- a/paper.bib
+++ b/paper.bib
@ -52,4 +52,37 @@
  booktitle={Proceedings of the 22nd ACM international conference on Information \& Knowledge Management},
  pages={1577--1580},
  year={2013}
+}
+
+@inproceedings{tang2022assessing,
+  title     = {Assessing software privacy using the privacy flow-graph},
+  author    = {Tang, Feiyang and {\O}stvold, Bjarte M},
+  booktitle = {Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security},
+  pages     = {7--15},
+  year      = {2022}
+}
+
+@inproceedings{vinayaka2021android,
+  title        = {Android malware detection using function call graph with graph convolutional networks},
+  author       = {Vinayaka, KV and Jaidhar, CD},
+  booktitle    = {2021 2nd International Conference on Secure Cyber Computing and Communications (ICSCCC)},
+  pages        = {279--287},
+  year         = {2021},
+  organization = {IEEE}
+}
+
+@inproceedings{soares2021integrating,
+  title        = {Integrating a Graph Builder into Python Tutor},
+  author       = {Soares, Diogo and Pereira, Maria Jo{\~a}o Varanda and Henriques, Pedro Rangel},
+  booktitle    = {Second International Computer Programming Education Conference (ICPEC 2021)},
+  year         = {2021},
+  organization = {Schloss Dagstuhl-Leibniz-Zentrum f{\"u}r Informatik}
+}
+
+@inproceedings{ghavamnia2020temporal,
+  title     = {Temporal system call specialization for attack surface reduction},
+  author    = {Ghavamnia, Seyedhamed and Palit, Tapti and Mishra, Shachee and Polychronakis, Michalis},
+  booktitle = {Proceedings of the 29th USENIX Conference on Security Symposium},
+  pages     = {1749--1766},
+  year      = {2020}
 }
--- a/paper.tex
+++ b/paper.tex
@ -57,67 +57,62 @@ Currently, the most practical work of students in information technology include

 Typically, student work is a small program that solves typical problems. In most cases, these works contain few files or a few lines of code. The architecture and algorithms of such programs are also simple.

-The teacher needs to spend many time to check all the work. The teacher usually notices when the student has borrowed a program source code. Students in such cases do not change the structure of the borrowed source code, but rename variables or change types of loops (from \textit{for} to \textit{while}), etc.
+The teacher needs to spend many time to check all the works. The teacher usually notices when the student has borrowed a program source code. Students in such cases do not change the structure of the borrowed source code, but rename variables or change types of loops (from \textit{for} to \textit{while}), etc.

-The software system proposed in this article allows you to analyze the structure of projects and provide information about their structural similarity. The indicator of the uniqueness of the current project structure is used to evaluate the uniqueness of the project in comparison with other projects.
+The software system proposed in this article allows you to analyze the structure of projects and provide information about their structural similarity. The indicator of the uniqueness of the current project structure is used to evaluate the uniqueness of the project in comparison with each other.

 \section{State of the art}

 There are no universal methods for analyzing the source code of software systems at the moment. Certain methods of analysis are used to solve various problems.

-There is a group of methods for analyzing source code, which is based on obtaining and analyzing abstract syntax trees (AST). An AST is an abstract representation of the grammatical structure of a source code. It expresses the structure of a program in some programming language as a tree structure. Each AST node is an operator or a set of operators of the analyzed source code. The compiler generates an AST because of the parsing step. Unlike a parse tree, an AST does not have nodes or edges for syntax rules that do not affect the semantics of the program (for example, grouping brackets).
+We can analyze projects using call graph generation tools, such as \textit{CodeViz} or \textit{Egypt}. Or we can use of reverse engineering tools, such as IDA Pro. The call graph based approaches allow developers to solve the program comprehension task for better program maintenance or to reduce security issues \cite{ghavamnia2020temporal,soares2021integrating,tang2022assessing,vinayaka2021android}.

-It can also analyze projects using call graph generation tools, such as \textit{CodeViz} or \textit{Egypt}. It is possible to use some functions of reverse engineering tools, such as IDA Pro.
+Another group of methods is based on obtaining and analyzing abstract syntax trees (AST). An AST is an abstract representation of the grammatical structure of a source code. It expresses the structure of a program as a structured tree and rarely depends on programming language. Each AST node is an operator or a set of operators of the analyzed source code. The compiler generates an AST on the parsing step. Unlike a parse tree, an AST does not contain nodes or edges that do not define the semantics of the program (for example, grouping brackets).

-AST-based approaches can find structurally similar projects. However, such approaches have high computational complexity \cite{nguyen2018crosssim}. Many existing approaches analyze a larger number of parameters than is necessary to solve the problem of this study \cite{aleksey2020approach,beniwal2021npmrec,nadezhda2019approach,nguyen2018crosssim,nguyen2020automated}: project dependencies, the number of stars in the repository, the contents of the documentation, etc.
+AST-based approaches allow us to find structurally similar projects. However, such approaches have high computational complexity \cite{nguyen2018crosssim}. Many existing approaches analyze a larger number of parameters than is necessary to solve the problem of this study \cite{aleksey2020approach,beniwal2021npmrec,nadezhda2019approach,nguyen2018crosssim,nguyen2020automated}: project dependencies, the number of stars in the repository, the contents of the documentation, etc.

-The paper \cite{ali2011overview} presents an analysis of approaches and software tools for searching for borrowings in the text and source code. However, there is no mention of software tools for searching for borrowings in the source code.
+The paper \cite{ali2011overview} presents an review of approaches and software tools for searching for borrowings in the text and source code. However, there is no mention of existing software tools for searching for borrowings in the source code.

-In the article \cite{chae2013software}, the authors analyze borrowings in the source code according to the sequences of using external programming interfaces and the frequency of such calls. This method is not suitable for solving the problem of this study because of the educational orientation of the analyzed source code.
+In the article \cite{chae2013software}, the authors analyze borrowings in the source code according to the sequences of using external programming interfaces (external dependencies) and the frequency of such calls. This method is not suitable for solving the problem of this study because of the educational orientation. Some student projects can not use external dependencies.

-Thus, it is necessary to develop an approach to the search for structurally similar projects, which are focused on working with simple software systems and with a high speed of analysis.
+Thus, it is necessary to develop an approach to the search for structurally similar projects, which are focused on simple software systems and a high speed of analysis.

 \section{The Proposed Algorithm for Analyzing the Structure of the Source Code}

-The source code of the software system in the proposed algorithm is the main source of data for identifying structural features. 
+The source code of the software system is the main data source for structural features identifying in the proposed algorithm. 

-We formed an AST to analyze the source code. There are various libraries and tools for all existing programming languages for the formation of AST. Thus, using your own representation of the AST allows you to add support for new programming languages to the system without changing the analysis algorithms.
-
-We will use the following AST model:
+We formed an AST to analyze the source code. There are various libraries and tools for all existing programming languages for the formation of AST. We use own representation of the AST to add support for new programming languages without changing the analysis algorithms.

+We define the proposed AST model as follows:
 \begin{equation*}    
    AST = \langle N,R \rangle,
 \end{equation*}
-
-where $N = \lbrace N_{1}, N_{2},\ldots, N_{n}\rbrace$ is the set of nodes AST;
-
-$N_{i} = \langle name, data \rangle$ is an i-th AST node containing the node name, node data;
-
+where $N = \lbrace N_{1}, N_{2},\ldots, N_{n}\rbrace$ is the set of AST nodes;\\
+$N_{i} = \langle name, data \rangle$ is an $i$-th AST node containing the node name and data;\\
 $R$ is the set of relations between AST nodes.

-We developed an algorithm to highlight the structure of the project in analyzing the source code, which comprises the following steps:
-
+We developed an algorithm to extract the structure of the project in the source code analyzing. The proposed algorithm contains the following steps:
 \begin{enumerate}
-    \item Form an ASD for the project.
-    \item Select nodes with type “Class”: 
+    \item Extract the AST from the project.
+    \item Select nodes with the `Class' type as the $N^{Class}$ set: 
    \begin{equation*}    
-        N^{Class} = \lbrace N_{i} \in N | F \left( N_{i}.data \right) = \text{'Class'} \rbrace,
+        N^{Class} = \lbrace N_{i} \in N | F \left( N_{i}.data \right) = \text{`Class'} \rbrace,
    \end{equation*}
-    \item Find nodes with the “Class field” type in the found classes: 
+    \item Select nodes with the `Class field' as the $N^{Vars}$ set from the $N^{Class}$ set: 
    \begin{equation*}    
-        N^{Vars} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{'Field'} \rbrace,
+        N^{Vars} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{`Field'} \rbrace,
    \end{equation*}
-    \item Find nodes with the “Methods” type in the found classes: 
+    \item Select nodes with the `Methods' type as the $N^{Methods}$ set from the $N^{Class}$ set: 
    \begin{equation*}    
-        N^{Methods} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{'Method'} \rbrace,
+        N^{Methods} = \lbrace N_{i}^{Class} \in N^{Class} | F \left( N_{i}^{Class}.data \right) = \text{`Method'} \rbrace,
    \end{equation*}
-    \item Find nodes with type “Method Argument” in the found methods: 
+    \item Select nodes with the `Method Argument' type as the $N^{MethodsArgs}$ set from the $N^{Methods}$ set: 
    \begin{equation*}    
-        N^{MethodsArgs} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{'Arg'} \rbrace,
+        N^{MethodsArgs} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{`Arg'} \rbrace,
    \end{equation*}
-    \item Find nodes with the type “Operator” in the found methods: 
+    \item Select nodes with the `Operator' type as the $N^{MethodsOps}$ from the $N^{Methods}$ set: 
    \begin{equation*}    
-        N^{MethodsOps} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{'Operator'} \rbrace,
+        N^{MethodsOps} = \lbrace N_{i}^{Methods} \in N^{Methods} | F \left( N_{i}^{Methods}.data \right) = \text{`Operator'} \rbrace,
    \end{equation*}
    \item Create based on previously got sets of AST for the analyzed source code, considering the set of relations R.
    \item Save the resulting AST in a graph database (GDB) to facilitate data handling. 
@ -191,12 +186,12 @@ We used neo4j GDB for data storage. Possibly redundant expression GDBs over rela
 The GDB data model allows you to store the following nodes:

 \begin{itemize}
-    \item nodes with type “Package” (Java-specific);
-    \item nodes with type “Class”;
-    \item nodes with the “Class field” type;
-    \item nodes with type “Method”;
-    \item nodes with type “Method argument”;
-    \item nodes with type “Operator”.
+    \item nodes with type `Package' (Java-specific);
+    \item nodes with type `Class';
+    \item nodes with the `Class field' type;
+    \item nodes with type `Method';
+    \item nodes with type `Method argument';
+    \item nodes with type `Operator'.
 \end{itemize}

 We arrange the nodes in the GDB hierarchically. For example, a class is in a package, but a method is in a class. The data model allows you to form the following relationships between graph nodes:
@ -230,7 +225,7 @@ Figure \ref{ResultOfRequest} shows the result of the query execution.
    \caption{An example of the result of a request to Neo4j.} \label{ResultOfRequest}
 \end{figure}

-Figure \ref{ResultOfRequest} shows that the hash \textit{"b199ef8568f72c43f6fd50860e228c51"} matches the path \textit{[“root”, “cont”]}. Two graphs contain this hash. The primary keys of the root nodes of these graphs are 7872 and 7977.
+Figure \ref{ResultOfRequest} shows that the hash \textit{"b199ef8568f72c43f6fd50860e228c51"} matches the path \textit{[`root', `cont']}. Two graphs contain this hash. The primary keys of the root nodes of these graphs are 7872 and 7977.


 Thus, we can discover the number of matching and different paths in the analyzed projects by obtaining hashes for all AST fragments. Figure \ref{ExampleSystem} shows an example of the developed system.