NJU Static Program Analysis 01: Introduction
Abstraction
Key questions in this lecture are:
- What are the differences between static analysis and (dynamic) testing
- Understand soundness, completeness, false negatives, and false positives.
- Why soundness is (usually) required by static analysis?
- How to understand abstraction and over-approximation
Notes
Programming language theory is a branch of computer science that concerning the design, implementation and analysis of formal languages. In this curriculum, we will start to learn about a specific domain in PL - the static program analysis.
Static program analysis is a kind of method that analyze a computer program to reason about its behaviors and determines whether it satisfies some properties before actually running it. This method is increasingly gaining its importance as the boom in complexity and scale of modern softwares.
Given a static analysis algorithm, we can evaluate it by inspecting how the algorithm satisfies two properties: soundness and completeness.
- An algorithm is sound if, anytime it returns an answer, the answer is true;
- An algorithm is complete if it guarantees to return a true answer.
However as Rice's Theorem indicated, there is no such approach to determine whether a computer program satisfies some non-trivial properties. In another words, it's impossible to find a program analysis method that is both sound and complete. We must compromise either of the properties.
If we compromise soundness, the return of the algorithm can be false negative, meaning that the algorithm may not return the answer we need. On the other hand if we compromise completeness, the return can be false positive, meaning that the algorithm may return something we needn't.
For example in a specific context, we have two algorithm to find bugs in a program. Assume that the program have exactly 10 bugs in fact. The algorithm compromising soundness may report 7 bugs, and the algorithm compromising completeness may report 12 bugs.
In this way, we always wants to have a sound but not that complete(fully-precise) algorithm, especially for important static-analysis applications such as compiler optimization and program verification.
Next, let's come down to a concrete example of static analysis, in which we will find two significant concepts: abstraction and over-approximation.
Assuming we are trying to static analyze a program to determine the sign of all its variables. For every possible value of variables in the program, we can abstract them into a few categories. For example in:
int a = 10, b = 1, c = -1, d = 0, e = (something? 1: -1), f = 1 / 0;
We can have five abstract categories to describe all the possible values:
Symbol | Correspond Values |
---|---|
\(\oplus\) | exactly positive ones, such as a & b
|
\(\ominus\) | exactly negative ones, such as c
|
\(0\) | zero, such as d
|
\(\top\) | unknown or unable to determine statically, such as e
|
\(\bot\) | undefined or error, such as f
|
And that is so-called abstraction, what we have always been doing when dealing with mathematics or programs.
Given all these abstract symbols, we also need to define some abstract operations:
Abstract Operation |
---|
\(\oplus + \oplus = \oplus\) |
\(\ominus + \ominus = \ominus\) |
\(\oplus + \ominus = \top\) |
\(\top \div 0 = \bot\) |
\(\dots\) |
You could image that in this way when we try to analyze a real complex program, at the end many of the abstract value of variables would be \(\top\), the unknown value. That's part of over-approximation - to get a comprehensive but ambiguous result, promising the soundness of our algorithm.
What's more
In this part, I'll talk about a few points mentioned but not expanded on in the curriculum that interests me.
The proof of Rice's Theorem
I have a strong intense that the problem has something similar to Turing's halting problem. So I managed to construct a similar proof.
\(P.f.\)
Assume we have a algorithm \(H\) that can test whether a given program has a non-trivial property \(\mathcal{P}\). Formally we have:
\[H(x) = \begin{cases} 1, & {\rm program}~x~{\rm has~the~property~} \mathcal{P} \\ 0, & {\rm program}~x~{\rm doesn't~has~the~property~} \mathcal{P} \\ \end{cases} \] And for every non-trivial property \(\mathcal{P}\), we can always find two program \(Y(x)\) and \(N(x)\) satisfying \(H(Y) = 1, H(N) = 0\)
But when we tried to construct a program:
and inspect whether \(H(C)\) would return true or false, here comes the contradiction.
\(Q.E.D.\)