Interprocedural analysis
------------------------
We assume an abstract lattice of the form D = P(A) with A a finite set
and that we have some basic commands c with given transfer functions
[[c]] : D->D. In complexity estimations we will use the notation a =
|A| for the size of the underlying set.
Some concrete examples:
- Constant propagation: A consist of pairs (x,v) where x is a
global variable and v is a possible value for that variable.
For example, v:{0,1,2,T}. The transfer function of, say, x=y+z is
f(d) = d \ (x,_) u {(x,v) | v = vx +' vy, (x,vx),(y,vy):d} with +' the
abstract addition function on {0,1,2,T}. Here (x,D) stands for all
pairs (x,v) with v:{0,1,2,T}.
- Reaching definitions: here A would be the set of code lines
containing assignments.
- Taintedness: A consists of pairs (x,T), (x,U) where x is a string variable.
T represents tainted values and U represents untainted values. The transfer
function of a command x=userInput is f(d)=d \ (x,_) u {(x,T)}.
The transfer function of x=y+z would be similar to the example
under constant propagation with the understanding that U+T=T+U=T,
U+U=U. The transfer function of x=sanitize(y)
would be f(d) = d \ (x,_) u {(x,U)}
Intraprocedural program analysis associates with each edge e in the
control flow graph a variable X_e of type D. One then computes the
least solution of an equation system whose left hand sides are these
variables and whose right hand sides are expressions involving the
variables themselves, constants, the transfer functions, and union
(representing joins).
Consider, for example, the following abstract program:
1: if(...)
2: c;
3: else while (...) {
4: if (...)
5: d;
6: else e;
7: }
8:
X2 = X1
X3 = X1 + X7
X8 = [[c]](X2) + X3
X4 = X3
X5 = X4
X6 = X4
X7 = [[d]](X5) + [[e]](X6)
Here Xi represents the sum of the edge(s) going into i.
In the course we have seen various ways for solving such equation
systems all based on iteratively updating the values of the unknowns
starting from 0 by replacing them with the current values of the
respective right hand sides until everything stabilises.
Interprocedural analysis: motivation & problem statement
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Interprocedural analysis is concerned with the case where (possibly
recursive) procedures are present. For simplicity, we assume at this
point that all variables are global and that procedures are
parameterless and do not return values.
Consider, for example, the following abstract program.
0: main(){
1: if (...) {
2: c;
3: P();
4: }
5: else {
6: d;
7: P();
8: }}
9: P(){
10: if(...) {
11: P();
12: a;
13: P();
14: } else
15: b;}
The problem is that when we try to analyse the body of P using the
intraprocedural approach we do not know what to take for the initial
state, i.e. what right hand side to use for X9 so to say.
One possibility is to simply use for this purpose the sum of the
states of all "call-sites" for P. In our example, this would result in
the equation
X9 = X7+X3+X11+X13
It is not hard to see that this may result in rather gross
overapproximations. For example, in the following concrete example
from taintedness analysis
0: main(){
1: x=userInput();
2: P();
3: x=sanitize(x);
4: P();
5: }
6: P: {
7: {}
8: }
Here, we would falsely conclude that x is tainted at line 5.
2. Context-sensitive analysis
- - - - - - - - - - - - - - -
One can improve the situation somewhat by analysing the procedure
several times, once per "call site" or indeed, once per calling
context which may also takes the calling procedure or some other
abstraction of the call stack into account ("context-sensitive
analysis"). In the present case, we would then get two versions of
P. One to be used in line 2 and returning {(x,T)} and the other one to
be used in line 4 and returning {(x,U)}.
Precise interprocedural analysis: a more accurate approach consists of
introducing unknowns of type D->D for the procedures, i.e. making
their "return state" depending on the respective initial state. Of
course, the intermediate abstract states in the bodies of the
procedures will then also depend on the respective initial states so
that all unknowns will be raised to the type D->D. In the example, we
would then get the following system of equations:
X1 = Id
X2 = KT o X1
X3 = XP o X2
X4 = KU o X3
X5 = XP o X4
XP = Id
where KT(d) = d \ (x,_) u (x,T) and KU = d \ (x,_) u (x,U)
Id(d) = d. Also, "o" stands for function composition.
The least solution of this accurately maps XP to the identity function
and hence X5({}) = {(x,U)}.
The disadvantage of this approach is that variables now range over an
exponentially larger domain, to wit |D->D| = |D|^|D| = 2^{a*2^a}.
(Recall that D=2^A and a=|A|). In particular, if we use naive fixpoint
iteration to solve an equation system with n variables ranging over
D->D then each variable can go up a*2^a many times so that O(n*2^a)
rounds are needed. Each single round also incurs an exponential effort
since the computation of the functional compositions requires filling
in an exponentially (in the parameter a) large table.
Historically, this idea seems to go back to Sharir and Pnueli (1978)
where it is called the *functional approach* to interprocedural
analysis.
The IFDS Framework
- - - - - - - - -
Horwitz, Reps, and Sagiv noticed that in those cases where all
transfer functions of basic commands are distributive, i.e. f(d u d')
= f(d) u f(d') the same holds true for the transfer functions of
blocks and procedures, i.e., the intermediate and final values of the
variables of type D->D such as Xi and XP, introduced above.
Now, a distributive function f:D->D is uniquely determined by its
effect on singletons, i.e. by the function x |-> f({x}). This is
because f({x1,x2,...,xk}) = f({x1}) u ... u f({xk}). As a result, one
can use variables of type A->D rather than D->D which results in an
exponential improvement since |A->D| = 2^{a^2} which is not so far
away from the 2^a in the intraprocedural case. For technical reasons,
it is more convenient to work with binary relations over A rather than
functions A->D:
Definition: a distributive function f : D->D is induced by a relation
f0:2^{AxA} if y:f({x}) <==> x f0 y.
Notice that each distributive f is induced by a unique binary relation.
Lemma: If f,g:D->D are induced by relations f0, g0 then the functional
composition h = g o f is induced by the relational composition f0;g0
given by
(f0;g0) = {(x,y) | exists x'. (x,x'):f0, (x'.y):g0}
Thus, assuming distributivity, we can obtain an equation system for
interprocedural analysis by replacing each variable of type D->D in
the Pnueli-Sharir approach with a variable ranging over a binary
relation over A and by replacing functional composition with
relational composition.
Horwitz et al introduced the acronym IFDS (interprocedural, finite,
distributive, subset) for this situation. Notice that the fact that
our abstract domain is a finite powerset (set of *subsets*) was also
important for the trick with the binary relations to work. It is
possible to slightly generalise it to lattice with the property that
each element can be uniquely expressed as a finite join of "atoms" and
in particular sublattices of powerset lattices.
Complexity of naive iteration in the IFDS setting
- - - - - - - - - - - - - - - - - - - - - - - - -
Now, a program of size n will result in an equation system with O(n)
variables ranging over binary relations over A and with constant size
right hand sides whose evaluation will thus incur an effort of
O(a^3). This is because, to fill in a single one of the a^2 many slots
one needs to go through O(a) middle values resulting from the
existential quantifier in the definition ogf relational composition.
Thus, the complete re-evaluation of all the right hand sides incurs an
effort of O(n*a^3) Every variable can be increased at most a^2 times
so that at most n*a^2 rounds are necessary resulting in an overall
upper bound of O(n^2 * a^5).
We notice that this bound, while still being rather large, is
polynomial in n and a thus demonstrating that in the IFDS situation
interprocedural analysis is possible in polynomial time.
IFDS in cubic time
- - - - - - - - - -
We will now show that by applying a series of rather generic
optimizations to the naive fixpoint iteration we can bring the
asymptotic runtime of IFDS interprocedural analysis down to O(n *
a^3).
In the original paper on IFDS the same bound is achieved using a
graph-theoretic encoding of relations which results in a rather
complicated description (in the author's opinion.) The reader is
invited to look up the original IFDS algorithm and also more recent
formulations in current lecture notes and compare to the following
more declarative presentation.
Recall that we have to solve an equation system of n equations of the
form X = RHS where X is a variable ranging over binary relations over
A and where the constant-size right-hand side RHS is built up from
other variables and fixed relations by relational composition and union.
Now, we specialize the arguments for each relational variable, i.e. we
replace each relational variable X with a^2 many boolean variables
X(x,y) representing the truth value of xXy for each pair
x,y:A. Expanding the relational compositions into boolean expressions
of size O(a) then yields an equation system with n*a^2 boolean
variables and righ-hand sides of size O(a) which are boolean
expressions in the variables and fixed boolean values that can be
looked up from the value tables for the fixed, builtin relations.
For example, an equation X = Y;Z + W with A = {u,v,w} becomes
X(u,v) = (Y(u,u) & Z(u,v) v Y(u,v) & Z(v,v) v Y(u,w) & Z(w,v)) v
W(u,v)
X(u,u) = ...
X(u,w) = ...
...
X(w,w) = ...
By introducing O(a) further boolean variables for intermediate results
we can transform into a system with O(n*a^3) many boolean variables
and constant size right-hand sides. Applying this further
transformation to the first equation above results in
X(u,v) = H1(u,v) v W(u,v)
H1(u,v) = Y(u,u) & Z(u,v) v H2(u,v)
H2(u,v) = Y(u,v) & Z(v,v) v H3(u,v)
H3(u,v) = Y(u,w) & Z(w,v)
Now, we solve this system using fixpoint iteration with the following
simple optimization: whenever the value of a variable changes only
re-evaluate those right hand sides in which this variable actually
occurs.
Now suppose that the i-th boolean variable occurs in M_i many right hand sides.
Since the right-hand sides have constant size we have
sum_i M_i = O(n*a^3)
Being boolean, each one of the O(n*a^3) variables can go up at most
once during the entire iteration process and its doing so will result
in M_i right hand sides having to be re-evaluated, thus a work effort
of O(M_i). The total effort for the iteration is therefore O(sum_i
M_i) = O(n*a^3) as desired.