Interprocedural analysis ------------------------ We assume an abstract lattice of the form D = P(A) with A a finite set and that we have some basic commands c with given transfer functions [[c]] : D->D. In complexity estimations we will use the notation a = |A| for the size of the underlying set. Some concrete examples: - Constant propagation: A consist of pairs (x,v) where x is a global variable and v is a possible value for that variable. For example, v:{0,1,2,T}. The transfer function of, say, x=y+z is f(d) = d \ (x,_) u {(x,v) | v = vx +' vy, (x,vx),(y,vy):d} with +' the abstract addition function on {0,1,2,T}. Here (x,D) stands for all pairs (x,v) with v:{0,1,2,T}. - Reaching definitions: here A would be the set of code lines containing assignments. - Taintedness: A consists of pairs (x,T), (x,U) where x is a string variable. T represents tainted values and U represents untainted values. The transfer function of a command x=userInput is f(d)=d \ (x,_) u {(x,T)}. The transfer function of x=y+z would be similar to the example under constant propagation with the understanding that U+T=T+U=T, U+U=U. The transfer function of x=sanitize(y) would be f(d) = d \ (x,_) u {(x,U)} Intraprocedural program analysis associates with each edge e in the control flow graph a variable X_e of type D. One then computes the least solution of an equation system whose left hand sides are these variables and whose right hand sides are expressions involving the variables themselves, constants, the transfer functions, and union (representing joins). Consider, for example, the following abstract program: 1: if(...) 2: c; 3: else while (...) { 4: if (...) 5: d; 6: else e; 7: } 8: X2 = X1 X3 = X1 + X7 X8 = [[c]](X2) + X3 X4 = X3 X5 = X4 X6 = X4 X7 = [[d]](X5) + [[e]](X6) Here Xi represents the sum of the edge(s) going into i. In the course we have seen various ways for solving such equation systems all based on iteratively updating the values of the unknowns starting from 0 by replacing them with the current values of the respective right hand sides until everything stabilises. Interprocedural analysis: motivation & problem statement - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Interprocedural analysis is concerned with the case where (possibly recursive) procedures are present. For simplicity, we assume at this point that all variables are global and that procedures are parameterless and do not return values. Consider, for example, the following abstract program. 0: main(){ 1: if (...) { 2: c; 3: P(); 4: } 5: else { 6: d; 7: P(); 8: }} 9: P(){ 10: if(...) { 11: P(); 12: a; 13: P(); 14: } else 15: b;} The problem is that when we try to analyse the body of P using the intraprocedural approach we do not know what to take for the initial state, i.e. what right hand side to use for X9 so to say. One possibility is to simply use for this purpose the sum of the states of all "call-sites" for P. In our example, this would result in the equation X9 = X7+X3+X11+X13 It is not hard to see that this may result in rather gross overapproximations. For example, in the following concrete example from taintedness analysis 0: main(){ 1: x=userInput(); 2: P(); 3: x=sanitize(x); 4: P(); 5: } 6: P: { 7: {} 8: } Here, we would falsely conclude that x is tainted at line 5. 2. Context-sensitive analysis - - - - - - - - - - - - - - - One can improve the situation somewhat by analysing the procedure several times, once per "call site" or indeed, once per calling context which may also takes the calling procedure or some other abstraction of the call stack into account ("context-sensitive analysis"). In the present case, we would then get two versions of P. One to be used in line 2 and returning {(x,T)} and the other one to be used in line 4 and returning {(x,U)}. Precise interprocedural analysis: a more accurate approach consists of introducing unknowns of type D->D for the procedures, i.e. making their "return state" depending on the respective initial state. Of course, the intermediate abstract states in the bodies of the procedures will then also depend on the respective initial states so that all unknowns will be raised to the type D->D. In the example, we would then get the following system of equations: X1 = Id X2 = KT o X1 X3 = XP o X2 X4 = KU o X3 X5 = XP o X4 XP = Id where KT(d) = d \ (x,_) u (x,T) and KU = d \ (x,_) u (x,U) Id(d) = d. Also, "o" stands for function composition. The least solution of this accurately maps XP to the identity function and hence X5({}) = {(x,U)}. The disadvantage of this approach is that variables now range over an exponentially larger domain, to wit |D->D| = |D|^|D| = 2^{a*2^a}. (Recall that D=2^A and a=|A|). In particular, if we use naive fixpoint iteration to solve an equation system with n variables ranging over D->D then each variable can go up a*2^a many times so that O(n*2^a) rounds are needed. Each single round also incurs an exponential effort since the computation of the functional compositions requires filling in an exponentially (in the parameter a) large table. Historically, this idea seems to go back to Sharir and Pnueli (1978) where it is called the *functional approach* to interprocedural analysis. The IFDS Framework - - - - - - - - - Horwitz, Reps, and Sagiv noticed that in those cases where all transfer functions of basic commands are distributive, i.e. f(d u d') = f(d) u f(d') the same holds true for the transfer functions of blocks and procedures, i.e., the intermediate and final values of the variables of type D->D such as Xi and XP, introduced above. Now, a distributive function f:D->D is uniquely determined by its effect on singletons, i.e. by the function x |-> f({x}). This is because f({x1,x2,...,xk}) = f({x1}) u ... u f({xk}). As a result, one can use variables of type A->D rather than D->D which results in an exponential improvement since |A->D| = 2^{a^2} which is not so far away from the 2^a in the intraprocedural case. For technical reasons, it is more convenient to work with binary relations over A rather than functions A->D: Definition: a distributive function f : D->D is induced by a relation f0:2^{AxA} if y:f({x}) <==> x f0 y. Notice that each distributive f is induced by a unique binary relation. Lemma: If f,g:D->D are induced by relations f0, g0 then the functional composition h = g o f is induced by the relational composition f0;g0 given by (f0;g0) = {(x,y) | exists x'. (x,x'):f0, (x'.y):g0} Thus, assuming distributivity, we can obtain an equation system for interprocedural analysis by replacing each variable of type D->D in the Pnueli-Sharir approach with a variable ranging over a binary relation over A and by replacing functional composition with relational composition. Horwitz et al introduced the acronym IFDS (interprocedural, finite, distributive, subset) for this situation. Notice that the fact that our abstract domain is a finite powerset (set of *subsets*) was also important for the trick with the binary relations to work. It is possible to slightly generalise it to lattice with the property that each element can be uniquely expressed as a finite join of "atoms" and in particular sublattices of powerset lattices. Complexity of naive iteration in the IFDS setting - - - - - - - - - - - - - - - - - - - - - - - - - Now, a program of size n will result in an equation system with O(n) variables ranging over binary relations over A and with constant size right hand sides whose evaluation will thus incur an effort of O(a^3). This is because, to fill in a single one of the a^2 many slots one needs to go through O(a) middle values resulting from the existential quantifier in the definition ogf relational composition. Thus, the complete re-evaluation of all the right hand sides incurs an effort of O(n*a^3) Every variable can be increased at most a^2 times so that at most n*a^2 rounds are necessary resulting in an overall upper bound of O(n^2 * a^5). We notice that this bound, while still being rather large, is polynomial in n and a thus demonstrating that in the IFDS situation interprocedural analysis is possible in polynomial time. IFDS in cubic time - - - - - - - - - - We will now show that by applying a series of rather generic optimizations to the naive fixpoint iteration we can bring the asymptotic runtime of IFDS interprocedural analysis down to O(n * a^3). In the original paper on IFDS the same bound is achieved using a graph-theoretic encoding of relations which results in a rather complicated description (in the author's opinion.) The reader is invited to look up the original IFDS algorithm and also more recent formulations in current lecture notes and compare to the following more declarative presentation. Recall that we have to solve an equation system of n equations of the form X = RHS where X is a variable ranging over binary relations over A and where the constant-size right-hand side RHS is built up from other variables and fixed relations by relational composition and union. Now, we specialize the arguments for each relational variable, i.e. we replace each relational variable X with a^2 many boolean variables X(x,y) representing the truth value of xXy for each pair x,y:A. Expanding the relational compositions into boolean expressions of size O(a) then yields an equation system with n*a^2 boolean variables and righ-hand sides of size O(a) which are boolean expressions in the variables and fixed boolean values that can be looked up from the value tables for the fixed, builtin relations. For example, an equation X = Y;Z + W with A = {u,v,w} becomes X(u,v) = (Y(u,u) & Z(u,v) v Y(u,v) & Z(v,v) v Y(u,w) & Z(w,v)) v W(u,v) X(u,u) = ... X(u,w) = ... ... X(w,w) = ... By introducing O(a) further boolean variables for intermediate results we can transform into a system with O(n*a^3) many boolean variables and constant size right-hand sides. Applying this further transformation to the first equation above results in X(u,v) = H1(u,v) v W(u,v) H1(u,v) = Y(u,u) & Z(u,v) v H2(u,v) H2(u,v) = Y(u,v) & Z(v,v) v H3(u,v) H3(u,v) = Y(u,w) & Z(w,v) Now, we solve this system using fixpoint iteration with the following simple optimization: whenever the value of a variable changes only re-evaluate those right hand sides in which this variable actually occurs. Now suppose that the i-th boolean variable occurs in M_i many right hand sides. Since the right-hand sides have constant size we have sum_i M_i = O(n*a^3) Being boolean, each one of the O(n*a^3) variables can go up at most once during the entire iteration process and its doing so will result in M_i right hand sides having to be re-evaluated, thus a work effort of O(M_i). The total effort for the iteration is therefore O(sum_i M_i) = O(n*a^3) as desired.