Previous Up Next

3  Advanced Programming Techniques

This section describes advanced programming techniques, in particular for improving efficiency and circumventing language limitations. The first subsection explains how to control the memory usage of CrocoPat. The second and third subsection describe how relational expressions are evaluated in CrocoPat, and how to assess and improve the efficiency of their evaluation. The fourth subsection explains why the universe is immutable during the execution of an RML program and how to work around this limitation.

3.1  Controlling the Memory Usage

CrocoPat represents relations using the data structure binary decision diagram (BDD, [Bry86]). When CrocoPat is started, it reserves a fixed amount of memory for BDDs, which is not changed during the execution of the RML program. If the available memory is insufficient, CrocoPat exits with the error message
Error: BDD package out of memory.
The BDD memory can be controlled with the command line option -m, followed by an integer number giving the amount of memory in MByte. The default value is 50. The actual amount of memory reserved for BDDs is not infinitely variable, so the specified value is only a rough upper bound of the amount of memory used.

It can also be beneficial to reserve less memory, because the time used for allocating memory increases with the amount of memory. When the manipulated relations are small or the algorithms are computationally inexpensive, memory allocation can dominate the overall runtime.

3.2  Speeding up the Evaluation of Relational Expressions

This subsection explains how CrocoPat evaluates relational expressions. Based on this information, hints for performance improvement are given. Understanding the subsection requires basic knowledge about BDDs and the impacts of the variable order on the size of BDDs. An introduction to BDDs is beyond the scope of this manual, we refer the reader to [Bry92].

The attributes in an RML program are called user attributes in the following. For example, the expression R(x,y) contains the user attributes x and y. For the internal representation of relations, CrocoPat uses a sequence of internal attributes, which are distinct from the user attributes. We call these internal attributes i1, i2, i3, ... For example, the binary relation R is internally represented as a set of assignments to the internal attributes i1 and i2.

When the expression R(x,y) is evaluated, the internal attributes i1 and i2 are renamed to the user attributes x and y. Therefore all BDD nodes of the representation of R have to be traversed. Thus, the time for evaluating the expression R(x,y) is at least linear in the number of BDD nodes of R's representation.

The order of the internal attributes in the BDD is always i1, i2, ... The order of the user attributes in the BDD may be different in the evaluation of each statement, because the scope of user attributes is restricted to one statement. The order of the user attributes in the BDD in the evaluation of a statement is the order in which CrocoPat encounters the user attributes in the execution of the statement. In the example statement
R(x,z) := EX(y, R(x,y) & R(y,z));
CrocoPat evaluates first R(x,y), then R(y,z), then the conjunction, then the existential quantification, and finally the assignment. Therefore, the order of the user attributes in the BDD is x, y, z.

Avoid renaming large relations. The time for the evaluation of the expression R(x,y) is at least linear in the number of BDD nodes in the representation of R, because all BDD nodes have to be renamed from internal attributes to user attributes. Usually this effort for renaming does not dominate the overall runtime, but in the following we give an example where it does.

Let R(x,y) be a directed graph with n nodes. Let the BDD representation of R have Θ(n2) BDD nodes (which is the worst case). The assignment
Outneighbor(y) := EX(x, R(x,y) & x="node1");
assigns the outneighbors of the graph node node1 to the set Outneighbor. The evaluation of R(x,y) costs Θ(n2) time in this example, because of the renaming of all nodes. The “real computation”, namely the conjunction and the existential quantification, can be done in O(logn) time. So the renaming dominates the overall time.

The equivalent statement
Outneighbor(y) := R("node1",y);
is executed in only O(n) time, because the set R("node1",y) has O(n) elements, and its BDD representation has O(n) nodes.

Avoid swapping attributes. Renaming the nodes of a BDD costs at least linear time, but can be much more expensive when attributes have to be swapped. In the statement
S(x,y,z) := R(y,z) & R(x,y);
the BDD attribute order on the right hand side of the assignment is y, z, x, while the BDD attribute order on the left hand side is x, y, z. Because the two orders are different, attributes have to be swapped to execute the assignment. This can be easily avoided by using the equivalent statement
S(x,y,z) := R(x,y) & R(y,z);
Of course, swapping attributes can not always be avoided. However, developers of RML programs should know that swapping attributes can be expensive, and should minimize it when performance is critical.

Ensure good attribute orders. A detailed discussion of BDD attribute orders is beyond the scope of this manual (see e.g. [Bry92, Section 1.3] for details), but the basic rule is that related attributes should be grouped together. In the two assignment statements
S1(v,w,x,y) := R(v,w) & R(x,y);
S2(v,x,w,y) := R(v,w) & R(x,y);
the attributes v and w are related, and the attributes x and y are related, while v and w are unrelated to x and y. In S1, related attributes are grouped together, but not in S2. For many relations R, the BDD representation of S1 will be drastically smaller than the BDD representation of S2.

Profile. Information about the number of BDD nodes and the BDD attribute order of an expression can be printed with PRINT RELINFO. For example,
PRINT RELINFO(R(y,z) & R(x,y));
may output
Number of tuples in the relation: 461705
Number of values (universe): 6218
Number of BDD nodes: 246986
Percentage of free nodes in BDD package: 1614430 / 1966102 = 82 %
Attribute order: y z x
The first line gives the cardinality of the relation, the second line the cardinality of the universe, the third line the size of the BDD that represents the result of the expression, and the fifth line the attribute order in this BDD.

3.3  Estimating the Evaluation Time of Relational Expressions

Knowledge of the computational complexity of RML's operators is useful to optimize the performance of RML programs. This subsection gives theoretical complexity results, but also discusses the limits of their practical application.

Table 1 shows the asymptotic worst case time complexity for the evaluation of RML's relational operators. The times do not include the renaming of internal attributes discussed in the previous subsection, and the evaluation of subexpressions. It is assumed that the caches of the BDD package are sufficiently large. This assumption is closely approximated in practice when the manipulated BDDs only occupy a small fraction of the available nodes in the BDD package. Otherwise, performance may be improved by increasing the BDD memory (see Subsection 3.1).

When the operands of an expression are relations, the computation time is given as function of the sizes of their BDD representation. (The only exception are the transitive closure operators, where a function of the size of the universe gives a more useful bound.) This raises the problem of how to estimate these BDD sizes. Many practical relations have regularities that enable an (often dramatically) compressed BDD representation, but the analytical derivation of the typical compression rate for relations from a particular application domain is generally difficult. Our advice is to choose some representative examples and measure the BDD sizes with the PRINT RELINFO statement.

It is important to note that Table 1 gives worst-case computation times. In many cases, the typical practical performance is much better than the worst case. For example, the relational comparison operators (<=, <, >=, >) and the binary logic operators (&, |, ->, <->) are very common in RML programs. Their worst-case complexity is the product of the sizes of their operand BDDs, which is alarmingly high. However, in practice the performance is often much closer to the sum of the operand BDD sizes. Similarly, the quantification operators are often efficient despite their prohibitive worst case runtime (which is difficult to derive because quantification is implemented as a series of several bit-level operations).

Another practically important example for the gap between average-case and worst-case runtime are the transitive closure operators. The worst case complexity of their BDD-based implementation is the same as for implementations with conventional data structures. However, the BDD-based implementations are much more efficient for many practical graphs [BNL03]. Even in the comparison of different BDD-based implementations, a better worst-case complexity does not imply a better performance in practice. We conclude from our experience that knowledge of the theoretical complexity complements but cannot replace experimentation in the development of highly optimized RML programs.


! re      O(bddsize(re))
re1 & re2,  re1 | re2      O(bddsize(re1) ⋅ bddsize(re2))
re1 -> re2,  re1 <-> re2      O(bddsize(re1) ⋅ bddsize(re2))
EX(x, re),  FA(x, re)      O(bddsize(re)2)
TC(re)      O(n3)
TCFAST(re)      O(n3 logn)
FALSE(x1, x2, ...)      O(1)
TRUE(x1, x2, ...)      O(1)
@s(x)      O(n logn)
~(x1, x2)      O(n logn) (~ can be =, !=, <=, <, >=, >)
~(ne1, ne2)      O(1) (~ can be =, !=, <=, <, >=, >)
re1 = re2,  re1 != re2      O(1)
re1 < re2,  re1 <= re2      O(bddsize(re1) ⋅ bddsize(re2))
re1 > re2,  re1 >= re2      O(bddsize(re1) ⋅ bddsize(re2))


Table 1: Worst case time complexity of the evaluation of relational expressions. re, re1, re2 are relational expressions, x, x1, x2 are attributes, and ne1, ne2 are numerical expressions. bddsize(re) is the number of BDD nodes of the result of the expression re, and n is the cardinality of the universe.



3.4  Extending the Universe

The set of all strings that may be tuple elements of relations in an RML program is called the universe. The universe contains all tuple elements of the input relations (from the input RSF data), and all string literals that appear on the left hand side of assignments in the RML program. The universe is immutable in the sense that it can be determined before the interpretation of the RML program starts, and is not changed during the interpretation of the RML program.

Sometimes the immutability of the universe is inconvenient for the developer of RML programs. Consider, for example, a program that takes as input the nodes and arcs of a graph, and computes the binary relation OutneighborCnt which contains for each node the number of outneighbors:
FOR n IN Node(x) {
    OutneighborCnt(n, #(Arc(n,x))) := TRUE();
}
This is not a syntactically correct RML program, because #(Arc(n,x)) is not a string, but a number. However, RML has an operator STRING that converts a number into a string. But
OutneighborCnt(n, STRING( #(Arc(n,x)) )) := TRUE();
is still not syntactically correct, because such a conversion is not allowed at the left hand side of assignment statements. The reason is that the string that results from such a conversion is generally not known before the execution of the RML program, can therefore not be added to the universe before the execution, and is thus not allowed as tuple element of a relation.

The immutability of the universe during the execution of an RML program is necessary because constant relations like TRUE(x) (the universe) and =(x,y) (string equality for all strings in the universe) are only defined for a given universe. Also, the complement of a relation depends on the universe: The complement of a set contains all strings of the universe that are not in the given set, and thus clearly changes when the universe changes.

However, there is a way to work around this limitation: Writing an RSF file, and restarting CrocoPat with this RSF file as input, which adds all tuple elements in the RSF file to the universe. For example, the above incorrect program can be replaced by the following correct program:
FOR n IN Node(x) {
    PRINT "OutneighborCnt ", n, " ", #(Arc(n,x)) TO "OutneighborCnt.rsf";
}
When CrocoPat is restarted with the resulting RSF file OutneighborCnt.rsf as input, the binary relation OutneighborCnt is available for further processing.


Previous Up Next