Software Verification: Testing vs. Model Checking

A Comparative Evaluation of the State of the Art

— Supplemental Material —

Dirk Beyer and Thomas Lemberger

Abstract

In practice, software testing has been the established method for finding bugs in programs for a long time. But in the last 15 years, software model checking has received a lot of attention, and many successful tools for software model checking exist today. We believe it is time for a careful comparative evaluation of automatic software testing against automatic software model checking. We chose six existing tools for automatic test-case generation, namely AFL-fuzz, CPATiger, Crest-ppc, FShell, Klee, and PRtest, and four tools for software model checking, namely Cbmc, CPA-Seq, Esbmc-incr, and Esbmc-kInd, for the task of ﬁnding specification violations in a large benchmark suite consisting of 5 693 C programs. In order to perform such an evaluation, we have implemented a framework for test-based falsification (TBF) that executes and validates test cases produced by test-case generation tools in order to ﬁnd errors in programs. The conclusion of our experiments is that software model checkers can (i) find a substantially larger number of bugs (ii) in less time, and (iii) require less adjustment to the input programs.

Supplementary Data

We performed an extensive experimental evaluation to compare test generation to model checking with respect to bug-finding. We used our framework TBF with the test-case generators AFL-fuzz CPATiger, CREST FShell KLEE, and PRtest (part of TBF). We used the software model checkers CBMC, ESBMC-kInd, ESBMC-incr, and CPA-seq for comparison. The 10 tools were used to find specification violations in progams from sv-benchmarks collection (we used version 423cf8cb).

A table listing all experimental results can be found here. Please note that the table is big and may slow down your browser.

Tool
AFL-fuzz	HTML Table	CSV Table	Log Files	Harnesses
CPATiger	HTML Table	CSV Table	Log Files	Harnesses
CREST-ppc	HTML Table	CSV Table	Log Files	Harnesses
FShell	HTML Table	CSV Table	Log Files	Harnesses
KLEE	HTML Table	CSV Table	Log Files	Harnesses
CBMC	HTML Table	CSV Table	Log Files	Witnesses
CPA-seq	HTML Table	CSV Table	Log Files	Witnesses
ESBMC-incr	HTML Table	CSV Table	Log Files	Witnesses
ESBMC-kInd	HTML Table	CSV Table	Log Files	Witnesses

We also performed a witness validation of the model-checking results.

Model Checker validated
CBMC	HTML Table	CSV Table
CPA-seq	HTML Table	CSV Table
ESBMC-incr	HTML Table	CSV Table
ESBMC-kInd	HTML Table	CSV Table

To show the validity of TBF, we compared our own harness generation and execution-technique to KLEE-replay, a test executor bundled with KLEE.


KLEE-replay	HTML Table	CSV Table	Log Files	Harnesses

Since we used a modified version of CREST, we performed experiments on a subset of tasks to make sure that the new heuristics actually outperform the original CREST. The experiments show that the best heuristic is the new heuristic 'hybrid', which thus was used in our work.

A table listing all results of these experiments can be found here (CSV). The log files can be found here.

CREST-ppc with heuristic …
CFG	HTML Table	CSV Table
DFS	HTML Table	CSV Table
Hybrid	HTML Table	CSV Table
PPC	HTML Table	CSV Table
PPC-cached	HTML Table	CSV Table

Benchmark Set

For our experiments, we used all tasks with a reach-safety property of the SV-COMP benchmark set. The benchmark set is categorized according to the features of the verification tasks - the table below gives a quick overview over the used set. The category LDV is special: its verification tasks are created from industrial-scale software, namely Linux modules, by the Linux Driver Verification team.

Category	Tasks	LOC					C features
Category	Tasks	Sum	Min	Max	Avg	Median	C features
Arrays	135	5 497	14	1 161	41	31	C arrays
False	40	1 389	15	57	35	35
True	95	4 108	14	1 161	43	30
BitVectors	50	10 511	13	696	210	39	Bit vector arithmetics
False	14	2 236	13	636	160	32
True	36	8 275	15	696	230	47
ControlFlow	94	183 875	94	22 300	1 956	1 644	Complicated control flow
False	42	83 034	220	10 835	1 977	1 694
True	52	100 841	94	22 300	1 939	1 057
ECA	1 149	29 685 918	344	185 053	25 836	4 269	Lots of (deep) branching
False	411	11 948 617	566	185 053	29 072	4 827
True	738	17 737 301	344	185 053	24 034	2 590
Floats	173	47 499	9	1 122	275	36	Floats (+ arithmetics)
False	31	963	15	154	31	31
True	142	46 536	9	1 122	328	48
Heap	173	136 949	11	4 605	792	627	Heap structures
False	66	50 430	19	4 605	764	656
True	107	86 519	11	4 576	809	437
Loops	156	9 770	14	1 644	63	24	C loops
False	51	3 989	14	1 644	78	22
True	105	5 781	14	476	55	25
ProductLines	597	1 160 305	838	3 789	1 944	1 014	Lots of branching because of product line options
False	265	620 859	847	3 789	2 343	2 951
True	332	539 446	838	3 693	1 625	979
Recursive	98	2 957	12	100	30	30	Use of recursion
False	45	1 227	12	49	27	27
True	53	1 730	12	100	33	30
Sequentialized	273	580 401	330	18 239	2 126	1 098	Sequentialized, previously multi-threaded programs
False	170	325 168	331	15 979	1 913	974
True	103	255 233	330	18 239	2 478	1 223
LDV	2 795	41 358 042	339	227 732	14 797	9 746	Linux device driver modules
False	355	6 116 255	1 389	85 772	17 229	13 420
True	2 440	35 241 787	339	227 732	14 443	8 664

Total	5 693	73 181 724	9	227 732	12 855	3 507
False	1 490	19 154 167	12	185 053	12 855	2 984
True	4 203	54 027 557	9	227 732	12 855	4 055

Generated Tests

The following table gives an overview of the number of generated tests and the number of input values per test.

Test generator	Tests generated (total)	Test vector size (# values)
AFL-fuzz	82 608	1 117	55.44	1	605	73 579	1	1 117	121.6	25	1	1 001	294.7	11.0
CPATiger	382	6	0.2564	0	57	73	1	3	1.281	1	1	41	2.368	1.0
Crest-ppc	1 305 568	8 665	876.2	1	376	43 157	1	4 806	114.8	2	0	52	5.676	5.0
FShell	1 042	54	0.6993	0	236	790	1	54	3.347	2	0	46	11.42	11.0
Klee	63 130	812	42.37	1	826	62 674	0	812	75.88	2	0	1 024	11.72	5.0
PRTest	168 310	46 181	114.0	0	292	75 414	0	46 181	258.3	1	0	100 000	349.6	1.0

Reproducing Results

If you encounter any problems with the instructions below, please contact Thomas Lemberger for support.

Replication Package & Virtual Machine

We provide a replication package for reproducing our experimental results.

The easiest way to use it is with our virtual machine.

Prerequisites

We assume that your machine meets the following or similar requirements:

Operating system based on Linux 4.x
OpenJDK 1.8 (or compatible)
16 GiB of RAM
At least 2 CPU cores

We also assume that you use bash as your command-line shell.

In the following, we will refer to the directory test-study of the extracted replication package as $BASE.

We use BenchExec for our experiments, which allows isolated, reproducible execution of benchmarks. It is already installed on the virtual machine provided. If you don't have BenchExec installed, our replication package contains a debian package for easy installation. From $BASE, run the following on the command line:

./setup.sh

After installation, you will have to add your user to group benchexec manually:

sudo gpasswd -a $USER benchexec

For more information and support, see the benchexec documentation.

Getting SV-Benchmarks

The benchmark set is not included with the artifact. To get it, you require git. Run on the command line:

git clone https://github.com/sosy-lab/sv-benchmarks.git

cd sv-benchmarks

git checkout 423cf8cb

In the following, we will assume that sv-benchmarks is located in the root directory of our artifact, i.e., test-study/sv-benchmarks .

Benchmark Execution

To execute our experiments with TBF, download the benchmark definition tbf_ex.xml and put it in directory test-study/tbf. It contains one run definition per test tool (+ klee-replay). To execute all run definitions, run from the directory of the extracted replication package:
```
cd $BASE/tbf
```
```
PYTHONPATH=./lib/ benchexec tbf_ex.xml
```
All results will be in $BASE/tbf/results.
To execute our experiments with the model checkers, run:
1. For CBMC:
```
cd $BASE/cbmc
```
```
benchexec --container cbmc.xml
```
  The results for cbmc will be in $BASE/cbmc/results
2. For CPA-Seq:
```
cd $BASE/cpa-seq
```
```
benchexec --container cpa-seq.xml
```
  The results for CPA-Seq will be in $BASE/cpa-seq/results
3. For the two configurations of ESBMC:
```
cd $BASE/esbmc
```
```
benchexec --container esbmc-incr.xml
```
```
benchexec --container esbmc-kind.xml
```
  The results for ESBMC-incr and ESBMC-kInd will be in $BASE/esbmc/results
4. To visualize the created results, you can use the tool table-generator, which is part of BenchExec.
  
  We provide a table definition to create one big table with all results:
```
cd $BASE
```
```
table-generator -x teststudy_all.xml
```
  table-generator will produce one HTML and one CSV-file. To open the HTML, run
```
firefox teststudy_all.table.html
```
5. To validate the witnesses produced by the model checkers, run:
  1. To validate witnesses with CPAchecker:
```
cd $BASE/cpa-seq
```
```
benchexec --container cpa-seq-validate-cbmc.xml
```
```
benchexec --container cpa-seq-validate-cpa-seq.xml
```
```
benchexec --container cpa-seq-validate-esbmc-incr.xml
```
```
benchexec --container cpa-seq-validate-esbmc-kind.xml
```
  2. To validate witnesses with CPA-witness2test:
```
cd $BASE/cpa-seq
```
```
benchexec --container cpa-w2t-validate-cbmc.xml
```
```
benchexec --container cpa-w2t-validate-cpa-seq.xml
```
```
benchexec --container cpa-w2t-validate-esbmc-incr.xml
```
```
benchexec --container cpa-w2t-validate-esbmc-kind.xml
```
  3. To validate witnesses with FShell-witness2test:
```
cd $BASE/cprover-sv-comp/witness2test
```
```
benchexec --container fshell-w2t-validate-cbmc.xml
```
```
benchexec --container fshell-w2t-validate-cpa-seq.xml
```
```
benchexec --container fshell-w2t-validate-esbmc-incr.xml
```
```
benchexec --container fshell-w2t-validate-esbmc-kind.xml
```
  4. To validate witnesses with Ultimate Automizer:
```
cd $BASE/uautomizer
```
```
benchexec --container uautomizer-validate-cbmc.xml
```
```
benchexec --container uautomizer-validate-cpa-seq.xml
```
```
benchexec --container uautomizer-validate-esbmc-incr.xml
```
```
benchexec --container uautomizer-validate-esbmc-kind.xml
```
  You will find the results of the validation runs in the corresponding results directories.
Finished!

After executing all these steps, you should have produced the data provided above yourself. Feel free to play around with the tools some more. If you have any questions regarding our work, the virtual machine or the results, you can write us at thomas.lemberger@sosy.ifi.lmu.de. And if you are curious about our other publications, visit sosy-lab.org.