CoVeriTest: Cooperative, Verifier-Based Testing

Dirk Beyer, Marie-Christine Jakobs

Author generated PDF

Abstract

Testing is a widely used method to assess software quality. Coverage criteria and coverage measurements are used to ensure that the constructed test suites adequately test the given software. Since manually developing such test suites is too expensive in practice, various automatic test-generation approaches were proposed. Since all approaches come with different strengths, combinations are necessary in order to achieve stronger tools. We study cooperative combinations of verification approaches for test generation, with high-level information exchange.
We present CoVeriTest, a hybrid approach for test-case generation, which iteratively applies different conditional model checkers. Thereby, it allows to adjust the level of cooperation and to assign individual time budgets per verifier. In our experiments, we combine explicit-state model checking and predicate abstraction (from CPAchecker) to systematically study different CoVeriTest configurations. Moreover, CoVeriTest achieves higher coverage than state-of-the-art test-generation tools for some programs.

Results

We performed three types of experiments:

  1. The first type of experiments systematically studies different CoVeriTest configurations.
  2. The second type of experiments compares the best CoVeriTest configuration against its components, more concrete against each of its single component analyses and against the parallel execution of all component analyses.
  3. The third type of experiments compares CoVeriTest against existing test-case generation approaches.

Comparing CoVeriTest Configurations

CoVeriTest is implemented as part of CPAchecker. Currently, only CPAchecker's value and predicate analysis support all kind of reuse types offered by CoVeriTest. Thus, we decided to fix the analyses components used for the comparison of CoVeriTest configurations. Investigating how CoVeriTest performs when using different component analyses is left for future work.

Next to the analysis components, which we fixed, CoVeriTest allows one to configure the time limit of each component (per iteration) and how to reuse information between analysis executions of the same CoVeriTest run. We used six different time limits to investigate whether CoVeriTest should switch often, sometimes, or rarely between analyses and whether should equally distribute the time among the component analyses or not. Furthermore, we used nine reuse types ranging from no exchange, over only reuse own results to full cooperation. For the details on the different reuse types, we refer to our paper.

For the comparison of these 54 configuration, we used all 6703 C programs from the largest publicly available benchmark collection of C verification tasks. Since all configurations are run by the same tool CPAchecker, we decided to disable test-case generation (thus, reducing the amount of data) and only compare the number of covered test goals reported by CPAchecker.

The following table presents the experimental data of our study of the different CoVeriTest configurations. For each combination of reuse type and time limit pair, the table provides the raw data (compressed XML file). Additionally, the table contains one HTML and one CSV file for each reuse type. These files show for each reuse type and test task (program) the experimental data for the different time limits, i.e., configurations using the particular reuse type with varying time limits. Similarly, the table contains one HTML and one CSV file for each time limit pair. These files provide for each time limit pair and test task (program) the experimental data for the different reuse types. The last row presents the log files (zip-compressed). There exists one log file per configuration and test task. Log files are grouped by the time limit used in the CoVeriTest configuration (due to course of experiments).

Time Limits
Reuse Type10×1050×50100×100250×25080×2020×80All Limits
plain Result XML Result XML Result XML Result XML Result XML Result XML HTML Table CSV Table
condv Result XML Result XML Result XML Result XML Result XML Result XML HTML Table CSV Table
condp Result XML Result XML Result XML Result XML Result XML Result XML HTML Table CSV Table
condv,p Result XML Result XML Result XML Result XML Result XML Result XML HTML Table CSV Table
reuse-prec Result XML Result XML Result XML Result XML Result XML Result XML HTML Table CSV Table
reuse-arg Result XML Result XML Result XML Result XML Result XML Result XML HTML Table CSV Table
condv+r Result XML Result XML Result XML Result XML Result XML Result XML HTML Table CSV Table
condp+r Result XML Result XML Result XML Result XML Result XML Result XML HTML Table CSV Table
condv,p+r Result XML Result XML Result XML Result XML Result XML Result XML HTML Table CSV Table
HTML Table HTML Table HTML Table HTML Table HTML Table HTML Table
All Types CSV Table CSV Table CSV Table CSV Table CSV Table CSV Table
Log Files Log Files Log Files Log Files Log Files Log Files
Log Files

The above table lets one compare the coverage results individually per test task (program). To compare the different configurations, we need to aggregate the results on all test tasks. We decided to look at the distribution of the achieved coverage.

Since a program may have a large number of unreachable test goals, we decided not to compare the total coverage, i.e., number of covered test goals divided by number of total test goals, but a relative measure (called relative coverage) that compares against the best value achieved. Hence, when comparing a set of configurations, we first compute for each program (test task) the largest number of covered test goals reported among all configurations in the set. To compute the relative coverage, we then divide the number of test goals achieved by a configuration by this largest number.

As explained above, we have two parameters for CoVeriTest, which we can control in our experiments, namely time limits and reuse type. In our paper, we studied which time limits to use for a particular reuse type and based on this what is the best CoVeriTest configuration. On our paper, we also study which reuse type to use for a particular time limit pair.

Comparison of Time Limits per Reuse Type

The following nine figures show for each of the nine reuse types six box plots. Each box plots shows the distribution of the relative coverage. For the per task comparison of the coverage, we refer to the HTML and CSV tables above. Looking at the figures, we observe that the first four reuse types perform best with view switches, while the last five reuse types perform best when preferring the predicate analysis. Since the first four reuse types let an analysis forget all information found out in a previous run, switching seem to be avoided. If this is not the case (as for the last five reuse types shown), the more powerful analysis (predicate analysis) is preferred.

plain condv condp
Box plots comparing relative coverage of configurations with reuse type plain Box plots comparing relative coverage of configurations with reuse type cond<sub>v</sub> Box plots Box plots comparing relative coverage of configurations with reuse type cond<sub>p</sub>
condv,p reuse-prec reuse-arg
Box plots comparing relative coverage of configurations with reuse type cond<sub>v,p</sub> Box plots comparing relative coverage of configurations with reuse type reuse-prec Box plots Box plots comparing relative coverage of configurations with reuse type reuse-arg
condv+r condp+r condv,p+r
Box plots comparing relative coverage of configurations with reuse type cond<sub>v</sub>+r Box plots comparing relative coverage of configurations with reuse type cond<sub>p</sub>+r Box plots comparing relative coverage of configurations with reuse type cond<sub>v,p</sub>

Comparison of Best Configurations per Reuse Type

Based on the previous results, we chose for each reuse type the best time limit pair and compared the different reuse types. Next, we provide the box plot comparison known from above as well as the CSV and HTML table of this comparison. The result of the comparison is that the best configuration uses type reuse-arg and time limit pair 20×80.

Box plots comparing relative coverage of best configuration of each reuse type HTML Table CSV Table

Comparison of Time Limits per Reuse Type

The following six figures show for each of the six time limit pairs nine box plots. Each box plots shows the distribution of the relative coverage. For the per task comparison of the coverage, we refer to the HTML and CSV tables above. Looking at the figures, we observe that for each time limit there always exists a configuration which is better than the first four reuse types. Thus, CoVeriTest should be configured to reuse information that an analysis generated in a previous iteration of the same CoVeriTest run. Furthermore, either reuse type reuse-arg or condv performed best.

10×10 50×50
Box plots comparing relative coverage of configurations with iteration time limit 10×10 Box plots comparing relative coverage of configurations with iteration time limit 50×50
100×100 250×250
Box plots comparing relative coverage of configurations with iteration time limit 100×100 Box plots comparing relative coverage of configurations with iteration time limit 250×250
80×20 20×80
Box plots comparing relative coverage of configurations with iteration time limit 80×20 Box plots comparing relative coverage of configurations with iteration time limit 20×80

Comparison of Best Configurations per Time Limit

Based on the previous results, we chose for each time limit pair the best reuse type and compared the different time limits. Next, we provide the box plot comparison known from above as well as the CSV and HTML table of this comparison. The result of the comparison is that the best configuration uses type condv+r and time limit pair 20×80.

Box plots comparing relative coverage of best configuration of each time limit pair HTML Table CSV Table

Depending on which parameter is fixed first, the best CoVeriTest configuration slightly differs. While the time limit pair is the same, the reuse type is different. Already in our paper, we mentioned that these two configurations are close by. The reason is that we use relative coverage instead of total coverage. Since the set of configurations differs, the maximal coverage value and thus the relative coverage value can be different for some test tasks. A direct comparison of only the two configurations revealed that the configuration (condv+r,20×80) achieves higher coverage for more tasks, while the configuration (reuse-arg,20×80) achieved higher coverage in total. Thus, we follow our paper and refer to the latter configuration as best CoVeriTest configuration.

Comparing CoVeriTest against its Components

As already mentioned, CoVeriTest interleaves the two analyses: predicate and value analysis. To investigate if CoVeriTest's interleaving is beneficial, we compare the best CoVeriTest configuration ((reuse-arg, 20×80) with the two component analyses and a parallel execution of both analyses.

For the comparison, we again used all 6703 C programs from the largest publicly available benchmark collection of C verification tasks. Since all configurations are still run by the same tool CPAchecker, we decided to disable test-case generation, too.

The following table presents the experimental data (XML and log files) of the three non-CoVeriTest test-case generation experiments. The experimental data for the best CoVeriTest configuration is provided in the table for the comparison of CoVeriTest configurations. Additionally, the table contains HTML and CSV tables comparing the best CoVeriTest configuration with one of the other three configurations.

Analysis TypeRaw DataCoVeriTest vs. Analysis
Parallel Analysis (Value+Predicate) Result XML HTML Table CSV Table
Predicate Analysis Result XML HTML Table CSV Table
Value Analysis Result XML HTML Table CSV Table
Log Files

The above table lets one compare the coverage results individually per test task (program). To compare the different configurations, we need to aggregate the results on all test tasks. We decided to use a scatter plot that contains one point per test tasks. Each point (x,y) describes the coverage x achieved by CoVeriTest and the coverage y achieved by the other analysis.

Looking at the scatter plots, we observe that CoVeriTest outperforms the value analysis. In the other two cases, it is not that obvious. CoVeriTest is beneficial for certain program classes, but not always best.

Scatter plot comparing coverage achieved by best CoVeriTest configuration with coverage achieved when running CoVeriTest's analyses in parallel Scatter plot comparing coverage achieved by best CoVeriTest configuration with coverage achieved by predicate analysis Scatter plot comparing coverage achieved by best CoVeriTest configuration with coverage achieved by value analysis

Comparing CoVeriTest against Test-Comp Participants

To study when and whether CoVeriTest improves over state-of-the-art test-case generation tools, we compare the branch coverage achieved by CoVeriTest's best configuration (reuse-arg,20×80) with the branch coverage achieved by the participants of the International Competition on Software Testing (Test-Comp'19). In our paper, we only compared against the best two tools.

For our comparison, we used the all 1720 programs used by Test-Comp'19 in the category Code Coverage. Furthermore, we no longer cannot use the reported number of covered goals for comparison, but had to (1) generate test-cases with CoVeriTest and (2) let the Test-Comp validator measure branch coverage. In the following, we show six scatter plots. Each scatter plots compares the branch coverage achieved by CoVeriTest against the branch coverage achieved by one Test-Comp participant. We ordered the scatter plots by the success of the tool in Test-Comp'19 (best first). Note that we did not include the two ESBMC participants because they do not output a single test case in the branch coverage category.

Looking at the scatter plots, we observe that CoVeriTest neither dominates nor is dominated by any of the approaches. Thus, it complements existing approaches.

Scatter plot comparing CoVeriTest and VeriFuzz Scatter plot comparing CoVeriTest and Klee Scatter plot comparing CoVeriTest and CPA-Tiger-MGP
Scatter plot comparing CoVeriTest and Symbiotic Scatter plot comparing CoVeriTest and FairFuzz Scatter plot comparing CoVeriTest and PRTest

The following table provides the raw data for the above comparison. The generated tests are available in our artifact. Furthermore, note that we did not produce the raw data for the Test-Comp'19 ourselves, but link to the raw data of Test-Comp'19, which we used in the comparison.

Analysis TypeRaw Data
Test-case generation Result XML Log Files
Test-case validation Result XML Log Files
Test-Comp'19 results All Raw

Reproducing Results

Prerequisites

We assume that your machine meets the following or similar requirements.

Environmental Setup

  1. Download our replication artifact (~3.2 GB compressed, ~11.5 GB uncompressed) and unzip it.
    After unzipping, your folder structure should look like this (BASEDIR being the base directory of the artifact).

      BASEDIR
      ├── benchmark_defs
      ├── benchmarks
      │   ├── sv-benchmarks
      │   ├── sv-benchmarks-testcomp19
      ├── CoVeriTest.pdf	
      ├── CPAchecker
      ├── debs
      ├── experiments
      │   ├── data
      │   ├── figures
      │   ├── table
      ├── README.md
      ├── validation
      │   ├── coverage-val.sh
      │   ├── tbf-test-suite-validator	
      │   ├── tool-info	
      │   ├── unzip_testcases.py
      │   ├── zip_testcases.py
    			

    Install the required software dependencies from the folder debs and set up BenchExec for <USER> using the following commands.

    cd $BASEDIR/debs
    sudo dpkg -i *.deb
    sudo adduser <USER> benchexec
    reboot
    		

    Tutorial Example on CoVeriTest

    Before we explain how to repeat our results, we show on the example program test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c how to run the different CoVeriTest configurations. We assume that all run commands are executed from $BASEDIR/CPAchecker. For our tutorial, we structured all run commands in the same way. First, each command set up the tool environment and the total time limit of the run via scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats. The parameter -spec tells CoVeriTest which functions cause program termination. The next two parameters are different for each configuration and define the reuse type as well as the time limit. Finally, the program is given.

    In the following, we list for each of the nine reuse type the command line required to execute the CoVeriTest configuration that uses this reuse type and time limit 10×10 (i.e., each of the two analyses is granted 10 s per iteration). To use a different time limit pair (x,y), one must adapt the command line string as follows: replace the first occurrence of _10 by _x and the second by _y.

    Note that the configurations target at programs with 32-bit architecture. To generate tests for programs with a 64-bit architecture add the parameter -64. Furthermore, we disabled the output of test cases. To enable test-case generation add the parameter -setprop testcase.values=%d.test.txt. The test-cases are written to $BASEDIR/CPAchecker/output. Each test-case contains one test input value per line.

    • plain
      scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \
          -spec config/specification/sv-comp-terminatingfunctions.spc \
          -testCaseGeneration-interleaved-value+predicate-noreuse \
          -setprop interleavedAlgorithm.configFiles=config/testCaseGeneration-valueAnalysis.properties::noreuse_10,\
      config/testCaseGeneration-predicateAnalysis.properties::noreuse_10 \
          ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c 
      			
    • condv
      scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \
          -spec config/specification/sv-comp-terminatingfunctions.spc \
          -testCaseGeneration-interleaved-value+predicate-cmc-v2p \
          -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \
          -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-generate-cmc-condition.properties::noreuse_10,\
      config/components/testCaseGeneration-predicate-use-cmc-condition.properties::noreuse_10 \
          ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c		
      			
    • condp
      scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \
          -spec config/specification/sv-comp-terminatingfunctions.spc \
          -testCaseGeneration-interleaved-value+predicate-cmc-p2v \
          -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \
          -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-use-cmc-condition.properties::noreuse_10,\
      config/components/testCaseGeneration-predicate-generate-cmc-condition.properties::noreuse_10 \
          ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
      			
    • condv,p
      scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \
          -spec config/specification/sv-comp-terminatingfunctions.spc \
          -testCaseGeneration-interleaved-value+predicate-cmc-bidirectional \
          -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \
          -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-generate-and-use-cmc-condition.properties::noreuse_10,\
      config/components/testCaseGeneration-predicate-generate-and-use-cmc-condition.properties::noreuse_10 \
          ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
      			
    • reuse-prec
      scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \
          -spec config/specification/sv-comp-terminatingfunctions.spc \
          -testCaseGeneration-interleaved-value+predicate-precision-reuse \
          -setprop interleavedAlgorithm.configFiles=config/testCaseGeneration-valueAnalysis.properties::reuse-precision_10,\
      config/testCaseGeneration-predicateAnalysis.properties::reuse-precision_10 \
          ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
      			
    • reuse-arg
      scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \
          -spec config/specification/sv-comp-terminatingfunctions.spc \
          -testCaseGeneration-interleaved-value+predicate-continue \
           -setprop interleavedAlgorithm.configFiles=config/testCaseGeneration-valueAnalysis.properties::continue_10,\
      config/testCaseGeneration-predicateAnalysis.properties::continue_80 \
          ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
      			
    • condv+r
      scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \
          -spec config/specification/sv-comp-terminatingfunctions.spc \
          -testCaseGeneration-interleaved-value+predicate-cmc-v2p \
          -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \
          -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-generate-cmc-condition.properties::continue_10,\
      config/components/testCaseGeneration-predicate-use-cmc-condition-precision.properties::noreuse_10 \
          ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
      			
    • condp+r
      scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \
          -spec config/specification/sv-comp-terminatingfunctions.spc \
          -testCaseGeneration-interleaved-value+predicate-cmc-p2v \
          -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \
          -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-use-cmc-condition-precision.properties::noreuse_10,\
      config/components/testCaseGeneration-predicate-generate-cmc-condition.properties::continue_10 \
             ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
      			
    • condv,p+r
      scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \
          -spec config/specification/sv-comp-terminatingfunctions.spc \
          -testCaseGeneration-interleaved-value+predicate-cmc-bidirectional \
          -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \
          -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-generate-and-use-cmc-condition-precision.properties::noreuse_10,\
      config/components/testCaseGeneration-predicate-generate-and-use-cmc-condition-precision.properties::noreuse_10 \
          ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
      			

    Repeating Our Experiments

    As already explained above, we performed three types of experiments: internal comparison of CoVeriTest configurations, comparison of CoVeriTest against its components, and comparison of CoVeriTest against Test-Comp participants. In the following, we explain how to repeat the experiments of each of the three types. If you system has swap memory, you need to turn it off.

    sudo swapoff -a
    You can later turn it on again executing the following command.
    sudo swapon -a

    To replicate our results, you need 8 CPU cores and 15 GB of memory.

    The results of the experiments can be found in the sub-folder results. The sub-folder results is a subfolder of the location from where you started the respective benchexec command, i.e., either $BASE/CPAchecker or $BASE/validation.

    1. To compare the CoVeriTest configurations, you need to run the following six experiments to run all 54 CoVeriTest configurations. Each experiment runs nine CoVeriTest configurations. While the time limit pair is fixed in each experiment, the nine experimental runs use one of the nine reuse types offered by CoVeriTest.
      • To run the CoVeriTest configurations with time limit pair 10×10, use the following commands.

          cd $BASE/CPAchecker
          benchexec --container ../benchmark_defs/test-generation_10_10.xml
        		
      • To run the CoVeriTest configurations with time limit pair 50×50, use the following commands.

          cd $BASE/CPAchecker
          benchexec --container ../benchmark_defs/test-generation_50_50.xml
        		
      • To run the CoVeriTest configurations with time limit pair 100×100, use the following commands.

          cd $BASE/CPAchecker
          benchexec --container ../benchmark_defs/test-generation_100_100.xml
        		
      • To run the CoVeriTest configurations with time limit pair 250×250, use the following commands.

          cd $BASE/CPAchecker
          benchexec --container ../benchmark_defs/test-generation_250_250.xml
        		
      • To run the CoVeriTest configurations with time limit pair 80×20, use the following commands.

          cd $BASE/CPAchecker
          benchexec --container ../benchmark_defs/test-generation_80_20.xml
        		
      • To run the CoVeriTest configurations with time limit pair 20×80, use the following commands.

          cd $BASE/CPAchecker
          benchexec --container ../benchmark_defs/test-generation_20_80.xml
        		
    2. To compare the best CoVeriTest configuration with the single analyses used by CoVeriTest as well as the parallel execution of these analyses, we need to run three analyses: value analysis, predicate analysis and parallel execution of value and predicate analysis. To regenerate the experimental data for these three analyses, execute the following commands. For CoVeriTest, we reuse the experimental data from the first experiment.

        cd $BASE/CPAchecker
        benchexec --container ../benchmark_defs/test-generation-single+parallel.xml
      		
    3. To compare our best CoVeriTest configuration with the tools participating in the International Competiton on Software Testing (Test-Comp'19), we first let the best CoVeriTest configuration (reuse-arg,20×80) generate test-cases for the test tasks of Test-Comp'19 that support the property coverage-branches.prp. Thereafter, we use the Test-Comp validator to measure for each task the branch coverage achieved by CoVeriTest on that task. Since the CoVeriTest version used in the evaluation does not support the test-case format of the validator, we use a wrapper script (coverage-val.sh) that converts the test-cases and then calls the validator.

      1. To rerun our test-case generation, execute the following commands.
          cd $BASE/CPAchecker
          benchexec --container ../benchmark_defs/test-case-generation-compare.xml
        		
      2. To repeat the coverage measurement with our test cases, use the following commands.
          cd $BASE/validation
          export PYTHONPATH=$PYTHONPATH:.
          benchexec --container ../benchmark_defs/test-case-coverage-val.xml
        		
        If you want to use the test-cases you generated with the commands in step one, you need to properly adjust the paths to test cases of your benchmark. To this end, adapt the text ending on test.zip in the option and requiredfiles tags of the benchmark file test-case-coverage-val.xml. Thereafter, you can use the above command.

      We compared the produced coverage values with the results of the test-case generation tools participating in Test-Comp'19. Note that this comparison only makes sense because we used the same resource limits and infrastructure as Test-Comp'19.