Cooperative Verifier-Based Testing with CoVeriTest

Dirk Beyer, Marie-Christine Jakobs

Abstract

Testing is a widely applied technique to evaluate software quality and coverage criteria are often used to assess the adequacy of a generated test suite. However, manually constructing an adequate test suite is typically too expensive and numerous techniques for automatic test-suite generation were proposed. All of them come with different strengths. To build stronger test-generation tools, different techniques should be combined. In this paper, we study cooperative combinations of verification approaches for test generation, which exchange high-level information.
We present CoVeriTest, a hybrid technique for test-suite generation. CoVeriTest iteratively applies different conditional model checkers and allows users to adjust the level of cooperation and to configure individual time limits for each conditional model checker. In our experiments, we systematically study different CoVeriTest configurations, which either use combinations of explicit-state model checking and predicate abstraction or bounded model checking and symbolic execution. A comparison with state-of-the-art test-generation tools, reveals that CoVeriTest achieves higher coverage for some programs.

Results

We performed four types of experiments:

The first type of experiments systematically studies different CoVeriTest configurations.
The second type of experiments compares for each combination of analyses the best CoVeriTest configuration against its component analyses and against the parallel execution of all component analyses.
The third type of experiments compares CoVeriTest against existing test-case generation approaches.

In the following, we provide the raw data, HTML tables, and figures for our experiments. Note that we do not provide the logfiles, CSV files, nor the tests generated by the third experiment. Even compressed, this data is several gigabytes large and, therefore, can only be found in our artifact.

1. Comparison of CoVeriTest Configurations

A CoVeriTest configuration consists of

a list of analyses, which are combined in an interleaved fashion,
the configuration of how to reuse information between analysis executions of the same CoVeriTest run, and
the individual time limit granted to each analysis in each iteration.

In our experiments, we consider two combinations of analyses. On the one hand, we combine a value analysis (similar to explicit model checking) and a predicate analysis. On the other hand, we combine bounded model checking (BMC) with symbolic execution. Additionally, we used six different time limit configuration. Four of them are uniform and trigger analysis switches often, sometimes, or rarely. The other two are non-uniform, each of them preferring one of the two combined analyses. Furthermore, we used up to twelve different reuse types ranging from no exchange, over only reuse own results to full cooperation. For the details on the different reuse types, we refer to our paper. All configurations are realized in the software analysis framework CPAchecker.

For the comparison of the different CoVeriTest configurations, we used all 7644 programs from the largest publicly available benchmark collection of C verification tasks and all configuration aimed at the coverage of the branches in the verification tasks. Since all configurations are run by the same tool CPAchecker, we decided to disable test-case generation (thus, reducing the amount of data) for the comparison of the different configurations and only compare the number of covered test goals reported by CPAchecker.

In the following, we first present the results of the experiments comparing the configurations of the value and predicate analysis combinations. Thereafter, we come to the results of the experiments studying combinations of BMC and symbolic execution.

CoVeriTest with Value and Predicate Analysis

The following table presents the raw data and the HTML tables for the combinations using value and predicate analysis. CSV files and logfiles are provided with our artifact. The table provides for each combination of reuse type and time limit pair the respective raw data (compressed XML file). Additionally, it contains one HTML for each reuse type and one for each time limit pair. The HTML files for the reuse types show for each reuse type and test task (program) the experimental data for the different time limits, i.e., configurations using the particular reuse type with varying time limits. Similarly, the HTML files for the time limit pairs provide for each time limit pair and test task (program) the experimental data for the different reuse types.

	Time Limits
Reuse Type	10×10	50×50	100×100	250×250	80×20	20×80	All Limits

plain	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
reuse-prec	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
reuse-arg	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond-v	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond-p	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond-v+r	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond-p+r	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond+r	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
trans-prec	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
trans-prec+r	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
trans-prec-cond+r	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table

All Types	HTML Table	HTML Table	HTML Table	HTML Table	HTML Table	HTML Table

The above table lets one compare the coverage results individually per test task (program). To better compare the different configurations, we also provide aggregated results for the different configurations. More concrete, we show the distribution of the achieved coverage. Instead of looking into total coverage, which is less robust against a large number of unreachable test goals, we decided to consider a relative measure (called relative coverage) that compares against the best value achieved. To compare a set of configurations, we first compute for each program (test task) the largest number of covered test goals reported among all configurations in the set and then use these values to compute the relative coverage by dividing for each test task and configuration the number of test goals achieved by this largest number.

The following twelve figures show for each of the twelve reuse types six box plots. Each box plots shows the distribution of the relative coverage per time limit configuration. Looking at the figures, we observe that for each reuse type the configuration using time limit pair (20,80), i.e., preferring the predicate analysis, performs best.

plain	reuse-prec	reuse-arg

cond-v	cond-p	cond

cond-v+r	cond-p+r	cond+r

trans-prec	trans-prec+r	trans-prec-cond+r

The following six figures show for each of the six time limit pairs twelve box plots. Each box plots shows the distribution of the relative coverage of one reuse type. Looking at the figures, we observe that for each time limit reuse types reuse-prec, reuse-arg, cond-p+r, and vars+r always perform better than the remaining ones. The analyses should definitely reuse their own information and the predicate analysis should not use information from the value analysis.

10×10	50×50

100×100	250×250

80×20	20×80

Next, we want to find out what is the best configuration for CoVeriTest combining value and predicate analysis. Based on the previous results, we therefore chose for each reuse type the best time limit pair and compared the different reuse types. Additionally, we chose for each time limit the best reuse type and compared the different time limits. We again consider the distribution of the relative coverage for comparison. The following two figures show the box plots for the comparison. Below each figure we also add the HTML tables allowing a per task comparison of the configurations. The raw data is contained in the above table.

Reuse Types	Time Limits

HTML Table	HTML Table

From the figures above, it is only clear that the best CoVeriTest configuration using a combination of value and predicate analysis uses a time limit of 20×80. However, it is not obvious whether reuse type reuse-arg or cond-p+r is better. Looking into our raw data, we found out that the best configuration uses the reuse type reuse-arg.

CoVeriTest with BMC and Symbolic Execution

The following table presents the raw data and the HTML tables for the combinations using BMC and symbolic execution. CSV files and logfiles are provided with our artifact. The table provides for each combination of reuse type and time limit pair the respective raw data (compressed XML file). Additionally, the table contains one HTML for each reuse type and one for each time limit pair. The HTML files for the reuse types show for each reuse type and test task (program) the experimental data for the different time limits, i.e., configurations using the particular reuse type with varying time limits. Similarly, the HTML files for the time limit pairs provide for each time limit pair and test task (program) the experimental data for the different reuse types.
Note that the combination of BMC and symbolic execution only supports nine of the twelve reuse types.

	Time Limits
Reuse Type	10×10	50×50	100×100	250×250	80×20	20×80	All Limits

plain	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
reuse-prec	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
reuse-arg	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond-b	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond-s	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond-b+r	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond-s+r	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table
cond+r	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table

All Types	HTML Table	HTML Table	HTML Table	HTML Table	HTML Table	HTML Table

The following nine figures show for each of the nine, compatible reuse types six box plots. Each box plots shows the distribution of the relative coverage per time limit configuration. Looking at the figures, we observe that for each reuse type the configuration using time limit pair (250,250), i.e., a configuration avoiding context switches, performs well, but not always best.

plain	reuse-prec	reuse-arg

cond-b	cond-s	cond

cond-b+r	cond-s+r	cond+r

The following six figures show for each of the six time limit pairs nine box plots. Each box plots shows the distribution of the relative coverage of one reuse type.

10×10	50×50

100×100	250×250

80×20	20×80

Next, we want to find out what is the best configuration for CoVeriTest combining BMC and symbolic execution. Based on the previous results, we therefore chose for each reuse type the best time limit pair and compared the different reuse types. Additionally, we chose for each time limit the best reuse type and compared the different time limits. We again consider the distribution of the relative coverage for comparison. The following two figures show the box plots for the comparison. Below each figure we also add the HTML tables allowing a per task comparison of the configurations. The raw data is contained in the above table.

Reuse Types	Time Limits

HTML Table	HTML Table

The result of the comparison is that the best CoVeriTest configuration for a combination of BMC and symbolic execution is reuse type cond-s+r and time limit 80×20.

Best CoVeriTest Configuration

At last, we want to find out which of our CoVeriTest configuration is best. To this end, we compare the best configurations for our two analysis combinations. To compare the different configurations, we decided to use a scatter plot that contains one point per test tasks. Each point (x,y) describes the coverage x achieved by the best CoVeriTest configuration using BMC+symbolic execution and the coverage y achieved by the best CoVeriTest configuration using value plus predicate analysis. We also provide an HTML table comparing the two configurations on a per task basis. The raw data of the two configurations is not referenced and can be looked up in the two tables above.

Scatter plot comparing coverage of best BMC+symbolic execution configurations with best value+predicate analysis configuration

HTML Table

Looking at the scatter plot, we observe that CoVeriTest with value and predicate analysis often outperforms the combination with BMC and symbolic execution.

2. Comparison of CoVeriTest against Single Analysis and Parallel Configuration

To investigate if CoVeriTest's interleaving is beneficial, we compare the best CoVeriTest configuration per analysis combination ((reuse-arg, 20×80) for value plus predicate analysis and (cond-s+r, 80×20) for BMC plus symbolic execution) with the two component analyses and a parallel execution of both analyses. For the comparison, we use the same set of test tasks as in the comparison of the different configurations and also do not generate tests, but only compare the number of covered test goals. As before, we first present the results for the combination of value and predicate analysis.

The following table presents the raw data (compressed XML files) of the four experiments that belong to the combination of value and predicate analysis. Additionally, the table contains an HTML table file for the experiment.

Analysis Type	Data

Parallel Analysis (Value+Predicate)	Result XML
Value Analysis	Result XML
Predicate Analysis	Result XML
CoVeriTest (Value+Predicate)	Result XML

All	HTML

The above table lets one compare the coverage results individually per test task (program). To compare the different configurations, we need to aggregate the results on all test tasks. We decided to use a scatter plot that contains one point per test tasks. Each point (x,y) describes the coverage x achieved by CoVeriTest and the coverage y achieved by the other analysis. Looking at the scatter plots, we observe that CoVeriTest outperforms the value analysis. In the other two cases, it is not that obvious. CoVeriTest is beneficial for certain program classes, but not always best.

Comparison of value analysis and CoVeriTest	Comparison of predicate analysis and CoVeriTest	Comparing parallel combination of value and predicate analysis with CoVeriTest

The following table presents the raw data (compressed XML files) of the four experiments that belong to the combination of BMC and symbolic execution. Additionally, the table contains an HTML table file for the experiment.

Analysis Type	Data

BMC	Result XML
Symbolic Execution	Result XML
Parallel Analysis (BMC+Symbolic Execution)	Result XML
CoVeriTest (BMC+Symbolic Execution)	Result XML

All	HTML

To better compare the different configurations, we also provide scatter plots for the BMC plus symbolic execution part of the experiment. Again, each point (x,y) in the scatter plot describes the coverage x achieved by CoVeriTest and the coverage y achieved by the other analysis. Looking at the scatter plots, we observe that CoVeriTest often outperforms BMC, symbolic execution, and their parallel combination.

Comparison of BMC and CoVeriTest	Comparison of symbolic execution and CoVeriTest	Comparing parallel combination of BMC and symbolic execution with CoVeriTest

3. Comparison of CoVeriTest against State-of-the-Art Test Generation Tools

To study when and whether CoVeriTest improves over state-of-the-art test-case generation tools, we compare the branch coverage achieved by CoVeriTest's best configuration (reuse-arg,20×80) with the branch coverage achieved by the best two participants of the International Competition on Software Testing (Test-Comp'19). For our comparison, we used all 1720 programs from Test-Comp'19 in the category Code Coverage. Furthermore, we no longer cannot use the reported number of covered goals for comparison, but had to (1) generate test-cases with CoVeriTest and (2) let the Test-Comp validator measure branch coverage. The following table links to the raw data of our test-case generation and subsequent validation as well as the raw data from Test-Comp'19.

Experiment	Raw Data

CoVeriTest Test Generation	Result XML
CoVeriTest Test Validation	Result XML
Test-Comp	Raw Data

To better compare the actual results, we show two scatter plots. Each scatter plots compares the branch coverage achieved by CoVeriTest against the branch coverage achieved by one Test-Comp participant. Looking at the scatter plots, we observe that CoVeriTest neither dominates nor is dominated by any of the approaches. Thus, it complements existing approaches

Comparison of CoVeriTest and VeriFuzz	Comparing of CoVeriTest and KLEE

Cooperative Verifier-Based Testing with CoVeriTest

Abstract

Results

1. Comparison of CoVeriTest Configurations

2. Comparison of CoVeriTest against Single Analysis and Parallel Configuration

3. Comparison of CoVeriTest against State-of-the-Art Test Generation Tools

Reproducing Results

Prerequisites

Environmental Setup

Tutorial Example on CoVeriTest

Repeating Our Experiments