You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Alex Petrov (Jira)" <ji...@apache.org> on 2019/10/08 16:03:00 UTC

[jira] [Created] (CASSANDRA-15348) Harry: generator library and extensible framework for fuzz testing Apache Cassandra

Alex Petrov created CASSANDRA-15348:
---------------------------------------

             Summary: Harry: generator library and extensible framework for fuzz testing Apache Cassandra
                 Key: CASSANDRA-15348
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15348
             Project: Cassandra
          Issue Type: New Feature
            Reporter: Alex Petrov
            Assignee: Alex Petrov


h2. Description:

This ticket introduces Harry, a component for fuzz testing and verification of the Apache Cassandra clusters at scale. 

h2. Motivation: 

Current testing tooling largely tests for common- and edge-cases, and most of the tests use predefined datasets. Property-based tests can help explore a broader range of states, but often require either a complex model or a large state to test against.

h2. What problems Harry solves:

Harry allows to run tests that are able to validate state of both dense nodes (to test local read-write path) and large clusters (to test distributed read-write path), and do it efficiently. Main goals, and what sets it apart from the other testing tools is:

 * The state required for verification should remain as compact as possible.
 * The verification process itself should be as performant as possible.
 * Ideally, we'd want a way to verify database state while _continuing_ running state change queries against it.

h2. What Harry does: 

To achieve this, Harry defines a model that holds the state of the database, generators that produce reproducible, pseudo-random schemas, mutations, and queries, and a validator that asserts the correctness of the model following execution of generated traffic.

h2. Harry consists of multiple reusable components:

 * Generator library: how to create a library of invertible, order-preserving generators for simple and composite data types.
 * Model and checker: how to use the properties of generators to validate the output of an eventually-consistent database in a linear time.
 * Runner library: how to create a scheme for reproducible runs, despite the concurrent nature of database and fuzzer itself.

h2. Short and somewhat approximate description of how Harry achieves this:

Generation and validation define strict mathematical relations between the generated values and pseudorandom numbers they were generated from. Using these properties, we can store minimal state and check if these properties hold during validation.

Since Cassandra stores data in rows, we should be able to "inflate" data to insert a row into the database from a single number we call _descriptor_. Each value in the row read from the database can be "deflated" back to the descriptor it was generated from. This way, to precisely verify the state of the row, we only need to know the descriptor it was generated from and a timestamp at which it was inserted.

Similarly, keys for the inserted row can be "inflated" from a single 64-bit integer, and then "deflated" back to it. To efficiently search for keys, while allowing range scans, our generation scheme preserves the order of the original 64-bit integer. Every pair of keys generated from two 64-bit integers would sort the same way as these integers.

This way, in order to validate a state of the range of rows queried from the database, it is sufficient to "deflate" its key and data values, use deflated 64-bit key representation to find all descriptors these rows were generated from, and ensure that the given sequence of descriptors could have resulted in the state that database has responded with.

Using this scheme, we keep a minimum possible amount of data per row, can efficiently generate the data, and backtrack values to the numbers they were generated from. Most of the time, we operate on 64-bit integer values and only use "inflated" objects when running queries against database state, minimizing the amount of required memory.

h2. Name: 

Harry (verb). 

According to Marriam-Webster: 
  * to torment by or as if by constant attack
  * to make a pillaging or destructive raid on
  * persistently carry out attacks on (an enemy or an enemy's territory)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org