You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Shawn Walker (JIRA)" <ji...@apache.org> on 2014/12/28 23:13:13 UTC

[jira] [Issue Comment Deleted] (CASSANDRA-8542) Make get_range_slices and related succeed most of the time on tombstone heavy column families

     [ https://issues.apache.org/jira/browse/CASSANDRA-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shawn Walker updated CASSANDRA-8542:
------------------------------------
    Comment: was deleted

(was: As written, the current {{InvertibleBloomReconciler}} uses finite field arithmetic over the prime field {{GF[2^30 - 35]}}.  This is a trade off:

 * (pro) Addition, multiplication, subtraction are fairly fast, and comprehensible.  
 * (pro) During reconciliation, we can actually determine what sources produced each non-common record, rather than just discovering what the differences were.
 * (pro) We can reconcile up to 28 sources at once.
 * (con) The way I've encoded from bytes into {{GF[2^30 - 35]}} is not space efficient, creating 4+ bytes of output for each 3 bytes of input.
 * (con) Division is a bit slow.

Similar techniques could be used with a different finite field.  {{GF[2^16]}} is a possible choice. For it:

* (pro) Addition and subtraction become just XOR, and so should be blazingly fast.
* (pro) Encoding is trivial, and has nearly perfect space efficiency: 1-2 bytes of padding per record should suffice.
* (con) We can only reconcile 16 sources at once.
* (con) Performing multiplication/division in {{GF[2^16]}} can be a bit arcane.  That said, it can be done quickly with a few lookup tables.* (con) During reconciliation we wouldn't be able to determine what sources produced each non-common record, only which records are the non-common records
)

> Make get_range_slices and related succeed most of the time on tombstone heavy column families
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8542
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8542
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Shawn Walker
>            Priority: Minor
>         Attachments: trunk-8542-InvertibleBloomReconciler.txt
>
>
> I've got a ColumnFamily in which some rows end up being used in a queue-like fashion, and where many tombstones tend to pile up at the left end of the row.  As a result, I run into {{TombstoneOverwhelming}} (e.g. CASSANDRA-6117) when trying to list the columns of said rows.
> Please don't yell at me too loudly. I know the issue I'm describing will generally be considered as being due to poor use of Cassandra.  I understand the rationale of the current behavior, and am hesitant to play with fire by increasing {{tombstone_fail_threshold}} to a high value.  Instead, what I propose is an alternate method of resolving such queries on the read path.
> ----
> This is based on the following observation: on the coordinator node, when {{RangeSliceResponseResolver}} is resolving a range slice query, it needn't be aware of all tombstones that all responding nodes have within the specified range.  Instead, it would suffice if it could determine only those tombstones which aren't shared by all responding nodes. E.g. if node #1 responds with tombstones (A, B, D), node #2 responds with tombstones (A, B, C), and node #3 responds with tombstones (A, B, C, D), {{RangeSliceResponseResolver}} need only actually know about the tombstones (C, D): All of the responders will already have removed any relevant data for the tombstones (A, B) from their individual responses.
> Arranging for {{RangeSliceResponseResolver}} to discover only the non-common tombstones can be accomplished by using a variant of the "invertible bloom filter" data structure described in e.g. http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf.  Using an invertible Bloom filter, each responding node can build a (roughly) fixed size data structure holding a representation of all the tombstones that node encounters.  The coordinator node can then combine the invertible Bloom filters.  If there aren't too many non-common tombstones, the coordinator node will be able to enumerate all of them, and so resolve the range slice query.
> I see accomplishing this in a few discrete steps:
> 1. Implement an appropriate variant of the "invertible bloom filter".  I've started this already, and will shortly upload a patch including my {{InvertibleBloomReconciler}} implementation.  From a black box perspective, {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}} demonstrates how this data structure and algorithm could be used.
> 2. ("db layer") Wrap {{InvertibleBloomReconciler}} into {{TombstoneReconciler}}, a structure for spilling tombstones into.  Refactor large swaths of {{org.apache.cassandra.db}}'s read path to accomodate the possibility of placing tombstones discovered during a read into a {{TombstoneReconciler}} instead of returning them within a {{Row}}, {{List<Row>}}, {{ColumnFamily}}, etc.  I've attempted a start on this, though this means carrying the {{TombstoneReconciler}} around through most of {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other places I haven't yet discovered.  I'm wondering if I wouldn't be better off waiting for CASSANDRA-8099 before starting this step -- a fully iterator flow through {{org.apache.cassandra.db}} might make this easier.
> 3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters for the IBR (table size, hash count, seed), possibly by making these part of {{CFMetaData}}.  Allow a {{RangeSliceResponse}} to optionally return a {{TombstoneReconciler}} in addition to its {{List<Row>}}.  Refactor {{RangeSliceResponseResolver}} to be capable of handling {{TombstoneReconciler}} s.  Possibly refactor {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses contained a {{TombstoneReconciler}}.
> Since the invertible bloom filter is a probabilistic data structure, it is possible that resolving a query in this manner could fail.  As such, I'm proposing that the current read path not make use of the {{TombstoneReconciler}} idea unless it would otherwise encounter a {{TombstoneOverwhelming}} situation.
> Also, there is an astronomically minute possibility (on the order if 1 / 2^120 ) that tombstones reconciled via a {{TombstoneReconciler}} could be bad data.  Possibly less astronomical if someone were attempting to maliciously abuse this functionality, given the fact that I'm using the Murmur3 hash instead of a cryptographic hash within {{InvertibleBloomReconciler}}.  As such, I'm suggesting that read repair not be enabled on this path, just in case.
> ----
> Any thoughts?  Would there be any interest in an implementation as I've described?  Any suggestions on who I might ask for help navigating {{org.apache.cassandra.db}} if/when I run into trouble there?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)