You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Yuki Morishita (JIRA)" <ji...@apache.org> on 2013/12/11 06:41:07 UTC

[jira] [Created] (CASSANDRA-6474) Compaction strategy based on MinHash

Yuki Morishita created CASSANDRA-6474:
-----------------------------------------

Summary: Compaction strategy based on MinHash
Key: CASSANDRA-6474
URL: https://issues.apache.org/jira/browse/CASSANDRA-6474
Project: Cassandra
Issue Type: New Feature
Reporter: Yuki Morishita
Priority: Minor

We can consider an SSTable as a set of partition keys, and 'compaction' as de-duplication of those partition keys.
We want to find compaction candidates from SSTables that have as many same keys as possible. If we can group similar SSTables based on some measurement, we can achieve more efficient compaction.
One such measurement is [Jaccard Distance|http://en.wikipedia.org/wiki/Jaccard_index],

!http://upload.wikimedia.org/math/1/8/6/186c7f4e83da32e889d606140fae25a0.png!

which we can estimate using technique called [MinHash|http://en.wikipedia.org/wiki/MinHash].

In Cassandra, we can calculate and store MinHash signature when writing SSTable. New compaction strategy uses the signature to find the group of similar SSTable for compaction candidates. We can always fall back to STCS when such candidates are not exists.

This is just an idea floating around my head, but before I forget, I dump it here. For introduction to this technique, [Chapter 3 of 'Mining of Massive Datasets'|http://infolab.stanford.edu/~ullman/mmds/ch3.pdf] is a good start.

--
This message was sent by Atlassian JIRA
(v6.1.4#6159)