You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Oleg Anastasyev (JIRA)" <ji...@apache.org> on 2014/01/16 18:34:19 UTC
[jira] [Comment Edited] (CASSANDRA-6446) Faster range tombstones on wide partitions

    [ https://issues.apache.org/jira/browse/CASSANDRA-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873641#comment-13873641 ] 

Oleg Anastasyev edited comment on CASSANDRA-6446 at 1/16/14 5:32 PM:
---------------------------------------------------------------------

I am not feeling myself comfortable with 2.1 branch, so not much help from me here.

Meanwhile we found a bug in v2 version (and v3 I believe as well). SliceQueryFilter has wrong behavior when reversed=true. 

The bug can be reproduced in cqlsh :
{code}
cqlsh:test> create table testppd ( a int, b int, c text, d bigint, primary key ( a,b,c ) ) with clustering order by ( b desc, c asc);
cqlsh:test> INSERT INTO testppd (a, b, c,d) VALUES ( 100,12,'Malvina',111);
cqlsh:test> INSERT INTO testppd (a, b, c,d) VALUES ( 100,13,'Karabas-Barabas',111);
cqlsh:test> select * from testppd where a=100 and b>11 and b <=13;

 a   | b  | c               | d
-----+----+-----------------+-----
 100 | 13 | Karabas-Barabas | 111
 100 | 12 |         Malvina | 111

(2 rows)

cqlsh:test> select * from testppd where a=100 and b>11 and b <=13 order by b desc;

 a   | b  | c               | d
-----+----+-----------------+-----
 100 | 13 | Karabas-Barabas | 111
 100 | 12 |         Malvina | 111

(2 rows)

cqlsh:test> select * from testppd where a=100 and b>11 and b <=13 order by b asc;

 a   | b  | c               | d
-----+----+-----------------+-----
 100 | 12 |         Malvina | 111
 100 | 13 | Karabas-Barabas | 111

(2 rows)

cqlsh:test> delete from testppd where a=100 and b = 13 and c='Karabas-Barabas';
cqlsh:test> 
cqlsh:test> select * from testppd where a=100 and b>11 and b <=13 order by b desc;

 a   | b  | c       | d
-----+----+---------+-----
 100 | 12 | Malvina | 111

(1 rows)

cqlsh:test> select * from testppd where a=100 and b>11 and b <=13 order by b asc;

 a   | b  | c               | d
-----+----+-----------------+-----
 100 | 12 |         Malvina | 111
 100 | 13 | Karabas-Barabas | 111

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
just removed record is resurrected
{code}

fixed it by patching your v2 patch (god plz forgive me) with :
{code}
diff --git a/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java b/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java
index f6d2b17..d7fe875 100644
--- a/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java
+++ b/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java
@@ -379,7 +379,8 @@ public class SliceQueryFilter implements IDiskAtomFilter
         return new AbstractIterator<RangeTombstone>()
         {
             private int sliceIdx = 0;
-            private Iterator<RangeTombstone> sliceIter = delInfo.rangeIterator(slices[0].start, slices[0].finish);
+
+            private Iterator<RangeTombstone> sliceIter = reversed ? delInfo.rangeIterator(slices[0].finish, slices[0].start) : delInfo.rangeIterator(slices[0].start, slices[0].finish);

             protected RangeTombstone computeNext()
             {

{code}
i.e by reversing start and finish of the slice when slie filter is reversed


was (Author: m0nstermind):
I am not feeling myself comfortable with 2.1 branch, so not much help from me here.

Meanwhile we found a bug in v2 version (and v3 I believe as well). SliceQueryFilter has wrong behavior when reversed=true. 

The bug can be reproduced in cqlsh :
{code}
cqlsh:test> create table testppd ( a int, b int, c text, d bigint, primary key ( a,b,c ) ) with clustering order by ( b desc, c asc);
cqlsh:test> INSERT INTO testppd (a, b, c,d) VALUES ( 100,12,'Malvina',111);
cqlsh:test> INSERT INTO testppd (a, b, c,d) VALUES ( 100,13,'Karabas-Barabas',111);
cqlsh:test> select * from testppd where a=100 and b>11 and b <=13;

 a   | b  | c               | d
-----+----+-----------------+-----
 100 | 13 | Karabas-Barabas | 111
 100 | 12 |         Malvina | 111

(2 rows)

cqlsh:test> select * from testppd where a=100 and b>11 and b <=13 order by b desc;

 a   | b  | c               | d
-----+----+-----------------+-----
 100 | 13 | Karabas-Barabas | 111
 100 | 12 |         Malvina | 111

(2 rows)

cqlsh:test> select * from testppd where a=100 and b>11 and b <=13 order by b asc;

 a   | b  | c               | d
-----+----+-----------------+-----
 100 | 12 |         Malvina | 111
 100 | 13 | Karabas-Barabas | 111

(2 rows)

cqlsh:test> delete from testppd where a=100 and b = 13 and c='Karabas-Barabas';
cqlsh:test> 
cqlsh:test> select * from testppd where a=100 and b>11 and b <=13 order by b desc;

 a   | b  | c       | d
-----+----+---------+-----
 100 | 12 | Malvina | 111

(1 rows)

cqlsh:test> select * from testppd where a=100 and b>11 and b <=13 order by b asc;

 a   | b  | c               | d
-----+----+-----------------+-----
 100 | 12 |         Malvina | 111
 100 | 13 | Karabas-Barabas | 111

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
just removed record is resurrected
{code}

fixed it by patching your v2 patch (god plz forgive me) with :
{code}
diff --git a/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java b/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java
index f6d2b17..d7fe875 100644
--- a/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java
+++ b/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java
@@ -379,7 +379,8 @@ public class SliceQueryFilter implements IDiskAtomFilter
         return new AbstractIterator<RangeTombstone>()
         {
             private int sliceIdx = 0;
-            private Iterator<RangeTombstone> sliceIter = delInfo.rangeIterator(slices[0].start, slices[0].finish);
+
+            private Iterator<RangeTombstone> sliceIter = reversed ? delInfo.rangeIterator(slices[0].finish, slices[0].start) : delInfo.rang
eIterator(slices[0].start, slices[0].finish);

             protected RangeTombstone computeNext()
             {

{code}
i.e by reversing start and finish of the slice when slie filter is reversed

> Faster range tombstones on wide partitions
> ------------------------------------------
>
>                 Key: CASSANDRA-6446
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6446
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Oleg Anastasyev
>            Assignee: Oleg Anastasyev
>             Fix For: 2.1
>
>         Attachments: 0001-6446-write-path-v2.txt, 0002-6446-Read-patch-v2.txt, 6446-Read-patch-v3.txt, 6446-write-path-v3.txt, RangeTombstonesReadOptimization.diff, RangeTombstonesWriteOptimization.diff
>
>
> Having wide CQL rows (~1M in single partition) and after deleting some of them, we found inefficiencies in handling of range tombstones on both write and read paths.
> I attached 2 patches here, one for write path (RangeTombstonesWriteOptimization.diff) and another on read (RangeTombstonesReadOptimization.diff).
> On write path, when you have some CQL rows deletions by primary key, each of deletion is represented by range tombstone. On put of this tombstone to memtable the original code takes all columns from memtable from partition and checks DeletionInfo.isDeleted by brute for loop to decide, should this column stay in memtable or it was deleted by new tombstone. Needless to say, more columns you have on partition the slower deletions you have heating your CPU with brute range tombstones check. 
> The RangeTombstonesWriteOptimization.diff patch for partitions with more than 10000 columns loops by tombstones instead and checks existance of columns for each of them. Also it copies of whole memtable range tombstone list only if there are changes to be made there (original code copies range tombstone list on every write).
> On read path, original code scans whole range tombstone list of a partition to match sstable columns to their range tomstones. The RangeTombstonesReadOptimization.diff patch scans only necessary range of tombstones, according to filter used for read.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)