You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Daniel Norberg (JIRA)" <ji...@apache.org> on 2012/09/24 16:32:07 UTC

[jira] [Created] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Daniel Norberg created CASSANDRA-4710:
-----------------------------------------

             Summary: High key hashing overhead for index scans when using RandomPartitioner
                 Key: CASSANDRA-4710
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
             Project: Cassandra
          Issue Type: Bug
            Reporter: Daniel Norberg


For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.

Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Daniel Norberg (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Norberg updated CASSANDRA-4710:
--------------------------------------

    Attachment: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
    
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Daniel Norberg
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470100#comment-13470100 ] 

Sylvain Lebresne commented on CASSANDRA-4710:
---------------------------------------------

I agree, good catch. I went ahead a committed the fix (with some comment) in commit 2a91a48. Thanks Yuki.
                
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Daniel Norberg
>            Assignee: Daniel Norberg
>            Priority: Minor
>             Fix For: 1.2.0 beta 2
>
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Daniel Norberg (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470105#comment-13470105 ] 

Daniel Norberg commented on CASSANDRA-4710:
-------------------------------------------

Good catch.
                
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Daniel Norberg
>            Assignee: Daniel Norberg
>            Priority: Minor
>             Fix For: 1.2.0 beta 2
>
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Daniel Norberg (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Norberg updated CASSANDRA-4710:
--------------------------------------

    Attachment:     (was: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch)
    
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Daniel Norberg
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-4710:
--------------------------------------

      Priority: Minor  (was: Major)
    Issue Type: Improvement  (was: Bug)
    
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Daniel Norberg
>            Priority: Minor
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne resolved CASSANDRA-4710.
-----------------------------------------

    Resolution: Fixed
    
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Daniel Norberg
>            Assignee: Daniel Norberg
>            Priority: Minor
>             Fix For: 1.2.0 beta 2
>
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461864#comment-13461864 ] 

Jonathan Ellis commented on CASSANDRA-4710:
-------------------------------------------

Why is DatabaseDescriptor.getIndexInterval introduced?
                
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Daniel Norberg
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Daniel Norberg (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462360#comment-13462360 ] 

Daniel Norberg commented on CASSANDRA-4710:
-------------------------------------------

The check against DatabaseDescriptor.getIndexInterval is to be able to exit the loop in case the key looked for is not present in the index. 

When doing token comparison the loop can be exited when an index entry whose token is greater than the needle is encountered as the index is sorted on token. I.e. the if (v < 0) return null. But when doing raw key comparison we have to look through every entry in the section of the index that the sampled index gave us to be able to know that a key was not present. Fortunately this should be rare as key presence is checked using the bloom filter for EQ lookups before reading the index.
                
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Daniel Norberg
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Daniel Norberg (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Norberg updated CASSANDRA-4710:
--------------------------------------

    Attachment: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch

Attached patch with suggested fix; performs a raw key comparison for EQ index lookups.
                
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Daniel Norberg
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Yuki Morishita (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yuki Morishita reopened CASSANDRA-4710:
---------------------------------------


When I was trying to reproduce CASSANDRA-4733, I stumbled upon following error.

{code}
ERROR [ValidationExecutor:2] 2012-10-04 15:24:43,440 CassandraDaemon.java (line 132) Exception in thread Thread[ValidationExecutor:2,1,main]
java.lang.AssertionError: 113427529603963934725865253558964126270 is not contained in (56713727820156410577229101238628035242,113427455640312821154458202477256070484]
        at org.apache.cassandra.service.AntiEntropyService$Validator.add(AntiEntropyService.java:345)
        at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:727)
        at org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:66)
        at org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:451)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
{code}

It turned out that the cause was SSTR#getPositionsForRanges returning unrelated section of file due to bug in SSTR#getPosition. getPosition was returning null when it should return position.

getPosition starts search for key from nearest sampled index up to index interval count.
The following check inside getPosition:

{code}
 while (!input.isEOF() && i < DatabaseDescriptor.getIndexInterval())
{code}

stops search for indexed position when it searches all indexes between index sampling intervals and method returns null.
But with the check above, when searching for key that is greater than the last key inside index interval but is less than next sampled index, the method returns null instead of the position.

I think the fix for this is changing < to <=.
                
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Daniel Norberg
>            Assignee: Daniel Norberg
>            Priority: Minor
>             Fix For: 1.2.0 beta 2
>
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Daniel Norberg (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Norberg updated CASSANDRA-4710:
--------------------------------------

    Attachment: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
    
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Daniel Norberg
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4710) High key hashing overhead for index scans when using RandomPartitioner

Posted by "Daniel Norberg (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Norberg updated CASSANDRA-4710:
--------------------------------------

    Attachment:     (was: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch)
    
> High key hashing overhead for index scans when using RandomPartitioner
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4710
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Daniel Norberg
>         Attachments: 0001-SSTableReader-compare-raw-key-when-scanning-index.patch
>
>
> For a workload where the dataset is completely in ram, the md5 hashing of the keys during index scans becomes a bottleneck for reads when using RandomPartitioner, according to profiling.
> Instead performing a raw key equals check in SSTableReader.getPosition() for EQ operations improves throughput by some 30% for my workload (moving the bottleneck elsewhere).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira