You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Jean-Francois Im (JIRA)" <ji...@apache.org> on 2011/07/16 17:11:59 UTC

[jira] [Created] (CASSANDRA-2904) get_range_slices with no columns could be made faster by scanning the index file

get_range_slices with no columns could be made faster by scanning the index file
--------------------------------------------------------------------------------

                 Key: CASSANDRA-2904
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2904
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
    Affects Versions: 0.7.6
            Reporter: Jean-Francois Im


When scanning a column family using get_range_slices() and a predicate that contains no columns, the scan operates on the actual data, not the index file.

Our use case for this is that we have a column family that has relatively wide rows(varying from 10kb to over 100kb of data per row) and we need to do iterate through all the keys to figure out which rows we are interested in; obviously, going through the index file than the data is faster in this case(in the order of minutes versus hours).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2904) get_range_slices with no columns could be made faster by scanning the index file

Posted by "Tupshin Harper (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tupshin Harper updated CASSANDRA-2904:
--------------------------------------

    Attachment: CASSANDRA-2904-v1.diff

> get_range_slices with no columns could be made faster by scanning the index file
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2904
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jean-Francois Im
>            Priority: Minor
>         Attachments: CASSANDRA-2904-v1.diff
>
>
> When scanning a column family using get_range_slices() and a predicate that contains no columns, the scan operates on the actual data, not the index file.
> Our use case for this is that we have a column family that has relatively wide rows(varying from 10kb to over 100kb of data per row) and we need to do iterate through all the keys to figure out which rows we are interested in; obviously, going through the index file than the data is faster in this case(in the order of minutes versus hours).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2904) get_range_slices with no columns could be made faster by scanning the index file

Posted by "Tupshin Harper (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13099296#comment-13099296 ] 

Tupshin Harper edited comment on CASSANDRA-2904 at 9/7/11 8:46 PM:
-------------------------------------------------------------------

Added a patch that adds a SSTableIndexScanner and related changes per Jonathan's suggest

      was (Author: tupshin):
    Adds SSTableIndexScanner per Jonathan's suggest
  
> get_range_slices with no columns could be made faster by scanning the index file
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2904
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jean-Francois Im
>            Priority: Minor
>         Attachments: CASSANDRA-2904-v1.diff
>
>
> When scanning a column family using get_range_slices() and a predicate that contains no columns, the scan operates on the actual data, not the index file.
> Our use case for this is that we have a column family that has relatively wide rows(varying from 10kb to over 100kb of data per row) and we need to do iterate through all the keys to figure out which rows we are interested in; obviously, going through the index file than the data is faster in this case(in the order of minutes versus hours).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2904) get_range_slices with no columns could be made faster by scanning the index file

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066653#comment-13066653 ] 

Jonathan Ellis commented on CASSANDRA-2904:
-------------------------------------------

bq. would probably need some pointers for some things (how to handle a compaction, query cursors and a consistency level other than ONE, mostly)

Probably the best place would be to add logic to RowIteratorFactory.getIterator to recognize the empty predicate, and write a SSTableIndexScanner along the lines of SSTableScanner to use in that situation.

You don't need to worry about compaction (the existing mechanisms to not purge in-use sstables continue to work) or consistencylevel (handled at the coordinator, not the replica owner).

> get_range_slices with no columns could be made faster by scanning the index file
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2904
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jean-Francois Im
>            Priority: Minor
>
> When scanning a column family using get_range_slices() and a predicate that contains no columns, the scan operates on the actual data, not the index file.
> Our use case for this is that we have a column family that has relatively wide rows(varying from 10kb to over 100kb of data per row) and we need to do iterate through all the keys to figure out which rows we are interested in; obviously, going through the index file than the data is faster in this case(in the order of minutes versus hours).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2904) get_range_slices with no columns could be made faster by scanning the index file

Posted by "Jean-Francois Im (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066513#comment-13066513 ] 

Jean-Francois Im commented on CASSANDRA-2904:
---------------------------------------------

I forgot to mention that I am interested in writing a patch for this; I implemented something quick and dirty on my end to get an idea of the performance improvement, but it assumes that there is nothing else going on at the same moment (ie. nobody else is writing, consistency level is always ONE, no compaction or anything else is going on, there's only one client doing this kind of query, etc.).

Writing something more general purpose would be trickier and I would probably need some pointers for some things(how to handle a compaction, query cursors and a consistency level other than ONE, mostly), but it sounds really fun. Is there any interest for this?

> get_range_slices with no columns could be made faster by scanning the index file
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2904
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.7.6
>            Reporter: Jean-Francois Im
>
> When scanning a column family using get_range_slices() and a predicate that contains no columns, the scan operates on the actual data, not the index file.
> Our use case for this is that we have a column family that has relatively wide rows(varying from 10kb to over 100kb of data per row) and we need to do iterate through all the keys to figure out which rows we are interested in; obviously, going through the index file than the data is faster in this case(in the order of minutes versus hours).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2904) get_range_slices with no columns could be made faster by scanning the index file

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2904:
--------------------------------------

             Priority: Minor  (was: Major)
    Affects Version/s:     (was: 0.7.6)

> get_range_slices with no columns could be made faster by scanning the index file
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2904
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jean-Francois Im
>            Priority: Minor
>
> When scanning a column family using get_range_slices() and a predicate that contains no columns, the scan operates on the actual data, not the index file.
> Our use case for this is that we have a column family that has relatively wide rows(varying from 10kb to over 100kb of data per row) and we need to do iterate through all the keys to figure out which rows we are interested in; obviously, going through the index file than the data is faster in this case(in the order of minutes versus hours).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2904) get_range_slices with no columns could be made faster by scanning the index file

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497629#comment-13497629 ] 

Jonathan Ellis commented on CASSANDRA-2904:
-------------------------------------------

Superseded by CASSANDRA-4536
                
> get_range_slices with no columns could be made faster by scanning the index file
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2904
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jean-Francois Im
>            Priority: Minor
>         Attachments: CASSANDRA-2904-v1.diff
>
>
> When scanning a column family using get_range_slices() and a predicate that contains no columns, the scan operates on the actual data, not the index file.
> Our use case for this is that we have a column family that has relatively wide rows(varying from 10kb to over 100kb of data per row) and we need to do iterate through all the keys to figure out which rows we are interested in; obviously, going through the index file than the data is faster in this case(in the order of minutes versus hours).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2904) get_range_slices with no columns could be made faster by scanning the index file

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066534#comment-13066534 ] 

Stu Hood commented on CASSANDRA-2904:
-------------------------------------

CASSANDRA-2319 implements something like this.

> get_range_slices with no columns could be made faster by scanning the index file
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2904
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.7.6
>            Reporter: Jean-Francois Im
>
> When scanning a column family using get_range_slices() and a predicate that contains no columns, the scan operates on the actual data, not the index file.
> Our use case for this is that we have a column family that has relatively wide rows(varying from 10kb to over 100kb of data per row) and we need to do iterate through all the keys to figure out which rows we are interested in; obviously, going through the index file than the data is faster in this case(in the order of minutes versus hours).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira