You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Jan Algermissen <ja...@nordsc.com> on 2013/08/13 22:46:42 UTC

token(), limit and wide rows

HI,

ok, so I found token() [1], and that it is an option for paging through randomly partitioned data. 

I take it that combining token() and LIMIT is the CQL3 idiom for paging (set aside the fact that one shouldn't raelly want to page and use C*)

Now, when I page through a CF with wide rows, limitting each 'page' to, for example, 100 I end up in situations where not all 'sub'rows that have the same result for token() are returned because LIMIT chops off the result after 100 'sub'rows, not neccessarily at the boundary to the next wide row.

Obvious ... but inconvenient.

The solution would be to throw away the last token returned (because it's wide row could have been chopped off) and do the next query with the token before.

So instead of doing

     SELECT * FROM users WHERE token(name) > token(last-name-of-prev-result) LIMIT 100;

I'd be doing

    SELECT * FROM users WHERE token(name) > token(one-befoe-the-last-name-of-prev-result) LIMIT 100;


Question: Is that what I have to do or is there a way to make token() and limit work together to return complete wide rows?


Jan



[1] token() and how it relates to paging is actually quite hard to grasp from the docs.

Re: token(), limit and wide rows

Posted by Richard Low <ri...@wentnet.com>.

You can do it by using two types of query.  One using token as you suggest,
the other by fixing the partition key and walking through the other parts
of the composite primary key.

For example, consider the table:

create table paging (a text, b text, c text primary key (a, b));

I inserted ('1', '1', 'x'), ('1', '2', 'x'), ..., ('1', '5', 'x') and then
again for a='2.  Suppose the paging size is 3, then start with

> select * from paging limit 3;

 a | b | c
---+---+---
 2 | 1 | x
 2 | 2 | x
 2 | 3 | x

Now you don't know if there are more items with a='2', so run:

> select * from paging where a = '2' and b > '3' limit 3;

 a | b | c
---+---+---
 2 | 4 | x
 2 | 5 | x

You know there aren't any more because only two results were obtained, but
you can continue with greater values of b if required.

Now move on to the next a value (in token order):

> select * from paging where token(a) > token('2') limit 3;

 a | b | c
---+---+---
 1 | 1 | x
 1 | 2 | x
 1 | 3 | x

and so on.

I don't know if there is any client library support for this, but it would
be useful.  But I think in Cassandra 2.0, CASSANDRA-4415 and CASSANDRA-4536
will solve this.

Richard.

On 16 August 2013 17:16, Jonathan Rhone <jo...@shareablee.com> wrote:

> Read
>
> http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive
>
> And look at
>
>
> http://fossies.org/dox/apache-cassandra-1.2.8-src/CqlPagingRecordReader_8java_source.html
>
> - Jon
>
>
> On Fri, Aug 16, 2013 at 12:08 PM, Keith Freeman <8f...@gmail.com> wrote:
>
>> I've run into the same problem, surprised nobody's responded to you.  Any
>> time someone asks "how do I page through all the rows of a table in CQL3?",
>> the standard answer is token() and limit.  But as you point out, this
>> method will often miss some data from wide rows.
>>
>> Maybe a Cassandra expert will chime in if we're wrong.
>>
>> Your suggestion is possible if you know how to find the previous value of
>> 'name' field (and are willing to filter out repeated rows), but wouldn't
>> that be difficult/impossible with some keys?  So then, is there a way to do
>> paging queries that get ALL of the rows, even in wide rows?
>>
>>
>>
>> On 08/13/2013 02:46 PM, Jan Algermissen wrote:
>>
>>> HI,
>>>
>>> ok, so I found token() [1], and that it is an option for paging through
>>> randomly partitioned data.
>>>
>>> I take it that combining token() and LIMIT is the CQL3 idiom for paging
>>> (set aside the fact that one shouldn't raelly want to page and use C*)
>>>
>>> Now, when I page through a CF with wide rows, limitting each 'page' to,
>>> for example, 100 I end up in situations where not all 'sub'rows that have
>>> the same result for token() are returned because LIMIT chops off the result
>>> after 100 'sub'rows, not neccessarily at the boundary to the next wide row.
>>>
>>> Obvious ... but inconvenient.
>>>
>>> The solution would be to throw away the last token returned (because
>>> it's wide row could have been chopped off) and do the next query with the
>>> token before.
>>>
>>> So instead of doing
>>>
>>>       SELECT * FROM users WHERE token(name) > token(last-name-of-prev-**result)
>>> LIMIT 100;
>>>
>>> I'd be doing
>>>
>>>      SELECT * FROM users WHERE token(name) >
>>> token(one-befoe-the-last-name-**of-prev-result) LIMIT 100;
>>>
>>>
>>> Question: Is that what I have to do or is there a way to make token()
>>> and limit work together to return complete wide rows?
>>>
>>>
>>> Jan
>>>
>>>
>>>
>>> [1] token() and how it relates to paging is actually quite hard to grasp
>>> from the docs.
>>>
>>
>>
>

Re: token(), limit and wide rows

Posted by Jonathan Rhone <jo...@shareablee.com>.

Read

http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive

And look at

http://fossies.org/dox/apache-cassandra-1.2.8-src/CqlPagingRecordReader_8java_source.html

- Jon

On Fri, Aug 16, 2013 at 12:08 PM, Keith Freeman <8f...@gmail.com> wrote:

> I've run into the same problem, surprised nobody's responded to you.  Any
> time someone asks "how do I page through all the rows of a table in CQL3?",
> the standard answer is token() and limit.  But as you point out, this
> method will often miss some data from wide rows.
>
> Maybe a Cassandra expert will chime in if we're wrong.
>
> Your suggestion is possible if you know how to find the previous value of
> 'name' field (and are willing to filter out repeated rows), but wouldn't
> that be difficult/impossible with some keys?  So then, is there a way to do
> paging queries that get ALL of the rows, even in wide rows?
>
>
>
> On 08/13/2013 02:46 PM, Jan Algermissen wrote:
>
>> HI,
>>
>> ok, so I found token() [1], and that it is an option for paging through
>> randomly partitioned data.
>>
>> I take it that combining token() and LIMIT is the CQL3 idiom for paging
>> (set aside the fact that one shouldn't raelly want to page and use C*)
>>
>> Now, when I page through a CF with wide rows, limitting each 'page' to,
>> for example, 100 I end up in situations where not all 'sub'rows that have
>> the same result for token() are returned because LIMIT chops off the result
>> after 100 'sub'rows, not neccessarily at the boundary to the next wide row.
>>
>> Obvious ... but inconvenient.
>>
>> The solution would be to throw away the last token returned (because it's
>> wide row could have been chopped off) and do the next query with the token
>> before.
>>
>> So instead of doing
>>
>>       SELECT * FROM users WHERE token(name) > token(last-name-of-prev-**result)
>> LIMIT 100;
>>
>> I'd be doing
>>
>>      SELECT * FROM users WHERE token(name) >
>> token(one-befoe-the-last-name-**of-prev-result) LIMIT 100;
>>
>>
>> Question: Is that what I have to do or is there a way to make token() and
>> limit work together to return complete wide rows?
>>
>>
>> Jan
>>
>>
>>
>> [1] token() and how it relates to paging is actually quite hard to grasp
>> from the docs.
>>
>
>

Re: token(), limit and wide rows

Posted by Keith Freeman <8f...@gmail.com>.

I've run into the same problem, surprised nobody's responded to you.  
Any time someone asks "how do I page through all the rows of a table in 
CQL3?", the standard answer is token() and limit.  But as you point out, 
this method will often miss some data from wide rows.

Maybe a Cassandra expert will chime in if we're wrong.

Your suggestion is possible if you know how to find the previous value 
of 'name' field (and are willing to filter out repeated rows), but 
wouldn't that be difficult/impossible with some keys?  So then, is there 
a way to do paging queries that get ALL of the rows, even in wide rows?

On 08/13/2013 02:46 PM, Jan Algermissen wrote:
> HI,
>
> ok, so I found token() [1], and that it is an option for paging through randomly partitioned data.
>
> I take it that combining token() and LIMIT is the CQL3 idiom for paging (set aside the fact that one shouldn't raelly want to page and use C*)
>
> Now, when I page through a CF with wide rows, limitting each 'page' to, for example, 100 I end up in situations where not all 'sub'rows that have the same result for token() are returned because LIMIT chops off the result after 100 'sub'rows, not neccessarily at the boundary to the next wide row.
>
> Obvious ... but inconvenient.
>
> The solution would be to throw away the last token returned (because it's wide row could have been chopped off) and do the next query with the token before.
>
> So instead of doing
>
>       SELECT * FROM users WHERE token(name) > token(last-name-of-prev-result) LIMIT 100;
>
> I'd be doing
>
>      SELECT * FROM users WHERE token(name) > token(one-befoe-the-last-name-of-prev-result) LIMIT 100;
>
>
> Question: Is that what I have to do or is there a way to make token() and limit work together to return complete wide rows?
>
>
> Jan
>
>
>
> [1] token() and how it relates to paging is actually quite hard to grasp from the docs.