You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Rakesh Kumar <dc...@aim.com> on 2016/03/10 22:44:02 UTC

What is wrong in this token function

C*  3.0.3


I have a table table1 which has the primary key on ((customer_id,event_id)).



I loaded 1.03 million rows from a csv file.


Business case: Show me all events for a given customer in a given time frame


In RDBMS it will be


(Query1)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+0000' and event_time <= '2016-03-12 19:05:00+0000'   ;



But C* does not allow >= <= on PKY cols. It suggested token function.


So I did this:


(Query2)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+0000')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+0000')  ;



I am seeing 75% more rows than what it should be. It should be 99K rows, it shows 163K.


I checked the output with the csv file itself.  To double check I loaded the csv in another table
with modified PKY so that the first query (Query1) can be executed. It also showed 99K rows.


Am I using token function incorrectly ?

Re: What is wrong in this token function

Posted by Matt Kennedy <mk...@datastax.com>.

The conversation around the partitioner sidetracks a bit from your original
question.

You originally asked:
>>
Business case: Show me all events for a given customer in a given time frame

In RDBMS it will be

(Query1)
where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+0000' and event_time <= '2016-03-12
19:05:00+0000'   ;"

But C* does not allow >= <= on PKY cols
<<

Actually, C* does allow range queries on _some_ primary key columns, just
not the partition key portion of the primary key columns. The primary key
that you are looking for is probably: ((customer_id),event_id). Structuring
the key like this uses customer_id as the partition key where you can use
an equality clause as show above (customer_id='289') followed by range
clauses on the evernt_id, which is now treated as a _clustering_column_.
Clustering columns are a concept that you probably want to look in to
further to wrap your head around this kind of query pattern.



On Thu, Mar 10, 2016 at 5:02 PM, Rakesh Kumar <dc...@aim.com> wrote:

> thanks. that explains it.
>
>
> -----Original Message-----
> From: Jack Krupansky <ja...@gmail.com>
> To: user <us...@cassandra.apache.org>
> Sent: Thu, Mar 10, 2016 5:28 pm
> Subject: Re: What is wrong in this token function
>
> From the doc: "When using the RandomPartitioner or Murmur3Partitioner,
> Cassandra rows are ordered by the hash of their value and hence the order
> of rows is not meaningful... The ByteOrdered partitioner arranges tokens
> the same way as key values, but the RandomPartitioner and
> Murmur3Partitioner distribute tokens in a completely unordered manner. The
> token function makes it possible to page through these unordered
> partitioner results."
>
> See:
> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html (for 2.1)
> https://docs.datastax.com/en/cql/3.3/cql/cql_using/usePaging.html (for
> 2.2 and 3.x)
>
>
> -- Jack Krupansky
>
> On Thu, Mar 10, 2016 at 5:14 PM, Rakesh Kumar <dc...@aim.com> wrote:
>
>> I am using default Murmur3.  So are you saying in case of Murmur3 the
>> following two queries
>>
>> select count*)
>> where customer_id = '289'
>> and event_time >= '2016-03-01 18:45:00+0000' and event_time <=
>> '2016-03-12 19:05:00+0000'   ;
>> and
>> select count(*)
>> where token(customer_id,event_time) >= token('289','2016-03-01
>> 18:45:00+0000')
>> and token(customer_id,event_time) <= token('289','2016-03-12
>> 19:05:00+0000')  ;
>>
>> are not same ?
>>
>> And yes I am aware of how to change the clustering_key to get the first
>> query. This question is more of academic exercise for me.
>>
>>
>> -----Original Message-----
>> From: Jack Krupansky <ja...@gmail.com>
>> To: user <us...@cassandra.apache.org>
>> Sent: Thu, Mar 10, 2016 4:55 pm
>> Subject: Re: What is wrong in this token function
>>
>> What partitioner are you using? The default partitioner is not "ordered",
>> so it will randomly order the hashes/tokens, so that tokens will not be
>> ordered even if your PKs are ordered. You probably want to use customer as
>> your partition key and event time as a clustering column - then you can use
>> RDBMS-like WHERE conditions to select a slice of the partition.
>>
>> -- Jack Krupansky
>>
>> On Thu, Mar 10, 2016 at 4:45 PM, Rakesh Kumar <dc...@aim.com> wrote:
>>
>>>
>>> typo: the primary key was (customer_id + event_time )
>>>
>>>
>>> -----Original Message-----
>>> From: Rakesh Kumar <dc...@aim.com>
>>> To: user <us...@cassandra.apache.org>
>>> Sent: Thu, Mar 10, 2016 4:44 pm
>>> Subject: What is wrong in this token function
>>>
>>> C*  3.0.3
>>>
>>> I have a table table1 which has the primary key on
>>> ((customer_id,event_id)).
>>>
>>> I loaded 1.03 million rows from a csv file.
>>>
>>> Business case: Show me all events for a given customer in a given time
>>> frame
>>>
>>> In RDBMS it will be
>>>
>>> (Query1)
>>> where customer_id = '289'
>>> and event_time >= '2016-03-01 18:45:00+0000' and event_time <=
>>> '2016-03-12 19:05:00+0000'   ;
>>>
>>> But C* does not allow >= <= on PKY cols. It suggested token function.
>>>
>>> So I did this:
>>>
>>> (Query2)
>>> where token(customer_id,event_time) >= token('289','2016-03-01
>>> 18:45:00+0000')
>>> and token(customer_id,event_time) <= token('289','2016-03-12
>>> 19:05:00+0000')  ;
>>>
>>> I am seeing 75% more rows than what it should be. It should be 99K rows,
>>> it shows 163K.
>>>
>>> I checked the output with the csv file itself.  To double check I loaded
>>> the csv in another table
>>> with modified PKY so that the first query (Query1) can be executed. It
>>> also showed 99K rows.
>>>
>>> Am I using token function incorrectly ?
>>>
>>>
>>>
>>>
>>
>

Re: What is wrong in this token function

Posted by Rakesh Kumar <dc...@aim.com>.

thanks. that explains it.

-----Original Message-----
From: Jack Krupansky <ja...@gmail.com>
To: user <us...@cassandra.apache.org>
Sent: Thu, Mar 10, 2016 5:28 pm
Subject: Re: What is wrong in this token function

>From the doc: "When using the RandomPartitioner or Murmur3Partitioner, Cassandra rows are ordered by the hash of their value and hence the order of rows is not meaningful... The ByteOrdered partitioner arranges tokens the same way as key values, but the RandomPartitioner and Murmur3Partitioner distribute tokens in a completely unordered manner. The token function makes it possible to page through these unordered partitioner results."

See:
https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html (for 2.1)

https://docs.datastax.com/en/cql/3.3/cql/cql_using/usePaging.html (for 2.2 and 3.x)

-- Jack Krupansky

On Thu, Mar 10, 2016 at 5:14 PM, Rakesh Kumar <dc...@aim.com> wrote:

I am using default Murmur3.  So are you saying in case of Murmur3 the following two queries

select count*)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+0000' and event_time <= '2016-03-12 19:05:00+0000'   ;
and
select count(*)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+0000')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+0000')  ;

are not same ?

And yes I am aware of how to change the clustering_key to get the first query. This question is more of academic exercise for me.

-----Original Message-----
From: Jack Krupansky <ja...@gmail.com>
To: user <us...@cassandra.apache.org>

Sent: Thu, Mar 10, 2016 4:55 pm
Subject: Re: What is wrong in this token function

What partitioner are you using? The default partitioner is not "ordered", so it will randomly order the hashes/tokens, so that tokens will not be ordered even if your PKs are ordered. You probably want to use customer as your partition key and event time as a clustering column - then you can use RDBMS-like WHERE conditions to select a slice of the partition.

-- Jack Krupansky

On Thu, Mar 10, 2016 at 4:45 PM, Rakesh Kumar <dc...@aim.com> wrote:

typo: the primary key was (customer_id + event_time )

-----Original Message-----
From: Rakesh Kumar <dc...@aim.com>
To: user <us...@cassandra.apache.org>
Sent: Thu, Mar 10, 2016 4:44 pm
Subject: What is wrong in this token function

C*  3.0.3

I have a table table1 which has the primary key on ((customer_id,event_id)).

I loaded 1.03 million rows from a csv file.

Business case: Show me all events for a given customer in a given time frame

In RDBMS it will be

(Query1)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+0000' and event_time <= '2016-03-12 19:05:00+0000'   ;

But C* does not allow >= <= on PKY cols. It suggested token function.

So I did this:

(Query2)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+0000')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+0000')  ;

I am seeing 75% more rows than what it should be. It should be 99K rows, it shows 163K.

I checked the output with the csv file itself.  To double check I loaded the csv in another table
with modified PKY so that the first query (Query1) can be executed. It also showed 99K rows.

Am I using token function incorrectly ?

Re: What is wrong in this token function

Posted by Jack Krupansky <ja...@gmail.com>.

>From the doc: "When using the RandomPartitioner or Murmur3Partitioner,
Cassandra rows are ordered by the hash of their value and hence the order
of rows is not meaningful... The ByteOrdered partitioner arranges tokens
the same way as key values, but the RandomPartitioner and
Murmur3Partitioner distribute tokens in a completely unordered manner. The
token function makes it possible to page through these unordered
partitioner results."

See:
https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html (for 2.1)
https://docs.datastax.com/en/cql/3.3/cql/cql_using/usePaging.html (for 2.2
and 3.x)


-- Jack Krupansky

On Thu, Mar 10, 2016 at 5:14 PM, Rakesh Kumar <dc...@aim.com> wrote:

> I am using default Murmur3.  So are you saying in case of Murmur3 the
> following two queries
>
> select count*)
> where customer_id = '289'
> and event_time >= '2016-03-01 18:45:00+0000' and event_time <= '2016-03-12
> 19:05:00+0000'   ;
> and
> select count(*)
> where token(customer_id,event_time) >= token('289','2016-03-01
> 18:45:00+0000')
> and token(customer_id,event_time) <= token('289','2016-03-12
> 19:05:00+0000')  ;
>
> are not same ?
>
> And yes I am aware of how to change the clustering_key to get the first
> query. This question is more of academic exercise for me.
>
>
> -----Original Message-----
> From: Jack Krupansky <ja...@gmail.com>
> To: user <us...@cassandra.apache.org>
> Sent: Thu, Mar 10, 2016 4:55 pm
> Subject: Re: What is wrong in this token function
>
> What partitioner are you using? The default partitioner is not "ordered",
> so it will randomly order the hashes/tokens, so that tokens will not be
> ordered even if your PKs are ordered. You probably want to use customer as
> your partition key and event time as a clustering column - then you can use
> RDBMS-like WHERE conditions to select a slice of the partition.
>
> -- Jack Krupansky
>
> On Thu, Mar 10, 2016 at 4:45 PM, Rakesh Kumar <dc...@aim.com> wrote:
>
>>
>> typo: the primary key was (customer_id + event_time )
>>
>>
>> -----Original Message-----
>> From: Rakesh Kumar <dc...@aim.com>
>> To: user <us...@cassandra.apache.org>
>> Sent: Thu, Mar 10, 2016 4:44 pm
>> Subject: What is wrong in this token function
>>
>> C*  3.0.3
>>
>> I have a table table1 which has the primary key on
>> ((customer_id,event_id)).
>>
>> I loaded 1.03 million rows from a csv file.
>>
>> Business case: Show me all events for a given customer in a given time
>> frame
>>
>> In RDBMS it will be
>>
>> (Query1)
>> where customer_id = '289'
>> and event_time >= '2016-03-01 18:45:00+0000' and event_time <=
>> '2016-03-12 19:05:00+0000'   ;
>>
>> But C* does not allow >= <= on PKY cols. It suggested token function.
>>
>> So I did this:
>>
>> (Query2)
>> where token(customer_id,event_time) >= token('289','2016-03-01
>> 18:45:00+0000')
>> and token(customer_id,event_time) <= token('289','2016-03-12
>> 19:05:00+0000')  ;
>>
>> I am seeing 75% more rows than what it should be. It should be 99K rows,
>> it shows 163K.
>>
>> I checked the output with the csv file itself.  To double check I loaded
>> the csv in another table
>> with modified PKY so that the first query (Query1) can be executed. It
>> also showed 99K rows.
>>
>> Am I using token function incorrectly ?
>>
>>
>>
>>
>

Re: What is wrong in this token function

Posted by Rakesh Kumar <dc...@aim.com>.

I am using default Murmur3.  So are you saying in case of Murmur3 the following two queries

select count*)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+0000' and event_time <= '2016-03-12 19:05:00+0000'   ;
and
select count(*)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+0000')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+0000')  ;


are not same ?


And yes I am aware of how to change the clustering_key to get the first query. This question is more of academic exercise for me.




-----Original Message-----
From: Jack Krupansky <ja...@gmail.com>
To: user <us...@cassandra.apache.org>
Sent: Thu, Mar 10, 2016 4:55 pm
Subject: Re: What is wrong in this token function



What partitioner are you using? The default partitioner is not "ordered", so it will randomly order the hashes/tokens, so that tokens will not be ordered even if your PKs are ordered. You probably want to use customer as your partition key and event time as a clustering column - then you can use RDBMS-like WHERE conditions to select a slice of the partition.



-- Jack Krupansky



On Thu, Mar 10, 2016 at 4:45 PM, Rakesh Kumar <dc...@aim.com> wrote:


typo: the primary key was (customer_id + event_time )



-----Original Message-----
From: Rakesh Kumar <dc...@aim.com>
To: user <us...@cassandra.apache.org>
Sent: Thu, Mar 10, 2016 4:44 pm
Subject: What is wrong in this token function



C*  3.0.3


I have a table table1 which has the primary key on ((customer_id,event_id)).



I loaded 1.03 million rows from a csv file.


Business case: Show me all events for a given customer in a given time frame


In RDBMS it will be


(Query1)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+0000' and event_time <= '2016-03-12 19:05:00+0000'   ;



But C* does not allow >= <= on PKY cols. It suggested token function.


So I did this:


(Query2)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+0000')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+0000')  ;



I am seeing 75% more rows than what it should be. It should be 99K rows, it shows 163K.


I checked the output with the csv file itself.  To double check I loaded the csv in another table
with modified PKY so that the first query (Query1) can be executed. It also showed 99K rows.


Am I using token function incorrectly ?

Re: What is wrong in this token function

Posted by Jack Krupansky <ja...@gmail.com>.

What partitioner are you using? The default partitioner is not "ordered",
so it will randomly order the hashes/tokens, so that tokens will not be
ordered even if your PKs are ordered. You probably want to use customer as
your partition key and event time as a clustering column - then you can use
RDBMS-like WHERE conditions to select a slice of the partition.

-- Jack Krupansky

On Thu, Mar 10, 2016 at 4:45 PM, Rakesh Kumar <dc...@aim.com> wrote:

>
> typo: the primary key was (customer_id + event_time )
>
>
> -----Original Message-----
> From: Rakesh Kumar <dc...@aim.com>
> To: user <us...@cassandra.apache.org>
> Sent: Thu, Mar 10, 2016 4:44 pm
> Subject: What is wrong in this token function
>
> C*  3.0.3
>
> I have a table table1 which has the primary key on
> ((customer_id,event_id)).
>
> I loaded 1.03 million rows from a csv file.
>
> Business case: Show me all events for a given customer in a given time
> frame
>
> In RDBMS it will be
>
> (Query1)
> where customer_id = '289'
> and event_time >= '2016-03-01 18:45:00+0000' and event_time <= '2016-03-12
> 19:05:00+0000'   ;
>
> But C* does not allow >= <= on PKY cols. It suggested token function.
>
> So I did this:
>
> (Query2)
> where token(customer_id,event_time) >= token('289','2016-03-01
> 18:45:00+0000')
> and token(customer_id,event_time) <= token('289','2016-03-12
> 19:05:00+0000')  ;
>
> I am seeing 75% more rows than what it should be. It should be 99K rows,
> it shows 163K.
>
> I checked the output with the csv file itself.  To double check I loaded
> the csv in another table
> with modified PKY so that the first query (Query1) can be executed. It
> also showed 99K rows.
>
> Am I using token function incorrectly ?
>
>
>
>

Re: What is wrong in this token function

Posted by Rakesh Kumar <dc...@aim.com>.

typo: the primary key was (customer_id + event_time )

-----Original Message-----
From: Rakesh Kumar <dc...@aim.com>
To: user <us...@cassandra.apache.org>
Sent: Thu, Mar 10, 2016 4:44 pm
Subject: What is wrong in this token function

C*  3.0.3

I have a table table1 which has the primary key on ((customer_id,event_id)).

I loaded 1.03 million rows from a csv file.

Business case: Show me all events for a given customer in a given time frame

In RDBMS it will be

(Query1)

where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+0000' and event_time <= '2016-03-12 19:05:00+0000'   ;

But C* does not allow >= <= on PKY cols. It suggested token function.

So I did this:

(Query2)

where token(customer_id,event_time) >= token('289','2016-03-01 18:45:00+0000')
and token(customer_id,event_time) <= token('289','2016-03-12 19:05:00+0000')  ;

I am seeing 75% more rows than what it should be. It should be 99K rows, it shows 163K.

I checked the output with the csv file itself.  To double check I loaded the csv in another table
with modified PKY so that the first query (Query1) can be executed. It also showed 99K rows.

Am I using token function incorrectly ?