You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Carlos Alonso <in...@mrcalonso.com> on 2016/04/21 17:58:50 UTC

Unable to reliably count keys on a thrift CF

Hi guys.

I've been struggling for the last days to find a reliable and stable way to
count keys in a thrift column family.

My idea is to basically iterate the whole ring using the token function, as
documented here:
https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html in batches
of 10000 records

The only corner case is that if there were more than 10000 records in a
single partition (not the case, but the program should still handle it) it
explores the partition in depth by getting all records for that particular
token (see below). In the end, all keys are saved into a hash to guarantee
uniqueness. The count of unique keys is always different (and random,
sometimes more keys, sometimes less are retrieved) and, of course, I'm sure
no activity is going on in that cf.

I'm running Cassandra 2.1.11 with MurMur3 partitioner. RF=3 and CL=QUORUM

the column family structure is

CREATE TABLE tbl (
    key blob,
    column1 ascii,
    value blob,
    PRIMARY KEY(key, column1)
)

and I'm running the following script

connection = open_cql_connection
results = connection.execute("SELECT token(key), key FROM tbl LIMIT 10000")

keys_hash = {} // Hash to save the keys to guarantee uniqueness
last_token = nil
token = nil

while results != nil
  results.each do |row|
    keys_hash[row['key']] = true
    token = row['token(key)']
  end
  if token == last_token
    results = connection.execute("SELECT token(key), key FROM tbl WHERE
token(key) = #{token}")
  else
    results = connection.execute("SELECT token(key), key FROM tbl WHERE
token(key) >= #{token} LIMIT 10000")
  end
  last_token = token
end

puts keys.keys.count

What am I missing?

Thanks!

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

Re: Unable to reliably count keys on a thrift CF

Posted by Anuj Wadehra <an...@yahoo.co.in>.
Hi Carlos,
Please check if the JIRA : https://issues.apache.org/jira/browse/CASSANDRA-11467 fixes your problem.
We had been facing row count issue with thrift cf / compact storage and this fixed it.
Above is fixed in latest 2.1.14. Its a two line fix. So, you can also prepare a custom jar and check if that works.
ThanksAnuj
Sent from Yahoo Mail on Android 
 
  On Thu, 21 Apr, 2016 at 9:29 PM, Carlos Alonso<in...@mrcalonso.com> wrote:   Hi guys.
I've been struggling for the last days to find a reliable and stable way to count keys in a thrift column family.
My idea is to basically iterate the whole ring using the token function, as documented here: https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html in batches of 10000 records
The only corner case is that if there were more than 10000 records in a single partition (not the case, but the program should still handle it) it explores the partition in depth by getting all records for that particular token (see below). In the end, all keys are saved into a hash to guarantee uniqueness. The count of unique keys is always different (and random, sometimes more keys, sometimes less are retrieved) and, of course, I'm sure no activity is going on in that cf.
I'm running Cassandra 2.1.11 with MurMur3 partitioner. RF=3 and CL=QUORUM
the column family structure is
CREATE TABLE tbl (    key blob,    column1 ascii,    value blob,    PRIMARY KEY(key, column1))
and I'm running the following script
connection = open_cql_connectionresults = connection.execute("SELECT token(key), key FROM tbl LIMIT 10000")
keys_hash = {} // Hash to save the keys to guarantee uniquenesslast_token = niltoken = nil
while results != nil  results.each do |row|    keys_hash[row['key']] = true    token = row['token(key)']  end  if token == last_token    results = connection.execute("SELECT token(key), key FROM tbl WHERE token(key) = #{token}")  else    results = connection.execute("SELECT token(key), key FROM tbl WHERE token(key) >= #{token} LIMIT 10000")  end  last_token = tokenend

puts keys.keys.count
What am I missing?
Thanks!
Carlos Alonso | Software Engineer | @calonso
  

Re: Unable to reliably count keys on a thrift CF

Posted by Carlos Alonso <in...@mrcalonso.com>.
Hi Jens.

Thanks for your response but my idea is to count different keys, so, if I
understood correctly selecting WHERE key = #{key} won't give me any new
key, right?

Thanks!

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 25 April 2016 at 09:22, Jens Rantil <je...@tink.se> wrote:

> Hi Carlos,
>
> In CQL, for the cornercase you describe, you could simply do
>
>     SELECT * FROM tbl WHERE key=#{key} LIMIT 1000;
>
> and if it returns 1000 items, you'd iteratively do
>
>     SELECT * FROM tbl WHERE key=#{key} AND column1 >
> #{last_col1_in_prev_query} LIMIT 1000;
>
> Also, have a look at fetchSize here:
> https://docs.datastax.com/en/developer/java-driver/2.0/java-driver/reference/queryBuilderOverview.html?scroll=queryBuilderOverview__setting-query-options-querybuilder-api
>
> Hope this helps.
>
> Cheers,
> Jens
>
> On Thu, Apr 21, 2016 at 5:59 PM Carlos Alonso <in...@mrcalonso.com> wrote:
>
>> Hi guys.
>>
>> I've been struggling for the last days to find a reliable and stable way
>> to count keys in a thrift column family.
>>
>> My idea is to basically iterate the whole ring using the token function,
>> as documented here:
>> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html in
>> batches of 10000 records
>>
>> The only corner case is that if there were more than 10000 records in a
>> single partition (not the case, but the program should still handle it) it
>> explores the partition in depth by getting all records for that particular
>> token (see below). In the end, all keys are saved into a hash to guarantee
>> uniqueness. The count of unique keys is always different (and random,
>> sometimes more keys, sometimes less are retrieved) and, of course, I'm sure
>> no activity is going on in that cf.
>>
>> I'm running Cassandra 2.1.11 with MurMur3 partitioner. RF=3 and CL=QUORUM
>>
>> the column family structure is
>>
>> CREATE TABLE tbl (
>>     key blob,
>>     column1 ascii,
>>     value blob,
>>     PRIMARY KEY(key, column1)
>> )
>>
>> and I'm running the following script
>>
>> connection = open_cql_connection
>> results = connection.execute("SELECT token(key), key FROM tbl LIMIT
>> 10000")
>>
>> keys_hash = {} // Hash to save the keys to guarantee uniqueness
>> last_token = nil
>> token = nil
>>
>> while results != nil
>>   results.each do |row|
>>     keys_hash[row['key']] = true
>>     token = row['token(key)']
>>   end
>>   if token == last_token
>>     results = connection.execute("SELECT token(key), key FROM tbl WHERE
>> token(key) = #{token}")
>>   else
>>     results = connection.execute("SELECT token(key), key FROM tbl WHERE
>> token(key) >= #{token} LIMIT 10000")
>>   end
>>   last_token = token
>> end
>>
>> puts keys.keys.count
>>
>> What am I missing?
>>
>> Thanks!
>>
>> Carlos Alonso | Software Engineer | @calonso
>> <https://twitter.com/calonso>
>>
> --
>
> Jens Rantil
> Backend Developer @ Tink
>
> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
> For urgent matters you can reach me at +46-708-84 18 32.
>

Re: Unable to reliably count keys on a thrift CF

Posted by Jens Rantil <je...@tink.se>.
Hi Carlos,

In CQL, for the cornercase you describe, you could simply do

    SELECT * FROM tbl WHERE key=#{key} LIMIT 1000;

and if it returns 1000 items, you'd iteratively do

    SELECT * FROM tbl WHERE key=#{key} AND column1 >
#{last_col1_in_prev_query} LIMIT 1000;

Also, have a look at fetchSize here:
https://docs.datastax.com/en/developer/java-driver/2.0/java-driver/reference/queryBuilderOverview.html?scroll=queryBuilderOverview__setting-query-options-querybuilder-api

Hope this helps.

Cheers,
Jens

On Thu, Apr 21, 2016 at 5:59 PM Carlos Alonso <in...@mrcalonso.com> wrote:

> Hi guys.
>
> I've been struggling for the last days to find a reliable and stable way
> to count keys in a thrift column family.
>
> My idea is to basically iterate the whole ring using the token function,
> as documented here:
> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html in
> batches of 10000 records
>
> The only corner case is that if there were more than 10000 records in a
> single partition (not the case, but the program should still handle it) it
> explores the partition in depth by getting all records for that particular
> token (see below). In the end, all keys are saved into a hash to guarantee
> uniqueness. The count of unique keys is always different (and random,
> sometimes more keys, sometimes less are retrieved) and, of course, I'm sure
> no activity is going on in that cf.
>
> I'm running Cassandra 2.1.11 with MurMur3 partitioner. RF=3 and CL=QUORUM
>
> the column family structure is
>
> CREATE TABLE tbl (
>     key blob,
>     column1 ascii,
>     value blob,
>     PRIMARY KEY(key, column1)
> )
>
> and I'm running the following script
>
> connection = open_cql_connection
> results = connection.execute("SELECT token(key), key FROM tbl LIMIT 10000")
>
> keys_hash = {} // Hash to save the keys to guarantee uniqueness
> last_token = nil
> token = nil
>
> while results != nil
>   results.each do |row|
>     keys_hash[row['key']] = true
>     token = row['token(key)']
>   end
>   if token == last_token
>     results = connection.execute("SELECT token(key), key FROM tbl WHERE
> token(key) = #{token}")
>   else
>     results = connection.execute("SELECT token(key), key FROM tbl WHERE
> token(key) >= #{token} LIMIT 10000")
>   end
>   last_token = token
> end
>
> puts keys.keys.count
>
> What am I missing?
>
> Thanks!
>
> Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>
>
-- 

Jens Rantil
Backend Developer @ Tink

Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
For urgent matters you can reach me at +46-708-84 18 32.