You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Dave Viner <da...@vinertech.com> on 2010/07/28 23:43:15 UTC

iterating over all rows keys gets duplicate key returns

Hi all,

I'm having a strange result in trying to iterate over all row keys for a
particular column family.  The iteration works, but I see the same row key
returned multiple times during the iteration.

I'm using cassandra 0.6.3, and I've put the code in use at
http://pastebin.com/zz5xJQ8f

Using get_range_slices() and a keyrange with incrementing start_key's,
shouldn't I get an enumeration of the keys such that each key appears only
once ?


In iterating 1000 times, I was given the same rows 8322 times.  Somehow it
seems like something is amiss in how I'm performing the iteration over the
keys.  Any suggestions on how I can properly iterate?

Thanks
Dave Viner

Re: iterating over all rows keys gets duplicate key returns

Posted by Dave Viner <da...@pobox.com>.

Just as a followup, here's what seems to be the resolution:

1. 0.6.4 should fix this problem.
2. Using OPP as the DHT should solve it as well.
3. Prior to 0.6.4, when using RandomPartitioner as the DHT, there's no good
way to guarantee that you see *all* row keys for a column family.

Strategies tried:

A. iterate over the keys returned until the "start_key" is identical to the
"last key returned".  When start_key == last key returned, exit.
-> fails since duplicate keys can appear anywhere, even as the last key
returned.

B. iterate over keys returned, adding the keys to a hash table.  When an
iteration returns no new keys, assume that all keys have been seen and exit.
-> this also fails, since a particular result set can be full of duplicates,
but the iteration has not traversed the entire row-key spectrum.

Dave Viner

On Wed, Jul 28, 2010 at 3:48 PM, Rob Coli <rc...@digg.com> wrote:

> On 7/28/10 2:43 PM, Dave Viner wrote:
>
>> Hi all,
>>
>> I'm having a strange result in trying to iterate over all row keys for a
>> particular column family.  The iteration works, but I see the same row
>> key returned multiple times during the iteration.
>>
>> I'm using cassandra 0.6.3, and I've put the code in use at
>>
>
> For those not playing along on IRC, this was determined to be caused by :
>
> http://issues.apache.org/jira/browse/CASSANDRA-1042
>
> Which is fixed in 0.6.4.
>
> =Rob
>

Re: iterating over all rows keys gets duplicate key returns

Posted by Rob Coli <rc...@digg.com>.

On 7/28/10 2:43 PM, Dave Viner wrote:
> Hi all,
>
> I'm having a strange result in trying to iterate over all row keys for a
> particular column family.  The iteration works, but I see the same row
> key returned multiple times during the iteration.
>
> I'm using cassandra 0.6.3, and I've put the code in use at

For those not playing along on IRC, this was determined to be caused by :

http://issues.apache.org/jira/browse/CASSANDRA-1042

Which is fixed in 0.6.4.

=Rob

Re: iterating over all rows keys gets duplicate key returns

Posted by Jeremy Hanna <je...@gmail.com>.

Yes, didn't know if you saw the reply in the channel.

This bug has been fixed in the forthcoming 0.6.4 release.  It was bug CASSANDRA-1042 - https://issues.apache.org/jira/browse/CASSANDRA-1042

(0.6.4 will be out really soon)

On Jul 28, 2010, at 4:43 PM, Dave Viner wrote:

> Hi all,
> 
> I'm having a strange result in trying to iterate over all row keys for a particular column family.  The iteration works, but I see the same row key returned multiple times during the iteration.
> 
> I'm using cassandra 0.6.3, and I've put the code in use at http://pastebin.com/zz5xJQ8f
> 
> Using get_range_slices() and a keyrange with incrementing start_key's, shouldn't I get an enumeration of the keys such that each key appears only once ?
> 
> 
> In iterating 1000 times, I was given the same rows 8322 times.  Somehow it seems like something is amiss in how I'm performing the iteration over the keys.  Any suggestions on how I can properly iterate?
> 
> Thanks
> Dave Viner
>