You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Adam Crain <ad...@greenenergycorp.com> on 2010/08/05 16:06:37 UTC

error using get_range_slice with random partitioner

Hi,

I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated that iterating over the keys in keyspace is possible, even with the random partitioner. This is mostly desirable in my case for testing purposes only.

I get the following error:

[junit] Internal error processing get_range_slices
[junit] org.apache.thrift.TApplicationException: Internal error processing get_range_slices

and the following server traceback:

java.lang.NumberFormatException: Zero length BigInteger
	at java.math.BigInteger.<init>(BigInteger.java:295)
	at java.math.BigInteger.<init>(BigInteger.java:467)
	at org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
	at org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)

I am using the scala cascal client, but am sure that get_range_slice is being called with start and stop set to "".

1) Is batch iteration possible with random partioner?

This isn't clear from the FAQ entry on the subject:

http://wiki.apache.org/cassandra/FAQ#iter_world

2) The FAQ states that start argument should be "". What should the end argument be?

thanks!
Adam

Re: error using get_range_slice with random partitioner

Posted by Thomas Heller <in...@zilence.net>.

Wild guess here, but are you using start_token/end_token here when you
should be using start_key? Looks to me like you are trying end_token
= ''.

HTH,
/thomas

On Thursday, August 5, 2010, Adam Crain <ad...@greenenergycorp.com> wrote:
> Hi,
>
> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated that iterating over the keys in keyspace is possible, even with the random partitioner. This is mostly desirable in my case for testing purposes only.
>
> I get the following error:
>
> [junit] Internal error processing get_range_slices
> [junit] org.apache.thrift.TApplicationException: Internal error processing get_range_slices
>
> and the following server traceback:
>
> java.lang.NumberFormatException: Zero length BigInteger
>         at java.math.BigInteger.<init>(BigInteger.java:295)
>         at java.math.BigInteger.<init>(BigInteger.java:467)
>         at org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>         at org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>
> I am using the scala cascal client, but am sure that get_range_slice is being called with start and stop set to "".
>
> 1) Is batch iteration possible with random partioner?
>
> This isn't clear from the FAQ entry on the subject:
>
> http://wiki.apache.org/cassandra/FAQ#iter_world
>
> 2) The FAQ states that start argument should be "". What should the end argument be?
>
> thanks!
> Adam
>
>
>
>
>
>

Re: error using get_range_slice with random partitioner

Posted by Jonathan Ellis <jb...@gmail.com>.

Yes, you should be able to use get_range_slices with RP.

This stack trace looks like you changed your partitioner after the
node already had data in it.

On Thu, Aug 5, 2010 at 10:06 AM, Adam Crain
<ad...@greenenergycorp.com> wrote:
> Hi,
>
> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
> that iterating over the keys in keyspace is possible, even with the random
> partitioner. This is mostly desirable in my case for testing purposes only.
>
> I get the following error:
>
> [junit] Internal error processing get_range_slices
> [junit] org.apache.thrift.TApplicationException: Internal error processing
> get_range_slices
>
> and the following server traceback:
>
> java.lang.NumberFormatException: Zero length BigInteger
>         at java.math.BigInteger.<init>(BigInteger.java:295)
>         at java.math.BigInteger.<init>(BigInteger.java:467)
>         at
> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>         at
> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>
> I am using the scala cascal client, but am sure that get_range_slice is
> being called with start and stop set to "".
>
> 1) Is batch iteration possible with random partioner?
>
> This isn't clear from the FAQ entry on the subject:
>
> http://wiki.apache.org/cassandra/FAQ#iter_world
>
> 2) The FAQ states that start argument should be "". What should the end
> argument be?
>
> thanks!
> Adam
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: error using get_range_slice with random partitioner

Posted by Dave Viner <da...@pobox.com>.

Funny you should ask... I just went through the same exercise.

You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
 However, here is a snippet of perl that you can use.

our $WANTED_COLUMN_NAME = 'mycol';
get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
\%map);

sub get_key_to_one_column_map
{
    my ($keyspace, $column_family_name, $super_column_name,
$consistency_level, $returned_keys) = @_;


    my($socket, $transport, $protocol, $client, $result, $predicate,
$column_parent, $keyrange);

    $column_parent = new Cassandra::ColumnParent();
    $column_parent->{'column_family'} = $column_family_name;
    $column_parent->{'super_column'} = $super_column_name;

    $keyrange = new Cassandra::KeyRange({
            'start_key' => '', 'end_key' => '', 'count' => 10
    });


    $predicate = new Cassandra::SlicePredicate();
    $predicate->{'column_names'} = [$WANTED_COLUMN_NAME];

    eval
    {
        $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
        $transport = new Thrift::BufferedTransport($socket, 1024, 1024);
        $protocol = new Thrift::BinaryProtocol($transport);
        $client = new Cassandra::CassandraClient($protocol);
        $transport->open();


        my($next_start_key, $one_res, $iteration, $have_more, $value,
$local_count, $previous_start_key);

        $iteration = 0;
        $have_more = 1;
        while ($have_more == 1)
        {
            $iteration++;
            $result = undef;

            $result = $client->get_range_slices($keyspace, $column_parent,
$predicate, $keyrange, $consistency_level);

            # on success, results is an array of objects.

            if (scalar(@$result) == 1)
            {
                # we only got 1 result... check to see if it's the
                # same key as the start key... if so, we're done.
                if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
                {
                    $have_more = 0;
                    last;
                }
            }

            # check to see if we are starting with some value
            # if so, we throw away the first result.
            if ($keyrange->{'start_key'})
            {
                shift(@$result);
            }
            if (scalar(@$result) == 0)
            {
                $have_more = 0;
                last;
            }

            $previous_start_key = $keyrange->{'start_key'};
            $local_count = 0;

            for (my $r = 0; $r < scalar(@$result); $r++)
            {
                $one_res = $result->[$r];
                $next_start_key = $one_res->{'key'};

                $keyrange->{'start_key'} = $next_start_key;

                if (!exists($returned_keys->{$next_start_key}))
                {
                    $have_more = 1;
                    $local_count++;
                }


                next if (scalar(@{ $one_res->{'columns'} }) == 0);

                $value = undef;

                for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
$i++)
                {
                    if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
$WANTED_COLUMN_NAME)
                    {
                        $value =
$one_res->{'columns'}->[$i]->{'column'}->{'value'};
                        if (!exists($returned_keys->{$next_start_key}))
                        {
                            $returned_keys->{$next_start_key} = $value;
                        }
                        else
                        {
                            # NOTE: prior to Cassandra 0.6.4, the
get_range_slices returns duplicates sometimes.
                            #warn "Found second value for key
[$next_start_key]  was [" . $returned_keys->{$next_start_key} . "] now
[$value]!";
                        }
                    }
                }
                $have_more = 1;
            } # end results loop

            if ($keyrange->{'start_key'} eq $previous_start_key)
            {
                $have_more = 0;
            }

        } # end while() loop

        $transport->close();
    };
    if ($@)
    {
        warn "Problem with Cassandra: " . Dumper($@);
    }

    # cleanup
    undef $client;
    undef $protocol;
    undef $transport;
    undef $socket;
}


HTH
Dave Viner

On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
<ad...@greenenergycorp.com>wrote:

> Thomas,
>
> That was indeed the source of the problem. I naively assumed that the token
> range would help me avoid retrieving duplicate rows.
>
> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
> tried this morning and I seem to get odd results. Maybe this is just a
> consequence of the random partitioner. I really don't care about the order
> of the iteration, but only each key once and that I see all keys is
> important.
>
> -Adam
>
>
> -----Original Message-----
> From: th.heller@gmail.com on behalf of Thomas Heller
> Sent: Fri 8/6/2010 7:27 AM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
>
> Wild guess here, but are you using start_token/end_token here when you
> should be using start_key? Looks to me like you are trying end_token
> = ''.
>
> HTH,
> /thomas
>
> On Thursday, August 5, 2010, Adam Crain <ad...@greenenergycorp.com>
> wrote:
> > Hi,
> >
> > I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
> that iterating over the keys in keyspace is possible, even with the random
> partitioner. This is mostly desirable in my case for testing purposes only.
> >
> > I get the following error:
> >
> > [junit] Internal error processing get_range_slices
> > [junit] org.apache.thrift.TApplicationException: Internal error
> processing get_range_slices
> >
> > and the following server traceback:
> >
> > java.lang.NumberFormatException: Zero length BigInteger
> >         at java.math.BigInteger.<init>(BigInteger.java:295)
> >         at java.math.BigInteger.<init>(BigInteger.java:467)
> >         at
> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
> >         at
> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
> >
> > I am using the scala cascal client, but am sure that get_range_slice is
> being called with start and stop set to "".
> >
> > 1) Is batch iteration possible with random partioner?
> >
> > This isn't clear from the FAQ entry on the subject:
> >
> > http://wiki.apache.org/cassandra/FAQ#iter_world
> >
> > 2) The FAQ states that start argument should be "". What should the end
> argument be?
> >
> > thanks!
> > Adam
> >
> >
> >
> >
> >
> >
>
>
>
>
>

Re: error using get_range_slice with random partitioner

Posted by Jeremy Hanna <je...@gmail.com>.

Sounds like what you're seeing is in the client, but there was another duplicate bug with get_range_slice that was recently fixed on cassandra-0.6 branch.  It's slated for 0.6.5 which will probably be out sometime this month, based on previous minor releases.

https://issues.apache.org/jira/browse/CASSANDRA-1145

On Aug 6, 2010, at 10:29 AM, Adam Crain wrote:

> Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just discovered that the client I'm using mutates the order of keys after retrieving the result with the thrift API... pretty much making key iteration impossible. So time to fork and see if they'll fix it :(.
> 
> I'll review yours as soon as I get the client fixed that I'm using.
> 
> Adam
> 
> 
> -----Original Message-----
> From: daveviner@gmail.com on behalf of Dave Viner
> Sent: Fri 8/6/2010 11:28 AM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
> 
> Funny you should ask... I just went through the same exercise.
> 
> You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
> However, here is a snippet of perl that you can use.
> 
> our $WANTED_COLUMN_NAME = 'mycol';
> get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
> \%map);
> 
> sub get_key_to_one_column_map
> {
>    my ($keyspace, $column_family_name, $super_column_name,
> $consistency_level, $returned_keys) = @_;
> 
> 
>    my($socket, $transport, $protocol, $client, $result, $predicate,
> $column_parent, $keyrange);
> 
>    $column_parent = new Cassandra::ColumnParent();
>    $column_parent->{'column_family'} = $column_family_name;
>    $column_parent->{'super_column'} = $super_column_name;
> 
>    $keyrange = new Cassandra::KeyRange({
>            'start_key' => '', 'end_key' => '', 'count' => 10
>    });
> 
> 
>    $predicate = new Cassandra::SlicePredicate();
>    $predicate->{'column_names'} = [$WANTED_COLUMN_NAME];
> 
>    eval
>    {
>        $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
>        $transport = new Thrift::BufferedTransport($socket, 1024, 1024);
>        $protocol = new Thrift::BinaryProtocol($transport);
>        $client = new Cassandra::CassandraClient($protocol);
>        $transport->open();
> 
> 
>        my($next_start_key, $one_res, $iteration, $have_more, $value,
> $local_count, $previous_start_key);
> 
>        $iteration = 0;
>        $have_more = 1;
>        while ($have_more == 1)
>        {
>            $iteration++;
>            $result = undef;
> 
>            $result = $client->get_range_slices($keyspace, $column_parent,
> $predicate, $keyrange, $consistency_level);
> 
>            # on success, results is an array of objects.
> 
>            if (scalar(@$result) == 1)
>            {
>                # we only got 1 result... check to see if it's the
>                # same key as the start key... if so, we're done.
>                if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
>                {
>                    $have_more = 0;
>                    last;
>                }
>            }
> 
>            # check to see if we are starting with some value
>            # if so, we throw away the first result.
>            if ($keyrange->{'start_key'})
>            {
>                shift(@$result);
>            }
>            if (scalar(@$result) == 0)
>            {
>                $have_more = 0;
>                last;
>            }
> 
>            $previous_start_key = $keyrange->{'start_key'};
>            $local_count = 0;
> 
>            for (my $r = 0; $r < scalar(@$result); $r++)
>            {
>                $one_res = $result->[$r];
>                $next_start_key = $one_res->{'key'};
> 
>                $keyrange->{'start_key'} = $next_start_key;
> 
>                if (!exists($returned_keys->{$next_start_key}))
>                {
>                    $have_more = 1;
>                    $local_count++;
>                }
> 
> 
>                next if (scalar(@{ $one_res->{'columns'} }) == 0);
> 
>                $value = undef;
> 
>                for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
> $i++)
>                {
>                    if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
> $WANTED_COLUMN_NAME)
>                    {
>                        $value =
> $one_res->{'columns'}->[$i]->{'column'}->{'value'};
>                        if (!exists($returned_keys->{$next_start_key}))
>                        {
>                            $returned_keys->{$next_start_key} = $value;
>                        }
>                        else
>                        {
>                            # NOTE: prior to Cassandra 0.6.4, the
> get_range_slices returns duplicates sometimes.
>                            #warn "Found second value for key
> [$next_start_key]  was [" . $returned_keys->{$next_start_key} . "] now
> [$value]!";
>                        }
>                    }
>                }
>                $have_more = 1;
>            } # end results loop
> 
>            if ($keyrange->{'start_key'} eq $previous_start_key)
>            {
>                $have_more = 0;
>            }
> 
>        } # end while() loop
> 
>        $transport->close();
>    };
>    if ($@)
>    {
>        warn "Problem with Cassandra: " . Dumper($@);
>    }
> 
>    # cleanup
>    undef $client;
>    undef $protocol;
>    undef $transport;
>    undef $socket;
> }
> 
> 
> HTH
> Dave Viner
> 
> On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
> <ad...@greenenergycorp.com>wrote:
> 
>> Thomas,
>> 
>> That was indeed the source of the problem. I naively assumed that the token
>> range would help me avoid retrieving duplicate rows.
>> 
>> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
>> tried this morning and I seem to get odd results. Maybe this is just a
>> consequence of the random partitioner. I really don't care about the order
>> of the iteration, but only each key once and that I see all keys is
>> important.
>> 
>> -Adam
>> 
>> 
>> -----Original Message-----
>> From: th.heller@gmail.com on behalf of Thomas Heller
>> Sent: Fri 8/6/2010 7:27 AM
>> To: user@cassandra.apache.org
>> Subject: Re: error using get_range_slice with random partitioner
>> 
>> Wild guess here, but are you using start_token/end_token here when you
>> should be using start_key? Looks to me like you are trying end_token
>> = ''.
>> 
>> HTH,
>> /thomas
>> 
>> On Thursday, August 5, 2010, Adam Crain <ad...@greenenergycorp.com>
>> wrote:
>>> Hi,
>>> 
>>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
>> that iterating over the keys in keyspace is possible, even with the random
>> partitioner. This is mostly desirable in my case for testing purposes only.
>>> 
>>> I get the following error:
>>> 
>>> [junit] Internal error processing get_range_slices
>>> [junit] org.apache.thrift.TApplicationException: Internal error
>> processing get_range_slices
>>> 
>>> and the following server traceback:
>>> 
>>> java.lang.NumberFormatException: Zero length BigInteger
>>>        at java.math.BigInteger.<init>(BigInteger.java:295)
>>>        at java.math.BigInteger.<init>(BigInteger.java:467)
>>>        at
>> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>>>        at
>> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>>> 
>>> I am using the scala cascal client, but am sure that get_range_slice is
>> being called with start and stop set to "".
>>> 
>>> 1) Is batch iteration possible with random partioner?
>>> 
>>> This isn't clear from the FAQ entry on the subject:
>>> 
>>> http://wiki.apache.org/cassandra/FAQ#iter_world
>>> 
>>> 2) The FAQ states that start argument should be "". What should the end
>> argument be?
>>> 
>>> thanks!
>>> Adam
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
> 
> <winmail.dat>

Re: error using get_range_slice with random partitioner

Posted by Jeremy Hanna <je...@gmail.com>.

If you're willing to try it out, the easiest way to check to see if it is resolved by the patch for CASSANDRA-1145, you could checkout the 0.6 branch:

svn checkout http://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.6/ cassandra-0.6

Then run `ant` to build the binaries.

On Aug 6, 2010, at 2:57 PM, Adam Crain wrote:

> Hi Jeremy,
> 
> So, I fixed my client so it preserves the ordering and I get results that may be related to the bug.
> 
> If I insert 30 keys into the random partitioner with names [key1, key2, ... key30] and then start the iteration (with a batch size of 10) I get the following debug output during the iteration:
> 
> [junit] Query w/ Range(,,10) result size: 10
> [junit] key18
> [junit] key23
> [junit] key26
> [junit] key27
> [junit] key12
> [junit] key28
> [junit] key4
> [junit] key3
> [junit] key1
> [junit] key24
> [junit] Query w/ Range(key24,,10) result size: 10
> [junit] key24
> [junit] key5
> [junit] key17
> [junit] key29
> [junit] key19
> [junit] key8
> [junit] key15
> [junit] key22
> [junit] key6
> [junit] key25
> [junit] Query w/ Range(key25,,10) result size: 3
> [junit] key25
> [junit] key14
> [junit] key2
> [junit] Query w/ Range(key2,,10), result size: 1
> [junit] key2
> 
> I never make it back around to key 18 as expected, and I never see all of the keys.
> 
> -Adam
> 
> -----Original Message-----
> From: Jeremy Hanna [mailto:jeremy.hanna1234@gmail.com]
> Sent: Fri 8/6/2010 11:45 AM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
> 
> Sounds like what you're seeing is in the client, but there was another duplicate bug with get_range_slice that was recently fixed on cassandra-0.6 branch.  It's slated for 0.6.5 which will probably be out sometime this month, based on previous minor releases.
> 
> https://issues.apache.org/jira/browse/CASSANDRA-1145
> 
> On Aug 6, 2010, at 10:29 AM, Adam Crain wrote:
> 
>> Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just discovered that the client I'm using mutates the order of keys after retrieving the result with the thrift API... pretty much making key iteration impossible. So time to fork and see if they'll fix it :(.
>> 
>> I'll review yours as soon as I get the client fixed that I'm using.
>> 
>> Adam
>> 
>> 
>> -----Original Message-----
>> From: daveviner@gmail.com on behalf of Dave Viner
>> Sent: Fri 8/6/2010 11:28 AM
>> To: user@cassandra.apache.org
>> Subject: Re: error using get_range_slice with random partitioner
>> 
>> Funny you should ask... I just went through the same exercise.
>> 
>> You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
>> However, here is a snippet of perl that you can use.
>> 
>> our $WANTED_COLUMN_NAME = 'mycol';
>> get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
>> \%map);
>> 
>> sub get_key_to_one_column_map
>> {
>>   my ($keyspace, $column_family_name, $super_column_name,
>> $consistency_level, $returned_keys) = @_;
>> 
>> 
>>   my($socket, $transport, $protocol, $client, $result, $predicate,
>> $column_parent, $keyrange);
>> 
>>   $column_parent = new Cassandra::ColumnParent();
>>   $column_parent->{'column_family'} = $column_family_name;
>>   $column_parent->{'super_column'} = $super_column_name;
>> 
>>   $keyrange = new Cassandra::KeyRange({
>>           'start_key' => '', 'end_key' => '', 'count' => 10
>>   });
>> 
>> 
>>   $predicate = new Cassandra::SlicePredicate();
>>   $predicate->{'column_names'} = [$WANTED_COLUMN_NAME];
>> 
>>   eval
>>   {
>>       $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
>>       $transport = new Thrift::BufferedTransport($socket, 1024, 1024);
>>       $protocol = new Thrift::BinaryProtocol($transport);
>>       $client = new Cassandra::CassandraClient($protocol);
>>       $transport->open();
>> 
>> 
>>       my($next_start_key, $one_res, $iteration, $have_more, $value,
>> $local_count, $previous_start_key);
>> 
>>       $iteration = 0;
>>       $have_more = 1;
>>       while ($have_more == 1)
>>       {
>>           $iteration++;
>>           $result = undef;
>> 
>>           $result = $client->get_range_slices($keyspace, $column_parent,
>> $predicate, $keyrange, $consistency_level);
>> 
>>           # on success, results is an array of objects.
>> 
>>           if (scalar(@$result) == 1)
>>           {
>>               # we only got 1 result... check to see if it's the
>>               # same key as the start key... if so, we're done.
>>               if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
>>               {
>>                   $have_more = 0;
>>                   last;
>>               }
>>           }
>> 
>>           # check to see if we are starting with some value
>>           # if so, we throw away the first result.
>>           if ($keyrange->{'start_key'})
>>           {
>>               shift(@$result);
>>           }
>>           if (scalar(@$result) == 0)
>>           {
>>               $have_more = 0;
>>               last;
>>           }
>> 
>>           $previous_start_key = $keyrange->{'start_key'};
>>           $local_count = 0;
>> 
>>           for (my $r = 0; $r < scalar(@$result); $r++)
>>           {
>>               $one_res = $result->[$r];
>>               $next_start_key = $one_res->{'key'};
>> 
>>               $keyrange->{'start_key'} = $next_start_key;
>> 
>>               if (!exists($returned_keys->{$next_start_key}))
>>               {
>>                   $have_more = 1;
>>                   $local_count++;
>>               }
>> 
>> 
>>               next if (scalar(@{ $one_res->{'columns'} }) == 0);
>> 
>>               $value = undef;
>> 
>>               for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
>> $i++)
>>               {
>>                   if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
>> $WANTED_COLUMN_NAME)
>>                   {
>>                       $value =
>> $one_res->{'columns'}->[$i]->{'column'}->{'value'};
>>                       if (!exists($returned_keys->{$next_start_key}))
>>                       {
>>                           $returned_keys->{$next_start_key} = $value;
>>                       }
>>                       else
>>                       {
>>                           # NOTE: prior to Cassandra 0.6.4, the
>> get_range_slices returns duplicates sometimes.
>>                           #warn "Found second value for key
>> [$next_start_key]  was [" . $returned_keys->{$next_start_key} . "] now
>> [$value]!";
>>                       }
>>                   }
>>               }
>>               $have_more = 1;
>>           } # end results loop
>> 
>>           if ($keyrange->{'start_key'} eq $previous_start_key)
>>           {
>>               $have_more = 0;
>>           }
>> 
>>       } # end while() loop
>> 
>>       $transport->close();
>>   };
>>   if ($@)
>>   {
>>       warn "Problem with Cassandra: " . Dumper($@);
>>   }
>> 
>>   # cleanup
>>   undef $client;
>>   undef $protocol;
>>   undef $transport;
>>   undef $socket;
>> }
>> 
>> 
>> HTH
>> Dave Viner
>> 
>> On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
>> <ad...@greenenergycorp.com>wrote:
>> 
>>> Thomas,
>>> 
>>> That was indeed the source of the problem. I naively assumed that the token
>>> range would help me avoid retrieving duplicate rows.
>>> 
>>> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
>>> tried this morning and I seem to get odd results. Maybe this is just a
>>> consequence of the random partitioner. I really don't care about the order
>>> of the iteration, but only each key once and that I see all keys is
>>> important.
>>> 
>>> -Adam
>>> 
>>> 
>>> -----Original Message-----
>>> From: th.heller@gmail.com on behalf of Thomas Heller
>>> Sent: Fri 8/6/2010 7:27 AM
>>> To: user@cassandra.apache.org
>>> Subject: Re: error using get_range_slice with random partitioner
>>> 
>>> Wild guess here, but are you using start_token/end_token here when you
>>> should be using start_key? Looks to me like you are trying end_token
>>> = ''.
>>> 
>>> HTH,
>>> /thomas
>>> 
>>> On Thursday, August 5, 2010, Adam Crain <ad...@greenenergycorp.com>
>>> wrote:
>>>> Hi,
>>>> 
>>>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
>>> that iterating over the keys in keyspace is possible, even with the random
>>> partitioner. This is mostly desirable in my case for testing purposes only.
>>>> 
>>>> I get the following error:
>>>> 
>>>> [junit] Internal error processing get_range_slices
>>>> [junit] org.apache.thrift.TApplicationException: Internal error
>>> processing get_range_slices
>>>> 
>>>> and the following server traceback:
>>>> 
>>>> java.lang.NumberFormatException: Zero length BigInteger
>>>>       at java.math.BigInteger.<init>(BigInteger.java:295)
>>>>       at java.math.BigInteger.<init>(BigInteger.java:467)
>>>>       at
>>> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>>>>       at
>>> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>>>> 
>>>> I am using the scala cascal client, but am sure that get_range_slice is
>>> being called with start and stop set to "".
>>>> 
>>>> 1) Is batch iteration possible with random partioner?
>>>> 
>>>> This isn't clear from the FAQ entry on the subject:
>>>> 
>>>> http://wiki.apache.org/cassandra/FAQ#iter_world
>>>> 
>>>> 2) The FAQ states that start argument should be "". What should the end
>>> argument be?
>>>> 
>>>> thanks!
>>>> Adam
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> <winmail.dat>
> 
> 
> 
> 
> 
> <winmail.dat>

Re: error using get_range_slice with random partitioner

Posted by Peter Schuller <pe...@infidyne.com>.

>> Another way to do it is to filter results to exclude columns received
>> twice due to being on iteration end points.
>
> Well, depends on the size of your rows, keeping lists of 1mil+ column
> names will eventually become reeeeally slow (at least in ruby).

You only have to keep track of a single column since you're iterating in order.

> You only ever need to decrement/increment by one and that should be
> pretty simple in almost all cases. Granted it was a little tricky for
> TimeUUID, but we are talking bytes here, so there really is only 0-255
> +/- 1. If you are talking ASCII just trim that range down a little.

It's not about incrementing/decrementing a single byte value, it's
about decrementing the value of a byte string. Say you limit yourself
to ascii a-z, but are still not finite in size. What's the entry
lexicographically previous to b? Is it aaaaaaa? Is it
aaaaaaaaaaaaaaaaaaaaaaaaaa? Whatever you pick there will always be a
column with one additional a.

In a less general case where you impose a length limit on the column
name, you're fine. But not in the general case.

-- 
/ Peter Schuller

Re: error using get_range_slice with random partitioner

Posted by Thomas Heller <in...@zilence.net>.

>
> Another way to do it is to filter results to exclude columns received
> twice due to being on iteration end points.

Well, depends on the size of your rows, keeping lists of 1mil+ column
names will eventually become reeeeally slow (at least in ruby).

>
> This is useful because it is not always possible to increment or
> decrement (depending on iteration order) a column name (for example,
> in the case of byte strings, because there is no defined maximum
> possible length so the lexicographically "previous" column name might
> be infinitely long).

You only ever need to decrement/increment by one and that should be
pretty simple in almost all cases. Granted it was a little tricky for
TimeUUID, but we are talking bytes here, so there really is only 0-255
+/- 1. If you are talking ASCII just trim that range down a little.

/thomas

Re: error using get_range_slice with random partitioner

Posted by Peter Schuller <pe...@infidyne.com>.

> I think this is actually the expected result, whenever you are using
> range_slices with start_key/end_key you must increment the last key
> you received and then use that in the next slice start_key. I also
> tried to use token because of exactly that behaviour and the doc
> talking about inclusive/exclusive.

Another way to do it is to filter results to exclude columns received
twice due to being on iteration end points.

This is useful because it is not always possible to increment or
decrement (depending on iteration order) a column name (for example,
in the case of byte strings, because there is no defined maximum
possible length so the lexicographically "previous" column name might
be infinitely long).

-- 
/ Peter Schuller

Re: error using get_range_slice with random partitioner

Posted by Thomas Heller <in...@zilence.net>.

On Sat, Aug 7, 2010 at 11:41 AM, Peter Schuller
<pe...@infidyne.com> wrote:
>> Remember the returned results are NOT sorted, so you whenever you are
>> dropping the first by default, you might be dropping a good one. At
>> least that would be my guess here.
>
> Sorry I may be forgetting something about this thread, but AFAIK the
> results from cassandra (the thrift API) are sorted. Maybe there was a
> client in between that did not preserve the sorting (I forget which
> thread that was).

Column slices are always sorted yes, we were talking about
get_RANGE_slices and the range of rows is not sorted (for RPP).

re incrementing/decrementing: you're right, I was only using my
inc/dec for UUID which are fixed length.

/thomas

Re: error using get_range_slice with random partitioner

Posted by Peter Schuller <pe...@infidyne.com>.

> Remember the returned results are NOT sorted, so you whenever you are
> dropping the first by default, you might be dropping a good one. At
> least that would be my guess here.

Sorry I may be forgetting something about this thread, but AFAIK the
results from cassandra (the thrift API) are sorted. Maybe there was a
client in between that did not preserve the sorting (I forget which
thread that was).

(I'm pretty sure my unit tests would have blown up by now if they're
not, but you never know...)

-- 
/ Peter Schuller

Re: error using get_range_slice with random partitioner

Posted by Thomas Heller <in...@zilence.net>.

On Sat, Aug 7, 2010 at 1:05 AM, Adam Crain
<ad...@greenenergycorp.com> wrote:
> I took this approach... reject the first result of subsequent get_range_slice requests. If you look back at output I posted (below) you'll notice that not all of the 30 keys [key1...key30] get listed! The iteration dies and can't proceed past key2.
>
> 1) 1st batch gets 10 unique keys.
> 2) 2nd batch only gets 9 unique keys with the 1st being a repeat
> 3) 3rd batch only get 2 unqiue keys ""
>
> That means the iteration didn't see 9 keys in the CF. Key7 and Key30 are missing for example.
>

Remember the returned results are NOT sorted, so you whenever you are
dropping the first by default, you might be dropping a good one. At
least that would be my guess here.

I have iteration implemented in my client and everything is working as
expected and so far I never had duplicates (running 0.6.3). I'm using
tokens for range_slices tho, increment/decrement for get_slice only.

/thomas

Re: error using get_range_slice with random partitioner

Posted by Thomas Heller <in...@zilence.net>.

Sure, but its in my ruby client which currently has close to no
documentation. ;)

Client is here:
http://github.com/thheller/greek_architect

Relevant Row Spec:
http://bit.ly/9uS6Ba

Row-based iteration:
http://bit.ly/cRVSTc #each_slice

Currently uses a "hack" since I wasnt able to produce cassandra
BigInteger Tokens in Ruby. I'm a math noob and couldnt figure out why
some of the Tokens would differ. I just spawn a Java Process and use
that to generate the Tokens, insanely slow but I dont use that feature
anymore anyways. ;)

CF-Iteration:
http://bit.ly/bNgsRG #each

Its all a little abstracted away I guess but I hope you can follow the
relevant thrift calls.

HTH,
/thomas


On Mon, Aug 9, 2010 at 3:55 PM, Adam Crain
<ad...@greenenergycorp.com> wrote:
> Hi Thomas,
>
> Can you share your client code for the iteration? It would probably help me catch my problem. Anyone know where in the cassandra source the integration tests are for this functionality on the random partitioner?
>
> Note that I posted a specific example where the iteration failed and I was not throwing out good keys only duplicate ones. That means 1 of 2 things:
>
> 1) I'm somehow using the API incorrectly
> 2) I am the only one encountering a bug
>
> My money is on 1) of course.  I can check the thrift API against what my Scala client is calling under the hood.
>
> -Adam
>
>
> -----Original Message-----
> From: th.heller@gmail.com on behalf of Thomas Heller
> Sent: Fri 8/6/2010 7:17 PM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
>
> On Sat, Aug 7, 2010 at 1:05 AM, Adam Crain
> <ad...@greenenergycorp.com> wrote:
>> I took this approach... reject the first result of subsequent get_range_slice requests. If you look back at output I posted (below) you'll notice that not all of the 30 keys [key1...key30] get listed! The iteration dies and can't proceed past key2.
>>
>> 1) 1st batch gets 10 unique keys.
>> 2) 2nd batch only gets 9 unique keys with the 1st being a repeat
>> 3) 3rd batch only get 2 unqiue keys ""
>>
>> That means the iteration didn't see 9 keys in the CF. Key7 and Key30 are missing for example.
>>
>
> Remember the returned results are NOT sorted, so you whenever you are
> dropping the first by default, you might be dropping a good one. At
> least that would be my guess here.
>
> I have iteration implemented in my client and everything is working as
> expected and so far I never had duplicates (running 0.6.3). I'm using
> tokens for range_slices tho, increment/decrement for get_slice only.
>
> /thomas
>
>
>
>
>

RE: error using get_range_slice with random partitioner

Posted by Adam Crain <ad...@greenenergycorp.com>.

Hi Thomas,

Can you share your client code for the iteration? It would probably help me catch my problem. Anyone know where in the cassandra source the integration tests are for this functionality on the random partitioner?

Note that I posted a specific example where the iteration failed and I was not throwing out good keys only duplicate ones. That means 1 of 2 things:

1) I'm somehow using the API incorrectly
2) I am the only one encountering a bug

My money is on 1) of course.  I can check the thrift API against what my Scala client is calling under the hood.

-Adam

-----Original Message-----
From: th.heller@gmail.com on behalf of Thomas Heller
Sent: Fri 8/6/2010 7:17 PM
To: user@cassandra.apache.org
Subject: Re: error using get_range_slice with random partitioner

On Sat, Aug 7, 2010 at 1:05 AM, Adam Crain
<ad...@greenenergycorp.com> wrote:
> I took this approach... reject the first result of subsequent get_range_slice requests. If you look back at output I posted (below) you'll notice that not all of the 30 keys [key1...key30] get listed! The iteration dies and can't proceed past key2.
>
> 1) 1st batch gets 10 unique keys.
> 2) 2nd batch only gets 9 unique keys with the 1st being a repeat
> 3) 3rd batch only get 2 unqiue keys ""
>
> That means the iteration didn't see 9 keys in the CF. Key7 and Key30 are missing for example.
>

Remember the returned results are NOT sorted, so you whenever you are
dropping the first by default, you might be dropping a good one. At
least that would be my guess here.

I have iteration implemented in my client and everything is working as
expected and so far I never had duplicates (running 0.6.3). I'm using
tokens for range_slices tho, increment/decrement for get_slice only.

/thomas

RE: error using get_range_slice with random partitioner

Posted by Adam Crain <ad...@greenenergycorp.com>.

I took this approach... reject the first result of subsequent get_range_slice requests. If you look back at output I posted (below) you'll notice that not all of the 30 keys [key1...key30] get listed! The iteration dies and can't proceed past key2.

1) 1st batch gets 10 unique keys.
2) 2nd batch only gets 9 unique keys with the 1st being a repeat
3) 3rd batch only get 2 unqiue keys ""

That means the iteration didn't see 9 keys in the CF. Key7 and Key30 are missing for example.

[junit] Query w/ Range(,,10) result size: 10 
[junit] key18 
[junit] key23 
[junit] key26 
[junit] key27 
[junit] key12 
[junit] key28 
[junit] key4 
[junit] key3 
[junit] key1 
[junit] key24 
[junit] Query w/ Range(key24,,10) result size: 10 
[junit] key24 
[junit] key5 
[junit] key17 
[junit] key29 
[junit] key19 
[junit] key8 
[junit] key15 
[junit] key22 
[junit] key6 
[junit] key25 
[junit] Query w/ Range(key25,,10) result size: 3 
[junit] key25 
[junit] key14 
[junit] key2 
[junit] Query w/ Range(key2,,10), result size: 1 
[junit] key2

-Adam

-----Original Message-----
From: scode@scode.org on behalf of Peter Schuller
Sent: Fri 8/6/2010 6:43 PM
To: user@cassandra.apache.org
Subject: Re: error using get_range_slice with random partitioner
 
> I think this is actually the expected result, whenever you are using
> range_slices with start_key/end_key you must increment the last key
> you received and then use that in the next slice start_key. I also
> tried to use token because of exactly that behaviour and the doc
> talking about inclusive/exclusive.

Another way to do it is to filter results to exclude columns received
twice due to being on iteration end points.

This is useful because it is not always possible to increment or
decrement (depending on iteration order) a column name (for example,
in the case of byte strings, because there is no defined maximum
possible length so the lexicographically "previous" column name might
be infinitely long).

-- 
/ Peter Schuller

Re: error using get_range_slice with random partitioner

Posted by Thomas Heller <in...@zilence.net>.

Hey,

[junit] key24
[junit] Query w/ Range(key24,,10) result size: 10
[junit] key24

I think this is actually the expected result, whenever you are using
range_slices with start_key/end_key you must increment the last key
you received and then use that in the next slice start_key. I also
tried to use token because of exactly that behaviour and the doc
talking about inclusive/exclusive.

Tokens are actually what the Partitioner uses to decide which nodes
your data goes to, so in case of RPP it the the MD5 hash of your
actual key as a 128bit BigInteger (just try nodetool ring to see some
Tokens ;). get_range_slices with start/end_token is best used together
with describe_ring/describe_splits so you can talk to the nodes
directly. The Hadoop/Pig stuff uses tokens for example.


HTH,
/thomas

On Sat, Aug 7, 2010 at 12:06 AM, Adam Crain
<ad...@greenenergycorp.com> wrote:
> I ran against the 0.6 branch I still see similarly odd results. My test cases prove that set of keys have been successfully inserted, but usually I never see the first key again or I reach the first key before having seen all of the keys.
>
> -Adam
>
>
>
> -----Original Message-----
> From: Jeremy Hanna [mailto:jeremy.hanna1234@gmail.com]
> Sent: Fri 8/6/2010 4:25 PM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
>
> If you're willing to try it out, the easiest way to check to see if it is resolved by the patch for CASSANDRA-1145, you could checkout the 0.6 branch:
>
> svn checkout http://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.6/ cassandra-0.6
>
> Then run `ant` to build the binaries.
>
> On Aug 6, 2010, at 2:57 PM, Adam Crain wrote:
>
>> Hi Jeremy,
>>
>> So, I fixed my client so it preserves the ordering and I get results that may be related to the bug.
>>
>> If I insert 30 keys into the random partitioner with names [key1, key2, ... key30] and then start the iteration (with a batch size of 10) I get the following debug output during the iteration:
>>
>> [junit] Query w/ Range(,,10) result size: 10
>> [junit] key18
>> [junit] key23
>> [junit] key26
>> [junit] key27
>> [junit] key12
>> [junit] key28
>> [junit] key4
>> [junit] key3
>> [junit] key1
>> [junit] key24
>> [junit] Query w/ Range(key24,,10) result size: 10
>> [junit] key24
>> [junit] key5
>> [junit] key17
>> [junit] key29
>> [junit] key19
>> [junit] key8
>> [junit] key15
>> [junit] key22
>> [junit] key6
>> [junit] key25
>> [junit] Query w/ Range(key25,,10) result size: 3
>> [junit] key25
>> [junit] key14
>> [junit] key2
>> [junit] Query w/ Range(key2,,10), result size: 1
>> [junit] key2
>>
>> I never make it back around to key 18 as expected, and I never see all of the keys.
>>
>> -Adam
>>
>> -----Original Message-----
>> From: Jeremy Hanna [mailto:jeremy.hanna1234@gmail.com]
>> Sent: Fri 8/6/2010 11:45 AM
>> To: user@cassandra.apache.org
>> Subject: Re: error using get_range_slice with random partitioner
>>
>> Sounds like what you're seeing is in the client, but there was another duplicate bug with get_range_slice that was recently fixed on cassandra-0.6 branch.  It's slated for 0.6.5 which will probably be out sometime this month, based on previous minor releases.
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-1145
>>
>> On Aug 6, 2010, at 10:29 AM, Adam Crain wrote:
>>
>>> Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just discovered that the client I'm using mutates the order of keys after retrieving the result with the thrift API... pretty much making key iteration impossible. So time to fork and see if they'll fix it :(.
>>>
>>> I'll review yours as soon as I get the client fixed that I'm using.
>>>
>>> Adam
>>>
>>>
>>> -----Original Message-----
>>> From: daveviner@gmail.com on behalf of Dave Viner
>>> Sent: Fri 8/6/2010 11:28 AM
>>> To: user@cassandra.apache.org
>>> Subject: Re: error using get_range_slice with random partitioner
>>>
>>> Funny you should ask... I just went through the same exercise.
>>>
>>> You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
>>> However, here is a snippet of perl that you can use.
>>>
>>> our $WANTED_COLUMN_NAME = 'mycol';
>>> get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
>>> \%map);
>>>
>>> sub get_key_to_one_column_map
>>> {
>>>   my ($keyspace, $column_family_name, $super_column_name,
>>> $consistency_level, $returned_keys) = @_;
>>>
>>>
>>>   my($socket, $transport, $protocol, $client, $result, $predicate,
>>> $column_parent, $keyrange);
>>>
>>>   $column_parent = new Cassandra::ColumnParent();
>>>   $column_parent->{'column_family'} = $column_family_name;
>>>   $column_parent->{'super_column'} = $super_column_name;
>>>
>>>   $keyrange = new Cassandra::KeyRange({
>>>           'start_key' => '', 'end_key' => '', 'count' => 10
>>>   });
>>>
>>>
>>>   $predicate = new Cassandra::SlicePredicate();
>>>   $predicate->{'column_names'} = [$WANTED_COLUMN_NAME];
>>>
>>>   eval
>>>   {
>>>       $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
>>>       $transport = new Thrift::BufferedTransport($socket, 1024, 1024);
>>>       $protocol = new Thrift::BinaryProtocol($transport);
>>>       $client = new Cassandra::CassandraClient($protocol);
>>>       $transport->open();
>>>
>>>
>>>       my($next_start_key, $one_res, $iteration, $have_more, $value,
>>> $local_count, $previous_start_key);
>>>
>>>       $iteration = 0;
>>>       $have_more = 1;
>>>       while ($have_more == 1)
>>>       {
>>>           $iteration++;
>>>           $result = undef;
>>>
>>>           $result = $client->get_range_slices($keyspace, $column_parent,
>>> $predicate, $keyrange, $consistency_level);
>>>
>>>           # on success, results is an array of objects.
>>>
>>>           if (scalar(@$result) == 1)
>>>           {
>>>               # we only got 1 result... check to see if it's the
>>>               # same key as the start key... if so, we're done.
>>>               if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
>>>               {
>>>                   $have_more = 0;
>>>                   last;
>>>               }
>>>           }
>>>
>>>           # check to see if we are starting with some value
>>>           # if so, we throw away the first result.
>>>           if ($keyrange->{'start_key'})
>>>           {
>>>               shift(@$result);
>>>           }
>>>           if (scalar(@$result) == 0)
>>>           {
>>>               $have_more = 0;
>>>               last;
>>>           }
>>>
>>>           $previous_start_key = $keyrange->{'start_key'};
>>>           $local_count = 0;
>>>
>>>           for (my $r = 0; $r < scalar(@$result); $r++)
>>>           {
>>>               $one_res = $result->[$r];
>>>               $next_start_key = $one_res->{'key'};
>>>
>>>               $keyrange->{'start_key'} = $next_start_key;
>>>
>>>               if (!exists($returned_keys->{$next_start_key}))
>>>               {
>>>                   $have_more = 1;
>>>                   $local_count++;
>>>               }
>>>
>>>
>>>               next if (scalar(@{ $one_res->{'columns'} }) == 0);
>>>
>>>               $value = undef;
>>>
>>>               for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
>>> $i++)
>>>               {
>>>                   if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
>>> $WANTED_COLUMN_NAME)
>>>                   {
>>>                       $value =
>>> $one_res->{'columns'}->[$i]->{'column'}->{'value'};
>>>                       if (!exists($returned_keys->{$next_start_key}))
>>>                       {
>>>                           $returned_keys->{$next_start_key} = $value;
>>>                       }
>>>                       else
>>>                       {
>>>                           # NOTE: prior to Cassandra 0.6.4, the
>>> get_range_slices returns duplicates sometimes.
>>>                           #warn "Found second value for key
>>> [$next_start_key]  was [" . $returned_keys->{$next_start_key} . "] now
>>> [$value]!";
>>>                       }
>>>                   }
>>>               }
>>>               $have_more = 1;
>>>           } # end results loop
>>>
>>>           if ($keyrange->{'start_key'} eq $previous_start_key)
>>>           {
>>>               $have_more = 0;
>>>           }
>>>
>>>       } # end while() loop
>>>
>>>       $transport->close();
>>>   };
>>>   if ($@)
>>>   {
>>>       warn "Problem with Cassandra: " . Dumper($@);
>>>   }
>>>
>>>   # cleanup
>>>   undef $client;
>>>   undef $protocol;
>>>   undef $transport;
>>>   undef $socket;
>>> }
>>>
>>>
>>> HTH
>>> Dave Viner
>>>
>>> On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
>>> <ad...@greenenergycorp.com>wrote:
>>>
>>>> Thomas,
>>>>
>>>> That was indeed the source of the problem. I naively assumed that the token
>>>> range would help me avoid retrieving duplicate rows.
>>>>
>>>> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
>>>> tried this morning and I seem to get odd results. Maybe this is just a
>>>> consequence of the random partitioner. I really don't care about the order
>>>> of the iteration, but only each key once and that I see all keys is
>>>> important.
>>>>
>>>> -Adam
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: th.heller@gmail.com on behalf of Thomas Heller
>>>> Sent: Fri 8/6/2010 7:27 AM
>>>> To: user@cassandra.apache.org
>>>> Subject: Re: error using get_range_slice with random partitioner
>>>>
>>>> Wild guess here, but are you using start_token/end_token here when you
>>>> should be using start_key? Looks to me like you are trying end_token
>>>> = ''.
>>>>
>>>> HTH,
>>>> /thomas
>>>>
>>>> On Thursday, August 5, 2010, Adam Crain <ad...@greenenergycorp.com>
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
>>>> that iterating over the keys in keyspace is possible, even with the random
>>>> partitioner. This is mostly desirable in my case for testing purposes only.
>>>>>
>>>>> I get the following error:
>>>>>
>>>>> [junit] Internal error processing get_range_slices
>>>>> [junit] org.apache.thrift.TApplicationException: Internal error
>>>> processing get_range_slices
>>>>>
>>>>> and the following server traceback:
>>>>>
>>>>> java.lang.NumberFormatException: Zero length BigInteger
>>>>>       at java.math.BigInteger.<init>(BigInteger.java:295)
>>>>>       at java.math.BigInteger.<init>(BigInteger.java:467)
>>>>>       at
>>>> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>>>>>       at
>>>> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>>>>>
>>>>> I am using the scala cascal client, but am sure that get_range_slice is
>>>> being called with start and stop set to "".
>>>>>
>>>>> 1) Is batch iteration possible with random partioner?
>>>>>
>>>>> This isn't clear from the FAQ entry on the subject:
>>>>>
>>>>> http://wiki.apache.org/cassandra/FAQ#iter_world
>>>>>
>>>>> 2) The FAQ states that start argument should be "". What should the end
>>>> argument be?
>>>>>
>>>>> thanks!
>>>>> Adam
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> <winmail.dat>
>>
>>
>>
>>
>>
>> <winmail.dat>
>
>
>
>
>
>

RE: error using get_range_slice with random partitioner

Posted by Adam Crain <ad...@greenenergycorp.com>.

I ran against the 0.6 branch I still see similarly odd results. My test cases prove that set of keys have been successfully inserted, but usually I never see the first key again or I reach the first key before having seen all of the keys.

-Adam



-----Original Message-----
From: Jeremy Hanna [mailto:jeremy.hanna1234@gmail.com]
Sent: Fri 8/6/2010 4:25 PM
To: user@cassandra.apache.org
Subject: Re: error using get_range_slice with random partitioner
 
If you're willing to try it out, the easiest way to check to see if it is resolved by the patch for CASSANDRA-1145, you could checkout the 0.6 branch:

svn checkout http://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.6/ cassandra-0.6

Then run `ant` to build the binaries.

On Aug 6, 2010, at 2:57 PM, Adam Crain wrote:

> Hi Jeremy,
> 
> So, I fixed my client so it preserves the ordering and I get results that may be related to the bug.
> 
> If I insert 30 keys into the random partitioner with names [key1, key2, ... key30] and then start the iteration (with a batch size of 10) I get the following debug output during the iteration:
> 
> [junit] Query w/ Range(,,10) result size: 10
> [junit] key18
> [junit] key23
> [junit] key26
> [junit] key27
> [junit] key12
> [junit] key28
> [junit] key4
> [junit] key3
> [junit] key1
> [junit] key24
> [junit] Query w/ Range(key24,,10) result size: 10
> [junit] key24
> [junit] key5
> [junit] key17
> [junit] key29
> [junit] key19
> [junit] key8
> [junit] key15
> [junit] key22
> [junit] key6
> [junit] key25
> [junit] Query w/ Range(key25,,10) result size: 3
> [junit] key25
> [junit] key14
> [junit] key2
> [junit] Query w/ Range(key2,,10), result size: 1
> [junit] key2
> 
> I never make it back around to key 18 as expected, and I never see all of the keys.
> 
> -Adam
> 
> -----Original Message-----
> From: Jeremy Hanna [mailto:jeremy.hanna1234@gmail.com]
> Sent: Fri 8/6/2010 11:45 AM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
> 
> Sounds like what you're seeing is in the client, but there was another duplicate bug with get_range_slice that was recently fixed on cassandra-0.6 branch.  It's slated for 0.6.5 which will probably be out sometime this month, based on previous minor releases.
> 
> https://issues.apache.org/jira/browse/CASSANDRA-1145
> 
> On Aug 6, 2010, at 10:29 AM, Adam Crain wrote:
> 
>> Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just discovered that the client I'm using mutates the order of keys after retrieving the result with the thrift API... pretty much making key iteration impossible. So time to fork and see if they'll fix it :(.
>> 
>> I'll review yours as soon as I get the client fixed that I'm using.
>> 
>> Adam
>> 
>> 
>> -----Original Message-----
>> From: daveviner@gmail.com on behalf of Dave Viner
>> Sent: Fri 8/6/2010 11:28 AM
>> To: user@cassandra.apache.org
>> Subject: Re: error using get_range_slice with random partitioner
>> 
>> Funny you should ask... I just went through the same exercise.
>> 
>> You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
>> However, here is a snippet of perl that you can use.
>> 
>> our $WANTED_COLUMN_NAME = 'mycol';
>> get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
>> \%map);
>> 
>> sub get_key_to_one_column_map
>> {
>>   my ($keyspace, $column_family_name, $super_column_name,
>> $consistency_level, $returned_keys) = @_;
>> 
>> 
>>   my($socket, $transport, $protocol, $client, $result, $predicate,
>> $column_parent, $keyrange);
>> 
>>   $column_parent = new Cassandra::ColumnParent();
>>   $column_parent->{'column_family'} = $column_family_name;
>>   $column_parent->{'super_column'} = $super_column_name;
>> 
>>   $keyrange = new Cassandra::KeyRange({
>>           'start_key' => '', 'end_key' => '', 'count' => 10
>>   });
>> 
>> 
>>   $predicate = new Cassandra::SlicePredicate();
>>   $predicate->{'column_names'} = [$WANTED_COLUMN_NAME];
>> 
>>   eval
>>   {
>>       $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
>>       $transport = new Thrift::BufferedTransport($socket, 1024, 1024);
>>       $protocol = new Thrift::BinaryProtocol($transport);
>>       $client = new Cassandra::CassandraClient($protocol);
>>       $transport->open();
>> 
>> 
>>       my($next_start_key, $one_res, $iteration, $have_more, $value,
>> $local_count, $previous_start_key);
>> 
>>       $iteration = 0;
>>       $have_more = 1;
>>       while ($have_more == 1)
>>       {
>>           $iteration++;
>>           $result = undef;
>> 
>>           $result = $client->get_range_slices($keyspace, $column_parent,
>> $predicate, $keyrange, $consistency_level);
>> 
>>           # on success, results is an array of objects.
>> 
>>           if (scalar(@$result) == 1)
>>           {
>>               # we only got 1 result... check to see if it's the
>>               # same key as the start key... if so, we're done.
>>               if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
>>               {
>>                   $have_more = 0;
>>                   last;
>>               }
>>           }
>> 
>>           # check to see if we are starting with some value
>>           # if so, we throw away the first result.
>>           if ($keyrange->{'start_key'})
>>           {
>>               shift(@$result);
>>           }
>>           if (scalar(@$result) == 0)
>>           {
>>               $have_more = 0;
>>               last;
>>           }
>> 
>>           $previous_start_key = $keyrange->{'start_key'};
>>           $local_count = 0;
>> 
>>           for (my $r = 0; $r < scalar(@$result); $r++)
>>           {
>>               $one_res = $result->[$r];
>>               $next_start_key = $one_res->{'key'};
>> 
>>               $keyrange->{'start_key'} = $next_start_key;
>> 
>>               if (!exists($returned_keys->{$next_start_key}))
>>               {
>>                   $have_more = 1;
>>                   $local_count++;
>>               }
>> 
>> 
>>               next if (scalar(@{ $one_res->{'columns'} }) == 0);
>> 
>>               $value = undef;
>> 
>>               for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
>> $i++)
>>               {
>>                   if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
>> $WANTED_COLUMN_NAME)
>>                   {
>>                       $value =
>> $one_res->{'columns'}->[$i]->{'column'}->{'value'};
>>                       if (!exists($returned_keys->{$next_start_key}))
>>                       {
>>                           $returned_keys->{$next_start_key} = $value;
>>                       }
>>                       else
>>                       {
>>                           # NOTE: prior to Cassandra 0.6.4, the
>> get_range_slices returns duplicates sometimes.
>>                           #warn "Found second value for key
>> [$next_start_key]  was [" . $returned_keys->{$next_start_key} . "] now
>> [$value]!";
>>                       }
>>                   }
>>               }
>>               $have_more = 1;
>>           } # end results loop
>> 
>>           if ($keyrange->{'start_key'} eq $previous_start_key)
>>           {
>>               $have_more = 0;
>>           }
>> 
>>       } # end while() loop
>> 
>>       $transport->close();
>>   };
>>   if ($@)
>>   {
>>       warn "Problem with Cassandra: " . Dumper($@);
>>   }
>> 
>>   # cleanup
>>   undef $client;
>>   undef $protocol;
>>   undef $transport;
>>   undef $socket;
>> }
>> 
>> 
>> HTH
>> Dave Viner
>> 
>> On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
>> <ad...@greenenergycorp.com>wrote:
>> 
>>> Thomas,
>>> 
>>> That was indeed the source of the problem. I naively assumed that the token
>>> range would help me avoid retrieving duplicate rows.
>>> 
>>> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
>>> tried this morning and I seem to get odd results. Maybe this is just a
>>> consequence of the random partitioner. I really don't care about the order
>>> of the iteration, but only each key once and that I see all keys is
>>> important.
>>> 
>>> -Adam
>>> 
>>> 
>>> -----Original Message-----
>>> From: th.heller@gmail.com on behalf of Thomas Heller
>>> Sent: Fri 8/6/2010 7:27 AM
>>> To: user@cassandra.apache.org
>>> Subject: Re: error using get_range_slice with random partitioner
>>> 
>>> Wild guess here, but are you using start_token/end_token here when you
>>> should be using start_key? Looks to me like you are trying end_token
>>> = ''.
>>> 
>>> HTH,
>>> /thomas
>>> 
>>> On Thursday, August 5, 2010, Adam Crain <ad...@greenenergycorp.com>
>>> wrote:
>>>> Hi,
>>>> 
>>>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
>>> that iterating over the keys in keyspace is possible, even with the random
>>> partitioner. This is mostly desirable in my case for testing purposes only.
>>>> 
>>>> I get the following error:
>>>> 
>>>> [junit] Internal error processing get_range_slices
>>>> [junit] org.apache.thrift.TApplicationException: Internal error
>>> processing get_range_slices
>>>> 
>>>> and the following server traceback:
>>>> 
>>>> java.lang.NumberFormatException: Zero length BigInteger
>>>>       at java.math.BigInteger.<init>(BigInteger.java:295)
>>>>       at java.math.BigInteger.<init>(BigInteger.java:467)
>>>>       at
>>> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>>>>       at
>>> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>>>> 
>>>> I am using the scala cascal client, but am sure that get_range_slice is
>>> being called with start and stop set to "".
>>>> 
>>>> 1) Is batch iteration possible with random partioner?
>>>> 
>>>> This isn't clear from the FAQ entry on the subject:
>>>> 
>>>> http://wiki.apache.org/cassandra/FAQ#iter_world
>>>> 
>>>> 2) The FAQ states that start argument should be "". What should the end
>>> argument be?
>>>> 
>>>> thanks!
>>>> Adam
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> <winmail.dat>
> 
> 
> 
> 
> 
> <winmail.dat>

RE: error using get_range_slice with random partitioner

Posted by Adam Crain <ad...@greenenergycorp.com>.

Hi Jeremy,

So, I fixed my client so it preserves the ordering and I get results that may be related to the bug.

If I insert 30 keys into the random partitioner with names [key1, key2, ... key30] and then start the iteration (with a batch size of 10) I get the following debug output during the iteration:

[junit] Query w/ Range(,,10) result size: 10
[junit] key18
[junit] key23
[junit] key26
[junit] key27
[junit] key12
[junit] key28
[junit] key4
[junit] key3
[junit] key1
[junit] key24
[junit] Query w/ Range(key24,,10) result size: 10
[junit] key24
[junit] key5
[junit] key17
[junit] key29
[junit] key19
[junit] key8
[junit] key15
[junit] key22
[junit] key6
[junit] key25
[junit] Query w/ Range(key25,,10) result size: 3
[junit] key25
[junit] key14
[junit] key2
[junit] Query w/ Range(key2,,10), result size: 1
[junit] key2

I never make it back around to key 18 as expected, and I never see all of the keys.

-Adam

-----Original Message-----
From: Jeremy Hanna [mailto:jeremy.hanna1234@gmail.com]
Sent: Fri 8/6/2010 11:45 AM
To: user@cassandra.apache.org
Subject: Re: error using get_range_slice with random partitioner
 
Sounds like what you're seeing is in the client, but there was another duplicate bug with get_range_slice that was recently fixed on cassandra-0.6 branch.  It's slated for 0.6.5 which will probably be out sometime this month, based on previous minor releases.

https://issues.apache.org/jira/browse/CASSANDRA-1145

On Aug 6, 2010, at 10:29 AM, Adam Crain wrote:

> Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just discovered that the client I'm using mutates the order of keys after retrieving the result with the thrift API... pretty much making key iteration impossible. So time to fork and see if they'll fix it :(.
> 
> I'll review yours as soon as I get the client fixed that I'm using.
> 
> Adam
> 
> 
> -----Original Message-----
> From: daveviner@gmail.com on behalf of Dave Viner
> Sent: Fri 8/6/2010 11:28 AM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
> 
> Funny you should ask... I just went through the same exercise.
> 
> You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
> However, here is a snippet of perl that you can use.
> 
> our $WANTED_COLUMN_NAME = 'mycol';
> get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
> \%map);
> 
> sub get_key_to_one_column_map
> {
>    my ($keyspace, $column_family_name, $super_column_name,
> $consistency_level, $returned_keys) = @_;
> 
> 
>    my($socket, $transport, $protocol, $client, $result, $predicate,
> $column_parent, $keyrange);
> 
>    $column_parent = new Cassandra::ColumnParent();
>    $column_parent->{'column_family'} = $column_family_name;
>    $column_parent->{'super_column'} = $super_column_name;
> 
>    $keyrange = new Cassandra::KeyRange({
>            'start_key' => '', 'end_key' => '', 'count' => 10
>    });
> 
> 
>    $predicate = new Cassandra::SlicePredicate();
>    $predicate->{'column_names'} = [$WANTED_COLUMN_NAME];
> 
>    eval
>    {
>        $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
>        $transport = new Thrift::BufferedTransport($socket, 1024, 1024);
>        $protocol = new Thrift::BinaryProtocol($transport);
>        $client = new Cassandra::CassandraClient($protocol);
>        $transport->open();
> 
> 
>        my($next_start_key, $one_res, $iteration, $have_more, $value,
> $local_count, $previous_start_key);
> 
>        $iteration = 0;
>        $have_more = 1;
>        while ($have_more == 1)
>        {
>            $iteration++;
>            $result = undef;
> 
>            $result = $client->get_range_slices($keyspace, $column_parent,
> $predicate, $keyrange, $consistency_level);
> 
>            # on success, results is an array of objects.
> 
>            if (scalar(@$result) == 1)
>            {
>                # we only got 1 result... check to see if it's the
>                # same key as the start key... if so, we're done.
>                if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
>                {
>                    $have_more = 0;
>                    last;
>                }
>            }
> 
>            # check to see if we are starting with some value
>            # if so, we throw away the first result.
>            if ($keyrange->{'start_key'})
>            {
>                shift(@$result);
>            }
>            if (scalar(@$result) == 0)
>            {
>                $have_more = 0;
>                last;
>            }
> 
>            $previous_start_key = $keyrange->{'start_key'};
>            $local_count = 0;
> 
>            for (my $r = 0; $r < scalar(@$result); $r++)
>            {
>                $one_res = $result->[$r];
>                $next_start_key = $one_res->{'key'};
> 
>                $keyrange->{'start_key'} = $next_start_key;
> 
>                if (!exists($returned_keys->{$next_start_key}))
>                {
>                    $have_more = 1;
>                    $local_count++;
>                }
> 
> 
>                next if (scalar(@{ $one_res->{'columns'} }) == 0);
> 
>                $value = undef;
> 
>                for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
> $i++)
>                {
>                    if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
> $WANTED_COLUMN_NAME)
>                    {
>                        $value =
> $one_res->{'columns'}->[$i]->{'column'}->{'value'};
>                        if (!exists($returned_keys->{$next_start_key}))
>                        {
>                            $returned_keys->{$next_start_key} = $value;
>                        }
>                        else
>                        {
>                            # NOTE: prior to Cassandra 0.6.4, the
> get_range_slices returns duplicates sometimes.
>                            #warn "Found second value for key
> [$next_start_key]  was [" . $returned_keys->{$next_start_key} . "] now
> [$value]!";
>                        }
>                    }
>                }
>                $have_more = 1;
>            } # end results loop
> 
>            if ($keyrange->{'start_key'} eq $previous_start_key)
>            {
>                $have_more = 0;
>            }
> 
>        } # end while() loop
> 
>        $transport->close();
>    };
>    if ($@)
>    {
>        warn "Problem with Cassandra: " . Dumper($@);
>    }
> 
>    # cleanup
>    undef $client;
>    undef $protocol;
>    undef $transport;
>    undef $socket;
> }
> 
> 
> HTH
> Dave Viner
> 
> On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
> <ad...@greenenergycorp.com>wrote:
> 
>> Thomas,
>> 
>> That was indeed the source of the problem. I naively assumed that the token
>> range would help me avoid retrieving duplicate rows.
>> 
>> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
>> tried this morning and I seem to get odd results. Maybe this is just a
>> consequence of the random partitioner. I really don't care about the order
>> of the iteration, but only each key once and that I see all keys is
>> important.
>> 
>> -Adam
>> 
>> 
>> -----Original Message-----
>> From: th.heller@gmail.com on behalf of Thomas Heller
>> Sent: Fri 8/6/2010 7:27 AM
>> To: user@cassandra.apache.org
>> Subject: Re: error using get_range_slice with random partitioner
>> 
>> Wild guess here, but are you using start_token/end_token here when you
>> should be using start_key? Looks to me like you are trying end_token
>> = ''.
>> 
>> HTH,
>> /thomas
>> 
>> On Thursday, August 5, 2010, Adam Crain <ad...@greenenergycorp.com>
>> wrote:
>>> Hi,
>>> 
>>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
>> that iterating over the keys in keyspace is possible, even with the random
>> partitioner. This is mostly desirable in my case for testing purposes only.
>>> 
>>> I get the following error:
>>> 
>>> [junit] Internal error processing get_range_slices
>>> [junit] org.apache.thrift.TApplicationException: Internal error
>> processing get_range_slices
>>> 
>>> and the following server traceback:
>>> 
>>> java.lang.NumberFormatException: Zero length BigInteger
>>>        at java.math.BigInteger.<init>(BigInteger.java:295)
>>>        at java.math.BigInteger.<init>(BigInteger.java:467)
>>>        at
>> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>>>        at
>> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>>> 
>>> I am using the scala cascal client, but am sure that get_range_slice is
>> being called with start and stop set to "".
>>> 
>>> 1) Is batch iteration possible with random partioner?
>>> 
>>> This isn't clear from the FAQ entry on the subject:
>>> 
>>> http://wiki.apache.org/cassandra/FAQ#iter_world
>>> 
>>> 2) The FAQ states that start argument should be "". What should the end
>> argument be?
>>> 
>>> thanks!
>>> Adam
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
> 
> <winmail.dat>

RE: error using get_range_slice with random partitioner

Posted by Adam Crain <ad...@greenenergycorp.com>.

Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just discovered that the client I'm using mutates the order of keys after retrieving the result with the thrift API... pretty much making key iteration impossible. So time to fork and see if they'll fix it :(.

I'll review yours as soon as I get the client fixed that I'm using.

Adam


-----Original Message-----
From: daveviner@gmail.com on behalf of Dave Viner
Sent: Fri 8/6/2010 11:28 AM
To: user@cassandra.apache.org
Subject: Re: error using get_range_slice with random partitioner
 
Funny you should ask... I just went through the same exercise.

You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
 However, here is a snippet of perl that you can use.

our $WANTED_COLUMN_NAME = 'mycol';
get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
\%map);

sub get_key_to_one_column_map
{
    my ($keyspace, $column_family_name, $super_column_name,
$consistency_level, $returned_keys) = @_;


    my($socket, $transport, $protocol, $client, $result, $predicate,
$column_parent, $keyrange);

    $column_parent = new Cassandra::ColumnParent();
    $column_parent->{'column_family'} = $column_family_name;
    $column_parent->{'super_column'} = $super_column_name;

    $keyrange = new Cassandra::KeyRange({
            'start_key' => '', 'end_key' => '', 'count' => 10
    });


    $predicate = new Cassandra::SlicePredicate();
    $predicate->{'column_names'} = [$WANTED_COLUMN_NAME];

    eval
    {
        $socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
        $transport = new Thrift::BufferedTransport($socket, 1024, 1024);
        $protocol = new Thrift::BinaryProtocol($transport);
        $client = new Cassandra::CassandraClient($protocol);
        $transport->open();


        my($next_start_key, $one_res, $iteration, $have_more, $value,
$local_count, $previous_start_key);

        $iteration = 0;
        $have_more = 1;
        while ($have_more == 1)
        {
            $iteration++;
            $result = undef;

            $result = $client->get_range_slices($keyspace, $column_parent,
$predicate, $keyrange, $consistency_level);

            # on success, results is an array of objects.

            if (scalar(@$result) == 1)
            {
                # we only got 1 result... check to see if it's the
                # same key as the start key... if so, we're done.
                if ($result->[0]->{'key'} eq $keyrange->{'start_key'})
                {
                    $have_more = 0;
                    last;
                }
            }

            # check to see if we are starting with some value
            # if so, we throw away the first result.
            if ($keyrange->{'start_key'})
            {
                shift(@$result);
            }
            if (scalar(@$result) == 0)
            {
                $have_more = 0;
                last;
            }

            $previous_start_key = $keyrange->{'start_key'};
            $local_count = 0;

            for (my $r = 0; $r < scalar(@$result); $r++)
            {
                $one_res = $result->[$r];
                $next_start_key = $one_res->{'key'};

                $keyrange->{'start_key'} = $next_start_key;

                if (!exists($returned_keys->{$next_start_key}))
                {
                    $have_more = 1;
                    $local_count++;
                }


                next if (scalar(@{ $one_res->{'columns'} }) == 0);

                $value = undef;

                for (my $i = 0; $i < scalar(@{ $one_res->{'columns'} });
$i++)
                {
                    if ($one_res->{'columns'}->[$i]->{'column'}->{'name'} eq
$WANTED_COLUMN_NAME)
                    {
                        $value =
$one_res->{'columns'}->[$i]->{'column'}->{'value'};
                        if (!exists($returned_keys->{$next_start_key}))
                        {
                            $returned_keys->{$next_start_key} = $value;
                        }
                        else
                        {
                            # NOTE: prior to Cassandra 0.6.4, the
get_range_slices returns duplicates sometimes.
                            #warn "Found second value for key
[$next_start_key]  was [" . $returned_keys->{$next_start_key} . "] now
[$value]!";
                        }
                    }
                }
                $have_more = 1;
            } # end results loop

            if ($keyrange->{'start_key'} eq $previous_start_key)
            {
                $have_more = 0;
            }

        } # end while() loop

        $transport->close();
    };
    if ($@)
    {
        warn "Problem with Cassandra: " . Dumper($@);
    }

    # cleanup
    undef $client;
    undef $protocol;
    undef $transport;
    undef $socket;
}


HTH
Dave Viner

On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
<ad...@greenenergycorp.com>wrote:

> Thomas,
>
> That was indeed the source of the problem. I naively assumed that the token
> range would help me avoid retrieving duplicate rows.
>
> If you iterate over the keys, how do you avoid retrieving duplicate keys? I
> tried this morning and I seem to get odd results. Maybe this is just a
> consequence of the random partitioner. I really don't care about the order
> of the iteration, but only each key once and that I see all keys is
> important.
>
> -Adam
>
>
> -----Original Message-----
> From: th.heller@gmail.com on behalf of Thomas Heller
> Sent: Fri 8/6/2010 7:27 AM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
>
> Wild guess here, but are you using start_token/end_token here when you
> should be using start_key? Looks to me like you are trying end_token
> = ''.
>
> HTH,
> /thomas
>
> On Thursday, August 5, 2010, Adam Crain <ad...@greenenergycorp.com>
> wrote:
> > Hi,
> >
> > I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
> that iterating over the keys in keyspace is possible, even with the random
> partitioner. This is mostly desirable in my case for testing purposes only.
> >
> > I get the following error:
> >
> > [junit] Internal error processing get_range_slices
> > [junit] org.apache.thrift.TApplicationException: Internal error
> processing get_range_slices
> >
> > and the following server traceback:
> >
> > java.lang.NumberFormatException: Zero length BigInteger
> >         at java.math.BigInteger.<init>(BigInteger.java:295)
> >         at java.math.BigInteger.<init>(BigInteger.java:467)
> >         at
> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
> >         at
> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
> >
> > I am using the scala cascal client, but am sure that get_range_slice is
> being called with start and stop set to "".
> >
> > 1) Is batch iteration possible with random partioner?
> >
> > This isn't clear from the FAQ entry on the subject:
> >
> > http://wiki.apache.org/cassandra/FAQ#iter_world
> >
> > 2) The FAQ states that start argument should be "". What should the end
> argument be?
> >
> > thanks!
> > Adam
> >
> >
> >
> >
> >
> >
>
>
>
>
>

RE: error using get_range_slice with random partitioner

Posted by Adam Crain <ad...@greenenergycorp.com>.

Thomas,

That was indeed the source of the problem. I naively assumed that the token range would help me avoid retrieving duplicate rows.

If you iterate over the keys, how do you avoid retrieving duplicate keys? I tried this morning and I seem to get odd results. Maybe this is just a consequence of the random partitioner. I really don't care about the order of the iteration, but only each key once and that I see all keys is important.

-Adam


-----Original Message-----
From: th.heller@gmail.com on behalf of Thomas Heller
Sent: Fri 8/6/2010 7:27 AM
To: user@cassandra.apache.org
Subject: Re: error using get_range_slice with random partitioner
 
Wild guess here, but are you using start_token/end_token here when you
should be using start_key? Looks to me like you are trying end_token
= ''.

HTH,
/thomas

On Thursday, August 5, 2010, Adam Crain <ad...@greenenergycorp.com> wrote:
> Hi,
>
> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated that iterating over the keys in keyspace is possible, even with the random partitioner. This is mostly desirable in my case for testing purposes only.
>
> I get the following error:
>
> [junit] Internal error processing get_range_slices
> [junit] org.apache.thrift.TApplicationException: Internal error processing get_range_slices
>
> and the following server traceback:
>
> java.lang.NumberFormatException: Zero length BigInteger
>         at java.math.BigInteger.<init>(BigInteger.java:295)
>         at java.math.BigInteger.<init>(BigInteger.java:467)
>         at org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>         at org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>
> I am using the scala cascal client, but am sure that get_range_slice is being called with start and stop set to "".
>
> 1) Is batch iteration possible with random partioner?
>
> This isn't clear from the FAQ entry on the subject:
>
> http://wiki.apache.org/cassandra/FAQ#iter_world
>
> 2) The FAQ states that start argument should be "". What should the end argument be?
>
> thanks!
> Adam
>
>
>
>
>
>

Re: error using get_range_slice with random partitioner

Posted by Jonathan Ellis <jb...@gmail.com>.

can you reproduce starting with a fresh install, no existing data?

On Thu, Aug 5, 2010 at 12:09 PM, Adam Crain
<ad...@greenenergycorp.com> wrote:
> I've never changed the partitioner from the default random. Other ideas?
>
> I can insert and do column queries using a single key but not range on CF.
>
> -Adam
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Thursday, August 05, 2010 11:33 AM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
>
> Yes, you should be able to use get_range_slices with RP.
>
> This stack trace looks like you changed your partitioner after the
> node already had data in it.
>
> On Thu, Aug 5, 2010 at 10:06 AM, Adam Crain
> <ad...@greenenergycorp.com> wrote:
>> Hi,
>>
>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
>> that iterating over the keys in keyspace is possible, even with the random
>> partitioner. This is mostly desirable in my case for testing purposes only.
>>
>> I get the following error:
>>
>> [junit] Internal error processing get_range_slices
>> [junit] org.apache.thrift.TApplicationException: Internal error processing
>> get_range_slices
>>
>> and the following server traceback:
>>
>> java.lang.NumberFormatException: Zero length BigInteger
>>         at java.math.BigInteger.<init>(BigInteger.java:295)
>>         at java.math.BigInteger.<init>(BigInteger.java:467)
>>         at
>> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>>         at
>> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>>
>> I am using the scala cascal client, but am sure that get_range_slice is
>> being called with start and stop set to "".
>>
>> 1) Is batch iteration possible with random partioner?
>>
>> This isn't clear from the FAQ entry on the subject:
>>
>> http://wiki.apache.org/cassandra/FAQ#iter_world
>>
>> 2) The FAQ states that start argument should be "". What should the end
>> argument be?
>>
>> thanks!
>> Adam
>>
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: error using get_range_slice with random partitioner

Posted by Jonathan Ellis <jb...@gmail.com>.

That's puzzling, because we have a bunch of system tests that do range
scans with randompartitioner.  If you can open a ticket with the code
to reproduce, I'll have a look.  Thanks!

On Thu, Aug 5, 2010 at 1:24 PM, Adam Crain
<ad...@greenenergycorp.com> wrote:
> I can. I'm using the debian distro.  I assume that all that is required is wiping the data/commitlog directories.
>
> If I do that, I still get the same result.
>
> Here's my CF:
>
> <ColumnFamily Name ="Meas" CompareWith="LongType" />
>
> I'm using this to time series measurement data where the keys are measurement names and the columns are Long unix epoch timestamps in millisecs. My use case is then to do a range_slice that asks for the first X number of rows, but only the most recent measurement by using a descending order column predicate with a limit of 1.
>
> I have no trouble using this predicate to retrieve columns within a specified row, but the get_range_slice fails.
>
> -Adam
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Thursday, August 05, 2010 12:22 PM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
>
> can you reproduce starting with a fresh install, no existing data?
>
> On Thu, Aug 5, 2010 at 12:09 PM, Adam Crain
> <ad...@greenenergycorp.com> wrote:
>> I've never changed the partitioner from the default random. Other ideas?
>>
>> I can insert and do column queries using a single key but not range on CF.
>>
>> -Adam
>>
>> -----Original Message-----
>> From: Jonathan Ellis [mailto:jbellis@gmail.com]
>> Sent: Thursday, August 05, 2010 11:33 AM
>> To: user@cassandra.apache.org
>> Subject: Re: error using get_range_slice with random partitioner
>>
>> Yes, you should be able to use get_range_slices with RP.
>>
>> This stack trace looks like you changed your partitioner after the
>> node already had data in it.
>>
>> On Thu, Aug 5, 2010 at 10:06 AM, Adam Crain
>> <ad...@greenenergycorp.com> wrote:
>>> Hi,
>>>
>>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
>>> that iterating over the keys in keyspace is possible, even with the random
>>> partitioner. This is mostly desirable in my case for testing purposes only.
>>>
>>> I get the following error:
>>>
>>> [junit] Internal error processing get_range_slices
>>> [junit] org.apache.thrift.TApplicationException: Internal error processing
>>> get_range_slices
>>>
>>> and the following server traceback:
>>>
>>> java.lang.NumberFormatException: Zero length BigInteger
>>>         at java.math.BigInteger.<init>(BigInteger.java:295)
>>>         at java.math.BigInteger.<init>(BigInteger.java:467)
>>>         at
>>> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>>>         at
>>> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>>>
>>> I am using the scala cascal client, but am sure that get_range_slice is
>>> being called with start and stop set to "".
>>>
>>> 1) Is batch iteration possible with random partioner?
>>>
>>> This isn't clear from the FAQ entry on the subject:
>>>
>>> http://wiki.apache.org/cassandra/FAQ#iter_world
>>>
>>> 2) The FAQ states that start argument should be "". What should the end
>>> argument be?
>>>
>>> thanks!
>>> Adam
>>>
>>>
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>>
>>
>>
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

RE: error using get_range_slice with random partitioner

Posted by Adam Crain <ad...@greenenergycorp.com>.

I can. I'm using the debian distro.  I assume that all that is required is wiping the data/commitlog directories.

If I do that, I still get the same result.

Here's my CF:

<ColumnFamily Name ="Meas" CompareWith="LongType" />

I'm using this to time series measurement data where the keys are measurement names and the columns are Long unix epoch timestamps in millisecs. My use case is then to do a range_slice that asks for the first X number of rows, but only the most recent measurement by using a descending order column predicate with a limit of 1.

I have no trouble using this predicate to retrieve columns within a specified row, but the get_range_slice fails.

-Adam

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Thursday, August 05, 2010 12:22 PM
To: user@cassandra.apache.org
Subject: Re: error using get_range_slice with random partitioner

can you reproduce starting with a fresh install, no existing data?

On Thu, Aug 5, 2010 at 12:09 PM, Adam Crain
<ad...@greenenergycorp.com> wrote:
> I've never changed the partitioner from the default random. Other ideas?
>
> I can insert and do column queries using a single key but not range on CF.
>
> -Adam
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Thursday, August 05, 2010 11:33 AM
> To: user@cassandra.apache.org
> Subject: Re: error using get_range_slice with random partitioner
>
> Yes, you should be able to use get_range_slices with RP.
>
> This stack trace looks like you changed your partitioner after the
> node already had data in it.
>
> On Thu, Aug 5, 2010 at 10:06 AM, Adam Crain
> <ad...@greenenergycorp.com> wrote:
>> Hi,
>>
>> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
>> that iterating over the keys in keyspace is possible, even with the random
>> partitioner. This is mostly desirable in my case for testing purposes only.
>>
>> I get the following error:
>>
>> [junit] Internal error processing get_range_slices
>> [junit] org.apache.thrift.TApplicationException: Internal error processing
>> get_range_slices
>>
>> and the following server traceback:
>>
>> java.lang.NumberFormatException: Zero length BigInteger
>>         at java.math.BigInteger.<init>(BigInteger.java:295)
>>         at java.math.BigInteger.<init>(BigInteger.java:467)
>>         at
>> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>>         at
>> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>>
>> I am using the scala cascal client, but am sure that get_range_slice is
>> being called with start and stop set to "".
>>
>> 1) Is batch iteration possible with random partioner?
>>
>> This isn't clear from the FAQ entry on the subject:
>>
>> http://wiki.apache.org/cassandra/FAQ#iter_world
>>
>> 2) The FAQ states that start argument should be "". What should the end
>> argument be?
>>
>> thanks!
>> Adam
>>
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

RE: error using get_range_slice with random partitioner

Posted by Adam Crain <ad...@greenenergycorp.com>.

I've never changed the partitioner from the default random. Other ideas?

I can insert and do column queries using a single key but not range on CF.

-Adam

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Thursday, August 05, 2010 11:33 AM
To: user@cassandra.apache.org
Subject: Re: error using get_range_slice with random partitioner

Yes, you should be able to use get_range_slices with RP.

This stack trace looks like you changed your partitioner after the
node already had data in it.

On Thu, Aug 5, 2010 at 10:06 AM, Adam Crain
<ad...@greenenergycorp.com> wrote:
> Hi,
>
> I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated
> that iterating over the keys in keyspace is possible, even with the random
> partitioner. This is mostly desirable in my case for testing purposes only.
>
> I get the following error:
>
> [junit] Internal error processing get_range_slices
> [junit] org.apache.thrift.TApplicationException: Internal error processing
> get_range_slices
>
> and the following server traceback:
>
> java.lang.NumberFormatException: Zero length BigInteger
>         at java.math.BigInteger.<init>(BigInteger.java:295)
>         at java.math.BigInteger.<init>(BigInteger.java:467)
>         at
> org.apache.cassandra.dht.RandomPartitioner$1.fromString(RandomPartitioner.java:100)
>         at
> org.apache.cassandra.thrift.CassandraServer.getRangeSlicesInternal(CassandraServer.java:575)
>
> I am using the scala cascal client, but am sure that get_range_slice is
> being called with start and stop set to "".
>
> 1) Is batch iteration possible with random partioner?
>
> This isn't clear from the FAQ entry on the subject:
>
> http://wiki.apache.org/cassandra/FAQ#iter_world
>
> 2) The FAQ states that start argument should be "". What should the end
> argument be?
>
> thanks!
> Adam
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com