You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Shotaro Kamio <ka...@gmail.com> on 2011/02/17 16:35:54 UTC

Inconsistent result in super range slice query (reversed order)

Hi,

We are in trouble with a strange behavior in cassandra 0.7.2 (also
happened in 0.7.0). Could someone help us?

The problem happens on a column family of super column type named "Order".
Data structure is something like:
  Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] = value

For example,
 Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ]
 is a super column.
Because we want to scan them in the latest-first order, range slice
query with reversed order is used. (Partitioner is
ByteOrderedPartitioner).

In some supercolumns in my cassandra instance, reversed query returns
no result while it should have results.
For instance,

* Range slice in normal (lexical)-order ( Order[ "100" ] [ from
"20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will
return results correctly.

col='20031210014347/190209-20031210-4476668-s/'
col='20031210014347/190209-20031210-4476668-s/0'
col='20031210022059/190209-20031210-4476885-s/'
col='20031210022059/190209-20031210-4476885-s/0'

* Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from
"20031210022059/190209-20031210-4476885-s/z" to  "20031210" ] ) will
return NO result!

Note that the super column name
"20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query
should work. And, it succeeds in other super columns.

* Range slice in reversed (latest-first)-order starting from existing
column name ( Order[ "100" ] [ from
"20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will
return results which should return.

Both pycassa and hector show the same behavior on the same column
name. I guess that cassandra has some logical error.


I'll appreciate any help.


Best reagards,
Shotaro

Re: Inconsistent result in super range slice query (reversed order)

Posted by Tyler Hobbs <ty...@datastax.com>.

I checked out #2212 and was able to reproduce the problem.

Thanks for investigating this and putting together a good script to
reproduce!

- Tyler

Re: Inconsistent result in super range slice query (reversed order)

Posted by Shotaro Kamio <ka...@gmail.com>.

Hi Tyler,

Your script doesn't cause the problem. But the problem really occurs
in a situation.
My colleague analyzed the problem and find out how to reproduce the problem.
Please look at the jira. https://issues.apache.org/jira/browse/CASSANDRA-2212

Best regards,
Shotaro


On Fri, Feb 18, 2011 at 3:59 PM, Tyler Hobbs <ty...@datastax.com> wrote:
> I'm unable to reproduce this in pycassa starting with a clean database.  Are
> you doing anything else to these rows besides inserting them?
>
> Here's the complete script I'm using below.  Could you confirm that this
> causes problems for you?
>
> - Tyler
>
> =========
>
> import sys
> import pycassa
>
> pool = pycassa.ConnectionPool('Keyspace1')
> cf = pycassa.ColumnFamily(pool, 'Super1')
>
> KEY = 'key'
>
> columns = [
>     "20031210020333/190209-20031210-4476807-s/"  , #0
>     "20031210020333/190209-20031210-4476807-s/0" , #1
>     "20031210021940/190209-20031210-4476883-s/"  , #2
>     "20031210021940/190209-20031210-4476883-s/0" , #3
>     "20031210022059/190209-20031210-4476885-s/"  , #4
>     "20031210022059/190209-20031210-4476885-s/0" , #5
>     # <--Problem_around_here.
>     "20031210022154/190209-20031210-4476888-s/"  , #6
>     "20031210022154/190209-20031210-4476888-s/0"   #7
> ]
>
> for supercolumn in columns:
>     cf.insert(KEY, {supercolumn: {'subcol': 'subval', 'subcol2': 'subval'}})
>
> def get_cols(start_date, end_date, reversed):
>     for key, cols in cf.get_range(start = KEY,
>                                   finish = KEY,
>                                   column_reversed=reversed,
>                                   column_count=10000,
>                                   column_start=start_date,
>                                   column_finish=end_date):
>         for supercol, subcols in cols.iteritems():
>             print "col='%s' \tlen = %d" % (supercol, len(subcols))
>
> start = 0
> for end in [0,3,5,7]:
>     print "\nstart %d, end %d + 'z'" % (start, end)
>     get_cols(columns[start], columns[end] + 'z', False)
>
> end = 0
> for start in [0, 3, 5, 7]:
>     print "\nstart %d + 'z', end %d (reversed)" % (start, end)
>     get_cols(columns[end], columns[start] + 'z', False)
>
>
> On Thu, Feb 17, 2011 at 11:09 PM, Shotaro Kamio <ka...@gmail.com> wrote:
>>
>> Hi Aaron,
>>
>> Range slice means get_range_slices() in thrift api,
>> createSuperSliceQuery in hector, get_range() in pycassa. The example
>> code in pycassa is attached below.
>>
>> The problem is a little bit complicated to explain. I'll try to
>> describe in examples.
>> Here are 8 super column names which exist in the specific key. The
>> list is forward order.
>>
>> #0: "20031210020333/190209-20031210-4476807-s/"
>> #1: "20031210020333/190209-20031210-4476807-s/0"
>> #2: "20031210021940/190209-20031210-4476883-s/"
>> #3: "20031210021940/190209-20031210-4476883-s/0"
>> #4: "20031210022059/190209-20031210-4476885-s/"
>> #5: "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
>> #6: "20031210022154/190209-20031210-4476888-s/"
>> #7: "20031210022154/190209-20031210-4476888-s/0"
>>
>> There is no problem if I use the super column names exist on the key.
>>
>> * Range from #0 to #3 in forward order -> OK
>> * Range from #0 to #5 in forward order -> OK
>> * Range from #0 to #7 in forward order -> OK
>>
>> * Range from #7 to #0 in reverse order -> OK
>> * Range from #5 to #0 in reverse order -> OK
>> * Range from #3 to #0 in reverse order -> OK
>>
>>
>> Because I want to scan orders in a certain range, however, I use
>> column names which added character "z" (higher than anything in
>> order_id). Those column names are listed below as #1z, #3z, #5z and
>> #7z. Note that these super column names don't really exist on the key.
>> (#4+ is a column name to locate between #4 and #5)
>>
>> #0 : "20031210020333/190209-20031210-4476807-s/"
>> #1 : "20031210020333/190209-20031210-4476807-s/0"
>> #1z: "20031210020333/190209-20031210-4476807-s/z" (don't exist)
>> #2 : "20031210021940/190209-20031210-4476883-s/"
>> #3 : "20031210021940/190209-20031210-4476883-s/0"
>> #3z: "20031210021940/190209-20031210-4476883-s/z" (don't exist)
>> #4 : "20031210022059/190209-20031210-4476885-s/"
>> #4+: "20031210022059/190209-20031210-4476885-s/+" (don't exist)
>> #5 : "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around
>> here.
>> #5z: "20031210022059/190209-20031210-4476885-s/z" (don't exist)
>> #6 : "20031210022154/190209-20031210-4476888-s/"
>> #7 : "20031210022154/190209-20031210-4476888-s/0"
>> #7z: "20031210022154/190209-20031210-4476888-s/z" (don't exist)
>>
>> Then, try to range slice them.
>>
>> * Range from #0 to #3z in forward order -> OK
>> * Range from #0 to #4+ in forward order -> OK
>> * Range from #0 to #5z in forward order -> OK
>> * Range from #0 to #7z in forward order -> OK
>>
>> * Range from #7z to #0 in reverse order -> OK
>> * Range from #5z to #0 in reverse order -> FAIL (no result)
>> * Range from #4+ to #0 in reverse order -> OK
>> * Range from #3z to #0 in reverse order -> OK
>>
>> The problem happens in this case. No error or warning is shown in
>> cassandra log.
>>
>> Also, I tried dumping data into json via sstable2json and restored it
>> with json2sstable. But the same problem occurs.
>>
>>
>> The code I used for the test is something like this.
>> ----------------------
>> client = pycassa.connect(KEYSPACE, [ CASSANDRA_HOST ])
>> cf = pycassa.ColumnFamily(client, COLUMN_FAMILY)
>>
>> columns = [
>> "20031210020333/190209-20031210-4476807-s/"  , #0
>> "20031210020333/190209-20031210-4476807-s/0" , #1
>> "20031210021940/190209-20031210-4476883-s/"  , #2
>> "20031210021940/190209-20031210-4476883-s/0" , #3
>> "20031210022059/190209-20031210-4476885-s/"  , #4
>> "20031210022059/190209-20031210-4476885-s/0" , #5
>> # <--Problem_around_here.
>> "20031210022154/190209-20031210-4476888-s/"  , #6
>> "20031210022154/190209-20031210-4476888-s/0"   #7
>> ]
>>
>> reversed = False
>> if len(sys.argv) > 1:
>>    # use reversed order if "-r" option is given. "-f" or others for
>> forward order, no option will list all column names.
>>    reversed = (sys.argv[1] == '-r')
>>
>>    start_date = columns[0]
>>    end_date  = columns[7] + "z" # add "z" to make problem.
>>
>>    if reversed:
>>        temp = start_date
>>        start_date = end_date
>>        end_date   = temp
>>        pass
>> else:
>>    start_date = end_date = ''
>>    pass
>>
>> print "start_date =", start_date, "end_date =", end_date, "reversed =
>> ", reversed
>>
>> for it in cf.get_range(start = A_KEY, finish = A_KEY,
>> column_reversed=reversed, column_count=10000, column_start=start_date,
>> column_finish=end_date):
>>
>>    for d in it[1].iteritems():
>>        print "col='%s', len = %d" % (d[0], len(d[0]))
>>        pass
>>    pass
>>
>> -------------------------
>>
>>
>> Regards,
>> Shotaro
>>
>>
>>
>>
>> On Fri, Feb 18, 2011 at 5:19 AM, Aaron Morton <aa...@thelastpickle.com>
>> wrote:
>> > First some terminology, when you say range slice do you mean getting
>> > multiple rows? Or do you mean get_slice where you return multiple super
>> > columns from one row?
>> >
>> > Your examples looks like you want to get multiple super columns from one
>> > row. In which case the choice of partitioner is not important. The
>> > comparator and sub comparator as specified in the CF definition control the
>> > ordering of colums. If possible i would suggest using the random
>> > partitioner.
>> >
>> > Could you provide examples of how you are doing the queries using
>> > pycassa we may be able to help.
>> >
>> > My initial guess is that the ranges you specify for the query are not
>> > correct when using ASCII ordering for column names, e,g,
>> >
>> > 20031210 < 20031210022059/190209-20031210-4476885-s/z is true
>> >
>> > 20031210022059/190209-20031210-4476885-s/z < 20031210 is not true
>> >
>> > Trying appending the highest value ASCII character to the end of
>> > 20031210
>> >
>> > Cheers
>> > Aaron
>> >
>> > On 18/02/2011, at 4:35 AM, Shotaro Kamio <ka...@gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> We are in trouble with a strange behavior in cassandra 0.7.2 (also
>> >> happened in 0.7.0). Could someone help us?
>> >>
>> >> The problem happens on a column family of super column type named
>> >> "Order".
>> >> Data structure is something like:
>> >>  Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] =
>> >> value
>> >>
>> >> For example,
>> >> Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ]
>> >> is a super column.
>> >> Because we want to scan them in the latest-first order, range slice
>> >> query with reversed order is used. (Partitioner is
>> >> ByteOrderedPartitioner).
>> >>
>> >> In some supercolumns in my cassandra instance, reversed query returns
>> >> no result while it should have results.
>> >> For instance,
>> >>
>> >> * Range slice in normal (lexical)-order ( Order[ "100" ] [ from
>> >> "20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will
>> >> return results correctly.
>> >>
>> >> col='20031210014347/190209-20031210-4476668-s/'
>> >> col='20031210014347/190209-20031210-4476668-s/0'
>> >> col='20031210022059/190209-20031210-4476885-s/'
>> >> col='20031210022059/190209-20031210-4476885-s/0'
>> >>
>> >> * Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from
>> >> "20031210022059/190209-20031210-4476885-s/z" to  "20031210" ] ) will
>> >> return NO result!
>> >>
>> >> Note that the super column name
>> >> "20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query
>> >> should work. And, it succeeds in other super columns.
>> >>
>> >> * Range slice in reversed (latest-first)-order starting from existing
>> >> column name ( Order[ "100" ] [ from
>> >> "20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will
>> >> return results which should return.
>> >>
>> >> Both pycassa and hector show the same behavior on the same column
>> >> name. I guess that cassandra has some logical error.
>> >>
>> >>
>> >> I'll appreciate any help.
>> >>
>> >>
>> >> Best reagards,
>> >> Shotaro
>> >
>>
>>
>>
>> --
>> Shotaro Kamio
>
>
>
> --
> Tyler Hobbs
> Software Engineer, DataStax
> Maintainer of the pycassa Cassandra Python client library
>
>



-- 
Shotaro Kamio

Re: Inconsistent result in super range slice query (reversed order)

Posted by Tyler Hobbs <ty...@datastax.com>.

I'm unable to reproduce this in pycassa starting with a clean database.  Are
you doing anything else to these rows besides inserting them?

Here's the complete script I'm using below.  Could you confirm that this
causes problems for you?

- Tyler

=========

import sys
import pycassa

pool = pycassa.ConnectionPool('Keyspace1')
cf = pycassa.ColumnFamily(pool, 'Super1')

KEY = 'key'

columns = [
    "20031210020333/190209-20031210-4476807-s/"  , #0
    "20031210020333/190209-20031210-4476807-s/0" , #1
    "20031210021940/190209-20031210-4476883-s/"  , #2
    "20031210021940/190209-20031210-4476883-s/0" , #3
    "20031210022059/190209-20031210-4476885-s/"  , #4
    "20031210022059/190209-20031210-4476885-s/0" , #5
    # <--Problem_around_here.
    "20031210022154/190209-20031210-4476888-s/"  , #6
    "20031210022154/190209-20031210-4476888-s/0"   #7
]

for supercolumn in columns:
    cf.insert(KEY, {supercolumn: {'subcol': 'subval', 'subcol2': 'subval'}})

def get_cols(start_date, end_date, reversed):
    for key, cols in cf.get_range(start = KEY,
                                  finish = KEY,
                                  column_reversed=reversed,
                                  column_count=10000,
                                  column_start=start_date,
                                  column_finish=end_date):
        for supercol, subcols in cols.iteritems():
            print "col='%s' \tlen = %d" % (supercol, len(subcols))

start = 0
for end in [0,3,5,7]:
    print "\nstart %d, end %d + 'z'" % (start, end)
    get_cols(columns[start], columns[end] + 'z', False)

end = 0
for start in [0, 3, 5, 7]:
    print "\nstart %d + 'z', end %d (reversed)" % (start, end)
    get_cols(columns[end], columns[start] + 'z', False)


On Thu, Feb 17, 2011 at 11:09 PM, Shotaro Kamio <ka...@gmail.com> wrote:

> Hi Aaron,
>
> Range slice means get_range_slices() in thrift api,
> createSuperSliceQuery in hector, get_range() in pycassa. The example
> code in pycassa is attached below.
>
> The problem is a little bit complicated to explain. I'll try to
> describe in examples.
> Here are 8 super column names which exist in the specific key. The
> list is forward order.
>
> #0: "20031210020333/190209-20031210-4476807-s/"
> #1: "20031210020333/190209-20031210-4476807-s/0"
> #2: "20031210021940/190209-20031210-4476883-s/"
> #3: "20031210021940/190209-20031210-4476883-s/0"
> #4: "20031210022059/190209-20031210-4476885-s/"
> #5: "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
> #6: "20031210022154/190209-20031210-4476888-s/"
> #7: "20031210022154/190209-20031210-4476888-s/0"
>
> There is no problem if I use the super column names exist on the key.
>
> * Range from #0 to #3 in forward order -> OK
> * Range from #0 to #5 in forward order -> OK
> * Range from #0 to #7 in forward order -> OK
>
> * Range from #7 to #0 in reverse order -> OK
> * Range from #5 to #0 in reverse order -> OK
> * Range from #3 to #0 in reverse order -> OK
>
>
> Because I want to scan orders in a certain range, however, I use
> column names which added character "z" (higher than anything in
> order_id). Those column names are listed below as #1z, #3z, #5z and
> #7z. Note that these super column names don't really exist on the key.
> (#4+ is a column name to locate between #4 and #5)
>
> #0 : "20031210020333/190209-20031210-4476807-s/"
> #1 : "20031210020333/190209-20031210-4476807-s/0"
> #1z: "20031210020333/190209-20031210-4476807-s/z" (don't exist)
> #2 : "20031210021940/190209-20031210-4476883-s/"
> #3 : "20031210021940/190209-20031210-4476883-s/0"
> #3z: "20031210021940/190209-20031210-4476883-s/z" (don't exist)
> #4 : "20031210022059/190209-20031210-4476885-s/"
> #4+: "20031210022059/190209-20031210-4476885-s/+" (don't exist)
> #5 : "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
> #5z: "20031210022059/190209-20031210-4476885-s/z" (don't exist)
> #6 : "20031210022154/190209-20031210-4476888-s/"
> #7 : "20031210022154/190209-20031210-4476888-s/0"
> #7z: "20031210022154/190209-20031210-4476888-s/z" (don't exist)
>
> Then, try to range slice them.
>
> * Range from #0 to #3z in forward order -> OK
> * Range from #0 to #4+ in forward order -> OK
> * Range from #0 to #5z in forward order -> OK
> * Range from #0 to #7z in forward order -> OK
>
> * Range from #7z to #0 in reverse order -> OK
> * Range from #5z to #0 in reverse order -> FAIL (no result)
> * Range from #4+ to #0 in reverse order -> OK
> * Range from #3z to #0 in reverse order -> OK
>
> The problem happens in this case. No error or warning is shown in cassandra
> log.
>
> Also, I tried dumping data into json via sstable2json and restored it
> with json2sstable. But the same problem occurs.
>
>
> The code I used for the test is something like this.
> ----------------------
> client = pycassa.connect(KEYSPACE, [ CASSANDRA_HOST ])
> cf = pycassa.ColumnFamily(client, COLUMN_FAMILY)
>
> columns = [
> "20031210020333/190209-20031210-4476807-s/"  , #0
> "20031210020333/190209-20031210-4476807-s/0" , #1
> "20031210021940/190209-20031210-4476883-s/"  , #2
> "20031210021940/190209-20031210-4476883-s/0" , #3
> "20031210022059/190209-20031210-4476885-s/"  , #4
> "20031210022059/190209-20031210-4476885-s/0" , #5
> # <--Problem_around_here.
> "20031210022154/190209-20031210-4476888-s/"  , #6
> "20031210022154/190209-20031210-4476888-s/0"   #7
> ]
>
> reversed = False
> if len(sys.argv) > 1:
>    # use reversed order if "-r" option is given. "-f" or others for
> forward order, no option will list all column names.
>    reversed = (sys.argv[1] == '-r')
>
>    start_date = columns[0]
>    end_date  = columns[7] + "z" # add "z" to make problem.
>
>    if reversed:
>        temp = start_date
>        start_date = end_date
>        end_date   = temp
>        pass
> else:
>    start_date = end_date = ''
>    pass
>
> print "start_date =", start_date, "end_date =", end_date, "reversed =
> ", reversed
>
> for it in cf.get_range(start = A_KEY, finish = A_KEY,
> column_reversed=reversed, column_count=10000, column_start=start_date,
> column_finish=end_date):
>
>    for d in it[1].iteritems():
>        print "col='%s', len = %d" % (d[0], len(d[0]))
>        pass
>    pass
>
> -------------------------
>
>
> Regards,
> Shotaro
>
>
>
>
> On Fri, Feb 18, 2011 at 5:19 AM, Aaron Morton <aa...@thelastpickle.com>
> wrote:
> > First some terminology, when you say range slice do you mean getting
> multiple rows? Or do you mean get_slice where you return multiple super
> columns from one row?
> >
> > Your examples looks like you want to get multiple super columns from one
> row. In which case the choice of partitioner is not important. The
> comparator and sub comparator as specified in the CF definition control the
> ordering of colums. If possible i would suggest using the random
> partitioner.
> >
> > Could you provide examples of how you are doing the queries using pycassa
> we may be able to help.
> >
> > My initial guess is that the ranges you specify for the query are not
> correct when using ASCII ordering for column names, e,g,
> >
> > 20031210 < 20031210022059/190209-20031210-4476885-s/z is true
> >
> > 20031210022059/190209-20031210-4476885-s/z < 20031210 is not true
> >
> > Trying appending the highest value ASCII character to the end of 20031210
> >
> > Cheers
> > Aaron
> >
> > On 18/02/2011, at 4:35 AM, Shotaro Kamio <ka...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> We are in trouble with a strange behavior in cassandra 0.7.2 (also
> >> happened in 0.7.0). Could someone help us?
> >>
> >> The problem happens on a column family of super column type named
> "Order".
> >> Data structure is something like:
> >>  Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] =
> value
> >>
> >> For example,
> >> Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ]
> >> is a super column.
> >> Because we want to scan them in the latest-first order, range slice
> >> query with reversed order is used. (Partitioner is
> >> ByteOrderedPartitioner).
> >>
> >> In some supercolumns in my cassandra instance, reversed query returns
> >> no result while it should have results.
> >> For instance,
> >>
> >> * Range slice in normal (lexical)-order ( Order[ "100" ] [ from
> >> "20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will
> >> return results correctly.
> >>
> >> col='20031210014347/190209-20031210-4476668-s/'
> >> col='20031210014347/190209-20031210-4476668-s/0'
> >> col='20031210022059/190209-20031210-4476885-s/'
> >> col='20031210022059/190209-20031210-4476885-s/0'
> >>
> >> * Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from
> >> "20031210022059/190209-20031210-4476885-s/z" to  "20031210" ] ) will
> >> return NO result!
> >>
> >> Note that the super column name
> >> "20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query
> >> should work. And, it succeeds in other super columns.
> >>
> >> * Range slice in reversed (latest-first)-order starting from existing
> >> column name ( Order[ "100" ] [ from
> >> "20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will
> >> return results which should return.
> >>
> >> Both pycassa and hector show the same behavior on the same column
> >> name. I guess that cassandra has some logical error.
> >>
> >>
> >> I'll appreciate any help.
> >>
> >>
> >> Best reagards,
> >> Shotaro
> >
>
>
>
> --
> Shotaro Kamio
>



-- 
Tyler Hobbs
Software Engineer, DataStax <http://datastax.com/>
Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
Python client library

Re: Inconsistent result in super range slice query (reversed order)

Posted by Shotaro Kamio <ka...@gmail.com>.

Hi Aaron,

Range slice means get_range_slices() in thrift api,
createSuperSliceQuery in hector, get_range() in pycassa. The example
code in pycassa is attached below.

The problem is a little bit complicated to explain. I'll try to
describe in examples.
Here are 8 super column names which exist in the specific key. The
list is forward order.

#0: "20031210020333/190209-20031210-4476807-s/"
#1: "20031210020333/190209-20031210-4476807-s/0"
#2: "20031210021940/190209-20031210-4476883-s/"
#3: "20031210021940/190209-20031210-4476883-s/0"
#4: "20031210022059/190209-20031210-4476885-s/"
#5: "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
#6: "20031210022154/190209-20031210-4476888-s/"
#7: "20031210022154/190209-20031210-4476888-s/0"

There is no problem if I use the super column names exist on the key.

* Range from #0 to #3 in forward order -> OK
* Range from #0 to #5 in forward order -> OK
* Range from #0 to #7 in forward order -> OK

* Range from #7 to #0 in reverse order -> OK
* Range from #5 to #0 in reverse order -> OK
* Range from #3 to #0 in reverse order -> OK


Because I want to scan orders in a certain range, however, I use
column names which added character "z" (higher than anything in
order_id). Those column names are listed below as #1z, #3z, #5z and
#7z. Note that these super column names don't really exist on the key.
(#4+ is a column name to locate between #4 and #5)

#0 : "20031210020333/190209-20031210-4476807-s/"
#1 : "20031210020333/190209-20031210-4476807-s/0"
#1z: "20031210020333/190209-20031210-4476807-s/z" (don't exist)
#2 : "20031210021940/190209-20031210-4476883-s/"
#3 : "20031210021940/190209-20031210-4476883-s/0"
#3z: "20031210021940/190209-20031210-4476883-s/z" (don't exist)
#4 : "20031210022059/190209-20031210-4476885-s/"
#4+: "20031210022059/190209-20031210-4476885-s/+" (don't exist)
#5 : "20031210022059/190209-20031210-4476885-s/0"  <-- Problem around here.
#5z: "20031210022059/190209-20031210-4476885-s/z" (don't exist)
#6 : "20031210022154/190209-20031210-4476888-s/"
#7 : "20031210022154/190209-20031210-4476888-s/0"
#7z: "20031210022154/190209-20031210-4476888-s/z" (don't exist)

Then, try to range slice them.

* Range from #0 to #3z in forward order -> OK
* Range from #0 to #4+ in forward order -> OK
* Range from #0 to #5z in forward order -> OK
* Range from #0 to #7z in forward order -> OK

* Range from #7z to #0 in reverse order -> OK
* Range from #5z to #0 in reverse order -> FAIL (no result)
* Range from #4+ to #0 in reverse order -> OK
* Range from #3z to #0 in reverse order -> OK

The problem happens in this case. No error or warning is shown in cassandra log.

Also, I tried dumping data into json via sstable2json and restored it
with json2sstable. But the same problem occurs.


The code I used for the test is something like this.
----------------------
client = pycassa.connect(KEYSPACE, [ CASSANDRA_HOST ])
cf = pycassa.ColumnFamily(client, COLUMN_FAMILY)

columns = [
"20031210020333/190209-20031210-4476807-s/"  , #0
"20031210020333/190209-20031210-4476807-s/0" , #1
"20031210021940/190209-20031210-4476883-s/"  , #2
"20031210021940/190209-20031210-4476883-s/0" , #3
"20031210022059/190209-20031210-4476885-s/"  , #4
"20031210022059/190209-20031210-4476885-s/0" , #5
# <--Problem_around_here.
"20031210022154/190209-20031210-4476888-s/"  , #6
"20031210022154/190209-20031210-4476888-s/0"   #7
]

reversed = False
if len(sys.argv) > 1:
    # use reversed order if "-r" option is given. "-f" or others for
forward order, no option will list all column names.
    reversed = (sys.argv[1] == '-r')

    start_date = columns[0]
    end_date  = columns[7] + "z" # add "z" to make problem.

    if reversed:
        temp = start_date
        start_date = end_date
        end_date   = temp
        pass
else:
    start_date = end_date = ''
    pass

print "start_date =", start_date, "end_date =", end_date, "reversed =
", reversed

for it in cf.get_range(start = A_KEY, finish = A_KEY,
column_reversed=reversed, column_count=10000, column_start=start_date,
column_finish=end_date):

    for d in it[1].iteritems():
        print "col='%s', len = %d" % (d[0], len(d[0]))
        pass
    pass

-------------------------


Regards,
Shotaro




On Fri, Feb 18, 2011 at 5:19 AM, Aaron Morton <aa...@thelastpickle.com> wrote:
> First some terminology, when you say range slice do you mean getting multiple rows? Or do you mean get_slice where you return multiple super columns from one row?
>
> Your examples looks like you want to get multiple super columns from one row. In which case the choice of partitioner is not important. The comparator and sub comparator as specified in the CF definition control the ordering of colums. If possible i would suggest using the random partitioner.
>
> Could you provide examples of how you are doing the queries using pycassa we may be able to help.
>
> My initial guess is that the ranges you specify for the query are not correct when using ASCII ordering for column names, e,g,
>
> 20031210 < 20031210022059/190209-20031210-4476885-s/z is true
>
> 20031210022059/190209-20031210-4476885-s/z < 20031210 is not true
>
> Trying appending the highest value ASCII character to the end of 20031210
>
> Cheers
> Aaron
>
> On 18/02/2011, at 4:35 AM, Shotaro Kamio <ka...@gmail.com> wrote:
>
>> Hi,
>>
>> We are in trouble with a strange behavior in cassandra 0.7.2 (also
>> happened in 0.7.0). Could someone help us?
>>
>> The problem happens on a column family of super column type named "Order".
>> Data structure is something like:
>>  Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] = value
>>
>> For example,
>> Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ]
>> is a super column.
>> Because we want to scan them in the latest-first order, range slice
>> query with reversed order is used. (Partitioner is
>> ByteOrderedPartitioner).
>>
>> In some supercolumns in my cassandra instance, reversed query returns
>> no result while it should have results.
>> For instance,
>>
>> * Range slice in normal (lexical)-order ( Order[ "100" ] [ from
>> "20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will
>> return results correctly.
>>
>> col='20031210014347/190209-20031210-4476668-s/'
>> col='20031210014347/190209-20031210-4476668-s/0'
>> col='20031210022059/190209-20031210-4476885-s/'
>> col='20031210022059/190209-20031210-4476885-s/0'
>>
>> * Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from
>> "20031210022059/190209-20031210-4476885-s/z" to  "20031210" ] ) will
>> return NO result!
>>
>> Note that the super column name
>> "20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query
>> should work. And, it succeeds in other super columns.
>>
>> * Range slice in reversed (latest-first)-order starting from existing
>> column name ( Order[ "100" ] [ from
>> "20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will
>> return results which should return.
>>
>> Both pycassa and hector show the same behavior on the same column
>> name. I guess that cassandra has some logical error.
>>
>>
>> I'll appreciate any help.
>>
>>
>> Best reagards,
>> Shotaro
>



-- 
Shotaro Kamio

Re: Inconsistent result in super range slice query (reversed order)

Posted by Aaron Morton <aa...@thelastpickle.com>.

First some terminology, when you say range slice do you mean getting multiple rows? Or do you mean get_slice where you return multiple super columns from one row?

Your examples looks like you want to get multiple super columns from one row. In which case the choice of partitioner is not important. The comparator and sub comparator as specified in the CF definition control the ordering of colums. If possible i would suggest using the random partitioner.

Could you provide examples of how you are doing the queries using pycassa we may be able to help.

My initial guess is that the ranges you specify for the query are not correct when using ASCII ordering for column names, e,g,

20031210 < 20031210022059/190209-20031210-4476885-s/z is true

20031210022059/190209-20031210-4476885-s/z < 20031210 is not true

Trying appending the highest value ASCII character to the end of 20031210

Cheers
Aaron

On 18/02/2011, at 4:35 AM, Shotaro Kamio <ka...@gmail.com> wrote:

> Hi,
> 
> We are in trouble with a strange behavior in cassandra 0.7.2 (also
> happened in 0.7.0). Could someone help us?
> 
> The problem happens on a column family of super column type named "Order".
> Data structure is something like:
>  Order[ a_key ][ date + "/" + order_id + "/" (+ suffix) ][attribute] = value
> 
> For example,
> Order[ "100" ][ "20031210022059/190209-20031210-4476885-s/" ]
> is a super column.
> Because we want to scan them in the latest-first order, range slice
> query with reversed order is used. (Partitioner is
> ByteOrderedPartitioner).
> 
> In some supercolumns in my cassandra instance, reversed query returns
> no result while it should have results.
> For instance,
> 
> * Range slice in normal (lexical)-order ( Order[ "100" ] [ from
> "20031210" to "20031210022059/190209-20031210-4476885-s/z" ] ) will
> return results correctly.
> 
> col='20031210014347/190209-20031210-4476668-s/'
> col='20031210014347/190209-20031210-4476668-s/0'
> col='20031210022059/190209-20031210-4476885-s/'
> col='20031210022059/190209-20031210-4476885-s/0'
> 
> * Range slice in reversed (latest-first)-order ( Order[ "100" ] [ from
> "20031210022059/190209-20031210-4476885-s/z" to  "20031210" ] ) will
> return NO result!
> 
> Note that the super column name
> "20031210022059/190209-20031210-4476885-s/z" doesn't exist. The query
> should work. And, it succeeds in other super columns.
> 
> * Range slice in reversed (latest-first)-order starting from existing
> column name ( Order[ "100" ] [ from
> "20031210022059/190209-20031210-4476885-s/0" to "20031210" ] ) will
> return results which should return.
> 
> Both pycassa and hector show the same behavior on the same column
> name. I guess that cassandra has some logical error.
> 
> 
> I'll appreciate any help.
> 
> 
> Best reagards,
> Shotaro