You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "John R. Frank" <jr...@mit.edu> on 2013/06/06 04:21:21 UTC

smallest/largest UUIDs for LexicalUUIDType

C*,

I'm trying to use composite column names to organize 10**8 records.  Each 
record has a unique pair of UUIDs.  The first UUID is often repeated, so I 
want to use column_start and column_finish to find all the records that 
have a given UUID as the first UUID in the pair.

I thought a simple way to get *all* of the columns would be to use

  start  = uuid.UUID(int=0)        -> 00000000-0000-0000-0000-000000000000
  finish = uuid.UUID(int=2**128-1) -> ffffffff-ffff-ffff-ffff-ffffffffffff

But strangely, this fails to find *any* of the columns, and it requires 
that column_reversed=True -- otherwise it raises an error about range 
finish not coming after start.  If I use ints that are much larger/smaller 
than these extremes, then reversed is not required!

Can anyone explain why LexicalUUIDType() does not treat these extremal 
UUIDs like other UUIDs?


Using pycassa v1.9:

     sm = SystemManager(chosen_server)
     sm.create_keyspace(namespace, SIMPLE_STRATEGY, {'replication_factor': '1'})

     family = 'test'
     sm.create_column_family(
         namespace, family, super=False,
         key_validation_class = ASCII_TYPE,
         default_validation_class = BYTES_TYPE,
         comparator_type=CompositeType(LexicalUUIDType(), LexicalUUIDType()),
         #column_name_class = LEXICAL_UUID_TYPE
         )

     pool = ConnectionPool(namespace, config['storage_addresses'],
                           max_retries=1000, pool_timeout=10, pool_size=2, timeout=120)

     cf = pycassa.ColumnFamily(pool, family)
     u1, u2, u3, u4 = uuid.uuid1(), uuid.uuid1(), uuid.uuid1(), uuid.uuid1()

     cf.insert('inbound', {(u3, u3): b'43'})

     start  = uuid.UUID(int=0)
     finish = uuid.UUID(int=2**128-1)
     print start
     print finish
     assert start < u3 < finish
     print cf.get('inbound', column_start=(start,), column_finish=(finish,), column_reversed=True).items()

     sm.close()



Raises:
             if len(list_col_or_super) == 0:
>               raise NotFoundException()
E               NotFoundException: NotFoundException(_message=None)

../../ve/local/lib/python2.7/site-packages/pycassa/columnfamily.py:663: NotFoundException



jrf


Re: smallest/largest UUIDs for LexicalUUIDType

Posted by "John R. Frank" <jr...@mit.edu>.
> Follow-up question:  it seems that range queries on the *second* field 
> of a CompositeType(UUIDType(), UUIDType()) do not work.

If I concatenate the two UUID.hex values into a 32-character string 
instead of a CompositeType of two UUIDs, then range queries work 
correctly.

This is illustrated below... so the question is:  what is the point of a 
CompositeType if range queries only work on the first field?  Is it just a 
convenience class for keeping things strongly typed and cleanly organized, 
or did I break something in the way I setup CompositeType in the example 
earlier in this thread?


def join_uuids(*uuids):
     return ''.join(map(attrgetter('hex'), uuids))

def split_uuids(uuid_str):
     return map(lambda s: uuid.UUID(hex=''.join(s)), grouper(uuid_str, 32))

def grouper(iterable, n, fillvalue=None):
     "Collect data into fixed-length chunks or blocks"
     # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
     args = [iter(iterable)] * n
     return itertools.izip_longest(fillvalue=fillvalue, *args)

def test_composite_column_names_second_level_range_query_with_decomposited_keys():
     '''
     check that we can execute range queries on the second part of a
     CompositeType column name after we unpack the composite key into a
     long string of concatenated hex forms of the UUIDs
     '''
     sm = SystemManager(chosen_server)
     sm.create_keyspace(namespace, SIMPLE_STRATEGY, {'replication_factor': '1'})

     family = 'test'
     sm.create_column_family(
         namespace, family, super=False,
         key_validation_class = ASCII_TYPE,
         default_validation_class = BYTES_TYPE,
         comparator_type=UTF8Type(),
         )
     pool = ConnectionPool(namespace, config['storage_addresses'],
                           max_retries=1000, pool_timeout=10, pool_size=2, timeout=120)

     cf = pycassa.ColumnFamily(pool, family)
     u1, u2, u3, u4 = uuid.uuid1(), uuid.uuid1(), uuid.uuid1(), uuid.uuid1()

     cf.insert('inbound', {join_uuids(u1, u2): b''})
     cf.insert('inbound', {join_uuids(u1, u3): b''})
     cf.insert('inbound', {join_uuids(u1, u4): b''})

     ## test range searching
     start  = uuid.UUID(int=u3.int - 1)
     finish = uuid.UUID(int=u3.int + 1)
     assert start.int < u3.int < finish.int
     rec3 = cf.get('inbound',
                   column_start =join_uuids(u1, start),
                   column_finish=join_uuids(u1, finish)).items()
     assert len(rec3) == 1
     assert split_uuids(rec3[0][0])[1] == u3
     ####  This assert above passes!

     ####  This next part fails :-/
     ## now insert many rows -- enough that some should fall in each
     ## subrange below
     for i in xrange(1000):
         cf.insert('inbound', {join_uuids(u1, uuid.uuid4()): b''})

     ## do four ranges, and expect more than zero in each
     step_size = 2**(128 - 2)
     for i in range(2**2, 0, -1):
         start =  uuid.UUID(int=(i-1) * step_size)
         finish = uuid.UUID(int=min(i * step_size, 2**128 - 1))
         recs = cf.get('inbound',
                       column_start =join_uuids(u1, start),
                       column_finish=join_uuids(u1, finish)).items()
         for key, val in recs:
             key = split_uuids(key)
             assert val == b''
             assert key[0] == u1
             assert key[1] < finish
             assert start < key[1]   ## this passes!! (fails with CompositeType...)

         assert len(recs) > 0
         print len(recs), ' for ', start, finish

     sm.close()

Re: smallest/largest UUIDs for LexicalUUIDType

Posted by "John R. Frank" <jr...@mit.edu>.
> I'll note that if you have the choice, you can use UUIDType rather than 
> LexicalUUIDType. UUIDType fixes that behavior and use a proper lexical 
> comparison for non-type-1 uuids (the other behavior of UUIDType is that 
> for type 1 uuid, it compares them by time first, i.e. it is equivalent 
> to TimeUUIDType for type 1 uuid).

That's super helpful, thanks!

Follow-up question:  it seems that range queries on the *second* field of 
a CompositeType(UUIDType(), UUIDType()) do not work.

What am I missing?

I thought that creating 1000 column names with the *same* UUID in the 
first field of the CompositeType and a random UUID in the second field 
would make it so I can get out four subranges of the second UUID by 
constructing four evenly spaced start/finish keys -- see below.

Instead, it appears that all of the rows end up in the last subrange. 
See assert below marked "this fails"

Advice on this much appreciated -- jrf


     sm = SystemManager(chosen_server)
     sm.create_keyspace(namespace, SIMPLE_STRATEGY, {'replication_factor': '1'})

     family = 'test'
     sm.create_column_family(
         namespace, family, super=False,
         key_validation_class = ASCII_TYPE,
         default_validation_class = BYTES_TYPE,
         comparator_type=CompositeType(UUIDType(), UUIDType()),
         )
     pool = ConnectionPool(namespace, config['storage_addresses'],
                           max_retries=1000, pool_timeout=10, pool_size=2,
                           timeout=120)

     cf = pycassa.ColumnFamily(pool, family)
     u1, u2, u3, u4 = uuid.uuid1(), uuid.uuid1(), uuid.uuid1(), uuid.uuid1()

     cf.insert('inbound', {(u1, u2): b''})
     cf.insert('inbound', {(u1, u3): b''})
     cf.insert('inbound', {(u1, u4): b''})

     ## test range searching
     start  = uuid.UUID(int=u3.int - 1)
     finish = uuid.UUID(int=u3.int + 1)
     assert start.int < u3.int < finish.int
     rec3 = cf.get('inbound',
                   column_start =(u1, start),
                   column_finish=(u1, finish)).items()
     assert len(rec3) == 1
     assert rec3[0][0][1] == u3
     ####  This assert above passes!

     ####  This next part fails :-/
     ## now insert many rows -- enough that some should fall in each
     ## subrange below
     for i in xrange(1000):
         cf.insert('inbound', {(u1, uuid.uuid4()): b''})

     ## do four ranges, and expect more than zero in each
     step_size = 2**(128 - 2)

     ## go in reverse order to illustrate that they are all at the end!!!?
     for i in range(2**2, 0, -1):
         start =  uuid.UUID(int=(i-1) * step_size)
         finish = uuid.UUID(int=min(i * step_size, 2**128 - 1))
         recs = cf.get('inbound',
                       column_start =(u1, start),
                       column_finish=(u1, finish)).items()
         for key, val in recs:
             assert val == b''
             assert start < key[1] < finish              ## this fails!!

         assert len(recs) > 0
         print len(recs), ' for ', start, finish

     sm.close()

Re: smallest/largest UUIDs for LexicalUUIDType

Posted by Sylvain Lebresne <sy...@datastax.com>.
> I'm trying to use composite column names to organize 10**8 records.  Each
> record has a unique pair of UUIDs.  The first UUID is often repeated, so I
> want to use column_start and column_finish to find all the records that
> have a given UUID as the first UUID in the pair.
>
> I thought a simple way to get *all* of the columns would be to use
>
>  start  = uuid.UUID(int=0)        -> 00000000-0000-0000-0000-**
> 000000000000
>  finish = uuid.UUID(int=2**128-1) -> ffffffff-ffff-ffff-ffff-**
> ffffffffffff
>
> But strangely, this fails to find *any* of the columns, and it requires
> that column_reversed=True -- otherwise it raises an error about range
> finish not coming after start.  If I use ints that are much larger/smaller
> than these extremes, then reversed is not required!
>
> Can anyone explain why LexicalUUIDType() does not treat these extremal
> UUIDs like other UUIDs?
>

LexicalUUIDType compares the uuid using Java UUID compare method (
http://docs.oracle.com/javase/6/docs/api/java/util/UUID.html#compareTo(java.util.UUID)
).

As it happens this method consider a UUID as 2 longs and when comparing 2
uuids, it compares those longs lexicographically.
But java longs are signed. So for that
method, 00000000-0000-0000-0000-000000000000
> ffffffff-ffff-ffff-ffff-ffffffffffff (but for
instance, 00000000-0000-0000-0000-000000000000 <
7fffffff-ffff-ffff-ffff-ffffffffffff (because the first "long" of that 2nd
uuid is now positive)).

That's an historical accident, LexicalUUIDType should probably not have use
that comparison as it's arguably not very intuitive. However it's too late
to change it (as changing it now would basically corrupt all data for
people using LexicalUUIDType today).

I'll note that if you have the choice, you can use UUIDType rather than
LexicalUUIDType. UUIDType fixes that behavior and use a proper lexical
comparison for non-type-1 uuids (the other behavior of UUIDType is that for
type 1 uuid, it compares them by time first, i.e. it is equivalent to
TimeUUIDType for type 1 uuid).

--
Sylvain