You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "John R. Frank" <jr...@mit.edu> on 2013/06/06 04:21:21 UTC
smallest/largest UUIDs for LexicalUUIDType
C*,
I'm trying to use composite column names to organize 10**8 records. Each
record has a unique pair of UUIDs. The first UUID is often repeated, so I
want to use column_start and column_finish to find all the records that
have a given UUID as the first UUID in the pair.
I thought a simple way to get *all* of the columns would be to use
start = uuid.UUID(int=0) -> 00000000-0000-0000-0000-000000000000
finish = uuid.UUID(int=2**128-1) -> ffffffff-ffff-ffff-ffff-ffffffffffff
But strangely, this fails to find *any* of the columns, and it requires
that column_reversed=True -- otherwise it raises an error about range
finish not coming after start. If I use ints that are much larger/smaller
than these extremes, then reversed is not required!
Can anyone explain why LexicalUUIDType() does not treat these extremal
UUIDs like other UUIDs?
Using pycassa v1.9:
sm = SystemManager(chosen_server)
sm.create_keyspace(namespace, SIMPLE_STRATEGY, {'replication_factor': '1'})
family = 'test'
sm.create_column_family(
namespace, family, super=False,
key_validation_class = ASCII_TYPE,
default_validation_class = BYTES_TYPE,
comparator_type=CompositeType(LexicalUUIDType(), LexicalUUIDType()),
#column_name_class = LEXICAL_UUID_TYPE
)
pool = ConnectionPool(namespace, config['storage_addresses'],
max_retries=1000, pool_timeout=10, pool_size=2, timeout=120)
cf = pycassa.ColumnFamily(pool, family)
u1, u2, u3, u4 = uuid.uuid1(), uuid.uuid1(), uuid.uuid1(), uuid.uuid1()
cf.insert('inbound', {(u3, u3): b'43'})
start = uuid.UUID(int=0)
finish = uuid.UUID(int=2**128-1)
print start
print finish
assert start < u3 < finish
print cf.get('inbound', column_start=(start,), column_finish=(finish,), column_reversed=True).items()
sm.close()
Raises:
if len(list_col_or_super) == 0:
> raise NotFoundException()
E NotFoundException: NotFoundException(_message=None)
../../ve/local/lib/python2.7/site-packages/pycassa/columnfamily.py:663: NotFoundException
jrf
Re: smallest/largest UUIDs for LexicalUUIDType
Posted by "John R. Frank" <jr...@mit.edu>.
> Follow-up question: it seems that range queries on the *second* field
> of a CompositeType(UUIDType(), UUIDType()) do not work.
If I concatenate the two UUID.hex values into a 32-character string
instead of a CompositeType of two UUIDs, then range queries work
correctly.
This is illustrated below... so the question is: what is the point of a
CompositeType if range queries only work on the first field? Is it just a
convenience class for keeping things strongly typed and cleanly organized,
or did I break something in the way I setup CompositeType in the example
earlier in this thread?
def join_uuids(*uuids):
return ''.join(map(attrgetter('hex'), uuids))
def split_uuids(uuid_str):
return map(lambda s: uuid.UUID(hex=''.join(s)), grouper(uuid_str, 32))
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
def test_composite_column_names_second_level_range_query_with_decomposited_keys():
'''
check that we can execute range queries on the second part of a
CompositeType column name after we unpack the composite key into a
long string of concatenated hex forms of the UUIDs
'''
sm = SystemManager(chosen_server)
sm.create_keyspace(namespace, SIMPLE_STRATEGY, {'replication_factor': '1'})
family = 'test'
sm.create_column_family(
namespace, family, super=False,
key_validation_class = ASCII_TYPE,
default_validation_class = BYTES_TYPE,
comparator_type=UTF8Type(),
)
pool = ConnectionPool(namespace, config['storage_addresses'],
max_retries=1000, pool_timeout=10, pool_size=2, timeout=120)
cf = pycassa.ColumnFamily(pool, family)
u1, u2, u3, u4 = uuid.uuid1(), uuid.uuid1(), uuid.uuid1(), uuid.uuid1()
cf.insert('inbound', {join_uuids(u1, u2): b''})
cf.insert('inbound', {join_uuids(u1, u3): b''})
cf.insert('inbound', {join_uuids(u1, u4): b''})
## test range searching
start = uuid.UUID(int=u3.int - 1)
finish = uuid.UUID(int=u3.int + 1)
assert start.int < u3.int < finish.int
rec3 = cf.get('inbound',
column_start =join_uuids(u1, start),
column_finish=join_uuids(u1, finish)).items()
assert len(rec3) == 1
assert split_uuids(rec3[0][0])[1] == u3
#### This assert above passes!
#### This next part fails :-/
## now insert many rows -- enough that some should fall in each
## subrange below
for i in xrange(1000):
cf.insert('inbound', {join_uuids(u1, uuid.uuid4()): b''})
## do four ranges, and expect more than zero in each
step_size = 2**(128 - 2)
for i in range(2**2, 0, -1):
start = uuid.UUID(int=(i-1) * step_size)
finish = uuid.UUID(int=min(i * step_size, 2**128 - 1))
recs = cf.get('inbound',
column_start =join_uuids(u1, start),
column_finish=join_uuids(u1, finish)).items()
for key, val in recs:
key = split_uuids(key)
assert val == b''
assert key[0] == u1
assert key[1] < finish
assert start < key[1] ## this passes!! (fails with CompositeType...)
assert len(recs) > 0
print len(recs), ' for ', start, finish
sm.close()
Re: smallest/largest UUIDs for LexicalUUIDType
Posted by "John R. Frank" <jr...@mit.edu>.
> I'll note that if you have the choice, you can use UUIDType rather than
> LexicalUUIDType. UUIDType fixes that behavior and use a proper lexical
> comparison for non-type-1 uuids (the other behavior of UUIDType is that
> for type 1 uuid, it compares them by time first, i.e. it is equivalent
> to TimeUUIDType for type 1 uuid).
That's super helpful, thanks!
Follow-up question: it seems that range queries on the *second* field of
a CompositeType(UUIDType(), UUIDType()) do not work.
What am I missing?
I thought that creating 1000 column names with the *same* UUID in the
first field of the CompositeType and a random UUID in the second field
would make it so I can get out four subranges of the second UUID by
constructing four evenly spaced start/finish keys -- see below.
Instead, it appears that all of the rows end up in the last subrange.
See assert below marked "this fails"
Advice on this much appreciated -- jrf
sm = SystemManager(chosen_server)
sm.create_keyspace(namespace, SIMPLE_STRATEGY, {'replication_factor': '1'})
family = 'test'
sm.create_column_family(
namespace, family, super=False,
key_validation_class = ASCII_TYPE,
default_validation_class = BYTES_TYPE,
comparator_type=CompositeType(UUIDType(), UUIDType()),
)
pool = ConnectionPool(namespace, config['storage_addresses'],
max_retries=1000, pool_timeout=10, pool_size=2,
timeout=120)
cf = pycassa.ColumnFamily(pool, family)
u1, u2, u3, u4 = uuid.uuid1(), uuid.uuid1(), uuid.uuid1(), uuid.uuid1()
cf.insert('inbound', {(u1, u2): b''})
cf.insert('inbound', {(u1, u3): b''})
cf.insert('inbound', {(u1, u4): b''})
## test range searching
start = uuid.UUID(int=u3.int - 1)
finish = uuid.UUID(int=u3.int + 1)
assert start.int < u3.int < finish.int
rec3 = cf.get('inbound',
column_start =(u1, start),
column_finish=(u1, finish)).items()
assert len(rec3) == 1
assert rec3[0][0][1] == u3
#### This assert above passes!
#### This next part fails :-/
## now insert many rows -- enough that some should fall in each
## subrange below
for i in xrange(1000):
cf.insert('inbound', {(u1, uuid.uuid4()): b''})
## do four ranges, and expect more than zero in each
step_size = 2**(128 - 2)
## go in reverse order to illustrate that they are all at the end!!!?
for i in range(2**2, 0, -1):
start = uuid.UUID(int=(i-1) * step_size)
finish = uuid.UUID(int=min(i * step_size, 2**128 - 1))
recs = cf.get('inbound',
column_start =(u1, start),
column_finish=(u1, finish)).items()
for key, val in recs:
assert val == b''
assert start < key[1] < finish ## this fails!!
assert len(recs) > 0
print len(recs), ' for ', start, finish
sm.close()
Re: smallest/largest UUIDs for LexicalUUIDType
Posted by Sylvain Lebresne <sy...@datastax.com>.
> I'm trying to use composite column names to organize 10**8 records. Each
> record has a unique pair of UUIDs. The first UUID is often repeated, so I
> want to use column_start and column_finish to find all the records that
> have a given UUID as the first UUID in the pair.
>
> I thought a simple way to get *all* of the columns would be to use
>
> start = uuid.UUID(int=0) -> 00000000-0000-0000-0000-**
> 000000000000
> finish = uuid.UUID(int=2**128-1) -> ffffffff-ffff-ffff-ffff-**
> ffffffffffff
>
> But strangely, this fails to find *any* of the columns, and it requires
> that column_reversed=True -- otherwise it raises an error about range
> finish not coming after start. If I use ints that are much larger/smaller
> than these extremes, then reversed is not required!
>
> Can anyone explain why LexicalUUIDType() does not treat these extremal
> UUIDs like other UUIDs?
>
LexicalUUIDType compares the uuid using Java UUID compare method (
http://docs.oracle.com/javase/6/docs/api/java/util/UUID.html#compareTo(java.util.UUID)
).
As it happens this method consider a UUID as 2 longs and when comparing 2
uuids, it compares those longs lexicographically.
But java longs are signed. So for that
method, 00000000-0000-0000-0000-000000000000
> ffffffff-ffff-ffff-ffff-ffffffffffff (but for
instance, 00000000-0000-0000-0000-000000000000 <
7fffffff-ffff-ffff-ffff-ffffffffffff (because the first "long" of that 2nd
uuid is now positive)).
That's an historical accident, LexicalUUIDType should probably not have use
that comparison as it's arguably not very intuitive. However it's too late
to change it (as changing it now would basically corrupt all data for
people using LexicalUUIDType today).
I'll note that if you have the choice, you can use UUIDType rather than
LexicalUUIDType. UUIDType fixes that behavior and use a proper lexical
comparison for non-type-1 uuids (the other behavior of UUIDType is that for
type 1 uuid, it compares them by time first, i.e. it is equivalent to
TimeUUIDType for type 1 uuid).
--
Sylvain