You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dan Feldman <hr...@gmail.com> on 2012/03/24 02:24:41 UTC

Slice columns by TimeUUID while loading to Pig

Hi everyone,

I have a Cassandra SCF where each super column has a name which is
dynamically assigned as TimeUUID at the time that that super column was
inserted into the database:

create column family CF
  with key_validation_class = UTF8Type
  and comparator = TimeUUIDType
  and subcomparator = UTF8Type
  and column_type = 'Super';

Now, I'm trying to write a Pig script that would automatically calculate
the number of new super columns added to the database during specified
period of time (let's say, in the last hour). For that, I thought it would
be nice to be able to do something along the lines of:

last_hour_data = LOAD
'cassandra://Keyspace/ColumnFamily&slice_start=Time(one hour
ago)&slice_end=Time(now)' USING CassandraStorage()...

However,
1) I'm not sure what that "Time(one hour ago)" and "Time(now)" syntax is
(so that it would translate those times into TimeUUIDs that cassandra
understands) and
2) The LOAD line above that I took from the bottom of
http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/README.txtproduces
an error thinking that 'CF&slice_start...' is one gigantic column
family name (which of course does not exist).


Alternatively, I could try generating my specified range of columns in Pig
after loading the whole database. But looking at the data, the super column
names look like 'S.?,uF?    ?B#q'    or    '    ??VuI??-gFd?' instead of
"normal-looking" UUIDs like '275564bc4f52f81573b4cfe0ea615ae0', even when I
try to load the super column names as chararrays. I'm thinking it's because
the latter representation of UUID differs from its string representation,
but is there a way to load it into Pig the "normal-looking" way?


Thank you in advance for your time!
Dan F.

Re: Slice columns by TimeUUID while loading to Pig

Posted by Dan Feldman <hr...@gmail.com>.
Well, I suppose I should say how we "solved" the problem (but not really)
in case someone in the future runs into similar issue...

Because writing a UDF wasn't really an option and there wasn't too much
data in Cassandra anyways, we decided to use our own generated time-based
uuids by which to store (super)columns in Cassandra in UTF8Type format:
before each store, our server running ruby gets the current time stamp in
YYMMDDMMSS + 8-character long randomly generated string, so that each super
column looks something like this

120326134516-DLFPASDF

and is pretty much guaranteed to be unique.

Since this is a string, I can then slice the columns when importing last
hour's data into pig for analysis every hour or so:

%DECLARE now `date`;
%DECLARE S `date -d "$now - 1 hour" "+%Y%m%d%H%M%S-00000000"`;

data = LOAD
'cassandra://Keyspace/ColumnFamily?slice_start=$S&comparator=AsciiType'
USING CassandraStorage() AS (key, columns: bag{(t:chararray, subcolumns:
bag{(name, value)})});

Clearly, this is an ugly-ugly "fix" because as traffic on our website
grows, I'm pretty sure our rails server won't be able to handle concurrent
calls to `date` as well as Cassandra could.... BUT, as it turns out in
production environment, an ugly fix now as opposed to a better fix later is
the only way to go..


Dan F.

On Fri, Mar 23, 2012 at 6:24 PM, Dan Feldman <hr...@gmail.com> wrote:

> Hi everyone,
>
> I have a Cassandra SCF where each super column has a name which is
> dynamically assigned as TimeUUID at the time that that super column was
> inserted into the database:
>
> create column family CF
>   with key_validation_class = UTF8Type
>   and comparator = TimeUUIDType
>   and subcomparator = UTF8Type
>   and column_type = 'Super';
>
> Now, I'm trying to write a Pig script that would automatically calculate
> the number of new super columns added to the database during specified
> period of time (let's say, in the last hour). For that, I thought it would
> be nice to be able to do something along the lines of:
>
> last_hour_data = LOAD
> 'cassandra://Keyspace/ColumnFamily&slice_start=Time(one hour
> ago)&slice_end=Time(now)' USING CassandraStorage()...
>
> However,
> 1) I'm not sure what that "Time(one hour ago)" and "Time(now)" syntax is
> (so that it would translate those times into TimeUUIDs that cassandra
> understands) and
> 2) The LOAD line above that I took from the bottom of
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/README.txtproduces an error thinking that 'CF&slice_start...' is one gigantic column
> family name (which of course does not exist).
>
>
> Alternatively, I could try generating my specified range of columns in Pig
> after loading the whole database. But looking at the data, the super column
> names look like 'S.?,uF?    ?B#q'    or    '    ??VuI??-gFd?' instead of
> "normal-looking" UUIDs like '275564bc4f52f81573b4cfe0ea615ae0', even when I
> try to load the super column names as chararrays. I'm thinking it's because
> the latter representation of UUID differs from its string representation,
> but is there a way to load it into Pig the "normal-looking" way?
>
>
> Thank you in advance for your time!
> Dan F.
>
>