You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by William Oberman <ob...@civicscience.com> on 2011/06/15 20:17:02 UTC

prep for cassandra storage from pig

I think I'm stuck on typing issues trying to store data in cassandra.  To
verify, cassandra wants (key, {tuples})

My pig script is fairly brief:
raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
(key:chararray, columns:bag {column:tuple (name, value)});
--colums == timeUUID -> JSON
rows = FOREACH raw GENERATE key, FLATTEN(columns);
alias_target_day = FOREACH rows {
    --I wrote a specialized parser that does exactly what I need
    observation_map = com.civicscience.pig.ParseObservation($2);
    GENERATE $0 as alias, observation_map#'_fqt' as target,
observation_map#'_day' as day;
};
grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
COUNT($1)) as day_count;

This gets me:
(targetA, (day1, count))
(targetA, (day2, count))
(targetB, (day1, count))
....

But, cassandra wants the 2nd item to be a bag.  So, I tried:
X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1,
COUNT($1))) as day_count;

But this results in:
(targetA, {((day1, count))})
(targetA, {((day2, count))})
(targetB, {((day1, count))})
It's hard to see, but the 2nd item now has a nested tuple as the first
value, which is still bad.

How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
cassandra), so I'm posting to the pig list too.

will

Re: prep for cassandra storage from pig

Posted by William Oberman <ob...@civicscience.com>.
I'll do a reply all, to keep this more consistent (sorry!).

Rather than staying stuck, I wrote a custom function: TupleToBagOfTuple. I'm
curious if I could have avoided it with proper pig scripting though.

On Wed, Jun 15, 2011 at 3:08 PM, William Oberman
<ob...@civicscience.com>wrote:

> My problem is the column names are dynamic (a date), and pygmalion seems to
> want the column names to be fixed at "compile time" (the script).
>
>
> On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna <je...@gmail.com>wrote:
>
>> Hi Will,
>>
>> That's partly why I like to use FromCassandraBag and ToCassandraBag from
>> pygmalion - it does the work for you to get it back into a form that
>> cassandra understands.
>>
>> Others may know better how to massage the data into that form using just
>> pig, but if all else fails, you could write a udf to do that.
>>
>> Jeremy
>>
>> On Jun 15, 2011, at 1:17 PM, William Oberman wrote:
>>
>> > I think I'm stuck on typing issues trying to store data in cassandra.
>>  To verify, cassandra wants (key, {tuples})
>> >
>> > My pig script is fairly brief:
>> > raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
>> (key:chararray, columns:bag {column:tuple (name, value)});
>> > --colums == timeUUID -> JSON
>> > rows = FOREACH raw GENERATE key, FLATTEN(columns);
>> > alias_target_day = FOREACH rows {
>> >     --I wrote a specialized parser that does exactly what I need
>> >     observation_map = com.civicscience.pig.ParseObservation($2);
>> >     GENERATE $0 as alias, observation_map#'_fqt' as target,
>> observation_map#'_day' as day;
>> > };
>> > grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
>> > X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
>> COUNT($1)) as day_count;
>> >
>> > This gets me:
>> > (targetA, (day1, count))
>> > (targetA, (day2, count))
>> > (targetB, (day1, count))
>> > ....
>> >
>> > But, cassandra wants the 2nd item to be a bag.  So, I tried:
>> > X = FOREACH grouping GENERATE group.$0 as target,
>> TOBAG(TOTUPLE(group.$1, COUNT($1))) as day_count;
>> >
>> > But this results in:
>> > (targetA, {((day1, count))})
>> > (targetA, {((day2, count))})
>> > (targetB, {((day1, count))})
>> > It's hard to see, but the 2nd item now has a nested tuple as the first
>> value, which is still bad.
>> >
>> > How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
>> cassandra), so I'm posting to the pig list too.
>> >
>> > will
>>
>>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>



-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: prep for cassandra storage from pig

Posted by William Oberman <ob...@civicscience.com>.
I'll do a reply all, to keep this more consistent (sorry!).

Rather than staying stuck, I wrote a custom function: TupleToBagOfTuple. I'm
curious if I could have avoided it with proper pig scripting though.

On Wed, Jun 15, 2011 at 3:08 PM, William Oberman
<ob...@civicscience.com>wrote:

> My problem is the column names are dynamic (a date), and pygmalion seems to
> want the column names to be fixed at "compile time" (the script).
>
>
> On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna <je...@gmail.com>wrote:
>
>> Hi Will,
>>
>> That's partly why I like to use FromCassandraBag and ToCassandraBag from
>> pygmalion - it does the work for you to get it back into a form that
>> cassandra understands.
>>
>> Others may know better how to massage the data into that form using just
>> pig, but if all else fails, you could write a udf to do that.
>>
>> Jeremy
>>
>> On Jun 15, 2011, at 1:17 PM, William Oberman wrote:
>>
>> > I think I'm stuck on typing issues trying to store data in cassandra.
>>  To verify, cassandra wants (key, {tuples})
>> >
>> > My pig script is fairly brief:
>> > raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
>> (key:chararray, columns:bag {column:tuple (name, value)});
>> > --colums == timeUUID -> JSON
>> > rows = FOREACH raw GENERATE key, FLATTEN(columns);
>> > alias_target_day = FOREACH rows {
>> >     --I wrote a specialized parser that does exactly what I need
>> >     observation_map = com.civicscience.pig.ParseObservation($2);
>> >     GENERATE $0 as alias, observation_map#'_fqt' as target,
>> observation_map#'_day' as day;
>> > };
>> > grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
>> > X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
>> COUNT($1)) as day_count;
>> >
>> > This gets me:
>> > (targetA, (day1, count))
>> > (targetA, (day2, count))
>> > (targetB, (day1, count))
>> > ....
>> >
>> > But, cassandra wants the 2nd item to be a bag.  So, I tried:
>> > X = FOREACH grouping GENERATE group.$0 as target,
>> TOBAG(TOTUPLE(group.$1, COUNT($1))) as day_count;
>> >
>> > But this results in:
>> > (targetA, {((day1, count))})
>> > (targetA, {((day2, count))})
>> > (targetB, {((day1, count))})
>> > It's hard to see, but the 2nd item now has a nested tuple as the first
>> value, which is still bad.
>> >
>> > How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
>> cassandra), so I'm posting to the pig list too.
>> >
>> > will
>>
>>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>



-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: prep for cassandra storage from pig

Posted by Jeremy Hanna <je...@gmail.com>.
Yeah - for completely dynamic column names, then yeah - From/To Cassandra Bag doesn't handle that.  It does handle prefixed names though - like link* will get a bag of all the columns that start with link.  But sounds like you are doing what I would have to do if I got into a nested data conundrum.  Like I said, others may have better advice for getting the data the way you want it.

On Jun 15, 2011, at 2:08 PM, William Oberman wrote:

> My problem is the column names are dynamic (a date), and pygmalion seems to
> want the column names to be fixed at "compile time" (the script).
> 
> On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna <je...@gmail.com>wrote:
> 
>> Hi Will,
>> 
>> That's partly why I like to use FromCassandraBag and ToCassandraBag from
>> pygmalion - it does the work for you to get it back into a form that
>> cassandra understands.
>> 
>> Others may know better how to massage the data into that form using just
>> pig, but if all else fails, you could write a udf to do that.
>> 
>> Jeremy
>> 
>> On Jun 15, 2011, at 1:17 PM, William Oberman wrote:
>> 
>>> I think I'm stuck on typing issues trying to store data in cassandra.  To
>> verify, cassandra wants (key, {tuples})
>>> 
>>> My pig script is fairly brief:
>>> raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
>> (key:chararray, columns:bag {column:tuple (name, value)});
>>> --colums == timeUUID -> JSON
>>> rows = FOREACH raw GENERATE key, FLATTEN(columns);
>>> alias_target_day = FOREACH rows {
>>>    --I wrote a specialized parser that does exactly what I need
>>>    observation_map = com.civicscience.pig.ParseObservation($2);
>>>    GENERATE $0 as alias, observation_map#'_fqt' as target,
>> observation_map#'_day' as day;
>>> };
>>> grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
>>> X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
>> COUNT($1)) as day_count;
>>> 
>>> This gets me:
>>> (targetA, (day1, count))
>>> (targetA, (day2, count))
>>> (targetB, (day1, count))
>>> ....
>>> 
>>> But, cassandra wants the 2nd item to be a bag.  So, I tried:
>>> X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1,
>> COUNT($1))) as day_count;
>>> 
>>> But this results in:
>>> (targetA, {((day1, count))})
>>> (targetA, {((day2, count))})
>>> (targetB, {((day1, count))})
>>> It's hard to see, but the 2nd item now has a nested tuple as the first
>> value, which is still bad.
>>> 
>>> How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
>> cassandra), so I'm posting to the pig list too.
>>> 
>>> will
>> 
>> 
> 
> 
> -- 
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com


Re: prep for cassandra storage from pig

Posted by Jeremy Hanna <je...@gmail.com>.
Yeah - for completely dynamic column names, then yeah - From/To Cassandra Bag doesn't handle that.  It does handle prefixed names though - like link* will get a bag of all the columns that start with link.  But sounds like you are doing what I would have to do if I got into a nested data conundrum.  Like I said, others may have better advice for getting the data the way you want it.

On Jun 15, 2011, at 2:08 PM, William Oberman wrote:

> My problem is the column names are dynamic (a date), and pygmalion seems to
> want the column names to be fixed at "compile time" (the script).
> 
> On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna <je...@gmail.com>wrote:
> 
>> Hi Will,
>> 
>> That's partly why I like to use FromCassandraBag and ToCassandraBag from
>> pygmalion - it does the work for you to get it back into a form that
>> cassandra understands.
>> 
>> Others may know better how to massage the data into that form using just
>> pig, but if all else fails, you could write a udf to do that.
>> 
>> Jeremy
>> 
>> On Jun 15, 2011, at 1:17 PM, William Oberman wrote:
>> 
>>> I think I'm stuck on typing issues trying to store data in cassandra.  To
>> verify, cassandra wants (key, {tuples})
>>> 
>>> My pig script is fairly brief:
>>> raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
>> (key:chararray, columns:bag {column:tuple (name, value)});
>>> --colums == timeUUID -> JSON
>>> rows = FOREACH raw GENERATE key, FLATTEN(columns);
>>> alias_target_day = FOREACH rows {
>>>    --I wrote a specialized parser that does exactly what I need
>>>    observation_map = com.civicscience.pig.ParseObservation($2);
>>>    GENERATE $0 as alias, observation_map#'_fqt' as target,
>> observation_map#'_day' as day;
>>> };
>>> grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
>>> X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
>> COUNT($1)) as day_count;
>>> 
>>> This gets me:
>>> (targetA, (day1, count))
>>> (targetA, (day2, count))
>>> (targetB, (day1, count))
>>> ....
>>> 
>>> But, cassandra wants the 2nd item to be a bag.  So, I tried:
>>> X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1,
>> COUNT($1))) as day_count;
>>> 
>>> But this results in:
>>> (targetA, {((day1, count))})
>>> (targetA, {((day2, count))})
>>> (targetB, {((day1, count))})
>>> It's hard to see, but the 2nd item now has a nested tuple as the first
>> value, which is still bad.
>>> 
>>> How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
>> cassandra), so I'm posting to the pig list too.
>>> 
>>> will
>> 
>> 
> 
> 
> -- 
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com


Re: prep for cassandra storage from pig

Posted by William Oberman <ob...@civicscience.com>.
My problem is the column names are dynamic (a date), and pygmalion seems to
want the column names to be fixed at "compile time" (the script).

On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna <je...@gmail.com>wrote:

> Hi Will,
>
> That's partly why I like to use FromCassandraBag and ToCassandraBag from
> pygmalion - it does the work for you to get it back into a form that
> cassandra understands.
>
> Others may know better how to massage the data into that form using just
> pig, but if all else fails, you could write a udf to do that.
>
> Jeremy
>
> On Jun 15, 2011, at 1:17 PM, William Oberman wrote:
>
> > I think I'm stuck on typing issues trying to store data in cassandra.  To
> verify, cassandra wants (key, {tuples})
> >
> > My pig script is fairly brief:
> > raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
> (key:chararray, columns:bag {column:tuple (name, value)});
> > --colums == timeUUID -> JSON
> > rows = FOREACH raw GENERATE key, FLATTEN(columns);
> > alias_target_day = FOREACH rows {
> >     --I wrote a specialized parser that does exactly what I need
> >     observation_map = com.civicscience.pig.ParseObservation($2);
> >     GENERATE $0 as alias, observation_map#'_fqt' as target,
> observation_map#'_day' as day;
> > };
> > grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
> > X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
> COUNT($1)) as day_count;
> >
> > This gets me:
> > (targetA, (day1, count))
> > (targetA, (day2, count))
> > (targetB, (day1, count))
> > ....
> >
> > But, cassandra wants the 2nd item to be a bag.  So, I tried:
> > X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1,
> COUNT($1))) as day_count;
> >
> > But this results in:
> > (targetA, {((day1, count))})
> > (targetA, {((day2, count))})
> > (targetB, {((day1, count))})
> > It's hard to see, but the 2nd item now has a nested tuple as the first
> value, which is still bad.
> >
> > How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
> cassandra), so I'm posting to the pig list too.
> >
> > will
>
>


-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: prep for cassandra storage from pig

Posted by William Oberman <ob...@civicscience.com>.
My problem is the column names are dynamic (a date), and pygmalion seems to
want the column names to be fixed at "compile time" (the script).

On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna <je...@gmail.com>wrote:

> Hi Will,
>
> That's partly why I like to use FromCassandraBag and ToCassandraBag from
> pygmalion - it does the work for you to get it back into a form that
> cassandra understands.
>
> Others may know better how to massage the data into that form using just
> pig, but if all else fails, you could write a udf to do that.
>
> Jeremy
>
> On Jun 15, 2011, at 1:17 PM, William Oberman wrote:
>
> > I think I'm stuck on typing issues trying to store data in cassandra.  To
> verify, cassandra wants (key, {tuples})
> >
> > My pig script is fairly brief:
> > raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
> (key:chararray, columns:bag {column:tuple (name, value)});
> > --colums == timeUUID -> JSON
> > rows = FOREACH raw GENERATE key, FLATTEN(columns);
> > alias_target_day = FOREACH rows {
> >     --I wrote a specialized parser that does exactly what I need
> >     observation_map = com.civicscience.pig.ParseObservation($2);
> >     GENERATE $0 as alias, observation_map#'_fqt' as target,
> observation_map#'_day' as day;
> > };
> > grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
> > X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
> COUNT($1)) as day_count;
> >
> > This gets me:
> > (targetA, (day1, count))
> > (targetA, (day2, count))
> > (targetB, (day1, count))
> > ....
> >
> > But, cassandra wants the 2nd item to be a bag.  So, I tried:
> > X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1,
> COUNT($1))) as day_count;
> >
> > But this results in:
> > (targetA, {((day1, count))})
> > (targetA, {((day2, count))})
> > (targetB, {((day1, count))})
> > It's hard to see, but the 2nd item now has a nested tuple as the first
> value, which is still bad.
> >
> > How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
> cassandra), so I'm posting to the pig list too.
> >
> > will
>
>


-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: prep for cassandra storage from pig

Posted by Jeremy Hanna <je...@gmail.com>.
Hi Will,

That's partly why I like to use FromCassandraBag and ToCassandraBag from pygmalion - it does the work for you to get it back into a form that cassandra understands.

Others may know better how to massage the data into that form using just pig, but if all else fails, you could write a udf to do that.

Jeremy

On Jun 15, 2011, at 1:17 PM, William Oberman wrote:

> I think I'm stuck on typing issues trying to store data in cassandra.  To verify, cassandra wants (key, {tuples})
> 
> My pig script is fairly brief:
> raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, value)});
> --colums == timeUUID -> JSON
> rows = FOREACH raw GENERATE key, FLATTEN(columns);
> alias_target_day = FOREACH rows {
>     --I wrote a specialized parser that does exactly what I need
>     observation_map = com.civicscience.pig.ParseObservation($2);
>     GENERATE $0 as alias, observation_map#'_fqt' as target, observation_map#'_day' as day;
> };
> grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
> X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1, COUNT($1)) as day_count;
> 
> This gets me:
> (targetA, (day1, count))
> (targetA, (day2, count))
> (targetB, (day1, count))
> ....
> 
> But, cassandra wants the 2nd item to be a bag.  So, I tried:
> X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1, COUNT($1))) as day_count;
> 
> But this results in:
> (targetA, {((day1, count))})
> (targetA, {((day2, count))})
> (targetB, {((day1, count))})
> It's hard to see, but the 2nd item now has a nested tuple as the first value, which is still bad.
> 
> How to I get (key, {tuple})???  I wasn't sure where to post this (pig or cassandra), so I'm posting to the pig list too.
> 
> will


Re: prep for cassandra storage from pig

Posted by Jeremy Hanna <je...@gmail.com>.
Hi Will,

That's partly why I like to use FromCassandraBag and ToCassandraBag from pygmalion - it does the work for you to get it back into a form that cassandra understands.

Others may know better how to massage the data into that form using just pig, but if all else fails, you could write a udf to do that.

Jeremy

On Jun 15, 2011, at 1:17 PM, William Oberman wrote:

> I think I'm stuck on typing issues trying to store data in cassandra.  To verify, cassandra wants (key, {tuples})
> 
> My pig script is fairly brief:
> raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, value)});
> --colums == timeUUID -> JSON
> rows = FOREACH raw GENERATE key, FLATTEN(columns);
> alias_target_day = FOREACH rows {
>     --I wrote a specialized parser that does exactly what I need
>     observation_map = com.civicscience.pig.ParseObservation($2);
>     GENERATE $0 as alias, observation_map#'_fqt' as target, observation_map#'_day' as day;
> };
> grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
> X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1, COUNT($1)) as day_count;
> 
> This gets me:
> (targetA, (day1, count))
> (targetA, (day2, count))
> (targetB, (day1, count))
> ....
> 
> But, cassandra wants the 2nd item to be a bag.  So, I tried:
> X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1, COUNT($1))) as day_count;
> 
> But this results in:
> (targetA, {((day1, count))})
> (targetA, {((day2, count))})
> (targetB, {((day1, count))})
> It's hard to see, but the 2nd item now has a nested tuple as the first value, which is still bad.
> 
> How to I get (key, {tuple})???  I wasn't sure where to post this (pig or cassandra), so I'm posting to the pig list too.
> 
> will