You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by William Oberman <ob...@civicscience.com> on 2011/06/15 20:17:33 UTC
prep for cassandra storage from pig
I think I'm stuck on typing issues trying to store data in cassandra. To
verify, cassandra wants (key, {tuples})
My pig script is fairly brief:
raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
(key:chararray, columns:bag {column:tuple (name, value)});
--colums == timeUUID -> JSON
rows = FOREACH raw GENERATE key, FLATTEN(columns);
alias_target_day = FOREACH rows {
--I wrote a specialized parser that does exactly what I need
observation_map = com.civicscience.pig.ParseObservation($2);
GENERATE $0 as alias, observation_map#'_fqt' as target,
observation_map#'_day' as day;
};
grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
COUNT($1)) as day_count;
This gets me:
(targetA, (day1, count))
(targetA, (day2, count))
(targetB, (day1, count))
....
But, cassandra wants the 2nd item to be a bag. So, I tried:
X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1,
COUNT($1))) as day_count;
But this results in:
(targetA, {((day1, count))})
(targetA, {((day2, count))})
(targetB, {((day1, count))})
It's hard to see, but the 2nd item now has a nested tuple as the first
value, which is still bad.
How to I get (key, {tuple})??? I wasn't sure where to post this (pig or
cassandra), so I'm posting to the pig list too.
will
Re: prep for cassandra storage from pig
Posted by William Oberman <ob...@civicscience.com>.
Rather than staying stuck, I wrote a custom function: TupleToBagOfTuple. I'm
curios if I could have avoided this though.
On Wed, Jun 15, 2011 at 2:17 PM, William Oberman
<ob...@civicscience.com>wrote:
> I think I'm stuck on typing issues trying to store data in cassandra. To
> verify, cassandra wants (key, {tuples})
>
> My pig script is fairly brief:
> raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
> (key:chararray, columns:bag {column:tuple (name, value)});
> --colums == timeUUID -> JSON
> rows = FOREACH raw GENERATE key, FLATTEN(columns);
> alias_target_day = FOREACH rows {
> --I wrote a specialized parser that does exactly what I need
> observation_map = com.civicscience.pig.ParseObservation($2);
> GENERATE $0 as alias, observation_map#'_fqt' as target,
> observation_map#'_day' as day;
> };
> grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
> X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
> COUNT($1)) as day_count;
>
> This gets me:
> (targetA, (day1, count))
> (targetA, (day2, count))
> (targetB, (day1, count))
> ....
>
> But, cassandra wants the 2nd item to be a bag. So, I tried:
> X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1,
> COUNT($1))) as day_count;
>
> But this results in:
> (targetA, {((day1, count))})
> (targetA, {((day2, count))})
> (targetB, {((day1, count))})
> It's hard to see, but the 2nd item now has a nested tuple as the first
> value, which is still bad.
>
> How to I get (key, {tuple})??? I wasn't sure where to post this (pig or
> cassandra), so I'm posting to the pig list too.
>
> will
>
--
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com