You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by bob <bo...@combshouse.com> on 2011/04/07 00:40:00 UTC

help flattening data from cassandra loader

No matter what I try, I end up losing the tuples after the initial flatten. I'm using some auto-generated test data with firstn, last and a concatanation for the key. The script and outputs. . .

rows = LOAD 'cassandra://Keyspace2/Standard1' USING CassandraStorage() as (key:chararray, cols:bag{T:tuple(name:chararray, value:chararray) } );
dump rows;

(faaaaaaaaazzzzzzeaaa,{(first,faaaaaaaaa),(last,zzzzzzeaaa)})
(jaaaaaaaaazzzlaaaaaa,{(first,jaaaaaaaaa),(last,zzzlaaaaaa)})
(naaaaaaaaazzzzzpaaaa,{(first,naaaaaaaaa),(last,zzzzzpaaaa)})
(uaaaaaaaaazzzzzsaaaa,{(first,uaaaaaaaaa),(last,zzzzzsaaaa)})
(vaaaaaaaaafaaaaaaaaa,{(first,vaaaaaaaaa),(last,faaaaaaaaa)})
(zuaaaaaaaazpaaaaaaaa,{(first,zuaaaaaaaa),(last,zpaaaaaaaa)})
(zuaaaaaaaazzzzhaaaaa,{(first,zuaaaaaaaa),(last,zzzzhaaaaa)})
(zwaaaaaaaaznaaaaaaaa,{(first,zwaaaaaaaa),(last,znaaaaaaaa)})
(zziaaaaaaazfaaaaaaaa,{(first,zziaaaaaaa),(last,zfaaaaaaaa)})
(zzkaaaaaaazzzdaaaaaa,{(first,zzkaaaaaaa),(last,zzzdaaaaaa)})

So far, so good.


columns = foreach rows generate flatten(cols) as (name, value);        
dump columns;

()
()
()
()
()
()
()
()
()
()


Not so good.



I've tried multiple combinations w/ no success.  If I just leave bag empty in the initial load, i.e. cols:bag{} and then leave off the as in the flatten I get something that looks like a list of tuples. But, trying to access $1 in that result gives me an Error 1000 index out of range. So, that's not the ticket either.

What I'd really like is to flatten the bag into a map, but I'm about as successful there as well.

Any help is most welcome .  (Cassandra 7.4 and Pig 0.8.0)

Re: help flattening data from cassandra loader

Posted by Jeremy Hanna <je...@gmail.com>.

On Apr 6, 2011, at 6:16 PM, bob wrote:

> Honestly, I'd rather have a keyed bag of maps on the initial load, but that'd work too. Is it really that hard to get cassandra data out that you need a UDF to do anything besides an initial dump?

That's what we're doing because it just makes it easier to deal with tabular-like data - we don't have to munge through it quite as much.  I'm still pretty low on my pig-fu but others on the list might have better answers on how to deal with that data structure.

> 
> On Apr 6, 2011, at 3:51 PM, Jeremy Hanna wrote:
> 
>> I'm going to put a UDF up on the pygmalion project hopefully today that will convert that into something more usable.  Props to Jacob from infochimps - he and I have been creating UDFs like that lately for use with Cassandra.  There's an associated UDF for getting it back into the key, cols form to output to cassandra as well.  I'll try to get that pushed tonight but take a look at:
>> https://github.com/jeromatron/pygmalion/
>> That's where I'll push the code - hopefully that will help.
>> 
>> What it does is takes the data structure returned from cassandra and allows you say, give me the key and the values for these column names in a bag so for your example it would return:
>> {(faaaaaaaaazzzzzzeaaa,faaaaaaaaa,zzzzzzeaaa)}
>> and you could assign var names for each like key, first, last within pig.
>> 
>> Anyway, if that helps, look for that soon.  It's helping us use the output as tabular data.
>> 
>> On Apr 6, 2011, at 5:40 PM, bob wrote:
>> 
>>> No matter what I try, I end up losing the tuples after the initial flatten. I'm using some auto-generated test data with firstn, last and a concatanation for the key. The script and outputs. . .
>>> 
>>> rows = LOAD 'cassandra://Keyspace2/Standard1' USING CassandraStorage() as (key:chararray, cols:bag{T:tuple(name:chararray, value:chararray) } );
>>> dump rows;
>>> 
>>> (faaaaaaaaazzzzzzeaaa,{(first,faaaaaaaaa),(last,zzzzzzeaaa)})
>>> (jaaaaaaaaazzzlaaaaaa,{(first,jaaaaaaaaa),(last,zzzlaaaaaa)})
>>> (naaaaaaaaazzzzzpaaaa,{(first,naaaaaaaaa),(last,zzzzzpaaaa)})
>>> (uaaaaaaaaazzzzzsaaaa,{(first,uaaaaaaaaa),(last,zzzzzsaaaa)})
>>> (vaaaaaaaaafaaaaaaaaa,{(first,vaaaaaaaaa),(last,faaaaaaaaa)})
>>> (zuaaaaaaaazpaaaaaaaa,{(first,zuaaaaaaaa),(last,zpaaaaaaaa)})
>>> (zuaaaaaaaazzzzhaaaaa,{(first,zuaaaaaaaa),(last,zzzzhaaaaa)})
>>> (zwaaaaaaaaznaaaaaaaa,{(first,zwaaaaaaaa),(last,znaaaaaaaa)})
>>> (zziaaaaaaazfaaaaaaaa,{(first,zziaaaaaaa),(last,zfaaaaaaaa)})
>>> (zzkaaaaaaazzzdaaaaaa,{(first,zzkaaaaaaa),(last,zzzdaaaaaa)})
>>> 
>>> So far, so good.
>>> 
>>> 
>>> columns = foreach rows generate flatten(cols) as (name, value);        
>>> dump columns;
>>> 
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> ()
>>> 
>>> 
>>> Not so good.
>>> 
>>> 
>>> 
>>> I've tried multiple combinations w/ no success.  If I just leave bag empty in the initial load, i.e. cols:bag{} and then leave off the as in the flatten I get something that looks like a list of tuples. But, trying to access $1 in that result gives me an Error 1000 index out of range. So, that's not the ticket either.
>>> 
>>> What I'd really like is to flatten the bag into a map, but I'm about as successful there as well.
>>> 
>>> Any help is most welcome .  (Cassandra 7.4 and Pig 0.8.0)
>>> 
>>> 
>> 
>

Re: help flattening data from cassandra loader

Posted by bob <bo...@combshouse.com>.

Honestly, I'd rather have a keyed bag of maps on the initial load, but that'd work too. Is it really that hard to get cassandra data out that you need a UDF to do anything besides an initial dump?

On Apr 6, 2011, at 3:51 PM, Jeremy Hanna wrote:

> I'm going to put a UDF up on the pygmalion project hopefully today that will convert that into something more usable.  Props to Jacob from infochimps - he and I have been creating UDFs like that lately for use with Cassandra.  There's an associated UDF for getting it back into the key, cols form to output to cassandra as well.  I'll try to get that pushed tonight but take a look at:
> https://github.com/jeromatron/pygmalion/
> That's where I'll push the code - hopefully that will help.
> 
> What it does is takes the data structure returned from cassandra and allows you say, give me the key and the values for these column names in a bag so for your example it would return:
> {(faaaaaaaaazzzzzzeaaa,faaaaaaaaa,zzzzzzeaaa)}
> and you could assign var names for each like key, first, last within pig.
> 
> Anyway, if that helps, look for that soon.  It's helping us use the output as tabular data.
> 
> On Apr 6, 2011, at 5:40 PM, bob wrote:
> 
>> No matter what I try, I end up losing the tuples after the initial flatten. I'm using some auto-generated test data with firstn, last and a concatanation for the key. The script and outputs. . .
>> 
>> rows = LOAD 'cassandra://Keyspace2/Standard1' USING CassandraStorage() as (key:chararray, cols:bag{T:tuple(name:chararray, value:chararray) } );
>> dump rows;
>> 
>> (faaaaaaaaazzzzzzeaaa,{(first,faaaaaaaaa),(last,zzzzzzeaaa)})
>> (jaaaaaaaaazzzlaaaaaa,{(first,jaaaaaaaaa),(last,zzzlaaaaaa)})
>> (naaaaaaaaazzzzzpaaaa,{(first,naaaaaaaaa),(last,zzzzzpaaaa)})
>> (uaaaaaaaaazzzzzsaaaa,{(first,uaaaaaaaaa),(last,zzzzzsaaaa)})
>> (vaaaaaaaaafaaaaaaaaa,{(first,vaaaaaaaaa),(last,faaaaaaaaa)})
>> (zuaaaaaaaazpaaaaaaaa,{(first,zuaaaaaaaa),(last,zpaaaaaaaa)})
>> (zuaaaaaaaazzzzhaaaaa,{(first,zuaaaaaaaa),(last,zzzzhaaaaa)})
>> (zwaaaaaaaaznaaaaaaaa,{(first,zwaaaaaaaa),(last,znaaaaaaaa)})
>> (zziaaaaaaazfaaaaaaaa,{(first,zziaaaaaaa),(last,zfaaaaaaaa)})
>> (zzkaaaaaaazzzdaaaaaa,{(first,zzkaaaaaaa),(last,zzzdaaaaaa)})
>> 
>> So far, so good.
>> 
>> 
>> columns = foreach rows generate flatten(cols) as (name, value);        
>> dump columns;
>> 
>> ()
>> ()
>> ()
>> ()
>> ()
>> ()
>> ()
>> ()
>> ()
>> ()
>> 
>> 
>> Not so good.
>> 
>> 
>> 
>> I've tried multiple combinations w/ no success.  If I just leave bag empty in the initial load, i.e. cols:bag{} and then leave off the as in the flatten I get something that looks like a list of tuples. But, trying to access $1 in that result gives me an Error 1000 index out of range. So, that's not the ticket either.
>> 
>> What I'd really like is to flatten the bag into a map, but I'm about as successful there as well.
>> 
>> Any help is most welcome .  (Cassandra 7.4 and Pig 0.8.0)
>> 
>> 
>

Re: help flattening data from cassandra loader

Posted by Jeremy Hanna <je...@gmail.com>.

I'm going to put a UDF up on the pygmalion project hopefully today that will convert that into something more usable.  Props to Jacob from infochimps - he and I have been creating UDFs like that lately for use with Cassandra.  There's an associated UDF for getting it back into the key, cols form to output to cassandra as well.  I'll try to get that pushed tonight but take a look at:
https://github.com/jeromatron/pygmalion/
That's where I'll push the code - hopefully that will help.

What it does is takes the data structure returned from cassandra and allows you say, give me the key and the values for these column names in a bag so for your example it would return:
{(faaaaaaaaazzzzzzeaaa,faaaaaaaaa,zzzzzzeaaa)}
and you could assign var names for each like key, first, last within pig.

Anyway, if that helps, look for that soon.  It's helping us use the output as tabular data.

On Apr 6, 2011, at 5:40 PM, bob wrote:

> No matter what I try, I end up losing the tuples after the initial flatten. I'm using some auto-generated test data with firstn, last and a concatanation for the key. The script and outputs. . .
> 
> rows = LOAD 'cassandra://Keyspace2/Standard1' USING CassandraStorage() as (key:chararray, cols:bag{T:tuple(name:chararray, value:chararray) } );
> dump rows;
> 
>  (faaaaaaaaazzzzzzeaaa,{(first,faaaaaaaaa),(last,zzzzzzeaaa)})
> (jaaaaaaaaazzzlaaaaaa,{(first,jaaaaaaaaa),(last,zzzlaaaaaa)})
> (naaaaaaaaazzzzzpaaaa,{(first,naaaaaaaaa),(last,zzzzzpaaaa)})
> (uaaaaaaaaazzzzzsaaaa,{(first,uaaaaaaaaa),(last,zzzzzsaaaa)})
> (vaaaaaaaaafaaaaaaaaa,{(first,vaaaaaaaaa),(last,faaaaaaaaa)})
> (zuaaaaaaaazpaaaaaaaa,{(first,zuaaaaaaaa),(last,zpaaaaaaaa)})
> (zuaaaaaaaazzzzhaaaaa,{(first,zuaaaaaaaa),(last,zzzzhaaaaa)})
> (zwaaaaaaaaznaaaaaaaa,{(first,zwaaaaaaaa),(last,znaaaaaaaa)})
> (zziaaaaaaazfaaaaaaaa,{(first,zziaaaaaaa),(last,zfaaaaaaaa)})
> (zzkaaaaaaazzzdaaaaaa,{(first,zzkaaaaaaa),(last,zzzdaaaaaa)})
> 
> So far, so good.
> 
> 
> columns = foreach rows generate flatten(cols) as (name, value);        
> dump columns;
> 
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> 
> 
> Not so good.
> 
> 
> 
> I've tried multiple combinations w/ no success.  If I just leave bag empty in the initial load, i.e. cols:bag{} and then leave off the as in the flatten I get something that looks like a list of tuples. But, trying to access $1 in that result gives me an Error 1000 index out of range. So, that's not the ticket either.
> 
> What I'd really like is to flatten the bag into a map, but I'm about as successful there as well.
> 
> Any help is most welcome .  (Cassandra 7.4 and Pig 0.8.0)
> 
>