You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Tamil Selvan <ta...@gmail.com> on 2011/09/28 18:18:53 UTC

Re: Pig & Cassandra integration

Hi,
 I'm trying to integrate pig with cassandra. 
 My columnfamily in cassandra is
 name -> xxx
 Age -> yyy
 class -> zzz
This is how I load data
 rows =LOAD 'cassandra://TestKeySpace/TestPig' USING CassandraStorage()
as (key,columns:bag{column:tuple(name,value)});

Now I wish to perform group by based on value of class. I tried

 col_values = FOREACH rows GENERATE (columns.value) as list:bag{};

This gave me the result in following Schema :bag(:tuple(chararray))
Ex: on dump col_values i got {(xxx),(yyy),(zzz)} 

Now if I try to access

 list = FOREACH col_values GENERATE (list.$0, list.$1);

I'm getting undefined index access error. Like
list.$1 doesn't exist :bag[:tuple(chararray)] has only one column [But
there are 3]

How can i access tuple wise data in such cases?
I couldn't perform group by based on 1 column because of this.

I tried TOTUPLE but the problem is, it converts the entire bag a tuple
and applies group by on that.

Help me out

Regards,
Tamil


Re: Pig & Cassandra integration

Posted by Jeremy Hanna <je...@gmail.com>.
It's been mentioned in this thread, but if you're using tabular (static column names) data, you might consider using Pygmalion.  It will extract the values from Cassandra to simplify grouping by values and other operations.
https://github.com/jeromatron/pygmalion
What you'll want to look at is the FromCassandraBag udf, which has an example here:
https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig

Hope that helps - we use pygmalion 1.0.0 for all our scripts in production.

On Sep 28, 2011, at 11:18 AM, Tamil Selvan wrote:

> Hi,
> I'm trying to integrate pig with cassandra. 
> My columnfamily in cassandra is
> name -> xxx
> Age -> yyy
> class -> zzz
> This is how I load data
> rows =LOAD 'cassandra://TestKeySpace/TestPig' USING CassandraStorage()
> as (key,columns:bag{column:tuple(name,value)});
> 
> Now I wish to perform group by based on value of class. I tried
> 
> col_values = FOREACH rows GENERATE (columns.value) as list:bag{};
> 
> This gave me the result in following Schema :bag(:tuple(chararray))
> Ex: on dump col_values i got {(xxx),(yyy),(zzz)} 
> 
> Now if I try to access
> 
> list = FOREACH col_values GENERATE (list.$0, list.$1);
> 
> I'm getting undefined index access error. Like
> list.$1 doesn't exist :bag[:tuple(chararray)] has only one column [But
> there are 3]
> 
> How can i access tuple wise data in such cases?
> I couldn't perform group by based on 1 column because of this.
> 
> I tried TOTUPLE but the problem is, it converts the entire bag a tuple
> and applies group by on that.
> 
> Help me out
> 
> Regards,
> Tamil
>