You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Benoit Mathieu <be...@yakaz.com> on 2012/03/01 11:45:09 UTC

hadoop map join with ColumnFamilyInputFormat

Hi all,

I want to write a MapReduce job with a Map task taking its data from 2
CFs. Those 2 CFs have the same row keys and are in same keyspace, so
they are partionned the same way across my cluster and it would be
nice that the Map task reads the both column families locally.

In hadoop package org.apache.hadoop.mapred.join, there is a
CompositeInputFormat class, which seems to do what I want, but it
seems related to HDFS files as the "compose" method takes "Path" args.

Does anyone have ever wrote a CompositeColumnFamilyInputFormat ? or
have any insight about it ?

Cheers,

Benoit

Re: hadoop map join with ColumnFamilyInputFormat

Posted by Jeremy Hanna <je...@gmail.com>.

I haven't used that in particular, but it's pretty trivial to do that with Pig and I would imagine it would just do the right thing under the covers.  It's a simple join with Pig.  We use pygmalion to get data from the Cassandra bag.  A simple example would be:
DEFINE FromCassandraBag org.pygmalion.udf.FromCassandraBag();

raw_billing_acount =  LOAD 'cassandra://voltron/billing_account' USING org.apache.cassandra.hadoop.pig.CassandraStorage() AS (id:chararray, columns:bag {column:tuple (name, value)});
billing_account = FOREACH raw_billing_account GENERATE
        id,
        FLATTEN(FromCassandraBag('name, age, address, city, state, zip',columns)) AS (
		name:		chararray,
		age: 		chararray,
		address: 	chararray,
		city: 		chararray,
		state:		chararray,
		zip:			chararay
        );

raw_game_account =  LOAD 'cassandra://voltron/game_account' USING org.apache.cassandra.hadoop.pig.CassandraStorage() AS (id:chararray, columns:bag {column:tuple (name, value)});
game_account = FOREACH raw_game_account GENERATE
        id,
        FLATTEN(FromCassandraBag('username, level, experience_points, super_powers, vehicles',columns)) AS (
		username:			chararray,
		level: 				chararray,
		experience_points: 	chararray,
		super_powers: 		chararray,
		vehicles:			chararray
        );

composite_relation = FOREACH
	(join billing_account by id, game_account by id)
		GENERATE
		billing_account::id as id,
		name,
		username,
		level,
		super_powers;

Anyway - not sure if that's what you're looking for but that's what we do a lot of with Pig - joins on any attribute or group bys or things like that.


On Mar 1, 2012, at 4:45 AM, Benoit Mathieu wrote:

> Hi all,
> 
> I want to write a MapReduce job with a Map task taking its data from 2
> CFs. Those 2 CFs have the same row keys and are in same keyspace, so
> they are partionned the same way across my cluster and it would be
> nice that the Map task reads the both column families locally.
> 
> In hadoop package org.apache.hadoop.mapred.join, there is a
> CompositeInputFormat class, which seems to do what I want, but it
> seems related to HDFS files as the "compose" method takes "Path" args.
> 
> Does anyone have ever wrote a CompositeColumnFamilyInputFormat ? or
> have any insight about it ?
> 
> Cheers,
> 
> Benoit