You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by AD <st...@gmail.com> on 2011/11/04 13:51:22 UTC

JOIN not printing properly

Hello,

 I am pulling data from cassandra into pig which means it ends up like key,
bag { (name,value),(name,value) }.  The info is logfiles so each column is
a field in server logfile (like apache).  I have the following pig to
combine 2 fields and count them but the GENERATE of the JOIN is not
printing the right field.  Is there an easier way to solve this, and does
anyone know why the join output is broken ?

rows = LOAD 'cassandra://Keyspace1/Logs' USING CassandraStorage() AS (key,
columns: bag {T: tuple(name, value)});

 A = FOREACH rows GENERATE $0, flatten($1) ; //FLATTEN
*(key1,url,http://www.google.com)*
*(key1,cache_hit,hit)*
*(key2,url,http://www.google.com)*
*(key2,cache_hit,miss)*

 B = group r2 by key ; // Combine url and cache_hit into one record
*(key1,{(key1,url,http://www.google.com),(key1,cache_hit,hit)})*
*(key2,{(key2,url,http://www.google.com),(key2,cache_hit,miss)})*

 // Create 2 lists and then JOIN them

 C = FOREACH B {
 u = FILTER A by name == 'url';
 GENERATE FLATTEN(u.(key,value)) ;
 }
* (key1,http://www.google.com)*
* (key2,http://www.google.com)*

 D = FOREACH B {
 u2 = FILTER A by name == 'cache_hit';
 GENERATE FLATTEN(u2.(key,value));
 }
 *(key1,hit)*
* (key2,miss)*

 E = join C by key, D by key ;
*(key1,http://www.google.com,key1,hit)*
*(key2,http://www.google.com,key2,miss)*

describe E ;
E: {C::u::key: chararray,C::u::value: chararray,D::u2::key:
chararray,D::u2::value: chararray}

F = FOREACH E GENERATE C::u::value, D::u2::value ;

*dump F ;*
*(http://www.google.com,http://www.google.com)  ?? Why not www.google.com,
hit ????*
*(http://www.google.com,http://www.google.com)*
*
*
Any help appreciated.
AD

Re: JOIN not printing properly

Posted by AD <st...@gmail.com>.
Yep, i just did and it worked thanks.

I do still find it odd that the below output of the JOIN is not printing
correctly, though no ?

On Fri, Nov 4, 2011 at 10:57 AM, Jacob Perkins <ja...@gmail.com>wrote:

> Have you taken a look at Pygmalion
> (http://github.com/jeromatron/pygmalion) which makes it MUCH easier to
> work with tabular data from Cassandra like you're trying to do?
>
> For example:
>
> what_cassandrastorage_should_really_produce = FOREACH rows GENERATE key
> AS key, FromCassandraBag('url,cache_hit', columns) AS (url:chararray,
> cache_hit:chararray);
>
> DUMP what_cassandrastorage_should_really_produce;
>
> (key1, http://www.google.com, hit)
> (key2, http://www.google.com, hit)
>
> Does that work for your use case?
>
> --jacob
> @thedatachef
>
>
> On Fri, 2011-11-04 at 08:51 -0400, AD wrote:
> > Hello,
> >
> >  I am pulling data from cassandra into pig which means it ends up like
> key,
> > bag { (name,value),(name,value) }.  The info is logfiles so each column
> is
> > a field in server logfile (like apache).  I have the following pig to
> > combine 2 fields and count them but the GENERATE of the JOIN is not
> > printing the right field.  Is there an easier way to solve this, and does
> > anyone know why the join output is broken ?
> >
> > rows = LOAD 'cassandra://Keyspace1/Logs' USING CassandraStorage() AS
> (key,
> > columns: bag {T: tuple(name, value)});
> >
> >  A = FOREACH rows GENERATE $0, flatten($1) ; //FLATTEN
> > *(key1,url,http://www.google.com)*
> > *(key1,cache_hit,hit)*
> > *(key2,url,http://www.google.com)*
> > *(key2,cache_hit,miss)*
> >
> >  B = group r2 by key ; // Combine url and cache_hit into one record
> > *(key1,{(key1,url,http://www.google.com),(key1,cache_hit,hit)})*
> > *(key2,{(key2,url,http://www.google.com),(key2,cache_hit,miss)})*
> >
> >  // Create 2 lists and then JOIN them
> >
> >  C = FOREACH B {
> >  u = FILTER A by name == 'url';
> >  GENERATE FLATTEN(u.(key,value)) ;
> >  }
> > * (key1,http://www.google.com)*
> > * (key2,http://www.google.com)*
> >
> >  D = FOREACH B {
> >  u2 = FILTER A by name == 'cache_hit';
> >  GENERATE FLATTEN(u2.(key,value));
> >  }
> >  *(key1,hit)*
> > * (key2,miss)*
> >
> >  E = join C by key, D by key ;
> > *(key1,http://www.google.com,key1,hit)*
> > *(key2,http://www.google.com,key2,miss)*
> >
> > describe E ;
> > E: {C::u::key: chararray,C::u::value: chararray,D::u2::key:
> > chararray,D::u2::value: chararray}
> >
> > F = FOREACH E GENERATE C::u::value, D::u2::value ;
> >
> > *dump F ;*
> > *(http://www.google.com,http://www.google.com)  ?? Why not
> www.google.com,
> > hit ????*
> > *(http://www.google.com,http://www.google.com)*
> > *
> > *
> > Any help appreciated.
> > AD
>
>
>

Re: JOIN not printing properly

Posted by Jacob Perkins <ja...@gmail.com>.
Have you taken a look at Pygmalion
(http://github.com/jeromatron/pygmalion) which makes it MUCH easier to
work with tabular data from Cassandra like you're trying to do?

For example:

what_cassandrastorage_should_really_produce = FOREACH rows GENERATE key
AS key, FromCassandraBag('url,cache_hit', columns) AS (url:chararray,
cache_hit:chararray);

DUMP what_cassandrastorage_should_really_produce;

(key1, http://www.google.com, hit)
(key2, http://www.google.com, hit)

Does that work for your use case?

--jacob
@thedatachef


On Fri, 2011-11-04 at 08:51 -0400, AD wrote:
> Hello,
> 
>  I am pulling data from cassandra into pig which means it ends up like key,
> bag { (name,value),(name,value) }.  The info is logfiles so each column is
> a field in server logfile (like apache).  I have the following pig to
> combine 2 fields and count them but the GENERATE of the JOIN is not
> printing the right field.  Is there an easier way to solve this, and does
> anyone know why the join output is broken ?
> 
> rows = LOAD 'cassandra://Keyspace1/Logs' USING CassandraStorage() AS (key,
> columns: bag {T: tuple(name, value)});
> 
>  A = FOREACH rows GENERATE $0, flatten($1) ; //FLATTEN
> *(key1,url,http://www.google.com)*
> *(key1,cache_hit,hit)*
> *(key2,url,http://www.google.com)*
> *(key2,cache_hit,miss)*
> 
>  B = group r2 by key ; // Combine url and cache_hit into one record
> *(key1,{(key1,url,http://www.google.com),(key1,cache_hit,hit)})*
> *(key2,{(key2,url,http://www.google.com),(key2,cache_hit,miss)})*
> 
>  // Create 2 lists and then JOIN them
> 
>  C = FOREACH B {
>  u = FILTER A by name == 'url';
>  GENERATE FLATTEN(u.(key,value)) ;
>  }
> * (key1,http://www.google.com)*
> * (key2,http://www.google.com)*
> 
>  D = FOREACH B {
>  u2 = FILTER A by name == 'cache_hit';
>  GENERATE FLATTEN(u2.(key,value));
>  }
>  *(key1,hit)*
> * (key2,miss)*
> 
>  E = join C by key, D by key ;
> *(key1,http://www.google.com,key1,hit)*
> *(key2,http://www.google.com,key2,miss)*
> 
> describe E ;
> E: {C::u::key: chararray,C::u::value: chararray,D::u2::key:
> chararray,D::u2::value: chararray}
> 
> F = FOREACH E GENERATE C::u::value, D::u2::value ;
> 
> *dump F ;*
> *(http://www.google.com,http://www.google.com)  ?? Why not www.google.com,
> hit ????*
> *(http://www.google.com,http://www.google.com)*
> *
> *
> Any help appreciated.
> AD