You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Malcolm Tye <ma...@btinternet.com> on 2012/04/05 14:58:55 UTC

"Exploding" a Hive array in Pig from an RCFile

Hi,
    I'm storing data into a partitioned table using Hive in RCFile format,
but I want to use Pig to do the aggregation of that data.

In my array <string> in Hive, I have colon delimited data, E.g.

:0:12:21:99:

With the lateral view and explode functions in Hive, I can output each value
as a separate row.

In Pig, I think I need to use flatten, but it just outputs the array as a
single field, and I can't see where to specify that the delimiter is the
delimiter/value separator

register /opt/pig/trunk/bin/piggybank.jar
mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID string,seg_ids
array<string>');
opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id;
dump opt;



Thanks

Malc

RE: "Exploding" a Hive array in Pig from an RCFile

Posted by Malcolm Tye <ma...@btinternet.com>.

Hi Norbert,
	  Thanks for your answer. I'm just documenting the problems I
experienced and will reply to the list soon with a detailed answer


Thanks for your help


Malc


-----Original Message-----
From: Norbert Burger [mailto:norbert.burger@gmail.com] 
Sent: 12 April 2012 04:14
To: user@pig.apache.org
Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile

A little wonky, but try wrapping the flattened tuple elements in a bag, and
then re-flattening that:

A = LOAD 'test.txt' USING PigStorage(',') AS
(C_SUB_ID:chararray,seg_ids:chararray);
B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':'));
C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));

Only flattened bags generate the cols -> rows transformation that you're
trying to make.  Flattened tuples, on the other hand, simply explode the
tuple into its composite elements, but without creating the multiple rows
("cross product') in your relation.  A custom UDF would be another option
here.

Norbert

On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye
<ma...@btinternet.com>wrote:

> Hi Norbert,
>            I don't seem to be getting what I'm after. If my data looks 
> like this
>
> 1133957209,61:0:1
> 4524524233,21:0
>
> I want to produce
>
> 1133957209,61
> 1133957209,0
> 1133957209,1
> 4524524233,21
> 4524524233,0
>
> I changed the LOAD statement to
>
> mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING 
> org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> string,seg_ids
> array');
> opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as 
> s_seg_id;
>
> I don't seem to be getting the cross product, just something like the 
> following
>
> 1133957209,61,0,1
> 4524524233,21,0
>
> Any ideas ?
>
>
> Thanks
>
> Malc
>
>
> -----Original Message-----
> From: Norbert Burger [mailto:norbert.burger@gmail.com]
> Sent: 06 April 2012 16:01
> To: user@pig.apache.org
> Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile
>
> Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to 
> tokenize a chararray on some delimeter.  So the following should work:
>
> opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as 
> s_seg_id;
>
> Norbert
>
> On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
> <ma...@btinternet.com>wrote:
>
> > Hi,
> >    I'm storing data into a partitioned table using Hive in RCFile 
> > format, but I want to use Pig to do the aggregation of that data.
> >
> > In my array <string> in Hive, I have colon delimited data, E.g.
> >
> > :0:12:21:99:
> >
> > With the lateral view and explode functions in Hive, I can output 
> > each value as a separate row.
> >
> > In Pig, I think I need to use flatten, but it just outputs the array 
> > as a single field, and I can't see where to specify that the 
> > delimiter is the delimiter/value separator
> >
> > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD 
> > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING 
> > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> > string,seg_ids
> > array<string>');
> > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; 
> > dump opt;
> >
> >
> >
> > Thanks
> >
> > Malc
> >
> >
> >
>
>

Re: "Exploding" a Hive array in Pig from an RCFile

Posted by Aniket Mokashi <an...@gmail.com>.

Hi Malcolm,

arrays are converted to tuples and flatten should directly work on it. I
think you need not worry about the delimiter (assuming hive knows how to
deserialize it). Btw, does RCFile require delimiter to store arrays? I am
not sure about that.

Thanks,
Aniket


On Wed, Apr 11, 2012 at 8:14 PM, Norbert Burger <no...@gmail.com>wrote:

> A little wonky, but try wrapping the flattened tuple elements in a bag, and
> then re-flattening that:
>
> A = LOAD 'test.txt' USING PigStorage(',') AS
> (C_SUB_ID:chararray,seg_ids:chararray);
> B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':'));
> C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));
>
> Only flattened bags generate the cols -> rows transformation that you're
> trying to make.  Flattened tuples, on the other hand, simply explode the
> tuple into its composite elements, but without creating the multiple rows
> ("cross product') in your relation.  A custom UDF would be another option
> here.
>
> Norbert
>
> On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye <malcolm.tye@btinternet.com
> >wrote:
>
> > Hi Norbert,
> >            I don't seem to be getting what I'm after. If my data looks
> like
> > this
> >
> > 1133957209,61:0:1
> > 4524524233,21:0
> >
> > I want to produce
> >
> > 1133957209,61
> > 1133957209,0
> > 1133957209,1
> > 4524524233,21
> > 4524524233,0
> >
> > I changed the LOAD statement to
> >
> > mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> > string,seg_ids
> > array');
> > opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as
> > s_seg_id;
> >
> > I don't seem to be getting the cross product, just something like the
> > following
> >
> > 1133957209,61,0,1
> > 4524524233,21,0
> >
> > Any ideas ?
> >
> >
> > Thanks
> >
> > Malc
> >
> >
> > -----Original Message-----
> > From: Norbert Burger [mailto:norbert.burger@gmail.com]
> > Sent: 06 April 2012 16:01
> > To: user@pig.apache.org
> > Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile
> >
> > Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to
> tokenize
> > a chararray on some delimeter.  So the following should work:
> >
> > opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
> > s_seg_id;
> >
> > Norbert
> >
> > On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
> > <ma...@btinternet.com>wrote:
> >
> > > Hi,
> > >    I'm storing data into a partitioned table using Hive in RCFile
> > > format, but I want to use Pig to do the aggregation of that data.
> > >
> > > In my array <string> in Hive, I have colon delimited data, E.g.
> > >
> > > :0:12:21:99:
> > >
> > > With the lateral view and explode functions in Hive, I can output each
> > > value as a separate row.
> > >
> > > In Pig, I think I need to use flatten, but it just outputs the array
> > > as a single field, and I can't see where to specify that the delimiter
> > > is the delimiter/value separator
> > >
> > > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD
> > > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> > > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> > > string,seg_ids
> > > array<string>');
> > > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump
> > > opt;
> > >
> > >
> > >
> > > Thanks
> > >
> > > Malc
> > >
> > >
> > >
> >
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: "Exploding" a Hive array in Pig from an RCFile

Posted by Norbert Burger <no...@gmail.com>.

A little wonky, but try wrapping the flattened tuple elements in a bag, and
then re-flattening that:

A = LOAD 'test.txt' USING PigStorage(',') AS
(C_SUB_ID:chararray,seg_ids:chararray);
B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':'));
C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));

Only flattened bags generate the cols -> rows transformation that you're
trying to make.  Flattened tuples, on the other hand, simply explode the
tuple into its composite elements, but without creating the multiple rows
("cross product') in your relation.  A custom UDF would be another option
here.

Norbert

On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye <ma...@btinternet.com>wrote:

> Hi Norbert,
>            I don't seem to be getting what I'm after. If my data looks like
> this
>
> 1133957209,61:0:1
> 4524524233,21:0
>
> I want to produce
>
> 1133957209,61
> 1133957209,0
> 1133957209,1
> 4524524233,21
> 4524524233,0
>
> I changed the LOAD statement to
>
> mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> string,seg_ids
> array');
> opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as
> s_seg_id;
>
> I don't seem to be getting the cross product, just something like the
> following
>
> 1133957209,61,0,1
> 4524524233,21,0
>
> Any ideas ?
>
>
> Thanks
>
> Malc
>
>
> -----Original Message-----
> From: Norbert Burger [mailto:norbert.burger@gmail.com]
> Sent: 06 April 2012 16:01
> To: user@pig.apache.org
> Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile
>
> Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to tokenize
> a chararray on some delimeter.  So the following should work:
>
> opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
> s_seg_id;
>
> Norbert
>
> On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
> <ma...@btinternet.com>wrote:
>
> > Hi,
> >    I'm storing data into a partitioned table using Hive in RCFile
> > format, but I want to use Pig to do the aggregation of that data.
> >
> > In my array <string> in Hive, I have colon delimited data, E.g.
> >
> > :0:12:21:99:
> >
> > With the lateral view and explode functions in Hive, I can output each
> > value as a separate row.
> >
> > In Pig, I think I need to use flatten, but it just outputs the array
> > as a single field, and I can't see where to specify that the delimiter
> > is the delimiter/value separator
> >
> > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD
> > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> > string,seg_ids
> > array<string>');
> > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump
> > opt;
> >
> >
> >
> > Thanks
> >
> > Malc
> >
> >
> >
>
>

RE: "Exploding" a Hive array in Pig from an RCFile

Posted by Malcolm Tye <ma...@btinternet.com>.

Hi Norbert,
	    I don't seem to be getting what I'm after. If my data looks like
this

1133957209,61:0:1
4524524233,21:0

I want to produce

1133957209,61
1133957209,0
1133957209,1
4524524233,21
4524524233,0

I changed the LOAD statement to

mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID string,seg_ids
array');
opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as
s_seg_id;

I don't seem to be getting the cross product, just something like the
following

1133957209,61,0,1
4524524233,21,0

Any ideas ?


Thanks

Malc


-----Original Message-----
From: Norbert Burger [mailto:norbert.burger@gmail.com] 
Sent: 06 April 2012 16:01
To: user@pig.apache.org
Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile

Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to tokenize
a chararray on some delimeter.  So the following should work:

opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
s_seg_id;

Norbert

On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
<ma...@btinternet.com>wrote:

> Hi,
>    I'm storing data into a partitioned table using Hive in RCFile 
> format, but I want to use Pig to do the aggregation of that data.
>
> In my array <string> in Hive, I have colon delimited data, E.g.
>
> :0:12:21:99:
>
> With the lateral view and explode functions in Hive, I can output each 
> value as a separate row.
>
> In Pig, I think I need to use flatten, but it just outputs the array 
> as a single field, and I can't see where to specify that the delimiter 
> is the delimiter/value separator
>
> register /opt/pig/trunk/bin/piggybank.jar mt = LOAD 
> '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING 
> org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> string,seg_ids
> array<string>');
> opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump 
> opt;
>
>
>
> Thanks
>
> Malc
>
>
>

Re: "Exploding" a Hive array in Pig from an RCFile

Posted by Norbert Burger <no...@gmail.com>.

Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to tokenize
a chararray on some delimeter.  So the following should work:

opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
s_seg_id;

Norbert

On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye <ma...@btinternet.com>wrote:

> Hi,
>    I'm storing data into a partitioned table using Hive in RCFile format,
> but I want to use Pig to do the aggregation of that data.
>
> In my array <string> in Hive, I have colon delimited data, E.g.
>
> :0:12:21:99:
>
> With the lateral view and explode functions in Hive, I can output each
> value
> as a separate row.
>
> In Pig, I think I need to use flatten, but it just outputs the array as a
> single field, and I can't see where to specify that the delimiter is the
> delimiter/value separator
>
> register /opt/pig/trunk/bin/piggybank.jar
> mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> string,seg_ids
> array<string>');
> opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id;
> dump opt;
>
>
>
> Thanks
>
> Malc
>
>
>