You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Malcolm Tye <ma...@btinternet.com> on 2012/05/03 14:29:44 UTC

RE: "Exploding" a Hive array in Pig from an RCFile

Hi Norbert,
	  Thanks for your answer. I'm just documenting the problems I
experienced and will reply to the list soon with a detailed answer


Thanks for your help


Malc


-----Original Message-----
From: Norbert Burger [mailto:norbert.burger@gmail.com] 
Sent: 12 April 2012 04:14
To: user@pig.apache.org
Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile

A little wonky, but try wrapping the flattened tuple elements in a bag, and
then re-flattening that:

A = LOAD 'test.txt' USING PigStorage(',') AS
(C_SUB_ID:chararray,seg_ids:chararray);
B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':'));
C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));

Only flattened bags generate the cols -> rows transformation that you're
trying to make.  Flattened tuples, on the other hand, simply explode the
tuple into its composite elements, but without creating the multiple rows
("cross product') in your relation.  A custom UDF would be another option
here.

Norbert

On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye
<ma...@btinternet.com>wrote:

> Hi Norbert,
>            I don't seem to be getting what I'm after. If my data looks 
> like this
>
> 1133957209,61:0:1
> 4524524233,21:0
>
> I want to produce
>
> 1133957209,61
> 1133957209,0
> 1133957209,1
> 4524524233,21
> 4524524233,0
>
> I changed the LOAD statement to
>
> mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING 
> org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> string,seg_ids
> array');
> opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as 
> s_seg_id;
>
> I don't seem to be getting the cross product, just something like the 
> following
>
> 1133957209,61,0,1
> 4524524233,21,0
>
> Any ideas ?
>
>
> Thanks
>
> Malc
>
>
> -----Original Message-----
> From: Norbert Burger [mailto:norbert.burger@gmail.com]
> Sent: 06 April 2012 16:01
> To: user@pig.apache.org
> Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile
>
> Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to 
> tokenize a chararray on some delimeter.  So the following should work:
>
> opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as 
> s_seg_id;
>
> Norbert
>
> On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
> <ma...@btinternet.com>wrote:
>
> > Hi,
> >    I'm storing data into a partitioned table using Hive in RCFile 
> > format, but I want to use Pig to do the aggregation of that data.
> >
> > In my array <string> in Hive, I have colon delimited data, E.g.
> >
> > :0:12:21:99:
> >
> > With the lateral view and explode functions in Hive, I can output 
> > each value as a separate row.
> >
> > In Pig, I think I need to use flatten, but it just outputs the array 
> > as a single field, and I can't see where to specify that the 
> > delimiter is the delimiter/value separator
> >
> > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD 
> > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING 
> > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> > string,seg_ids
> > array<string>');
> > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; 
> > dump opt;
> >
> >
> >
> > Thanks
> >
> > Malc
> >
> >
> >
>
>