You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Stan Rosenberg <sr...@proclivitysystems.com> on 2012/01/26 04:11:02 UTC

explode operation

Hi Guys,

I came across a use case that seems to require an 'explode' operation
which to my knowledge is not currently available.
That is, given a tuple (x,y,z), 'explode' would generate the tuples
(x), (y), (z).

E.g., consider a relation that contains an arbitrary number of
different identifier columns, say,
social security id, student id, etc.  We want to compute the set of
all distinct identifiers.  Assume that the number of identifier
columns is large and intermingled with other
columns that should be projected out; this is to avoid a solution
using 'SPLIT', e.g.

To be concrete, if X = {(..., 2, 4, ..., 3), (..., 2,,...,5)} is such
a relation, then the answer we want is
Y={2,3,4,5}.

Any suggestions?

Thanks,

stan

Re: explode operation

Posted by Stan Rosenberg <sr...@proclivitysystems.com>.

On Mon, Jan 30, 2012 at 2:25 AM, Aniket Mokashi <an...@gmail.com> wrote:
> Isnt FLATTEN similar to explode?

Not quite. EXPLODE would take a record with n fields and generate n records.

Re: explode operation

Posted by Aniket Mokashi <an...@gmail.com>.

Isnt FLATTEN similar to explode?

On Sun, Jan 29, 2012 at 5:46 PM, Stan Rosenberg <
srosenberg@proclivitysystems.com> wrote:

> Hi Jonathan,
>
> What you recommended below is not quite right.  The right solution
> would need to do something similar to 'explode'.
>
> Thanks,
>
> stan
>
> On Thu, Jan 26, 2012 at 3:04 PM, Jonathan Coveney <jc...@gmail.com>
> wrote:
> > I think this might give you what you want
> >
> > X = LOAD 'input.txt' using PigStorage(',') AS (id1:chararray,
> > id2:chararray, id3:chararray, id4:chararray, id5:chararray);
> > Y_0 = foreach X generate FLATTEN(TOBAG(*));
> > Y = filter Y_0 by $0 is not null;
> >
> > 2012/1/25 Prashant Kommireddi <pr...@gmail.com>
> >
> >> Sorry I misunderstood your initial question. You would have to write a
> >> custom UDF to do this.
> >>
> >> Thanks,
> >> Prashant
> >>
> >> On Jan 25, 2012, at 7:32 PM, Stan Rosenberg
> >> <sr...@proclivitysystems.com> wrote:
> >>
> >> > To clarify, here is our input:
> >> >
> >> > X = LOAD 'input.txt' AS (id1:chararray, id2:charrarray,
> >> > id3:charrarray, id4:chararray, id5:chararray);
> >> >
> >> > We want to compute Y that consists of a single column denoting the set
> >> > of all (non-null) ids coming from X.
> >> >
> >> > stan
> >> >
> >> >
> >> > On Wed, Jan 25, 2012 at 10:26 PM, Stan Rosenberg
> >> > <sr...@proclivitysystems.com> wrote:
> >> >> I don't see how flatten would help in this case.
> >> >>
> >> >> On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi
> >> >> <pr...@gmail.com> wrote:
> >> >>> Hi Stan,
> >> >>>
> >> >>> Would using FLATTEN and then DISTINCT work?
> >> >>>
> >> >>> Thanks,
> >> >>> Prashant
> >> >>>
> >> >>> On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg <
> >> >>> srosenberg@proclivitysystems.com> wrote:
> >> >>>
> >> >>>> Hi Guys,
> >> >>>>
> >> >>>> I came across a use case that seems to require an 'explode'
> operation
> >> >>>> which to my knowledge is not currently available.
> >> >>>> That is, given a tuple (x,y,z), 'explode' would generate the tuples
> >> >>>> (x), (y), (z).
> >> >>>>
> >> >>>> E.g., consider a relation that contains an arbitrary number of
> >> >>>> different identifier columns, say,
> >> >>>> social security id, student id, etc.  We want to compute the set of
> >> >>>> all distinct identifiers.  Assume that the number of identifier
> >> >>>> columns is large and intermingled with other
> >> >>>> columns that should be projected out; this is to avoid a solution
> >> >>>> using 'SPLIT', e.g.
> >> >>>>
> >> >>>> To be concrete, if X = {(..., 2, 4, ..., 3), (..., 2,,...,5)} is
> such
> >> >>>> a relation, then the answer we want is
> >> >>>> Y={2,3,4,5}.
> >> >>>>
> >> >>>> Any suggestions?
> >> >>>>
> >> >>>> Thanks,
> >> >>>>
> >> >>>> stan
> >> >>>>
> >>
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: explode operation

Posted by Stan Rosenberg <sr...@proclivitysystems.com>.

Hi Jonathan,

What you recommended below is not quite right.  The right solution
would need to do something similar to 'explode'.

Thanks,

stan

On Thu, Jan 26, 2012 at 3:04 PM, Jonathan Coveney <jc...@gmail.com> wrote:
> I think this might give you what you want
>
> X = LOAD 'input.txt' using PigStorage(',') AS (id1:chararray,
> id2:chararray, id3:chararray, id4:chararray, id5:chararray);
> Y_0 = foreach X generate FLATTEN(TOBAG(*));
> Y = filter Y_0 by $0 is not null;
>
> 2012/1/25 Prashant Kommireddi <pr...@gmail.com>
>
>> Sorry I misunderstood your initial question. You would have to write a
>> custom UDF to do this.
>>
>> Thanks,
>> Prashant
>>
>> On Jan 25, 2012, at 7:32 PM, Stan Rosenberg
>> <sr...@proclivitysystems.com> wrote:
>>
>> > To clarify, here is our input:
>> >
>> > X = LOAD 'input.txt' AS (id1:chararray, id2:charrarray,
>> > id3:charrarray, id4:chararray, id5:chararray);
>> >
>> > We want to compute Y that consists of a single column denoting the set
>> > of all (non-null) ids coming from X.
>> >
>> > stan
>> >
>> >
>> > On Wed, Jan 25, 2012 at 10:26 PM, Stan Rosenberg
>> > <sr...@proclivitysystems.com> wrote:
>> >> I don't see how flatten would help in this case.
>> >>
>> >> On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi
>> >> <pr...@gmail.com> wrote:
>> >>> Hi Stan,
>> >>>
>> >>> Would using FLATTEN and then DISTINCT work?
>> >>>
>> >>> Thanks,
>> >>> Prashant
>> >>>
>> >>> On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg <
>> >>> srosenberg@proclivitysystems.com> wrote:
>> >>>
>> >>>> Hi Guys,
>> >>>>
>> >>>> I came across a use case that seems to require an 'explode' operation
>> >>>> which to my knowledge is not currently available.
>> >>>> That is, given a tuple (x,y,z), 'explode' would generate the tuples
>> >>>> (x), (y), (z).
>> >>>>
>> >>>> E.g., consider a relation that contains an arbitrary number of
>> >>>> different identifier columns, say,
>> >>>> social security id, student id, etc.  We want to compute the set of
>> >>>> all distinct identifiers.  Assume that the number of identifier
>> >>>> columns is large and intermingled with other
>> >>>> columns that should be projected out; this is to avoid a solution
>> >>>> using 'SPLIT', e.g.
>> >>>>
>> >>>> To be concrete, if X = {(..., 2, 4, ..., 3), (..., 2,,...,5)} is such
>> >>>> a relation, then the answer we want is
>> >>>> Y={2,3,4,5}.
>> >>>>
>> >>>> Any suggestions?
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> stan
>> >>>>
>>

Re: explode operation

Posted by Jonathan Coveney <jc...@gmail.com>.

I think this might give you what you want

X = LOAD 'input.txt' using PigStorage(',') AS (id1:chararray,
id2:chararray, id3:chararray, id4:chararray, id5:chararray);
Y_0 = foreach X generate FLATTEN(TOBAG(*));
Y = filter Y_0 by $0 is not null;

2012/1/25 Prashant Kommireddi <pr...@gmail.com>

> Sorry I misunderstood your initial question. You would have to write a
> custom UDF to do this.
>
> Thanks,
> Prashant
>
> On Jan 25, 2012, at 7:32 PM, Stan Rosenberg
> <sr...@proclivitysystems.com> wrote:
>
> > To clarify, here is our input:
> >
> > X = LOAD 'input.txt' AS (id1:chararray, id2:charrarray,
> > id3:charrarray, id4:chararray, id5:chararray);
> >
> > We want to compute Y that consists of a single column denoting the set
> > of all (non-null) ids coming from X.
> >
> > stan
> >
> >
> > On Wed, Jan 25, 2012 at 10:26 PM, Stan Rosenberg
> > <sr...@proclivitysystems.com> wrote:
> >> I don't see how flatten would help in this case.
> >>
> >> On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi
> >> <pr...@gmail.com> wrote:
> >>> Hi Stan,
> >>>
> >>> Would using FLATTEN and then DISTINCT work?
> >>>
> >>> Thanks,
> >>> Prashant
> >>>
> >>> On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg <
> >>> srosenberg@proclivitysystems.com> wrote:
> >>>
> >>>> Hi Guys,
> >>>>
> >>>> I came across a use case that seems to require an 'explode' operation
> >>>> which to my knowledge is not currently available.
> >>>> That is, given a tuple (x,y,z), 'explode' would generate the tuples
> >>>> (x), (y), (z).
> >>>>
> >>>> E.g., consider a relation that contains an arbitrary number of
> >>>> different identifier columns, say,
> >>>> social security id, student id, etc.  We want to compute the set of
> >>>> all distinct identifiers.  Assume that the number of identifier
> >>>> columns is large and intermingled with other
> >>>> columns that should be projected out; this is to avoid a solution
> >>>> using 'SPLIT', e.g.
> >>>>
> >>>> To be concrete, if X = {(..., 2, 4, ..., 3), (..., 2,,...,5)} is such
> >>>> a relation, then the answer we want is
> >>>> Y={2,3,4,5}.
> >>>>
> >>>> Any suggestions?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> stan
> >>>>
>

Re: explode operation

Posted by Prashant Kommireddi <pr...@gmail.com>.

Sorry I misunderstood your initial question. You would have to write a
custom UDF to do this.

Thanks,
Prashant

On Jan 25, 2012, at 7:32 PM, Stan Rosenberg
<sr...@proclivitysystems.com> wrote:

> To clarify, here is our input:
>
> X = LOAD 'input.txt' AS (id1:chararray, id2:charrarray,
> id3:charrarray, id4:chararray, id5:chararray);
>
> We want to compute Y that consists of a single column denoting the set
> of all (non-null) ids coming from X.
>
> stan
>
>
> On Wed, Jan 25, 2012 at 10:26 PM, Stan Rosenberg
> <sr...@proclivitysystems.com> wrote:
>> I don't see how flatten would help in this case.
>>
>> On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi
>> <pr...@gmail.com> wrote:
>>> Hi Stan,
>>>
>>> Would using FLATTEN and then DISTINCT work?
>>>
>>> Thanks,
>>> Prashant
>>>
>>> On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg <
>>> srosenberg@proclivitysystems.com> wrote:
>>>
>>>> Hi Guys,
>>>>
>>>> I came across a use case that seems to require an 'explode' operation
>>>> which to my knowledge is not currently available.
>>>> That is, given a tuple (x,y,z), 'explode' would generate the tuples
>>>> (x), (y), (z).
>>>>
>>>> E.g., consider a relation that contains an arbitrary number of
>>>> different identifier columns, say,
>>>> social security id, student id, etc.  We want to compute the set of
>>>> all distinct identifiers.  Assume that the number of identifier
>>>> columns is large and intermingled with other
>>>> columns that should be projected out; this is to avoid a solution
>>>> using 'SPLIT', e.g.
>>>>
>>>> To be concrete, if X = {(..., 2, 4, ..., 3), (..., 2,,...,5)} is such
>>>> a relation, then the answer we want is
>>>> Y={2,3,4,5}.
>>>>
>>>> Any suggestions?
>>>>
>>>> Thanks,
>>>>
>>>> stan
>>>>

Re: explode operation

Posted by Stan Rosenberg <sr...@proclivitysystems.com>.

To clarify, here is our input:

X = LOAD 'input.txt' AS (id1:chararray, id2:charrarray,
id3:charrarray, id4:chararray, id5:chararray);

We want to compute Y that consists of a single column denoting the set
of all (non-null) ids coming from X.

stan


On Wed, Jan 25, 2012 at 10:26 PM, Stan Rosenberg
<sr...@proclivitysystems.com> wrote:
> I don't see how flatten would help in this case.
>
> On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi
> <pr...@gmail.com> wrote:
>> Hi Stan,
>>
>> Would using FLATTEN and then DISTINCT work?
>>
>> Thanks,
>> Prashant
>>
>> On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg <
>> srosenberg@proclivitysystems.com> wrote:
>>
>>> Hi Guys,
>>>
>>> I came across a use case that seems to require an 'explode' operation
>>> which to my knowledge is not currently available.
>>> That is, given a tuple (x,y,z), 'explode' would generate the tuples
>>> (x), (y), (z).
>>>
>>> E.g., consider a relation that contains an arbitrary number of
>>> different identifier columns, say,
>>> social security id, student id, etc.  We want to compute the set of
>>> all distinct identifiers.  Assume that the number of identifier
>>> columns is large and intermingled with other
>>> columns that should be projected out; this is to avoid a solution
>>> using 'SPLIT', e.g.
>>>
>>> To be concrete, if X = {(..., 2, 4, ..., 3), (..., 2,,...,5)} is such
>>> a relation, then the answer we want is
>>> Y={2,3,4,5}.
>>>
>>> Any suggestions?
>>>
>>> Thanks,
>>>
>>> stan
>>>

Re: explode operation

Posted by Stan Rosenberg <sr...@proclivitysystems.com>.

I don't see how flatten would help in this case.

On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi
<pr...@gmail.com> wrote:
> Hi Stan,
>
> Would using FLATTEN and then DISTINCT work?
>
> Thanks,
> Prashant
>
> On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg <
> srosenberg@proclivitysystems.com> wrote:
>
>> Hi Guys,
>>
>> I came across a use case that seems to require an 'explode' operation
>> which to my knowledge is not currently available.
>> That is, given a tuple (x,y,z), 'explode' would generate the tuples
>> (x), (y), (z).
>>
>> E.g., consider a relation that contains an arbitrary number of
>> different identifier columns, say,
>> social security id, student id, etc.  We want to compute the set of
>> all distinct identifiers.  Assume that the number of identifier
>> columns is large and intermingled with other
>> columns that should be projected out; this is to avoid a solution
>> using 'SPLIT', e.g.
>>
>> To be concrete, if X = {(..., 2, 4, ..., 3), (..., 2,,...,5)} is such
>> a relation, then the answer we want is
>> Y={2,3,4,5}.
>>
>> Any suggestions?
>>
>> Thanks,
>>
>> stan
>>

Re: explode operation

Posted by Prashant Kommireddi <pr...@gmail.com>.

Hi Stan,

Would using FLATTEN and then DISTINCT work?

Thanks,
Prashant

On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg <
srosenberg@proclivitysystems.com> wrote:

> Hi Guys,
>
> I came across a use case that seems to require an 'explode' operation
> which to my knowledge is not currently available.
> That is, given a tuple (x,y,z), 'explode' would generate the tuples
> (x), (y), (z).
>
> E.g., consider a relation that contains an arbitrary number of
> different identifier columns, say,
> social security id, student id, etc.  We want to compute the set of
> all distinct identifiers.  Assume that the number of identifier
> columns is large and intermingled with other
> columns that should be projected out; this is to avoid a solution
> using 'SPLIT', e.g.
>
> To be concrete, if X = {(..., 2, 4, ..., 3), (..., 2,,...,5)} is such
> a relation, then the answer we want is
> Y={2,3,4,5}.
>
> Any suggestions?
>
> Thanks,
>
> stan
>