You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by hc busy <hc...@gmail.com> on 2010/04/02 20:29:47 UTC

What should FLATTEN do?

Guys, I have a row containing a map

'id','data', {((1,2)), ((2,3)), ((4,5))}

What is the expected behavior when I flatten on that bag? I had expected it
to result in

'id','data', (1,2)
'id','data', (2,3)
'id','data', (4,5)


But it appears to me that the result of applying FLATTEN to that bag is this
instead:

'id','data', 1,2
'id','data', 2,3
'id','data', 4,5


The latter is returned by the current cloudera's CDH2 and I've seen the
prior behavior on other versions of pig.

Which is the correct behavior by design?

What will pig 0.6 do when it is released?

thanks!

Re: What should FLATTEN do?

Posted by Russell Jurney <ru...@gmail.com>.

Thanks.  I did so, but I probably did it wrong.  Couldn't make it work.

On Fri, Apr 2, 2010 at 1:49 PM, hc busy <hc...@gmail.com> wrote:

> .... yeah, you have to implement outputSchema() method on the udf in order
> to make the content of the tuple visible... There's a nice example in the
> UDF Manual
>
> http://hadoop.apache.org/pig/docs/r0.6.0/udf.html
>
> <http://hadoop.apache.org/pig/docs/r0.6.0/udf.html>search for 'package
> myudf' until u find it.
>
>
>
> On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney <russell.jurney@gmail.com
> >wrote:
>
> > Not sure if this is exactly the same, but when I've created tuples within
> > tuples in UDFs (to preserve order of pairs), from bag input, Pig has
> > allowed
> > it - but I can't work with that data in subsequent steps.
> >
> > On Fri, Apr 2, 2010 at 12:37 PM, hc busy <hc...@gmail.com> wrote:
> >
> > > Yeah, I'm sure it has nested tuples. Pig doesn't natively support
> > > introduction of tuples
> > >
> > > h = foreach g generate ((x,y,z)), (x), ((((x))))
> > >
> > > doesn't work, but i have a udf that does that.... don't ask why....,
> and
> > > I've seen it print double pair of paren's when I took a dump.
> > >
> > > Our hadoop guys here says it's CDH2 and that the "upgrade" was just
> > > re-installation of CDH2... ("same jars") But certainly my script
> suddenly
> > > started doing weird things when it flattened that all the way through.
> > >
> > > I'd support the prior behavior as well, because that seems to match my
> > > reading of documentation on behavior of FLATTEN.
> > >
> > >
> > >
> > > Has anybody else had this problem with recent cloudera/pig versions?
> > >
> > >
> > > thnx!!
> > >
> > >
> > > On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman <zaki.rahaman@gmail.com
> > > >wrote:
> > >
> > > > Stupid question but are you sure your bag has the dual sets of
> > > parentheses?
> > > > (And if I may ask, why is that the case?)
> > > >
> > > > On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <zaki.rahaman@gmail.com
> >
> > > > wrote:
> > > >
> > > > > If I'm not mistaken, the output is the expected behavior. Flatten
> > > should
> > > > > unnest bags. I'm assuming your statement is something like FOREACH
> > ...
> > > > > GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
> > > first
> > > > two
> > > > > fields of a tuple for every tuple in the nested bag.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:
> > > > >
> > > > >> doh!!!! s/map/bag/g
> > > > >>
> > > > >> I seem to get maps and bags mixed up or some reason...
> > > > >>
> > > > >> Guys, I have a row containing a *bag*
> > > > >>
> > > > >> 'id','data', {((1,2)), ((2,3)), ((4,5))}
> > > > >>
> > > > >> What is the expected behavior when I flatten on that bag? I had
> > > expected
> > > > >> it
> > > > >> to result in
> > > > >>
> > > > >> 'id','data', (1,2)
> > > > >> 'id','data', (2,3)
> > > > >> 'id','data', (4,5)
> > > > >>
> > > > >>
> > > > >> But it appears to me that the result of applying FLATTEN to that
> bag
> > > is
> > > > >> this
> > > > >> instead:
> > > > >>
> > > > >> 'id','data', 1,2
> > > > >> 'id','data', 2,3
> > > > >> 'id','data', 4,5
> > > > >>
> > > > >>
> > > > >> The latter is returned by the current cloudera's CDH2 and I've
> seen
> > > the
> > > > >> prior behavior on other versions of pig.
> > > > >>
> > > > >> Which is the correct behavior by design?
> > > > >>
> > > > >> What will pig 0.6 do when it is released?
> > > > >>
> > > > >> thanks!
> > > > >> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com>
> wrote:
> > > > >>
> > > > >> > Guys, I have a row containing a map
> > > > >> >
> > > > >> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
> > > > >> >
> > > > >> > What is the expected behavior when I flatten on that bag? I had
> > > > expected
> > > > >> it
> > > > >> > to result in
> > > > >> >
> > > > >> > 'id','data', (1,2)
> > > > >> > 'id','data', (2,3)
> > > > >> > 'id','data', (4,5)
> > > > >> >
> > > > >> >
> > > > >> > But it appears to me that the result of applying FLATTEN to that
> > bag
> > > > is
> > > > >> > this instead:
> > > > >> >
> > > > >> > 'id','data', 1,2
> > > > >> > 'id','data', 2,3
> > > > >> > 'id','data', 4,5
> > > > >> >
> > > > >> >
> > > > >> > The latter is returned by the current cloudera's CDH2 and I've
> > seen
> > > > the
> > > > >> > prior behavior on other versions of pig.
> > > > >> >
> > > > >> > Which is the correct behavior by design?
> > > > >> >
> > > > >> > What will pig 0.6 do when it is released?
> > > > >> >
> > > > >> > thanks!
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Zaki Rahaman
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Zaki Rahaman
> > > >
> > >
> >
>

Re: What should FLATTEN do?

Posted by hc busy <hc...@gmail.com>.

The hadoop version:

hadoop-0.20-0.20.1+169.68-1

On Fri, Apr 2, 2010 at 2:33 PM, hc busy <hc...@gmail.com> wrote:

> Okay guys some details after some digging. We've got this version of  pig
> from CDH2 installed:
>
> hadoop-pig-0.5.0+11.1-1
>
>
> the list of patches that they applied on top of 0.5.0 are listed here:
>
> http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt
>
> <http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt>The patches
> listed there doesn't seem to deal with FLATTEN in any way.
>
> Any suggestions?
>
>
>
>
> On Fri, Apr 2, 2010 at 1:49 PM, hc busy <hc...@gmail.com> wrote:
>
>>
>> .... yeah, you have to implement outputSchema() method on the udf in order
>> to make the content of the tuple visible... There's a nice example in the
>> UDF Manual
>>
>> http://hadoop.apache.org/pig/docs/r0.6.0/udf.html
>>
>> <http://hadoop.apache.org/pig/docs/r0.6.0/udf.html>search for 'package
>> myudf' until u find it.
>>
>>
>>
>> On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney <russell.jurney@gmail.com
>> > wrote:
>>
>>> Not sure if this is exactly the same, but when I've created tuples within
>>> tuples in UDFs (to preserve order of pairs), from bag input, Pig has
>>> allowed
>>> it - but I can't work with that data in subsequent steps.
>>>
>>> On Fri, Apr 2, 2010 at 12:37 PM, hc busy <hc...@gmail.com> wrote:
>>>
>>> > Yeah, I'm sure it has nested tuples. Pig doesn't natively support
>>> > introduction of tuples
>>> >
>>> > h = foreach g generate ((x,y,z)), (x), ((((x))))
>>> >
>>> > doesn't work, but i have a udf that does that.... don't ask why....,
>>> and
>>> > I've seen it print double pair of paren's when I took a dump.
>>> >
>>> > Our hadoop guys here says it's CDH2 and that the "upgrade" was just
>>> > re-installation of CDH2... ("same jars") But certainly my script
>>> suddenly
>>> > started doing weird things when it flattened that all the way through.
>>> >
>>> > I'd support the prior behavior as well, because that seems to match my
>>> > reading of documentation on behavior of FLATTEN.
>>> >
>>> >
>>> >
>>> > Has anybody else had this problem with recent cloudera/pig versions?
>>> >
>>> >
>>> > thnx!!
>>> >
>>> >
>>> > On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman <zaki.rahaman@gmail.com
>>> > >wrote:
>>> >
>>> > > Stupid question but are you sure your bag has the dual sets of
>>> > parentheses?
>>> > > (And if I may ask, why is that the case?)
>>> > >
>>> > > On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <zaki.rahaman@gmail.com
>>> >
>>> > > wrote:
>>> > >
>>> > > > If I'm not mistaken, the output is the expected behavior. Flatten
>>> > should
>>> > > > unnest bags. I'm assuming your statement is something like FOREACH
>>> ...
>>> > > > GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
>>> > first
>>> > > two
>>> > > > fields of a tuple for every tuple in the nested bag.
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:
>>> > > >
>>> > > >> doh!!!! s/map/bag/g
>>> > > >>
>>> > > >> I seem to get maps and bags mixed up or some reason...
>>> > > >>
>>> > > >> Guys, I have a row containing a *bag*
>>> > > >>
>>> > > >> 'id','data', {((1,2)), ((2,3)), ((4,5))}
>>> > > >>
>>> > > >> What is the expected behavior when I flatten on that bag? I had
>>> > expected
>>> > > >> it
>>> > > >> to result in
>>> > > >>
>>> > > >> 'id','data', (1,2)
>>> > > >> 'id','data', (2,3)
>>> > > >> 'id','data', (4,5)
>>> > > >>
>>> > > >>
>>> > > >> But it appears to me that the result of applying FLATTEN to that
>>> bag
>>> > is
>>> > > >> this
>>> > > >> instead:
>>> > > >>
>>> > > >> 'id','data', 1,2
>>> > > >> 'id','data', 2,3
>>> > > >> 'id','data', 4,5
>>> > > >>
>>> > > >>
>>> > > >> The latter is returned by the current cloudera's CDH2 and I've
>>> seen
>>> > the
>>> > > >> prior behavior on other versions of pig.
>>> > > >>
>>> > > >> Which is the correct behavior by design?
>>> > > >>
>>> > > >> What will pig 0.6 do when it is released?
>>> > > >>
>>> > > >> thanks!
>>> > > >> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com>
>>> wrote:
>>> > > >>
>>> > > >> > Guys, I have a row containing a map
>>> > > >> >
>>> > > >> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
>>> > > >> >
>>> > > >> > What is the expected behavior when I flatten on that bag? I had
>>> > > expected
>>> > > >> it
>>> > > >> > to result in
>>> > > >> >
>>> > > >> > 'id','data', (1,2)
>>> > > >> > 'id','data', (2,3)
>>> > > >> > 'id','data', (4,5)
>>> > > >> >
>>> > > >> >
>>> > > >> > But it appears to me that the result of applying FLATTEN to that
>>> bag
>>> > > is
>>> > > >> > this instead:
>>> > > >> >
>>> > > >> > 'id','data', 1,2
>>> > > >> > 'id','data', 2,3
>>> > > >> > 'id','data', 4,5
>>> > > >> >
>>> > > >> >
>>> > > >> > The latter is returned by the current cloudera's CDH2 and I've
>>> seen
>>> > > the
>>> > > >> > prior behavior on other versions of pig.
>>> > > >> >
>>> > > >> > Which is the correct behavior by design?
>>> > > >> >
>>> > > >> > What will pig 0.6 do when it is released?
>>> > > >> >
>>> > > >> > thanks!
>>> > > >> >
>>> > > >>
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > Zaki Rahaman
>>> > > >
>>> > > >
>>> > >
>>> > >
>>> > > --
>>> > > Zaki Rahaman
>>> > >
>>> >
>>>
>>
>>
>

Re: What should FLATTEN do?

Posted by hc busy <hc...@gmail.com>.

The hadoop version:

hadoop-0.20-0.20.1+169.68-1

On Fri, Apr 2, 2010 at 2:33 PM, hc busy <hc...@gmail.com> wrote:

> Okay guys some details after some digging. We've got this version of  pig
> from CDH2 installed:
>
> hadoop-pig-0.5.0+11.1-1
>
>
> the list of patches that they applied on top of 0.5.0 are listed here:
>
> http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt
>
> <http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt>The patches
> listed there doesn't seem to deal with FLATTEN in any way.
>
> Any suggestions?
>
>
>
>
> On Fri, Apr 2, 2010 at 1:49 PM, hc busy <hc...@gmail.com> wrote:
>
>>
>> .... yeah, you have to implement outputSchema() method on the udf in order
>> to make the content of the tuple visible... There's a nice example in the
>> UDF Manual
>>
>> http://hadoop.apache.org/pig/docs/r0.6.0/udf.html
>>
>> <http://hadoop.apache.org/pig/docs/r0.6.0/udf.html>search for 'package
>> myudf' until u find it.
>>
>>
>>
>> On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney <russell.jurney@gmail.com
>> > wrote:
>>
>>> Not sure if this is exactly the same, but when I've created tuples within
>>> tuples in UDFs (to preserve order of pairs), from bag input, Pig has
>>> allowed
>>> it - but I can't work with that data in subsequent steps.
>>>
>>> On Fri, Apr 2, 2010 at 12:37 PM, hc busy <hc...@gmail.com> wrote:
>>>
>>> > Yeah, I'm sure it has nested tuples. Pig doesn't natively support
>>> > introduction of tuples
>>> >
>>> > h = foreach g generate ((x,y,z)), (x), ((((x))))
>>> >
>>> > doesn't work, but i have a udf that does that.... don't ask why....,
>>> and
>>> > I've seen it print double pair of paren's when I took a dump.
>>> >
>>> > Our hadoop guys here says it's CDH2 and that the "upgrade" was just
>>> > re-installation of CDH2... ("same jars") But certainly my script
>>> suddenly
>>> > started doing weird things when it flattened that all the way through.
>>> >
>>> > I'd support the prior behavior as well, because that seems to match my
>>> > reading of documentation on behavior of FLATTEN.
>>> >
>>> >
>>> >
>>> > Has anybody else had this problem with recent cloudera/pig versions?
>>> >
>>> >
>>> > thnx!!
>>> >
>>> >
>>> > On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman <zaki.rahaman@gmail.com
>>> > >wrote:
>>> >
>>> > > Stupid question but are you sure your bag has the dual sets of
>>> > parentheses?
>>> > > (And if I may ask, why is that the case?)
>>> > >
>>> > > On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <zaki.rahaman@gmail.com
>>> >
>>> > > wrote:
>>> > >
>>> > > > If I'm not mistaken, the output is the expected behavior. Flatten
>>> > should
>>> > > > unnest bags. I'm assuming your statement is something like FOREACH
>>> ...
>>> > > > GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
>>> > first
>>> > > two
>>> > > > fields of a tuple for every tuple in the nested bag.
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:
>>> > > >
>>> > > >> doh!!!! s/map/bag/g
>>> > > >>
>>> > > >> I seem to get maps and bags mixed up or some reason...
>>> > > >>
>>> > > >> Guys, I have a row containing a *bag*
>>> > > >>
>>> > > >> 'id','data', {((1,2)), ((2,3)), ((4,5))}
>>> > > >>
>>> > > >> What is the expected behavior when I flatten on that bag? I had
>>> > expected
>>> > > >> it
>>> > > >> to result in
>>> > > >>
>>> > > >> 'id','data', (1,2)
>>> > > >> 'id','data', (2,3)
>>> > > >> 'id','data', (4,5)
>>> > > >>
>>> > > >>
>>> > > >> But it appears to me that the result of applying FLATTEN to that
>>> bag
>>> > is
>>> > > >> this
>>> > > >> instead:
>>> > > >>
>>> > > >> 'id','data', 1,2
>>> > > >> 'id','data', 2,3
>>> > > >> 'id','data', 4,5
>>> > > >>
>>> > > >>
>>> > > >> The latter is returned by the current cloudera's CDH2 and I've
>>> seen
>>> > the
>>> > > >> prior behavior on other versions of pig.
>>> > > >>
>>> > > >> Which is the correct behavior by design?
>>> > > >>
>>> > > >> What will pig 0.6 do when it is released?
>>> > > >>
>>> > > >> thanks!
>>> > > >> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com>
>>> wrote:
>>> > > >>
>>> > > >> > Guys, I have a row containing a map
>>> > > >> >
>>> > > >> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
>>> > > >> >
>>> > > >> > What is the expected behavior when I flatten on that bag? I had
>>> > > expected
>>> > > >> it
>>> > > >> > to result in
>>> > > >> >
>>> > > >> > 'id','data', (1,2)
>>> > > >> > 'id','data', (2,3)
>>> > > >> > 'id','data', (4,5)
>>> > > >> >
>>> > > >> >
>>> > > >> > But it appears to me that the result of applying FLATTEN to that
>>> bag
>>> > > is
>>> > > >> > this instead:
>>> > > >> >
>>> > > >> > 'id','data', 1,2
>>> > > >> > 'id','data', 2,3
>>> > > >> > 'id','data', 4,5
>>> > > >> >
>>> > > >> >
>>> > > >> > The latter is returned by the current cloudera's CDH2 and I've
>>> seen
>>> > > the
>>> > > >> > prior behavior on other versions of pig.
>>> > > >> >
>>> > > >> > Which is the correct behavior by design?
>>> > > >> >
>>> > > >> > What will pig 0.6 do when it is released?
>>> > > >> >
>>> > > >> > thanks!
>>> > > >> >
>>> > > >>
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > Zaki Rahaman
>>> > > >
>>> > > >
>>> > >
>>> > >
>>> > > --
>>> > > Zaki Rahaman
>>> > >
>>> >
>>>
>>
>>
>

Re: What should FLATTEN do?

Posted by hc busy <hc...@gmail.com>.

Okay guys some details after some digging. We've got this version of  pig
from CDH2 installed:

hadoop-pig-0.5.0+11.1-1


the list of patches that they applied on top of 0.5.0 are listed here:

http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt

<http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt>The patches
listed there doesn't seem to deal with FLATTEN in any way.

Any suggestions?




On Fri, Apr 2, 2010 at 1:49 PM, hc busy <hc...@gmail.com> wrote:

>
> .... yeah, you have to implement outputSchema() method on the udf in order
> to make the content of the tuple visible... There's a nice example in the
> UDF Manual
>
> http://hadoop.apache.org/pig/docs/r0.6.0/udf.html
>
> <http://hadoop.apache.org/pig/docs/r0.6.0/udf.html>search for 'package
> myudf' until u find it.
>
>
>
> On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>> Not sure if this is exactly the same, but when I've created tuples within
>> tuples in UDFs (to preserve order of pairs), from bag input, Pig has
>> allowed
>> it - but I can't work with that data in subsequent steps.
>>
>> On Fri, Apr 2, 2010 at 12:37 PM, hc busy <hc...@gmail.com> wrote:
>>
>> > Yeah, I'm sure it has nested tuples. Pig doesn't natively support
>> > introduction of tuples
>> >
>> > h = foreach g generate ((x,y,z)), (x), ((((x))))
>> >
>> > doesn't work, but i have a udf that does that.... don't ask why...., and
>> > I've seen it print double pair of paren's when I took a dump.
>> >
>> > Our hadoop guys here says it's CDH2 and that the "upgrade" was just
>> > re-installation of CDH2... ("same jars") But certainly my script
>> suddenly
>> > started doing weird things when it flattened that all the way through.
>> >
>> > I'd support the prior behavior as well, because that seems to match my
>> > reading of documentation on behavior of FLATTEN.
>> >
>> >
>> >
>> > Has anybody else had this problem with recent cloudera/pig versions?
>> >
>> >
>> > thnx!!
>> >
>> >
>> > On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman <zaki.rahaman@gmail.com
>> > >wrote:
>> >
>> > > Stupid question but are you sure your bag has the dual sets of
>> > parentheses?
>> > > (And if I may ask, why is that the case?)
>> > >
>> > > On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <za...@gmail.com>
>> > > wrote:
>> > >
>> > > > If I'm not mistaken, the output is the expected behavior. Flatten
>> > should
>> > > > unnest bags. I'm assuming your statement is something like FOREACH
>> ...
>> > > > GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
>> > first
>> > > two
>> > > > fields of a tuple for every tuple in the nested bag.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:
>> > > >
>> > > >> doh!!!! s/map/bag/g
>> > > >>
>> > > >> I seem to get maps and bags mixed up or some reason...
>> > > >>
>> > > >> Guys, I have a row containing a *bag*
>> > > >>
>> > > >> 'id','data', {((1,2)), ((2,3)), ((4,5))}
>> > > >>
>> > > >> What is the expected behavior when I flatten on that bag? I had
>> > expected
>> > > >> it
>> > > >> to result in
>> > > >>
>> > > >> 'id','data', (1,2)
>> > > >> 'id','data', (2,3)
>> > > >> 'id','data', (4,5)
>> > > >>
>> > > >>
>> > > >> But it appears to me that the result of applying FLATTEN to that
>> bag
>> > is
>> > > >> this
>> > > >> instead:
>> > > >>
>> > > >> 'id','data', 1,2
>> > > >> 'id','data', 2,3
>> > > >> 'id','data', 4,5
>> > > >>
>> > > >>
>> > > >> The latter is returned by the current cloudera's CDH2 and I've seen
>> > the
>> > > >> prior behavior on other versions of pig.
>> > > >>
>> > > >> Which is the correct behavior by design?
>> > > >>
>> > > >> What will pig 0.6 do when it is released?
>> > > >>
>> > > >> thanks!
>> > > >> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com>
>> wrote:
>> > > >>
>> > > >> > Guys, I have a row containing a map
>> > > >> >
>> > > >> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
>> > > >> >
>> > > >> > What is the expected behavior when I flatten on that bag? I had
>> > > expected
>> > > >> it
>> > > >> > to result in
>> > > >> >
>> > > >> > 'id','data', (1,2)
>> > > >> > 'id','data', (2,3)
>> > > >> > 'id','data', (4,5)
>> > > >> >
>> > > >> >
>> > > >> > But it appears to me that the result of applying FLATTEN to that
>> bag
>> > > is
>> > > >> > this instead:
>> > > >> >
>> > > >> > 'id','data', 1,2
>> > > >> > 'id','data', 2,3
>> > > >> > 'id','data', 4,5
>> > > >> >
>> > > >> >
>> > > >> > The latter is returned by the current cloudera's CDH2 and I've
>> seen
>> > > the
>> > > >> > prior behavior on other versions of pig.
>> > > >> >
>> > > >> > Which is the correct behavior by design?
>> > > >> >
>> > > >> > What will pig 0.6 do when it is released?
>> > > >> >
>> > > >> > thanks!
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Zaki Rahaman
>> > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > Zaki Rahaman
>> > >
>> >
>>
>
>

Re: What should FLATTEN do?

Posted by hc busy <hc...@gmail.com>.

Okay guys some details after some digging. We've got this version of  pig
from CDH2 installed:

hadoop-pig-0.5.0+11.1-1


the list of patches that they applied on top of 0.5.0 are listed here:

http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt

<http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt>The patches
listed there doesn't seem to deal with FLATTEN in any way.

Any suggestions?




On Fri, Apr 2, 2010 at 1:49 PM, hc busy <hc...@gmail.com> wrote:

>
> .... yeah, you have to implement outputSchema() method on the udf in order
> to make the content of the tuple visible... There's a nice example in the
> UDF Manual
>
> http://hadoop.apache.org/pig/docs/r0.6.0/udf.html
>
> <http://hadoop.apache.org/pig/docs/r0.6.0/udf.html>search for 'package
> myudf' until u find it.
>
>
>
> On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>> Not sure if this is exactly the same, but when I've created tuples within
>> tuples in UDFs (to preserve order of pairs), from bag input, Pig has
>> allowed
>> it - but I can't work with that data in subsequent steps.
>>
>> On Fri, Apr 2, 2010 at 12:37 PM, hc busy <hc...@gmail.com> wrote:
>>
>> > Yeah, I'm sure it has nested tuples. Pig doesn't natively support
>> > introduction of tuples
>> >
>> > h = foreach g generate ((x,y,z)), (x), ((((x))))
>> >
>> > doesn't work, but i have a udf that does that.... don't ask why...., and
>> > I've seen it print double pair of paren's when I took a dump.
>> >
>> > Our hadoop guys here says it's CDH2 and that the "upgrade" was just
>> > re-installation of CDH2... ("same jars") But certainly my script
>> suddenly
>> > started doing weird things when it flattened that all the way through.
>> >
>> > I'd support the prior behavior as well, because that seems to match my
>> > reading of documentation on behavior of FLATTEN.
>> >
>> >
>> >
>> > Has anybody else had this problem with recent cloudera/pig versions?
>> >
>> >
>> > thnx!!
>> >
>> >
>> > On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman <zaki.rahaman@gmail.com
>> > >wrote:
>> >
>> > > Stupid question but are you sure your bag has the dual sets of
>> > parentheses?
>> > > (And if I may ask, why is that the case?)
>> > >
>> > > On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <za...@gmail.com>
>> > > wrote:
>> > >
>> > > > If I'm not mistaken, the output is the expected behavior. Flatten
>> > should
>> > > > unnest bags. I'm assuming your statement is something like FOREACH
>> ...
>> > > > GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
>> > first
>> > > two
>> > > > fields of a tuple for every tuple in the nested bag.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:
>> > > >
>> > > >> doh!!!! s/map/bag/g
>> > > >>
>> > > >> I seem to get maps and bags mixed up or some reason...
>> > > >>
>> > > >> Guys, I have a row containing a *bag*
>> > > >>
>> > > >> 'id','data', {((1,2)), ((2,3)), ((4,5))}
>> > > >>
>> > > >> What is the expected behavior when I flatten on that bag? I had
>> > expected
>> > > >> it
>> > > >> to result in
>> > > >>
>> > > >> 'id','data', (1,2)
>> > > >> 'id','data', (2,3)
>> > > >> 'id','data', (4,5)
>> > > >>
>> > > >>
>> > > >> But it appears to me that the result of applying FLATTEN to that
>> bag
>> > is
>> > > >> this
>> > > >> instead:
>> > > >>
>> > > >> 'id','data', 1,2
>> > > >> 'id','data', 2,3
>> > > >> 'id','data', 4,5
>> > > >>
>> > > >>
>> > > >> The latter is returned by the current cloudera's CDH2 and I've seen
>> > the
>> > > >> prior behavior on other versions of pig.
>> > > >>
>> > > >> Which is the correct behavior by design?
>> > > >>
>> > > >> What will pig 0.6 do when it is released?
>> > > >>
>> > > >> thanks!
>> > > >> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com>
>> wrote:
>> > > >>
>> > > >> > Guys, I have a row containing a map
>> > > >> >
>> > > >> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
>> > > >> >
>> > > >> > What is the expected behavior when I flatten on that bag? I had
>> > > expected
>> > > >> it
>> > > >> > to result in
>> > > >> >
>> > > >> > 'id','data', (1,2)
>> > > >> > 'id','data', (2,3)
>> > > >> > 'id','data', (4,5)
>> > > >> >
>> > > >> >
>> > > >> > But it appears to me that the result of applying FLATTEN to that
>> bag
>> > > is
>> > > >> > this instead:
>> > > >> >
>> > > >> > 'id','data', 1,2
>> > > >> > 'id','data', 2,3
>> > > >> > 'id','data', 4,5
>> > > >> >
>> > > >> >
>> > > >> > The latter is returned by the current cloudera's CDH2 and I've
>> seen
>> > > the
>> > > >> > prior behavior on other versions of pig.
>> > > >> >
>> > > >> > Which is the correct behavior by design?
>> > > >> >
>> > > >> > What will pig 0.6 do when it is released?
>> > > >> >
>> > > >> > thanks!
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Zaki Rahaman
>> > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > Zaki Rahaman
>> > >
>> >
>>
>
>

Re: What should FLATTEN do?

Posted by hc busy <hc...@gmail.com>.

.... yeah, you have to implement outputSchema() method on the udf in order
to make the content of the tuple visible... There's a nice example in the
UDF Manual

http://hadoop.apache.org/pig/docs/r0.6.0/udf.html

<http://hadoop.apache.org/pig/docs/r0.6.0/udf.html>search for 'package
myudf' until u find it.



On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney <ru...@gmail.com>wrote:

> Not sure if this is exactly the same, but when I've created tuples within
> tuples in UDFs (to preserve order of pairs), from bag input, Pig has
> allowed
> it - but I can't work with that data in subsequent steps.
>
> On Fri, Apr 2, 2010 at 12:37 PM, hc busy <hc...@gmail.com> wrote:
>
> > Yeah, I'm sure it has nested tuples. Pig doesn't natively support
> > introduction of tuples
> >
> > h = foreach g generate ((x,y,z)), (x), ((((x))))
> >
> > doesn't work, but i have a udf that does that.... don't ask why...., and
> > I've seen it print double pair of paren's when I took a dump.
> >
> > Our hadoop guys here says it's CDH2 and that the "upgrade" was just
> > re-installation of CDH2... ("same jars") But certainly my script suddenly
> > started doing weird things when it flattened that all the way through.
> >
> > I'd support the prior behavior as well, because that seems to match my
> > reading of documentation on behavior of FLATTEN.
> >
> >
> >
> > Has anybody else had this problem with recent cloudera/pig versions?
> >
> >
> > thnx!!
> >
> >
> > On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman <zaki.rahaman@gmail.com
> > >wrote:
> >
> > > Stupid question but are you sure your bag has the dual sets of
> > parentheses?
> > > (And if I may ask, why is that the case?)
> > >
> > > On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <za...@gmail.com>
> > > wrote:
> > >
> > > > If I'm not mistaken, the output is the expected behavior. Flatten
> > should
> > > > unnest bags. I'm assuming your statement is something like FOREACH
> ...
> > > > GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
> > first
> > > two
> > > > fields of a tuple for every tuple in the nested bag.
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:
> > > >
> > > >> doh!!!! s/map/bag/g
> > > >>
> > > >> I seem to get maps and bags mixed up or some reason...
> > > >>
> > > >> Guys, I have a row containing a *bag*
> > > >>
> > > >> 'id','data', {((1,2)), ((2,3)), ((4,5))}
> > > >>
> > > >> What is the expected behavior when I flatten on that bag? I had
> > expected
> > > >> it
> > > >> to result in
> > > >>
> > > >> 'id','data', (1,2)
> > > >> 'id','data', (2,3)
> > > >> 'id','data', (4,5)
> > > >>
> > > >>
> > > >> But it appears to me that the result of applying FLATTEN to that bag
> > is
> > > >> this
> > > >> instead:
> > > >>
> > > >> 'id','data', 1,2
> > > >> 'id','data', 2,3
> > > >> 'id','data', 4,5
> > > >>
> > > >>
> > > >> The latter is returned by the current cloudera's CDH2 and I've seen
> > the
> > > >> prior behavior on other versions of pig.
> > > >>
> > > >> Which is the correct behavior by design?
> > > >>
> > > >> What will pig 0.6 do when it is released?
> > > >>
> > > >> thanks!
> > > >> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com> wrote:
> > > >>
> > > >> > Guys, I have a row containing a map
> > > >> >
> > > >> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
> > > >> >
> > > >> > What is the expected behavior when I flatten on that bag? I had
> > > expected
> > > >> it
> > > >> > to result in
> > > >> >
> > > >> > 'id','data', (1,2)
> > > >> > 'id','data', (2,3)
> > > >> > 'id','data', (4,5)
> > > >> >
> > > >> >
> > > >> > But it appears to me that the result of applying FLATTEN to that
> bag
> > > is
> > > >> > this instead:
> > > >> >
> > > >> > 'id','data', 1,2
> > > >> > 'id','data', 2,3
> > > >> > 'id','data', 4,5
> > > >> >
> > > >> >
> > > >> > The latter is returned by the current cloudera's CDH2 and I've
> seen
> > > the
> > > >> > prior behavior on other versions of pig.
> > > >> >
> > > >> > Which is the correct behavior by design?
> > > >> >
> > > >> > What will pig 0.6 do when it is released?
> > > >> >
> > > >> > thanks!
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Zaki Rahaman
> > > >
> > > >
> > >
> > >
> > > --
> > > Zaki Rahaman
> > >
> >
>

Re: What should FLATTEN do?

Posted by hc busy <hc...@gmail.com>.

.... yeah, you have to implement outputSchema() method on the udf in order
to make the content of the tuple visible... There's a nice example in the
UDF Manual

http://hadoop.apache.org/pig/docs/r0.6.0/udf.html

<http://hadoop.apache.org/pig/docs/r0.6.0/udf.html>search for 'package
myudf' until u find it.



On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney <ru...@gmail.com>wrote:

> Not sure if this is exactly the same, but when I've created tuples within
> tuples in UDFs (to preserve order of pairs), from bag input, Pig has
> allowed
> it - but I can't work with that data in subsequent steps.
>
> On Fri, Apr 2, 2010 at 12:37 PM, hc busy <hc...@gmail.com> wrote:
>
> > Yeah, I'm sure it has nested tuples. Pig doesn't natively support
> > introduction of tuples
> >
> > h = foreach g generate ((x,y,z)), (x), ((((x))))
> >
> > doesn't work, but i have a udf that does that.... don't ask why...., and
> > I've seen it print double pair of paren's when I took a dump.
> >
> > Our hadoop guys here says it's CDH2 and that the "upgrade" was just
> > re-installation of CDH2... ("same jars") But certainly my script suddenly
> > started doing weird things when it flattened that all the way through.
> >
> > I'd support the prior behavior as well, because that seems to match my
> > reading of documentation on behavior of FLATTEN.
> >
> >
> >
> > Has anybody else had this problem with recent cloudera/pig versions?
> >
> >
> > thnx!!
> >
> >
> > On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman <zaki.rahaman@gmail.com
> > >wrote:
> >
> > > Stupid question but are you sure your bag has the dual sets of
> > parentheses?
> > > (And if I may ask, why is that the case?)
> > >
> > > On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <za...@gmail.com>
> > > wrote:
> > >
> > > > If I'm not mistaken, the output is the expected behavior. Flatten
> > should
> > > > unnest bags. I'm assuming your statement is something like FOREACH
> ...
> > > > GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
> > first
> > > two
> > > > fields of a tuple for every tuple in the nested bag.
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:
> > > >
> > > >> doh!!!! s/map/bag/g
> > > >>
> > > >> I seem to get maps and bags mixed up or some reason...
> > > >>
> > > >> Guys, I have a row containing a *bag*
> > > >>
> > > >> 'id','data', {((1,2)), ((2,3)), ((4,5))}
> > > >>
> > > >> What is the expected behavior when I flatten on that bag? I had
> > expected
> > > >> it
> > > >> to result in
> > > >>
> > > >> 'id','data', (1,2)
> > > >> 'id','data', (2,3)
> > > >> 'id','data', (4,5)
> > > >>
> > > >>
> > > >> But it appears to me that the result of applying FLATTEN to that bag
> > is
> > > >> this
> > > >> instead:
> > > >>
> > > >> 'id','data', 1,2
> > > >> 'id','data', 2,3
> > > >> 'id','data', 4,5
> > > >>
> > > >>
> > > >> The latter is returned by the current cloudera's CDH2 and I've seen
> > the
> > > >> prior behavior on other versions of pig.
> > > >>
> > > >> Which is the correct behavior by design?
> > > >>
> > > >> What will pig 0.6 do when it is released?
> > > >>
> > > >> thanks!
> > > >> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com> wrote:
> > > >>
> > > >> > Guys, I have a row containing a map
> > > >> >
> > > >> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
> > > >> >
> > > >> > What is the expected behavior when I flatten on that bag? I had
> > > expected
> > > >> it
> > > >> > to result in
> > > >> >
> > > >> > 'id','data', (1,2)
> > > >> > 'id','data', (2,3)
> > > >> > 'id','data', (4,5)
> > > >> >
> > > >> >
> > > >> > But it appears to me that the result of applying FLATTEN to that
> bag
> > > is
> > > >> > this instead:
> > > >> >
> > > >> > 'id','data', 1,2
> > > >> > 'id','data', 2,3
> > > >> > 'id','data', 4,5
> > > >> >
> > > >> >
> > > >> > The latter is returned by the current cloudera's CDH2 and I've
> seen
> > > the
> > > >> > prior behavior on other versions of pig.
> > > >> >
> > > >> > Which is the correct behavior by design?
> > > >> >
> > > >> > What will pig 0.6 do when it is released?
> > > >> >
> > > >> > thanks!
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Zaki Rahaman
> > > >
> > > >
> > >
> > >
> > > --
> > > Zaki Rahaman
> > >
> >
>

Re: What should FLATTEN do?

Posted by Russell Jurney <ru...@gmail.com>.

Not sure if this is exactly the same, but when I've created tuples within
tuples in UDFs (to preserve order of pairs), from bag input, Pig has allowed
it - but I can't work with that data in subsequent steps.

On Fri, Apr 2, 2010 at 12:37 PM, hc busy <hc...@gmail.com> wrote:

> Yeah, I'm sure it has nested tuples. Pig doesn't natively support
> introduction of tuples
>
> h = foreach g generate ((x,y,z)), (x), ((((x))))
>
> doesn't work, but i have a udf that does that.... don't ask why...., and
> I've seen it print double pair of paren's when I took a dump.
>
> Our hadoop guys here says it's CDH2 and that the "upgrade" was just
> re-installation of CDH2... ("same jars") But certainly my script suddenly
> started doing weird things when it flattened that all the way through.
>
> I'd support the prior behavior as well, because that seems to match my
> reading of documentation on behavior of FLATTEN.
>
>
>
> Has anybody else had this problem with recent cloudera/pig versions?
>
>
> thnx!!
>
>
> On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman <zaki.rahaman@gmail.com
> >wrote:
>
> > Stupid question but are you sure your bag has the dual sets of
> parentheses?
> > (And if I may ask, why is that the case?)
> >
> > On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <za...@gmail.com>
> > wrote:
> >
> > > If I'm not mistaken, the output is the expected behavior. Flatten
> should
> > > unnest bags. I'm assuming your statement is something like FOREACH ...
> > > GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
> first
> > two
> > > fields of a tuple for every tuple in the nested bag.
> > >
> > >
> > >
> > >
> > > On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:
> > >
> > >> doh!!!! s/map/bag/g
> > >>
> > >> I seem to get maps and bags mixed up or some reason...
> > >>
> > >> Guys, I have a row containing a *bag*
> > >>
> > >> 'id','data', {((1,2)), ((2,3)), ((4,5))}
> > >>
> > >> What is the expected behavior when I flatten on that bag? I had
> expected
> > >> it
> > >> to result in
> > >>
> > >> 'id','data', (1,2)
> > >> 'id','data', (2,3)
> > >> 'id','data', (4,5)
> > >>
> > >>
> > >> But it appears to me that the result of applying FLATTEN to that bag
> is
> > >> this
> > >> instead:
> > >>
> > >> 'id','data', 1,2
> > >> 'id','data', 2,3
> > >> 'id','data', 4,5
> > >>
> > >>
> > >> The latter is returned by the current cloudera's CDH2 and I've seen
> the
> > >> prior behavior on other versions of pig.
> > >>
> > >> Which is the correct behavior by design?
> > >>
> > >> What will pig 0.6 do when it is released?
> > >>
> > >> thanks!
> > >> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com> wrote:
> > >>
> > >> > Guys, I have a row containing a map
> > >> >
> > >> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
> > >> >
> > >> > What is the expected behavior when I flatten on that bag? I had
> > expected
> > >> it
> > >> > to result in
> > >> >
> > >> > 'id','data', (1,2)
> > >> > 'id','data', (2,3)
> > >> > 'id','data', (4,5)
> > >> >
> > >> >
> > >> > But it appears to me that the result of applying FLATTEN to that bag
> > is
> > >> > this instead:
> > >> >
> > >> > 'id','data', 1,2
> > >> > 'id','data', 2,3
> > >> > 'id','data', 4,5
> > >> >
> > >> >
> > >> > The latter is returned by the current cloudera's CDH2 and I've seen
> > the
> > >> > prior behavior on other versions of pig.
> > >> >
> > >> > Which is the correct behavior by design?
> > >> >
> > >> > What will pig 0.6 do when it is released?
> > >> >
> > >> > thanks!
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Zaki Rahaman
> > >
> > >
> >
> >
> > --
> > Zaki Rahaman
> >
>

Re: What should FLATTEN do?

Posted by hc busy <hc...@gmail.com>.

Yeah, I'm sure it has nested tuples. Pig doesn't natively support
introduction of tuples

h = foreach g generate ((x,y,z)), (x), ((((x))))

doesn't work, but i have a udf that does that.... don't ask why...., and
I've seen it print double pair of paren's when I took a dump.

Our hadoop guys here says it's CDH2 and that the "upgrade" was just
re-installation of CDH2... ("same jars") But certainly my script suddenly
started doing weird things when it flattened that all the way through.

I'd support the prior behavior as well, because that seems to match my
reading of documentation on behavior of FLATTEN.



Has anybody else had this problem with recent cloudera/pig versions?


thnx!!


On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman <za...@gmail.com>wrote:

> Stupid question but are you sure your bag has the dual sets of parentheses?
> (And if I may ask, why is that the case?)
>
> On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <za...@gmail.com>
> wrote:
>
> > If I'm not mistaken, the output is the expected behavior. Flatten should
> > unnest bags. I'm assuming your statement is something like FOREACH ...
> > GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the first
> two
> > fields of a tuple for every tuple in the nested bag.
> >
> >
> >
> >
> > On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:
> >
> >> doh!!!! s/map/bag/g
> >>
> >> I seem to get maps and bags mixed up or some reason...
> >>
> >> Guys, I have a row containing a *bag*
> >>
> >> 'id','data', {((1,2)), ((2,3)), ((4,5))}
> >>
> >> What is the expected behavior when I flatten on that bag? I had expected
> >> it
> >> to result in
> >>
> >> 'id','data', (1,2)
> >> 'id','data', (2,3)
> >> 'id','data', (4,5)
> >>
> >>
> >> But it appears to me that the result of applying FLATTEN to that bag is
> >> this
> >> instead:
> >>
> >> 'id','data', 1,2
> >> 'id','data', 2,3
> >> 'id','data', 4,5
> >>
> >>
> >> The latter is returned by the current cloudera's CDH2 and I've seen the
> >> prior behavior on other versions of pig.
> >>
> >> Which is the correct behavior by design?
> >>
> >> What will pig 0.6 do when it is released?
> >>
> >> thanks!
> >> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com> wrote:
> >>
> >> > Guys, I have a row containing a map
> >> >
> >> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
> >> >
> >> > What is the expected behavior when I flatten on that bag? I had
> expected
> >> it
> >> > to result in
> >> >
> >> > 'id','data', (1,2)
> >> > 'id','data', (2,3)
> >> > 'id','data', (4,5)
> >> >
> >> >
> >> > But it appears to me that the result of applying FLATTEN to that bag
> is
> >> > this instead:
> >> >
> >> > 'id','data', 1,2
> >> > 'id','data', 2,3
> >> > 'id','data', 4,5
> >> >
> >> >
> >> > The latter is returned by the current cloudera's CDH2 and I've seen
> the
> >> > prior behavior on other versions of pig.
> >> >
> >> > Which is the correct behavior by design?
> >> >
> >> > What will pig 0.6 do when it is released?
> >> >
> >> > thanks!
> >> >
> >>
> >
> >
> >
> > --
> > Zaki Rahaman
> >
> >
>
>
> --
> Zaki Rahaman
>

Re: What should FLATTEN do?

Posted by hc busy <hc...@gmail.com>.

Yeah, I'm sure it has nested tuples. Pig doesn't natively support
introduction of tuples

h = foreach g generate ((x,y,z)), (x), ((((x))))

doesn't work, but i have a udf that does that.... don't ask why...., and
I've seen it print double pair of paren's when I took a dump.

Our hadoop guys here says it's CDH2 and that the "upgrade" was just
re-installation of CDH2... ("same jars") But certainly my script suddenly
started doing weird things when it flattened that all the way through.

I'd support the prior behavior as well, because that seems to match my
reading of documentation on behavior of FLATTEN.



Has anybody else had this problem with recent cloudera/pig versions?


thnx!!


On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman <za...@gmail.com>wrote:

> Stupid question but are you sure your bag has the dual sets of parentheses?
> (And if I may ask, why is that the case?)
>
> On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <za...@gmail.com>
> wrote:
>
> > If I'm not mistaken, the output is the expected behavior. Flatten should
> > unnest bags. I'm assuming your statement is something like FOREACH ...
> > GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the first
> two
> > fields of a tuple for every tuple in the nested bag.
> >
> >
> >
> >
> > On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:
> >
> >> doh!!!! s/map/bag/g
> >>
> >> I seem to get maps and bags mixed up or some reason...
> >>
> >> Guys, I have a row containing a *bag*
> >>
> >> 'id','data', {((1,2)), ((2,3)), ((4,5))}
> >>
> >> What is the expected behavior when I flatten on that bag? I had expected
> >> it
> >> to result in
> >>
> >> 'id','data', (1,2)
> >> 'id','data', (2,3)
> >> 'id','data', (4,5)
> >>
> >>
> >> But it appears to me that the result of applying FLATTEN to that bag is
> >> this
> >> instead:
> >>
> >> 'id','data', 1,2
> >> 'id','data', 2,3
> >> 'id','data', 4,5
> >>
> >>
> >> The latter is returned by the current cloudera's CDH2 and I've seen the
> >> prior behavior on other versions of pig.
> >>
> >> Which is the correct behavior by design?
> >>
> >> What will pig 0.6 do when it is released?
> >>
> >> thanks!
> >> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com> wrote:
> >>
> >> > Guys, I have a row containing a map
> >> >
> >> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
> >> >
> >> > What is the expected behavior when I flatten on that bag? I had
> expected
> >> it
> >> > to result in
> >> >
> >> > 'id','data', (1,2)
> >> > 'id','data', (2,3)
> >> > 'id','data', (4,5)
> >> >
> >> >
> >> > But it appears to me that the result of applying FLATTEN to that bag
> is
> >> > this instead:
> >> >
> >> > 'id','data', 1,2
> >> > 'id','data', 2,3
> >> > 'id','data', 4,5
> >> >
> >> >
> >> > The latter is returned by the current cloudera's CDH2 and I've seen
> the
> >> > prior behavior on other versions of pig.
> >> >
> >> > Which is the correct behavior by design?
> >> >
> >> > What will pig 0.6 do when it is released?
> >> >
> >> > thanks!
> >> >
> >>
> >
> >
> >
> > --
> > Zaki Rahaman
> >
> >
>
>
> --
> Zaki Rahaman
>

Re: What should FLATTEN do?

Posted by zaki rahaman <za...@gmail.com>.

Stupid question but are you sure your bag has the dual sets of parentheses?
(And if I may ask, why is that the case?)

On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <za...@gmail.com> wrote:

> If I'm not mistaken, the output is the expected behavior. Flatten should
> unnest bags. I'm assuming your statement is something like FOREACH ...
> GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the first two
> fields of a tuple for every tuple in the nested bag.
>
>
>
>
> On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:
>
>> doh!!!! s/map/bag/g
>>
>> I seem to get maps and bags mixed up or some reason...
>>
>> Guys, I have a row containing a *bag*
>>
>> 'id','data', {((1,2)), ((2,3)), ((4,5))}
>>
>> What is the expected behavior when I flatten on that bag? I had expected
>> it
>> to result in
>>
>> 'id','data', (1,2)
>> 'id','data', (2,3)
>> 'id','data', (4,5)
>>
>>
>> But it appears to me that the result of applying FLATTEN to that bag is
>> this
>> instead:
>>
>> 'id','data', 1,2
>> 'id','data', 2,3
>> 'id','data', 4,5
>>
>>
>> The latter is returned by the current cloudera's CDH2 and I've seen the
>> prior behavior on other versions of pig.
>>
>> Which is the correct behavior by design?
>>
>> What will pig 0.6 do when it is released?
>>
>> thanks!
>> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com> wrote:
>>
>> > Guys, I have a row containing a map
>> >
>> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
>> >
>> > What is the expected behavior when I flatten on that bag? I had expected
>> it
>> > to result in
>> >
>> > 'id','data', (1,2)
>> > 'id','data', (2,3)
>> > 'id','data', (4,5)
>> >
>> >
>> > But it appears to me that the result of applying FLATTEN to that bag is
>> > this instead:
>> >
>> > 'id','data', 1,2
>> > 'id','data', 2,3
>> > 'id','data', 4,5
>> >
>> >
>> > The latter is returned by the current cloudera's CDH2 and I've seen the
>> > prior behavior on other versions of pig.
>> >
>> > Which is the correct behavior by design?
>> >
>> > What will pig 0.6 do when it is released?
>> >
>> > thanks!
>> >
>>
>
>
>
> --
> Zaki Rahaman
>
>


-- 
Zaki Rahaman

Re: What should FLATTEN do?

Posted by zaki rahaman <za...@gmail.com>.

If I'm not mistaken, the output is the expected behavior. Flatten should
unnest bags. I'm assuming your statement is something like FOREACH ...
GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the first two
fields of a tuple for every tuple in the nested bag.



On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc...@gmail.com> wrote:

> doh!!!! s/map/bag/g
>
> I seem to get maps and bags mixed up or some reason...
>
> Guys, I have a row containing a *bag*
>
> 'id','data', {((1,2)), ((2,3)), ((4,5))}
>
> What is the expected behavior when I flatten on that bag? I had expected it
> to result in
>
> 'id','data', (1,2)
> 'id','data', (2,3)
> 'id','data', (4,5)
>
>
> But it appears to me that the result of applying FLATTEN to that bag is
> this
> instead:
>
> 'id','data', 1,2
> 'id','data', 2,3
> 'id','data', 4,5
>
>
> The latter is returned by the current cloudera's CDH2 and I've seen the
> prior behavior on other versions of pig.
>
> Which is the correct behavior by design?
>
> What will pig 0.6 do when it is released?
>
> thanks!
> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com> wrote:
>
> > Guys, I have a row containing a map
> >
> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
> >
> > What is the expected behavior when I flatten on that bag? I had expected
> it
> > to result in
> >
> > 'id','data', (1,2)
> > 'id','data', (2,3)
> > 'id','data', (4,5)
> >
> >
> > But it appears to me that the result of applying FLATTEN to that bag is
> > this instead:
> >
> > 'id','data', 1,2
> > 'id','data', 2,3
> > 'id','data', 4,5
> >
> >
> > The latter is returned by the current cloudera's CDH2 and I've seen the
> > prior behavior on other versions of pig.
> >
> > Which is the correct behavior by design?
> >
> > What will pig 0.6 do when it is released?
> >
> > thanks!
> >
>



-- 
Zaki Rahaman

Re: What should FLATTEN do?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

CDH2 or CDH3?

CDH2 is basically 0.{4,5}. CDH3 is in between 5 and 6.

I expect the first result -- a flattened bag of tuples results in multiple
rows, each containing the (not-flattened) tuple.

Btw, Pig 0.6 is out.

-D

On Fri, Apr 2, 2010 at 11:32 AM, hc busy <hc...@gmail.com> wrote:

> doh!!!! s/map/bag/g
>
> I seem to get maps and bags mixed up or some reason...
>
> Guys, I have a row containing a *bag*
>
> 'id','data', {((1,2)), ((2,3)), ((4,5))}
>
> What is the expected behavior when I flatten on that bag? I had expected it
> to result in
>
> 'id','data', (1,2)
> 'id','data', (2,3)
> 'id','data', (4,5)
>
>
> But it appears to me that the result of applying FLATTEN to that bag is
> this
> instead:
>
> 'id','data', 1,2
> 'id','data', 2,3
> 'id','data', 4,5
>
>
> The latter is returned by the current cloudera's CDH2 and I've seen the
> prior behavior on other versions of pig.
>
> Which is the correct behavior by design?
>
> What will pig 0.6 do when it is released?
>
> thanks!
> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com> wrote:
>
> > Guys, I have a row containing a map
> >
> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
> >
> > What is the expected behavior when I flatten on that bag? I had expected
> it
> > to result in
> >
> > 'id','data', (1,2)
> > 'id','data', (2,3)
> > 'id','data', (4,5)
> >
> >
> > But it appears to me that the result of applying FLATTEN to that bag is
> > this instead:
> >
> > 'id','data', 1,2
> > 'id','data', 2,3
> > 'id','data', 4,5
> >
> >
> > The latter is returned by the current cloudera's CDH2 and I've seen the
> > prior behavior on other versions of pig.
> >
> > Which is the correct behavior by design?
> >
> > What will pig 0.6 do when it is released?
> >
> > thanks!
> >
>

Re: What should FLATTEN do?

Posted by hc busy <hc...@gmail.com>.

doh!!!! s/map/bag/g

I seem to get maps and bags mixed up or some reason...

Guys, I have a row containing a *bag*

'id','data', {((1,2)), ((2,3)), ((4,5))}

What is the expected behavior when I flatten on that bag? I had expected it
to result in

'id','data', (1,2)
'id','data', (2,3)
'id','data', (4,5)

But it appears to me that the result of applying FLATTEN to that bag is this
instead:

'id','data', 1,2
'id','data', 2,3
'id','data', 4,5

The latter is returned by the current cloudera's CDH2 and I've seen the
prior behavior on other versions of pig.

Which is the correct behavior by design?

What will pig 0.6 do when it is released?

thanks!
On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com> wrote:

> Guys, I have a row containing a map
>
> 'id','data', {((1,2)), ((2,3)), ((4,5))}
>
> What is the expected behavior when I flatten on that bag? I had expected it
> to result in
>
> 'id','data', (1,2)
> 'id','data', (2,3)
> 'id','data', (4,5)
>
>
> But it appears to me that the result of applying FLATTEN to that bag is
> this instead:
>
> 'id','data', 1,2
> 'id','data', 2,3
> 'id','data', 4,5
>
>
> The latter is returned by the current cloudera's CDH2 and I've seen the
> prior behavior on other versions of pig.
>
> Which is the correct behavior by design?
>
> What will pig 0.6 do when it is released?
>
> thanks!
>

Re: What should FLATTEN do?

Posted by hc busy <hc...@gmail.com>.

doh!!!! s/map/bag/g

I seem to get maps and bags mixed up or some reason...

Guys, I have a row containing a *bag*

'id','data', {((1,2)), ((2,3)), ((4,5))}

What is the expected behavior when I flatten on that bag? I had expected it
to result in

'id','data', (1,2)
'id','data', (2,3)
'id','data', (4,5)

But it appears to me that the result of applying FLATTEN to that bag is this
instead:

'id','data', 1,2
'id','data', 2,3
'id','data', 4,5

The latter is returned by the current cloudera's CDH2 and I've seen the
prior behavior on other versions of pig.

Which is the correct behavior by design?

What will pig 0.6 do when it is released?

thanks!
On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc...@gmail.com> wrote:

> Guys, I have a row containing a map
>
> 'id','data', {((1,2)), ((2,3)), ((4,5))}
>
> What is the expected behavior when I flatten on that bag? I had expected it
> to result in
>
> 'id','data', (1,2)
> 'id','data', (2,3)
> 'id','data', (4,5)
>
>
> But it appears to me that the result of applying FLATTEN to that bag is
> this instead:
>
> 'id','data', 1,2
> 'id','data', 2,3
> 'id','data', 4,5
>
>
> The latter is returned by the current cloudera's CDH2 and I've seen the
> prior behavior on other versions of pig.
>
> Which is the correct behavior by design?
>
> What will pig 0.6 do when it is released?
>
> thanks!
>