You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mark <st...@gmail.com> on 2011/04/01 16:30:13 UTC

Re: Conversion

I created the following:

http://pastie.org/1743857

And I'm using it in the following way:

register 'target/pig-1.0-SNAPSHOT.jar'
rows = LOAD 'foo' AS (user:chararray, item:long);
grouped = GROUP rows BY user;
final = GENERATE FLATTEN(com.mycompany.pig.udf.Foo(unique.item));

Does that look about right? Is there any particular reason why I need to 
flatten at the end? When I try to output a simple tuple from the 
EvalFunc it is always a tuple inside a tuple.

Thanks


On 3/31/11 10:10 AM, Jonathan Coveney wrote:
> You definitely can do this with a UDF. You simply take the Tuples as input
> and then begin concatenating them together. Be wary of memory limitations
> for the intermediate as it gets large. It may be more practical to let the
> output be a tuple whose element sare the rows.
>
> (199027860,199027860,149167529,203508790,198488630)
>
> then the input to your UDF will be a tuple whose first element is a bag, and
> then the output will be a tuple of all the elements. It is quite easy to
> write something that does this, take a look at the UDF documentation and ask
> if you need any help.
>
> 2011/3/31 Mark<st...@gmail.com>
>
>> I have these "rows"
>>
>> ({(155495400)})
>> ({(199027860),(199027860),(149167529),(203508790),(198488630)})
>> ({(174255619),(201077556),(199051606),(198778302)})
>>
>> I believe the correct way to explain them would be each row/tuple is a bag
>> that contains tuples of size 1? Is that right?
>>
>> Anyway, is there something native or UDF I can use to convert them to this
>> format?
>>
>> (155495400)
>> (199027860 199027860 149167529 203508790 198488630)
>> (174255619 201077556 199051606 198778302)
>>
>> Maybe if I explain what we are trying to do it would help.
>>
>> We have logs of users to product views in a tab delimited format.
>>
>> foo\t1234
>> bar\t1234
>> foo\t4423
>> baz\t5563
>>
>> We simply want product views grouped by user and outputed on 1 line.
>>
>> 1234 4423
>> 1234
>> 5563
>>
>> The above first line would be from the user foo, second bar and third baz.
>>
>> Thanks
>>

Re: Conversion

Posted by Mark <st...@gmail.com>.

How would I return a list of values?

(val1, val2, val3...)

I tried returning a List<Object> however it I get a tuple that contains 
a tuple with a list of values and I have to flatten it to get the 
desired behavior.

((val1, val2, val3...))

Thanks

On 4/1/11 10:09 AM, Dmitriy Ryaboy wrote:
> Right, Pig always returns a Tuple that contains whatever your UDF returns --
> so if you return a string, it returns a Tuple with a String in it.
> Unfortunately that also means that if you return a Tuple, you get a Tuple in
> a Tuple.
>
> We probably shouldn't do that, but at this point changing the behavior can
> break a lot of people's existing pig code :(.
>
> D
>
> On Fri, Apr 1, 2011 at 7:30 AM, Mark<st...@gmail.com>  wrote:
>
>> I created the following:
>>
>> http://pastie.org/1743857
>>
>> And I'm using it in the following way:
>>
>> register 'target/pig-1.0-SNAPSHOT.jar'
>> rows = LOAD 'foo' AS (user:chararray, item:long);
>> grouped = GROUP rows BY user;
>> final = GENERATE FLATTEN(com.mycompany.pig.udf.Foo(unique.item));
>>
>> Does that look about right? Is there any particular reason why I need to
>> flatten at the end? When I try to output a simple tuple from the EvalFunc it
>> is always a tuple inside a tuple.
>>
>> Thanks
>>
>>
>>
>> On 3/31/11 10:10 AM, Jonathan Coveney wrote:
>>
>>> You definitely can do this with a UDF. You simply take the Tuples as input
>>> and then begin concatenating them together. Be wary of memory limitations
>>> for the intermediate as it gets large. It may be more practical to let the
>>> output be a tuple whose element sare the rows.
>>>
>>> (199027860,199027860,149167529,203508790,198488630)
>>>
>>> then the input to your UDF will be a tuple whose first element is a bag,
>>> and
>>> then the output will be a tuple of all the elements. It is quite easy to
>>> write something that does this, take a look at the UDF documentation and
>>> ask
>>> if you need any help.
>>>
>>> 2011/3/31 Mark<st...@gmail.com>
>>>
>>>   I have these "rows"
>>>> ({(155495400)})
>>>> ({(199027860),(199027860),(149167529),(203508790),(198488630)})
>>>> ({(174255619),(201077556),(199051606),(198778302)})
>>>>
>>>> I believe the correct way to explain them would be each row/tuple is a
>>>> bag
>>>> that contains tuples of size 1? Is that right?
>>>>
>>>> Anyway, is there something native or UDF I can use to convert them to
>>>> this
>>>> format?
>>>>
>>>> (155495400)
>>>> (199027860 199027860 149167529 203508790 198488630)
>>>> (174255619 201077556 199051606 198778302)
>>>>
>>>> Maybe if I explain what we are trying to do it would help.
>>>>
>>>> We have logs of users to product views in a tab delimited format.
>>>>
>>>> foo\t1234
>>>> bar\t1234
>>>> foo\t4423
>>>> baz\t5563
>>>>
>>>> We simply want product views grouped by user and outputed on 1 line.
>>>>
>>>> 1234 4423
>>>> 1234
>>>> 5563
>>>>
>>>> The above first line would be from the user foo, second bar and third
>>>> baz.
>>>>
>>>> Thanks
>>>>
>>>>

Re: Conversion

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Right, Pig always returns a Tuple that contains whatever your UDF returns --
so if you return a string, it returns a Tuple with a String in it.
Unfortunately that also means that if you return a Tuple, you get a Tuple in
a Tuple.

We probably shouldn't do that, but at this point changing the behavior can
break a lot of people's existing pig code :(.

D

On Fri, Apr 1, 2011 at 7:30 AM, Mark <st...@gmail.com> wrote:

> I created the following:
>
> http://pastie.org/1743857
>
> And I'm using it in the following way:
>
> register 'target/pig-1.0-SNAPSHOT.jar'
> rows = LOAD 'foo' AS (user:chararray, item:long);
> grouped = GROUP rows BY user;
> final = GENERATE FLATTEN(com.mycompany.pig.udf.Foo(unique.item));
>
> Does that look about right? Is there any particular reason why I need to
> flatten at the end? When I try to output a simple tuple from the EvalFunc it
> is always a tuple inside a tuple.
>
> Thanks
>
>
>
> On 3/31/11 10:10 AM, Jonathan Coveney wrote:
>
>> You definitely can do this with a UDF. You simply take the Tuples as input
>> and then begin concatenating them together. Be wary of memory limitations
>> for the intermediate as it gets large. It may be more practical to let the
>> output be a tuple whose element sare the rows.
>>
>> (199027860,199027860,149167529,203508790,198488630)
>>
>> then the input to your UDF will be a tuple whose first element is a bag,
>> and
>> then the output will be a tuple of all the elements. It is quite easy to
>> write something that does this, take a look at the UDF documentation and
>> ask
>> if you need any help.
>>
>> 2011/3/31 Mark<st...@gmail.com>
>>
>>  I have these "rows"
>>>
>>> ({(155495400)})
>>> ({(199027860),(199027860),(149167529),(203508790),(198488630)})
>>> ({(174255619),(201077556),(199051606),(198778302)})
>>>
>>> I believe the correct way to explain them would be each row/tuple is a
>>> bag
>>> that contains tuples of size 1? Is that right?
>>>
>>> Anyway, is there something native or UDF I can use to convert them to
>>> this
>>> format?
>>>
>>> (155495400)
>>> (199027860 199027860 149167529 203508790 198488630)
>>> (174255619 201077556 199051606 198778302)
>>>
>>> Maybe if I explain what we are trying to do it would help.
>>>
>>> We have logs of users to product views in a tab delimited format.
>>>
>>> foo\t1234
>>> bar\t1234
>>> foo\t4423
>>> baz\t5563
>>>
>>> We simply want product views grouped by user and outputed on 1 line.
>>>
>>> 1234 4423
>>> 1234
>>> 5563
>>>
>>> The above first line would be from the user foo, second bar and third
>>> baz.
>>>
>>> Thanks
>>>
>>>