You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mark <st...@gmail.com> on 2011/04/03 16:44:35 UTC
Distinct question
If I have a tuple of values, is there a way to eliminate duplicate
values per tuple?
Example:
(5,5,4,7,2,3,4,9) = (5,4,7,2,3,9)
Thanks
Re: Distinct question
Posted by Bill Graham <bi...@gmail.com>.
I was suggesting using the Set just as a means to distinct your input
data. You'll want to iterate over the set and add each item into the
response tuple. So you're returning a tuple of N unique objects,
instead of a tuple of a Set of N.
On Sun, Apr 3, 2011 at 9:57 AM, Jonathan Coveney <jc...@gmail.com> wrote:
> I do not know if this is it, but I am not sure that pig likes it when you
> use the result variable in its own declaration. That is to say, try doing
> rows2 = Foreach rows generate etc.
>
> 2011/4/3 Mark <st...@gmail.com>
>
>> I have a simple EvalFunc as so:
>>
>> public class Set extends EvalFunc<Tuple> {
>> public Tuple exec(Tuple tuple) throws IOException {
>> Set<Object> unique = new HashSet<Object>();
>> unique.addAll(tuple.getAll());
>> return TupleFactory.getInstance().newTuple(unique);
>> }
>> }
>>
>> How can I apply this to a result set though? When I try:
>>
>> rows = LOAD 'foo';
>> rows = FOREACH rows GENERATE com.mycompany.piggybank.Set(rows);
>> 2011-04-03 09:16:25,423 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1000: Error during parsing. Scalars can be only used with projections
>>
>> I get the above error? Should I be using something other than a EvalFunc?
>>
>> Thanks
>>
>>
>>
>> On 4/3/11 8:53 AM, Bill Graham wrote:
>>
>>> You could add all the values to a set in a udf and the return it's
>>> contents.
>>>
>>> On Sunday, April 3, 2011, Mark<st...@gmail.com> wrote:
>>>
>>>> If I have a tuple of values, is there a way to eliminate duplicate values
>>>> per tuple?
>>>>
>>>> Example:
>>>> (5,5,4,7,2,3,4,9) = (5,4,7,2,3,9)
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>
Re: Distinct question
Posted by Jonathan Coveney <jc...@gmail.com>.
I do not know if this is it, but I am not sure that pig likes it when you
use the result variable in its own declaration. That is to say, try doing
rows2 = Foreach rows generate etc.
2011/4/3 Mark <st...@gmail.com>
> I have a simple EvalFunc as so:
>
> public class Set extends EvalFunc<Tuple> {
> public Tuple exec(Tuple tuple) throws IOException {
> Set<Object> unique = new HashSet<Object>();
> unique.addAll(tuple.getAll());
> return TupleFactory.getInstance().newTuple(unique);
> }
> }
>
> How can I apply this to a result set though? When I try:
>
> rows = LOAD 'foo';
> rows = FOREACH rows GENERATE com.mycompany.piggybank.Set(rows);
> 2011-04-03 09:16:25,423 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000: Error during parsing. Scalars can be only used with projections
>
> I get the above error? Should I be using something other than a EvalFunc?
>
> Thanks
>
>
>
> On 4/3/11 8:53 AM, Bill Graham wrote:
>
>> You could add all the values to a set in a udf and the return it's
>> contents.
>>
>> On Sunday, April 3, 2011, Mark<st...@gmail.com> wrote:
>>
>>> If I have a tuple of values, is there a way to eliminate duplicate values
>>> per tuple?
>>>
>>> Example:
>>> (5,5,4,7,2,3,4,9) = (5,4,7,2,3,9)
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
Re: Distinct question
Posted by Mark <st...@gmail.com>.
I have a simple EvalFunc as so:
public class Set extends EvalFunc<Tuple> {
public Tuple exec(Tuple tuple) throws IOException {
Set<Object> unique = new HashSet<Object>();
unique.addAll(tuple.getAll());
return TupleFactory.getInstance().newTuple(unique);
}
}
How can I apply this to a result set though? When I try:
rows = LOAD 'foo';
rows = FOREACH rows GENERATE com.mycompany.piggybank.Set(rows);
2011-04-03 09:16:25,423 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Scalars can be only used with projections
I get the above error? Should I be using something other than a EvalFunc?
Thanks
On 4/3/11 8:53 AM, Bill Graham wrote:
> You could add all the values to a set in a udf and the return it's contents.
>
> On Sunday, April 3, 2011, Mark<st...@gmail.com> wrote:
>> If I have a tuple of values, is there a way to eliminate duplicate values per tuple?
>>
>> Example:
>> (5,5,4,7,2,3,4,9) = (5,4,7,2,3,9)
>>
>> Thanks
>>
>>
>>
>>
Re: Distinct question
Posted by Bill Graham <bi...@gmail.com>.
You could add all the values to a set in a udf and the return it's contents.
On Sunday, April 3, 2011, Mark <st...@gmail.com> wrote:
> If I have a tuple of values, is there a way to eliminate duplicate values per tuple?
>
> Example:
> (5,5,4,7,2,3,4,9) = (5,4,7,2,3,9)
>
> Thanks
>
>
>
>