You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mark <st...@gmail.com> on 2011/04/03 16:44:35 UTC

Distinct question

If I have a tuple of values, is there a way to eliminate duplicate 
values per tuple?

Example:
(5,5,4,7,2,3,4,9) = (5,4,7,2,3,9)

Thanks

Re: Distinct question

Posted by Bill Graham <bi...@gmail.com>.

I was suggesting using the Set just as a means to distinct your input
data. You'll want to iterate over the set and add each item into the
response tuple. So you're returning a tuple of N unique objects,
instead of a tuple of a Set of N.

On Sun, Apr 3, 2011 at 9:57 AM, Jonathan Coveney <jc...@gmail.com> wrote:
> I do not know if this is it, but I am not sure that pig likes it when you
> use the result variable in its own declaration. That is to say, try doing
> rows2 = Foreach rows generate etc.
>
> 2011/4/3 Mark <st...@gmail.com>
>
>> I have a simple EvalFunc as so:
>>
>> public class Set extends EvalFunc<Tuple> {
>>  public Tuple exec(Tuple tuple) throws IOException {
>>    Set<Object> unique = new HashSet<Object>();
>>    unique.addAll(tuple.getAll());
>>    return TupleFactory.getInstance().newTuple(unique);
>>  }
>> }
>>
>> How can I apply this to a result set though?  When I try:
>>
>> rows = LOAD 'foo';
>> rows = FOREACH rows GENERATE com.mycompany.piggybank.Set(rows);
>> 2011-04-03 09:16:25,423 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1000: Error during parsing. Scalars can be only used with projections
>>
>> I get the above error? Should I be using something other than a EvalFunc?
>>
>> Thanks
>>
>>
>>
>> On 4/3/11 8:53 AM, Bill Graham wrote:
>>
>>> You could add all the values to a set in a udf and the return it's
>>> contents.
>>>
>>> On Sunday, April 3, 2011, Mark<st...@gmail.com>  wrote:
>>>
>>>> If I have a tuple of values, is there a way to eliminate duplicate values
>>>> per tuple?
>>>>
>>>> Example:
>>>> (5,5,4,7,2,3,4,9) = (5,4,7,2,3,9)
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>

Re: Distinct question

Posted by Jonathan Coveney <jc...@gmail.com>.

I do not know if this is it, but I am not sure that pig likes it when you
use the result variable in its own declaration. That is to say, try doing
rows2 = Foreach rows generate etc.

2011/4/3 Mark <st...@gmail.com>

> I have a simple EvalFunc as so:
>
> public class Set extends EvalFunc<Tuple> {
>  public Tuple exec(Tuple tuple) throws IOException {
>    Set<Object> unique = new HashSet<Object>();
>    unique.addAll(tuple.getAll());
>    return TupleFactory.getInstance().newTuple(unique);
>  }
> }
>
> How can I apply this to a result set though?  When I try:
>
> rows = LOAD 'foo';
> rows = FOREACH rows GENERATE com.mycompany.piggybank.Set(rows);
> 2011-04-03 09:16:25,423 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000: Error during parsing. Scalars can be only used with projections
>
> I get the above error? Should I be using something other than a EvalFunc?
>
> Thanks
>
>
>
> On 4/3/11 8:53 AM, Bill Graham wrote:
>
>> You could add all the values to a set in a udf and the return it's
>> contents.
>>
>> On Sunday, April 3, 2011, Mark<st...@gmail.com>  wrote:
>>
>>> If I have a tuple of values, is there a way to eliminate duplicate values
>>> per tuple?
>>>
>>> Example:
>>> (5,5,4,7,2,3,4,9) = (5,4,7,2,3,9)
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>

Re: Distinct question

Posted by Mark <st...@gmail.com>.

I have a simple EvalFunc as so:

public class Set extends EvalFunc<Tuple> {
   public Tuple exec(Tuple tuple) throws IOException {
     Set<Object> unique = new HashSet<Object>();
     unique.addAll(tuple.getAll());
     return TupleFactory.getInstance().newTuple(unique);
   }
}

How can I apply this to a result set though?  When I try:

rows = LOAD 'foo';
rows = FOREACH rows GENERATE com.mycompany.piggybank.Set(rows);
2011-04-03 09:16:25,423 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
ERROR 1000: Error during parsing. Scalars can be only used with projections

I get the above error? Should I be using something other than a EvalFunc?

Thanks

On 4/3/11 8:53 AM, Bill Graham wrote:
> You could add all the values to a set in a udf and the return it's contents.
>
> On Sunday, April 3, 2011, Mark<st...@gmail.com>  wrote:
>> If I have a tuple of values, is there a way to eliminate duplicate values per tuple?
>>
>> Example:
>> (5,5,4,7,2,3,4,9) = (5,4,7,2,3,9)
>>
>> Thanks
>>
>>
>>
>>

Re: Distinct question

Posted by Bill Graham <bi...@gmail.com>.

You could add all the values to a set in a udf and the return it's contents.

On Sunday, April 3, 2011, Mark <st...@gmail.com> wrote:
> If I have a tuple of values, is there a way to eliminate duplicate values per tuple?
>
> Example:
> (5,5,4,7,2,3,4,9) = (5,4,7,2,3,9)
>
> Thanks
>
>
>
>