You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Prashanth Pappu <pr...@conviva.com> on 2008/08/14 20:53:59 UTC

DataAtom Types

I think we discussed a similar issue sometime ago about problems with PIG's
implicit comparisons and DataAtoms.

Here's an example -

> describe A;
A:(X,Y)

> A={(1,2),(2,3)};
> B = foreach A generate X+1 as U, Y as V;
> cogroup_AB = cogroup A by X and B by U;

What do we expect cogroup_AB to contain? In the current implementation, it
is null. This is because B is {(2.0,2), (3.0,3)}. And cogroup uses default
comparator in DataAtoms which is essentially a String comparator (according
to which "2.0" ne "2").

>
return (type == Type.STRING ) ? stringVal.compareTo(dOther.stringVal) :
            WritableComparator.compareBytes(binaryVal, 0, binaryVal.length,
                                            dOther.binaryVal, 0,
                                            dOther.binaryVal.length);
>

I would like to hear comments on how others are dealing with these issues.
-- Is there a simple PIG script level method of dealing with this issue?
-- Has anyone tried hacking the DataAtom comparator to detect 'numbers' in
strings, like

>
 // temporary fix for comparison of ints
        if (type == Type.STRING) {
                if(stringVal.matches("[0-9]*") ||
stringVal.matches("[0-9]*\\x2e[0-9]*")) {
                        // A number
                        if (dOther.numval() == numval()) {
                                return 0;
                        } else if (numval() < dOther.numval()) {
                                return -1;
                        } else {
                                return 1;
                        }
                } else {
                        return stringVal.compareTo(dOther.stringVal);
                }
        } else {
                 return WritableComparator.compareBytes(binaryVal, 0,
binaryVal.length,
                                            dOther.binaryVal, 0,
                                            dOther.binaryVal.length);
        }
>

- Also, I know there are typed schemas in PIG's roadmap. Is that going to
solve this issue? Do we have any details about which release might have
types support?

Thanks!
Prashanth

RE: DataAtom Types

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Try types branch. At this point it is reasonably stable.

Olga 

> -----Original Message-----
> From: prashanth.rinera@gmail.com 
> [mailto:prashanth.rinera@gmail.com] On Behalf Of Prashanth Pappu
> Sent: Thursday, August 14, 2008 12:07 PM
> To: pig-user@incubator.apache.org
> Subject: Re: DataAtom Types
> 
> Thanks Alan.
> 
> Do you have any recommendations for dealing with this issue 
> in the interim?
> 
> Prashanth
> 
> On Thu, Aug 14, 2008 at 12:02 PM, Alan Gates 
> <ga...@yahoo-inc.com> wrote:
> 
> > There is code now in the types branch that will handle this 
> case.  The 
> > only difference is that when you load your data, if you want this 
> > behavior, you must declare your types.  So:
> >
> > A = load 'firstfile'; -- contains (1, 2) B = load 'secondfile'; -- 
> > contains (1.0, 2.0) C = cogroup A by $0, B by $0; D = foreach C 
> > flatten(A), flatten(B) dump D;
> >
> > will return the same as before.
> >
> > But if you change it to:
> > A = load 'firstfile' as (a: int, b:int); B = load 
> 'secondfile' as (a: 
> > int, b:int);
> >
> > (or double if you prefer them to come out as doubles) then 
> those two 
> > will be cast to the same type and the cogroup will go through.
> >
> > As for when this will be released, I don't know yet.  We're still 
> > doing a lot of bug fixing in this code.  So we're not ready 
> to merge 
> > it into the trunk yet.
> >
> > Alan.
> >
> >
> >
> > Prashanth Pappu wrote:
> >
> >> I think we discussed a similar issue sometime ago about 
> problems with 
> >> PIG's implicit comparisons and DataAtoms.
> >>
> >> Here's an example -
> >>
> >>
> >>
> >>> describe A;
> >>>
> >>>
> >> A:(X,Y)
> >>
> >>
> >>
> >>> A={(1,2),(2,3)};
> >>> B = foreach A generate X+1 as U, Y as V; cogroup_AB = 
> cogroup A by X 
> >>> and B by U;
> >>>
> >>>
> >>
> >> What do we expect cogroup_AB to contain? In the current 
> >> implementation, it is null. This is because B is {(2.0,2), 
> (3.0,3)}. 
> >> And cogroup uses default comparator in DataAtoms which is 
> essentially 
> >> a String comparator (according to which "2.0" ne "2").
> >>
> >>  return (type == Type.STRING ) ? 
> stringVal.compareTo(dOther.stringVal) :
> >>            WritableComparator.compareBytes(binaryVal, 0, 
> binaryVal.length,
> >>                                            dOther.binaryVal, 0,
> >>                                            
> dOther.binaryVal.length);
> >>
> >> I would like to hear comments on how others are dealing 
> with these issues.
> >> -- Is there a simple PIG script level method of dealing 
> with this issue?
> >> -- Has anyone tried hacking the DataAtom comparator to detect 
> >> 'numbers' in strings, like
> >>
> >>   // temporary fix for comparison of ints
> >>        if (type == Type.STRING) {
> >>                if(stringVal.matches("[0-9]*") ||
> >> stringVal.matches("[0-9]*\\x2e[0-9]*")) {
> >>                        // A number
> >>                        if (dOther.numval() == numval()) {
> >>                                return 0;
> >>                        } else if (numval() < dOther.numval()) {
> >>                                return -1;
> >>                        } else {
> >>                                return 1;
> >>                        }
> >>                } else {
> >>                        return 
> stringVal.compareTo(dOther.stringVal);
> >>                }
> >>        } else {
> >>                 return 
> WritableComparator.compareBytes(binaryVal, 0, 
> >> binaryVal.length,
> >>                                            dOther.binaryVal, 0,
> >>                                            
> dOther.binaryVal.length);
> >>        }
> >>
> >> - Also, I know there are typed schemas in PIG's roadmap. Is that 
> >> going to solve this issue? Do we have any details about 
> which release 
> >> might have types support?
> >>
> >> Thanks!
> >> Prashanth
> >>
> >>
> >>
> >
> 

Re: DataAtom Types

Posted by Prashanth Pappu <pr...@conviva.com>.
Thanks Alan.

Do you have any recommendations for dealing with this issue in the interim?

Prashanth

On Thu, Aug 14, 2008 at 12:02 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> There is code now in the types branch that will handle this case.  The only
> difference is that when you load your data, if you want this behavior, you
> must declare your types.  So:
>
> A = load 'firstfile'; -- contains (1, 2)
> B = load 'secondfile'; -- contains (1.0, 2.0)
> C = cogroup A by $0, B by $0;
> D = foreach C flatten(A), flatten(B)
> dump D;
>
> will return the same as before.
>
> But if you change it to:
> A = load 'firstfile' as (a: int, b:int);
> B = load 'secondfile' as (a: int, b:int);
>
> (or double if you prefer them to come out as doubles) then those two will
> be cast to the same type and the cogroup will go through.
>
> As for when this will be released, I don't know yet.  We're still doing a
> lot of bug fixing in this code.  So we're not ready to merge it into the
> trunk yet.
>
> Alan.
>
>
>
> Prashanth Pappu wrote:
>
>> I think we discussed a similar issue sometime ago about problems with
>> PIG's
>> implicit comparisons and DataAtoms.
>>
>> Here's an example -
>>
>>
>>
>>> describe A;
>>>
>>>
>> A:(X,Y)
>>
>>
>>
>>> A={(1,2),(2,3)};
>>> B = foreach A generate X+1 as U, Y as V;
>>> cogroup_AB = cogroup A by X and B by U;
>>>
>>>
>>
>> What do we expect cogroup_AB to contain? In the current implementation, it
>> is null. This is because B is {(2.0,2), (3.0,3)}. And cogroup uses default
>> comparator in DataAtoms which is essentially a String comparator
>> (according
>> to which "2.0" ne "2").
>>
>>  return (type == Type.STRING ) ? stringVal.compareTo(dOther.stringVal) :
>>            WritableComparator.compareBytes(binaryVal, 0, binaryVal.length,
>>                                            dOther.binaryVal, 0,
>>                                            dOther.binaryVal.length);
>>
>> I would like to hear comments on how others are dealing with these issues.
>> -- Is there a simple PIG script level method of dealing with this issue?
>> -- Has anyone tried hacking the DataAtom comparator to detect 'numbers' in
>> strings, like
>>
>>   // temporary fix for comparison of ints
>>        if (type == Type.STRING) {
>>                if(stringVal.matches("[0-9]*") ||
>> stringVal.matches("[0-9]*\\x2e[0-9]*")) {
>>                        // A number
>>                        if (dOther.numval() == numval()) {
>>                                return 0;
>>                        } else if (numval() < dOther.numval()) {
>>                                return -1;
>>                        } else {
>>                                return 1;
>>                        }
>>                } else {
>>                        return stringVal.compareTo(dOther.stringVal);
>>                }
>>        } else {
>>                 return WritableComparator.compareBytes(binaryVal, 0,
>> binaryVal.length,
>>                                            dOther.binaryVal, 0,
>>                                            dOther.binaryVal.length);
>>        }
>>
>> - Also, I know there are typed schemas in PIG's roadmap. Is that going to
>> solve this issue? Do we have any details about which release might have
>> types support?
>>
>> Thanks!
>> Prashanth
>>
>>
>>
>

Re: DataAtom Types

Posted by Alan Gates <ga...@yahoo-inc.com>.
There is code now in the types branch that will handle this case.  The 
only difference is that when you load your data, if you want this 
behavior, you must declare your types.  So:

A = load 'firstfile'; -- contains (1, 2)
B = load 'secondfile'; -- contains (1.0, 2.0)
C = cogroup A by $0, B by $0;
D = foreach C flatten(A), flatten(B)
dump D;

will return the same as before.

But if you change it to:
A = load 'firstfile' as (a: int, b:int);
B = load 'secondfile' as (a: int, b:int);

(or double if you prefer them to come out as doubles) then those two 
will be cast to the same type and the cogroup will go through.

As for when this will be released, I don't know yet.  We're still doing 
a lot of bug fixing in this code.  So we're not ready to merge it into 
the trunk yet.

Alan.


Prashanth Pappu wrote:
> I think we discussed a similar issue sometime ago about problems with PIG's
> implicit comparisons and DataAtoms.
>
> Here's an example -
>
>   
>> describe A;
>>     
> A:(X,Y)
>
>   
>> A={(1,2),(2,3)};
>> B = foreach A generate X+1 as U, Y as V;
>> cogroup_AB = cogroup A by X and B by U;
>>     
>
> What do we expect cogroup_AB to contain? In the current implementation, it
> is null. This is because B is {(2.0,2), (3.0,3)}. And cogroup uses default
> comparator in DataAtoms which is essentially a String comparator (according
> to which "2.0" ne "2").
>
>   
> return (type == Type.STRING ) ? stringVal.compareTo(dOther.stringVal) :
>             WritableComparator.compareBytes(binaryVal, 0, binaryVal.length,
>                                             dOther.binaryVal, 0,
>                                             dOther.binaryVal.length);
>   
>
> I would like to hear comments on how others are dealing with these issues.
> -- Is there a simple PIG script level method of dealing with this issue?
> -- Has anyone tried hacking the DataAtom comparator to detect 'numbers' in
> strings, like
>
>   
>  // temporary fix for comparison of ints
>         if (type == Type.STRING) {
>                 if(stringVal.matches("[0-9]*") ||
> stringVal.matches("[0-9]*\\x2e[0-9]*")) {
>                         // A number
>                         if (dOther.numval() == numval()) {
>                                 return 0;
>                         } else if (numval() < dOther.numval()) {
>                                 return -1;
>                         } else {
>                                 return 1;
>                         }
>                 } else {
>                         return stringVal.compareTo(dOther.stringVal);
>                 }
>         } else {
>                  return WritableComparator.compareBytes(binaryVal, 0,
> binaryVal.length,
>                                             dOther.binaryVal, 0,
>                                             dOther.binaryVal.length);
>         }
>   
>
> - Also, I know there are typed schemas in PIG's roadmap. Is that going to
> solve this issue? Do we have any details about which release might have
> types support?
>
> Thanks!
> Prashanth
>
>