You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Lauren Blau <la...@digitalreasoning.com> on 2012/08/30 23:59:55 UTC

wrong sort order (lexical vs numeric) in a nested foreach

I have the following foreach:

foo := foreach bar {

Re: wrong sort order (lexical vs numeric) in a nested foreach

Posted by Lauren Blau <la...@digitalreasoning.com>.
I think I finally found the culprit. There is a load like this:

a = load '/foobar' using CustomJsonLoader('baz') as (m:map[]);  -- loading
an untyped map

then there is a flatten,
a1 = foreach a generate a#'id' as id: chararray, flatten(a#'listvals') as
(listvals: map[]); -- another untyped map

and then

a2 = foreach a1 generate id as id: chararray, listvals#'intval1' as
intval1: int,listvals#'intval2' as intval2:int;

by putting an explicit cast as such:
a2 = foreach a1 generate id as id: chararray,(int) listvals#'intval1' as
intval1: int,(int)listvals#'intval2' as intval2:int;

I've finally got the results I was looking for, without having to store and
reload.

Thanks

On Tue, Sep 4, 2012 at 2:05 PM, Lauren Blau <
lauren.blau@digitalreasoning.com> wrote:

> unfortunately, I can't put together an example without sharing the custom
> jsonloader and data. But I've worked around this by explicitly storing and
> reloading the data.
> But it sounds like you have it backwards in your attempt to be sneaky. The
> data actually is an int and should be sorted numerically.
> I just read something in another email that leads me to believe I need to
> be casting the lhs of values in a foreach..
> like foreach x generate (int)field = fieldname: int;
>
> so I'm going to try explicit casts like that.
>
>
> Thanks,
> lauren
>
>
> On Sat, Sep 1, 2012 at 12:42 AM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> I tried to reproduce this and haven't been able to -- all my devious
>> attempts to get something that is actually a string to show up as an
>> int in "describe" wind up in class cast exceptions and blown up jobs
>> (not devious enough, clearly).
>>
>> Can you give put together an example that reproduces the issue, and
>> let us know which version of pig you are running?
>>
>> Thanks,
>> Dmitriy
>>
>> On Fri, Aug 31, 2012 at 2:42 AM, Lauren Blau
>> <la...@digitalreasoning.com> wrote:
>> > Could this be a problem with the original read of the data. It is
>> stored in
>> > Json format and read with a custom Json loader.
>> > If I save the results of the loader to a file using PigStorage and then
>> run
>> > the same script reading from that file the sort is done numerically.
>> >
>> > I've had other pig script problems which have been solved by explicitly
>> > storing and re-reading using PigStorage.
>> > I'm not sure what I can check in the loader (I didn't write it) to see
>> what
>> > might be causing this,
>> > Any hints on how to debug this?
>> >
>> > Thanks,
>> > Lauren
>> >
>> > On Thu, Aug 30, 2012 at 6:10 PM, Lauren Blau <
>> > lauren.blau@digitalreasoning.com> wrote:
>> >
>> >> sorry, premature email :-).
>> >>
>> >> relation = key1 ,key2,orderkey1,val; //schema is
>> >> (chararray,int,int,chararray);
>> >>
>> >> groupbykey = group relation by (key1,key2);
>> >> foreach groupbykey {
>> >>     sorted = order  relation by orderkey1;
>> >>     generate flatten($0), MyUDF(sorted);
>> >> }
>> >>
>> >> I notice that when the 'sorted' values arrive in my UDF, they are
>> sorted
>> >> lexically, not numerically. I checked the schema on the way in and
>> >> orderkey1 is definitely an int.
>> >>
>> >> Is there any way to force the order by into a numeric sort?
>> >>
>> >> Thanks,
>> >> Lauren
>> >>
>> >>
>> >> On Thu, Aug 30, 2012 at 5:59 PM, Lauren Blau <
>> >> lauren.blau@digitalreasoning.com> wrote:
>> >>
>> >>> I have the following foreach:
>> >>>
>> >>> foo := foreach bar {
>> >>>
>> >>>
>> >>
>>
>
>

Re: wrong sort order (lexical vs numeric) in a nested foreach

Posted by Lauren Blau <la...@digitalreasoning.com>.
unfortunately, I can't put together an example without sharing the custom
jsonloader and data. But I've worked around this by explicitly storing and
reloading the data.
But it sounds like you have it backwards in your attempt to be sneaky. The
data actually is an int and should be sorted numerically.
I just read something in another email that leads me to believe I need to
be casting the lhs of values in a foreach..
like foreach x generate (int)field = fieldname: int;

so I'm going to try explicit casts like that.


Thanks,
lauren

On Sat, Sep 1, 2012 at 12:42 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I tried to reproduce this and haven't been able to -- all my devious
> attempts to get something that is actually a string to show up as an
> int in "describe" wind up in class cast exceptions and blown up jobs
> (not devious enough, clearly).
>
> Can you give put together an example that reproduces the issue, and
> let us know which version of pig you are running?
>
> Thanks,
> Dmitriy
>
> On Fri, Aug 31, 2012 at 2:42 AM, Lauren Blau
> <la...@digitalreasoning.com> wrote:
> > Could this be a problem with the original read of the data. It is stored
> in
> > Json format and read with a custom Json loader.
> > If I save the results of the loader to a file using PigStorage and then
> run
> > the same script reading from that file the sort is done numerically.
> >
> > I've had other pig script problems which have been solved by explicitly
> > storing and re-reading using PigStorage.
> > I'm not sure what I can check in the loader (I didn't write it) to see
> what
> > might be causing this,
> > Any hints on how to debug this?
> >
> > Thanks,
> > Lauren
> >
> > On Thu, Aug 30, 2012 at 6:10 PM, Lauren Blau <
> > lauren.blau@digitalreasoning.com> wrote:
> >
> >> sorry, premature email :-).
> >>
> >> relation = key1 ,key2,orderkey1,val; //schema is
> >> (chararray,int,int,chararray);
> >>
> >> groupbykey = group relation by (key1,key2);
> >> foreach groupbykey {
> >>     sorted = order  relation by orderkey1;
> >>     generate flatten($0), MyUDF(sorted);
> >> }
> >>
> >> I notice that when the 'sorted' values arrive in my UDF, they are sorted
> >> lexically, not numerically. I checked the schema on the way in and
> >> orderkey1 is definitely an int.
> >>
> >> Is there any way to force the order by into a numeric sort?
> >>
> >> Thanks,
> >> Lauren
> >>
> >>
> >> On Thu, Aug 30, 2012 at 5:59 PM, Lauren Blau <
> >> lauren.blau@digitalreasoning.com> wrote:
> >>
> >>> I have the following foreach:
> >>>
> >>> foo := foreach bar {
> >>>
> >>>
> >>
>

Re: wrong sort order (lexical vs numeric) in a nested foreach

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
I tried to reproduce this and haven't been able to -- all my devious
attempts to get something that is actually a string to show up as an
int in "describe" wind up in class cast exceptions and blown up jobs
(not devious enough, clearly).

Can you give put together an example that reproduces the issue, and
let us know which version of pig you are running?

Thanks,
Dmitriy

On Fri, Aug 31, 2012 at 2:42 AM, Lauren Blau
<la...@digitalreasoning.com> wrote:
> Could this be a problem with the original read of the data. It is stored in
> Json format and read with a custom Json loader.
> If I save the results of the loader to a file using PigStorage and then run
> the same script reading from that file the sort is done numerically.
>
> I've had other pig script problems which have been solved by explicitly
> storing and re-reading using PigStorage.
> I'm not sure what I can check in the loader (I didn't write it) to see what
> might be causing this,
> Any hints on how to debug this?
>
> Thanks,
> Lauren
>
> On Thu, Aug 30, 2012 at 6:10 PM, Lauren Blau <
> lauren.blau@digitalreasoning.com> wrote:
>
>> sorry, premature email :-).
>>
>> relation = key1 ,key2,orderkey1,val; //schema is
>> (chararray,int,int,chararray);
>>
>> groupbykey = group relation by (key1,key2);
>> foreach groupbykey {
>>     sorted = order  relation by orderkey1;
>>     generate flatten($0), MyUDF(sorted);
>> }
>>
>> I notice that when the 'sorted' values arrive in my UDF, they are sorted
>> lexically, not numerically. I checked the schema on the way in and
>> orderkey1 is definitely an int.
>>
>> Is there any way to force the order by into a numeric sort?
>>
>> Thanks,
>> Lauren
>>
>>
>> On Thu, Aug 30, 2012 at 5:59 PM, Lauren Blau <
>> lauren.blau@digitalreasoning.com> wrote:
>>
>>> I have the following foreach:
>>>
>>> foo := foreach bar {
>>>
>>>
>>

Re: wrong sort order (lexical vs numeric) in a nested foreach

Posted by Віталій Тимчишин <ti...@gmail.com>.
I'd try to describe original schema as varchar and the cast during order
by, e.g
order  relation by (char)orderkey1;
If pig does not accept cast in order, try to add additional foreach with
cast.
Last resort could be a udf that does the cast.

2012/8/31 Lauren Blau <la...@digitalreasoning.com>

> Could this be a problem with the original read of the data. It is stored in
> Json format and read with a custom Json loader.
> If I save the results of the loader to a file using PigStorage and then run
> the same script reading from that file the sort is done numerically.
>
> I've had other pig script problems which have been solved by explicitly
> storing and re-reading using PigStorage.
> I'm not sure what I can check in the loader (I didn't write it) to see what
> might be causing this,
> Any hints on how to debug this?
>
> Thanks,
> Lauren
>
> On Thu, Aug 30, 2012 at 6:10 PM, Lauren Blau <
> lauren.blau@digitalreasoning.com> wrote:
>
> > sorry, premature email :-).
> >
> > relation = key1 ,key2,orderkey1,val; //schema is
> > (chararray,int,int,chararray);
> >
> > groupbykey = group relation by (key1,key2);
> > foreach groupbykey {
> >     sorted = order  relation by orderkey1;
> >     generate flatten($0), MyUDF(sorted);
> > }
> >
> > I notice that when the 'sorted' values arrive in my UDF, they are sorted
> > lexically, not numerically. I checked the schema on the way in and
> > orderkey1 is definitely an int.
> >
> > Is there any way to force the order by into a numeric sort?
> >
> > Thanks,
> > Lauren
> >
> >
> > On Thu, Aug 30, 2012 at 5:59 PM, Lauren Blau <
> > lauren.blau@digitalreasoning.com> wrote:
> >
> >> I have the following foreach:
> >>
> >> foo := foreach bar {
> >>
> >>
> >
>



-- 
Best regards,
 Vitalii Tymchyshyn

Re: wrong sort order (lexical vs numeric) in a nested foreach

Posted by Lauren Blau <la...@digitalreasoning.com>.
Could this be a problem with the original read of the data. It is stored in
Json format and read with a custom Json loader.
If I save the results of the loader to a file using PigStorage and then run
the same script reading from that file the sort is done numerically.

I've had other pig script problems which have been solved by explicitly
storing and re-reading using PigStorage.
I'm not sure what I can check in the loader (I didn't write it) to see what
might be causing this,
Any hints on how to debug this?

Thanks,
Lauren

On Thu, Aug 30, 2012 at 6:10 PM, Lauren Blau <
lauren.blau@digitalreasoning.com> wrote:

> sorry, premature email :-).
>
> relation = key1 ,key2,orderkey1,val; //schema is
> (chararray,int,int,chararray);
>
> groupbykey = group relation by (key1,key2);
> foreach groupbykey {
>     sorted = order  relation by orderkey1;
>     generate flatten($0), MyUDF(sorted);
> }
>
> I notice that when the 'sorted' values arrive in my UDF, they are sorted
> lexically, not numerically. I checked the schema on the way in and
> orderkey1 is definitely an int.
>
> Is there any way to force the order by into a numeric sort?
>
> Thanks,
> Lauren
>
>
> On Thu, Aug 30, 2012 at 5:59 PM, Lauren Blau <
> lauren.blau@digitalreasoning.com> wrote:
>
>> I have the following foreach:
>>
>> foo := foreach bar {
>>
>>
>

Re: wrong sort order (lexical vs numeric) in a nested foreach

Posted by Lauren Blau <la...@digitalreasoning.com>.
sorry, premature email :-).

relation = key1 ,key2,orderkey1,val; //schema is
(chararray,int,int,chararray);

groupbykey = group relation by (key1,key2);
foreach groupbykey {
    sorted = order  relation by orderkey1;
    generate flatten($0), MyUDF(sorted);
}

I notice that when the 'sorted' values arrive in my UDF, they are sorted
lexically, not numerically. I checked the schema on the way in and
orderkey1 is definitely an int.

Is there any way to force the order by into a numeric sort?

Thanks,
Lauren

On Thu, Aug 30, 2012 at 5:59 PM, Lauren Blau <
lauren.blau@digitalreasoning.com> wrote:

> I have the following foreach:
>
> foo := foreach bar {
>
>