You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Saggi Neumann <sa...@xplenty.com> on 2015/04/01 10:52:59 UTC

Re: Weird behaviour with string min/max

We've figured it out and opened a jira with a patch - apparently there's a
bug in the string min/max accumulator function. See
https://issues.apache.org/jira/browse/PIG-4490

Nice catch!

Saggi

On Sun, Mar 29, 2015 at 2:46 PM, Ronald Green <gr...@gmail.com>
wrote:

> I can share demo data to go with the script. Anyone has any clue?
>
> On 24 March 2015 at 14:04, Ronald Green <gr...@gmail.com> wrote:
>
> > Hi,
> >
> > I stumbled upon a case where MIN/MAX on strings results with values that
> > are definitely not the minimum or the maximum:
> >
> > When executed on 1 million records the following script results in wrong
> > values for MIN/MAX:
> >
> > ```
> > src = LOAD 's3n://.../' USING PigStorage('\t','-noschema') AS
> (field1:int,
> > field2:int, field3:int, field4:chararray, field5:chararray,
> > field6:chararray, field7:chararray, field8:chararray);
> > agg = GROUP src BY (field3);
> > proj = FOREACH agg GENERATE group AS field3, COUNT_STAR(proj) AS
> > countme, datafu.pig.stats.HyperLogLogPlusPlus(proj.field5) AS HLL1,
> > MIN(proj.field8) AS Minval, MAX(proj.field8) AS Maxval;
> > STORE copy_of_destination14 INTO 's3n://...' USING PigStorage('\t');
> > ```
> >
> > If I make the following changes, the results for MIN and MAX are as
> > expected:
> >
> > 1. Remove use of HyperLogLogPlusPlus
> > 2. If I treat field8 as a datetime field instead of chararray
> > 3. If I only execute this on 1/100 of the data
> >
> > Note that the job is comprised of a single map/reduce job with a single
> > map task and a single reduce task.
> >
> > Any idea?
> >
> > Thanks,
> > Ron
> >
>