You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Pala M Muthaia <mc...@rocketfuelinc.com> on 2013/08/24 02:35:35 UTC

DISTRIBUTE BY works incorrectly in Hive 0.11 in some cases

Hi,

We are using DISTRIBUTE BY with custom reducer scripts in our query
workload.

After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT BY and
custom reducer scripts produced incorrect results. Particularly, rows with
same value on DISTRIBUTE BY column ends up in multiple reducers and thus
produce multiple rows in final result, when we expect only one.

I investigated a little bit and discovered the following behavior for Hive
0.11:

- Hive 0.11 produces a different plan for these queries with incorrect
results. The extra stage for the DISTRIBUTE BY + Transform is missing and
the Transform operator for the custom reducer script is pushed into the
reduce operator tree containing GROUP BY itself.

- However, *if the SORT BY in the query has a DESC order in it*, the right
plan is produced, and the results look correct too.

Hive 0.10 produces the expected plan with right results in all cases.


To illustrate, here is a simplified repro setup:

Table:

*CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3 STRING,
val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY
'\n' STORED AS TEXTFILE;*

Query:

*ADD FILE reducer.py;*

*FROM(*
*  SELECT grp, val2 *
*  FROM test_cluster *
*  GROUP BY grp, val2 *
*  DISTRIBUTE BY grp *
*  SORT BY grp, val2  -- add DESC here to get correct results*
*) **a*
*
*
*REDUCE a.**
*USING 'reducer.py'*
*AS grp, reducedValue*


If i understand correctly, this is a bug. Is this a known issue? Any other
insights? We have reverted to Hive 0.10 to avoid the incorrect results
while we investigate this.

I have the repro sample, with test data and scripts, if anybody is
interested.



Thanks,
pala

Re: DISTRIBUTE BY works incorrectly in Hive 0.11 in some cases

Posted by Pala M Muthaia <mc...@rocketfuelinc.com>.

Thanks for following up Yin.

We realized later this was due to the reduce deduplication optimization,
and found turning off the flag avoids the issue.

-pala


On Mon, Aug 26, 2013 at 4:40 AM, Yin Huai <hu...@gmail.com> wrote:

> forgot to add in my last reply.... To generate correct results, you can
> set hive.optimize.reducededuplication to false to turn off
> ReduceSinkDeDuplication
>
>
> On Sun, Aug 25, 2013 at 9:35 PM, Yin Huai <hu...@gmail.com> wrote:
>
> > Created a jira https://issues.apache.org/jira/browse/HIVE-5149
> >
> >
> > On Sun, Aug 25, 2013 at 9:11 PM, Yin Huai <hu...@gmail.com> wrote:
> >
> >> Seems ReduceSinkDeDuplication picked the wrong partitioning columns.
> >>
> >>
> >> On Fri, Aug 23, 2013 at 9:15 PM, Shahansad KP <sk...@rocketfuel.com>
> wrote:
> >>
> >>> I think the problem lies with in the group by operation. For this
> >>> optimization to work the group bys partitioning should be on the column
> >>> 1 only.
> >>>
> >>> It wont effect the correctness of group by, can make it slow but int
> >>> this case will fasten the overall query performance.
> >>>
> >>>
> >>> On Fri, Aug 23, 2013 at 5:55 PM, Pala M Muthaia <
> >>> mchettiar@rocketfuelinc.com> wrote:
> >>>
> >>>> I have attached the hive 10 and 11 query plans, for the sample query
> >>>> below, for illustration.
> >>>>
> >>>>
> >>>> On Fri, Aug 23, 2013 at 5:35 PM, Pala M Muthaia <
> >>>> mchettiar@rocketfuelinc.com> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> We are using DISTRIBUTE BY with custom reducer scripts in our query
> >>>>> workload.
> >>>>>
> >>>>> After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT
> >>>>> BY and custom reducer scripts produced incorrect results.
> Particularly,
> >>>>> rows with same value on DISTRIBUTE BY column ends up in multiple
> reducers
> >>>>> and thus produce multiple rows in final result, when we expect only
> one.
> >>>>>
> >>>>> I investigated a little bit and discovered the following behavior for
> >>>>> Hive 0.11:
> >>>>>
> >>>>> - Hive 0.11 produces a different plan for these queries with
> incorrect
> >>>>> results. The extra stage for the DISTRIBUTE BY + Transform is
> missing and
> >>>>> the Transform operator for the custom reducer script is pushed into
> the
> >>>>> reduce operator tree containing GROUP BY itself.
> >>>>>
> >>>>> - However, *if the SORT BY in the query has a DESC order in it*, the
> >>>>> right plan is produced, and the results look correct too.
> >>>>>
> >>>>> Hive 0.10 produces the expected plan with right results in all cases.
> >>>>>
> >>>>>
> >>>>> To illustrate, here is a simplified repro setup:
> >>>>>
> >>>>> Table:
> >>>>>
> >>>>> *CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3
> >>>>> STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
> >>>>> TERMINATED BY '\n' STORED AS TEXTFILE;*
> >>>>>
> >>>>> Query:
> >>>>>
> >>>>> *ADD FILE reducer.py;*
> >>>>>
> >>>>> *FROM(*
> >>>>> *  SELECT grp, val2 *
> >>>>> *  FROM test_cluster *
> >>>>> *  GROUP BY grp, val2 *
> >>>>> *  DISTRIBUTE BY grp *
> >>>>> *  SORT BY grp, val2  -- add DESC here to get correct results*
> >>>>> *) **a*
> >>>>> *
> >>>>> *
> >>>>> *REDUCE a.**
> >>>>> *USING 'reducer.py'*
> >>>>> *AS grp, reducedValue*
> >>>>>
> >>>>>
> >>>>> If i understand correctly, this is a bug. Is this a known issue? Any
> >>>>> other insights? We have reverted to Hive 0.10 to avoid the incorrect
> >>>>> results while we investigate this.
> >>>>>
> >>>>> I have the repro sample, with test data and scripts, if anybody is
> >>>>> interested.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> pala
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>