You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Adrian Popescu <ad...@epfl.ch> on 2013/12/13 20:17:58 UTC
Re: handling joins in Hive 0.11.0
Hello,
I found out that the dependency graph among task stages is incorrect for
the skewed join optimized plan.
In particular, the conditional task in the optimized plan maintains no
dependency with the child tasks
of the common join task in the original plan. The conditional task is
composed of the map join task which
has all these dependencies, but for the case the map join task is
filtered out, all these dependencies are removed.
Hence, all the other task stages of the query are skipped.
The bug resides in "ql/optimizer/physical/GenMRSkewJoinProcessor.java",
processSkewJoin() function,
immediately after the ConditionalTask is created and its dependencies
are set.
I currently fixed the issue by adding dependencies among the
ConditonalTask and all the child tasks of the common
join task of the original plan.
From the original design I see that only tasks included in the
ConditionalTask are allowed to have dependencies,
so I am wondering what shall be the alternative correct implementation?
Maybe adding an "nop" task inside the
ConditionalTask (in addition to the map join task), so that the
dependencies are maintained for the case that the
map join task is filtered out?
Thanks,
Adrian
On 11/15/2013 10:20 PM, Adrian Popescu wrote:
>
> 2. In my experiments I also evaluate skewed joins. I enable skew joins
> through "hive.optimize.skewjoin" and I run the same
> tpch query 5. The skew join is not actually triggered as the number of
> rows with the same key is less than "hive.skewjoin.key".
> Hence, the map join corresponding to the skewed join is filtered out
> at runtime, but unfortunately all the other stages
> are also filtered out. Thus, no result is actually generated. If I
> disable the skew join optimization, the query running only with
> common joins returns the result correctly.
>
> I believe this is a bug when the skew join operator is enabled but not
> triggered. Did anyone experienced the same problem with
> skew joins on queries of multiple map reduce joins? I attach the
> explain plan. Essentially only stage 6 and 22 are executed.
> Everything else is skipped silently with no output result being
> generated, nor error in "hive.log". Similar behaviour is observed
> for other TPCH queries.
>
> Many thanks,
> Adrian
>
--
Adrian