You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Shantian Purkad <sh...@yahoo.com> on 2011/06/07 21:31:17 UTC

Skew Join Optimization in hive

Hi,

I have a query which joins 12 different tables (most of them left outer joins) and the query takes almost 3 hours. 90% of the time is taken by a single reducer. One reducer is getting bulk of the data to process.

How can I get around this and have fair distribution of data across all reducers? I tried to enable the skewjoin optimization but getting below NPE after first step of the job is executed.

Any suggestions/ideas will be or great help.

Thanks,
Shantian

2011-06-07 19:22:28,923 Stage-11 map = 100%,  reduce = 85%
2011-06-07 19:22:30,932 Stage-11 map = 100%,  reduce = 100%
Ended Job = job_201106071542_0010
java.lang.NullPointerException
    at org.apache.hadoop.hive.ql.plan.ConditionalResolverSkewJoin.getTasks(ConditionalResolverSkewJoin.java:97)
    at org.apache.hadoop.hive.ql.exec.ConditionalTask.execute(ConditionalTask.java:81)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
    at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:164)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.ConditionalTask
hive> 

Re: Skew Join Optimization in hive

Posted by Shantian Purkad <sh...@yahoo.com>.
We have given hints to use mapside joins on small tables.


We are planning to break this query into multiple, but would prefer options that help us keep the queries as is (with few modifications and tuning instead of breaking the queries into multiple steps as there is quite bit of complicate logic and dependencies in the joins)



________________________________
From: Igor Tatarinov <ig...@decide.com>
To: user@hive.apache.org; Shantian Purkad <sh...@yahoo.com>
Sent: Tuesday, June 7, 2011 12:58 PM
Subject: Re: Skew Join Optimization in hive


Have you tried splitting the query into 2 or 3 steps and/or enabling map jons (SET hive.auto.convert.join = true;) if some of the tables are smallish?



On Tue, Jun 7, 2011 at 12:31 PM, Shantian Purkad <sh...@yahoo.com> wrote:

Hi,
>
>I have a query which joins 12 different tables (most of them left outer joins) and the query takes almost 3 hours. 90% of the time is taken by a single reducer. One reducer is getting bulk of the data to process.
>
>How can I get around this and have fair distribution of data across all reducers? I tried to enable the skewjoin optimization but getting below NPE after first step of the job is executed.
>
>Any suggestions/ideas will be or great help.
>
>Thanks,
>Shantian
>
>2011-06-07 19:22:28,923 Stage-11 map = 100%,  reduce = 85%
>2011-06-07 19:22:30,932 Stage-11 map = 100%,  reduce = 100%
>Ended Job = job_201106071542_0010
>java.lang.NullPointerException
>    at
 org.apache.hadoop.hive.ql.plan.ConditionalResolverSkewJoin.getTasks(ConditionalResolverSkewJoin.java:97)
>    at org.apache.hadoop.hive.ql.exec.ConditionalTask.execute(ConditionalTask.java:81)
>    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
>    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
>    at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
>    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
>    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
>    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:164)
>    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241)
>    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456)
>    at
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>    at java.lang.reflect.Method.invoke(Method.java:597)
>    at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.ConditionalTask
>hive> 
>
>

Re: Skew Join Optimization in hive

Posted by Igor Tatarinov <ig...@decide.com>.
Have you tried splitting the query into 2 or 3 steps and/or enabling map
jons (SET hive.auto.convert.join = true;) if some of the tables are
smallish?


On Tue, Jun 7, 2011 at 12:31 PM, Shantian Purkad
<sh...@yahoo.com>wrote:

> Hi,
>
> I have a query which joins 12 different tables (most of them left outer
> joins) and the query takes almost 3 hours. 90% of the time is taken by a
> single reducer. One reducer is getting bulk of the data to process.
>
> How can I get around this and have fair distribution of data across all
> reducers? I tried to enable the skewjoin optimization but getting below NPE
> after first step of the job is executed.
>
> Any suggestions/ideas will be or great help.
>
> Thanks,
> Shantian
>
> 2011-06-07 19:22:28,923 Stage-11 map = 100%,  reduce = 85%
> 2011-06-07 19:22:30,932 Stage-11 map = 100%,  reduce = 100%
> Ended Job = job_201106071542_0010
> java.lang.NullPointerException
>     at
> org.apache.hadoop.hive.ql.plan.ConditionalResolverSkewJoin.getTasks(ConditionalResolverSkewJoin.java:97)
>     at
> org.apache.hadoop.hive.ql.exec.ConditionalTask.execute(ConditionalTask.java:81)
>     at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
>     at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
>     at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
>     at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
>     at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:164)
>     at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241)
>     at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> FAILED: Execution Error, return code -101 from
> org.apache.hadoop.hive.ql.exec.ConditionalTask
> hive>
>
>