You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Gautam <ga...@gmail.com> on 2016/02/19 04:37:04 UTC

Hive query on Tez slower than on MR (fails in some cases) ..

Good Evening,

It's an ETL query that writes parquet data. The data itself has some skew
which has it's effects on the execution of the query. The query uses a
Common Table Expression to blow out data for a select list of apps and
writes to 12 different parquet tables.

On MR, this gets divided into 1 MR job followed by 12 parallel MR jobs.
Finishes in approx 1hr. I have attached the MR query plan and the query
itself.

On Tez, this is run as a single DAG of M-R+ ...  On first attempt with the
same options this query fails in the 2nd mapper (due to set
hive.auto.convert.join=true )..  when I turn that off the query passes but
takes longer (by 50%) than the MR version. It gets stuck due to some large
apps in the 1st Reducer Phase while holding all subsequent 12 Reducer
phases until the final Reducer in the 2nd phase is finished.

I'm using Tez 0.7.1 .. I'v attached the simple and full Tez explain images.
Are there things in Tez I can leverage or change my query to make it
conducive for Tez to deal with skew better?


-Gautam.

Re: Hive query on Tez slower than on MR (fails in some cases) ..

Posted by Gopal Vijayaraghavan <go...@apache.org>.

Hi,

> Here's the Tez DAG swimlane. Haven't gotten vertex.py to work.. will
>send that too soon.

Pretty clear that the map-side is fine - splitting sort buffers isn't
bothering this at all.

We want to over-partition Reducer 7 and possibly have all of them pick the
total # of reducers dynamically

set hive.exec.parallel=false; -- bad idea on Tez

set hive.tez.auto.reducer.parallelism=true; -- decide on total # of
reducers dynamically
set hive.tez.min.partition.factor=0.1;

set hive.tez.max.partition.factor=10;

set tez.shuffle-vertex-manager.min-src-fraction=0.9; -- slow start min
(reducer counts are picked at this point)
set tez.shuffle-vertex-manager.max-src-fraction=0.99;

set tez.runtime.report.partition.stats=true;

(experimental!! - I'm still testing this for machine failure tolerance)

set tez.runtime.pipelined-shuffle.enabled=true;


Cheers,
Gopal

Re: Hive query on Tez slower than on MR (fails in some cases) ..

Posted by Gautam <ga...@gmail.com>.

Here's the Tez DAG swimlane. Haven't gotten vertex.py to work.. will send
that too soon.

On Thu, Feb 18, 2016 at 10:34 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
> > On Tez, this is run as a single DAG of M-R+ ...
>
> Can't tell which vertex is the slow one in this.
>
> More tooling for isolating which vertex is taking up time (and which task)
>
> https://github.com/apache/tez/tree/master/tez-tools/swimlanes
>
>
> or alternatively run
>
> https://github.com/t3rmin4t0r/tez-swimlanes/blob/master/vertex.py
>
>
> The first one should get you a graph which a lot like
>
> http://people.apache.org/~gopalv/query4.svg
>
>
> and the 2nd one should get you something which looks like
>
> http://people.apache.org/~gopalv/q21_suppliers_who_kept_orders_waiting.svg
> (note skewed tail in Reducer 3)
>
>
> > It gets stuck due to some large apps in the 1st Reducer Phase while
> >holding all subsequent 12 Reducer phases until the final Reducer in the
> >2nd phase is finished.
>
> You're splitting the sort buffers 12-way.
>
> > Are there things in Tez I can leverage or change my query to make it
> >conducive for Tez to deal with skew better?
>
> Usually Tez runs all containers using the Mapper Xmx values, if left
> unconfigured. Most of the times the perf diff is reported, it's due to the
> use of 1.5Gb containers (and 6Gb reducers in MRv2).
>
> Assuming that isn't the case, get the other SVGs produced - should tell me
> exactly what's wrong.
>
> Tez doesn't introduce skews in general, but the impact of dividing
> io.sort.mb into 12 chunks might be a problem.
>
> Cheers,
> Gopal
> PS: in 0.8.2, the tooling actually gets you something like -
> https://issues.apache.org/jira/secure/attachment/12751186/criticalPath.jpg
>
>
>


-- 
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."

Re: Hive query on Tez slower than on MR (fails in some cases) ..

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> On Tez, this is run as a single DAG of M-R+ ...

Can't tell which vertex is the slow one in this.

More tooling for isolating which vertex is taking up time (and which task)

https://github.com/apache/tez/tree/master/tez-tools/swimlanes


or alternatively run

https://github.com/t3rmin4t0r/tez-swimlanes/blob/master/vertex.py


The first one should get you a graph which a lot like

http://people.apache.org/~gopalv/query4.svg


and the 2nd one should get you something which looks like

http://people.apache.org/~gopalv/q21_suppliers_who_kept_orders_waiting.svg
(note skewed tail in Reducer 3)


> It gets stuck due to some large apps in the 1st Reducer Phase while
>holding all subsequent 12 Reducer phases until the final Reducer in the
>2nd phase is finished.

You're splitting the sort buffers 12-way.

> Are there things in Tez I can leverage or change my query to make it
>conducive for Tez to deal with skew better?

Usually Tez runs all containers using the Mapper Xmx values, if left
unconfigured. Most of the times the perf diff is reported, it's due to the
use of 1.5Gb containers (and 6Gb reducers in MRv2).
 
Assuming that isn't the case, get the other SVGs produced - should tell me
exactly what's wrong.

Tez doesn't introduce skews in general, but the impact of dividing
io.sort.mb into 12 chunks might be a problem.

Cheers,
Gopal
PS: in 0.8.2, the tooling actually gets you something like -
https://issues.apache.org/jira/secure/attachment/12751186/criticalPath.jpg