You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Piyush Narang <pn...@twitter.com> on 2017/03/17 00:01:03 UTC

Understanding resource usage overheads in a Tez job

hi folks,

I'm trying to compare the performance of a Scalding job on Tez vs Hadoop
and understand a bit about the resource usage overheads on Tez. This job
reads around 6TB of input data and sets up a Scalding flow with 13 Hadoop
jobs. (20 Tez vertices)

The runtime of the job on Tez is around 2-3x better than that on Hadoop.
Tez run takes 15 mins (container reuse on) and 25 mins (reuse off). The
Hadoop job on the other hand takes around 48 mins.

Looking at the resource usage Megabyte Millis(total container uptime *
allocatedMemory), it seems like Tez is at-par with Hadoop (0.59% worse).
This seems strange given that Tez is running for 2-3x lesser time. I'm
trying to understand why this is the case and I could use any ideas /
suggestions that folks might have.

Some things I have checked out / confirmed:

   - For the purpose of these tests, my container sizes in Hadoop and Tez
   are identical (4G).
   - Total no of containers:
      - Hadoop seems to be spinning up around 46K containers over the 13
      jobs.
      - Tez spins up 32K tasks (and around 35K containers when reuse=false,
      and 5 - 10K containers when reuse=true)
   - While looking at the run with reuse=false, on comparing the container
   runtime durations with the corresponding tasks's runtime, I noticed that a
   decent number of containers were run for around 1s or more longer than the
   task as part of it. Is the Tez AM holding on to containers for a duration
   of time before starting up tasks on them? This contributes around 5% of the
   MB_MILLIS in case of the Tez run.
   - Tez seems to be spinning up around 3K extra containers that are very
   short lived. This doesn't contribute a lot of overhead but seemed
   interesting.
   - I tried out a couple of minor speculative execution setting tweaks
   (tez.shuffle-vertex-manager.max-src-fraction=0.95, min-src-fraction=0.9)
   this made MB_MILLIS a little worse (~10%).

I looked at some of the resource usage related jiras like TEZ-3274 and
TEZ-3535. We're not hitting TEZ-3274. Wasn't sure based on reading TEZ-3535
if that might be a problem here.

Does anyone have suggestions what we could look into / explore? Not sure if
others have run into such scenarios while comparing resource usage numbers
on Hadoop / Tez where the runtimes are much better on Tez but usage isn't
too much better.

Thanks,

-- 
- Piyush

Re: Understanding resource usage overheads in a Tez job

Posted by Piyush Narang <pn...@twitter.com>.

Thanks Gopal, we did run into a lot of overhead in the comparator in one of
our other jobs. Turning on OrderedSerialization in Scalding seemed to have
helped there. While we were comparing Hadoop and Tez in that job, we were
seeing Tez's reducers taking substantially more time (and the overhead was
in the comparison methods). I tried a few runs after forcing Cascading to
using the rawBytes comparator - https://github.com/cwensel/
cascading/blob/wip-3.2/cascading-hadoop/src/main/
shared/cascading/tuple/hadoop/util/DeserializerComparator.java#L59
(returning true there), that did help as well. From what I understand,
though, we need to do this from the Scalding side as a lot of our jobs use
complex objects (e.g. thrift structs / scala case classes). If we don't
have ordered serialization enabled from Scalding I'm not sure the raw
comparators will make sense (think some of the work there was to ensure the
byte representations of these objects can be compared sanely).

Thanks for the two links, they look really useful!

I was able to test out a few variants of our job with slowstart = 0.999 to
see the if the pipelining would explain the resource usage. Turns out that
it was contributing a good deal to the resource usage. When we set this
value, we end up seeing's Tez using around 20 (container reuse=false) - 27
(container reuse=true)% lower mb_millis than MR. Runtime wise Tez is still
better, takes around half the time as Hadoop.

Thanks,

On Fri, Mar 17, 2017 at 12:37 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
> > We are using OrderedSerialization in a bunch of our jobs. In this job
> we're not using it on both the Hadoop side and the Tez side. The datasets
> both jobs are reading are identical.
>
> That single comparator call was the biggest fraction of slow-down when I
> ran profiles with Tez.
>
> I profiled through that codepath for TEZ-2505, of course YMMV.
>
> My estimate was that a raw byte OrderedSerialization + TezRawComparator
> could save ~50% of the total CPU of some jobs.
>
> > Our suspicion internally was also around pipelining and speculative
> execution across steps which doesn't happen in Hadoop between jobs
>
> https://github.com/apache/tez/blob/master/tez-tools/
> swimlanes/yarn-swimlanes.sh
> +
> https://github.com/apache/tez/blob/master/tez-tools/
> analyzers/job-analyzer/src/main/java/org/apache/tez/analyzer/plugins/
> CriticalPathAnalyzer.java
>
> Those help a lot in locating issues with Tez scheduling and targeting
> optimizations.
>
> Cheers,
> Gopal
>
>
>
>

-- 
- Piyush

Re: Understanding resource usage overheads in a Tez job

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> We are using OrderedSerialization in a bunch of our jobs. In this job we're not using it on both the Hadoop side and the Tez side. The datasets both jobs are reading are identical. 

That single comparator call was the biggest fraction of slow-down when I ran profiles with Tez.

I profiled through that codepath for TEZ-2505, of course YMMV.

My estimate was that a raw byte OrderedSerialization + TezRawComparator could save ~50% of the total CPU of some jobs.

> Our suspicion internally was also around pipelining and speculative execution across steps which doesn't happen in Hadoop between jobs

https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlanes.sh
+
https://github.com/apache/tez/blob/master/tez-tools/analyzers/job-analyzer/src/main/java/org/apache/tez/analyzer/plugins/CriticalPathAnalyzer.java

Those help a lot in locating issues with Tez scheduling and targeting optimizations.

Cheers,
Gopal

Re: Understanding resource usage overheads in a Tez job

Posted by Piyush Narang <pn...@twitter.com>.

Thanks for getting back Gopal & Jason.
We are using OrderedSerialization in a bunch of our jobs. In this job we're
not using it on both the Hadoop side and the Tez side. The datasets both
jobs are reading are identical.

Our suspicion internally was also around pipelining and speculative
execution across steps which doesn't happen in Hadoop between jobs. I've
been trying to gather numbers to compare how things look when we set the
slowstart settings to 1.0 in both Tez and MapReduce. Let me try that and
get back with what I find.

Thanks,

On Fri, Mar 17, 2017 at 9:42 AM, Jason Lowe <jl...@yahoo-inc.com> wrote:

> Having the runtime be a lot shorter but the MB-seconds be similar can
> definitely happen.  For example, consider the legacy MapReduce pipeline
> where it runs a chain of jobs.  Each job in the chain cannot start until
> the previous job finishes.  Most jobs have at least some skew where the
> whole job cannot complete until that last, longer-running task completes.
> Finally it completes and the next job starts.  The next job has a long tail
> which keeps the next job from starting, etc. Note that while the last task
> of a job is running the incremental MB-seconds is pretty small since only
> one task of the job is running at that time.  The overall latency is large
> to all these last-task stalls, but the footprint profile isn't a rectangle
> of total tasks * total runtime.  Since Tez executes the DAG directly, there
> are no artificial "job boundaries" between vertices in the DAG, so it can
> do a better job of pipelining the work.  In some cases approximately the
> same amount of work is getting done with legacy vs. Tez, so the overall
> MB-seconds value is going to be similar.  However since Tez was able to
> pipeline the work more efficiently the overall job latency significantly
> improved.  Therefore job latency is not a very good proxy for job
> footprint, because the width of the job at any point in time can be
> drastically different.
>
> Container reuse in Tez will always result in at least some containers
> being allocated and then quickly released.  It's a natural race inherent
> with container reuse.  When a new task needs to be allocated, a request is
> sent out to the RM to allocate a container for that task.  If the task gets
> reassigned to a reused container then the RM may grant the allocation
> before the Tez AM can cancel the allocation request.  Now the Tez AM has an
> extra container, and if it cannot find another task to run with it then it
> will end up releasing it.  Note that Tez also will hold onto containers for
> a small amount of time as it tries to match them first for node locality,
> then a little later for rack locality, then a little later for any task.
>
> Without having access to the Tez AM logs, it's going to be hard to know
> for sure what is happening with the job.  Given that the unused container
> footprint was only 5% of the total job footprint, I suspect this is simply
> a case of pipelining the work for better latency but approximately the same
> amount of work is getting done.  Calculating the MB-second footprint from
> the tasks-only perspective (i.e.: sum of task_size * task_runtime for the
> entire job) would help verify.  I'd also look for Tez tasks spending a long
> period of time waiting for shuffle data to become ready.  You could try
> setting the task slowstart setting to 1.0 so that it only launches
> downstream tasks when all the upstream tasks are done, so the tasks avoid
> sitting on the cluster for a long time waiting for their final shuffle
> input to become available.
>
> Jason
>
>
> On Thursday, March 16, 2017 7:01 PM, Piyush Narang <pn...@twitter.com>
> wrote:
>
>
> hi folks,
>
> I'm trying to compare the performance of a Scalding job on Tez vs Hadoop
> and understand a bit about the resource usage overheads on Tez. This job
> reads around 6TB of input data and sets up a Scalding flow with 13 Hadoop
> jobs. (20 Tez vertices)
>
> The runtime of the job on Tez is around 2-3x better than that on Hadoop.
> Tez run takes 15 mins (container reuse on) and 25 mins (reuse off). The
> Hadoop job on the other hand takes around 48 mins.
>
> Looking at the resource usage Megabyte Millis(total container uptime *
> allocatedMemory), it seems like Tez is at-par with Hadoop (0.59% worse).
> This seems strange given that Tez is running for 2-3x lesser time. I'm
> trying to understand why this is the case and I could use any ideas /
> suggestions that folks might have.
>
> Some things I have checked out / confirmed:
>
>    - For the purpose of these tests, my container sizes in Hadoop and Tez
>    are identical (4G).
>    - Total no of containers:
>       - Hadoop seems to be spinning up around 46K containers over the 13
>       jobs.
>       - Tez spins up 32K tasks (and around 35K containers when
>       reuse=false, and 5 - 10K containers when reuse=true)
>    - While looking at the run with reuse=false, on comparing the
>    container runtime durations with the corresponding tasks's runtime, I
>    noticed that a decent number of containers were run for around 1s or more
>    longer than the task as part of it. Is the Tez AM holding on to containers
>    for a duration of time before starting up tasks on them? This contributes
>    around 5% of the MB_MILLIS in case of the Tez run.
>    - Tez seems to be spinning up around 3K extra containers that are very
>    short lived. This doesn't contribute a lot of overhead but seemed
>    interesting.
>    - I tried out a couple of minor speculative execution setting tweaks
>    (tez.shuffle-vertex-manager.max-src-fraction=0.95,
>    min-src-fraction=0.9) this made MB_MILLIS a little worse (~10%).
>
> I looked at some of the resource usage related jiras like TEZ-3274 and
> TEZ-3535. We're not hitting TEZ-3274. Wasn't sure based on reading TEZ-3535
> if that might be a problem here.
>
> Does anyone have suggestions what we could look into / explore? Not sure
> if others have run into such scenarios while comparing resource usage
> numbers on Hadoop / Tez where the runtimes are much better on Tez but usage
> isn't too much better.
>
> Thanks,
>
> --
> - Piyush
>
>
>


-- 
- Piyush

Re: Understanding resource usage overheads in a Tez job

Posted by Jason Lowe <jl...@yahoo-inc.com>.

Having the runtime be a lot shorter but the MB-seconds be similar can definitely happen.  For example, consider the legacy MapReduce pipeline where it runs a chain of jobs.  Each job in the chain cannot start until the previous job finishes.  Most jobs have at least some skew where the whole job cannot complete until that last, longer-running task completes.  Finally it completes and the next job starts.  The next job has a long tail which keeps the next job from starting, etc. Note that while the last task of a job is running the incremental MB-seconds is pretty small since only one task of the job is running at that time.  The overall latency is large to all these last-task stalls, but the footprint profile isn't a rectangle of total tasks * total runtime.  Since Tez executes the DAG directly, there are no artificial "job boundaries" between vertices in the DAG, so it can do a better job of pipelining the work.  In some cases approximately the same amount of work is getting done with legacy vs. Tez, so the overall MB-seconds value is going to be similar.  However since Tez was able to pipeline the work more efficiently the overall job latency significantly improved.  Therefore job latency is not a very good proxy for job footprint, because the width of the job at any point in time can be drastically different.
Container reuse in Tez will always result in at least some containers being allocated and then quickly released.  It's a natural race inherent with container reuse.  When a new task needs to be allocated, a request is sent out to the RM to allocate a container for that task.  If the task gets reassigned to a reused container then the RM may grant the allocation before the Tez AM can cancel the allocation request.  Now the Tez AM has an extra container, and if it cannot find another task to run with it then it will end up releasing it.  Note that Tez also will hold onto containers for a small amount of time as it tries to match them first for node locality, then a little later for rack locality, then a little later for any task.

Without having access to the Tez AM logs, it's going to be hard to know for sure what is happening with the job.  Given that the unused container footprint was only 5% of the total job footprint, I suspect this is simply a case of pipelining the work for better latency but approximately the same amount of work is getting done.  Calculating the MB-second footprint from the tasks-only perspective (i.e.: sum of task_size * task_runtime for the entire job) would help verify.  I'd also look for Tez tasks spending a long period of time waiting for shuffle data to become ready.  You could try setting the task slowstart setting to 1.0 so that it only launches downstream tasks when all the upstream tasks are done, so the tasks avoid sitting on the cluster for a long time waiting for their final shuffle input to become available.
Jason
 

    On Thursday, March 16, 2017 7:01 PM, Piyush Narang <pn...@twitter.com> wrote:
 

 hi folks,
I'm trying to compare the performance of a Scalding job on Tez vs Hadoop and understand a bit about the resource usage overheads on Tez. This job reads around 6TB of input data and sets up a Scalding flow with 13 Hadoop jobs. (20 Tez vertices)
The runtime of the job on Tez is around 2-3x better than that on Hadoop. Tez run takes 15 mins (container reuse on) and 25 mins (reuse off). The Hadoop job on the other hand takes around 48 mins. 
Looking at the resource usage Megabyte Millis(total container uptime * allocatedMemory), it seems like Tez is at-par with Hadoop (0.59% worse). This seems strange given that Tez is running for 2-3x lesser time. I'm trying to understand why this is the case and I could use any ideas / suggestions that folks might have. 
Some things I have checked out / confirmed:   
   - For the purpose of these tests, my container sizes in Hadoop and Tez are identical (4G).
   - Total no of containers:
   
   - Hadoop seems to be spinning up around 46K containers over the 13 jobs.
   - Tez spins up 32K tasks (and around 35K containers when reuse=false, and 5 - 10K containers when reuse=true)
   
   - While looking at the run with reuse=false, on comparing the container runtime durations with the corresponding tasks's runtime, I noticed that a decent number of containers were run for around 1s or more longer than the task as part of it. Is the Tez AM holding on to containers for a duration of time before starting up tasks on them? This contributes around 5% of the MB_MILLIS in case of the Tez run. 
   - Tez seems to be spinning up around 3K extra containers that are very short lived. This doesn't contribute a lot of overhead but seemed interesting.
   - I tried out a couple of minor speculative execution setting tweaks (tez.shuffle-vertex-manager.max-src-fraction=0.95, min-src-fraction=0.9) this made MB_MILLIS a little worse (~10%).
I looked at some of the resource usage related jiras like TEZ-3274 and TEZ-3535. We're not hitting TEZ-3274. Wasn't sure based on reading TEZ-3535 if that might be a problem here. 

Does anyone have suggestions what we could look into / explore? Not sure if others have run into such scenarios while comparing resource usage numbers on Hadoop / Tez where the runtimes are much better on Tez but usage isn't too much better. 
Thanks, 
-- 
- Piyush

Re: Understanding resource usage overheads in a Tez job

Posted by Gopal Vijayaraghavan <go...@hortonworks.com>.

Hi,

> Looking at the resource usage Megabyte Millis(total container uptime * allocatedMemory), it seems like Tez is at-par with Hadoop (0.59% worse). This seems strange given that Tez is running for 2-3x lesser time. I'm trying to understand why this is the case and I could use any ideas / suggestions that folks might have. 

The throughput of a workload is often not a framework thing - sometimes it is entirely IO driven.

Once the workload is IO limited (or even CPU limited), any runtime benefit you get is actually from better scheduling or faster scheduling (i.e pipelining between stages).

The framework code only matters when it is the bottleneck - with other bottlenecks, the expectation is that Tez will run at the same throughput but at better latency.

> Is the Tez AM holding on to containers for a duration of time before starting up tasks on them? This contributes around 5% of the MB_MILLIS in case of the Tez run. 

Yes, it is waiting for confirmation to exit - I think that is 1 heartbeat.

> Does anyone have suggestions what we could look into / explore? 

This is from my last visit to Twitter, I think twitter was building some sort of ordered serializers for .aggregateBy instead of using the Cascading defaults.

The use of RawComparator which does an allocation was 3-5x slower for the sort operation - TEZ-1288 is the API which Hive takes advantage of.

Scala somewhat locks you into allocating immutable objects in the inner loop, but for entirely byte-ordered keys, this works great.

Cheers,
Gopal