You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Rajesh Balamohan <rb...@apache.org> on 2015/12/01 01:11:52 UTC

Re: Running tez jobs with data in memory

1. Is it possible to determine from the tez history logs, what the
bottleneck for a task/vertex is? Whether it is compute, disk or network?

- Vertex counters and task counters for the vertex can be looked into for
determine this. If you have enabled ATS, this would be available in TEZ-UI
itself. Otherwise it should be available in the job logs. However, it is
not always directly related to compute/disk/network.  Sometimes, the vertex
is delayed as it has to get the data from the source vertex (think of it
more like data dependency), sometimes due to re-execution of task in the
source vertex due to failures like disks, or due to cluster slot
unavailability and so on.  You can also look at using CriticalPathAnalyzer
(early version available in 0.8.x) which can help in determining the
critical path of the DAG (to determine whether the vertex was slow due to
different conditions). E.g HADOOP_CLASSPATH=$TEZ_HOME/*:/$TEZ_HOME/lib/*:$
HADOOP_CLASSPATH yarn jar $TEZ_HOME/tez-job-analyzer-0.8.2-SNAPSHOT.jar
CriticalPath --outputDir=/tmp/ --dagId=dag_1443665985063_58064_1

2. What are the common ways to get Tez work on data in memory, as opposed
to reading from HDFS. This is to minimize the duration mappers spend in
reading from HDFS or disk.

- Not sure if you are trying to compare with Spark way of loading the data
to memory and working on it.  Tez does not have a direct equivalent for
this;  But Tez has ObjectRegistry (look for BroadcastAndOneToOneExample
<https://github.com/apache/tez/blob/b153035b076d4603eb6bc771d675d64181eb02e9/tez-tests/src/main/java/org/apache/tez/mapreduce/examples/BroadcastAndOneToOneExample.java>
in tez codebase) where data can be stored in memory to share between tasks.

~Rajesh.B

On Tue, Dec 1, 2015 at 12:33 AM, Raajay <ra...@gmail.com> wrote:

> Hello,
>
> Two questions
>
> 1. Is it possible to determine from the tez history logs, what the
> bottleneck for a task/vertex is? Whether it is compute, disk or network?
>
> 2. What are the common ways to get Tez work on data in memory, as opposed
> to reading from HDFS. This is to minimize the duration mappers spend in
> reading from HDFS or disk.
>
> Thanks
> Raajay
>

RE: Running tez jobs with data in memory

Posted by Bikas Saha <bi...@apache.org>.
In HDFS in-memory tier the caching is best effort. So data is written to RAM
and asynchronously persisted to disk. This makes the data reliably available
despite memory pressure and machine reboot. So the application would be
functional.

 

This data would continue to reside in memory until there is memory pressure
and it needs to be released. So the read path would also benefit from
performance gains. There were some perf bottlenecks in the read path in HDFS
and they were fixed as part of the in memory tier changes.

 

Bikas

 

From: Raajay [mailto:raajay.v@gmail.com] 
Sent: Monday, November 30, 2015 4:41 PM
To: user@tez.apache.org
Subject: Re: Running tez jobs with data in memory

 

Great thanks ! Am i right in inferring that HDFS in-memory tier helps in
speeding up writes and not reads ? Read might still happen from disk as
there is no caching in the RAM.

 

One of the alternatives I was exploring was running tez atop Tachyon, but
have not been able to get that working till now :(

 

Raajay

 

On Nov 30, 2015, at 6:34 PM, Rajesh Balamohan <rajesh.balamohan@gmail.com
<ma...@gmail.com> > wrote:

 

Adding more to #2. Alternatively, you may want to consider adding paths to
HDFS in-memory tier
(https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Memo
ryStorage.html).

 

~Rajesh.B

 

On Tue, Dec 1, 2015 at 5:41 AM, Rajesh Balamohan <rbalamohan@apache.org
<ma...@apache.org> > wrote:

1. Is it possible to determine from the tez history logs, what the
bottleneck for a task/vertex is? Whether it is compute, disk or network?  

 

- Vertex counters and task counters for the vertex can be looked into for
determine this. If you have enabled ATS, this would be available in TEZ-UI
itself. Otherwise it should be available in the job logs. However, it is not
always directly related to compute/disk/network.  Sometimes, the vertex is
delayed as it has to get the data from the source vertex (think of it more
like data dependency), sometimes due to re-execution of task in the source
vertex due to failures like disks, or due to cluster slot unavailability and
so on.  You can also look at using CriticalPathAnalyzer (early version
available in 0.8.x) which can help in determining the critical path of the
DAG (to determine whether the vertex was slow due to different conditions).
E.g HADOOP_CLASSPATH=$TEZ_HOME/*:/$TEZ_HOME/lib/*:$HADOOP_CLASSPATH yarn jar
$TEZ_HOME/tez-job-analyzer-0.8.2-SNAPSHOT.jar CriticalPath --outputDir=/tmp/
--dagId=dag_1443665985063_58064_1

 

2. What are the common ways to get Tez work on data in memory, as opposed to
reading from HDFS. This is to minimize the duration mappers spend in reading
from HDFS or disk.

 

- Not sure if you are trying to compare with Spark way of loading the data
to memory and working on it.  Tez does not have a direct equivalent for
this;  But Tez has ObjectRegistry (look for BroadcastAndOneToOneExample
<https://github.com/apache/tez/blob/b153035b076d4603eb6bc771d675d64181eb02e9
/tez-tests/src/main/java/org/apache/tez/mapreduce/examples/BroadcastAndOneTo
OneExample.java>  in tez codebase) where data can be stored in memory to
share between tasks.

 

~Rajesh.B

 

On Tue, Dec 1, 2015 at 12:33 AM, Raajay <raajay.v@gmail.com
<ma...@gmail.com> > wrote:

Hello,

Two questions

1. Is it possible to determine from the tez history logs, what the
bottleneck for a task/vertex is? Whether it is compute, disk or network?  

2. What are the common ways to get Tez work on data in memory, as opposed to
reading from HDFS. This is to minimize the duration mappers spend in reading
from HDFS or disk.

Thanks

Raajay

 





 

-- 

~Rajesh.B

 


Re: Running tez jobs with data in memory

Posted by Raajay <ra...@gmail.com>.
Great thanks ! Am i right in inferring that HDFS in-memory tier helps in speeding up writes and not reads ? Read might still happen from disk as there is no caching in the RAM.

One of the alternatives I was exploring was running tez atop Tachyon, but have not been able to get that working till now :(

Raajay

> On Nov 30, 2015, at 6:34 PM, Rajesh Balamohan <ra...@gmail.com> wrote:
> 
> Adding more to #2. Alternatively, you may want to consider adding paths to HDFS in-memory tier (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html>).
> 
> ~Rajesh.B
> 
> On Tue, Dec 1, 2015 at 5:41 AM, Rajesh Balamohan <rbalamohan@apache.org <ma...@apache.org>> wrote:
> 1. Is it possible to determine from the tez history logs, what the bottleneck for a task/vertex is? Whether it is compute, disk or network?  
> 
> - Vertex counters and task counters for the vertex can be looked into for determine this. If you have enabled ATS, this would be available in TEZ-UI itself. Otherwise it should be available in the job logs. However, it is not always directly related to compute/disk/network.  Sometimes, the vertex is delayed as it has to get the data from the source vertex (think of it more like data dependency), sometimes due to re-execution of task in the source vertex due to failures like disks, or due to cluster slot unavailability and so on.  You can also look at using CriticalPathAnalyzer (early version available in 0.8.x) which can help in determining the critical path of the DAG (to determine whether the vertex was slow due to different conditions). E.g HADOOP_CLASSPATH=$TEZ_HOME/*:/$TEZ_HOME/lib/*:$HADOOP_CLASSPATH yarn jar $TEZ_HOME/tez-job-analyzer-0.8.2-SNAPSHOT.jar CriticalPath --outputDir=/tmp/ --dagId=dag_1443665985063_58064_1
> 
> 2. What are the common ways to get Tez work on data in memory, as opposed to reading from HDFS. This is to minimize the duration mappers spend in reading from HDFS or disk.
> 
> - Not sure if you are trying to compare with Spark way of loading the data to memory and working on it.  Tez does not have a direct equivalent for this;  But Tez has ObjectRegistry (look for BroadcastAndOneToOneExample <https://github.com/apache/tez/blob/b153035b076d4603eb6bc771d675d64181eb02e9/tez-tests/src/main/java/org/apache/tez/mapreduce/examples/BroadcastAndOneToOneExample.java> in tez codebase) where data can be stored in memory to share between tasks.
> 
> ~Rajesh.B
> 
> On Tue, Dec 1, 2015 at 12:33 AM, Raajay <raajay.v@gmail.com <ma...@gmail.com>> wrote:
> Hello,
> 
> Two questions
> 
> 1. Is it possible to determine from the tez history logs, what the bottleneck for a task/vertex is? Whether it is compute, disk or network?  
> 
> 2. What are the common ways to get Tez work on data in memory, as opposed to reading from HDFS. This is to minimize the duration mappers spend in reading from HDFS or disk.
> 
> Thanks
> Raajay
> 
> 
> 
> 
> -- 
> ~Rajesh.B


Re: Running tez jobs with data in memory

Posted by Rajesh Balamohan <ra...@gmail.com>.
Adding more to #2. Alternatively, you may want to consider adding paths to
HDFS in-memory tier (
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html
).

~Rajesh.B

On Tue, Dec 1, 2015 at 5:41 AM, Rajesh Balamohan <rb...@apache.org>
wrote:

> 1. Is it possible to determine from the tez history logs, what the
> bottleneck for a task/vertex is? Whether it is compute, disk or network?
>
> - Vertex counters and task counters for the vertex can be looked into for
> determine this. If you have enabled ATS, this would be available in TEZ-UI
> itself. Otherwise it should be available in the job logs. However, it is
> not always directly related to compute/disk/network.  Sometimes, the vertex
> is delayed as it has to get the data from the source vertex (think of it
> more like data dependency), sometimes due to re-execution of task in the
> source vertex due to failures like disks, or due to cluster slot
> unavailability and so on.  You can also look at using CriticalPathAnalyzer
> (early version available in 0.8.x) which can help in determining the
> critical path of the DAG (to determine whether the vertex was slow due to
> different conditions). E.g HADOOP_CLASSPATH=$TEZ_HOME/*:/$TEZ_HOME/lib/*:$
> HADOOP_CLASSPATH yarn jar $TEZ_HOME/tez-job-analyzer-0.8.2-SNAPSHOT.jar
> CriticalPath --outputDir=/tmp/ --dagId=dag_1443665985063_58064_1
>
> 2. What are the common ways to get Tez work on data in memory, as opposed
> to reading from HDFS. This is to minimize the duration mappers spend in
> reading from HDFS or disk.
>
> - Not sure if you are trying to compare with Spark way of loading the data
> to memory and working on it.  Tez does not have a direct equivalent for
> this;  But Tez has ObjectRegistry (look for BroadcastAndOneToOneExample
> <https://github.com/apache/tez/blob/b153035b076d4603eb6bc771d675d64181eb02e9/tez-tests/src/main/java/org/apache/tez/mapreduce/examples/BroadcastAndOneToOneExample.java>
> in tez codebase) where data can be stored in memory to share between tasks.
>
> ~Rajesh.B
>
> On Tue, Dec 1, 2015 at 12:33 AM, Raajay <ra...@gmail.com> wrote:
>
>> Hello,
>>
>> Two questions
>>
>> 1. Is it possible to determine from the tez history logs, what the
>> bottleneck for a task/vertex is? Whether it is compute, disk or network?
>>
>> 2. What are the common ways to get Tez work on data in memory, as opposed
>> to reading from HDFS. This is to minimize the duration mappers spend in
>> reading from HDFS or disk.
>>
>> Thanks
>> Raajay
>>
>
>


-- 
~Rajesh.B