You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Raajay <ra...@gmail.com> on 2015/12/01 19:36:10 UTC

Shared object registry

How to effectively use shared object registry? 

I created a tez client as a session, and submitted a dag twice sequentially. 

However, i did not see noticeable difference in their run times. They query was tpcds query#3. 

I had set enable container reuse in tez-site.xml. Are there other configs i need to ensure are set correctly to use shares objects?

- Raajay


RE: Shared object registry

Posted by Bikas Saha <bi...@apache.org>.
Yes. For your custom optimizations (beyond what stock hive gives you) you will have to write custom code.

 

From: Raajay [mailto:raajay.v@gmail.com] 
Sent: Wednesday, December 2, 2015 11:18 AM
To: user@tez.apache.org
Subject: Re: Shared object registry

 

I did not write my own processor. I just re-use Tez Work created by Hive. So the processors are classes like HiveMap, HiveJoin defined by Hive.

So if I understand the setting correctly, only by modifying these processors can I take advantage of Shared Object Registry.

Thanks a lot ! 

Raajay

 

On Tue, Dec 1, 2015 at 3:39 PM, Bikas Saha <bikas@apache.org <ma...@apache.org> > wrote:

To be clear, you have written your own processor that runs in your DAG vertices? Your processor runs your custom code for processing input data.

If yes, then the following applies.

You will get access to the registry from your context object.

You can use cacheForVertex() to cache for the lifetime of the vertex. cacheForDAG() to cache for the lifetime of the DAG and cacheForSession() to cache for the lifetime of a session (which runs multiple DAGs). As far as the key, value parameters – key is any unique string to look up the value. The value is any Java object (say a map or a list). For performance you would want to cache the object in a form that can be immediately used without any conversion.

 

There is a toy example of the usage in the Tez source code in BroadcastAndOneToOneExample.java

 

The Javadoc for object registry would have more details. Please open a jira if the Javadoc is not clear enough.

 

From: Raajay [mailto:raajay.v@gmail.com <ma...@gmail.com> ] 
Sent: Tuesday, December 1, 2015 11:02 AM
To: user@tez.apache.org <ma...@tez.apache.org> 
Subject: Re: Shared object registry

 

I am running a custom application; however, the dag is created similar to the dag that Hive would have created for the tpcds query. I use "TezClient" to submit these dags.

How can I use Shared Objects explicitly ?

I understand that Object Registry provides a key value interface. But then if I want to dump intermediate data (say output of mappers for small jobs) into the shared object registry how shall I do that ?

Raajay

 

 

On Tue, Dec 1, 2015 at 12:47 PM, Bikas Saha <bikas@apache.org <ma...@apache.org> > wrote:

Object registry is a user enabled feature provided by Tez to the application
(e.g. Hive and Pig) If the application chooses to use this, then it can do
some user land caching across tasks/vertices/dags using it. E.g. hive caches
the smaller broadcast side of a broadcast join in the shared object
registry.

Object registry is not an automatic data caching or input caching mechanism.

What application/job are you running? Hive/Pig/Custom? Unless the
application (like Hive) has used object caching for a cross dag scenario
(which AFAIK it does not) you will not see any difference. If its custom
then you will have to explicitly use object registry in a manner that makes
sense for your app.



-----Original Message-----
From: Raajay [mailto:raajay.v@gmail.com <ma...@gmail.com> ]
Sent: Tuesday, December 1, 2015 10:36 AM
To: user@tez.apache.org <ma...@tez.apache.org> 
Subject: Shared object registry

How to effectively use shared object registry?

I created a tez client as a session, and submitted a dag twice sequentially.


However, i did not see noticeable difference in their run times. They query
was tpcds query#3.

I had set enable container reuse in tez-site.xml. Are there other configs i
need to ensure are set correctly to use shares objects?

- Raajay

 

 


Re: Shared object registry

Posted by Raajay <ra...@gmail.com>.
I did not write my own processor. I just re-use Tez Work created by Hive.
So the processors are classes like HiveMap, HiveJoin defined by Hive.

So if I understand the setting correctly, only by modifying these
processors can I take advantage of Shared Object Registry.

Thanks a lot !

Raajay

On Tue, Dec 1, 2015 at 3:39 PM, Bikas Saha <bi...@apache.org> wrote:

> To be clear, you have written your own processor that runs in your DAG
> vertices? Your processor runs your custom code for processing input data.
>
> If yes, then the following applies.
>
> You will get access to the registry from your context object.
>
> You can use cacheForVertex() to cache for the lifetime of the vertex.
> cacheForDAG() to cache for the lifetime of the DAG and cacheForSession() to
> cache for the lifetime of a session (which runs multiple DAGs). As far as
> the key, value parameters – key is any unique string to look up the value.
> The value is any Java object (say a map or a list). For performance you
> would want to cache the object in a form that can be immediately used
> without any conversion.
>
>
>
> There is a toy example of the usage in the Tez source code in
> BroadcastAndOneToOneExample.java
>
>
>
> The Javadoc for object registry would have more details. Please open a
> jira if the Javadoc is not clear enough.
>
>
>
> *From:* Raajay [mailto:raajay.v@gmail.com]
> *Sent:* Tuesday, December 1, 2015 11:02 AM
> *To:* user@tez.apache.org
> *Subject:* Re: Shared object registry
>
>
>
> I am running a custom application; however, the dag is created similar to
> the dag that Hive would have created for the tpcds query. I use "TezClient"
> to submit these dags.
>
> How can I use Shared Objects explicitly ?
>
> I understand that Object Registry provides a key value interface. But then
> if I want to dump intermediate data (say output of mappers for small jobs)
> into the shared object registry how shall I do that ?
>
> Raajay
>
>
>
>
>
> On Tue, Dec 1, 2015 at 12:47 PM, Bikas Saha <bi...@apache.org> wrote:
>
> Object registry is a user enabled feature provided by Tez to the
> application
> (e.g. Hive and Pig) If the application chooses to use this, then it can do
> some user land caching across tasks/vertices/dags using it. E.g. hive
> caches
> the smaller broadcast side of a broadcast join in the shared object
> registry.
>
> Object registry is not an automatic data caching or input caching
> mechanism.
>
> What application/job are you running? Hive/Pig/Custom? Unless the
> application (like Hive) has used object caching for a cross dag scenario
> (which AFAIK it does not) you will not see any difference. If its custom
> then you will have to explicitly use object registry in a manner that makes
> sense for your app.
>
>
>
> -----Original Message-----
> From: Raajay [mailto:raajay.v@gmail.com]
> Sent: Tuesday, December 1, 2015 10:36 AM
> To: user@tez.apache.org
> Subject: Shared object registry
>
> How to effectively use shared object registry?
>
> I created a tez client as a session, and submitted a dag twice
> sequentially.
>
>
> However, i did not see noticeable difference in their run times. They query
> was tpcds query#3.
>
> I had set enable container reuse in tez-site.xml. Are there other configs i
> need to ensure are set correctly to use shares objects?
>
> - Raajay
>
>
>

RE: Shared object registry

Posted by Bikas Saha <bi...@apache.org>.
To be clear, you have written your own processor that runs in your DAG vertices? Your processor runs your custom code for processing input data.

If yes, then the following applies.

You will get access to the registry from your context object.

You can use cacheForVertex() to cache for the lifetime of the vertex. cacheForDAG() to cache for the lifetime of the DAG and cacheForSession() to cache for the lifetime of a session (which runs multiple DAGs). As far as the key, value parameters – key is any unique string to look up the value. The value is any Java object (say a map or a list). For performance you would want to cache the object in a form that can be immediately used without any conversion.

 

There is a toy example of the usage in the Tez source code in BroadcastAndOneToOneExample.java

 

The Javadoc for object registry would have more details. Please open a jira if the Javadoc is not clear enough.

 

From: Raajay [mailto:raajay.v@gmail.com] 
Sent: Tuesday, December 1, 2015 11:02 AM
To: user@tez.apache.org
Subject: Re: Shared object registry

 

I am running a custom application; however, the dag is created similar to the dag that Hive would have created for the tpcds query. I use "TezClient" to submit these dags.

How can I use Shared Objects explicitly ?

I understand that Object Registry provides a key value interface. But then if I want to dump intermediate data (say output of mappers for small jobs) into the shared object registry how shall I do that ?

Raajay

 

 

On Tue, Dec 1, 2015 at 12:47 PM, Bikas Saha <bikas@apache.org <ma...@apache.org> > wrote:

Object registry is a user enabled feature provided by Tez to the application
(e.g. Hive and Pig) If the application chooses to use this, then it can do
some user land caching across tasks/vertices/dags using it. E.g. hive caches
the smaller broadcast side of a broadcast join in the shared object
registry.

Object registry is not an automatic data caching or input caching mechanism.

What application/job are you running? Hive/Pig/Custom? Unless the
application (like Hive) has used object caching for a cross dag scenario
(which AFAIK it does not) you will not see any difference. If its custom
then you will have to explicitly use object registry in a manner that makes
sense for your app.



-----Original Message-----
From: Raajay [mailto:raajay.v@gmail.com <ma...@gmail.com> ]
Sent: Tuesday, December 1, 2015 10:36 AM
To: user@tez.apache.org <ma...@tez.apache.org> 
Subject: Shared object registry

How to effectively use shared object registry?

I created a tez client as a session, and submitted a dag twice sequentially.


However, i did not see noticeable difference in their run times. They query
was tpcds query#3.

I had set enable container reuse in tez-site.xml. Are there other configs i
need to ensure are set correctly to use shares objects?

- Raajay



 


Re: Shared object registry

Posted by Raajay <ra...@gmail.com>.
I am running a custom application; however, the dag is created similar to
the dag that Hive would have created for the tpcds query. I use "TezClient"
to submit these dags.

How can I use Shared Objects explicitly ?

I understand that Object Registry provides a key value interface. But then
if I want to dump intermediate data (say output of mappers for small jobs)
into the shared object registry how shall I do that ?

Raajay


On Tue, Dec 1, 2015 at 12:47 PM, Bikas Saha <bi...@apache.org> wrote:

> Object registry is a user enabled feature provided by Tez to the
> application
> (e.g. Hive and Pig) If the application chooses to use this, then it can do
> some user land caching across tasks/vertices/dags using it. E.g. hive
> caches
> the smaller broadcast side of a broadcast join in the shared object
> registry.
>
> Object registry is not an automatic data caching or input caching
> mechanism.
>
> What application/job are you running? Hive/Pig/Custom? Unless the
> application (like Hive) has used object caching for a cross dag scenario
> (which AFAIK it does not) you will not see any difference. If its custom
> then you will have to explicitly use object registry in a manner that makes
> sense for your app.
>
>
> -----Original Message-----
> From: Raajay [mailto:raajay.v@gmail.com]
> Sent: Tuesday, December 1, 2015 10:36 AM
> To: user@tez.apache.org
> Subject: Shared object registry
>
> How to effectively use shared object registry?
>
> I created a tez client as a session, and submitted a dag twice
> sequentially.
>
>
> However, i did not see noticeable difference in their run times. They query
> was tpcds query#3.
>
> I had set enable container reuse in tez-site.xml. Are there other configs i
> need to ensure are set correctly to use shares objects?
>
> - Raajay
>
>
>

RE: Shared object registry

Posted by Bikas Saha <bi...@apache.org>.
Object registry is a user enabled feature provided by Tez to the application
(e.g. Hive and Pig) If the application chooses to use this, then it can do
some user land caching across tasks/vertices/dags using it. E.g. hive caches
the smaller broadcast side of a broadcast join in the shared object
registry.

Object registry is not an automatic data caching or input caching mechanism.

What application/job are you running? Hive/Pig/Custom? Unless the
application (like Hive) has used object caching for a cross dag scenario
(which AFAIK it does not) you will not see any difference. If its custom
then you will have to explicitly use object registry in a manner that makes
sense for your app. 


-----Original Message-----
From: Raajay [mailto:raajay.v@gmail.com] 
Sent: Tuesday, December 1, 2015 10:36 AM
To: user@tez.apache.org
Subject: Shared object registry

How to effectively use shared object registry? 

I created a tez client as a session, and submitted a dag twice sequentially.


However, i did not see noticeable difference in their run times. They query
was tpcds query#3. 

I had set enable container reuse in tez-site.xml. Are there other configs i
need to ensure are set correctly to use shares objects?

- Raajay