You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tez.apache.org by Achal Soni <as...@twitter.com> on 2013/07/26 22:59:47 UTC

Distributed Cache in Tez

Hey all,

Have any thoughts be given to distributed cache in Tez? It seems that it is
almost as simple as adding local files to vertices via YARN.

Is there any insight into how DistributedCache differs from adding
LocalResources? Should I be looking into MRApps for helper methods?

Thanks!

Achal

Re: Distributed Cache in Tez

Posted by Hitesh Shah <hi...@apache.org>.

Sorry - looks like I was wrong. MR supports an explicit api to add an archive to classpath but does nothing implicitly for both files/archives being added to distributed cache. I am still not sure if we need to add this helper api as the archive structure is not known to tez. A user can supply a simple jar, fat jar ( jar of jars ), a tarball of jars and in this case, the user is likely the best person to understand what would be the unzipped format in the container's working dir and modify the classpath as needed. As long as the user knows the structure, using relative paths based on the working dir would result in a valid classpath.

Given that you believe there is a need for a helper api for Pig and one that can potential useful to other users, please go ahead and file a jira with the helper api proposal. 

-- Hitesh

On Jul 26, 2013, at 5:23 PM, Achal Soni wrote:

> Yes I agree with both things. If Dist-Cache truly does that, then we should
> not mimic that in Tez. However, * *I don't think it is fair to put the onus
> on the user to handle the jars. We use it extensively in Pig with the
> "register" command and while Pig could attempt to modify the classpath
> correctly, I think some helper functions/support is needed from Tez.
> 
> 
> On Fri, Jul 26, 2013 at 3:58 PM, Bikas Saha <bi...@hortonworks.com> wrote:
> 
>> With local resources used in the Tez API, we have the choice about being
>> efficient and using specific resources for specific vertices. So if an
>> enormous look up table is used for a vertex1 then it can be added to only
>> that vertex. I think dist-cache ends up making all resources available to
>> all tasks even if they don't need them.
>> 
>> Hitesh, the archive jars is an interesting point. Without a
>> YARN/LocalResource-NM/Tez support to help edit the classpath (and even
>> notify users that they need to do something to make that archive available
>> via the classpath) its hard for users to do the right thing.
>> 
>> Bikas
>> 
>> -----Original Message-----
>> From: Hitesh Shah [mailto:hitesh@apache.org]
>> Sent: Friday, July 26, 2013 3:03 PM
>> To: dev@tez.incubator.apache.org
>> Subject: Re: Distributed Cache in Tez
>> 
>> Hi Achal,
>> 
>> Yes you are right - I forgot to mention that bit. Distributed cache does
>> also modify the classpath to account for the jars being distributed.
>> 
>> One thing to note is that LocalResources supports 2 modes - file and
>> archive mode. In case of file mode, the jar will be localized as a single
>> file. In archive mode, the jar will be unzipped. By default, Tez adds
>> PWD/* to the classpath of a task so all local resources in file mode are
>> accounted for. For archive mode, the onus is on the user to modify the
>> classpath as needed to ensure that all the paths of the unzipped structure
>> are accounted for in the classpath for the task.
>> 
>> Thanks for raising these questions. If you see something lacking in the
>> javadocs, would you mind filing jiras so that we can address such issues.
>> 
>> thanks
>> -- Hitesh
>> 
>> 
>> On Jul 26, 2013, at 2:22 PM, Achal Soni wrote:
>> 
>>> Hi Hitesh,
>>> 
>>> I think the DistributedCache name is very misleading because it really
>>> doesn't act much like a cache, for the reasons you stated above.The
>>> management of these additional fies and jars I think is a different
>>> discussion and definitely out of the scope of Tez. I agree that
>>> clients like Pig and Hive should be more mindful and perhaps develop
>>> systems for managing these files.
>>> 
>>> I think that the LocalResources way is perfectly suitable. What I
>>> don't quite get is, is there any difference between DistributedCache
>>> and what it offers to the client as opposed to LocalResources. For
>>> example, Pig needs to distribute jars to the nodes. These need to be
>> added to the classpath.
>>> Does DC do that which the LR way won't?
>>> 
>>> Thanks,
>>> Achal
>>> 
>>> 
>>> On Fri, Jul 26, 2013 at 2:11 PM, Hitesh Shah <hi...@apache.org> wrote:
>>> 
>>>> Hi Achal
>>>> 
>>>> We want to force folks to use local resources as it makes the users
>>>> more aware of how to use the cache.
>>>> 
>>>> Pushing local files to distributed cache for each job does not bring
>>>> any performance improvement. All it does is ensure that the local
>>>> files are now available on the remote node in the cluster where the
>>>> task is run. It also requires uploading the local files to hdfs each
>>>> and every time. This also means that given that there is a new hdfs
>>>> file each and every time, the "cache" on the remote node can be used.
>>>> 
>>>> With local resources, the user is making a conscious choice of first
>>>> uploading a local file to hdfs and then adding the hdfs file as a
>>>> local resource for the remote task. As long as the file on hdfs
>>>> remains unchanged, the remote node will re-use the local copy ( local
>>>> copy is downloaded once the first time around from hdfs ). With this
>>>> in mind, a user will be more mindful of when to upload a local file
>>>> and how to re-use hdfs-based resources across jobs. A user would now
>>>> realize that the penalty of uploading a non-changing jar for each and
>>>> every job ( as was done by hive earlier ).
>>>> 
>>>> In the case of helpers, are you looking at a helper method for
>>>> creating local resources out of files that change for each and every
>> job?
>>>> 
>>>> Furthermore, there is a question of management of these uploaded files?
>>>> When should they be deleted - after the job completes? If yes, is the
>>>> AM supposed to delete them or the client? What if a client does not
>>>> hang around for the job to complete or is killed before it can clean
>>>> up the files?
>>>> 
>>>> thanks
>>>> -- Hitesh
>>>> 
>>>> On Jul 26, 2013, at 1:59 PM, Achal Soni wrote:
>>>> 
>>>>> Hey all,
>>>>> 
>>>>> Have any thoughts be given to distributed cache in Tez? It seems
>>>>> that it
>>>> is
>>>>> almost as simple as adding local files to vertices via YARN.
>>>>> 
>>>>> Is there any insight into how DistributedCache differs from adding
>>>>> LocalResources? Should I be looking into MRApps for helper methods?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Achal
>>>> 
>>>> 
>>

Re: Distributed Cache in Tez

Posted by Achal Soni <as...@twitter.com>.

Yes I agree with both things. If Dist-Cache truly does that, then we should
not mimic that in Tez. However, * *I don't think it is fair to put the onus
on the user to handle the jars. We use it extensively in Pig with the
"register" command and while Pig could attempt to modify the classpath
correctly, I think some helper functions/support is needed from Tez.


On Fri, Jul 26, 2013 at 3:58 PM, Bikas Saha <bi...@hortonworks.com> wrote:

> With local resources used in the Tez API, we have the choice about being
> efficient and using specific resources for specific vertices. So if an
> enormous look up table is used for a vertex1 then it can be added to only
> that vertex. I think dist-cache ends up making all resources available to
> all tasks even if they don't need them.
>
> Hitesh, the archive jars is an interesting point. Without a
> YARN/LocalResource-NM/Tez support to help edit the classpath (and even
> notify users that they need to do something to make that archive available
> via the classpath) its hard for users to do the right thing.
>
> Bikas
>
> -----Original Message-----
> From: Hitesh Shah [mailto:hitesh@apache.org]
> Sent: Friday, July 26, 2013 3:03 PM
> To: dev@tez.incubator.apache.org
> Subject: Re: Distributed Cache in Tez
>
> Hi Achal,
>
> Yes you are right - I forgot to mention that bit. Distributed cache does
> also modify the classpath to account for the jars being distributed.
>
> One thing to note is that LocalResources supports 2 modes - file and
> archive mode. In case of file mode, the jar will be localized as a single
> file. In archive mode, the jar will be unzipped. By default, Tez adds
> PWD/* to the classpath of a task so all local resources in file mode are
> accounted for. For archive mode, the onus is on the user to modify the
> classpath as needed to ensure that all the paths of the unzipped structure
> are accounted for in the classpath for the task.
>
> Thanks for raising these questions. If you see something lacking in the
> javadocs, would you mind filing jiras so that we can address such issues.
>
> thanks
> -- Hitesh
>
>
> On Jul 26, 2013, at 2:22 PM, Achal Soni wrote:
>
> > Hi Hitesh,
> >
> > I think the DistributedCache name is very misleading because it really
> > doesn't act much like a cache, for the reasons you stated above.The
> > management of these additional fies and jars I think is a different
> > discussion and definitely out of the scope of Tez. I agree that
> > clients like Pig and Hive should be more mindful and perhaps develop
> > systems for managing these files.
> >
> > I think that the LocalResources way is perfectly suitable. What I
> > don't quite get is, is there any difference between DistributedCache
> > and what it offers to the client as opposed to LocalResources. For
> > example, Pig needs to distribute jars to the nodes. These need to be
> added to the classpath.
> > Does DC do that which the LR way won't?
> >
> > Thanks,
> > Achal
> >
> >
> > On Fri, Jul 26, 2013 at 2:11 PM, Hitesh Shah <hi...@apache.org> wrote:
> >
> >> Hi Achal
> >>
> >> We want to force folks to use local resources as it makes the users
> >> more aware of how to use the cache.
> >>
> >> Pushing local files to distributed cache for each job does not bring
> >> any performance improvement. All it does is ensure that the local
> >> files are now available on the remote node in the cluster where the
> >> task is run. It also requires uploading the local files to hdfs each
> >> and every time. This also means that given that there is a new hdfs
> >> file each and every time, the "cache" on the remote node can be used.
> >>
> >> With local resources, the user is making a conscious choice of first
> >> uploading a local file to hdfs and then adding the hdfs file as a
> >> local resource for the remote task. As long as the file on hdfs
> >> remains unchanged, the remote node will re-use the local copy ( local
> >> copy is downloaded once the first time around from hdfs ). With this
> >> in mind, a user will be more mindful of when to upload a local file
> >> and how to re-use hdfs-based resources across jobs. A user would now
> >> realize that the penalty of uploading a non-changing jar for each and
> >> every job ( as was done by hive earlier ).
> >>
> >> In the case of helpers, are you looking at a helper method for
> >> creating local resources out of files that change for each and every
> job?
> >>
> >> Furthermore, there is a question of management of these uploaded files?
> >> When should they be deleted - after the job completes? If yes, is the
> >> AM supposed to delete them or the client? What if a client does not
> >> hang around for the job to complete or is killed before it can clean
> >> up the files?
> >>
> >> thanks
> >> -- Hitesh
> >>
> >> On Jul 26, 2013, at 1:59 PM, Achal Soni wrote:
> >>
> >>> Hey all,
> >>>
> >>> Have any thoughts be given to distributed cache in Tez? It seems
> >>> that it
> >> is
> >>> almost as simple as adding local files to vertices via YARN.
> >>>
> >>> Is there any insight into how DistributedCache differs from adding
> >>> LocalResources? Should I be looking into MRApps for helper methods?
> >>>
> >>> Thanks!
> >>>
> >>> Achal
> >>
> >>
>

RE: Distributed Cache in Tez

Posted by Bikas Saha <bi...@hortonworks.com>.

With local resources used in the Tez API, we have the choice about being
efficient and using specific resources for specific vertices. So if an
enormous look up table is used for a vertex1 then it can be added to only
that vertex. I think dist-cache ends up making all resources available to
all tasks even if they don't need them.

Hitesh, the archive jars is an interesting point. Without a
YARN/LocalResource-NM/Tez support to help edit the classpath (and even
notify users that they need to do something to make that archive available
via the classpath) its hard for users to do the right thing.

Bikas

-----Original Message-----
From: Hitesh Shah [mailto:hitesh@apache.org]
Sent: Friday, July 26, 2013 3:03 PM
To: dev@tez.incubator.apache.org
Subject: Re: Distributed Cache in Tez

Hi Achal,

Yes you are right - I forgot to mention that bit. Distributed cache does
also modify the classpath to account for the jars being distributed.

One thing to note is that LocalResources supports 2 modes - file and
archive mode. In case of file mode, the jar will be localized as a single
file. In archive mode, the jar will be unzipped. By default, Tez adds
PWD/* to the classpath of a task so all local resources in file mode are
accounted for. For archive mode, the onus is on the user to modify the
classpath as needed to ensure that all the paths of the unzipped structure
are accounted for in the classpath for the task.

Thanks for raising these questions. If you see something lacking in the
javadocs, would you mind filing jiras so that we can address such issues.

thanks
-- Hitesh

On Jul 26, 2013, at 2:22 PM, Achal Soni wrote:

> Hi Hitesh,
>
> I think the DistributedCache name is very misleading because it really
> doesn't act much like a cache, for the reasons you stated above.The
> management of these additional fies and jars I think is a different
> discussion and definitely out of the scope of Tez. I agree that
> clients like Pig and Hive should be more mindful and perhaps develop
> systems for managing these files.
>
> I think that the LocalResources way is perfectly suitable. What I
> don't quite get is, is there any difference between DistributedCache
> and what it offers to the client as opposed to LocalResources. For
> example, Pig needs to distribute jars to the nodes. These need to be
added to the classpath.
> Does DC do that which the LR way won't?
>
> Thanks,
> Achal
>
>
> On Fri, Jul 26, 2013 at 2:11 PM, Hitesh Shah <hi...@apache.org> wrote:
>
>> Hi Achal
>>
>> We want to force folks to use local resources as it makes the users
>> more aware of how to use the cache.
>>
>> Pushing local files to distributed cache for each job does not bring
>> any performance improvement. All it does is ensure that the local
>> files are now available on the remote node in the cluster where the
>> task is run. It also requires uploading the local files to hdfs each
>> and every time. This also means that given that there is a new hdfs
>> file each and every time, the "cache" on the remote node can be used.
>>
>> With local resources, the user is making a conscious choice of first
>> uploading a local file to hdfs and then adding the hdfs file as a
>> local resource for the remote task. As long as the file on hdfs
>> remains unchanged, the remote node will re-use the local copy ( local
>> copy is downloaded once the first time around from hdfs ). With this
>> in mind, a user will be more mindful of when to upload a local file
>> and how to re-use hdfs-based resources across jobs. A user would now
>> realize that the penalty of uploading a non-changing jar for each and
>> every job ( as was done by hive earlier ).
>>
>> In the case of helpers, are you looking at a helper method for
>> creating local resources out of files that change for each and every
job?
>>
>> Furthermore, there is a question of management of these uploaded files?
>> When should they be deleted - after the job completes? If yes, is the
>> AM supposed to delete them or the client? What if a client does not
>> hang around for the job to complete or is killed before it can clean
>> up the files?
>>
>> thanks
>> -- Hitesh
>>
>> On Jul 26, 2013, at 1:59 PM, Achal Soni wrote:
>>
>>> Hey all,
>>>
>>> Have any thoughts be given to distributed cache in Tez? It seems
>>> that it
>> is
>>> almost as simple as adding local files to vertices via YARN.
>>>
>>> Is there any insight into how DistributedCache differs from adding
>>> LocalResources? Should I be looking into MRApps for helper methods?
>>>
>>> Thanks!
>>>
>>> Achal
>>
>>

Re: Distributed Cache in Tez

Posted by Hitesh Shah <hi...@apache.org>.

Hi Achal,

Yes you are right - I forgot to mention that bit. Distributed cache does also modify the classpath to account for the jars being distributed. 

One thing to note is that LocalResources supports 2 modes - file and archive mode. In case of file mode, the jar will be localized as a single file. In archive mode, the jar will be unzipped. By default, Tez adds PWD/* to the classpath of a task so all local resources in file mode are accounted for. For archive mode, the onus is on the user to modify the classpath as needed to ensure that all the paths of the unzipped structure are accounted for in the classpath for the task. 

Thanks for raising these questions. If you see something lacking in the javadocs, would you mind filing jiras so that we can address such issues.

thanks
-- Hitesh 


On Jul 26, 2013, at 2:22 PM, Achal Soni wrote:

> Hi Hitesh,
> 
> I think the DistributedCache name is very misleading because it really
> doesn't act much like a cache, for the reasons you stated above.The
> management of these additional fies and jars I think is a different
> discussion and definitely out of the scope of Tez. I agree that clients
> like Pig and Hive should be more mindful and perhaps develop systems for
> managing these files.
> 
> I think that the LocalResources way is perfectly suitable. What I don't
> quite get is, is there any difference between DistributedCache and what it
> offers to the client as opposed to LocalResources. For example, Pig needs
> to distribute jars to the nodes. These need to be added to the classpath.
> Does DC do that which the LR way won't?
> 
> Thanks,
> Achal
> 
> 
> On Fri, Jul 26, 2013 at 2:11 PM, Hitesh Shah <hi...@apache.org> wrote:
> 
>> Hi Achal
>> 
>> We want to force folks to use local resources as it makes the users more
>> aware of how to use the cache.
>> 
>> Pushing local files to distributed cache for each job does not bring any
>> performance improvement. All it does is ensure that the local files are now
>> available on the remote node in the cluster where the task is run. It also
>> requires uploading the local files to hdfs each and every time. This also
>> means that given that there is a new hdfs file each and every time, the
>> "cache" on the remote node can be used.
>> 
>> With local resources, the user is making a conscious choice of first
>> uploading a local file to hdfs and then adding the hdfs file as a local
>> resource for the remote task. As long as the file on hdfs remains
>> unchanged, the remote node will re-use the local copy ( local copy is
>> downloaded once the first time around from hdfs ). With this in mind, a
>> user will be more mindful of when to upload a local file and how to re-use
>> hdfs-based resources across jobs. A user would now realize that the penalty
>> of uploading a non-changing jar for each and every job ( as was done by
>> hive earlier ).
>> 
>> In the case of helpers, are you looking at a helper method for creating
>> local resources out of files that change for each and every job?
>> 
>> Furthermore, there is a question of management of these uploaded files?
>> When should they be deleted - after the job completes? If yes, is the AM
>> supposed to delete them or the client? What if a client does not hang
>> around for the job to complete or is killed before it can clean up the
>> files?
>> 
>> thanks
>> -- Hitesh
>> 
>> On Jul 26, 2013, at 1:59 PM, Achal Soni wrote:
>> 
>>> Hey all,
>>> 
>>> Have any thoughts be given to distributed cache in Tez? It seems that it
>> is
>>> almost as simple as adding local files to vertices via YARN.
>>> 
>>> Is there any insight into how DistributedCache differs from adding
>>> LocalResources? Should I be looking into MRApps for helper methods?
>>> 
>>> Thanks!
>>> 
>>> Achal
>> 
>>

Re: Distributed Cache in Tez

Posted by Achal Soni <as...@twitter.com>.

Hi Hitesh,

I think the DistributedCache name is very misleading because it really
doesn't act much like a cache, for the reasons you stated above.The
management of these additional fies and jars I think is a different
discussion and definitely out of the scope of Tez. I agree that clients
like Pig and Hive should be more mindful and perhaps develop systems for
managing these files.

 I think that the LocalResources way is perfectly suitable. What I don't
quite get is, is there any difference between DistributedCache and what it
offers to the client as opposed to LocalResources. For example, Pig needs
to distribute jars to the nodes. These need to be added to the classpath.
Does DC do that which the LR way won't?

Thanks,
Achal


On Fri, Jul 26, 2013 at 2:11 PM, Hitesh Shah <hi...@apache.org> wrote:

> Hi Achal
>
> We want to force folks to use local resources as it makes the users more
> aware of how to use the cache.
>
> Pushing local files to distributed cache for each job does not bring any
> performance improvement. All it does is ensure that the local files are now
> available on the remote node in the cluster where the task is run. It also
> requires uploading the local files to hdfs each and every time. This also
> means that given that there is a new hdfs file each and every time, the
> "cache" on the remote node can be used.
>
> With local resources, the user is making a conscious choice of first
> uploading a local file to hdfs and then adding the hdfs file as a local
> resource for the remote task. As long as the file on hdfs remains
> unchanged, the remote node will re-use the local copy ( local copy is
> downloaded once the first time around from hdfs ). With this in mind, a
> user will be more mindful of when to upload a local file and how to re-use
> hdfs-based resources across jobs. A user would now realize that the penalty
> of uploading a non-changing jar for each and every job ( as was done by
> hive earlier ).
>
> In the case of helpers, are you looking at a helper method for creating
> local resources out of files that change for each and every job?
>
> Furthermore, there is a question of management of these uploaded files?
> When should they be deleted - after the job completes? If yes, is the AM
> supposed to delete them or the client? What if a client does not hang
> around for the job to complete or is killed before it can clean up the
> files?
>
> thanks
> -- Hitesh
>
> On Jul 26, 2013, at 1:59 PM, Achal Soni wrote:
>
> > Hey all,
> >
> > Have any thoughts be given to distributed cache in Tez? It seems that it
> is
> > almost as simple as adding local files to vertices via YARN.
> >
> > Is there any insight into how DistributedCache differs from adding
> > LocalResources? Should I be looking into MRApps for helper methods?
> >
> > Thanks!
> >
> > Achal
>
>

Re: Distributed Cache in Tez

Posted by Hitesh Shah <hi...@apache.org>.

Hi Achal 

We want to force folks to use local resources as it makes the users more aware of how to use the cache. 

Pushing local files to distributed cache for each job does not bring any performance improvement. All it does is ensure that the local files are now available on the remote node in the cluster where the task is run. It also requires uploading the local files to hdfs each and every time. This also means that given that there is a new hdfs file each and every time, the "cache" on the remote node can be used. 

With local resources, the user is making a conscious choice of first uploading a local file to hdfs and then adding the hdfs file as a local resource for the remote task. As long as the file on hdfs remains unchanged, the remote node will re-use the local copy ( local copy is downloaded once the first time around from hdfs ). With this in mind, a user will be more mindful of when to upload a local file and how to re-use hdfs-based resources across jobs. A user would now realize that the penalty of uploading a non-changing jar for each and every job ( as was done by hive earlier ). 

In the case of helpers, are you looking at a helper method for creating local resources out of files that change for each and every job? 

Furthermore, there is a question of management of these uploaded files? When should they be deleted - after the job completes? If yes, is the AM supposed to delete them or the client? What if a client does not hang around for the job to complete or is killed before it can clean up the files?   

thanks
-- Hitesh

On Jul 26, 2013, at 1:59 PM, Achal Soni wrote:

> Hey all,
> 
> Have any thoughts be given to distributed cache in Tez? It seems that it is
> almost as simple as adding local files to vertices via YARN.
> 
> Is there any insight into how DistributedCache differs from adding
> LocalResources? Should I be looking into MRApps for helper methods?
> 
> Thanks!
> 
> Achal