You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by zhangguoping zhangguoping <zh...@gmail.com> on 2010/07/06 10:53:45 UTC

Distributed Cache

>From the book: "Hadoop The definitive guide" -- P242
>>
When you launch a job, Hadoop copies the files specified by the -files and
-archives options to the jobtracker’s filesystem (normally HDFS). Then,
before a task
is run, the tasktracker copies the files from the jobtracker’s filesystem to
a local disk—
the cache—so the task can access the files.
>>

I wonder why hadoop wants to copy the files to jobtracker's filesystem.
Since it is already in HDFS, it should be available to tasks.
Any considerations?

Re: Distributed Cache

Posted by Hemanth Yamijala <yh...@gmail.com>.
Hi,

> From the book: "Hadoop The definitive guide" -- P242
>>>
> When you launch a job, Hadoop copies the files specified by the -files and
> -archives options to the jobtracker’s filesystem (normally HDFS). Then,
> before a task
> is run, the tasktracker copies the files from the jobtracker’s filesystem to
> a local disk—
> the cache—so the task can access the files.
>>>
>
> I wonder why hadoop wants to copy the files to jobtracker's filesystem.
> Since it is already in HDFS, it should be available to tasks.
> Any considerations?

Unlike input data files for M/R tasks, -files and -archives are
options to copy additional files (like any configuration files etc)
that all the M/R tasks might need when running. Such files typically
need to be transferred from the local machine where the job is
launched to the cluster nodes where the tasks run. Think of them as
convenient shortcuts to distribute files to all the tasks.

Makes sense ?

Thanks
Hemanth

Re: Distributed Cache

Posted by Amareshwari Sri Ramadasu <am...@yahoo-inc.com>.
If they are already in JT's FS, it does not copy them. It copies them only if they are on local FS or some other FS.

On 7/6/10 2:23 PM, "zhangguoping zhangguoping" <zh...@gmail.com> wrote:


I wonder why hadoop wants to copy the files to jobtracker's filesystem.
Since it is already in HDFS, it should be available to tasks.
Any considerations?