You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Gang Luo <lg...@yahoo.com.cn> on 2010/08/20 17:08:12 UTC
where distributed cache start working
Hi all,
I go through the code, but couldn't find the place where distributed cache start
working. I want to know between DistriubtedCache.addCacheFile at the master node
and DistributedCache.getLocalCacheFiles at the client side, when and where are
the files get distributed.
Thanks,
-Gang
Re: where distributed cache start working
Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi Jeff,
I realize the profiling is running within each jvm, while the distributed cache
seems start before the jvm starts. That is probably why I couldn't trace it.
Thanks,
-Gang
----- 原始邮件 ----
发件人: Jeff Zhang <zj...@gmail.com>
收件人: common-dev@hadoop.apache.org
发送日期: 2010/8/23 (周一) 12:47:31 上午
主 题: Re: where distributed cache start working
Do you debug it using LocalJobRunner ? In local mode, TaskRunner won't
been called.
In local mode, mapper task runs in thread rather than forked jvm. The
TaskRunner only been called in distributed mode.
2010/8/22 Gang Luo <lg...@yahoo.com.cn>:
> Thanks Jeff.
>
> However, are you sure TaskRunner.run() is also used in the new API? I use
>btrace
> to trace the function call but didn't find this function had been called
> anywhere.
>
>
> One more question about distributed cache. After I call
> DistributedCache.purgeCache, I think the local cached files should be deleted
>or
> invalidated. However ,When I run the same job with the purge operation at the
> end multiple times, I find the local files have never been deleted and the
> modification time is when the first job run. How can I ask my job to
> re-distributed the cache again anyway?
>
> Thanks,
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Jeff Zhang <zj...@gmail.com>
> 收件人: common-dev@hadoop.apache.org
> 发送日期: 2010/8/20 (周五) 11:22:49 上午
> 主 题: Re: where distributed cache start working
>
> Hi Gang,
>
> In the TaskRunner's run() method, hadoop will download the cache files
> which you set on the client side to local, then the forked child jvm
> can use these cache files locally.
>
>
>
> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
>> Hi all,
>> I go through the code, but couldn't find the place where distributed cache
>>start
>> working. I want to know between DistriubtedCache.addCacheFile at the master
>>node
>> and DistributedCache.getLocalCacheFiles at the client side, when and where
are
>> the files get distributed.
>>
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
>
>
--
Best Regards
Jeff Zhang
Re: where distributed cache start working
Posted by Jeff Zhang <zj...@gmail.com>.
Do you debug it using LocalJobRunner ? In local mode, TaskRunner won't
been called.
In local mode, mapper task runs in thread rather than forked jvm. The
TaskRunner only been called in distributed mode.
2010/8/22 Gang Luo <lg...@yahoo.com.cn>:
> Thanks Jeff.
>
> However, are you sure TaskRunner.run() is also used in the new API? I use btrace
> to trace the function call but didn't find this function had been called
> anywhere.
>
>
> One more question about distributed cache. After I call
> DistributedCache.purgeCache, I think the local cached files should be deleted or
> invalidated. However ,When I run the same job with the purge operation at the
> end multiple times, I find the local files have never been deleted and the
> modification time is when the first job run. How can I ask my job to
> re-distributed the cache again anyway?
>
> Thanks,
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Jeff Zhang <zj...@gmail.com>
> 收件人: common-dev@hadoop.apache.org
> 发送日期: 2010/8/20 (周五) 11:22:49 上午
> 主 题: Re: where distributed cache start working
>
> Hi Gang,
>
> In the TaskRunner's run() method, hadoop will download the cache files
> which you set on the client side to local, then the forked child jvm
> can use these cache files locally.
>
>
>
> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
>> Hi all,
>> I go through the code, but couldn't find the place where distributed cache
>>start
>> working. I want to know between DistriubtedCache.addCacheFile at the master
>>node
>> and DistributedCache.getLocalCacheFiles at the client side, when and where are
>> the files get distributed.
>>
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
>
>
--
Best Regards
Jeff Zhang
Re: where distributed cache start working
Posted by Hemanth Yamijala <yh...@gmail.com>.
Hi,
> Thanks Arun. Change the mTime is a good idea. However, given a file (the path is
>
> A/B/C/D/file) distributed to all the nodes, if I just change the mTime of file
> to a earlier time stamp, it will not be replaced next time. Should I also change
> the mTime for all the directories along the path (A, B, C and D). Whose
> timestamp is used by DistributedCache?
It is the timestamp of the file on DFS. So, you modify the file's
timestamp on DFS, it should be re-distributed to all the nodes.
Thanks
Hemanth
>
> Thanks.
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Arun C Murthy <ac...@yahoo-inc.com>
> 收件人: mapreduce-user@hadoop.apache.org
> 发送日期: 2010/8/22 (周日) 9:38:02 下午
> 主 题: Re: where distributed cache start working
>
> Moving to mapreduce-user@, bcc common-dev@. Please use the project specific
> lists.
>
> DistributedCache.purgeCache isn't a public api. You shouldn't be calling it from
>
> the task.
>
> A simple way of doing what you want is to change the mtime of the cache files on
>
> HDFS.
>
> Arun
>
> On Aug 22, 2010, at 9:48 AM, Gang Luo wrote:
>
>> Thanks Jeff.
>>
>> However, are you sure TaskRunner.run() is also used in the new API? I use
>>btrace
>> to trace the function call but didn't find this function had been called
>> anywhere.
>>
>>
>> One more question about distributed cache. After I call
>> DistributedCache.purgeCache, I think the local cached files should be deleted
>>or
>> invalidated. However ,When I run the same job with the purge operation at the
>> end multiple times, I find the local files have never been deleted and the
>> modification time is when the first job run. How can I ask my job to
>> re-distributed the cache again anyway?
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>> ----- 原始邮件 ----
>> 发件人: Jeff Zhang <zj...@gmail.com>
>> 收件人: common-dev@hadoop.apache.org
>> 发送日期: 2010/8/20 (周五) 11:22:49 上午
>> 主 题: Re: where distributed cache start working
>>
>> Hi Gang,
>>
>> In the TaskRunner's run() method, hadoop will download the cache files
>> which you set on the client side to local, then the forked child jvm
>> can use these cache files locally.
>>
>>
>>
>> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
>>> Hi all,
>>> I go through the code, but couldn't find the place where distributed cache
>>> start
>>> working. I want to know between DistriubtedCache.addCacheFile at the master
>>> node
>>> and DistributedCache.getLocalCacheFiles at the client side, when and where
> are
>>> the files get distributed.
>>>
>>>
>>> Thanks,
>>> -Gang
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --Best Regards
>>
>> Jeff Zhang
>>
>>
>>
>>
>
>
>
>
>
Re: where distributed cache start working
Posted by Gang Luo <lg...@yahoo.com.cn>.
Thanks Arun. Change the mTime is a good idea. However, given a file (the path is
A/B/C/D/file) distributed to all the nodes, if I just change the mTime of file
to a earlier time stamp, it will not be replaced next time. Should I also change
the mTime for all the directories along the path (A, B, C and D). Whose
timestamp is used by DistributedCache?
Thanks.
-Gang
----- 原始邮件 ----
发件人: Arun C Murthy <ac...@yahoo-inc.com>
收件人: mapreduce-user@hadoop.apache.org
发送日期: 2010/8/22 (周日) 9:38:02 下午
主 题: Re: where distributed cache start working
Moving to mapreduce-user@, bcc common-dev@. Please use the project specific
lists.
DistributedCache.purgeCache isn't a public api. You shouldn't be calling it from
the task.
A simple way of doing what you want is to change the mtime of the cache files on
HDFS.
Arun
On Aug 22, 2010, at 9:48 AM, Gang Luo wrote:
> Thanks Jeff.
>
> However, are you sure TaskRunner.run() is also used in the new API? I use
>btrace
> to trace the function call but didn't find this function had been called
> anywhere.
>
>
> One more question about distributed cache. After I call
> DistributedCache.purgeCache, I think the local cached files should be deleted
>or
> invalidated. However ,When I run the same job with the purge operation at the
> end multiple times, I find the local files have never been deleted and the
> modification time is when the first job run. How can I ask my job to
> re-distributed the cache again anyway?
>
> Thanks,
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Jeff Zhang <zj...@gmail.com>
> 收件人: common-dev@hadoop.apache.org
> 发送日期: 2010/8/20 (周五) 11:22:49 上午
> 主 题: Re: where distributed cache start working
>
> Hi Gang,
>
> In the TaskRunner's run() method, hadoop will download the cache files
> which you set on the client side to local, then the forked child jvm
> can use these cache files locally.
>
>
>
> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
>> Hi all,
>> I go through the code, but couldn't find the place where distributed cache
>> start
>> working. I want to know between DistriubtedCache.addCacheFile at the master
>> node
>> and DistributedCache.getLocalCacheFiles at the client side, when and where
are
>> the files get distributed.
>>
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> --Best Regards
>
> Jeff Zhang
>
>
>
>
Re: where distributed cache start working
Posted by Arun C Murthy <ac...@yahoo-inc.com>.
Moving to mapreduce-user@, bcc common-dev@. Please use the project
specific lists.
DistributedCache.purgeCache isn't a public api. You shouldn't be
calling it from the task.
A simple way of doing what you want is to change the mtime of the
cache files on HDFS.
Arun
On Aug 22, 2010, at 9:48 AM, Gang Luo wrote:
> Thanks Jeff.
>
> However, are you sure TaskRunner.run() is also used in the new API?
> I use btrace
> to trace the function call but didn't find this function had been
> called
> anywhere.
>
>
> One more question about distributed cache. After I call
> DistributedCache.purgeCache, I think the local cached files should
> be deleted or
> invalidated. However ,When I run the same job with the purge
> operation at the
> end multiple times, I find the local files have never been deleted
> and the
> modification time is when the first job run. How can I ask my job to
> re-distributed the cache again anyway?
>
> Thanks,
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Jeff Zhang <zj...@gmail.com>
> 收件人: common-dev@hadoop.apache.org
> 发送日期: 2010/8/20 (周五) 11:22:49 上午
> 主 题: Re: where distributed cache start working
>
> Hi Gang,
>
> In the TaskRunner's run() method, hadoop will download the cache files
> which you set on the client side to local, then the forked child jvm
> can use these cache files locally.
>
>
>
> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn>
> wrote:
>> Hi all,
>> I go through the code, but couldn't find the place where
>> distributed cache
>> start
>> working. I want to know between DistriubtedCache.addCacheFile at
>> the master
>> node
>> and DistributedCache.getLocalCacheFiles at the client side, when
>> and where are
>> the files get distributed.
>>
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
>
Re: where distributed cache start working
Posted by Arun C Murthy <ac...@yahoo-inc.com>.
Moving to mapreduce-user@, bcc common-dev@. Please use the project
specific lists.
DistributedCache.purgeCache isn't a public api. You shouldn't be
calling it from the task.
A simple way of doing what you want is to change the mtime of the
cache files on HDFS.
Arun
On Aug 22, 2010, at 9:48 AM, Gang Luo wrote:
> Thanks Jeff.
>
> However, are you sure TaskRunner.run() is also used in the new API?
> I use btrace
> to trace the function call but didn't find this function had been
> called
> anywhere.
>
>
> One more question about distributed cache. After I call
> DistributedCache.purgeCache, I think the local cached files should
> be deleted or
> invalidated. However ,When I run the same job with the purge
> operation at the
> end multiple times, I find the local files have never been deleted
> and the
> modification time is when the first job run. How can I ask my job to
> re-distributed the cache again anyway?
>
> Thanks,
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Jeff Zhang <zj...@gmail.com>
> 收件人: common-dev@hadoop.apache.org
> 发送日期: 2010/8/20 (周五) 11:22:49 上午
> 主 题: Re: where distributed cache start working
>
> Hi Gang,
>
> In the TaskRunner's run() method, hadoop will download the cache files
> which you set on the client side to local, then the forked child jvm
> can use these cache files locally.
>
>
>
> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn>
> wrote:
>> Hi all,
>> I go through the code, but couldn't find the place where
>> distributed cache
>> start
>> working. I want to know between DistriubtedCache.addCacheFile at
>> the master
>> node
>> and DistributedCache.getLocalCacheFiles at the client side, when
>> and where are
>> the files get distributed.
>>
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
>
Re: where distributed cache start working
Posted by Gang Luo <lg...@yahoo.com.cn>.
Thanks Jeff.
However, are you sure TaskRunner.run() is also used in the new API? I use btrace
to trace the function call but didn't find this function had been called
anywhere.
One more question about distributed cache. After I call
DistributedCache.purgeCache, I think the local cached files should be deleted or
invalidated. However ,When I run the same job with the purge operation at the
end multiple times, I find the local files have never been deleted and the
modification time is when the first job run. How can I ask my job to
re-distributed the cache again anyway?
Thanks,
-Gang
----- 原始邮件 ----
发件人: Jeff Zhang <zj...@gmail.com>
收件人: common-dev@hadoop.apache.org
发送日期: 2010/8/20 (周五) 11:22:49 上午
主 题: Re: where distributed cache start working
Hi Gang,
In the TaskRunner's run() method, hadoop will download the cache files
which you set on the client side to local, then the forked child jvm
can use these cache files locally.
On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
> Hi all,
> I go through the code, but couldn't find the place where distributed cache
>start
> working. I want to know between DistriubtedCache.addCacheFile at the master
>node
> and DistributedCache.getLocalCacheFiles at the client side, when and where are
> the files get distributed.
>
>
> Thanks,
> -Gang
>
>
>
>
>
--
Best Regards
Jeff Zhang
Re: where distributed cache start working
Posted by Jeff Zhang <zj...@gmail.com>.
Hi Gang,
In the TaskRunner's run() method, hadoop will download the cache files
which you set on the client side to local, then the forked child jvm
can use these cache files locally.
On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
> Hi all,
> I go through the code, but couldn't find the place where distributed cache start
> working. I want to know between DistriubtedCache.addCacheFile at the master node
> and DistributedCache.getLocalCacheFiles at the client side, when and where are
> the files get distributed.
>
>
> Thanks,
> -Gang
>
>
>
>
>
--
Best Regards
Jeff Zhang