You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Gang Luo <lg...@yahoo.com.cn> on 2010/08/20 17:08:12 UTC

where distributed cache start working

Hi all,
I go through the code, but couldn't find the place where distributed cache start 
working. I want to know between DistriubtedCache.addCacheFile at the master node 
and DistributedCache.getLocalCacheFiles at the client side, when and where are 
the files get distributed. 


Thanks,
-Gang

Re: where distributed cache start working

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi Jeff,
I realize the profiling is running within each jvm, while the distributed cache 
seems start before the jvm starts. That is probably why I couldn't trace it.

Thanks,
-Gang




----- 原始邮件 ----
发件人： Jeff Zhang <zj...@gmail.com>
收件人： common-dev@hadoop.apache.org
发送日期： 2010/8/23 (周一) 12:47:31 上午
主   题： Re: where distributed cache start working

Do you debug it using LocalJobRunner ? In local mode, TaskRunner won't
been called.
In local mode, mapper task runs in thread rather than forked jvm. The
TaskRunner only been called in distributed mode.




2010/8/22 Gang Luo <lg...@yahoo.com.cn>:
> Thanks Jeff.
>
> However, are you sure TaskRunner.run() is also used in the new API? I use 
>btrace
> to trace the function call but didn't find this function had been called
> anywhere.
>
>
> One more question about distributed cache. After I call
> DistributedCache.purgeCache, I think the local cached files should be deleted 
>or
> invalidated. However ,When I run the same job with the purge operation at the
> end multiple times, I find the local files have never been deleted and the
> modification time is when the first job run. How can I ask my job to
> re-distributed the cache again anyway?
>
> Thanks,
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人： Jeff Zhang <zj...@gmail.com>
> 收件人： common-dev@hadoop.apache.org
> 发送日期： 2010/8/20 (周五) 11:22:49 上午
> 主   题： Re: where distributed cache start working
>
> Hi Gang,
>
> In the TaskRunner's run() method, hadoop will download the cache files
> which you set on the client side to local, then the forked child jvm
> can use these cache files locally.
>
>
>
> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
>> Hi all,
>> I go through the code, but couldn't find the place where distributed cache
>>start
>> working. I want to know between DistriubtedCache.addCacheFile at the master
>>node
>> and DistributedCache.getLocalCacheFiles at the client side, when and where 
are
>> the files get distributed.
>>
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
>
>



-- 
Best Regards

Jeff Zhang

Re: where distributed cache start working

Posted by Jeff Zhang <zj...@gmail.com>.

Do you debug it using LocalJobRunner ? In local mode, TaskRunner won't
been called.
In local mode, mapper task runs in thread rather than forked jvm. The
TaskRunner only been called in distributed mode.




2010/8/22 Gang Luo <lg...@yahoo.com.cn>:
> Thanks Jeff.
>
> However, are you sure TaskRunner.run() is also used in the new API? I use btrace
> to trace the function call but didn't find this function had been called
> anywhere.
>
>
> One more question about distributed cache. After I call
> DistributedCache.purgeCache, I think the local cached files should be deleted or
> invalidated. However ,When I run the same job with the purge operation at the
> end multiple times, I find the local files have never been deleted and the
> modification time is when the first job run. How can I ask my job to
> re-distributed the cache again anyway?
>
> Thanks,
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人： Jeff Zhang <zj...@gmail.com>
> 收件人： common-dev@hadoop.apache.org
> 发送日期： 2010/8/20 (周五) 11:22:49 上午
> 主   题： Re: where distributed cache start working
>
> Hi Gang,
>
> In the TaskRunner's run() method, hadoop will download the cache files
> which you set on the client side to local, then the forked child jvm
> can use these cache files locally.
>
>
>
> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
>> Hi all,
>> I go through the code, but couldn't find the place where distributed cache
>>start
>> working. I want to know between DistriubtedCache.addCacheFile at the master
>>node
>> and DistributedCache.getLocalCacheFiles at the client side, when and where are
>> the files get distributed.
>>
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
>
>



-- 
Best Regards

Jeff Zhang

Re: where distributed cache start working

Posted by Hemanth Yamijala <yh...@gmail.com>.

Hi,
> Thanks Arun. Change the mTime is a good idea. However, given a file (the path is
>
> A/B/C/D/file) distributed to all the nodes, if I just change the mTime of file
> to a earlier time stamp, it will not be replaced next time. Should I also change
> the mTime for all the directories along the path (A, B, C and D). Whose
> timestamp is used by DistributedCache?

It is the timestamp of the file on DFS. So, you modify the file's
timestamp on DFS, it should be re-distributed to all the nodes.

Thanks
Hemanth
>
> Thanks.
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人： Arun C Murthy <ac...@yahoo-inc.com>
> 收件人： mapreduce-user@hadoop.apache.org
> 发送日期： 2010/8/22 (周日) 9:38:02 下午
> 主   题： Re: where distributed cache start working
>
> Moving to mapreduce-user@, bcc common-dev@. Please use the project specific
> lists.
>
> DistributedCache.purgeCache isn't a public api. You shouldn't be calling it from
>
> the task.
>
> A simple way of doing what you want is to change the mtime of the cache files on
>
> HDFS.
>
> Arun
>
> On Aug 22, 2010, at 9:48 AM, Gang Luo wrote:
>
>> Thanks Jeff.
>>
>> However, are you sure TaskRunner.run() is also used in the new API? I use
>>btrace
>> to trace the function call but didn't find this function had been called
>> anywhere.
>>
>>
>> One more question about distributed cache. After I call
>> DistributedCache.purgeCache, I think the local cached files should be deleted
>>or
>> invalidated. However ,When I run the same job with the purge operation at the
>> end multiple times, I find the local files have never been deleted and the
>> modification time is when the first job run. How can I ask my job to
>> re-distributed the cache again anyway?
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>> ----- 原始邮件 ----
>> 发件人： Jeff Zhang <zj...@gmail.com>
>> 收件人： common-dev@hadoop.apache.org
>> 发送日期： 2010/8/20 (周五) 11:22:49 上午
>> 主   题： Re: where distributed cache start working
>>
>> Hi Gang,
>>
>> In the TaskRunner's run() method, hadoop will download the cache files
>> which you set on the client side to local, then the forked child jvm
>> can use these cache files locally.
>>
>>
>>
>> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
>>> Hi all,
>>> I go through the code, but couldn't find the place where distributed cache
>>> start
>>> working. I want to know between DistriubtedCache.addCacheFile at the master
>>> node
>>> and DistributedCache.getLocalCacheFiles at the client side, when and where
> are
>>> the files get distributed.
>>>
>>>
>>> Thanks,
>>> -Gang
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --Best Regards
>>
>> Jeff Zhang
>>
>>
>>
>>
>
>
>
>
>

Re: where distributed cache start working

Posted by Gang Luo <lg...@yahoo.com.cn>.

Thanks Arun. Change the mTime is a good idea. However, given a file (the path is 

A/B/C/D/file) distributed to all the nodes, if I just change the mTime of file 
to a earlier time stamp, it will not be replaced next time. Should I also change 
the mTime for all the directories along the path (A, B, C and D). Whose 
timestamp is used by DistributedCache?

Thanks.
-Gang




----- 原始邮件 ----
发件人： Arun C Murthy <ac...@yahoo-inc.com>
收件人： mapreduce-user@hadoop.apache.org
发送日期： 2010/8/22 (周日) 9:38:02 下午
主   题： Re: where distributed cache start working

Moving to mapreduce-user@, bcc common-dev@. Please use the project specific 
lists.

DistributedCache.purgeCache isn't a public api. You shouldn't be calling it from 

the task.

A simple way of doing what you want is to change the mtime of the cache files on 

HDFS.

Arun

On Aug 22, 2010, at 9:48 AM, Gang Luo wrote:

> Thanks Jeff.
> 
> However, are you sure TaskRunner.run() is also used in the new API? I use 
>btrace
> to trace the function call but didn't find this function had been called
> anywhere.
> 
> 
> One more question about distributed cache. After I call
> DistributedCache.purgeCache, I think the local cached files should be deleted 
>or
> invalidated. However ,When I run the same job with the purge operation at the
> end multiple times, I find the local files have never been deleted and the
> modification time is when the first job run. How can I ask my job to
> re-distributed the cache again anyway?
> 
> Thanks,
> -Gang
> 
> 
> 
> 
> ----- 原始邮件 ----
> 发件人： Jeff Zhang <zj...@gmail.com>
> 收件人： common-dev@hadoop.apache.org
> 发送日期： 2010/8/20 (周五) 11:22:49 上午
> 主   题： Re: where distributed cache start working
> 
> Hi Gang,
> 
> In the TaskRunner's run() method, hadoop will download the cache files
> which you set on the client side to local, then the forked child jvm
> can use these cache files locally.
> 
> 
> 
> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
>> Hi all,
>> I go through the code, but couldn't find the place where distributed cache
>> start
>> working. I want to know between DistriubtedCache.addCacheFile at the master
>> node
>> and DistributedCache.getLocalCacheFiles at the client side, when and where 
are
>> the files get distributed.
>> 
>> 
>> Thanks,
>> -Gang
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> --Best Regards
> 
> Jeff Zhang
> 
> 
> 
>

Re: where distributed cache start working

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

Moving to mapreduce-user@, bcc common-dev@. Please use the project  
specific lists.

DistributedCache.purgeCache isn't a public api. You shouldn't be  
calling it from the task.

  A simple way of doing what you want is to change the mtime of the  
cache files on HDFS.

Arun

On Aug 22, 2010, at 9:48 AM, Gang Luo wrote:

> Thanks Jeff.
>
> However, are you sure TaskRunner.run() is also used in the new API?  
> I use btrace
> to trace the function call but didn't find this function had been  
> called
> anywhere.
>
>
> One more question about distributed cache. After I call
> DistributedCache.purgeCache, I think the local cached files should  
> be deleted or
> invalidated. However ,When I run the same job with the purge  
> operation at the
> end multiple times, I find the local files have never been deleted  
> and the
> modification time is when the first job run. How can I ask my job to
> re-distributed the cache again anyway?
>
> Thanks,
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人： Jeff Zhang <zj...@gmail.com>
> 收件人： common-dev@hadoop.apache.org
> 发送日期： 2010/8/20 (周五) 11:22:49 上午
> 主   题： Re: where distributed cache start working
>
> Hi Gang,
>
> In the TaskRunner's run() method, hadoop will download the cache files
> which you set on the client side to local, then the forked child jvm
> can use these cache files locally.
>
>
>
> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn>  
> wrote:
>> Hi all,
>> I go through the code, but couldn't find the place where  
>> distributed cache
>> start
>> working. I want to know between DistriubtedCache.addCacheFile at  
>> the master
>> node
>> and DistributedCache.getLocalCacheFiles at the client side, when  
>> and where are
>> the files get distributed.
>>
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> -- 
> Best Regards
>
> Jeff Zhang
>
>
>
>

Re: where distributed cache start working

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

Moving to mapreduce-user@, bcc common-dev@. Please use the project  
specific lists.

DistributedCache.purgeCache isn't a public api. You shouldn't be  
calling it from the task.

  A simple way of doing what you want is to change the mtime of the  
cache files on HDFS.

Arun

On Aug 22, 2010, at 9:48 AM, Gang Luo wrote:

> Thanks Jeff.
>
> However, are you sure TaskRunner.run() is also used in the new API?  
> I use btrace
> to trace the function call but didn't find this function had been  
> called
> anywhere.
>
>
> One more question about distributed cache. After I call
> DistributedCache.purgeCache, I think the local cached files should  
> be deleted or
> invalidated. However ,When I run the same job with the purge  
> operation at the
> end multiple times, I find the local files have never been deleted  
> and the
> modification time is when the first job run. How can I ask my job to
> re-distributed the cache again anyway?
>
> Thanks,
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人： Jeff Zhang <zj...@gmail.com>
> 收件人： common-dev@hadoop.apache.org
> 发送日期： 2010/8/20 (周五) 11:22:49 上午
> 主   题： Re: where distributed cache start working
>
> Hi Gang,
>
> In the TaskRunner's run() method, hadoop will download the cache files
> which you set on the client side to local, then the forked child jvm
> can use these cache files locally.
>
>
>
> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn>  
> wrote:
>> Hi all,
>> I go through the code, but couldn't find the place where  
>> distributed cache
>> start
>> working. I want to know between DistriubtedCache.addCacheFile at  
>> the master
>> node
>> and DistributedCache.getLocalCacheFiles at the client side, when  
>> and where are
>> the files get distributed.
>>
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> -- 
> Best Regards
>
> Jeff Zhang
>
>
>
>

Re: where distributed cache start working

Posted by Gang Luo <lg...@yahoo.com.cn>.

Thanks Jeff. 

However, are you sure TaskRunner.run() is also used in the new API? I use btrace 
to trace the function call but didn't find this function had been called 
anywhere. 

One more question about distributed cache. After I call 
DistributedCache.purgeCache, I think the local cached files should be deleted or 
invalidated. However ,When I run the same job with the purge operation at the 
end multiple times, I find the local files have never been deleted and the 
modification time is when the first job run. How can I ask my job to 
re-distributed the cache again anyway?

Thanks,
-Gang

----- 原始邮件 ----
发件人： Jeff Zhang <zj...@gmail.com>
收件人： common-dev@hadoop.apache.org
发送日期： 2010/8/20 (周五) 11:22:49 上午
主   题： Re: where distributed cache start working

Hi Gang,

In the TaskRunner's run() method, hadoop will download the cache files
which you set on the client side to local, then the forked child jvm
can use these cache files locally.

On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
> Hi all,
> I go through the code, but couldn't find the place where distributed cache 
>start
> working. I want to know between DistriubtedCache.addCacheFile at the master 
>node
> and DistributedCache.getLocalCacheFiles at the client side, when and where are
> the files get distributed.
>
>
> Thanks,
> -Gang
>
>
>
>
>

-- 
Best Regards

Jeff Zhang

Re: where distributed cache start working

Posted by Jeff Zhang <zj...@gmail.com>.

Hi Gang,

In the TaskRunner's run() method, hadoop will download the cache files
which you set on the client side to local, then the forked child jvm
can use these cache files locally.



On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
> Hi all,
> I go through the code, but couldn't find the place where distributed cache start
> working. I want to know between DistriubtedCache.addCacheFile at the master node
> and DistributedCache.getLocalCacheFiles at the client side, when and where are
> the files get distributed.
>
>
> Thanks,
> -Gang
>
>
>
>
>



-- 
Best Regards

Jeff Zhang