You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Stephen Watt <sw...@us.ibm.com> on 2010/03/17 18:12:03 UTC

Austin Hadoop Users Group - Tomorrow Evening (Thursday)

Hi Folks

The Austin HUG is meeting tomorrow night. I hope to see you there. We have 
speakers from Rackspace (Stu Hood on Cassandra) and IBM (Gino Bustelo on 
BigSheets).

Detailed Information is available at http://austinhug.blogspot.com/

Kind regards
Steve Watt

Re: when to sent distributed cache file

Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi Gang,
Yes, the time to distribute files is considered as jobs running time ( more specifically the set up time ). The time is essentially for the the TT to copy the files specified in distributed cache to its local FS, generally from HDFS unless you have a separate FS for JT. So in general you might be having small time gains when your files to be distributed have relatively high replication factor.
Wrt blocks, AFAIK, even on HDFS if the file size < block size, the actual space consumed is the file size itself. The overhead is in terms of storing metadata on that (small) file block. So when you have it on local disk, it will still consume only the actual size and not block size.

Thanks,
Amogh


On 3/18/10 2:28 AM, "Gang Luo" <lg...@yahoo.com.cn> wrote:

Thanks Ravi.

Here are some observations. I run job1 to generate some data used by the following job2 without replication. The total size of the job 1 output is 25mb and is in 50 files. I use distributed cache to sent all the files to nodes running job2 tasks. When job2 starts, it stayed at "map 0% reduce 0%" for 10 minutes. When the job1 output is in 10 files (using 10 reducers in job1), the time consumed here are 2 minutes.

So, I think the time to distribute cache files is actually counted as part of the total time of the MR job. And in order to sent a cache file from HDFS to local disk, it sent at least one block (64mb by default) even that file is only 1mb. Is that right? If so, how much space that cache file takes on the local disk, 64mb or 1mb?

-Gang




Hello Gang,
      The framework will copy the necessary files to the slave node  before any tasks for the job are executed on that node.
Not sure if  time required to distribute cache is counted in map reduce job time but it is included in job submission process in JobClient .
--
Ravi

On 3/17/10 11:32 AM, "Gang Luo" <lg...@yahoo.com.cn> wrote:

Hi all,
I doubt when does hadoop distributes the cache files. The moment we call DistributedCache.addCacheFile() ? Will the time to distribute caches be counted as part of the mapreduce job time?

Thanks,
-Gang





Re: when to sent distributed cache file

Posted by Gang Luo <lg...@yahoo.com.cn>.
Thanks Ravi.

Here are some observations. I run job1 to generate some data used by the following job2 without replication. The total size of the job 1 output is 25mb and is in 50 files. I use distributed cache to sent all the files to nodes running job2 tasks. When job2 starts, it stayed at "map 0% reduce 0%" for 10 minutes. When the job1 output is in 10 files (using 10 reducers in job1), the time consumed here are 2 minutes. 

So, I think the time to distribute cache files is actually counted as part of the total time of the MR job. And in order to sent a cache file from HDFS to local disk, it sent at least one block (64mb by default) even that file is only 1mb. Is that right? If so, how much space that cache file takes on the local disk, 64mb or 1mb? 

-Gang



----- 原始邮件 ----
发件人: Ravi Phulari <rp...@yahoo-inc.com>
收件人: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>; Gang Luo <lg...@yahoo.com.cn>
发送日期: 2010/3/17 (周三) 3:52:24 下午
主   题: Re: when to sent distributed cache file

Hello Gang,
      The framework will copy the necessary files to the slave node  before any tasks for the job are executed on that node.
Not sure if  time required to distribute cache is counted in map reduce job time but it is included in job submission process in JobClient .
--
Ravi

On 3/17/10 11:32 AM, "Gang Luo" <lg...@yahoo.com.cn> wrote:

Hi all,
I doubt when does hadoop distributes the cache files. The moment we call DistributedCache.addCacheFile() ? Will the time to distribute caches be counted as part of the mapreduce job time?

Thanks,
-Gang


      

Re: when to sent distributed cache file

Posted by Ravi Phulari <rp...@yahoo-inc.com>.
Hello Gang,
      The framework will copy the necessary files to the slave node  before any tasks for the job are executed on that node.
Not sure if  time required to distribute cache is counted in map reduce job time but it is included in job submission process in JobClient .
--
Ravi

On 3/17/10 11:32 AM, "Gang Luo" <lg...@yahoo.com.cn> wrote:

Hi all,
I doubt when does hadoop distributes the cache files. The moment we call DistributedCache.addCacheFile() ? Will the time to distribute caches be counted as part of the mapreduce job time?

Thanks,
-Gang







when to sent distributed cache file

Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi all,
I doubt when does hadoop distributes the cache files. The moment we call DistributedCache.addCacheFile() ? Will the time to distribute caches be counted as part of the mapreduce job time?

Thanks,
-Gang



      

Re: Austin Hadoop Users Group - Tomorrow Evening (Thursday)

Posted by Alexandre Jaquet <al...@gmail.com>.
Hi,

Please let me know if you wil publish any kind of document, presentation,
video and else

Thanks in advance

Alexandre Jaquet

2010/3/17 Stephen Watt <sw...@us.ibm.com>

> Hi Folks
>
> The Austin HUG is meeting tomorrow night. I hope to see you there. We have
> speakers from Rackspace (Stu Hood on Cassandra) and IBM (Gino Bustelo on
> BigSheets).
>
> Detailed Information is available at http://austinhug.blogspot.com/
>
> Kind regards
> Steve Watt