You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by John Meza <j_...@hotmail.com> on 2013/04/09 06:58:45 UTC

Distributed cache: how big is too big?

I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.
I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??-Distributed Cache is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to TaskTracker?
thanksJohn

RE: Distributed cache: how big is too big?

Posted by John Meza <j_...@hotmail.com>.

The Distributed Cache uses the shared file system (which ever is specified).
The Distributed Cache can be loaded via the GenericOptionsParser / TooRunner parameters. Those parameters (-files, -archives, -libjars) are seen  on the commandline and available in a MR driver class that implements the Tool interface.
Those parameters as well as the methods in the Distributed Cache API load the files into the shared filesystem used by the JT. From there the framework manages the distribution to the DNs.
A couple of unique characteristics are:
1.The Distributed Cache will manage the deployment of the files into the cache directory, where they can be used by all those jobs that need them. The TT maintains a reference count to help ensure the file(s) aren't deleted prematurely. 
2.Archives are unarchived, with directory structures intact if needed. This is an important requirement for my application. During the unarchive the directory structure is created.
Most of this info is directly from HadoopDefGuide and various other sources on the net.
I also look forward to comments and corrections from those with more experience.John

Date: Tue, 9 Apr 2013 16:07:12 -0700
Subject: Re: Distributed cache: how big is too big?
From: bjornjon@gmail.com
To: user@hadoop.apache.org

I think the correct question is why would you use distributed cache for a large file that is read during map/reduce instead of plain hdfs? It does not sound wise to shuffle GB of data onto all nodes on each job submission and then just remove it when the job is done. I would think about picking another "data strategy" and just use hdfs for the file. Its no problem to make sure the file is available on every node.

Anyway...maybe someone with more knowledge on this will chip in :)

On Tue, Apr 9, 2013 at 7:56 AM, Jay Vyas <ja...@gmail.com> wrote:

Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a replacement for the distributed cache?

After all - the distributed cache is just a file with replication over the whole cluster, which isn't in hdfs.  Cant you Just make the cache size big and store the file there?

What advantage is hdfs distribution of the file over all nodes  ?

On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:

Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.


On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:





I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.

I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.


I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??

-Distributed Cache is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to TaskTracker?
thanks

John

RE: Distributed cache: how big is too big?

Posted by John Meza <j_...@hotmail.com>.

The Distributed Cache uses the shared file system (which ever is specified).
The Distributed Cache can be loaded via the GenericOptionsParser / TooRunner parameters. Those parameters (-files, -archives, -libjars) are seen  on the commandline and available in a MR driver class that implements the Tool interface.
Those parameters as well as the methods in the Distributed Cache API load the files into the shared filesystem used by the JT. From there the framework manages the distribution to the DNs.
A couple of unique characteristics are:
1.The Distributed Cache will manage the deployment of the files into the cache directory, where they can be used by all those jobs that need them. The TT maintains a reference count to help ensure the file(s) aren't deleted prematurely. 
2.Archives are unarchived, with directory structures intact if needed. This is an important requirement for my application. During the unarchive the directory structure is created.
Most of this info is directly from HadoopDefGuide and various other sources on the net.
I also look forward to comments and corrections from those with more experience.John

Date: Tue, 9 Apr 2013 16:07:12 -0700
Subject: Re: Distributed cache: how big is too big?
From: bjornjon@gmail.com
To: user@hadoop.apache.org

I think the correct question is why would you use distributed cache for a large file that is read during map/reduce instead of plain hdfs? It does not sound wise to shuffle GB of data onto all nodes on each job submission and then just remove it when the job is done. I would think about picking another "data strategy" and just use hdfs for the file. Its no problem to make sure the file is available on every node.

Anyway...maybe someone with more knowledge on this will chip in :)

On Tue, Apr 9, 2013 at 7:56 AM, Jay Vyas <ja...@gmail.com> wrote:

Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a replacement for the distributed cache?

After all - the distributed cache is just a file with replication over the whole cluster, which isn't in hdfs.  Cant you Just make the cache size big and store the file there?

What advantage is hdfs distribution of the file over all nodes  ?

On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:

Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.


On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:





I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.

I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.


I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??

-Distributed Cache is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to TaskTracker?
thanks

John

RE: Distributed cache: how big is too big?

Posted by John Meza <j_...@hotmail.com>.

The Distributed Cache uses the shared file system (which ever is specified).
The Distributed Cache can be loaded via the GenericOptionsParser / TooRunner parameters. Those parameters (-files, -archives, -libjars) are seen  on the commandline and available in a MR driver class that implements the Tool interface.
Those parameters as well as the methods in the Distributed Cache API load the files into the shared filesystem used by the JT. From there the framework manages the distribution to the DNs.
A couple of unique characteristics are:
1.The Distributed Cache will manage the deployment of the files into the cache directory, where they can be used by all those jobs that need them. The TT maintains a reference count to help ensure the file(s) aren't deleted prematurely. 
2.Archives are unarchived, with directory structures intact if needed. This is an important requirement for my application. During the unarchive the directory structure is created.
Most of this info is directly from HadoopDefGuide and various other sources on the net.
I also look forward to comments and corrections from those with more experience.John

Date: Tue, 9 Apr 2013 16:07:12 -0700
Subject: Re: Distributed cache: how big is too big?
From: bjornjon@gmail.com
To: user@hadoop.apache.org

I think the correct question is why would you use distributed cache for a large file that is read during map/reduce instead of plain hdfs? It does not sound wise to shuffle GB of data onto all nodes on each job submission and then just remove it when the job is done. I would think about picking another "data strategy" and just use hdfs for the file. Its no problem to make sure the file is available on every node.

Anyway...maybe someone with more knowledge on this will chip in :)

On Tue, Apr 9, 2013 at 7:56 AM, Jay Vyas <ja...@gmail.com> wrote:

Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a replacement for the distributed cache?

After all - the distributed cache is just a file with replication over the whole cluster, which isn't in hdfs.  Cant you Just make the cache size big and store the file there?

What advantage is hdfs distribution of the file over all nodes  ?

On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:

Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.


On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:





I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.

I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.


I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??

-Distributed Cache is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to TaskTracker?
thanks

John

RE: Distributed cache: how big is too big?

Posted by John Meza <j_...@hotmail.com>.

The Distributed Cache uses the shared file system (which ever is specified).
The Distributed Cache can be loaded via the GenericOptionsParser / TooRunner parameters. Those parameters (-files, -archives, -libjars) are seen  on the commandline and available in a MR driver class that implements the Tool interface.
Those parameters as well as the methods in the Distributed Cache API load the files into the shared filesystem used by the JT. From there the framework manages the distribution to the DNs.
A couple of unique characteristics are:
1.The Distributed Cache will manage the deployment of the files into the cache directory, where they can be used by all those jobs that need them. The TT maintains a reference count to help ensure the file(s) aren't deleted prematurely. 
2.Archives are unarchived, with directory structures intact if needed. This is an important requirement for my application. During the unarchive the directory structure is created.
Most of this info is directly from HadoopDefGuide and various other sources on the net.
I also look forward to comments and corrections from those with more experience.John

Date: Tue, 9 Apr 2013 16:07:12 -0700
Subject: Re: Distributed cache: how big is too big?
From: bjornjon@gmail.com
To: user@hadoop.apache.org

I think the correct question is why would you use distributed cache for a large file that is read during map/reduce instead of plain hdfs? It does not sound wise to shuffle GB of data onto all nodes on each job submission and then just remove it when the job is done. I would think about picking another "data strategy" and just use hdfs for the file. Its no problem to make sure the file is available on every node.

Anyway...maybe someone with more knowledge on this will chip in :)

On Tue, Apr 9, 2013 at 7:56 AM, Jay Vyas <ja...@gmail.com> wrote:

Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a replacement for the distributed cache?

After all - the distributed cache is just a file with replication over the whole cluster, which isn't in hdfs.  Cant you Just make the cache size big and store the file there?

What advantage is hdfs distribution of the file over all nodes  ?

On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:

Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.


On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:





I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.

I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.


I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??

-Distributed Cache is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to TaskTracker?
thanks

John

Re: Distributed cache: how big is too big?

Posted by Bjorn Jonsson <bj...@gmail.com>.

I think the correct question is why would you use distributed cache for a
large file that is read during map/reduce instead of plain hdfs? It does
not sound wise to shuffle GB of data onto all nodes on each job submission
and then just remove it when the job is done. I would think about picking
another "data strategy" and just use hdfs for the file. Its no problem to
make sure the file is available on every node.

Anyway...maybe someone with more knowledge on this will chip in :)


On Tue, Apr 9, 2013 at 7:56 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as
> a replacement for the distributed cache?
>
> After all - the distributed cache is just a file with replication over the
> whole cluster, which isn't in hdfs.  Cant you Just make the cache size big
> and store the file there?
>
> What advantage is hdfs distribution of the file over all nodes  ?
>
> On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:
>
> Put it once on hdfs with a replication factor equal to the number of DN.
> No startup latency on job submission or max size and access it from
> anywhere with fs since it sticks around untill you replace it? Just a
> thought.
> On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:
>
>> I am researching a Hadoop solution for an existing application that
>> requires a directory structure full of data for processing.
>>
>> To make the Hadoop solution work I need to deploy the data directory to
>> each DN when the job is executed.
>> I know this isn't new and commonly done with a Distributed Cache.
>>
>> *Based on experience what are the common file sizes deployed in a
>> Distributed Cache?*
>> I know smaller is better, but how big is too big? the larger cache
>> deployed I have read there will be startup latency. I also assume there are
>> other factors that play into this.
>>
>> I know that->Default local.cache.size=10Gb
>>
>> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
>> -Distributed Cache is normally not used if larger than =____?
>>
>> *Another Option:* Put the data directories on each DN and provide
>> location to TaskTracker?
>>
>> thanks
>> John
>>
>>

Re: Distributed cache: how big is too big?

Posted by Bjorn Jonsson <bj...@gmail.com>.

I think the correct question is why would you use distributed cache for a
large file that is read during map/reduce instead of plain hdfs? It does
not sound wise to shuffle GB of data onto all nodes on each job submission
and then just remove it when the job is done. I would think about picking
another "data strategy" and just use hdfs for the file. Its no problem to
make sure the file is available on every node.

Anyway...maybe someone with more knowledge on this will chip in :)


On Tue, Apr 9, 2013 at 7:56 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as
> a replacement for the distributed cache?
>
> After all - the distributed cache is just a file with replication over the
> whole cluster, which isn't in hdfs.  Cant you Just make the cache size big
> and store the file there?
>
> What advantage is hdfs distribution of the file over all nodes  ?
>
> On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:
>
> Put it once on hdfs with a replication factor equal to the number of DN.
> No startup latency on job submission or max size and access it from
> anywhere with fs since it sticks around untill you replace it? Just a
> thought.
> On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:
>
>> I am researching a Hadoop solution for an existing application that
>> requires a directory structure full of data for processing.
>>
>> To make the Hadoop solution work I need to deploy the data directory to
>> each DN when the job is executed.
>> I know this isn't new and commonly done with a Distributed Cache.
>>
>> *Based on experience what are the common file sizes deployed in a
>> Distributed Cache?*
>> I know smaller is better, but how big is too big? the larger cache
>> deployed I have read there will be startup latency. I also assume there are
>> other factors that play into this.
>>
>> I know that->Default local.cache.size=10Gb
>>
>> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
>> -Distributed Cache is normally not used if larger than =____?
>>
>> *Another Option:* Put the data directories on each DN and provide
>> location to TaskTracker?
>>
>> thanks
>> John
>>
>>

Re: Distributed cache: how big is too big?

Posted by Bjorn Jonsson <bj...@gmail.com>.

I think the correct question is why would you use distributed cache for a
large file that is read during map/reduce instead of plain hdfs? It does
not sound wise to shuffle GB of data onto all nodes on each job submission
and then just remove it when the job is done. I would think about picking
another "data strategy" and just use hdfs for the file. Its no problem to
make sure the file is available on every node.

Anyway...maybe someone with more knowledge on this will chip in :)


On Tue, Apr 9, 2013 at 7:56 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as
> a replacement for the distributed cache?
>
> After all - the distributed cache is just a file with replication over the
> whole cluster, which isn't in hdfs.  Cant you Just make the cache size big
> and store the file there?
>
> What advantage is hdfs distribution of the file over all nodes  ?
>
> On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:
>
> Put it once on hdfs with a replication factor equal to the number of DN.
> No startup latency on job submission or max size and access it from
> anywhere with fs since it sticks around untill you replace it? Just a
> thought.
> On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:
>
>> I am researching a Hadoop solution for an existing application that
>> requires a directory structure full of data for processing.
>>
>> To make the Hadoop solution work I need to deploy the data directory to
>> each DN when the job is executed.
>> I know this isn't new and commonly done with a Distributed Cache.
>>
>> *Based on experience what are the common file sizes deployed in a
>> Distributed Cache?*
>> I know smaller is better, but how big is too big? the larger cache
>> deployed I have read there will be startup latency. I also assume there are
>> other factors that play into this.
>>
>> I know that->Default local.cache.size=10Gb
>>
>> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
>> -Distributed Cache is normally not used if larger than =____?
>>
>> *Another Option:* Put the data directories on each DN and provide
>> location to TaskTracker?
>>
>> thanks
>> John
>>
>>

Re: Distributed cache: how big is too big?

Posted by Bjorn Jonsson <bj...@gmail.com>.

I think the correct question is why would you use distributed cache for a
large file that is read during map/reduce instead of plain hdfs? It does
not sound wise to shuffle GB of data onto all nodes on each job submission
and then just remove it when the job is done. I would think about picking
another "data strategy" and just use hdfs for the file. Its no problem to
make sure the file is available on every node.

Anyway...maybe someone with more knowledge on this will chip in :)


On Tue, Apr 9, 2013 at 7:56 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as
> a replacement for the distributed cache?
>
> After all - the distributed cache is just a file with replication over the
> whole cluster, which isn't in hdfs.  Cant you Just make the cache size big
> and store the file there?
>
> What advantage is hdfs distribution of the file over all nodes  ?
>
> On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:
>
> Put it once on hdfs with a replication factor equal to the number of DN.
> No startup latency on job submission or max size and access it from
> anywhere with fs since it sticks around untill you replace it? Just a
> thought.
> On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:
>
>> I am researching a Hadoop solution for an existing application that
>> requires a directory structure full of data for processing.
>>
>> To make the Hadoop solution work I need to deploy the data directory to
>> each DN when the job is executed.
>> I know this isn't new and commonly done with a Distributed Cache.
>>
>> *Based on experience what are the common file sizes deployed in a
>> Distributed Cache?*
>> I know smaller is better, but how big is too big? the larger cache
>> deployed I have read there will be startup latency. I also assume there are
>> other factors that play into this.
>>
>> I know that->Default local.cache.size=10Gb
>>
>> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
>> -Distributed Cache is normally not used if larger than =____?
>>
>> *Another Option:* Put the data directories on each DN and provide
>> location to TaskTracker?
>>
>> thanks
>> John
>>
>>

Re: Distributed cache: how big is too big?

Posted by Jay Vyas <ja...@gmail.com>.

Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a replacement for the distributed cache?

After all - the distributed cache is just a file with replication over the whole cluster, which isn't in hdfs.  Cant you Just make the cache size big and store the file there?

What advantage is hdfs distribution of the file over all nodes  ?

On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:

> Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.
> 
> On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:
>> I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
>> 
>> To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.
>> I know this isn't new and commonly done with a Distributed Cache.
>> 
>> Based on experience what are the common file sizes deployed in a Distributed Cache? 
>> I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.
>> 
>> I know that->Default local.cache.size=10Gb
>> 
>> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
>> -Distributed Cache is normally not used if larger than =____?
>> 
>> Another Option: Put the data directories on each DN and provide location to TaskTracker?
>> 
>> thanks
>> John

RE: Distributed cache: how big is too big?

Posted by John Meza <j_...@hotmail.com>.

"a replication factor equal to the number of DN"Hmmm... I'm not sure I understand: there are  8 DN in mytest cluster. 
Date: Tue, 9 Apr 2013 04:49:17 -0700
Subject: Re: Distributed cache: how big is too big?
From: bjornjon@gmail.com
To: user@hadoop.apache.org

Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.

On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:

I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.
I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.

I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
-Distributed Cache is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to TaskTracker?
thanks
John

Re: Distributed cache: how big is too big?

Posted by Jay Vyas <ja...@gmail.com>.

Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a replacement for the distributed cache?

After all - the distributed cache is just a file with replication over the whole cluster, which isn't in hdfs.  Cant you Just make the cache size big and store the file there?

What advantage is hdfs distribution of the file over all nodes  ?

On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:

> Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.
> 
> On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:
>> I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
>> 
>> To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.
>> I know this isn't new and commonly done with a Distributed Cache.
>> 
>> Based on experience what are the common file sizes deployed in a Distributed Cache? 
>> I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.
>> 
>> I know that->Default local.cache.size=10Gb
>> 
>> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
>> -Distributed Cache is normally not used if larger than =____?
>> 
>> Another Option: Put the data directories on each DN and provide location to TaskTracker?
>> 
>> thanks
>> John

Re: Distributed cache: how big is too big?

Posted by Jay Vyas <ja...@gmail.com>.

Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a replacement for the distributed cache?

After all - the distributed cache is just a file with replication over the whole cluster, which isn't in hdfs.  Cant you Just make the cache size big and store the file there?

What advantage is hdfs distribution of the file over all nodes  ?

On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:

> Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.
> 
> On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:
>> I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
>> 
>> To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.
>> I know this isn't new and commonly done with a Distributed Cache.
>> 
>> Based on experience what are the common file sizes deployed in a Distributed Cache? 
>> I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.
>> 
>> I know that->Default local.cache.size=10Gb
>> 
>> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
>> -Distributed Cache is normally not used if larger than =____?
>> 
>> Another Option: Put the data directories on each DN and provide location to TaskTracker?
>> 
>> thanks
>> John

Re: Distributed cache: how big is too big?

Posted by Jay Vyas <ja...@gmail.com>.

Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a replacement for the distributed cache?

After all - the distributed cache is just a file with replication over the whole cluster, which isn't in hdfs.  Cant you Just make the cache size big and store the file there?

What advantage is hdfs distribution of the file over all nodes  ?

On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson <bj...@gmail.com> wrote:

> Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.
> 
> On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:
>> I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
>> 
>> To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.
>> I know this isn't new and commonly done with a Distributed Cache.
>> 
>> Based on experience what are the common file sizes deployed in a Distributed Cache? 
>> I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.
>> 
>> I know that->Default local.cache.size=10Gb
>> 
>> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
>> -Distributed Cache is normally not used if larger than =____?
>> 
>> Another Option: Put the data directories on each DN and provide location to TaskTracker?
>> 
>> thanks
>> John

RE: Distributed cache: how big is too big?

Posted by John Meza <j_...@hotmail.com>.

"a replication factor equal to the number of DN"Hmmm... I'm not sure I understand: there are  8 DN in mytest cluster. 
Date: Tue, 9 Apr 2013 04:49:17 -0700
Subject: Re: Distributed cache: how big is too big?
From: bjornjon@gmail.com
To: user@hadoop.apache.org

Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.

On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:

I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.
I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.

I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
-Distributed Cache is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to TaskTracker?
thanks
John

RE: Distributed cache: how big is too big?

Posted by John Meza <j_...@hotmail.com>.

"a replication factor equal to the number of DN"Hmmm... I'm not sure I understand: there are  8 DN in mytest cluster. 
Date: Tue, 9 Apr 2013 04:49:17 -0700
Subject: Re: Distributed cache: how big is too big?
From: bjornjon@gmail.com
To: user@hadoop.apache.org

Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.

On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:

I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.
I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.

I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
-Distributed Cache is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to TaskTracker?
thanks
John

RE: Distributed cache: how big is too big?

Posted by John Meza <j_...@hotmail.com>.

"a replication factor equal to the number of DN"Hmmm... I'm not sure I understand: there are  8 DN in mytest cluster. 
Date: Tue, 9 Apr 2013 04:49:17 -0700
Subject: Re: Distributed cache: how big is too big?
From: bjornjon@gmail.com
To: user@hadoop.apache.org

Put it once on hdfs with a replication factor equal to the number of DN. No startup latency on job submission or max size and access it from anywhere with fs since it sticks around untill you replace it? Just a thought.

On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:

I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.
I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.

I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
-Distributed Cache is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to TaskTracker?
thanks
John

Re: Distributed cache: how big is too big?

Posted by Bjorn Jonsson <bj...@gmail.com>.

Put it once on hdfs with a replication factor equal to the number of DN. No
startup latency on job submission or max size and access it from anywhere
with fs since it sticks around untill you replace it? Just a thought.
On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:

> I am researching a Hadoop solution for an existing application that
> requires a directory structure full of data for processing.
>
> To make the Hadoop solution work I need to deploy the data directory to
> each DN when the job is executed.
> I know this isn't new and commonly done with a Distributed Cache.
>
> *Based on experience what are the common file sizes deployed in a
> Distributed Cache?*
> I know smaller is better, but how big is too big? the larger cache
> deployed I have read there will be startup latency. I also assume there are
> other factors that play into this.
>
> I know that->Default local.cache.size=10Gb
>
> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
> -Distributed Cache is normally not used if larger than =____?
>
> *Another Option:* Put the data directories on each DN and provide
> location to TaskTracker?
>
> thanks
> John
>
>

Re: Distributed cache: how big is too big?

Posted by Bjorn Jonsson <bj...@gmail.com>.

Put it once on hdfs with a replication factor equal to the number of DN. No
startup latency on job submission or max size and access it from anywhere
with fs since it sticks around untill you replace it? Just a thought.
On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:

> I am researching a Hadoop solution for an existing application that
> requires a directory structure full of data for processing.
>
> To make the Hadoop solution work I need to deploy the data directory to
> each DN when the job is executed.
> I know this isn't new and commonly done with a Distributed Cache.
>
> *Based on experience what are the common file sizes deployed in a
> Distributed Cache?*
> I know smaller is better, but how big is too big? the larger cache
> deployed I have read there will be startup latency. I also assume there are
> other factors that play into this.
>
> I know that->Default local.cache.size=10Gb
>
> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
> -Distributed Cache is normally not used if larger than =____?
>
> *Another Option:* Put the data directories on each DN and provide
> location to TaskTracker?
>
> thanks
> John
>
>

Re: Distributed cache: how big is too big?

Posted by Bjorn Jonsson <bj...@gmail.com>.

Put it once on hdfs with a replication factor equal to the number of DN. No
startup latency on job submission or max size and access it from anywhere
with fs since it sticks around untill you replace it? Just a thought.
On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:

> I am researching a Hadoop solution for an existing application that
> requires a directory structure full of data for processing.
>
> To make the Hadoop solution work I need to deploy the data directory to
> each DN when the job is executed.
> I know this isn't new and commonly done with a Distributed Cache.
>
> *Based on experience what are the common file sizes deployed in a
> Distributed Cache?*
> I know smaller is better, but how big is too big? the larger cache
> deployed I have read there will be startup latency. I also assume there are
> other factors that play into this.
>
> I know that->Default local.cache.size=10Gb
>
> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
> -Distributed Cache is normally not used if larger than =____?
>
> *Another Option:* Put the data directories on each DN and provide
> location to TaskTracker?
>
> thanks
> John
>
>

Re: Distributed cache: how big is too big?

Posted by Bjorn Jonsson <bj...@gmail.com>.

Put it once on hdfs with a replication factor equal to the number of DN. No
startup latency on job submission or max size and access it from anywhere
with fs since it sticks around untill you replace it? Just a thought.
On Apr 8, 2013 9:59 PM, "John Meza" <j_...@hotmail.com> wrote:

> I am researching a Hadoop solution for an existing application that
> requires a directory structure full of data for processing.
>
> To make the Hadoop solution work I need to deploy the data directory to
> each DN when the job is executed.
> I know this isn't new and commonly done with a Distributed Cache.
>
> *Based on experience what are the common file sizes deployed in a
> Distributed Cache?*
> I know smaller is better, but how big is too big? the larger cache
> deployed I have read there will be startup latency. I also assume there are
> other factors that play into this.
>
> I know that->Default local.cache.size=10Gb
>
> -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
> -Distributed Cache is normally not used if larger than =____?
>
> *Another Option:* Put the data directories on each DN and provide
> location to TaskTracker?
>
> thanks
> John
>
>