You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Chris Dyer <re...@umd.edu> on 2008/03/12 23:28:09 UTC

scaling experiments on a static cluster?

Hi Hadoop mavens-
I'm hoping someone out there will have a quick solution for me.  I'm
trying to run some very basic scaling experiments for a rapidly
approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
with 2 procs/node.  Ideally, I would want to run my code on clusters
of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
The problem is that I am not able to reconfigure the cluster (in the
long run, i.e., before a final version of the paper, I assume this
will be possible, but for now it's not).  Setting the number of
mappers/reducers does not seem to be a viable option, at least not in
the trivial way, since the physical layout of the input files makes
hadoop run different tasks of processes than I may request (most of my
jobs consist of multiple MR steps, the initial one always running on a
relatively small data set, which fits into a single block, and
therefore the Hadoop framework does honor my task number request on
the first job-- but during the later ones it does not).

My questions:
1) can I get around this limitation programmatically?  I.e., is there
a way to tell the framework to only use a subset of the nodes for DFS
/ mapping / reducing?
2) if not, what statistics would be good to report if I can only have
two data points -- a legacy "single-core" implementation of the
algorithms and a MapReduce version running on a cluster full cluster?

Thanks for any suggestions!
Chris

Re: scaling experiments on a static cluster?

Posted by Ted Dunning <td...@veoh.com>.
Yes.

Increase the replication.  Wait.  Drop the replication.


On 3/12/08 3:44 PM, "Chris Dyer" <re...@umd.edu> wrote:

> Thanks-- that should work.  I'll follow up with the cluster
> administrators to see if I can get this to happen.  To rebalance the
> file storage can I just set the replication factor using "hadoop dfs"?
> Chris
> 
> On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning <td...@veoh.com> wrote:
>> 
>>  What about just taking down half of the nodes and then loading your data
>>  into the remainder?  Should take about 20 minutes each time you remove nodes
>>  but only a few seconds each time you add some.  Remember that you need to
>>  reload the data each time (or rebalance it if growing the cluster) to get
>>  realistic numbers.
>> 
>>  My suggested procedure would be to take all but 2 nodes down, and then
>> 
>>  - run test
>>  - double number of nodes
>>  - rebalance file storage
>>  - lather, rinse, repeat
>> 
>> 
>> 
>> 
>>  On 3/12/08 3:28 PM, "Chris Dyer" <re...@umd.edu> wrote:
>> 
>>> Hi Hadoop mavens-
>>> I'm hoping someone out there will have a quick solution for me.  I'm
>>> trying to run some very basic scaling experiments for a rapidly
>>> approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
>>> with 2 procs/node.  Ideally, I would want to run my code on clusters
>>> of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
>>> The problem is that I am not able to reconfigure the cluster (in the
>>> long run, i.e., before a final version of the paper, I assume this
>>> will be possible, but for now it's not).  Setting the number of
>>> mappers/reducers does not seem to be a viable option, at least not in
>>> the trivial way, since the physical layout of the input files makes
>>> hadoop run different tasks of processes than I may request (most of my
>>> jobs consist of multiple MR steps, the initial one always running on a
>>> relatively small data set, which fits into a single block, and
>>> therefore the Hadoop framework does honor my task number request on
>>> the first job-- but during the later ones it does not).
>>> 
>>> My questions:
>>> 1) can I get around this limitation programmatically?  I.e., is there
>>> a way to tell the framework to only use a subset of the nodes for DFS
>>> / mapping / reducing?
>>> 2) if not, what statistics would be good to report if I can only have
>>> two data points -- a legacy "single-core" implementation of the
>>> algorithms and a MapReduce version running on a cluster full cluster?
>>> 
>>> Thanks for any suggestions!
>>> Chris
>> 
>> 


Re: scaling experiments on a static cluster?

Posted by Chris Dyer <re...@umd.edu>.
Thanks-- that should work.  I'll follow up with the cluster
administrators to see if I can get this to happen.  To rebalance the
file storage can I just set the replication factor using "hadoop dfs"?
Chris

On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning <td...@veoh.com> wrote:
>
>  What about just taking down half of the nodes and then loading your data
>  into the remainder?  Should take about 20 minutes each time you remove nodes
>  but only a few seconds each time you add some.  Remember that you need to
>  reload the data each time (or rebalance it if growing the cluster) to get
>  realistic numbers.
>
>  My suggested procedure would be to take all but 2 nodes down, and then
>
>  - run test
>  - double number of nodes
>  - rebalance file storage
>  - lather, rinse, repeat
>
>
>
>
>  On 3/12/08 3:28 PM, "Chris Dyer" <re...@umd.edu> wrote:
>
>  > Hi Hadoop mavens-
>  > I'm hoping someone out there will have a quick solution for me.  I'm
>  > trying to run some very basic scaling experiments for a rapidly
>  > approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
>  > with 2 procs/node.  Ideally, I would want to run my code on clusters
>  > of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
>  > The problem is that I am not able to reconfigure the cluster (in the
>  > long run, i.e., before a final version of the paper, I assume this
>  > will be possible, but for now it's not).  Setting the number of
>  > mappers/reducers does not seem to be a viable option, at least not in
>  > the trivial way, since the physical layout of the input files makes
>  > hadoop run different tasks of processes than I may request (most of my
>  > jobs consist of multiple MR steps, the initial one always running on a
>  > relatively small data set, which fits into a single block, and
>  > therefore the Hadoop framework does honor my task number request on
>  > the first job-- but during the later ones it does not).
>  >
>  > My questions:
>  > 1) can I get around this limitation programmatically?  I.e., is there
>  > a way to tell the framework to only use a subset of the nodes for DFS
>  > / mapping / reducing?
>  > 2) if not, what statistics would be good to report if I can only have
>  > two data points -- a legacy "single-core" implementation of the
>  > algorithms and a MapReduce version running on a cluster full cluster?
>  >
>  > Thanks for any suggestions!
>  > Chris
>
>

Re: scaling experiments on a static cluster?

Posted by Ted Dunning <td...@veoh.com>.
What about just taking down half of the nodes and then loading your data
into the remainder?  Should take about 20 minutes each time you remove nodes
but only a few seconds each time you add some.  Remember that you need to
reload the data each time (or rebalance it if growing the cluster) to get
realistic numbers.

My suggested procedure would be to take all but 2 nodes down, and then

- run test
- double number of nodes
- rebalance file storage
- lather, rinse, repeat


On 3/12/08 3:28 PM, "Chris Dyer" <re...@umd.edu> wrote:

> Hi Hadoop mavens-
> I'm hoping someone out there will have a quick solution for me.  I'm
> trying to run some very basic scaling experiments for a rapidly
> approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
> with 2 procs/node.  Ideally, I would want to run my code on clusters
> of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
> The problem is that I am not able to reconfigure the cluster (in the
> long run, i.e., before a final version of the paper, I assume this
> will be possible, but for now it's not).  Setting the number of
> mappers/reducers does not seem to be a viable option, at least not in
> the trivial way, since the physical layout of the input files makes
> hadoop run different tasks of processes than I may request (most of my
> jobs consist of multiple MR steps, the initial one always running on a
> relatively small data set, which fits into a single block, and
> therefore the Hadoop framework does honor my task number request on
> the first job-- but during the later ones it does not).
> 
> My questions:
> 1) can I get around this limitation programmatically?  I.e., is there
> a way to tell the framework to only use a subset of the nodes for DFS
> / mapping / reducing?
> 2) if not, what statistics would be good to report if I can only have
> two data points -- a legacy "single-core" implementation of the
> algorithms and a MapReduce version running on a cluster full cluster?
> 
> Thanks for any suggestions!
> Chris