You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by David Stuart <da...@progressivealliance.co.uk> on 2011/03/13 11:41:03 UTC

Scaling question

Hey,

I have done my initial tests locally  and now want to building a cluster. My question is currently I have three big machines (32gb ram and 2 x 6 cores), would it be more effective/faster keep the machines as is or to divide them into virtual machines and have say 6 machines per Server.

Regards

David Stuart

Re: Scaling question

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Mar 13, 2011 at 6:49 PM, Lance Norskog <go...@gmail.com> wrote:

> Should they be striped (RAID 0) without duplication? This was the
> wisdom I've received.
>

Definitely not.

Just define a separate file system on each.  Hadoop handles that correctly.
 It helps to have enough mapper or reducer slots on each machine to make
sure that you have a thread per spindle.

Striping can cause catastrophic loss of reliability in larger clusters.

> Before starting each job, run this: "cat inputdirectory/* >
> /dev/null". This loads the disk cache with the input files; which
> helps a surprising amount.
>

Depends on your application.  If you count the time required for the cat, it
probably doesn't help.  Also, if your data is larger than memory, it won't
help at all and may hurt because certain smaller resources will have been
totally evicted by this exercise.

> On Sun, Mar 13, 2011 at 4:45 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Hadoop doesn't make good use of SSD's.
> >
> > Just adding more spindles will do more than anything else.
> >
> > On Sun, Mar 13, 2011 at 5:38 AM, Dave Stuart <
> > david.stuart@progressivealliance.co.uk> wrote:
> >
> >> Thanks Sean,
> >>
> >> At the moment the disks aren't super fast (7200 rpm) so I was planning
> on
> >> adding some SSD into the mix.
>

Re: Scaling question

Posted by Lance Norskog <go...@gmail.com>.

Should they be striped (RAID 0) without duplication? This was the
wisdom I've received.

Before starting each job, run this: "cat inputdirectory/* >
/dev/null". This loads the disk cache with the input files; which
helps a surprising amount.

On Sun, Mar 13, 2011 at 4:45 PM, Ted Dunning <te...@gmail.com> wrote:
> Hadoop doesn't make good use of SSD's.
>
> Just adding more spindles will do more than anything else.
>
> On Sun, Mar 13, 2011 at 5:38 AM, Dave Stuart <
> david.stuart@progressivealliance.co.uk> wrote:
>
>> Thanks Sean,
>>
>> At the moment the disks aren't super fast (7200 rpm) so I was planning on
>> adding some SSD into the mix.
>> Thanks for the tips
>>
>> Regards,
>>
>> Dave
>>
>>
>>
>> On 13 Mar 2011, at 11:20, Sean Owen wrote:
>>
>> > There's no real point in making virtual machines in order to do more
>> > work per machine -- just make Hadoop run more workers per machine. A
>> > good first approximation is indeed to run one worker per core.
>> >
>> > I think you'll find a lot of Mahout-related jobs are I/O-bound, not
>> > CPU-bound. So you may reach a bottleneck with fewer workers than that.
>> > And then you may find you get more bang for your buck not with more
>> > RAM or cores but more and faster disks, and getting Hadoop to use
>> > them.
>> >
>> > On Sun, Mar 13, 2011 at 10:41 AM, David Stuart
>> > <da...@progressivealliance.co.uk> wrote:
>> >> Hey,
>> >>
>> >> I have done my initial tests locally  and now want to building a
>> cluster. My question is currently I have three big machines (32gb ram and 2
>> x 6 cores), would it be more effective/faster keep the machines as is or to
>> divide them into virtual machines and have say 6 machines per Server.
>> >>
>> >> Regards
>> >>
>> >> David Stuart
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Scaling question

Posted by Ted Dunning <te...@gmail.com>.

Hadoop doesn't make good use of SSD's.

Just adding more spindles will do more than anything else.

On Sun, Mar 13, 2011 at 5:38 AM, Dave Stuart <
david.stuart@progressivealliance.co.uk> wrote:

> Thanks Sean,
>
> At the moment the disks aren't super fast (7200 rpm) so I was planning on
> adding some SSD into the mix.
> Thanks for the tips
>
> Regards,
>
> Dave
>
>
>
> On 13 Mar 2011, at 11:20, Sean Owen wrote:
>
> > There's no real point in making virtual machines in order to do more
> > work per machine -- just make Hadoop run more workers per machine. A
> > good first approximation is indeed to run one worker per core.
> >
> > I think you'll find a lot of Mahout-related jobs are I/O-bound, not
> > CPU-bound. So you may reach a bottleneck with fewer workers than that.
> > And then you may find you get more bang for your buck not with more
> > RAM or cores but more and faster disks, and getting Hadoop to use
> > them.
> >
> > On Sun, Mar 13, 2011 at 10:41 AM, David Stuart
> > <da...@progressivealliance.co.uk> wrote:
> >> Hey,
> >>
> >> I have done my initial tests locally  and now want to building a
> cluster. My question is currently I have three big machines (32gb ram and 2
> x 6 cores), would it be more effective/faster keep the machines as is or to
> divide them into virtual machines and have say 6 machines per Server.
> >>
> >> Regards
> >>
> >> David Stuart
>
>

Re: Scaling question

Posted by Dave Stuart <da...@progressivealliance.co.uk>.

Thanks Sean,

At the moment the disks aren't super fast (7200 rpm) so I was planning on adding some SSD into the mix. 
Thanks for the tips

Regards,

Dave



On 13 Mar 2011, at 11:20, Sean Owen wrote:

> There's no real point in making virtual machines in order to do more
> work per machine -- just make Hadoop run more workers per machine. A
> good first approximation is indeed to run one worker per core.
> 
> I think you'll find a lot of Mahout-related jobs are I/O-bound, not
> CPU-bound. So you may reach a bottleneck with fewer workers than that.
> And then you may find you get more bang for your buck not with more
> RAM or cores but more and faster disks, and getting Hadoop to use
> them.
> 
> On Sun, Mar 13, 2011 at 10:41 AM, David Stuart
> <da...@progressivealliance.co.uk> wrote:
>> Hey,
>> 
>> I have done my initial tests locally  and now want to building a cluster. My question is currently I have three big machines (32gb ram and 2 x 6 cores), would it be more effective/faster keep the machines as is or to divide them into virtual machines and have say 6 machines per Server.
>> 
>> Regards
>> 
>> David Stuart

Re: Scaling question

Posted by Sean Owen <sr...@gmail.com>.

There's no real point in making virtual machines in order to do more
work per machine -- just make Hadoop run more workers per machine. A
good first approximation is indeed to run one worker per core.

I think you'll find a lot of Mahout-related jobs are I/O-bound, not
CPU-bound. So you may reach a bottleneck with fewer workers than that.
And then you may find you get more bang for your buck not with more
RAM or cores but more and faster disks, and getting Hadoop to use
them.

On Sun, Mar 13, 2011 at 10:41 AM, David Stuart
<da...@progressivealliance.co.uk> wrote:
> Hey,
>
> I have done my initial tests locally  and now want to building a cluster. My question is currently I have three big machines (32gb ram and 2 x 6 cores), would it be more effective/faster keep the machines as is or to divide them into virtual machines and have say 6 machines per Server.
>
> Regards
>
> David Stuart