You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by buddha1021 <bu...@yahoo.cn> on 2009/02/19 02:08:55 UTC

the question about the common pc?

hi:
the hadoop distributes the data and processing across clusters of commonly
available computers.the document said this. but what is the "commonly
available computers" mean? 1U server? or the pc that people daily used on
windows?
-- 
View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092022.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Re:Re: the question about the common pc?

Posted by buddha1021 <bu...@yahoo.cn>.

when i said"people daily used on windows",I want to specify the commom
handware (not OS)but don't mean hadoop run on windows! I mean that hadoop
run on common pc's hardware .certainly linux as OS !

Tim Wintle wrote:
> 
> On Thu, 2009-02-19 at 13:43 +0800, 柳松 wrote:
>> Hadoop is designed for High performance computing equipment, but
>> "claimed" to be fit for "daily pc"s.
> 
> The phrase "High Performance Computing equipment" makes me think of
> infiniband, fibre all over the place etc.
> 
> 
> Hadoop doesn't need that, it runs well on standard pc hardware - i.e. no
> special hardware you couldn't find in a standard pc. That doesn't mean
> you should run it on pcs that are being used for other things though.
> 
> I found that hadoop ran ok on fairly old hardware - a load of old
> power-pc macs (running linux) churned through some jobs quickly, and
> I've actually run it on people's office machines during the nights (not
> on Windows). I did end up having to add an extra switch in for the part
> of the network that was only 100 mbps to get the throughput though.
> 
> Of course ideally you would be running it on a rack of 1u servers, but
> that's still normally standard pc hardware.
> 
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22094601.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Re:Re: the question about the common pc?

Posted by Tim Wintle <ti...@teamrubber.com>.

On Thu, 2009-02-19 at 13:43 +0800, 柳松 wrote:
> Hadoop is designed for High performance computing equipment, but "claimed" to be fit for "daily pc"s.

The phrase "High Performance Computing equipment" makes me think of
infiniband, fibre all over the place etc.


Hadoop doesn't need that, it runs well on standard pc hardware - i.e. no
special hardware you couldn't find in a standard pc. That doesn't mean
you should run it on pcs that are being used for other things though.

I found that hadoop ran ok on fairly old hardware - a load of old
power-pc macs (running linux) churned through some jobs quickly, and
I've actually run it on people's office machines during the nights (not
on Windows). I did end up having to add an extra switch in for the part
of the network that was only 100 mbps to get the throughput though.

Of course ideally you would be running it on a rack of 1u servers, but
that's still normally standard pc hardware.

Re: the question about the common pc?

Posted by Tim Wintle <ti...@teamrubber.com>.

On Mon, 2009-02-23 at 11:14 +0000, Steve Loughran wrote:
> Dumbo provides py support under Hadoop:
>   http://wiki.github.com/klbostee/dumbo
>   https://issues.apache.org/jira/browse/HADOOP-4304

Ooh, nice - I hadn't seen dumbo. That's far cleaner than the python
wrapper to streaming I'd hacked together.

I'm probably going to be using hadoop more again in the near future so
I'll bookmark that, thanks Steve.

Personally I only need text based records, so I'm fine using a wrapper
around streaming

Tim Wintle

Re: the question about the common pc?

Posted by Steve Loughran <st...@apache.org>.

Tim Wintle wrote:
> On Fri, 2009-02-20 at 13:07 +0000, Steve Loughran wrote:
>> I've been doing MapReduce work over small in-memory datasets 
>> using Erlang,  which works very well in such a context.
> 
> I've got some (mainly python) scripts (that will probably be run with
> hadoop streaming eventually) that I run over multiple cpus/cores on a
> single machine by opening the appropriate number of named pipes and
> using tee and awk to split the workload
> 
> something like
> 
>> mkfifo mypipe1
>> mkfifo mypipe2
>> awk '0 == NR % 2' < mypipe1 | ./mapper | sort > map_out_1&
>   awk '0 == (NR+1) % 2' < mypipe2 | ./mapper | sort > map_out_2&
>> ./get_lots_of_data | tee mypipe1 > mypipe2
> 
> (wait until it's done... or send a signal from the "get_lots_of_data"
> process on completion if it's a cronjob)
> 
>> sort -m map_out* | ./reducer > reduce_out
> 
> works around the global interpreter lock in python quite nicely and
> doesn't need people that write the scripts (who may not be programmers)
> to understand multiple processes etc, just stdin and stdout.
> 

Dumbo provides py support under Hadoop:
  http://wiki.github.com/klbostee/dumbo
  https://issues.apache.org/jira/browse/HADOOP-4304

as well as that, given Hadoop is java1.6+, there's no reason why it 
couldn't support the javax.script engine, with JavaScript working 
without extra JAR files, groovy and jython once their JARs were stuck on 
the classpath. Some work would probably be needed to make it easier to 
use these languages, and then there are the tests...

Re: the question about the common pc?

Posted by Tim Wintle <ti...@teamrubber.com>.

On Fri, 2009-02-20 at 13:07 +0000, Steve Loughran wrote:
> I've been doing MapReduce work over small in-memory datasets 
> using Erlang,  which works very well in such a context.

I've got some (mainly python) scripts (that will probably be run with
hadoop streaming eventually) that I run over multiple cpus/cores on a
single machine by opening the appropriate number of named pipes and
using tee and awk to split the workload

something like

> mkfifo mypipe1
> mkfifo mypipe2
> awk '0 == NR % 2' < mypipe1 | ./mapper | sort > map_out_1&
  awk '0 == (NR+1) % 2' < mypipe2 | ./mapper | sort > map_out_2&
> ./get_lots_of_data | tee mypipe1 > mypipe2

(wait until it's done... or send a signal from the "get_lots_of_data"
process on completion if it's a cronjob)

> sort -m map_out* | ./reducer > reduce_out

works around the global interpreter lock in python quite nicely and
doesn't need people that write the scripts (who may not be programmers)
to understand multiple processes etc, just stdin and stdout.

Tim Wintle

Re: the question about the common pc?

Posted by Steve Loughran <st...@apache.org>.

?? wrote:
> Actually, there's a widely misunderstanding of this "Common PC" . Common PC doesn't means PCs which are daily used, It means the performance of each node, can be measured by common pc's computing power.
> 
> In the matter of fact, we dont use Gb enthernet for daily pcs' communication, we dont use linux for our document process, and most importantly, Hadoop cannot run effectively on thoese "daily pc"s.
> 
>  
> Hadoop is designed for High performance computing equipment, but "claimed" to be fit for "daily pc"s.
> 
> Hadoop for pcs? what a joke.

Hadoop is designed to build a high throughput dataprocessing 
infrastructure from commodity PC parts. SATA not RAID or SAN, x68+linux 
not supercomputer hardware and OS. You can bring it up on lighter weight 
systems, but it has a minimium overhead that is quite steep for small 
datasets. I've been doing MapReduce work over small in-memory datasets 
using Erlang,  which works very well in such a context.

-you need a good network, with DNS working (fast), good backbone and 
switches
-the faster your disks, the better your throughput
-ECC memory makes a lot of sense
-you need a good cluster management setup unless you like SSH-ing to 20 
boxes to find out which one is playing up

Re: the question about the common pc?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Feb 18, 2009, at 11:43 PM, 柳松 wrote:

> Actually, there's a widely misunderstanding of this "Common PC" .  
> Common PC doesn't means PCs which are daily used, It means the  
> performance of each node, can be measured by common pc's computing  
> power.
>
> In the matter of fact, we dont use Gb enthernet for daily pcs'  
> communication,

I certainly do.

> we dont use linux for our document process,

I do.

> and most importantly, Hadoop cannot run effectively on thoese "daily  
> pc"s.
>

Maybe your PC is under-spec'd?

>
> Hadoop is designed for High performance computing equipment, but  
> "claimed" to be fit for "daily pc"s.
>

Our students run it on Pentium III's with 20GB HDD.  Try finding a new  
laptop with that low of specs.

> Hadoop for pcs? what a joke.
>

The truth is that Hadoop scales to the gear you have.  If you throw a  
bunch of Windows desktops, it'll perform like a bunch of Windows  
desktops.  If you run it on the "student test cluster", it'll perform  
like Java on PIIIs.  If you run it on a new high-performance  
cluster ... well, you get the point.

If you want to run Hadoop for development work, I'd say you want to  
use your desktop.  If you want to run Hadoop for production work, I'd  
recommend a "production environment" - decently powered 1U linux  
servers with large disks (or whatever the recommendation is on the  
wiki).

Brian

>
>  > -----原始邮件-----
>> 发件人: buddha1021 <bu...@yahoo.cn>
>> 发送时间: 2009年2月19日 星期四
>> 收件人: core-user@hadoop.apache.org
>> 抄送:
>> 主题: Re: the question about the common pc?
>>
>>
>> and ,the nodes is the pc that people daily used on windows or the  
>> 1u server?
>>
>> buddha1021 wrote:
>>>
>>> hi:
>>> the hadoop distributes the data and processing across clusters of  
>>> commonly
>>> available computers.the document said this. but what is the  
>>> "commonly
>>> available computers" mean? 1U server? or the pc that people daily  
>>> used on
>>> windows?
>>>
>>
>> -- 
>> View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>

Re:Re: the question about the common pc?

Posted by 柳松 <la...@126.com>.

Actually, there's a widely misunderstanding of this "Common PC" . Common PC doesn't means PCs which are daily used, It means the performance of each node, can be measured by common pc's computing power.

In the matter of fact, we dont use Gb enthernet for daily pcs' communication, we dont use linux for our document process, and most importantly, Hadoop cannot run effectively on thoese "daily pc"s.

Hadoop is designed for High performance computing equipment, but "claimed" to be fit for "daily pc"s.

Hadoop for pcs? what a joke.

 > -----原始邮件-----
> 发件人: buddha1021 <bu...@yahoo.cn>
> 发送时间: 2009年2月19日 星期四
> 收件人: core-user@hadoop.apache.org
> 抄送: 
> 主题: Re: the question about the common pc?
> 
> 
> and ,the nodes is the pc that people daily used on windows or the 1u server?
> 
> buddha1021 wrote:
> > 
> > hi:
> > the hadoop distributes the data and processing across clusters of commonly
> > available computers.the document said this. but what is the "commonly
> > available computers" mean? 1U server? or the pc that people daily used on
> > windows?
> > 
> 
> -- 
> View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>

Re: the question about the common pc?

Posted by buddha1021 <bu...@yahoo.cn>.

and ,the nodes is the pc that people daily used on windows or the 1u server?

buddha1021 wrote:
> 
> hi:
> the hadoop distributes the data and processing across clusters of commonly
> available computers.the document said this. but what is the "commonly
> available computers" mean? 1U server? or the pc that people daily used on
> windows?
> 

-- 
View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.