You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by buddha1021 <bu...@yahoo.cn> on 2009/02/19 02:08:55 UTC
the question about the common pc?
hi:
the hadoop distributes the data and processing across clusters of commonly
available computers.the document said this. but what is the "commonly
available computers" mean? 1U server? or the pc that people daily used on
windows?
--
View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092022.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Re:Re: the question about the common pc?
Posted by buddha1021 <bu...@yahoo.cn>.
when i said"people daily used on windows",I want to specify the commom
handware (not OS)but don't mean hadoop run on windows! I mean that hadoop
run on common pc's hardware .certainly linux as OS !
Tim Wintle wrote:
>
> On Thu, 2009-02-19 at 13:43 +0800, 柳松 wrote:
>> Hadoop is designed for High performance computing equipment, but
>> "claimed" to be fit for "daily pc"s.
>
> The phrase "High Performance Computing equipment" makes me think of
> infiniband, fibre all over the place etc.
>
>
> Hadoop doesn't need that, it runs well on standard pc hardware - i.e. no
> special hardware you couldn't find in a standard pc. That doesn't mean
> you should run it on pcs that are being used for other things though.
>
> I found that hadoop ran ok on fairly old hardware - a load of old
> power-pc macs (running linux) churned through some jobs quickly, and
> I've actually run it on people's office machines during the nights (not
> on Windows). I did end up having to add an extra switch in for the part
> of the network that was only 100 mbps to get the throughput though.
>
> Of course ideally you would be running it on a rack of 1u servers, but
> that's still normally standard pc hardware.
>
>
>
>
>
>
--
View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22094601.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Re:Re: the question about the common pc?
Posted by Tim Wintle <ti...@teamrubber.com>.
On Thu, 2009-02-19 at 13:43 +0800, 柳松 wrote:
> Hadoop is designed for High performance computing equipment, but "claimed" to be fit for "daily pc"s.
The phrase "High Performance Computing equipment" makes me think of
infiniband, fibre all over the place etc.
Hadoop doesn't need that, it runs well on standard pc hardware - i.e. no
special hardware you couldn't find in a standard pc. That doesn't mean
you should run it on pcs that are being used for other things though.
I found that hadoop ran ok on fairly old hardware - a load of old
power-pc macs (running linux) churned through some jobs quickly, and
I've actually run it on people's office machines during the nights (not
on Windows). I did end up having to add an extra switch in for the part
of the network that was only 100 mbps to get the throughput though.
Of course ideally you would be running it on a rack of 1u servers, but
that's still normally standard pc hardware.
Re: the question about the common pc?
Posted by Tim Wintle <ti...@teamrubber.com>.
On Mon, 2009-02-23 at 11:14 +0000, Steve Loughran wrote:
> Dumbo provides py support under Hadoop:
> http://wiki.github.com/klbostee/dumbo
> https://issues.apache.org/jira/browse/HADOOP-4304
Ooh, nice - I hadn't seen dumbo. That's far cleaner than the python
wrapper to streaming I'd hacked together.
I'm probably going to be using hadoop more again in the near future so
I'll bookmark that, thanks Steve.
Personally I only need text based records, so I'm fine using a wrapper
around streaming
Tim Wintle
Re: the question about the common pc?
Posted by Steve Loughran <st...@apache.org>.
Tim Wintle wrote:
> On Fri, 2009-02-20 at 13:07 +0000, Steve Loughran wrote:
>> I've been doing MapReduce work over small in-memory datasets
>> using Erlang, which works very well in such a context.
>
> I've got some (mainly python) scripts (that will probably be run with
> hadoop streaming eventually) that I run over multiple cpus/cores on a
> single machine by opening the appropriate number of named pipes and
> using tee and awk to split the workload
>
> something like
>
>> mkfifo mypipe1
>> mkfifo mypipe2
>> awk '0 == NR % 2' < mypipe1 | ./mapper | sort > map_out_1&
> awk '0 == (NR+1) % 2' < mypipe2 | ./mapper | sort > map_out_2&
>> ./get_lots_of_data | tee mypipe1 > mypipe2
>
> (wait until it's done... or send a signal from the "get_lots_of_data"
> process on completion if it's a cronjob)
>
>> sort -m map_out* | ./reducer > reduce_out
>
> works around the global interpreter lock in python quite nicely and
> doesn't need people that write the scripts (who may not be programmers)
> to understand multiple processes etc, just stdin and stdout.
>
Dumbo provides py support under Hadoop:
http://wiki.github.com/klbostee/dumbo
https://issues.apache.org/jira/browse/HADOOP-4304
as well as that, given Hadoop is java1.6+, there's no reason why it
couldn't support the javax.script engine, with JavaScript working
without extra JAR files, groovy and jython once their JARs were stuck on
the classpath. Some work would probably be needed to make it easier to
use these languages, and then there are the tests...
Re: the question about the common pc?
Posted by Tim Wintle <ti...@teamrubber.com>.
On Fri, 2009-02-20 at 13:07 +0000, Steve Loughran wrote:
> I've been doing MapReduce work over small in-memory datasets
> using Erlang, which works very well in such a context.
I've got some (mainly python) scripts (that will probably be run with
hadoop streaming eventually) that I run over multiple cpus/cores on a
single machine by opening the appropriate number of named pipes and
using tee and awk to split the workload
something like
> mkfifo mypipe1
> mkfifo mypipe2
> awk '0 == NR % 2' < mypipe1 | ./mapper | sort > map_out_1&
awk '0 == (NR+1) % 2' < mypipe2 | ./mapper | sort > map_out_2&
> ./get_lots_of_data | tee mypipe1 > mypipe2
(wait until it's done... or send a signal from the "get_lots_of_data"
process on completion if it's a cronjob)
> sort -m map_out* | ./reducer > reduce_out
works around the global interpreter lock in python quite nicely and
doesn't need people that write the scripts (who may not be programmers)
to understand multiple processes etc, just stdin and stdout.
Tim Wintle
Re: the question about the common pc?
Posted by Steve Loughran <st...@apache.org>.
?? wrote:
> Actually, there's a widely misunderstanding of this "Common PC" . Common PC doesn't means PCs which are daily used, It means the performance of each node, can be measured by common pc's computing power.
>
> In the matter of fact, we dont use Gb enthernet for daily pcs' communication, we dont use linux for our document process, and most importantly, Hadoop cannot run effectively on thoese "daily pc"s.
>
>
> Hadoop is designed for High performance computing equipment, but "claimed" to be fit for "daily pc"s.
>
> Hadoop for pcs? what a joke.
Hadoop is designed to build a high throughput dataprocessing
infrastructure from commodity PC parts. SATA not RAID or SAN, x68+linux
not supercomputer hardware and OS. You can bring it up on lighter weight
systems, but it has a minimium overhead that is quite steep for small
datasets. I've been doing MapReduce work over small in-memory datasets
using Erlang, which works very well in such a context.
-you need a good network, with DNS working (fast), good backbone and
switches
-the faster your disks, the better your throughput
-ECC memory makes a lot of sense
-you need a good cluster management setup unless you like SSH-ing to 20
boxes to find out which one is playing up
Re: the question about the common pc?
Posted by Brian Bockelman <bb...@cse.unl.edu>.
On Feb 18, 2009, at 11:43 PM, 柳松 wrote:
> Actually, there's a widely misunderstanding of this "Common PC" .
> Common PC doesn't means PCs which are daily used, It means the
> performance of each node, can be measured by common pc's computing
> power.
>
> In the matter of fact, we dont use Gb enthernet for daily pcs'
> communication,
I certainly do.
> we dont use linux for our document process,
I do.
> and most importantly, Hadoop cannot run effectively on thoese "daily
> pc"s.
>
Maybe your PC is under-spec'd?
>
> Hadoop is designed for High performance computing equipment, but
> "claimed" to be fit for "daily pc"s.
>
Our students run it on Pentium III's with 20GB HDD. Try finding a new
laptop with that low of specs.
> Hadoop for pcs? what a joke.
>
The truth is that Hadoop scales to the gear you have. If you throw a
bunch of Windows desktops, it'll perform like a bunch of Windows
desktops. If you run it on the "student test cluster", it'll perform
like Java on PIIIs. If you run it on a new high-performance
cluster ... well, you get the point.
If you want to run Hadoop for development work, I'd say you want to
use your desktop. If you want to run Hadoop for production work, I'd
recommend a "production environment" - decently powered 1U linux
servers with large disks (or whatever the recommendation is on the
wiki).
Brian
>
> > -----原始邮件-----
>> 发件人: buddha1021 <bu...@yahoo.cn>
>> 发送时间: 2009年2月19日 星期四
>> 收件人: core-user@hadoop.apache.org
>> 抄送:
>> 主题: Re: the question about the common pc?
>>
>>
>> and ,the nodes is the pc that people daily used on windows or the
>> 1u server?
>>
>> buddha1021 wrote:
>>>
>>> hi:
>>> the hadoop distributes the data and processing across clusters of
>>> commonly
>>> available computers.the document said this. but what is the
>>> "commonly
>>> available computers" mean? 1U server? or the pc that people daily
>>> used on
>>> windows?
>>>
>>
>> --
>> View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
Re:Re: the question about the common pc?
Posted by 柳松 <la...@126.com>.
Actually, there's a widely misunderstanding of this "Common PC" . Common PC doesn't means PCs which are daily used, It means the performance of each node, can be measured by common pc's computing power.
In the matter of fact, we dont use Gb enthernet for daily pcs' communication, we dont use linux for our document process, and most importantly, Hadoop cannot run effectively on thoese "daily pc"s.
Hadoop is designed for High performance computing equipment, but "claimed" to be fit for "daily pc"s.
Hadoop for pcs? what a joke.
> -----原始邮件-----
> 发件人: buddha1021 <bu...@yahoo.cn>
> 发送时间: 2009年2月19日 星期四
> 收件人: core-user@hadoop.apache.org
> 抄送:
> 主题: Re: the question about the common pc?
>
>
> and ,the nodes is the pc that people daily used on windows or the 1u server?
>
> buddha1021 wrote:
> >
> > hi:
> > the hadoop distributes the data and processing across clusters of commonly
> > available computers.the document said this. but what is the "commonly
> > available computers" mean? 1U server? or the pc that people daily used on
> > windows?
> >
>
> --
> View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
Re: the question about the common pc?
Posted by buddha1021 <bu...@yahoo.cn>.
and ,the nodes is the pc that people daily used on windows or the 1u server?
buddha1021 wrote:
>
> hi:
> the hadoop distributes the data and processing across clusters of commonly
> available computers.the document said this. but what is the "commonly
> available computers" mean? 1U server? or the pc that people daily used on
> windows?
>
--
View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.