You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by AJ Chen <ca...@gmail.com> on 2006/10/26 18:09:30 UTC

map-reduce very slow on single machine

I'm using 0.9-dev code to crawl the web on a single machine. Using default
configuration, it spends ~5 hours to fetch 100,000 pages, but also >5 hours
in doing map-reduce. Is this the expected performance for map-reduce phase
relative to fetch phase? It seems to me map-reduce takes too much time. Is
there anything to configure in order to reduce the operation (time) for
map-reduce?  I'll appreciate any suggetion on how to improve web search
performance on single machine.

Thanks,

AJ
http://web2express.org

Re: map-reduce very slow on single machine

Posted by Zaheed Haque <za...@gmail.com>.
On 10/26/06, AJ Chen <ca...@gmail.com> wrote:
> Current version of nutch uses mapred regardless of the number of computer
> nodes. So, for applications using a single computer and default
> configuration (i.e. no distributed crawling), the issue is not about
> performance gain from mapred, but rather how to minimize the overhead from
> mapred. Does anyone have a good performance benchmark for running nutch 0.8or
> 0.9 on a single machine? Particularly, how much time spent in map-reduce
> phases is reasonable relative to the time used by fetching phase?
>
> Can anyone tell whether 4 hours of doing "reduce > reduce" after fetching
> 100,000 pages in 5 hours is within the expectation? If it's not right, what
> might be the cause?

How much memory do you have? Also whats the java heap size? I have the
following configuration on my test server

6 GB memory
one 64 AMD
Java heap 4 GB
and I can do 250,000 pages.. crawl to index in about 2-3 hours.

also I am just fetching strict html pages ..no pdf no word etc..

Sorry its very difficult to say what could be the problem.

Regards,
Zaheed


> AJ
>
>
> On 10/26/06, Josef Novak <jo...@gmail.com> wrote:
> >
> > Hi AJ,
> >
> > I very well may be wrong, but as I understand it, nutch/hadoop
> > implements map/reduce primarily as a means of efficiently and reliably
> > distributing work among nodes in a (large) cluster of consumer grade
> > machines.  I suspect that there is not much to be gained from
> > implementing it with a single machine.
> >
> > http://labs.google.com/papers/mapreduce.html
> > http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
> > http://wiki.apache.org/lucene-hadoop/
> >
> >
> > happy hunting,
> > joe
> >
> >
> > On 10/27/06, AJ Chen <ca...@gmail.com> wrote:
> > > I'm using 0.9-dev code to crawl the web on a single machine. Using
> > default
> > > configuration, it spends ~5 hours to fetch 100,000 pages, but also >5
> > hours
> > > in doing map-reduce. Is this the expected performance for map-reduce
> > phase
> > > relative to fetch phase? It seems to me map-reduce takes too much time.
> > Is
> > > there anything to configure in order to reduce the operation (time) for
> > > map-reduce?  I'll appreciate any suggetion on how to improve web search
> > > performance on single machine.
> > >
> > > Thanks,
> > >
> > > AJ
> > > http://web2express.org
> > >
> > >
> >
>
>
>
> --
> AJ Chen, PhD
> http://web2express.org
>
>

Re: map-reduce very slow on single machine

Posted by AJ Chen <ca...@gmail.com>.
Current version of nutch uses mapred regardless of the number of computer
nodes. So, for applications using a single computer and default
configuration (i.e. no distributed crawling), the issue is not about
performance gain from mapred, but rather how to minimize the overhead from
mapred. Does anyone have a good performance benchmark for running nutch 0.8or
0.9 on a single machine? Particularly, how much time spent in map-reduce
phases is reasonable relative to the time used by fetching phase?

Can anyone tell whether 4 hours of doing "reduce > reduce" after fetching
100,000 pages in 5 hours is within the expectation? If it's not right, what
might be the cause?

AJ


On 10/26/06, Josef Novak <jo...@gmail.com> wrote:
>
> Hi AJ,
>
> I very well may be wrong, but as I understand it, nutch/hadoop
> implements map/reduce primarily as a means of efficiently and reliably
> distributing work among nodes in a (large) cluster of consumer grade
> machines.  I suspect that there is not much to be gained from
> implementing it with a single machine.
>
> http://labs.google.com/papers/mapreduce.html
> http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
> http://wiki.apache.org/lucene-hadoop/
>
>
> happy hunting,
> joe
>
>
> On 10/27/06, AJ Chen <ca...@gmail.com> wrote:
> > I'm using 0.9-dev code to crawl the web on a single machine. Using
> default
> > configuration, it spends ~5 hours to fetch 100,000 pages, but also >5
> hours
> > in doing map-reduce. Is this the expected performance for map-reduce
> phase
> > relative to fetch phase? It seems to me map-reduce takes too much time.
> Is
> > there anything to configure in order to reduce the operation (time) for
> > map-reduce?  I'll appreciate any suggetion on how to improve web search
> > performance on single machine.
> >
> > Thanks,
> >
> > AJ
> > http://web2express.org
> >
> >
>



-- 
AJ Chen, PhD
http://web2express.org

Re: map-reduce very slow on single machine

Posted by Josef Novak <jo...@gmail.com>.
Hi AJ,

 I very well may be wrong, but as I understand it, nutch/hadoop
implements map/reduce primarily as a means of efficiently and reliably
distributing work among nodes in a (large) cluster of consumer grade
machines.  I suspect that there is not much to be gained from
implementing it with a single machine.

http://labs.google.com/papers/mapreduce.html
http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
http://wiki.apache.org/lucene-hadoop/


happy hunting,
joe


On 10/27/06, AJ Chen <ca...@gmail.com> wrote:
> I'm using 0.9-dev code to crawl the web on a single machine. Using default
> configuration, it spends ~5 hours to fetch 100,000 pages, but also >5 hours
> in doing map-reduce. Is this the expected performance for map-reduce phase
> relative to fetch phase? It seems to me map-reduce takes too much time. Is
> there anything to configure in order to reduce the operation (time) for
> map-reduce?  I'll appreciate any suggetion on how to improve web search
> performance on single machine.
>
> Thanks,
>
> AJ
> http://web2express.org
>
>