You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "renaud@apache.org" <re...@apache.org> on 2007/09/01 21:28:12 UTC

Re: hadoop on single machine

hi Tomislav,
> Hi Renaud,
> thank you for your reply. This is valuable information, but can you
> elaborate a little bit more, like:
>
> you say: Nutch is "always" using Hadoop.
>
> I assume it does not uses Hadoop Distributed File System (HDFS) when
> running on a single machine by default?
>
> hadoop homepage says:  Hadoop implements MapReduce, using the HDFS
>
> If there is no distributing file sistem over computer nodes (single
> machine configuration) what does Hadoop do?
>   
Well, you're not using the full potential of Hadoop's HDFS when using 
Nutch on a single machine (still, Hadoop is handling the map-reduce 
logic, the configuration objects, etc). It's like using a chainsaw to 
cut a toothpick ;-) Nevertheless, Nutch is a very good choice for 
single-machine deployments: high-performance, reliable and easy to 
customize.
> When running crawl/recrawl cycle-> generate/fetch/update
> what processes is Hadoop running? 
Have a look at the class Crawl.java
> How can I monitor them to see what is
> going on (like how many urls are fetched and how many are still
> unfetched from fetchlist)? Is there a GUI for this?
>   
No GUI, but the command line tools can give you informations (e.g. 
readdb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20readdb)
> you say: Fetching 100 sites of 1000 nodes with a single machine should
> definitively be OK
>
> What about recrawl on a regular basis (once a day or even more often)?
>   
It depends on your configuration and connection, but you can expect to 
fetch 10-30 pages / second. So for 100K pages, it will take < 3h
Re disk space, with an estimate of 10k/page for the index, it will take 
you ~1GB disk space
See more on http://wiki.apache.org/nutch/HardwareRequirements

HTH,
Renaud
> Sorry if this are basic questions but I am trying to learn about nutch
> and hadoop. 
>
> Thanks,
>      Tomislav
>
>  
>
>
> On Thu, 2007-08-30 at 18:06 -0400, renaud@apache.org wrote:
>   
>> hi Tomislav,
>>
>> The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes 
>> with a single machine should definitively be OK. You might want to add 
>> more machines if many many people are searching your index.
>>
>> BTW, Nutch is "always" using Hadoop. When testing locally or when using 
>> only one machine, Hadoop just uses the local file system. So even the 
>> NutchTutorial uses Hadoop.
>>
>> HTH,
>> Renaud
>>
>>     
>>> Would it be recommended to use hadoop for crawling (100 sites with 1000
>>> pages each) on a single machine? What would be the benefit?
>>> Something like described on:
>>> http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
>>> machine.
>>>
>>>
>>> Or is the simple crawl/recrawl (without hadoop, like described in nutch
>>> tutorial on wiki:  
>>> http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
>>> way to go?
>>>
>>> Thanks,
>>>        Tomislav
>>>
>>>
>>>   
>>>       
>
>
>

Re: hadoop on single machine

Posted by Tomislav Poljak <tp...@gmail.com>.

Hi Renaud,
what would be a recommended hardware specification for a machine running
searcher web application with 15K users per day using this index (100K
pages)? What is a good practice for
getting index from crawl machine to search machine (if using separate
machines for crawl and search)?

Thanks,
     Tomislav

2007/9/1, renaud@apache.org <re...@apache.org>:
>
> hi Tomislav,
> > Hi Renaud,
> > thank you for your reply. This is valuable information, but can you
> > elaborate a little bit more, like:
> >
> > you say: Nutch is "always" using Hadoop.
> >
> > I assume it does not uses Hadoop Distributed File System (HDFS) when
> > running on a single machine by default?
> >
> > hadoop homepage says:  Hadoop implements MapReduce, using the HDFS
> >
> > If there is no distributing file sistem over computer nodes (single
> > machine configuration) what does Hadoop do?
> >
> Well, you're not using the full potential of Hadoop's HDFS when using
> Nutch on a single machine (still, Hadoop is handling the map-reduce
> logic, the configuration objects, etc). It's like using a chainsaw to
> cut a toothpick ;-) Nevertheless, Nutch is a very good choice for
> single-machine deployments: high-performance, reliable and easy to
> customize.
> > When running crawl/recrawl cycle-> generate/fetch/update
> > what processes is Hadoop running?
> Have a look at the class Crawl.java
> > How can I monitor them to see what is
> > going on (like how many urls are fetched and how many are still
> > unfetched from fetchlist)? Is there a GUI for this?
> >
> No GUI, but the command line tools can give you informations (e.g.
> readdb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20readdb)
> > you say: Fetching 100 sites of 1000 nodes with a single machine should
> > definitively be OK
> >
> > What about recrawl on a regular basis (once a day or even more often)?
> >
> It depends on your configuration and connection, but you can expect to
> fetch 10-30 pages / second. So for 100K pages, it will take < 3h
> Re disk space, with an estimate of 10k/page for the index, it will take
> you ~1GB disk space
> See more on http://wiki.apache.org/nutch/HardwareRequirements
>
> HTH,
> Renaud
> > Sorry if this are basic questions but I am trying to learn about nutch
> > and hadoop.
> >
> > Thanks,
> >      Tomislav
> >
> >
> >
> >
> > On Thu, 2007-08-30 at 18:06 -0400, renaud@apache.org wrote:
> >
> >> hi Tomislav,
> >>
> >> The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes
> >> with a single machine should definitively be OK. You might want to add
> >> more machines if many many people are searching your index.
> >>
> >> BTW, Nutch is "always" using Hadoop. When testing locally or when using
> >> only one machine, Hadoop just uses the local file system. So even the
> >> NutchTutorial uses Hadoop.
> >>
> >> HTH,
> >> Renaud
> >>
> >>
> >>> Would it be recommended to use hadoop for crawling (100 sites with
> 1000
> >>> pages each) on a single machine? What would be the benefit?
> >>> Something like described on:
> >>> http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
> >>> machine.
> >>>
> >>>
> >>> Or is the simple crawl/recrawl (without hadoop, like described in
> nutch
> >>> tutorial on wiki:
> >>> http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
> >>> way to go?
> >>>
> >>> Thanks,
> >>>        Tomislav
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
>
>