You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by og...@yahoo.com on 2005/08/02 22:12:30 UTC

Re: [Nutch-general] Re: Memory usage2

Wow, a pile of questions. :)
Is this for a web-wide search engine?

Otis


--- Jay Pound <we...@poundwebhosting.com> wrote:

> whats the bottleneck for the slow searching, I'm monitoring it and
> its doing
> about 57% cpu load when I'm searching , it takes about 50secs to
> bring up
> the results page the first time, then if I search for the same thing
> again
> its much faster.
> Doug, can I trash my segments after they are indexed, I don't want to
> have
> cached access to the pages do the segments still need to be there? my
> 30mil
> page index/segment is using over 300gb I have the space, but when I
> get to
> the hundreds of millions of pages I will run out of room on my raid
> controler's for hd expansion, I'm planning on moving to lustre if
> ndfs is
> not stable by then. I plan on having a multi billion page index if
> the
> memory requirements for that can be below 16gb per search node. right
> now
> I'm getting pretty crappy results from my 30 million pages, I read
> the
> whitepaper on Authoritative Sources in a Hyperlinked Environment
> because
> someone said thats how the nutch algorithm worked, so I'm assuming as
> my
> index grows the pages that deserve top placement will recieve top
> placement,
> but I don't know if I should re-fetch a new set of segments with root
> url's
> just ending in US extensions(.com.edu etc...) I made a small set
> testing
> this theory (100000 pages) and its results were much better than my
> results
> from the 30mill page index. whats your thought on this, am I right in
> thinking that the pages with the most pages linking to them will show
> up
> first? so if I index 500 million pages my results should be on par
> with the
> rest of the "big dogs"?
> 
> one last important question, if I merge my indexes will it search
> faster
> than if I don't merge them, I currently have 20 directories of
> 1-1.7mill
> pages each.
> and if I split up these indexes across multiple machines will the
> searching
> be faster, I couldent get the nutch-server to work but I'm using 0.6.
> 
> I have a very fast server I didnt know if the searching would take
> advantage
> of smp, fetching will and I can run multiple index's at the same
> time. my HD
> array is 200MB a sec i/o I have the new dual core opteron 275 italy
> core
> with 4gb ram, working my way to 16gb when I need it and a second
> processor
> when I need it, 1.28TB of hd space for nutch currently with expansion
> up to
> 5.12TB, I'm currently running windows 2000 on it as they havent made
> a
> driver yet for suse 9.3 for my raid cards (highpoint 2220) so my
> scalability
> will be to 960MB a sec with all the drives in the system and 4x2.2
> Ghz
> processor cores. untill I need to cluster thats what I have to play
> with for
> nutch.
> in case you guys needed to know what hardware I'm running
> Thank you
> -Jay Pound
> Fromped.com
> BTW windows 2000 is not 100% stable with dual core processors. nutch
> is ok
> but cant do too many things at once or I'll get a kernel inpage error
> (guess
> its time to migrate to 2003.net server-damn)
> ----- Original Message ----- 
> From: "Doug Cutting" <cu...@nutch.org>
> To: <nu...@lucene.apache.org>
> Sent: Tuesday, August 02, 2005 1:53 PM
> Subject: Re: Memory usage
> 
> 
> > Try the following settings in your nutch-site.xml:
> >
> > <property>
> >    <name>io.map.index.skip</name>
> >    <value>7</value>
> > </property>
> >
> > <property>
> >    <name>indexer.termIndexInterval</name>
> >    <value>1024</value>
> > </property>
> >
> > The first causes data files to use considerably less memory.
> >
> > The second affects index creation, so must be done before you
> create the
> > index you search.  It's okay if your segment indexes were created
> > without this, you can just (re-)merge indexes and the merged index
> will
> > get the setting and use less memory when searching.
> >
> > Combining these two I have searched a 40+M page index on a machine
> using
> > about 500MB of RAM.  That said, search times with such a large
> index are
> > not good.  At some point, as your collection grows, you will want
> to
> > merge multiple indexes containing different subsets of segments and
> put
> > each on a separate box and search them with distributed search.
> >
> > Doug
> >
> > Jay Pound wrote:
> > > I'm testing an index of 30 million pages, it requires 1.5gb of
> ram to
> search
> > > using tomcat 5, I plan on having an index with multiple billion
> pages,
> but
> > > if this is to scale then even with 16GB of ram I wont be able to
> have an
> > > index larger than 320million pages? how can I distribute the
> memory
> > > requirements across multiple machines, or is there another
> servlet
> program
> > > (like resin) that will require less memory to operate, has anyone
> else
> run
> > > into this?
> > > Thanks,
> > > -Jay Pound
> > >
> > >
> >
> >
> 
> 
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration
> Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>

Re: distributed search

Posted by Jay Pound <we...@poundwebhosting.com>.

Thank You Piotr, and sorry for mispelling your name in the long e-mail
-J
----- Original Message ----- 
From: "Piotr Kosiorowski" <pk...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Friday, August 05, 2005 8:43 AM
Subject: Re: distributed search


> If you have two search servers
> search1.mydomain.com
> search2.mydomain.com
> Then on each of them run
> ./bin/nutch server 1234 /index
>
> Now go to your tomcat box. In the directory where you used to have
> "segments" dir
> (either tomcat startup directory or directory specified in nutch config
xml).
> Create "search-servers.txt" file containing:
> search1.mydomain.com 1234
> search2.mydomain.com 1234
>
> And move your old segment/index directories somewhere else so they are
> not by accident used.
> You should see search activity in your search servers logs now.
> Regards
> Piotr
>
>
>
>
>
> On 8/2/05, webmaster <sa...@www.poundwebhosting.com> wrote:
> > I read the wiki on the server option, how does it talk with tomcat for
the
> > search? it says
> > ./bin/nutch server port index dir
> > ./bin/nutch server 1234 /index
> >
> > how does it talk with eachother to find the other servers in the
cluster?
> > -Jay
> >
>
>

Re: distributed search

Posted by Piotr Kosiorowski <pk...@gmail.com>.

If you have two search servers
search1.mydomain.com
search2.mydomain.com
Then on each of them run
./bin/nutch server 1234 /index

Now go to your tomcat box. In the directory where you used to have
"segments" dir
(either tomcat startup directory or directory specified in nutch config xml).
Create "search-servers.txt" file containing:
search1.mydomain.com 1234
search2.mydomain.com 1234

And move your old segment/index directories somewhere else so they are
not by accident used.
You should see search activity in your search servers logs now.
Regards
Piotr

On 8/2/05, webmaster <sa...@www.poundwebhosting.com> wrote:
> I read the wiki on the server option, how does it talk with tomcat for the
> search? it says
> ./bin/nutch server port index dir
> ./bin/nutch server 1234 /index
> 
> how does it talk with eachother to find the other servers in the cluster?
> -Jay
>

distributed search

Posted by webmaster <sa...@www.poundwebhosting.com>.

I read the wiki on the server option, how does it talk with tomcat for the 
search? it says 
./bin/nutch server port index dir
./bin/nutch server 1234 /index

how does it talk with eachother to find the other servers in the cluster?
-Jay

Re: [Nutch-general] Re: Memory usage2

Posted by Jay Pound <we...@poundwebhosting.com>.

this is going to be a web wide search engine, I just want to be able to set
it up for each language, right now it returns results for all languages, so
the results are not so good
I'm trying to get pruning to work but don't know how, then I'll make an
smaller index for each language out of a larger index containing all
languages.
-J
----- Original Message ----- 
From: "Sébastien LE CALLONNEC" <sl...@yahoo.ie>
To: <nu...@lucene.apache.org>
Sent: Tuesday, August 02, 2005 4:34 PM
Subject: Re: [Nutch-general] Re: Memory usage2


> Obviously not:  it must be for « [urls] just ending in US
> extensions(.com.edu etc...) ». :))
>
> Anyway, it all sounds very impressive!  Good luck with your
> investigations and please keep us posted.
>
>
> Regards,
> Sébastien.
>
>
> --- ogjunk-nutch@yahoo.com a écrit :
>
> > Wow, a pile of questions. :)
> > Is this for a web-wide search engine?
> >
> > Otis
> >
> >
> > --- Jay Pound <we...@poundwebhosting.com> wrote:
> >
> > > whats the bottleneck for the slow searching, I'm monitoring it and
> > > its doing
> > > about 57% cpu load when I'm searching , it takes about 50secs to
> > > bring up
> > > the results page the first time, then if I search for the same
> > thing
> > > again
> > > its much faster.
> > > Doug, can I trash my segments after they are indexed, I don't want
> > to
> > > have
> > > cached access to the pages do the segments still need to be there?
> > my
> > > 30mil
> > > page index/segment is using over 300gb I have the space, but when I
> > > get to
> > > the hundreds of millions of pages I will run out of room on my raid
> > > controler's for hd expansion, I'm planning on moving to lustre if
> > > ndfs is
> > > not stable by then. I plan on having a multi billion page index if
> > > the
> > > memory requirements for that can be below 16gb per search node.
> > right
> > > now
> > > I'm getting pretty crappy results from my 30 million pages, I read
> > > the
> > > whitepaper on Authoritative Sources in a Hyperlinked Environment
> > > because
> > > someone said thats how the nutch algorithm worked, so I'm assuming
> > as
> > > my
> > > index grows the pages that deserve top placement will recieve top
> > > placement,
> > > but I don't know if I should re-fetch a new set of segments with
> > root
> > > url's
> > > just ending in US extensions(.com.edu etc...) I made a small set
> > > testing
> > > this theory (100000 pages) and its results were much better than my
> > > results
> > > from the 30mill page index. whats your thought on this, am I right
> > in
> > > thinking that the pages with the most pages linking to them will
> > show
> > > up
> > > first? so if I index 500 million pages my results should be on par
> > > with the
> > > rest of the "big dogs"?
> > >
> > > one last important question, if I merge my indexes will it search
> > > faster
> > > than if I don't merge them, I currently have 20 directories of
> > > 1-1.7mill
> > > pages each.
> > > and if I split up these indexes across multiple machines will the
> > > searching
> > > be faster, I couldent get the nutch-server to work but I'm using
> > 0.6.
> > >
> > > I have a very fast server I didnt know if the searching would take
> > > advantage
> > > of smp, fetching will and I can run multiple index's at the same
> > > time. my HD
> > > array is 200MB a sec i/o I have the new dual core opteron 275 italy
> > > core
> > > with 4gb ram, working my way to 16gb when I need it and a second
> > > processor
> > > when I need it, 1.28TB of hd space for nutch currently with
> > expansion
> > > up to
> > > 5.12TB, I'm currently running windows 2000 on it as they havent
> > made
> > > a
> > > driver yet for suse 9.3 for my raid cards (highpoint 2220) so my
> > > scalability
> > > will be to 960MB a sec with all the drives in the system and 4x2.2
> > > Ghz
> > > processor cores. untill I need to cluster thats what I have to play
> > > with for
> > > nutch.
> > > in case you guys needed to know what hardware I'm running
> > > Thank you
> > > -Jay Pound
> > > Fromped.com
> > > BTW windows 2000 is not 100% stable with dual core processors.
> > nutch
> > > is ok
> > > but cant do too many things at once or I'll get a kernel inpage
> > error
> > > (guess
> > > its time to migrate to 2003.net server-damn)
> > > ----- Original Message ----- 
> > > From: "Doug Cutting" <cu...@nutch.org>
> > > To: <nu...@lucene.apache.org>
> > > Sent: Tuesday, August 02, 2005 1:53 PM
> > > Subject: Re: Memory usage
> > >
> > >
> > > > Try the following settings in your nutch-site.xml:
> > > >
> > > > <property>
> > > >    <name>io.map.index.skip</name>
> > > >    <value>7</value>
> > > > </property>
> > > >
> > > > <property>
> > > >    <name>indexer.termIndexInterval</name>
> > > >    <value>1024</value>
> > > > </property>
> > > >
> > > > The first causes data files to use considerably less memory.
> > > >
> > > > The second affects index creation, so must be done before you
> > > create the
> > > > index you search.  It's okay if your segment indexes were created
> > > > without this, you can just (re-)merge indexes and the merged
> > index
> > > will
> > > > get the setting and use less memory when searching.
> > > >
> > > > Combining these two I have searched a 40+M page index on a
> > machine
> > > using
> > > > about 500MB of RAM.  That said, search times with such a large
> > > index are
> > > > not good.  At some point, as your collection grows, you will want
> > > to
> > > > merge multiple indexes containing different subsets of segments
> > and
> > > put
> > > > each on a separate box and search them with distributed search.
> > > >
> > > > Doug
> > > >
> > > > Jay Pound wrote:
> > > > > I'm testing an index of 30 million pages, it requires 1.5gb of
> > > ram to
> > > search
> > > > > using tomcat 5, I plan on having an index with multiple billion
> > > pages,
> > > but
> > > > > if this is to scale then even with 16GB of ram I wont be able
> > to
> > > have an
> > > > > index larger than 320million pages? how can I distribute the
> > > memory
> > > > > requirements across multiple machines, or is there another
> > > servlet
> > > program
> > > > > (like resin) that will require less memory to operate, has
> > anyone
> > > else
> > > run
> > > > > into this?
> > > > > Thanks,
> > > > > -Jay Pound
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > -------------------------------------------------------
> > > SF.Net email is sponsored by: Discover Easy Linux Migration
> > > Strategies
> > > from IBM. Find simple to follow Roadmaps, straightforward articles,
> > > informative Webcasts and more! Get everything you need to get up to
> > > speed, fast.
> > http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> > > _______________________________________________
> > > Nutch-general mailing list
> > > Nutch-general@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/nutch-general
> > >
> >
> >
>
>
>
>
>
>
>
>
___________________________________________________________________________
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
> Téléchargez cette version sur http://fr.messenger.yahoo.com
>
>

Re: [Nutch-general] Re: Memory usage2

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.

Obviously not:  it must be for « [urls] just ending in US
extensions(.com.edu etc...) ». :))

Anyway, it all sounds very impressive!  Good luck with your
investigations and please keep us posted.


Regards,
Sébastien.


--- ogjunk-nutch@yahoo.com a écrit :

> Wow, a pile of questions. :)
> Is this for a web-wide search engine?
> 
> Otis
> 
> 
> --- Jay Pound <we...@poundwebhosting.com> wrote:
> 
> > whats the bottleneck for the slow searching, I'm monitoring it and
> > its doing
> > about 57% cpu load when I'm searching , it takes about 50secs to
> > bring up
> > the results page the first time, then if I search for the same
> thing
> > again
> > its much faster.
> > Doug, can I trash my segments after they are indexed, I don't want
> to
> > have
> > cached access to the pages do the segments still need to be there?
> my
> > 30mil
> > page index/segment is using over 300gb I have the space, but when I
> > get to
> > the hundreds of millions of pages I will run out of room on my raid
> > controler's for hd expansion, I'm planning on moving to lustre if
> > ndfs is
> > not stable by then. I plan on having a multi billion page index if
> > the
> > memory requirements for that can be below 16gb per search node.
> right
> > now
> > I'm getting pretty crappy results from my 30 million pages, I read
> > the
> > whitepaper on Authoritative Sources in a Hyperlinked Environment
> > because
> > someone said thats how the nutch algorithm worked, so I'm assuming
> as
> > my
> > index grows the pages that deserve top placement will recieve top
> > placement,
> > but I don't know if I should re-fetch a new set of segments with
> root
> > url's
> > just ending in US extensions(.com.edu etc...) I made a small set
> > testing
> > this theory (100000 pages) and its results were much better than my
> > results
> > from the 30mill page index. whats your thought on this, am I right
> in
> > thinking that the pages with the most pages linking to them will
> show
> > up
> > first? so if I index 500 million pages my results should be on par
> > with the
> > rest of the "big dogs"?
> > 
> > one last important question, if I merge my indexes will it search
> > faster
> > than if I don't merge them, I currently have 20 directories of
> > 1-1.7mill
> > pages each.
> > and if I split up these indexes across multiple machines will the
> > searching
> > be faster, I couldent get the nutch-server to work but I'm using
> 0.6.
> > 
> > I have a very fast server I didnt know if the searching would take
> > advantage
> > of smp, fetching will and I can run multiple index's at the same
> > time. my HD
> > array is 200MB a sec i/o I have the new dual core opteron 275 italy
> > core
> > with 4gb ram, working my way to 16gb when I need it and a second
> > processor
> > when I need it, 1.28TB of hd space for nutch currently with
> expansion
> > up to
> > 5.12TB, I'm currently running windows 2000 on it as they havent
> made
> > a
> > driver yet for suse 9.3 for my raid cards (highpoint 2220) so my
> > scalability
> > will be to 960MB a sec with all the drives in the system and 4x2.2
> > Ghz
> > processor cores. untill I need to cluster thats what I have to play
> > with for
> > nutch.
> > in case you guys needed to know what hardware I'm running
> > Thank you
> > -Jay Pound
> > Fromped.com
> > BTW windows 2000 is not 100% stable with dual core processors.
> nutch
> > is ok
> > but cant do too many things at once or I'll get a kernel inpage
> error
> > (guess
> > its time to migrate to 2003.net server-damn)
> > ----- Original Message ----- 
> > From: "Doug Cutting" <cu...@nutch.org>
> > To: <nu...@lucene.apache.org>
> > Sent: Tuesday, August 02, 2005 1:53 PM
> > Subject: Re: Memory usage
> > 
> > 
> > > Try the following settings in your nutch-site.xml:
> > >
> > > <property>
> > >    <name>io.map.index.skip</name>
> > >    <value>7</value>
> > > </property>
> > >
> > > <property>
> > >    <name>indexer.termIndexInterval</name>
> > >    <value>1024</value>
> > > </property>
> > >
> > > The first causes data files to use considerably less memory.
> > >
> > > The second affects index creation, so must be done before you
> > create the
> > > index you search.  It's okay if your segment indexes were created
> > > without this, you can just (re-)merge indexes and the merged
> index
> > will
> > > get the setting and use less memory when searching.
> > >
> > > Combining these two I have searched a 40+M page index on a
> machine
> > using
> > > about 500MB of RAM.  That said, search times with such a large
> > index are
> > > not good.  At some point, as your collection grows, you will want
> > to
> > > merge multiple indexes containing different subsets of segments
> and
> > put
> > > each on a separate box and search them with distributed search.
> > >
> > > Doug
> > >
> > > Jay Pound wrote:
> > > > I'm testing an index of 30 million pages, it requires 1.5gb of
> > ram to
> > search
> > > > using tomcat 5, I plan on having an index with multiple billion
> > pages,
> > but
> > > > if this is to scale then even with 16GB of ram I wont be able
> to
> > have an
> > > > index larger than 320million pages? how can I distribute the
> > memory
> > > > requirements across multiple machines, or is there another
> > servlet
> > program
> > > > (like resin) that will require less memory to operate, has
> anyone
> > else
> > run
> > > > into this?
> > > > Thanks,
> > > > -Jay Pound
> > > >
> > > >
> > >
> > >
> > 
> > 
> > 
> > 
> > -------------------------------------------------------
> > SF.Net email is sponsored by: Discover Easy Linux Migration
> > Strategies
> > from IBM. Find simple to follow Roadmaps, straightforward articles,
> > informative Webcasts and more! Get everything you need to get up to
> > speed, fast.
> http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> > _______________________________________________
> > Nutch-general mailing list
> > Nutch-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nutch-general
> > 
> 
> 



	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com