You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Andy Morris <an...@woodward.edu> on 2006/02/05 19:54:13 UTC

How deep to go

How deep should a good intranet crawl be...10-20?
I still can't get all of my site searchable..

Here is my situation...
I want to crawl just a local site for our intranet.   We have just
rolled out an asp only website from a pure html site.  I ran nutch on
the old site and got great results.  Since moving to this new site I am
have a devil of a time retrieving good information and missing a ton of
info all together.  I am not sure what settings I need to change to get
good results.  One setting that I have set does produce good results but
it seems to crawl other website and not just my domain.  The last line
of the crawl-urlfilter file I just replace the - with + so it does not
ignore other information. Our site is www.woodward.edu I was wondering
if someone on this list can crawl this site and only this domain and see
what they come up with.  Woodward.edu is the domain.  I am just stumped
as what to do next.  I am running a nightly build from January 26th
2006. 

My criteria for our local search is to be able to search PDF, images,
doc, and web content.  You can go here and see what the search page
pulls up http://search.woodward.edu .

Thanks for any help this list can provide.
Andy Morris 

Re: How deep to go

Posted by Stefan Groschupf <sg...@media-style.com>.
Instead of using the crawl command I personal prefer the manually  
commands.
Than I use a small script that runs
http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling
in a never ending loop where a wait for a day for each iteration.
This will make sure that you have all links that match you url filter.
Just dont miss to remove old segments and merge indexes together,  
more about such things can be found in the mail archive.
Also don't miss to add the plugins (e.g. pdf parser).

HTH
Stefan

Am 05.02.2006 um 19:54 schrieb Andy Morris:

> How deep should a good intranet crawl be...10-20?
> I still can't get all of my site searchable..
>
> Here is my situation...
> I want to crawl just a local site for our intranet.   We have just
> rolled out an asp only website from a pure html site.  I ran nutch on
> the old site and got great results.  Since moving to this new site  
> I am
> have a devil of a time retrieving good information and missing a  
> ton of
> info all together.  I am not sure what settings I need to change to  
> get
> good results.  One setting that I have set does produce good  
> results but
> it seems to crawl other website and not just my domain.  The last line
> of the crawl-urlfilter file I just replace the - with + so it does not
> ignore other information. Our site is www.woodward.edu I was wondering
> if someone on this list can crawl this site and only this domain  
> and see
> what they come up with.  Woodward.edu is the domain.  I am just  
> stumped
> as what to do next.  I am running a nightly build from January 26th
> 2006.
>
> My criteria for our local search is to be able to search PDF, images,
> doc, and web content.  You can go here and see what the search page
> pulls up http://search.woodward.edu .
>
> Thanks for any help this list can provide.
> Andy Morris
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net