You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Enis Soztutar <en...@gmail.com> on 2007/04/02 11:06:47 UTC

Re: Wildly different crawl results depending on environment...

Briggs wrote:
> nutch 0.7.2
>
> I have 2 scenarios (both using the exact same configurations):
>
> 1) Running the crawl tool from the command line:
>
>    ./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5
>
> 2) Running the crawl tool from a web app somewhere in code like:
>
>    final String[] args = new String[]{
>        "-local", "/tmp/urlfile.txt",
>        "-dir", "/tmp/somedir",
>        "-depth", "5"};
>
>    CrawlTool.main(args);
>
>
> When I run the first scenario, I may get thousands of pages, but when
> I run the second scenario my results vary wildly.  I mean, I get
> perhaps 0,1,10+, 100+.  But, I rarely ever get a good crawl from
> within a web application.  So, there are many things that could be
> going wrong here....
>
> 1) Is there some sort of parsing issue?  An xml parser, regex,
> timeouts... something?  Not sure.  But, it just won't crawl as well as
> the 'standalone mode'.
>
> 2) Is it a bad idea to use many concurrent CrawlTools, or even reusing
> a crawl tool (more than once) within a instance of a JVM?  It seems to
> have problems doing this. I am thinking there are some static
> references that don't really like handling such use. But this is just
> a wild accusation that I am not sure of.
>
>
>
Checking out the logs might help in this case. From my experience, i can 
say that there can be some classloading problem with the crawl running 
in a servlet container. I suggest you also try running the crawl step 
wise, by first running inject, generate, fetch. etc.

Re: Wildly different crawl results depending on environment...

Posted by Briggs <ac...@gmail.com>.

Thanks, I'll look into it. Though, I have never really tried that
level of granularity.  So, i'll have to figure out what you just told
me to do!  hah.



On 4/2/07, Enis Soztutar <en...@gmail.com> wrote:
> Briggs wrote:
> > nutch 0.7.2
> >
> > I have 2 scenarios (both using the exact same configurations):
> >
> > 1) Running the crawl tool from the command line:
> >
> >    ./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5
> >
> > 2) Running the crawl tool from a web app somewhere in code like:
> >
> >    final String[] args = new String[]{
> >        "-local", "/tmp/urlfile.txt",
> >        "-dir", "/tmp/somedir",
> >        "-depth", "5"};
> >
> >    CrawlTool.main(args);
> >
> >
> > When I run the first scenario, I may get thousands of pages, but when
> > I run the second scenario my results vary wildly.  I mean, I get
> > perhaps 0,1,10+, 100+.  But, I rarely ever get a good crawl from
> > within a web application.  So, there are many things that could be
> > going wrong here....
> >
> > 1) Is there some sort of parsing issue?  An xml parser, regex,
> > timeouts... something?  Not sure.  But, it just won't crawl as well as
> > the 'standalone mode'.
> >
> > 2) Is it a bad idea to use many concurrent CrawlTools, or even reusing
> > a crawl tool (more than once) within a instance of a JVM?  It seems to
> > have problems doing this. I am thinking there are some static
> > references that don't really like handling such use. But this is just
> > a wild accusation that I am not sure of.
> >
> >
> >
> Checking out the logs might help in this case. From my experience, i can
> say that there can be some classloading problem with the crawl running
> in a servlet container. I suggest you also try running the crawl step
> wise, by first running inject, generate, fetch. etc.
>
>
>
>


-- 
"Concious decisions by concious minds are what make reality real"