You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ar...@csiro.au on 2011/12/19 08:32:53 UTC

Runaway fetcher threads

Hi,

I've observed an interesting phenomenon that is not hard to reproduce and that I think should not be happening:

If you have N fetcher threads, inject, say, 2xN URLs of VERY large files plus a few smaller files to fetch and run something that uses org.apache.nutch.crawl.Crawl. The big files will take forever to download and the threads will be killed. The process then will proceed to the indexing stage. However, you will see fetcher threads output in the logs intermixed with the output of the indexer. This shows that they were not terminated properly (or at all?).

Regards,

Arkadi

RE: Runaway fetcher threads

Posted by Ar...@csiro.au.

> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Tuesday, 20 December 2011 10:08 AM
> To: user@nutch.apache.org
> Subject: Re: Runaway fetcher threads
> 
> Hi,
> 
> > Hi Markus,
> >
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> > > Sent: Monday, 19 December 2011 9:24 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Runaway fetcher threads
> > >
> > > On Monday 19 December 2011 08:32:53 Arkadi.Kosmynin@csiro.au wrote:
> > > > Hi,
> > > >
> > > > I've observed an interesting phenomenon that is not hard to
> reproduce
> > >
> > > and
> > >
> > > > that I think should not be happening:
> > > >
> > > > If you have N fetcher threads, inject, say, 2xN URLs of VERY
> large
> > >
> > > files
> > >
> > > > plus a few smaller files to fetch and run something that uses
> > > > org.apache.nutch.crawl.Crawl. The big files will take forever to
> > >
> > > download
> > >
> > > > and the threads will be killed. The process then will proceed to
> the
> > > > indexing stage. However, you will see fetcher threads output in
> the
> > >
> > > logs
> > >
> > > > intermixed with the output of the indexer. This shows that they
> were
> > >
> > > not
> > >
> > > > terminated properly (or at all?).
> > >
> > > Hi, what version are you running? Sounds like a old one. Can you
> try
> > > with a more recent version if that is the case?
> >
> > I am using 1.4 latest release.
> 
> Then how can fetcher logs be `intermixed` with indexer logs? Or is this
> a
> local instance where you run multiple local jobs concurrently?

Yes, I am running Nutch in local mode. All output goes to one log file. But, in this file fetcher records appear after/mixed with the indexer records. This is what looks abnormal. By the time the indexer starts, the fetcher call must have returned (see the Crawl class). Evidently, some fetcher threads were left running.

> 
> I've never seen fetcher and indexer output together in one log or part
> of a
> log (in that case it's running local).
> 
> 
> >
> > > In anyway, if this is about evenly distributing files across fetch
> > > lists, this
> > > cannot be based on file size as it is unknown beforehand. That is
> only
> > > possible when recrawling large files with a modified generator and
> and
> > > updater
> > > that adds the Content-Length field as CrawlDatum metadata.
> >
> > No, this is not related to evenly distributing files across fetch
> lists.
> >
> > > > Regards,
> > > >
> > > > Arkadi
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex

Re: Runaway fetcher threads

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

> Hi Markus,
> 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> > Sent: Monday, 19 December 2011 9:24 PM
> > To: user@nutch.apache.org
> > Subject: Re: Runaway fetcher threads
> > 
> > On Monday 19 December 2011 08:32:53 Arkadi.Kosmynin@csiro.au wrote:
> > > Hi,
> > > 
> > > I've observed an interesting phenomenon that is not hard to reproduce
> > 
> > and
> > 
> > > that I think should not be happening:
> > > 
> > > If you have N fetcher threads, inject, say, 2xN URLs of VERY large
> > 
> > files
> > 
> > > plus a few smaller files to fetch and run something that uses
> > > org.apache.nutch.crawl.Crawl. The big files will take forever to
> > 
> > download
> > 
> > > and the threads will be killed. The process then will proceed to the
> > > indexing stage. However, you will see fetcher threads output in the
> > 
> > logs
> > 
> > > intermixed with the output of the indexer. This shows that they were
> > 
> > not
> > 
> > > terminated properly (or at all?).
> > 
> > Hi, what version are you running? Sounds like a old one. Can you try
> > with a more recent version if that is the case?
> 
> I am using 1.4 latest release.

Then how can fetcher logs be `intermixed` with indexer logs? Or is this a 
local instance where you run multiple local jobs concurrently?

I've never seen fetcher and indexer output together in one log or part of a 
log (in that case it's running local).


> 
> > In anyway, if this is about evenly distributing files across fetch
> > lists, this
> > cannot be based on file size as it is unknown beforehand. That is only
> > possible when recrawling large files with a modified generator and and
> > updater
> > that adds the Content-Length field as CrawlDatum metadata.
> 
> No, this is not related to evenly distributing files across fetch lists.
> 
> > > Regards,
> > > 
> > > Arkadi
> > 
> > --
> > Markus Jelsma - CTO - Openindex

RE: Runaway fetcher threads

Posted by Ar...@csiro.au.
Hi Markus,

> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Monday, 19 December 2011 9:24 PM
> To: user@nutch.apache.org
> Subject: Re: Runaway fetcher threads
> 
> 
> 
> On Monday 19 December 2011 08:32:53 Arkadi.Kosmynin@csiro.au wrote:
> > Hi,
> >
> > I've observed an interesting phenomenon that is not hard to reproduce
> and
> > that I think should not be happening:
> >
> > If you have N fetcher threads, inject, say, 2xN URLs of VERY large
> files
> > plus a few smaller files to fetch and run something that uses
> > org.apache.nutch.crawl.Crawl. The big files will take forever to
> download
> > and the threads will be killed. The process then will proceed to the
> > indexing stage. However, you will see fetcher threads output in the
> logs
> > intermixed with the output of the indexer. This shows that they were
> not
> > terminated properly (or at all?).
> 
> Hi, what version are you running? Sounds like a old one. Can you try
> with a more recent version if that is the case?

I am using 1.4 latest release.


> 
> In anyway, if this is about evenly distributing files across fetch
> lists, this
> cannot be based on file size as it is unknown beforehand. That is only
> possible when recrawling large files with a modified generator and and
> updater
> that adds the Content-Length field as CrawlDatum metadata.

No, this is not related to evenly distributing files across fetch lists.

> 
> >
> > Regards,
> >
> > Arkadi
> 
> --
> Markus Jelsma - CTO - Openindex

Re: Runaway fetcher threads

Posted by Markus Jelsma <ma...@openindex.io>.

On Monday 19 December 2011 08:32:53 Arkadi.Kosmynin@csiro.au wrote:
> Hi,
> 
> I've observed an interesting phenomenon that is not hard to reproduce and
> that I think should not be happening:
> 
> If you have N fetcher threads, inject, say, 2xN URLs of VERY large files
> plus a few smaller files to fetch and run something that uses
> org.apache.nutch.crawl.Crawl. The big files will take forever to download
> and the threads will be killed. The process then will proceed to the
> indexing stage. However, you will see fetcher threads output in the logs
> intermixed with the output of the indexer. This shows that they were not
> terminated properly (or at all?).

Hi, what version are you running? Sounds like a old one. Can you try with a 
more recent version if that is the case?

In anyway, if this is about evenly distributing files across fetch lists, this 
cannot be based on file size as it is unknown beforehand. That is only 
possible when recrawling large files with a modified generator and and updater 
that adds the Content-Length field as CrawlDatum metadata.

> 
> Regards,
> 
> Arkadi

-- 
Markus Jelsma - CTO - Openindex