You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ar...@csiro.au on 2011/12/19 08:32:53 UTC
Runaway fetcher threads
Hi,
I've observed an interesting phenomenon that is not hard to reproduce and that I think should not be happening:
If you have N fetcher threads, inject, say, 2xN URLs of VERY large files plus a few smaller files to fetch and run something that uses org.apache.nutch.crawl.Crawl. The big files will take forever to download and the threads will be killed. The process then will proceed to the indexing stage. However, you will see fetcher threads output in the logs intermixed with the output of the indexer. This shows that they were not terminated properly (or at all?).
Regards,
Arkadi
RE: Runaway fetcher threads
Posted by Ar...@csiro.au.
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Tuesday, 20 December 2011 10:08 AM
> To: user@nutch.apache.org
> Subject: Re: Runaway fetcher threads
>
> Hi,
>
> > Hi Markus,
> >
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> > > Sent: Monday, 19 December 2011 9:24 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Runaway fetcher threads
> > >
> > > On Monday 19 December 2011 08:32:53 Arkadi.Kosmynin@csiro.au wrote:
> > > > Hi,
> > > >
> > > > I've observed an interesting phenomenon that is not hard to
> reproduce
> > >
> > > and
> > >
> > > > that I think should not be happening:
> > > >
> > > > If you have N fetcher threads, inject, say, 2xN URLs of VERY
> large
> > >
> > > files
> > >
> > > > plus a few smaller files to fetch and run something that uses
> > > > org.apache.nutch.crawl.Crawl. The big files will take forever to
> > >
> > > download
> > >
> > > > and the threads will be killed. The process then will proceed to
> the
> > > > indexing stage. However, you will see fetcher threads output in
> the
> > >
> > > logs
> > >
> > > > intermixed with the output of the indexer. This shows that they
> were
> > >
> > > not
> > >
> > > > terminated properly (or at all?).
> > >
> > > Hi, what version are you running? Sounds like a old one. Can you
> try
> > > with a more recent version if that is the case?
> >
> > I am using 1.4 latest release.
>
> Then how can fetcher logs be `intermixed` with indexer logs? Or is this
> a
> local instance where you run multiple local jobs concurrently?
Yes, I am running Nutch in local mode. All output goes to one log file. But, in this file fetcher records appear after/mixed with the indexer records. This is what looks abnormal. By the time the indexer starts, the fetcher call must have returned (see the Crawl class). Evidently, some fetcher threads were left running.
>
> I've never seen fetcher and indexer output together in one log or part
> of a
> log (in that case it's running local).
>
>
> >
> > > In anyway, if this is about evenly distributing files across fetch
> > > lists, this
> > > cannot be based on file size as it is unknown beforehand. That is
> only
> > > possible when recrawling large files with a modified generator and
> and
> > > updater
> > > that adds the Content-Length field as CrawlDatum metadata.
> >
> > No, this is not related to evenly distributing files across fetch
> lists.
> >
> > > > Regards,
> > > >
> > > > Arkadi
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
Re: Runaway fetcher threads
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
> Hi Markus,
>
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> > Sent: Monday, 19 December 2011 9:24 PM
> > To: user@nutch.apache.org
> > Subject: Re: Runaway fetcher threads
> >
> > On Monday 19 December 2011 08:32:53 Arkadi.Kosmynin@csiro.au wrote:
> > > Hi,
> > >
> > > I've observed an interesting phenomenon that is not hard to reproduce
> >
> > and
> >
> > > that I think should not be happening:
> > >
> > > If you have N fetcher threads, inject, say, 2xN URLs of VERY large
> >
> > files
> >
> > > plus a few smaller files to fetch and run something that uses
> > > org.apache.nutch.crawl.Crawl. The big files will take forever to
> >
> > download
> >
> > > and the threads will be killed. The process then will proceed to the
> > > indexing stage. However, you will see fetcher threads output in the
> >
> > logs
> >
> > > intermixed with the output of the indexer. This shows that they were
> >
> > not
> >
> > > terminated properly (or at all?).
> >
> > Hi, what version are you running? Sounds like a old one. Can you try
> > with a more recent version if that is the case?
>
> I am using 1.4 latest release.
Then how can fetcher logs be `intermixed` with indexer logs? Or is this a
local instance where you run multiple local jobs concurrently?
I've never seen fetcher and indexer output together in one log or part of a
log (in that case it's running local).
>
> > In anyway, if this is about evenly distributing files across fetch
> > lists, this
> > cannot be based on file size as it is unknown beforehand. That is only
> > possible when recrawling large files with a modified generator and and
> > updater
> > that adds the Content-Length field as CrawlDatum metadata.
>
> No, this is not related to evenly distributing files across fetch lists.
>
> > > Regards,
> > >
> > > Arkadi
> >
> > --
> > Markus Jelsma - CTO - Openindex
RE: Runaway fetcher threads
Posted by Ar...@csiro.au.
Hi Markus,
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Monday, 19 December 2011 9:24 PM
> To: user@nutch.apache.org
> Subject: Re: Runaway fetcher threads
>
>
>
> On Monday 19 December 2011 08:32:53 Arkadi.Kosmynin@csiro.au wrote:
> > Hi,
> >
> > I've observed an interesting phenomenon that is not hard to reproduce
> and
> > that I think should not be happening:
> >
> > If you have N fetcher threads, inject, say, 2xN URLs of VERY large
> files
> > plus a few smaller files to fetch and run something that uses
> > org.apache.nutch.crawl.Crawl. The big files will take forever to
> download
> > and the threads will be killed. The process then will proceed to the
> > indexing stage. However, you will see fetcher threads output in the
> logs
> > intermixed with the output of the indexer. This shows that they were
> not
> > terminated properly (or at all?).
>
> Hi, what version are you running? Sounds like a old one. Can you try
> with a more recent version if that is the case?
I am using 1.4 latest release.
>
> In anyway, if this is about evenly distributing files across fetch
> lists, this
> cannot be based on file size as it is unknown beforehand. That is only
> possible when recrawling large files with a modified generator and and
> updater
> that adds the Content-Length field as CrawlDatum metadata.
No, this is not related to evenly distributing files across fetch lists.
>
> >
> > Regards,
> >
> > Arkadi
>
> --
> Markus Jelsma - CTO - Openindex
Re: Runaway fetcher threads
Posted by Markus Jelsma <ma...@openindex.io>.
On Monday 19 December 2011 08:32:53 Arkadi.Kosmynin@csiro.au wrote:
> Hi,
>
> I've observed an interesting phenomenon that is not hard to reproduce and
> that I think should not be happening:
>
> If you have N fetcher threads, inject, say, 2xN URLs of VERY large files
> plus a few smaller files to fetch and run something that uses
> org.apache.nutch.crawl.Crawl. The big files will take forever to download
> and the threads will be killed. The process then will proceed to the
> indexing stage. However, you will see fetcher threads output in the logs
> intermixed with the output of the indexer. This shows that they were not
> terminated properly (or at all?).
Hi, what version are you running? Sounds like a old one. Can you try with a
more recent version if that is the case?
In anyway, if this is about evenly distributing files across fetch lists, this
cannot be based on file size as it is unknown beforehand. That is only
possible when recrawling large files with a modified generator and and updater
that adds the Content-Length field as CrawlDatum metadata.
>
> Regards,
>
> Arkadi
--
Markus Jelsma - CTO - Openindex