You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by qu...@webmail.co.za on 2005/06/22 22:01:29 UTC

Nutch Lockup/Freeze (Fetcher)

Anyone experiencing freezes when fetching with 50 threads ?
If I use 5 threads everything is fine - if i raise it to 10
it freezes and random times when fetching a segment.

Any ideas?
_____________________________________________________________________
For super low premiums, click here http://www.dialdirect.co.za/quote

Re: PDFBox (Re: Nutch Lockup/Freeze (Fetcher) - HELP!!)

Posted by Juho Mäkinen <ju...@gmail.com>.
On 6/29/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Juho Mäkinen wrote:
> > I did some research and I traced the problem to be somewhere inside
> > HttpRequest of protocol-httpclient.
> 
> If you enabled the PDF parser, the version of PDFBox that is currently
> in SVN is known to be broken - for some PDFs a bug in CMap handling can
> ....

I'm not using PDF parser, so that can't be the problem.

 - Juho Mäkinen, http://www.juhonkoti.net

PDFBox (Re: Nutch Lockup/Freeze (Fetcher) - HELP!!)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Juho Mäkinen wrote:
> I did some research and I traced the problem to be somewhere inside
> HttpRequest of protocol-httpclient.

If you enabled the PDF parser, the version of PDFBox that is currently 
in SVN is known to be broken - for some PDFs a bug in CMap handling can 
cause an endless loop. Please download the latest binary from 
http://www.pdfbox.org/dist , and try again.

I didn't commit the latest PDFBox, because it's unreleased yet. As soon 
as there is a new release I'll update the one in our SVN. Until then you 
need to follow the above procedure.

I attached also a simple tool to create fetchlists based on a list of 
arbitrary URLs. This comes handy if you want to test various parts of 
Nutch with arbitrary URLs, not coming from the DB.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Nutch Maximum urls per domain?

Posted by qu...@webmail.co.za.
Hi there

Is there any way to limit the fetching per domain to a set
number? eg. only 1000 urls for each domain in the
fetchlist?

Any ideas?
_____________________________________________________________________
For super low premiums, click here http://www.dialdirect.co.za/quote

Re: Nutch Lockup/Freeze (Fetcher) - HELP!!

Posted by Juho Mäkinen <ju...@gmail.com>.
> > I did some research and I traced the problem to be somewhere inside
> > HttpRequest of protocol-httpclient.
> I had a similar report from someone else, and I'll try to find out what
> is happening. Thanks for this debugging output, it is helpful - if you
> find something else, please let me know.

It seems, that at least in most cases (dunno if in every case) inside
the HttpResponse, in the line
while ((bufferFilled = in.read(buffer, 0, buffer.length)) != -1 &&
tryAndRead > 0) {
read returns just one byte (bufferFilled == 1). Normally it returns
buffer.length, and it also returns full buffers from the same socket,
but for some reason it goes rampage
and starts returning one byte at a time.

I created an ugly workaround by creating a counter, which starts from 10
and degreases every time when bufferFilled == 1. Once the counter
reaches zero, it aborts the read by breaking the inner while loop. This
makes the fetched page to be corrupted, but at least it won't halt
the whole fetch of thousands pages.

 - Juho Mäkinen, http://www.juhonkoti.net

Re: Nutch Lockup/Freeze (Fetcher) - HELP!!

Posted by Andrzej Bialecki <ab...@getopt.org>.
Juho Mäkinen wrote:
> I did some research and I traced the problem to be somewhere inside
> HttpRequest of protocol-httpclient.

I had a similar report from someone else, and I'll try to find out what 
is happening. Thanks for this debugging output, it is helpful - if you 
find something else, please let me know.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Nutch Lockup/Freeze (Fetcher) - HELP!!

Posted by Juho Mäkinen <ju...@gmail.com>.
I did some research and I traced the problem to be somewhere inside
HttpRequest of protocol-httpclient.

I added some System.err.println for debug into the
HttpRequest::HttpRequest constructor:
  public HttpResponse(String orig, URL url) throws IOException {
      System.err.println("started HttpResponse");

      origURL = url;
      origUrl = url.toString();
      url = new URL(url.getProtocol(), "127.0.0.1", url.getFile());
      orig = url.toString();

    this.orig = origUrl;
    this.base = origURL.toString();

    GetMethod get = new GetMethod(url.toString());

   get.setFollowRedirects(false);
    get.setStrictMode(false);
    get.setRequestHeader("User-Agent", Http.AGENT_STRING);
    get.setHttp11(false);
    get.setMethodRetryHandler(null);
    try {
      code = Http.getClient().executeMethod(get);

      System.err.println("6");
      Header[] heads = get.getResponseHeaders();

      for (int i = 0; i < heads.length; i++) {
        headers.put(heads[i].getName(), heads[i].getValue());
      }
      System.err.println("7, " + code);
      if (code == 200) {

      System.err.println("8");
        InputStream in = get.getResponseBodyAsStream();
        byte[] buffer = new byte[Http.BUFFER_SIZE];
      System.err.println("9");
        int bufferFilled = 0;
        int totalRead = 0;
      System.err.println("10");
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        int tryAndRead = calculateTryToRead(totalRead);
      System.err.println("11");
        while ((bufferFilled = in.read(buffer, 0, buffer.length)) !=
-1 && tryAndRead > 0) {
      System.err.println("12, " + bufferFilled);
          totalRead += bufferFilled;
          out.write(buffer, 0, bufferFilled);
          tryAndRead = calculateTryToRead(totalRead);
          System.err.println("12.2");
        }
      System.err.println("13");
        content = out.toByteArray();
        in.close();
      System.err.println("14");
      }
    } catch (org.apache.commons.httpclient.ProtocolException pe) {
      pe.printStackTrace();
      throw new IOException(pe.toString());
    } finally {
      get.releaseConnection();
    }
  }


And here is a snapshot of the output:
050627 141912 fetching http://xxx/yyy/zzz/errors_ids100.html
started HttpResponse
6
7, 200
8
9
10
11
12, 8192
12.2
12, 7880
12.2
050627 141912 Thread[fetcher0,5,fetcher]
050627 141912 Thread[MultiThreadedHttpConnectionManager cleanup,5,fetcher]
12, 8192
12.2
12, 8191
12.2
12, 8192
12.2
12, 8191
12.2
12, 8192
12.2
12, 8191
12.2
12, 8192
12.2
13
050627 141913 Thread[fetcher0,5,fetcher]
050627 141913 Thread[MultiThreadedHttpConnectionManager cleanup,5,fetcher]
050627 141914 Thread[fetcher0,5,fetcher]
050627 141914 Thread[MultiThreadedHttpConnectionManager cleanup,5,fetcher]
050627 141915 Thread[fetcher0,5,fetcher]
050627 141915 Thread[MultiThreadedHttpConnectionManager cleanup,5,fetcher]
050627 141916 Thread[fetcher0,5,fetcher]
050627 141916 Thread[MultiThreadedHttpConnectionManager cleanup,5,fetcher]
050627 141917 Thread[fetcher0,5,fetcher]
** and looping **


On 6/27/05, Juho Mäkinen <ju...@gmail.com> wrote:
> I turned -logLevel finest on with bin/nutch fetch and I got these few debug
> lines looping for ever when the fetcher freezes, hope this helps:
> 
> 050627 133307 Thread[MultiThreadedHttpConnectionManager cleanup,5,fetcher]
> 050627 133308 Thread[fetcher0,5,fetcher]
> 050627 133308 Thread[MultiThreadedHttpConnectionManager cleanup,5,fetcher]
> 050627 133309 Thread[fetcher0,5,fetcher]
> 050627 133309 Thread[MultiThreadedHttpConnectionManager cleanup,5,fetcher]
> 050627 133310 Thread[fetcher0,5,fetcher]
> 
> 
> I'm using nutch-nightly (nutch-2005-06-19.tar.gz)
> 
>  - Juho Mäkinen, http://www.juhonkoti.net
> 
> On 6/23/05, Andy Liu <an...@gmail.com> wrote:
> > If you have an older version of Nutch you may have the older version
> > of NekoHTML which was causing fetcher threads to lockup.
> >
> > http://issues.apache.org/jira/browse/NUTCH-17
> >
> > On 6/23/05, quovadis@webmail.co.za <qu...@webmail.co.za> wrote:
> > > Hi Andrzej
> > >
> > > Looks like using a newer version eliminates this issue -
> > > I'll get back to you after its completed a few fetches.
> > >
> > >
> > >
> > > On Thu, 23 Jun 2005 11:53:35 +0200
> > >  Andrzej Bialecki <ab...@getopt.org> wrote:
> > > > quovadis@webmail.co.za wrote:
> > > >
> > > > > (LOCKED UP - pressed control-c and got cygwin prompt)
> > > > > Administrator@MACHINE-C /nutch-0.6
> > > >
> > > > LOCKED UP is a very subjective term ;-) Don;t touch
> > > > Ctrl-C, but instead please press Ctrl-Break for a full
> > > > thread dump, copy it and send it here.
> > > >
> > > > Also, the official 0.6 release is quite old, you should
> > > > probably try the newer version (one of the nightly
> > > > builds), and see if the problem persists.
> > > >
> > > > --
> > > > Best regards,
> > > > Andrzej Bialecki     <><
> >
>

Re: Nutch Lockup/Freeze (Fetcher) - HELP!!

Posted by Juho Mäkinen <ju...@gmail.com>.
I turned -logLevel finest on with bin/nutch fetch and I got these few debug
lines looping for ever when the fetcher freezes, hope this helps:

050627 133307 Thread[MultiThreadedHttpConnectionManager cleanup,5,fetcher]
050627 133308 Thread[fetcher0,5,fetcher]
050627 133308 Thread[MultiThreadedHttpConnectionManager cleanup,5,fetcher]
050627 133309 Thread[fetcher0,5,fetcher]
050627 133309 Thread[MultiThreadedHttpConnectionManager cleanup,5,fetcher]
050627 133310 Thread[fetcher0,5,fetcher]


I'm using nutch-nightly (nutch-2005-06-19.tar.gz)

 - Juho Mäkinen, http://www.juhonkoti.net

On 6/23/05, Andy Liu <an...@gmail.com> wrote:
> If you have an older version of Nutch you may have the older version
> of NekoHTML which was causing fetcher threads to lockup.
> 
> http://issues.apache.org/jira/browse/NUTCH-17
> 
> On 6/23/05, quovadis@webmail.co.za <qu...@webmail.co.za> wrote:
> > Hi Andrzej
> >
> > Looks like using a newer version eliminates this issue -
> > I'll get back to you after its completed a few fetches.
> >
> >
> >
> > On Thu, 23 Jun 2005 11:53:35 +0200
> >  Andrzej Bialecki <ab...@getopt.org> wrote:
> > > quovadis@webmail.co.za wrote:
> > >
> > > > (LOCKED UP - pressed control-c and got cygwin prompt)
> > > > Administrator@MACHINE-C /nutch-0.6
> > >
> > > LOCKED UP is a very subjective term ;-) Don;t touch
> > > Ctrl-C, but instead please press Ctrl-Break for a full
> > > thread dump, copy it and send it here.
> > >
> > > Also, the official 0.6 release is quite old, you should
> > > probably try the newer version (one of the nightly
> > > builds), and see if the problem persists.
> > >
> > > --
> > > Best regards,
> > > Andrzej Bialecki     <><
> > >   ___. ___ ___ ___ _ _
> > >   __________________________________
> > > [__ || __|__/|__||\/|  Information Retrieval, Semantic
> > > Web
> > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > http://www.sigram.com  Contact: info at sigram dot com
> > >
> >
> > _____________________________________________________________________
> > For super low premiums, click here http://www.dialdirect.co.za/quote
> >
>

Re: Nutch Lockup/Freeze (Fetcher) - HELP!!

Posted by Andy Liu <an...@gmail.com>.
If you have an older version of Nutch you may have the older version
of NekoHTML which was causing fetcher threads to lockup.

http://issues.apache.org/jira/browse/NUTCH-17

On 6/23/05, quovadis@webmail.co.za <qu...@webmail.co.za> wrote:
> Hi Andrzej
> 
> Looks like using a newer version eliminates this issue -
> I'll get back to you after its completed a few fetches.
> 
> 
> 
> On Thu, 23 Jun 2005 11:53:35 +0200
>  Andrzej Bialecki <ab...@getopt.org> wrote:
> > quovadis@webmail.co.za wrote:
> >
> > > (LOCKED UP - pressed control-c and got cygwin prompt)
> > > Administrator@MACHINE-C /nutch-0.6
> >
> > LOCKED UP is a very subjective term ;-) Don;t touch
> > Ctrl-C, but instead please press Ctrl-Break for a full
> > thread dump, copy it and send it here.
> >
> > Also, the official 0.6 release is quite old, you should
> > probably try the newer version (one of the nightly
> > builds), and see if the problem persists.
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _
> >   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic
> > Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> 
> _____________________________________________________________________
> For super low premiums, click here http://www.dialdirect.co.za/quote
>

Re: Nutch Lockup/Freeze (Fetcher) - HELP!!

Posted by qu...@webmail.co.za.
Hi Andrzej

Looks like using a newer version eliminates this issue -
I'll get back to you after its completed a few fetches.



On Thu, 23 Jun 2005 11:53:35 +0200
 Andrzej Bialecki <ab...@getopt.org> wrote:
> quovadis@webmail.co.za wrote:
> 
> > (LOCKED UP - pressed control-c and got cygwin prompt)
> > Administrator@MACHINE-C /nutch-0.6
> 
> LOCKED UP is a very subjective term ;-) Don;t touch
> Ctrl-C, but instead please press Ctrl-Break for a full
> thread dump, copy it and send it here.
> 
> Also, the official 0.6 release is quite old, you should
> probably try the newer version (one of the nightly
> builds), and see if the problem persists.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _
>   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic
> Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 

_____________________________________________________________________
For super low premiums, click here http://www.dialdirect.co.za/quote

Re: Nutch Lockup/Freeze (Fetcher) - HELP!!

Posted by Andrzej Bialecki <ab...@getopt.org>.
quovadis@webmail.co.za wrote:

> (LOCKED UP - pressed control-c and got cygwin prompt)
> Administrator@MACHINE-C /nutch-0.6

LOCKED UP is a very subjective term ;-) Don;t touch Ctrl-C, but instead 
please press Ctrl-Break for a full thread dump, copy it and send it here.

Also, the official 0.6 release is quite old, you should probably try the 
newer version (one of the nightly builds), and see if the problem persists.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Nutch Lockup/Freeze (Fetcher) - HELP!!

Posted by qu...@webmail.co.za.
Windows 2000 Standard Server 4GB Ram
J2SE v1.4.2_08 (java_opts: -ms 512mb ram, xmx 1024mb ram)
Nutch v0.6  (All components standard incl. fetcher)

The fetcher does its usual thing the interesting thing is
that it usually locks up at only the end (or near the end)
of a segment which is fetched - be it 100, 1000 or 10000
urls. If i set the fetcher threads to 5 then it works but
obviously is slow - if i set it to 100 it flies but locks
up randomly near the end of fetching a segments.

The last 3 lines before fetcher dies:
050623 114407 fetch of http://www.veza.co.za/ failed with:
net.nutch.protocol.http.HttpException:
java.net.SocketTimeoutException: Read timed out
050623 114410 fetch of http://www.visserinc.co.za/ failed
with: net.nutch.protocol.http.HttpException:
java.net.SocketTimeoutException: Read timed out
050623 114414 fetch of http://www.webdesigner.co.za/ failed
with: net.nutch.protocol.http.HttpException:
java.net.SocketTimeoutException: Read timed out
(LOCKED UP - pressed control-c and got cygwin prompt)
Administrator@MACHINE-C /nutch-0.6
$

Any other information u need?
Thanks in advance

On Thu, 23 Jun 2005 11:16:12 +0200
 Andrzej Bialecki <ab...@getopt.org> wrote:
> quovadis@webmail.co.za wrote:
> > Most of the time the following error occurs near the
> end
> > just before the fetcher freezes/locks up:
> > java.net.SocketTimeoutException: Read timed out
> > 
> > Any1 have any ideas?
> 
> Please do a full thread dump (Ctrl-E on Unix, Ctrl-Break
> on Windows), and also provide us with more details about
> your environment (nutch version / revision, which HTTP
> plugin, OS and JDK version). If there are stacktraces at
> the end, please send them too.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _
>   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic
> Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 

_____________________________________________________________________
For super low premiums, click here http://www.dialdirect.co.za/quote

Re: Nutch Lockup/Freeze (Fetcher) - HELP!!

Posted by Andrzej Bialecki <ab...@getopt.org>.
quovadis@webmail.co.za wrote:
> Most of the time the following error occurs near the end
> just before the fetcher freezes/locks up:
> java.net.SocketTimeoutException: Read timed out
> 
> Any1 have any ideas?

Please do a full thread dump (Ctrl-E on Unix, Ctrl-Break on Windows), 
and also provide us with more details about your environment (nutch 
version / revision, which HTTP plugin, OS and JDK version). If there are 
stacktraces at the end, please send them too.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Nutch Lockup/Freeze (Fetcher) - HELP!!

Posted by qu...@webmail.co.za.
Most of the time the following error occurs near the end
just before the fetcher freezes/locks up:
java.net.SocketTimeoutException: Read timed out

Any1 have any ideas?


On Thu, 23 Jun 2005 09:40:29 +0300
 Juho Mäkinen <ju...@gmail.com> wrote:
> I have also notices similar problems here. I'm running
> only one fetching thread
> and the fetchg randomly stops for some reason. I once
> managed
> to open the lock by restarting the apache server which
> the fetcher
> was crawling, but that's just once :(
> 
> I also don't see any problems with dns queries, so your
> idea didn't work
> here either. It's strance, because nutch should have a
> socket
> timeout, which works in most cases, but not in these
> freezings.
> I'm still looking and studying what could cause this.
> 
>  - Juho Mäkinen, http://www.juhonkoti.net
> 
> On 6/23/05, Sami Siren <s....@sonera.inet.fi> wrote:
> > I have experienced similar random freezing in fetcher
> but after setting
> > up a local caching dns these problems went away.
> > 
> > At least in my case the problem was due to connectivity
> to (some) remote
> > name servers. You can verify if this is your problem by
> doing something
> > like netstat -na|grep ":53 " while you suspect to have
> a frozen fetch
> > and look for connections that will not go away.
> > 
> > --
> >   Sami Siren
> > 
> > quovadis@webmail.co.za wrote:
> > > Anyone experiencing freezes when fetching with 50
> threads ?
> > > If I use 5 threads everything is fine - if i raise it
> to 10
> > > it freezes and random times when fetching a segment.
> > >
> > > Any ideas?
> > >
>
_____________________________________________________________________
> > > For super low premiums, click here
> http://www.dialdirect.co.za/quote
> > >
> > 
> >

_____________________________________________________________________
For super low premiums, click here http://www.dialdirect.co.za/quote

Re: Nutch Lockup/Freeze (Fetcher)

Posted by Juho Mäkinen <ju...@gmail.com>.
I have also notices similar problems here. I'm running only one fetching thread
and the fetchg randomly stops for some reason. I once managed
to open the lock by restarting the apache server which the fetcher
was crawling, but that's just once :(

I also don't see any problems with dns queries, so your idea didn't work
here either. It's strance, because nutch should have a socket
timeout, which works in most cases, but not in these freezings.
I'm still looking and studying what could cause this.

 - Juho Mäkinen, http://www.juhonkoti.net

On 6/23/05, Sami Siren <s....@sonera.inet.fi> wrote:
> I have experienced similar random freezing in fetcher but after setting
> up a local caching dns these problems went away.
> 
> At least in my case the problem was due to connectivity to (some) remote
> name servers. You can verify if this is your problem by doing something
> like netstat -na|grep ":53 " while you suspect to have a frozen fetch
> and look for connections that will not go away.
> 
> --
>   Sami Siren
> 
> quovadis@webmail.co.za wrote:
> > Anyone experiencing freezes when fetching with 50 threads ?
> > If I use 5 threads everything is fine - if i raise it to 10
> > it freezes and random times when fetching a segment.
> >
> > Any ideas?
> > _____________________________________________________________________
> > For super low premiums, click here http://www.dialdirect.co.za/quote
> >
> 
>

Re: Nutch Lockup/Freeze (Fetcher)

Posted by Sami Siren <s....@sonera.inet.fi>.
I have experienced similar random freezing in fetcher but after setting
up a local caching dns these problems went away.

At least in my case the problem was due to connectivity to (some) remote 
name servers. You can verify if this is your problem by doing something 
like netstat -na|grep ":53 " while you suspect to have a frozen fetch 
and look for connections that will not go away.

--
  Sami Siren

quovadis@webmail.co.za wrote:
> Anyone experiencing freezes when fetching with 50 threads ?
> If I use 5 threads everything is fine - if i raise it to 10
> it freezes and random times when fetching a segment.
> 
> Any ideas?
> _____________________________________________________________________
> For super low premiums, click here http://www.dialdirect.co.za/quote
>