You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Vishal Shah <vi...@rediff.co.in> on 2006/09/08 12:45:52 UTC

Reduce Error during fetch

Hi,
 
   I've been trying to get the nutch fetcher to work since a couple of
days, but it always hangs on one of the reduce processes, and the job is
aborted. I am using numFetchers=24 during generate, 24 map tasks and 6
reduce tasks during fetch on a 3 machine cluster. The task that failed
was tried atleast 3 times, before the job was aborted.
 
  I looked into the logs on one of the machines with the failed tasks,
and I see these errors:
 
1) 2006-09-08 18:04:03,294 INFO  mapred.TaskTracker -
task_0003_r_000004_3: Task failed to report status for 608 seconds.
Killin
g.
 
2) 
java.lang.IllegalStateException
        at
org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpRespo
nse.java:561)
        at
org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:122)
        at
org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:1
15)
        at
org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java:1
90)
        at
org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspF
actoryImpl.java:115)
        at
org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(JspFactoryIm
pl.java:75)
        at
org.apache.hadoop.mapred.getMapOutput_jsp._jspService(getMapOutput_jsp.j
ava:100)
        at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
        at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
andler.java:475)
        at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
        at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
text.java:635)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
        at org.mortbay.http.HttpServer.service(HttpServer.java:954)
        at
org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
        at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
        at
org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
        at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
)
        at
org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
        at
org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
 
Any idea where the problem is, and how to rectify it?
 
Regards,
 
-vishal.

RE: Reduce Error during fetch

Posted by Vishal Shah <vi...@rediff.co.in>.
Hi mike,
 
  I tried removing one regex as described in
http://issues.apache.org/jira/browse/NUTCH-233
 
 I am not 100% sure if this is what eliminated the error, since a lot of
things changed since then=> the seed list, updated nutch trunk and also
I am doing an internal crawl now on my seeds. It's worth a shot to try
and change the regex as described, or remove it completely if you don't
need that kind of thing.
 
Regards,
 
Vishal.
 
-----Original Message-----
From: Mike Smith [mailto:mike.smith.dev@gmail.com] 
Sent: Wednesday, October 18, 2006 4:50 AM
To: nutch-user@lucene.apache.org; vishals@rediff.co.in
Subject: Re: Reduce Error during fetch
 
Hi Vishal, 
 
I am experiencing the same problem. It gets stuck in the reduce stage
and finally fails by timeout problem. Did removing or simplifying regex
solved the problem?
 
Thanks, Mike 

 
On 9/11/06, Vishal Shah <vi...@rediff.co.in> wrote: 
Hi Dennis,

  Thanks for the reply. I can't avoid using the regex matching, I have
some patterns in the hostname that can't be matched using either prefix 
or suffix filters. However, I will try it your way using simpler regexes
just to test your theory.

Regards,

-vishal.


-----Original Message-----
From: Dennis Kubes [mailto: <ma...@dragonflymc.com>
nutch-dev@dragonflymc.com]
Sent: Friday, September 08, 2006 11:30 PM
To: nutch-user@lucene.apache.org
Subject: Re: Reduce Error during fetch

You may be running into problems with regex stalls on filtering.  Try 
removing the regex filter from the nutch-site.xml plugin.includes
property.  I was having similar problems before switching to just use
prefix and suffix filters as below.  I attached my prefix and suffix url

filter files that go in conf.  I am only indexing http files so you may
need to modify these.  Hope this helps.

<property>
<name>plugin.includes</name>

<value>protocol-http|urlfilter-(suffix|prefix)|parse-(text|html|js)|inde

x-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include.  Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints 
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>

Dennis

Vishal Shah wrote:
> Hi, 
>
>    I've been trying to get the nutch fetcher to work since a couple of
> days, but it always hangs on one of the reduce processes, and the job
is
> aborted. I am using numFetchers=24 during generate, 24 map tasks and 6

> reduce tasks during fetch on a 3 machine cluster. The task that failed
> was tried atleast 3 times, before the job was aborted.
>
>   I looked into the logs on one of the machines with the failed tasks,

> and I see these errors:
>
> 1) 2006-09-08 18:04:03,294 INFO  mapred.TaskTracker -
> task_0003_r_000004_3: Task failed to report status for 608 seconds.
> Killin
> g.
>
> 2) 
> java.lang.IllegalStateException
>         at
>
org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpRespo
> nse.java:561)
>         at
>
org.apache.jasper.runtime.JspWriterImpl.initOut (JspWriterImpl.java:122)
>         at
>
org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:1
> 15)
>         at
>
org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java
:1
> 90)
>         at
>
org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspF
> actoryImpl.java:115)
>         at
>
org.apache.jasper.runtime.JspFactoryImpl.releasePageContext
(JspFactoryIm
> pl.java:75)
>         at
>
org.apache.hadoop.mapred.getMapOutput_jsp._jspService(getMapOutput_jsp.j
> ava:100)
>         at
> org.apache.jasper.runtime.HttpJspBase.service (HttpJspBase.java:94)
>         at
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>         at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
>         at 
>
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
> andler.java:475)
>         at
>
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
>         at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
>         at
>
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
> text.java:635)
>         at org.mortbay.http.HttpContext.handle (HttpContext.java:1517)
>         at org.mortbay.http.HttpServer.service(HttpServer.java:954)
>         at
> org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
>         at
> org.mortbay.http.HttpConnection.handleNext (HttpConnection.java:981)
>         at
> org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
>         at
>
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
> )
>         at
> org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
>         at
> org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>
> Any idea where the problem is, and how to rectify it? 
>
> Regards,
>
> -vishal.
>
>
 

Re: Reduce Error during fetch

Posted by Mike Smith <mi...@gmail.com>.
Hi Vishal,

I am experiencing the same problem. It gets stuck in the reduce stage and
finally fails by timeout problem. Did removing or simplifying regex solved
the problem?

Thanks, Mike


On 9/11/06, Vishal Shah <vi...@rediff.co.in> wrote:
>
> Hi Dennis,
>
>   Thanks for the reply. I can't avoid using the regex matching, I have
> some patterns in the hostname that can't be matched using either prefix
> or suffix filters. However, I will try it your way using simpler regexes
> just to test your theory.
>
> Regards,
>
> -vishal.
>
>
> -----Original Message-----
> From: Dennis Kubes [mailto:nutch-dev@dragonflymc.com]
> Sent: Friday, September 08, 2006 11:30 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Reduce Error during fetch
>
> You may be running into problems with regex stalls on filtering.  Try
> removing the regex filter from the nutch-site.xml plugin.includes
> property.  I was having similar problems before switching to just use
> prefix and suffix filters as below.  I attached my prefix and suffix url
>
> filter files that go in conf.  I am only indexing http files so you may
> need to modify these.  Hope this helps.
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-(suffix|prefix)|parse-(text|html|js)|inde
> x-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
> <description>Regular expression naming plugin directory names to
> include.  Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints
> plugin. By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins.
> </description>
> </property>
>
> Dennis
>
> Vishal Shah wrote:
> > Hi,
> >
> >    I've been trying to get the nutch fetcher to work since a couple of
> > days, but it always hangs on one of the reduce processes, and the job
> is
> > aborted. I am using numFetchers=24 during generate, 24 map tasks and 6
> > reduce tasks during fetch on a 3 machine cluster. The task that failed
> > was tried atleast 3 times, before the job was aborted.
> >
> >   I looked into the logs on one of the machines with the failed tasks,
> > and I see these errors:
> >
> > 1) 2006-09-08 18:04:03,294 INFO  mapred.TaskTracker -
> > task_0003_r_000004_3: Task failed to report status for 608 seconds.
> > Killin
> > g.
> >
> > 2)
> > java.lang.IllegalStateException
> >         at
> >
> org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpRespo
> > nse.java:561)
> >         at
> >
> org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:122)
> >         at
> >
> org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:1
> > 15)
> >         at
> >
> org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java:1
> > 90)
> >         at
> >
> org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspF
> > actoryImpl.java:115)
> >         at
> >
> org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(JspFactoryIm
> > pl.java:75)
> >         at
> >
> org.apache.hadoop.mapred.getMapOutput_jsp._jspService(getMapOutput_jsp.j
> > ava:100)
> >         at
> > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
> >         at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> >         at
> > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
> >         at
> >
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
> > andler.java:475)
> >         at
> >
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
> >         at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
> >         at
> >
> org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
> > text.java:635)
> >         at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
> >         at org.mortbay.http.HttpServer.service(HttpServer.java:954)
> >         at
> > org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
> >         at
> > org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
> >         at
> > org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
> >         at
> >
> org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
> > )
> >         at
> > org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
> >         at
> > org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
> >
> > Any idea where the problem is, and how to rectify it?
> >
> > Regards,
> >
> > -vishal.
> >
> >
>
>

RE: Reduce Error during fetch

Posted by Vishal Shah <vi...@rediff.co.in>.
Hi Dennis,

   Thanks for the reply. I can't avoid using the regex matching, I have
some patterns in the hostname that can't be matched using either prefix
or suffix filters. However, I will try it your way using simpler regexes
just to test your theory.

Regards,

-vishal.


-----Original Message-----
From: Dennis Kubes [mailto:nutch-dev@dragonflymc.com] 
Sent: Friday, September 08, 2006 11:30 PM
To: nutch-user@lucene.apache.org
Subject: Re: Reduce Error during fetch

You may be running into problems with regex stalls on filtering.  Try 
removing the regex filter from the nutch-site.xml plugin.includes 
property.  I was having similar problems before switching to just use 
prefix and suffix filters as below.  I attached my prefix and suffix url

filter files that go in conf.  I am only indexing http files so you may 
need to modify these.  Hope this helps.

<property>
  <name>plugin.includes</name>
  
<value>protocol-http|urlfilter-(suffix|prefix)|parse-(text|html|js)|inde
x-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints
plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

Dennis

Vishal Shah wrote:
> Hi,
>  
>    I've been trying to get the nutch fetcher to work since a couple of
> days, but it always hangs on one of the reduce processes, and the job
is
> aborted. I am using numFetchers=24 during generate, 24 map tasks and 6
> reduce tasks during fetch on a 3 machine cluster. The task that failed
> was tried atleast 3 times, before the job was aborted.
>  
>   I looked into the logs on one of the machines with the failed tasks,
> and I see these errors:
>  
> 1) 2006-09-08 18:04:03,294 INFO  mapred.TaskTracker -
> task_0003_r_000004_3: Task failed to report status for 608 seconds.
> Killin
> g.
>  
> 2) 
> java.lang.IllegalStateException
>         at
>
org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpRespo
> nse.java:561)
>         at
>
org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:122)
>         at
>
org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:1
> 15)
>         at
>
org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java:1
> 90)
>         at
>
org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspF
> actoryImpl.java:115)
>         at
>
org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(JspFactoryIm
> pl.java:75)
>         at
>
org.apache.hadoop.mapred.getMapOutput_jsp._jspService(getMapOutput_jsp.j
> ava:100)
>         at
> org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
>         at
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>         at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
>         at
>
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
> andler.java:475)
>         at
>
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
>         at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
>         at
>
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
> text.java:635)
>         at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
>         at org.mortbay.http.HttpServer.service(HttpServer.java:954)
>         at
> org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
>         at
> org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
>         at
> org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
>         at
>
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
> )
>         at
> org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
>         at
> org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>  
> Any idea where the problem is, and how to rectify it?
>  
> Regards,
>  
> -vishal.
>
>   


Re: Reduce Error during fetch

Posted by Dennis Kubes <nu...@dragonflymc.com>.
You may be running into problems with regex stalls on filtering.  Try 
removing the regex filter from the nutch-site.xml plugin.includes 
property.  I was having similar problems before switching to just use 
prefix and suffix filters as below.  I attached my prefix and suffix url 
filter files that go in conf.  I am only indexing http files so you may 
need to modify these.  Hope this helps.

<property>
  <name>plugin.includes</name>
  
<value>protocol-http|urlfilter-(suffix|prefix)|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

Dennis

Vishal Shah wrote:
> Hi,
>  
>    I've been trying to get the nutch fetcher to work since a couple of
> days, but it always hangs on one of the reduce processes, and the job is
> aborted. I am using numFetchers=24 during generate, 24 map tasks and 6
> reduce tasks during fetch on a 3 machine cluster. The task that failed
> was tried atleast 3 times, before the job was aborted.
>  
>   I looked into the logs on one of the machines with the failed tasks,
> and I see these errors:
>  
> 1) 2006-09-08 18:04:03,294 INFO  mapred.TaskTracker -
> task_0003_r_000004_3: Task failed to report status for 608 seconds.
> Killin
> g.
>  
> 2) 
> java.lang.IllegalStateException
>         at
> org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpRespo
> nse.java:561)
>         at
> org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:122)
>         at
> org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:1
> 15)
>         at
> org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java:1
> 90)
>         at
> org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspF
> actoryImpl.java:115)
>         at
> org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(JspFactoryIm
> pl.java:75)
>         at
> org.apache.hadoop.mapred.getMapOutput_jsp._jspService(getMapOutput_jsp.j
> ava:100)
>         at
> org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>         at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
>         at
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
> andler.java:475)
>         at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
>         at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
>         at
> org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
> text.java:635)
>         at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
>         at org.mortbay.http.HttpServer.service(HttpServer.java:954)
>         at
> org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
>         at
> org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
>         at
> org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
>         at
> org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
> )
>         at
> org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
>         at
> org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>  
> Any idea where the problem is, and how to rectify it?
>  
> Regards,
>  
> -vishal.
>
>