You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Rajesh Munavalli <fi...@gmail.com> on 2006/04/05 17:22:06 UTC

details: stackoverflow error

I had earlier posted this message to the list but havent got any response.
Here are more details.

Nutch versionI: nutch.0.7.2
URL File: contains a single URL. File name: "urls"
Crawl-url-filter: is set to grab all URLs

Command: bin/nutch crawl urls -dir crawl.test -depth 3
Error: java.lang.StackOverflowError

The error occurrs while it executes the "UpdateDatabaseTool".

One solution I can think of is to provide more stack memory. But is there a
better solution to this?

Thanks,

Rajesh

Re: details: stackoverflow error

Posted by Jérôme Charron <je...@gmail.com>.

> > Stefan, do you refer to NUTCH-233?
> No:
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200603.mbox/
> %3C6F8149F4-E383-4647-B03F-B4C53467B9D3@media-style.com%3E

OK.
What's about the dichotomy approach you suggested to Doug in order to find
the record that fails?
(do you know if it is implemented in Hadoop? Or planned to be?)
I agree with Doug that it is more a Hadoop issue than a Nutch one.

Re: details: stackoverflow error

Posted by Jérôme Charron <je...@gmail.com>.

> >> Perhaps we could enhance the logic of the loop at Fetcher.java:
> >> 320. Currently this exits the fetcher when all threads exceed a
> >> timeout. Instead it could kill any thread that exceeds the
> >> timeout, and restart a new thread to replace it.  So instead of
> >> just keeping a count of fetcher threads, we could maintain a table
> >> of all running fetcher threads, each with a lastRequestStart time,
> >> rather than a global lastRequestStart. Then, in this loop, we can
> >> check to see if any thread has exceeded the maximum timeout, and,
> >> if it has, kill it and start a new thread.  When no urls remain,
> >> threads will exit and remove themselves from the set of threads,
> >> so the loop can exit as it does now, when there are no more
> >> running fetcher threads.  Does this make sense?  It would prevent
> >> all sorts thread hangs, not just in regexes.
> >
> > +1, sounds like a good solution to this.
>
> +1 a much better solution than my suggestion!

+1. Who take it?

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/

Re: details: stackoverflow error

Posted by Stefan Groschupf <sg...@media-style.com>.

> Doug Cutting wrote:
>> Perhaps we could enhance the logic of the loop at Fetcher.java: 
>> 320. Currently this exits the fetcher when all threads exceed a  
>> timeout. Instead it could kill any thread that exceeds the  
>> timeout, and restart a new thread to replace it.  So instead of  
>> just keeping a count of fetcher threads, we could maintain a table  
>> of all running fetcher threads, each with a lastRequestStart time,  
>> rather than a global lastRequestStart. Then, in this loop, we can  
>> check to see if any thread has exceeded the maximum timeout, and,  
>> if it has, kill it and start a new thread.  When no urls remain,  
>> threads will exit and remove themselves from the set of threads,  
>> so the loop can exit as it does now, when there are no more  
>> running fetcher threads.  Does this make sense?  It would prevent  
>> all sorts thread hangs, not just in regexes.
>
> +1, sounds like a good solution to this.

+1 a much better solution than my suggestion!

Re: details: stackoverflow error

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:
> Perhaps we could enhance the logic of the loop at Fetcher.java:320. 
> Currently this exits the fetcher when all threads exceed a timeout. 
> Instead it could kill any thread that exceeds the timeout, and restart 
> a new thread to replace it.  So instead of just keeping a count of 
> fetcher threads, we could maintain a table of all running fetcher 
> threads, each with a lastRequestStart time, rather than a global 
> lastRequestStart. Then, in this loop, we can check to see if any 
> thread has exceeded the maximum timeout, and, if it has, kill it and 
> start a new thread.  When no urls remain, threads will exit and remove 
> themselves from the set of threads, so the loop can exit as it does 
> now, when there are no more running fetcher threads.  Does this make 
> sense?  It would prevent all sorts thread hangs, not just in regexes.

+1, sounds like a good solution to this.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: details: stackoverflow error

Posted by Doug Cutting <cu...@apache.org>.

Stefan Groschupf wrote:
>>> I already suggested to add a kind of timeout mechanism here and had
>>> done this for my installation,
>>> however the patch  suggestion was rejected since it was a 'non
>>> reproducible' problem.
>>>
>> Stefan, do you refer to NUTCH-233?
> 
> No:
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200603.mbox/ 
> %3C6F8149F4-E383-4647-B03F-B4C53467B9D3@media-style.com%3E

I don't think that's why it was rejected.  Spawning an extra thread for 
every url and every rule is pretty crude.  Hadoop should indeed have a 
better mechanism to handle this sort of thing, but there's no reason we 
cannot also first fix this in the fetcher.

Perhaps we could enhance the logic of the loop at Fetcher.java:320. 
Currently this exits the fetcher when all threads exceed a timeout. 
Instead it could kill any thread that exceeds the timeout, and restart a 
new thread to replace it.  So instead of just keeping a count of fetcher 
threads, we could maintain a table of all running fetcher threads, each 
with a lastRequestStart time, rather than a global lastRequestStart. 
Then, in this loop, we can check to see if any thread has exceeded the 
maximum timeout, and, if it has, kill it and start a new thread.  When 
no urls remain, threads will exit and remove themselves from the set of 
threads, so the loop can exit as it does now, when there are no more 
running fetcher threads.  Does this make sense?  It would prevent all 
sorts thread hangs, not just in regexes.

Doug

Re: details: stackoverflow error

Posted by Stefan Groschupf <sg...@media-style.com>.

Am 07.04.2006 um 22:13 schrieb Jérôme Charron:

>> I already suggested to add a kind of timeout mechanism here and had
>> done this for my installation,
>> however the patch  suggestion was rejected since it was a 'non
>> reproducible' problem.
>>
> Stefan, do you refer to NUTCH-233?
No:
http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200603.mbox/ 
%3C6F8149F4-E383-4647-B03F-B4C53467B9D3@media-style.com%3E

Re: details: stackoverflow error

Posted by Jérôme Charron <je...@gmail.com>.

> I already suggested to add a kind of timeout mechanism here and had
> done this for my installation,
> however the patch  suggestion was rejected since it was a 'non
> reproducible' problem.
>
Stefan, do you refer to NUTCH-233?
It is not rejected, but simply postponed... waiting for more feedback...

Regards

Jérôme

Re: details: stackoverflow error

Posted by Stefan Groschupf <sg...@media-style.com>.

I already suggested to add a kind of timeout mechanism here and had  
done this for my installation,
however the patch  suggestion was rejected since it was a 'non  
reproducible' problem.

:-/

Am 07.04.2006 um 21:55 schrieb Rajesh Munavalli:

> Hi Piotr,
>          Thanks for the help. I think I found the source of the  
> error. It
> was in the "crawl-urlfilter.txt".
>
> I had the following reg expression to grab all the URLs
> +^http://([a-z0-9]*\.)*(a-z0-9*)*
>
> The regex factory should have ran into infinite loop.
>
> Thanks,
>
> Rajesh
>
>
> On 4/7/06, Piotr Kosiorowski <pk...@gmail.com> wrote:
>>
>> Hello Rajesh,
>> I have run  bin/nutch crawl urls -dir crawl.test -depth 3
>> on standard nutch-0.7.2 setup.
>> The urls file contain http://www.math.psu.edu/MathLists/ 
>> Contents.htmlonly.
>> In crawl-rlfilter I have changed the url pattern to:
>> # accept hosts in MY.DOMAIN.NAME
>> +^http://
>>
>> JVM: java version "1.4.2_06"
>> Linux
>>
>> It runs without problems.
>> Please reinstall from distribution make only required changes and
>> retest. If it fails we will to track it down again.
>> Regards
>> Piotr
>>
>>
>>
>> Rajesh Munavalli wrote:
>>> Forgot to mention one more parameter. Modify the crawl-urlfilter to
>> accept
>>> any URL.
>>>
>>> On 4/6/06, Rajesh Munavalli <fi...@gmail.com> wrote:
>>>>  Java version: JSDK 1.4.2_08
>>>> URL Seed: http://www.math.psu.edu/MathLists/Contents.html
>>>>
>>>> I even tried allocating more stack memory using "-Xss", process  
>>>> memory
>>>> "-Xms" option. However, if I run the individual tools  
>>>> (fetchlisttool,
>>>> fetcher, updatedb..etc) separately from the shell, it works fine.
>>>>
>>>> Thanks,
>>>>  --Rajesh
>>>>
>>>>
>>>>
>>>> On 4/6/06, Piotr Kosiorowski <pk...@gmail.com> wrote:
>>>>> Which Java version do you use?
>>>>> Is it the same for all urls or only for specific one?
>>>>> If URL you are trying to crawl is public you can send it to me  
>>>>> (off
>> list
>>>>>
>>>>> if you wish) and I can check it on my machine.
>>>>> Regards
>>>>> Piotr
>>>>>
>>>>> Rajesh Munavalli wrote:
>>>>>> I had earlier posted this message to the list but havent got any
>>>>> response.
>>>>>> Here are more details.
>>>>>>
>>>>>> Nutch versionI: nutch.0.7.2
>>>>>> URL File: contains a single URL. File name: "urls"
>>>>>> Crawl-url-filter: is set to grab all URLs
>>>>>>
>>>>>> Command: bin/nutch crawl urls -dir crawl.test -depth 3
>>>>>> Error: java.lang.StackOverflowError
>>>>>>
>>>>>> The error occurrs while it executes the "UpdateDatabaseTool".
>>>>>>
>>>>>> One solution I can think of is to provide more stack memory.  
>>>>>> But is
>>>>> there a
>>>>>> better solution to this?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Rajesh
>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com

Re: details: stackoverflow error

Posted by Rajesh Munavalli <fi...@gmail.com>.

Hi Piotr,
         Thanks for the help. I think I found the source of the error. It
was in the "crawl-urlfilter.txt".

I had the following reg expression to grab all the URLs
+^http://([a-z0-9]*\.)*(a-z0-9*)*

The regex factory should have ran into infinite loop.

Thanks,

Rajesh


On 4/7/06, Piotr Kosiorowski <pk...@gmail.com> wrote:
>
> Hello Rajesh,
> I have run  bin/nutch crawl urls -dir crawl.test -depth 3
> on standard nutch-0.7.2 setup.
> The urls file contain http://www.math.psu.edu/MathLists/Contents.htmlonly.
> In crawl-rlfilter I have changed the url pattern to:
> # accept hosts in MY.DOMAIN.NAME
> +^http://
>
> JVM: java version "1.4.2_06"
> Linux
>
> It runs without problems.
> Please reinstall from distribution make only required changes and
> retest. If it fails we will to track it down again.
> Regards
> Piotr
>
>
>
> Rajesh Munavalli wrote:
> > Forgot to mention one more parameter. Modify the crawl-urlfilter to
> accept
> > any URL.
> >
> > On 4/6/06, Rajesh Munavalli <fi...@gmail.com> wrote:
> >>  Java version: JSDK 1.4.2_08
> >> URL Seed: http://www.math.psu.edu/MathLists/Contents.html
> >>
> >> I even tried allocating more stack memory using "-Xss", process memory
> >> "-Xms" option. However, if I run the individual tools (fetchlisttool,
> >> fetcher, updatedb..etc) separately from the shell, it works fine.
> >>
> >> Thanks,
> >>  --Rajesh
> >>
> >>
> >>
> >> On 4/6/06, Piotr Kosiorowski <pk...@gmail.com> wrote:
> >>> Which Java version do you use?
> >>> Is it the same for all urls or only for specific one?
> >>> If URL you are trying to crawl is public you can send it to me (off
> list
> >>>
> >>> if you wish) and I can check it on my machine.
> >>> Regards
> >>> Piotr
> >>>
> >>> Rajesh Munavalli wrote:
> >>>> I had earlier posted this message to the list but havent got any
> >>> response.
> >>>> Here are more details.
> >>>>
> >>>> Nutch versionI: nutch.0.7.2
> >>>> URL File: contains a single URL. File name: "urls"
> >>>> Crawl-url-filter: is set to grab all URLs
> >>>>
> >>>> Command: bin/nutch crawl urls -dir crawl.test -depth 3
> >>>> Error: java.lang.StackOverflowError
> >>>>
> >>>> The error occurrs while it executes the "UpdateDatabaseTool".
> >>>>
> >>>> One solution I can think of is to provide more stack memory. But is
> >>> there a
> >>>> better solution to this?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Rajesh
> >>>>
> >>>
> >>>
> >>>
> >
>
>

Re: details: stackoverflow error

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Hello Rajesh,
I have run  bin/nutch crawl urls -dir crawl.test -depth 3
on standard nutch-0.7.2 setup.
The urls file contain http://www.math.psu.edu/MathLists/Contents.html only.
In crawl-rlfilter I have changed the url pattern to:
# accept hosts in MY.DOMAIN.NAME
+^http://

JVM: java version "1.4.2_06"
Linux

It runs without problems.
Please reinstall from distribution make only required changes and 
retest. If it fails we will to track it down again.
Regards
Piotr



Rajesh Munavalli wrote:
> Forgot to mention one more parameter. Modify the crawl-urlfilter to accept
> any URL.
> 
> On 4/6/06, Rajesh Munavalli <fi...@gmail.com> wrote:
>>  Java version: JSDK 1.4.2_08
>> URL Seed: http://www.math.psu.edu/MathLists/Contents.html
>>
>> I even tried allocating more stack memory using "-Xss", process memory
>> "-Xms" option. However, if I run the individual tools (fetchlisttool,
>> fetcher, updatedb..etc) separately from the shell, it works fine.
>>
>> Thanks,
>>  --Rajesh
>>
>>
>>
>> On 4/6/06, Piotr Kosiorowski <pk...@gmail.com> wrote:
>>> Which Java version do you use?
>>> Is it the same for all urls or only for specific one?
>>> If URL you are trying to crawl is public you can send it to me (off list
>>>
>>> if you wish) and I can check it on my machine.
>>> Regards
>>> Piotr
>>>
>>> Rajesh Munavalli wrote:
>>>> I had earlier posted this message to the list but havent got any
>>> response.
>>>> Here are more details.
>>>>
>>>> Nutch versionI: nutch.0.7.2
>>>> URL File: contains a single URL. File name: "urls"
>>>> Crawl-url-filter: is set to grab all URLs
>>>>
>>>> Command: bin/nutch crawl urls -dir crawl.test -depth 3
>>>> Error: java.lang.StackOverflowError
>>>>
>>>> The error occurrs while it executes the "UpdateDatabaseTool".
>>>>
>>>> One solution I can think of is to provide more stack memory. But is
>>> there a
>>>> better solution to this?
>>>>
>>>> Thanks,
>>>>
>>>> Rajesh
>>>>
>>>
>>>
>>>
>

Re: details: stackoverflow error

Posted by Rajesh Munavalli <fi...@gmail.com>.

Forgot to mention one more parameter. Modify the crawl-urlfilter to accept
any URL.

On 4/6/06, Rajesh Munavalli <fi...@gmail.com> wrote:
>
>  Java version: JSDK 1.4.2_08
> URL Seed: http://www.math.psu.edu/MathLists/Contents.html
>
> I even tried allocating more stack memory using "-Xss", process memory
> "-Xms" option. However, if I run the individual tools (fetchlisttool,
> fetcher, updatedb..etc) separately from the shell, it works fine.
>
> Thanks,
>  --Rajesh
>
>
>
> On 4/6/06, Piotr Kosiorowski <pk...@gmail.com> wrote:
> >
> > Which Java version do you use?
> > Is it the same for all urls or only for specific one?
> > If URL you are trying to crawl is public you can send it to me (off list
> >
> > if you wish) and I can check it on my machine.
> > Regards
> > Piotr
> >
> > Rajesh Munavalli wrote:
> > > I had earlier posted this message to the list but havent got any
> > response.
> > > Here are more details.
> > >
> > > Nutch versionI: nutch.0.7.2
> > > URL File: contains a single URL. File name: "urls"
> > > Crawl-url-filter: is set to grab all URLs
> > >
> > > Command: bin/nutch crawl urls -dir crawl.test -depth 3
> > > Error: java.lang.StackOverflowError
> > >
> > > The error occurrs while it executes the "UpdateDatabaseTool".
> > >
> > > One solution I can think of is to provide more stack memory. But is
> > there a
> > > better solution to this?
> > >
> > > Thanks,
> > >
> > > Rajesh
> > >
> >
> >
> >
> >
>

Re: details: stackoverflow error

Posted by Rajesh Munavalli <fi...@gmail.com>.

Java version: JSDK 1.4.2_08
URL Seed: http://www.math.psu.edu/MathLists/Contents.html

I even tried allocating more stack memory using "-Xss", process memory
"-Xms" option. However, if I run the individual tools (fetchlisttool,
fetcher, updatedb..etc) separately from the shell, it works fine.

Thanks,
--Rajesh



On 4/6/06, Piotr Kosiorowski <pk...@gmail.com> wrote:
>
> Which Java version do you use?
> Is it the same for all urls or only for specific one?
> If URL you are trying to crawl is public you can send it to me (off list
> if you wish) and I can check it on my machine.
> Regards
> Piotr
>
> Rajesh Munavalli wrote:
> > I had earlier posted this message to the list but havent got any
> response.
> > Here are more details.
> >
> > Nutch versionI: nutch.0.7.2
> > URL File: contains a single URL. File name: "urls"
> > Crawl-url-filter: is set to grab all URLs
> >
> > Command: bin/nutch crawl urls -dir crawl.test -depth 3
> > Error: java.lang.StackOverflowError
> >
> > The error occurrs while it executes the "UpdateDatabaseTool".
> >
> > One solution I can think of is to provide more stack memory. But is
> there a
> > better solution to this?
> >
> > Thanks,
> >
> > Rajesh
> >
>
>
>
>

Re: details: stackoverflow error

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Which Java version do you use?
Is it the same for all urls or only for specific one?
If URL you are trying to crawl is public you can send it to me (off list 
if you wish) and I can check it on my machine.
Regards
Piotr

Rajesh Munavalli wrote:
> I had earlier posted this message to the list but havent got any response.
> Here are more details.
> 
> Nutch versionI: nutch.0.7.2
> URL File: contains a single URL. File name: "urls"
> Crawl-url-filter: is set to grab all URLs
> 
> Command: bin/nutch crawl urls -dir crawl.test -depth 3
> Error: java.lang.StackOverflowError
> 
> The error occurrs while it executes the "UpdateDatabaseTool".
> 
> One solution I can think of is to provide more stack memory. But is there a
> better solution to this?
> 
> Thanks,
> 
> Rajesh
>