You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Marko Bauhardt <mb...@media-style.com> on 2005/10/18 09:50:49 UTC

RegexUrlFilter hangs up

Hi all,
I use nutch-mapred from the svn-branch. Sometimes the reduce job of  
the fetchprocess hangs up. The CoreDump prints out that the  
RegexUrlFilter is in work.
In the regex-urlfilter.txt i uncommented the line
#-[?*!@=]

because I want to fetch dynamic urls like jsp's.



Here is the CoreDump.

051017 151123 reduce > reduce
Full thread dump Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode):

"MultiThreadedHttpConnectionManager cleanup" daemon prio=1  
tid=0x08249fa0 nid=0x7645 in Object.wait() [6d489000..6d489868]
         at java.lang.Object.wait(Native Method)
         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:111)
         - locked <0x753a19c0> (a java.lang.ref.ReferenceQueue$Lock)
         at  
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager 
$ReferenceQueueThread.run(MultiThreadedHttpConnectionManager.java:1100)

"Thread-1" prio=1 tid=0x082149b0 nid=0x7645 runnable  
[6efc3000..6efc3868]
         at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown  
Source)
         at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown  
Source)
         at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown  
Source)
         at org.apache.oro.text.regex.Perl5Matcher.__tryExpression 
(Unknown Source)
         at org.apache.oro.text.regex.Perl5Matcher.__interpret 
(Unknown Source)
         at org.apache.oro.text.regex.Perl5Matcher.contains(Unknown  
Source)
         at org.apache.oro.text.regex.Perl5Matcher.contains(Unknown  
Source)
         at org.apache.nutch.net.RegexURLFilter.filter 
(RegexURLFilter.java:114)
         - locked <0x753d8cc8> (a org.apache.nutch.net.RegexURLFilter)
         at org.apache.nutch.net.URLFilters.filter(URLFilters.java:77)
         at org.apache.nutch.crawl.ParseOutputFormat$1.write 
(ParseOutputFormat.java:71)
         at org.apache.nutch.crawl.FetcherOutputFormat$1.write 
(FetcherOutputFormat.java:78)
         at org.apache.nutch.mapred.ReduceTask$2.collect 
(ReduceTask.java:247)
         at org.apache.nutch.mapred.lib.IdentityReducer.reduce 
(IdentityReducer.java:41)
         at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
         at org.apache.nutch.mapred.LocalJobRunner$Job.run 
(LocalJobRunner.java:90)

"Signal Dispatcher" daemon prio=1 tid=0x080a6ff8 nid=0x7645 waiting  
on condition [0..0]

"Finalizer" daemon prio=1 tid=0x080933e8 nid=0x7645 in Object.wait()  
[70159000..70159868]
         at java.lang.Object.wait(Native Method)
         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:111)
         - locked <0x75350780> (a java.lang.ref.ReferenceQueue$Lock)
         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:127)
         at java.lang.ref.Finalizer$FinalizerThread.run 
(Finalizer.java:159)

"Reference Handler" daemon prio=1 tid=0x08091978 nid=0x7645 in  
Object.wait() [701da000..701da868]
         at java.lang.Object.wait(Native Method)
         at java.lang.Object.wait(Object.java:429)
         at java.lang.ref.Reference$ReferenceHandler.run 
(Reference.java:115)
         - locked <0x753507e8> (a java.lang.ref.Reference$Lock)

"main" prio=1 tid=0x0805c0d8 nid=0x7645 waiting on condition  
[bfffb000..bfffb41c]
         at java.lang.Thread.sleep(Native Method)
         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:294)
         at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:333)
         at org.apache.nutch.crawl.Fetcher.main(Fetcher.java:362)

"VM Thread" prio=1 tid=0x08090718 nid=0x7645 runnable

"VM Periodic Task Thread" prio=1 tid=0x6fb01420 nid=0x7645 waiting on  
condition
"Suspend Checker Thread" prio=1 tid=0x080a65f0 nid=0x7645 runnable



Re: RegexUrlFilter hangs up

Posted by Marko Bauhardt <mb...@media-style.com>.
Am 18.10.2005 um 17:55 schrieb Doug Cutting:

> What makes you think that the fetcher is hung?




The entries in the logfile and the size of the segment didn't grow  
up. I was waiting about 8hours, but the last entry of my logfile is  
still '051017 151123 reduce > reduce'.
I use mapred on local fs .



Re: RegexUrlFilter hangs up

Posted by Doug Cutting <cu...@nutch.org>.
What makes you think that the fetcher is hung?

Doug

Marko Bauhardt wrote:
> Hi all,
> I use nutch-mapred from the svn-branch. Sometimes the reduce job of  the 
> fetchprocess hangs up. The CoreDump prints out that the  RegexUrlFilter 
> is in work.
> In the regex-urlfilter.txt i uncommented the line
> #-[?*!@=]
> 
> because I want to fetch dynamic urls like jsp's.
> 
> 
> 
> Here is the CoreDump.
> 
> 051017 151123 reduce > reduce
> Full thread dump Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode):
> 
> "MultiThreadedHttpConnectionManager cleanup" daemon prio=1  
> tid=0x08249fa0 nid=0x7645 in Object.wait() [6d489000..6d489868]
>         at java.lang.Object.wait(Native Method)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:111)
>         - locked <0x753a19c0> (a java.lang.ref.ReferenceQueue$Lock)
>         at  
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager 
> $ReferenceQueueThread.run(MultiThreadedHttpConnectionManager.java:1100)
> 
> "Thread-1" prio=1 tid=0x082149b0 nid=0x7645 runnable  [6efc3000..6efc3868]
>         at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown  Source)
>         at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown  Source)
>         at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown  Source)
>         at org.apache.oro.text.regex.Perl5Matcher.__tryExpression 
> (Unknown Source)
>         at org.apache.oro.text.regex.Perl5Matcher.__interpret (Unknown 
> Source)
>         at org.apache.oro.text.regex.Perl5Matcher.contains(Unknown  Source)
>         at org.apache.oro.text.regex.Perl5Matcher.contains(Unknown  Source)
>         at org.apache.nutch.net.RegexURLFilter.filter 
> (RegexURLFilter.java:114)
>         - locked <0x753d8cc8> (a org.apache.nutch.net.RegexURLFilter)
>         at org.apache.nutch.net.URLFilters.filter(URLFilters.java:77)
>         at org.apache.nutch.crawl.ParseOutputFormat$1.write 
> (ParseOutputFormat.java:71)
>         at org.apache.nutch.crawl.FetcherOutputFormat$1.write 
> (FetcherOutputFormat.java:78)
>         at org.apache.nutch.mapred.ReduceTask$2.collect 
> (ReduceTask.java:247)
>         at org.apache.nutch.mapred.lib.IdentityReducer.reduce 
> (IdentityReducer.java:41)
>         at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
>         at org.apache.nutch.mapred.LocalJobRunner$Job.run 
> (LocalJobRunner.java:90)
> 
> "Signal Dispatcher" daemon prio=1 tid=0x080a6ff8 nid=0x7645 waiting  on 
> condition [0..0]
> 
> "Finalizer" daemon prio=1 tid=0x080933e8 nid=0x7645 in Object.wait()  
> [70159000..70159868]
>         at java.lang.Object.wait(Native Method)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:111)
>         - locked <0x75350780> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:127)
>         at java.lang.ref.Finalizer$FinalizerThread.run (Finalizer.java:159)
> 
> "Reference Handler" daemon prio=1 tid=0x08091978 nid=0x7645 in  
> Object.wait() [701da000..701da868]
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:429)
>         at java.lang.ref.Reference$ReferenceHandler.run 
> (Reference.java:115)
>         - locked <0x753507e8> (a java.lang.ref.Reference$Lock)
> 
> "main" prio=1 tid=0x0805c0d8 nid=0x7645 waiting on condition  
> [bfffb000..bfffb41c]
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:294)
>         at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:333)
>         at org.apache.nutch.crawl.Fetcher.main(Fetcher.java:362)
> 
> "VM Thread" prio=1 tid=0x08090718 nid=0x7645 runnable
> 
> "VM Periodic Task Thread" prio=1 tid=0x6fb01420 nid=0x7645 waiting on  
> condition
> "Suspend Checker Thread" prio=1 tid=0x080a65f0 nid=0x7645 runnable
> 
>