You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Koch Martina <Ko...@huberverlag.de> on 2009/02/12 16:16:22 UTC

Fetcher2 crashes with current trunk

Hi all,

we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied.
We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1.
When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log:

2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.
2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to process : 2
2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to process : 2
2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
java.lang.RuntimeException: java.lang.NullPointerException
       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
       at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
       at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
       at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
       at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
       at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
       at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
       at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
Caused by: java.lang.NullPointerException
       at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
       ... 13 more
2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
       at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
       at org.apache.nutch.crawl.Injector.run(Injector.java:190)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
       at org.apache.nutch.crawl.Injector.main(Injector.java:180)

After that the crawldb is broken and can't be accessed e.g. with the readdb <crawldb> -stats command.
When we use for exactly the same task Fetcher instead of Fetcher2, we can do as many cycles as we like without any problems or crashes.

Besides this error we've observed that the fetch-cycle with Fetcher is about twice as fast as Fetcher2, although we use the exact same settings in the nutch-site:
generate.max.per.host  - 100
fetcher.threads.per.host - 1
fetcher.server.delay - 0
for an initial url list with 30 URLs of different hosts.

Has anybody observed similar errors or performance issues?

Kind regards,
Martina

AW: Fetcher2 crashes with current trunk

Posted by Koch Martina <Ko...@huberverlag.de>.
Hi,

all crawls we performed over the weekend were fine, no crawldb crash - perfect!
But, I still see errors like this in the log:

2009-02-23 09:18:19,221 WARN  parse.ParseOutputFormat - Can't read fetch time for: http://www3.daserste.de/forum/showthread.php?t=1200427&goto=newpost
2009-02-23 00:51:56,113 WARN  parse.ParseOutputFormat - Can't read fetch time for: http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/

Something to be concerned about?

Kind regards,
Martina






-----Ursprüngliche Nachricht-----
Von: Doğacan Güney [mailto:dogacan@gmail.com] 
Gesendet: Freitag, 20. Februar 2009 15:27
An: nutch-user@lucene.apache.org
Betreff: Re: Fetcher2 crashes with current trunk

Hi,

On Fri, Feb 20, 2009 at 13:03, Koch Martina <Ko...@huberverlag.de> wrote:
> Hi,
>
> I've applied the patch and run a couple of tests - so far without any crashes, that means the bug seems to be fixed. I'll keep testing over the weekend and report, if the error occurs again.
>
> Thank you very much for your time and help!
>

No problem :) I will wait over the weekend and commit the patch
if you do not encounter another error.

> Kind regards,
> Martina
>
>
> -----Ursprüngliche Nachricht-----
> Von: Doğacan Güney [mailto:dogacan@gmail.com]
> Gesendet: Freitag, 20. Februar 2009 09:55
> An: nutch-user@lucene.apache.org
> Betreff: Re: Fetcher2 crashes with current trunk
>
> Hi,
>
> Can you try again with the patch for NUTCH-698 ?
>
>
> --
> Doğacan Güney
>



-- 
Doğacan Güney

Re: Fetcher2 crashes with current trunk

Posted by Doğacan Güney <do...@gmail.com>.
Hi,

On Fri, Feb 20, 2009 at 13:03, Koch Martina <Ko...@huberverlag.de> wrote:
> Hi,
>
> I've applied the patch and run a couple of tests - so far without any crashes, that means the bug seems to be fixed. I'll keep testing over the weekend and report, if the error occurs again.
>
> Thank you very much for your time and help!
>

No problem :) I will wait over the weekend and commit the patch
if you do not encounter another error.

> Kind regards,
> Martina
>
>
> -----Ursprüngliche Nachricht-----
> Von: Doğacan Güney [mailto:dogacan@gmail.com]
> Gesendet: Freitag, 20. Februar 2009 09:55
> An: nutch-user@lucene.apache.org
> Betreff: Re: Fetcher2 crashes with current trunk
>
> Hi,
>
> Can you try again with the patch for NUTCH-698 ?
>
>
> --
> Doğacan Güney
>



-- 
Doğacan Güney

AW: Fetcher2 crashes with current trunk

Posted by Koch Martina <Ko...@huberverlag.de>.
Hi,

I've applied the patch and run a couple of tests - so far without any crashes, that means the bug seems to be fixed. I'll keep testing over the weekend and report, if the error occurs again.

Thank you very much for your time and help!

Kind regards,
Martina


-----Ursprüngliche Nachricht-----
Von: Doğacan Güney [mailto:dogacan@gmail.com] 
Gesendet: Freitag, 20. Februar 2009 09:55
An: nutch-user@lucene.apache.org
Betreff: Re: Fetcher2 crashes with current trunk

Hi,

Can you try again with the patch for NUTCH-698 ?


-- 
Doğacan Güney

Re: Fetcher2 crashes with current trunk

Posted by Doğacan Güney <do...@gmail.com>.
Hi,

Can you try again with the patch for NUTCH-698 ?


-- 
Doğacan Güney

Re: Fetcher2 crashes with current trunk

Posted by Sami Siren <ss...@gmail.com>.
Dog(acan Güney wrote:
> I think I have found the bug here, but I am in a hurry now, I will
> create a JIRA issue
> and post (what is hopefully) the fix later today.
>   

Great! thanks.

--
 Sami Siren
> On Tue, Feb 17, 2009 at 21:39, Dog(acan Güney <do...@gmail.com> wrote:
>   
>> 2009/2/17 Sami Siren <ss...@gmail.com>:
>>     
>>> Do we have a Jira issue for this, seems like a blocker for 1.0 to me if it is reproducible.
>>>
>>>       
>> No we don't. But you are right that we should. I am very busy and I
>> forgot about it. I will
>> examine this problem in more detail tomorrow and will open an issue if
>> I can reproduce
>> the bug.
>>
>>     
>>> --
>>> Sami Siren
>>>
>>>
>>> Dog(acan Güney wrote:
>>>       
>>>> Thanks for detailed analysis. I will take a look and get back to you.
>>>>
>>>> On Mon, Feb 16, 2009 at 13:41, Koch Martina <Ko...@huberverlag.de> wrote:
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>> sorry for the late reply. We did some further digging and found that the error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the error just happens much later (after about 20 fetch cycles).
>>>>> We did many test runs, eliminated as much plugins as possible and identified URLs which are most likely to fail.
>>>>> With the following configuration we get a corrupt crawldb after two fetch2 cycles:
>>>>> - activated plugins: protocol-http, parse-html, feed
>>>>> - generate.max.per.host - 100
>>>>> - URLs to fetch:
>>>>> http://www.prosieben.de/service/newsflash/
>>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249
>>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239
>>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238
>>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241
>>>>> http://www.prosieben.de/kino_dvd/news/60897/
>>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278
>>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268
>>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279
>>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267
>>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259
>>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/
>>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/
>>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/
>>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/
>>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/
>>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/
>>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/
>>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/
>>>>> http://www.prosieben.de/spielfilm_serie/topstories/61051/
>>>>> http://www.prosieben.de/kino_dvd/news/60897/
>>>>>
>>>>> When starting from an higher URL like http://www.prosieben.de these URLs get the following warn message after some fetch cycles:
>>>>> WARN  parse.ParseOutputFormat - Can't read fetch time for: http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>>>>> But the crawldb does not get corrupt immediately after the first occurence of such messages, it gets corrupted some cyles later.
>>>>>
>>>>> Any suggestions are highly appreciated.
>>>>> Something seems to go wrong with the feed plugin, but I can't diagnose exactly when and why...
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> Kind regards,
>>>>> Martina
>>>>>
>>>>>
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: Dog(acan Güney [mailto:dogacan@gmail.com]
>>>>> Gesendet: Freitag, 13. Februar 2009 09:37
>>>>> An: nutch-user@lucene.apache.org
>>>>> Betreff: Re: Fetcher2 crashes with current trunk
>>>>>
>>>>> On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <Ko...@huberverlag.de> wrote:
>>>>>
>>>>>           
>>>>>> Hi all,
>>>>>>
>>>>>> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied.
>>>>>> We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1.
>>>>>> When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log:
>>>>>>
>>>>>> 2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.
>>>>>> 2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
>>>>>> 2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to process : 2
>>>>>> 2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
>>>>>> 2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to process : 2
>>>>>> 2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
>>>>>> 2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
>>>>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
>>>>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
>>>>>> 2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
>>>>>> 2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
>>>>>> java.lang.RuntimeException: java.lang.NullPointerException
>>>>>>      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
>>>>>>      at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
>>>>>>      at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
>>>>>>      at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>>>>>      at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>>>>>      at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
>>>>>>      at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
>>>>>>      at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>>>>>>      at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>>>>>>      at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
>>>>>>      at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
>>>>>>      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>>>>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>>>>>>      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>>>>>> Caused by: java.lang.NullPointerException
>>>>>>      at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>>>>>>      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
>>>>>>      ... 13 more
>>>>>> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: Job failed!
>>>>>>      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>>>>>>      at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
>>>>>>      at org.apache.nutch.crawl.Injector.run(Injector.java:190)
>>>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>      at org.apache.nutch.crawl.Injector.main(Injector.java:180)
>>>>>>
>>>>>> After that the crawldb is broken and can't be accessed e.g. with the readdb <crawldb> -stats command.
>>>>>> When we use for exactly the same task Fetcher instead of Fetcher2, we can do as many cycles as we like without any problems or crashes.
>>>>>>
>>>>>> Besides this error we've observed that the fetch-cycle with Fetcher is about twice as fast as Fetcher2, although we use the exact same settings in the nutch-site:
>>>>>> generate.max.per.host  - 100
>>>>>> fetcher.threads.per.host - 1
>>>>>> fetcher.server.delay - 0
>>>>>> for an initial url list with 30 URLs of different hosts.
>>>>>>
>>>>>> Has anybody observed similar errors or performance issues?
>>>>>>
>>>>>>
>>>>>>             
>>>>> Fetcher - Fetcher2 performance is a confusing issue. There have been
>>>>> reports that both
>>>>> have been faster than the other. Fetcher2 has a much more flexible and
>>>>> smarter architecture
>>>>> compared to Fetcher so I can only think that this is some sort of bug
>>>>> in Fetcher2 that degrades
>>>>> performance.
>>>>>
>>>>> However, your other problem (Fetcher2 crash) is very weird. I went
>>>>> through Fetcher and Fetcher2
>>>>> code and there is nothing different in them that will make one work
>>>>> and the other fail. Does this
>>>>> error consistently happen if you try it again with Fetcher2 from scratch?
>>>>>
>>>>>
>>>>>           
>>>>>> Kind regards,
>>>>>> Martina
>>>>>>
>>>>>>
>>>>>>             
>>>>> --
>>>>> Dog(acan Güney
>>>>>
>>>>>
>>>>>           
>>>>
>>>>
>>>>         
>>>       
>>
>> --
>> Dog(acan Güney
>>
>>     
>
>
>
>   


Re: Fetcher2 crashes with current trunk

Posted by Doğacan Güney <do...@gmail.com>.
I think I have found the bug here, but I am in a hurry now, I will
create a JIRA issue
and post (what is hopefully) the fix later today.

On Tue, Feb 17, 2009 at 21:39, Doğacan Güney <do...@gmail.com> wrote:
> 2009/2/17 Sami Siren <ss...@gmail.com>:
>> Do we have a Jira issue for this, seems like a blocker for 1.0 to me if it is reproducible.
>>
>
> No we don't. But you are right that we should. I am very busy and I
> forgot about it. I will
> examine this problem in more detail tomorrow and will open an issue if
> I can reproduce
> the bug.
>
>> --
>> Sami Siren
>>
>>
>> Dog(acan Güney wrote:
>>>
>>> Thanks for detailed analysis. I will take a look and get back to you.
>>>
>>> On Mon, Feb 16, 2009 at 13:41, Koch Martina <Ko...@huberverlag.de> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> sorry for the late reply. We did some further digging and found that the error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the error just happens much later (after about 20 fetch cycles).
>>>> We did many test runs, eliminated as much plugins as possible and identified URLs which are most likely to fail.
>>>> With the following configuration we get a corrupt crawldb after two fetch2 cycles:
>>>> - activated plugins: protocol-http, parse-html, feed
>>>> - generate.max.per.host - 100
>>>> - URLs to fetch:
>>>> http://www.prosieben.de/service/newsflash/
>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249
>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239
>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238
>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241
>>>> http://www.prosieben.de/kino_dvd/news/60897/
>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278
>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268
>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279
>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267
>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/
>>>> http://www.prosieben.de/spielfilm_serie/topstories/61051/
>>>> http://www.prosieben.de/kino_dvd/news/60897/
>>>>
>>>> When starting from an higher URL like http://www.prosieben.de these URLs get the following warn message after some fetch cycles:
>>>> WARN  parse.ParseOutputFormat - Can't read fetch time for: http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>>>> But the crawldb does not get corrupt immediately after the first occurence of such messages, it gets corrupted some cyles later.
>>>>
>>>> Any suggestions are highly appreciated.
>>>> Something seems to go wrong with the feed plugin, but I can't diagnose exactly when and why...
>>>>
>>>> Thanks in advance.
>>>>
>>>> Kind regards,
>>>> Martina
>>>>
>>>>
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Dog(acan Güney [mailto:dogacan@gmail.com]
>>>> Gesendet: Freitag, 13. Februar 2009 09:37
>>>> An: nutch-user@lucene.apache.org
>>>> Betreff: Re: Fetcher2 crashes with current trunk
>>>>
>>>> On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <Ko...@huberverlag.de> wrote:
>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied.
>>>>> We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1.
>>>>> When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log:
>>>>>
>>>>> 2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.
>>>>> 2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
>>>>> 2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to process : 2
>>>>> 2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
>>>>> 2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to process : 2
>>>>> 2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
>>>>> 2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
>>>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
>>>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
>>>>> 2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
>>>>> 2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
>>>>> java.lang.RuntimeException: java.lang.NullPointerException
>>>>>      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
>>>>>      at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
>>>>>      at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
>>>>>      at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>>>>      at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>>>>      at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
>>>>>      at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
>>>>>      at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>>>>>      at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>>>>>      at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
>>>>>      at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
>>>>>      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>>>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>>>>>      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>>>>> Caused by: java.lang.NullPointerException
>>>>>      at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>>>>>      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
>>>>>      ... 13 more
>>>>> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: Job failed!
>>>>>      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>>>>>      at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
>>>>>      at org.apache.nutch.crawl.Injector.run(Injector.java:190)
>>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>      at org.apache.nutch.crawl.Injector.main(Injector.java:180)
>>>>>
>>>>> After that the crawldb is broken and can't be accessed e.g. with the readdb <crawldb> -stats command.
>>>>> When we use for exactly the same task Fetcher instead of Fetcher2, we can do as many cycles as we like without any problems or crashes.
>>>>>
>>>>> Besides this error we've observed that the fetch-cycle with Fetcher is about twice as fast as Fetcher2, although we use the exact same settings in the nutch-site:
>>>>> generate.max.per.host  - 100
>>>>> fetcher.threads.per.host - 1
>>>>> fetcher.server.delay - 0
>>>>> for an initial url list with 30 URLs of different hosts.
>>>>>
>>>>> Has anybody observed similar errors or performance issues?
>>>>>
>>>>>
>>>>
>>>> Fetcher - Fetcher2 performance is a confusing issue. There have been
>>>> reports that both
>>>> have been faster than the other. Fetcher2 has a much more flexible and
>>>> smarter architecture
>>>> compared to Fetcher so I can only think that this is some sort of bug
>>>> in Fetcher2 that degrades
>>>> performance.
>>>>
>>>> However, your other problem (Fetcher2 crash) is very weird. I went
>>>> through Fetcher and Fetcher2
>>>> code and there is nothing different in them that will make one work
>>>> and the other fail. Does this
>>>> error consistently happen if you try it again with Fetcher2 from scratch?
>>>>
>>>>
>>>>>
>>>>> Kind regards,
>>>>> Martina
>>>>>
>>>>>
>>>>
>>>> --
>>>> Dog(acan Güney
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
> --
> Doğacan Güney
>



-- 
Doğacan Güney

Re: Fetcher2 crashes with current trunk

Posted by Doğacan Güney <do...@gmail.com>.
2009/2/17 Sami Siren <ss...@gmail.com>:
> Do we have a Jira issue for this, seems like a blocker for 1.0 to me if it is reproducible.
>

No we don't. But you are right that we should. I am very busy and I
forgot about it. I will
examine this problem in more detail tomorrow and will open an issue if
I can reproduce
the bug.

> --
> Sami Siren
>
>
> Dog(acan Güney wrote:
>>
>> Thanks for detailed analysis. I will take a look and get back to you.
>>
>> On Mon, Feb 16, 2009 at 13:41, Koch Martina <Ko...@huberverlag.de> wrote:
>>
>>>
>>> Hi,
>>>
>>> sorry for the late reply. We did some further digging and found that the error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the error just happens much later (after about 20 fetch cycles).
>>> We did many test runs, eliminated as much plugins as possible and identified URLs which are most likely to fail.
>>> With the following configuration we get a corrupt crawldb after two fetch2 cycles:
>>> - activated plugins: protocol-http, parse-html, feed
>>> - generate.max.per.host - 100
>>> - URLs to fetch:
>>> http://www.prosieben.de/service/newsflash/
>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249
>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239
>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238
>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241
>>> http://www.prosieben.de/kino_dvd/news/60897/
>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278
>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268
>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279
>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267
>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/
>>> http://www.prosieben.de/spielfilm_serie/topstories/61051/
>>> http://www.prosieben.de/kino_dvd/news/60897/
>>>
>>> When starting from an higher URL like http://www.prosieben.de these URLs get the following warn message after some fetch cycles:
>>> WARN  parse.ParseOutputFormat - Can't read fetch time for: http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>>> But the crawldb does not get corrupt immediately after the first occurence of such messages, it gets corrupted some cyles later.
>>>
>>> Any suggestions are highly appreciated.
>>> Something seems to go wrong with the feed plugin, but I can't diagnose exactly when and why...
>>>
>>> Thanks in advance.
>>>
>>> Kind regards,
>>> Martina
>>>
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Dog(acan Güney [mailto:dogacan@gmail.com]
>>> Gesendet: Freitag, 13. Februar 2009 09:37
>>> An: nutch-user@lucene.apache.org
>>> Betreff: Re: Fetcher2 crashes with current trunk
>>>
>>> On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <Ko...@huberverlag.de> wrote:
>>>
>>>>
>>>> Hi all,
>>>>
>>>> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied.
>>>> We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1.
>>>> When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log:
>>>>
>>>> 2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.
>>>> 2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
>>>> 2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to process : 2
>>>> 2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
>>>> 2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to process : 2
>>>> 2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
>>>> 2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
>>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
>>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
>>>> 2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
>>>> 2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
>>>> java.lang.RuntimeException: java.lang.NullPointerException
>>>>      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
>>>>      at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
>>>>      at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
>>>>      at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>>>      at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>>>      at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
>>>>      at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
>>>>      at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>>>>      at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>>>>      at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
>>>>      at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
>>>>      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>>>>      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>>>> Caused by: java.lang.NullPointerException
>>>>      at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>>>>      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
>>>>      ... 13 more
>>>> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: Job failed!
>>>>      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>>>>      at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
>>>>      at org.apache.nutch.crawl.Injector.run(Injector.java:190)
>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>      at org.apache.nutch.crawl.Injector.main(Injector.java:180)
>>>>
>>>> After that the crawldb is broken and can't be accessed e.g. with the readdb <crawldb> -stats command.
>>>> When we use for exactly the same task Fetcher instead of Fetcher2, we can do as many cycles as we like without any problems or crashes.
>>>>
>>>> Besides this error we've observed that the fetch-cycle with Fetcher is about twice as fast as Fetcher2, although we use the exact same settings in the nutch-site:
>>>> generate.max.per.host  - 100
>>>> fetcher.threads.per.host - 1
>>>> fetcher.server.delay - 0
>>>> for an initial url list with 30 URLs of different hosts.
>>>>
>>>> Has anybody observed similar errors or performance issues?
>>>>
>>>>
>>>
>>> Fetcher - Fetcher2 performance is a confusing issue. There have been
>>> reports that both
>>> have been faster than the other. Fetcher2 has a much more flexible and
>>> smarter architecture
>>> compared to Fetcher so I can only think that this is some sort of bug
>>> in Fetcher2 that degrades
>>> performance.
>>>
>>> However, your other problem (Fetcher2 crash) is very weird. I went
>>> through Fetcher and Fetcher2
>>> code and there is nothing different in them that will make one work
>>> and the other fail. Does this
>>> error consistently happen if you try it again with Fetcher2 from scratch?
>>>
>>>
>>>>
>>>> Kind regards,
>>>> Martina
>>>>
>>>>
>>>
>>> --
>>> Dog(acan Güney
>>>
>>>
>>
>>
>>
>>
>
>



-- 
Doğacan Güney

Re: Fetcher2 crashes with current trunk

Posted by Sami Siren <ss...@gmail.com>.
Do we have a Jira issue for this, seems like a blocker for 1.0 to me if 
it is reproducible.

--
 Sami Siren


Dog(acan Güney wrote:
> Thanks for detailed analysis. I will take a look and get back to you.
>
> On Mon, Feb 16, 2009 at 13:41, Koch Martina <Ko...@huberverlag.de> wrote:
>   
>> Hi,
>>
>> sorry for the late reply. We did some further digging and found that the error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the error just happens much later (after about 20 fetch cycles).
>> We did many test runs, eliminated as much plugins as possible and identified URLs which are most likely to fail.
>> With the following configuration we get a corrupt crawldb after two fetch2 cycles:
>> - activated plugins: protocol-http, parse-html, feed
>> - generate.max.per.host - 100
>> - URLs to fetch:
>> http://www.prosieben.de/service/newsflash/
>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249
>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239
>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238
>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241
>> http://www.prosieben.de/kino_dvd/news/60897/
>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278
>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268
>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279
>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267
>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259
>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/
>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/
>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/
>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/
>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/
>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/
>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/
>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/
>> http://www.prosieben.de/spielfilm_serie/topstories/61051/
>> http://www.prosieben.de/kino_dvd/news/60897/
>>
>> When starting from an higher URL like http://www.prosieben.de these URLs get the following warn message after some fetch cycles:
>> WARN  parse.ParseOutputFormat - Can't read fetch time for: http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>> But the crawldb does not get corrupt immediately after the first occurence of such messages, it gets corrupted some cyles later.
>>
>> Any suggestions are highly appreciated.
>> Something seems to go wrong with the feed plugin, but I can't diagnose exactly when and why...
>>
>> Thanks in advance.
>>
>> Kind regards,
>> Martina
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Dog(acan Güney [mailto:dogacan@gmail.com]
>> Gesendet: Freitag, 13. Februar 2009 09:37
>> An: nutch-user@lucene.apache.org
>> Betreff: Re: Fetcher2 crashes with current trunk
>>
>> On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <Ko...@huberverlag.de> wrote:
>>     
>>> Hi all,
>>>
>>> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied.
>>> We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1.
>>> When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log:
>>>
>>> 2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.
>>> 2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
>>> 2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to process : 2
>>> 2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
>>> 2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to process : 2
>>> 2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
>>> 2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
>>> 2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
>>> 2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
>>> java.lang.RuntimeException: java.lang.NullPointerException
>>>       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
>>>       at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
>>>       at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
>>>       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>>       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>>       at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
>>>       at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
>>>       at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>>>       at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>>>       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
>>>       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
>>>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>>>       at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>>> Caused by: java.lang.NullPointerException
>>>       at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>>>       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
>>>       ... 13 more
>>> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: Job failed!
>>>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>>>       at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
>>>       at org.apache.nutch.crawl.Injector.run(Injector.java:190)
>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>       at org.apache.nutch.crawl.Injector.main(Injector.java:180)
>>>
>>> After that the crawldb is broken and can't be accessed e.g. with the readdb <crawldb> -stats command.
>>> When we use for exactly the same task Fetcher instead of Fetcher2, we can do as many cycles as we like without any problems or crashes.
>>>
>>> Besides this error we've observed that the fetch-cycle with Fetcher is about twice as fast as Fetcher2, although we use the exact same settings in the nutch-site:
>>> generate.max.per.host  - 100
>>> fetcher.threads.per.host - 1
>>> fetcher.server.delay - 0
>>> for an initial url list with 30 URLs of different hosts.
>>>
>>> Has anybody observed similar errors or performance issues?
>>>
>>>       
>> Fetcher - Fetcher2 performance is a confusing issue. There have been
>> reports that both
>> have been faster than the other. Fetcher2 has a much more flexible and
>> smarter architecture
>> compared to Fetcher so I can only think that this is some sort of bug
>> in Fetcher2 that degrades
>> performance.
>>
>> However, your other problem (Fetcher2 crash) is very weird. I went
>> through Fetcher and Fetcher2
>> code and there is nothing different in them that will make one work
>> and the other fail. Does this
>> error consistently happen if you try it again with Fetcher2 from scratch?
>>
>>     
>>> Kind regards,
>>> Martina
>>>
>>>       
>>
>> --
>> Dog(acan Güney
>>
>>     
>
>
>
>   


Re: Fetcher2 crashes with current trunk

Posted by Doğacan Güney <do...@gmail.com>.
Thanks for detailed analysis. I will take a look and get back to you.

On Mon, Feb 16, 2009 at 13:41, Koch Martina <Ko...@huberverlag.de> wrote:
> Hi,
>
> sorry for the late reply. We did some further digging and found that the error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the error just happens much later (after about 20 fetch cycles).
> We did many test runs, eliminated as much plugins as possible and identified URLs which are most likely to fail.
> With the following configuration we get a corrupt crawldb after two fetch2 cycles:
> - activated plugins: protocol-http, parse-html, feed
> - generate.max.per.host - 100
> - URLs to fetch:
> http://www.prosieben.de/service/newsflash/
> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249
> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239
> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238
> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241
> http://www.prosieben.de/kino_dvd/news/60897/
> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278
> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268
> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279
> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267
> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/
> http://www.prosieben.de/spielfilm_serie/topstories/61051/
> http://www.prosieben.de/kino_dvd/news/60897/
>
> When starting from an higher URL like http://www.prosieben.de these URLs get the following warn message after some fetch cycles:
> WARN  parse.ParseOutputFormat - Can't read fetch time for: http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
> But the crawldb does not get corrupt immediately after the first occurence of such messages, it gets corrupted some cyles later.
>
> Any suggestions are highly appreciated.
> Something seems to go wrong with the feed plugin, but I can't diagnose exactly when and why...
>
> Thanks in advance.
>
> Kind regards,
> Martina
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Doğacan Güney [mailto:dogacan@gmail.com]
> Gesendet: Freitag, 13. Februar 2009 09:37
> An: nutch-user@lucene.apache.org
> Betreff: Re: Fetcher2 crashes with current trunk
>
> On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <Ko...@huberverlag.de> wrote:
>> Hi all,
>>
>> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied.
>> We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1.
>> When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log:
>>
>> 2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.
>> 2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
>> 2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to process : 2
>> 2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
>> 2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to process : 2
>> 2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
>> 2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
>> 2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
>> 2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
>> java.lang.RuntimeException: java.lang.NullPointerException
>>       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
>>       at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
>>       at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
>>       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>       at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
>>       at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
>>       at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>>       at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>>       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
>>       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
>>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>>       at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>> Caused by: java.lang.NullPointerException
>>       at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>>       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
>>       ... 13 more
>> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: Job failed!
>>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>>       at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
>>       at org.apache.nutch.crawl.Injector.run(Injector.java:190)
>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>       at org.apache.nutch.crawl.Injector.main(Injector.java:180)
>>
>> After that the crawldb is broken and can't be accessed e.g. with the readdb <crawldb> -stats command.
>> When we use for exactly the same task Fetcher instead of Fetcher2, we can do as many cycles as we like without any problems or crashes.
>>
>> Besides this error we've observed that the fetch-cycle with Fetcher is about twice as fast as Fetcher2, although we use the exact same settings in the nutch-site:
>> generate.max.per.host  - 100
>> fetcher.threads.per.host - 1
>> fetcher.server.delay - 0
>> for an initial url list with 30 URLs of different hosts.
>>
>> Has anybody observed similar errors or performance issues?
>>
>
> Fetcher - Fetcher2 performance is a confusing issue. There have been
> reports that both
> have been faster than the other. Fetcher2 has a much more flexible and
> smarter architecture
> compared to Fetcher so I can only think that this is some sort of bug
> in Fetcher2 that degrades
> performance.
>
> However, your other problem (Fetcher2 crash) is very weird. I went
> through Fetcher and Fetcher2
> code and there is nothing different in them that will make one work
> and the other fail. Does this
> error consistently happen if you try it again with Fetcher2 from scratch?
>
>> Kind regards,
>> Martina
>>
>
>
>
> --
> Doğacan Güney
>



-- 
Doğacan Güney

AW: Fetcher2 crashes with current trunk

Posted by Koch Martina <Ko...@huberverlag.de>.
Hi,

sorry for the late reply. We did some further digging and found that the error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the error just happens much later (after about 20 fetch cycles).
We did many test runs, eliminated as much plugins as possible and identified URLs which are most likely to fail.
With the following configuration we get a corrupt crawldb after two fetch2 cycles:
- activated plugins: protocol-http, parse-html, feed
- generate.max.per.host - 100
- URLs to fetch:
http://www.prosieben.de/service/newsflash/
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238
http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241
http://www.prosieben.de/kino_dvd/news/60897/
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267
http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/
http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/
http://www.prosieben.de/spielfilm_serie/topstories/61051/
http://www.prosieben.de/kino_dvd/news/60897/ 

When starting from an higher URL like http://www.prosieben.de these URLs get the following warn message after some fetch cycles:
WARN  parse.ParseOutputFormat - Can't read fetch time for: http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
But the crawldb does not get corrupt immediately after the first occurence of such messages, it gets corrupted some cyles later.

Any suggestions are highly appreciated. 
Something seems to go wrong with the feed plugin, but I can't diagnose exactly when and why...

Thanks in advance.

Kind regards,
Martina



-----Ursprüngliche Nachricht-----
Von: Doğacan Güney [mailto:dogacan@gmail.com] 
Gesendet: Freitag, 13. Februar 2009 09:37
An: nutch-user@lucene.apache.org
Betreff: Re: Fetcher2 crashes with current trunk

On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <Ko...@huberverlag.de> wrote:
> Hi all,
>
> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied.
> We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1.
> When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log:
>
> 2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.
> 2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
> 2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to process : 2
> 2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
> 2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to process : 2
> 2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
> 2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
> 2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
> 2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
> java.lang.RuntimeException: java.lang.NullPointerException
>       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
>       at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
>       at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
>       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>       at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
>       at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
>       at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>       at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
>       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>       at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> Caused by: java.lang.NullPointerException
>       at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
>       ... 13 more
> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: Job failed!
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>       at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
>       at org.apache.nutch.crawl.Injector.run(Injector.java:190)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at org.apache.nutch.crawl.Injector.main(Injector.java:180)
>
> After that the crawldb is broken and can't be accessed e.g. with the readdb <crawldb> -stats command.
> When we use for exactly the same task Fetcher instead of Fetcher2, we can do as many cycles as we like without any problems or crashes.
>
> Besides this error we've observed that the fetch-cycle with Fetcher is about twice as fast as Fetcher2, although we use the exact same settings in the nutch-site:
> generate.max.per.host  - 100
> fetcher.threads.per.host - 1
> fetcher.server.delay - 0
> for an initial url list with 30 URLs of different hosts.
>
> Has anybody observed similar errors or performance issues?
>

Fetcher - Fetcher2 performance is a confusing issue. There have been
reports that both
have been faster than the other. Fetcher2 has a much more flexible and
smarter architecture
compared to Fetcher so I can only think that this is some sort of bug
in Fetcher2 that degrades
performance.

However, your other problem (Fetcher2 crash) is very weird. I went
through Fetcher and Fetcher2
code and there is nothing different in them that will make one work
and the other fail. Does this
error consistently happen if you try it again with Fetcher2 from scratch?

> Kind regards,
> Martina
>



-- 
Doğacan Güney

Re: Fetcher2 crashes with current trunk

Posted by Doğacan Güney <do...@gmail.com>.
On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <Ko...@huberverlag.de> wrote:
> Hi all,
>
> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied.
> We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1.
> When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log:
>
> 2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.
> 2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
> 2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to process : 2
> 2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
> 2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to process : 2
> 2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
> 2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
> 2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
> 2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
> java.lang.RuntimeException: java.lang.NullPointerException
>       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
>       at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
>       at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
>       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>       at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
>       at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
>       at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>       at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
>       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>       at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> Caused by: java.lang.NullPointerException
>       at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
>       ... 13 more
> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: Job failed!
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>       at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
>       at org.apache.nutch.crawl.Injector.run(Injector.java:190)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at org.apache.nutch.crawl.Injector.main(Injector.java:180)
>
> After that the crawldb is broken and can't be accessed e.g. with the readdb <crawldb> -stats command.
> When we use for exactly the same task Fetcher instead of Fetcher2, we can do as many cycles as we like without any problems or crashes.
>
> Besides this error we've observed that the fetch-cycle with Fetcher is about twice as fast as Fetcher2, although we use the exact same settings in the nutch-site:
> generate.max.per.host  - 100
> fetcher.threads.per.host - 1
> fetcher.server.delay - 0
> for an initial url list with 30 URLs of different hosts.
>
> Has anybody observed similar errors or performance issues?
>

Fetcher - Fetcher2 performance is a confusing issue. There have been
reports that both
have been faster than the other. Fetcher2 has a much more flexible and
smarter architecture
compared to Fetcher so I can only think that this is some sort of bug
in Fetcher2 that degrades
performance.

However, your other problem (Fetcher2 crash) is very weird. I went
through Fetcher and Fetcher2
code and there is nothing different in them that will make one work
and the other fail. Does this
error consistently happen if you try it again with Fetcher2 from scratch?

> Kind regards,
> Martina
>



-- 
Doğacan Güney