You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "yoursoft@freemail.hu" <yo...@freemail.hu> on 2005/09/01 15:30:40 UTC

nutch 0.7 bug?

Dear Developers!

I tested  nutch 0.7 with all the parser plugins, and found the followings:

-------------------------------------------------------------------------
The fetch broken by with e.g. followings:
-------------------------------------------------------------------------
050901 110915 fetch okay, but can't parse 
http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
reason: failed
(2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved 
files are unsupported at this time
050901 110915 fetching http://en.mimi.hu/fishing/scad.html
050901 110917 SEVERE error writing output:java.lang.NullPointerException
java.lang.NullPointerException
        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
        at 
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110917 SEVERE error writing output:java.io.IOException: key out 
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Exception in thread "main" java.lang.RuntimeException: SEVERE error 
logged.  Exiting fetcher.
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
050901 110921 SEVERE error writing output:java.io.IOException: key out 
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key out 
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key out 
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
etc.

---------------------------------------------------------------------------
There are the differences between nutch-site.xml and nutch-default.xml:
---------------------------------------------------------------------------
 ***** nutch-default.xml
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description>
***** NUTCH-SITE.XML
  <name>http.timeout</name>
  <value>30000</value>
  <description>The default network timeout, in milliseconds.</description>
*****

***** nutch-default.xml
  <name>http.max.delays</name>
  <value>3</value>
  <description>The number of times a thread will delay when trying to
***** NUTCH-SITE.XML
  <name>http.max.delays</name>
  <value>6</value>
  <description>The number of times a thread will delay when trying to
*****

***** nutch-default.xml
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
  <name>http.content.limit</name>
  <value>130000</value>
  <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
  <name>file.content.limit</name>
  <value>130000</value>
  <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
  <name>ftp.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
  <name>ftp.content.limit</name>
  <value>130000</value>
  <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
***** NUTCH-SITE.XML
  <name>db.max.outlinks.per.page</name>
  <value>200</value>
  <description>The maximum number of outlinks that we'll process for a page.
*****

***** nutch-default.xml
  <name>db.fetch.retry.max</name>
  <value>3</value>
  <description>The maximum number of times a url that has encountered
***** NUTCH-SITE.XML
  <name>db.fetch.retry.max</name>
  <value>6</value>
  <description>The maximum number of times a url that has encountered
*****

***** nutch-default.xml
  <name>fetcher.server.delay</name>
  <value>5.0</value>
  <description>The number of seconds the fetcher will delay between
***** NUTCH-SITE.XML
  <name>fetcher.server.delay</name>
  <value>30.0</value>
  <description>The number of seconds the fetcher will delay between
*****

***** nutch-default.xml
  <name>fetcher.threads.fetch</name>
  <value>10</value>
  <description>The number of FetcherThreads the fetcher should use.
***** NUTCH-SITE.XML
  <name>fetcher.threads.fetch</name>
  <value>100</value>
  <description>The number of FetcherThreads the fetcher should use.
*****

***** nutch-default.xml
  <name>fetcher.threads.per.host</name>
  <value>1</value>
  <description>This number is the maximum number of threads that
***** NUTCH-SITE.XML
  <name>fetcher.threads.per.host</name>
  <value>100</value>
  <description>This number is the maximum number of threads that
*****

***** nutch-default.xml
  <name>parser.threads.parse</name>
  <value>10</value>
  <description>Number of ParserThreads ParseSegment should 
use.</description>
***** NUTCH-SITE.XML
  <name>parser.threads.parse</name>
  <value>100</value>
  <description>Number of ParserThreads ParseSegment should 
use.</description>
*****

***** nutch-default.xml
  <name>indexer.minMergeDocs</name>
  <value>50</value>
  <description>This number determines the minimum number of Lucene
***** NUTCH-SITE.XML
  <name>indexer.minMergeDocs</name>
  <value>10000</value>
  <description>This number determines the minimum number of Lucene
*****

***** nutch-default.xml
  <name>indexer.maxMergeDocs</name>
  <value>50</value>
  <description>This number determines the maximum number of Lucene
***** NUTCH-SITE.XML
  <name>indexer.maxMergeDocs</name>
  <value>10000000</value>
  <description>This number determines the maximum number of Lucene
*****

***** nutch-default.xml
  <name>searcher.dir</name>
  <value>.</value>
  <description>
***** NUTCH-SITE.XML
  <name>searcher.dir</name>
  <value>/srv/db/</value>
  <description>
*****

***** nutch-default.xml
  <name>ipc.client.timeout</name>
  <value>10000</value>
  <description>Defines the timeout for IPC calls in milliseconds. 
</description>
***** NUTCH-SITE.XML
  <name>ipc.client.timeout</name>
  <value>20000</value>
  <description>Defines the timeout for IPC calls in milliseconds. 
</description>
*****

***** nutch-default.xml
  <name>plugin.includes</name>
  
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
***** NUTCH-SITE.XML
  <name>plugin.includes</name>
  
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
basic|more|site|url)</value>
  <description>Regular expression naming plugin directory names to
*****

***** nutch-default.xml
  <name>parser.character.encoding.default</name>
  <value>windows-1252</value>
  <description>The character encoding to fall back to when no other 
information
***** NUTCH-SITE.XML
  <name>parser.character.encoding.default</name>
  <value>iso-8859-2</value>
  <description>The character encoding to fall back to when no other 
information
*****

Any idea what is the problem source?

Best Regards:
    Ferenc

Re: [Nutch-dev] Re: nutch 0.7 bug?

Posted by YourSoft <yo...@freemail.hu>.
Dear Michael,

Thanks, for your mail. But I think there are 2 different problem. I 
don't use the rss parser.

Ferenc

Michael Nebel wrotte:

> Just for the mail archives: please see also NUTCH-89.
>
> Thread closed?
>
> Michael
>
>
>
> yoursoft@freemail.hu wrote:
>
>> Hi Michael,
>>
>> I going back to a nigthly build.
>> I think this problem is related to 'fetcher.threads.per.host' value, 
>> when it is bigger than 1.
>> There is another possible sources: fetcher.threads.fetch or 
>> fetcher.threads.per.host or parser.threads.parse.
>>
>> Best Regards,
>>    Ferenc
>>
>>> Hi Ferenc,
>>>
>>> I see the same errors. As I've seen a running installation 
>>> yesterday, I think it's a configuration mistake. By now I have no 
>>> idea where. Have you made any progress?
>>>
>>> Regards
>>>
>>>     Michael
>>>
>>>
>>> yoursoft@freemail.hu wrote:
>>>
>>>> Dear Developers!
>>>>
>>>> I tested  nutch 0.7 with all the parser plugins, and found the 
>>>> followings:
>>>>
>>>> ------------------------------------------------------------------------- 
>>>>
>>>> The fetch broken by with e.g. followings:
>>>> ------------------------------------------------------------------------- 
>>>>
>>>> 050901 110915 fetch okay, but can't parse 
>>>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
>>>> reason: failed
>>>> (2,200): org.apache.nutch.parse.msword.FastSavedException: 
>>>> Fast-saved files are unsupported at this time
>>>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>>>> 050901 110917 SEVERE error writing 
>>>> output:java.lang.NullPointerException
>>>> java.lang.NullPointerException
>>>>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>>>        at 
>>>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110917 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> Exception in thread "main" java.lang.RuntimeException: SEVERE error 
>>>> logged.  Exiting fetcher.
>>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> etc.
>>>>
>>>> --------------------------------------------------------------------------- 
>>>>
>>>> There are the differences between nutch-site.xml and 
>>>> nutch-default.xml:
>>>> --------------------------------------------------------------------------- 
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.timeout</name>
>>>>  <value>10000</value>
>>>>  <description>The default network timeout, in 
>>>> milliseconds.</description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.timeout</name>
>>>>  <value>30000</value>
>>>>  <description>The default network timeout, in 
>>>> milliseconds.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.max.delays</name>
>>>>  <value>3</value>
>>>>  <description>The number of times a thread will delay when trying to
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.max.delays</name>
>>>>  <value>6</value>
>>>>  <description>The number of times a thread will delay when trying to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>file.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>file.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>ftp.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>ftp.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>db.max.outlinks.per.page</name>
>>>>  <value>100</value>
>>>>  <description>The maximum number of outlinks that we'll process for 
>>>> a page.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>db.max.outlinks.per.page</name>
>>>>  <value>200</value>
>>>>  <description>The maximum number of outlinks that we'll process for 
>>>> a page.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>db.fetch.retry.max</name>
>>>>  <value>3</value>
>>>>  <description>The maximum number of times a url that has encountered
>>>> ***** NUTCH-SITE.XML
>>>>  <name>db.fetch.retry.max</name>
>>>>  <value>6</value>
>>>>  <description>The maximum number of times a url that has encountered
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.server.delay</name>
>>>>  <value>5.0</value>
>>>>  <description>The number of seconds the fetcher will delay between
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.server.delay</name>
>>>>  <value>30.0</value>
>>>>  <description>The number of seconds the fetcher will delay between
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.threads.fetch</name>
>>>>  <value>10</value>
>>>>  <description>The number of FetcherThreads the fetcher should use.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.threads.fetch</name>
>>>>  <value>100</value>
>>>>  <description>The number of FetcherThreads the fetcher should use.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.threads.per.host</name>
>>>>  <value>1</value>
>>>>  <description>This number is the maximum number of threads that
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.threads.per.host</name>
>>>>  <value>100</value>
>>>>  <description>This number is the maximum number of threads that
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>parser.threads.parse</name>
>>>>  <value>10</value>
>>>>  <description>Number of ParserThreads ParseSegment should 
>>>> use.</description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>parser.threads.parse</name>
>>>>  <value>100</value>
>>>>  <description>Number of ParserThreads ParseSegment should 
>>>> use.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>indexer.minMergeDocs</name>
>>>>  <value>50</value>
>>>>  <description>This number determines the minimum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>>  <name>indexer.minMergeDocs</name>
>>>>  <value>10000</value>
>>>>  <description>This number determines the minimum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>indexer.maxMergeDocs</name>
>>>>  <value>50</value>
>>>>  <description>This number determines the maximum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>>  <name>indexer.maxMergeDocs</name>
>>>>  <value>10000000</value>
>>>>  <description>This number determines the maximum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>searcher.dir</name>
>>>>  <value>.</value>
>>>>  <description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>searcher.dir</name>
>>>>  <value>/srv/db/</value>
>>>>  <description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>ipc.client.timeout</name>
>>>>  <value>10000</value>
>>>>  <description>Defines the timeout for IPC calls in milliseconds. 
>>>> </description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>ipc.client.timeout</name>
>>>>  <value>20000</value>
>>>>  <description>Defines the timeout for IPC calls in milliseconds. 
>>>> </description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value> 
>>>>
>>>>  <description>Regular expression naming plugin directory names to
>>>> ***** NUTCH-SITE.XML
>>>>  <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query- 
>>>>
>>>> basic|more|site|url)</value>
>>>>  <description>Regular expression naming plugin directory names to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>parser.character.encoding.default</name>
>>>>  <value>windows-1252</value>
>>>>  <description>The character encoding to fall back to when no other 
>>>> information
>>>> ***** NUTCH-SITE.XML
>>>>  <name>parser.character.encoding.default</name>
>>>>  <value>iso-8859-2</value>
>>>>  <description>The character encoding to fall back to when no other 
>>>> information
>>>> *****
>>>>
>>>> Any idea what is the problem source?
>>>>
>>>> Best Regards:
>>>>    Ferenc
>>>
>>>
>>>
>>>
>>>
>
>


Re: nutch 0.7 bug?

Posted by Michael Nebel <mi...@nebel.de>.
Just for the mail archives: please see also NUTCH-89.

Thread closed?

Michael



yoursoft@freemail.hu wrote:

> Hi Michael,
> 
> I going back to a nigthly build.
> I think this problem is related to 'fetcher.threads.per.host' value, 
> when it is bigger than 1.
> There is another possible sources: fetcher.threads.fetch or 
> fetcher.threads.per.host or parser.threads.parse.
> 
> Best Regards,
>    Ferenc
> 
>> Hi Ferenc,
>>
>> I see the same errors. As I've seen a running installation yesterday, 
>> I think it's a configuration mistake. By now I have no idea where. 
>> Have you made any progress?
>>
>> Regards
>>
>>     Michael
>>
>>
>> yoursoft@freemail.hu wrote:
>>
>>> Dear Developers!
>>>
>>> I tested  nutch 0.7 with all the parser plugins, and found the 
>>> followings:
>>>
>>> ------------------------------------------------------------------------- 
>>>
>>> The fetch broken by with e.g. followings:
>>> ------------------------------------------------------------------------- 
>>>
>>> 050901 110915 fetch okay, but can't parse 
>>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
>>> reason: failed
>>> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved 
>>> files are unsupported at this time
>>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>>> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
>>> java.lang.NullPointerException
>>>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>>        at 
>>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110917 SEVERE error writing output:java.io.IOException: key 
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> Exception in thread "main" java.lang.RuntimeException: SEVERE error 
>>> logged.  Exiting fetcher.
>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>> etc.
>>>
>>> --------------------------------------------------------------------------- 
>>>
>>> There are the differences between nutch-site.xml and nutch-default.xml:
>>> --------------------------------------------------------------------------- 
>>>
>>> ***** nutch-default.xml
>>>  <name>http.timeout</name>
>>>  <value>10000</value>
>>>  <description>The default network timeout, in 
>>> milliseconds.</description>
>>> ***** NUTCH-SITE.XML
>>>  <name>http.timeout</name>
>>>  <value>30000</value>
>>>  <description>The default network timeout, in 
>>> milliseconds.</description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>http.max.delays</name>
>>>  <value>3</value>
>>>  <description>The number of times a thread will delay when trying to
>>> ***** NUTCH-SITE.XML
>>>  <name>http.max.delays</name>
>>>  <value>6</value>
>>>  <description>The number of times a thread will delay when trying to
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>http.content.limit</name>
>>>  <value>65536</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>>  <name>http.content.limit</name>
>>>  <value>130000</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>file.content.limit</name>
>>>  <value>65536</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>>  <name>file.content.limit</name>
>>>  <value>130000</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>ftp.content.limit</name>
>>>  <value>65536</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>>  <name>ftp.content.limit</name>
>>>  <value>130000</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>db.max.outlinks.per.page</name>
>>>  <value>100</value>
>>>  <description>The maximum number of outlinks that we'll process for a 
>>> page.
>>> ***** NUTCH-SITE.XML
>>>  <name>db.max.outlinks.per.page</name>
>>>  <value>200</value>
>>>  <description>The maximum number of outlinks that we'll process for a 
>>> page.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>db.fetch.retry.max</name>
>>>  <value>3</value>
>>>  <description>The maximum number of times a url that has encountered
>>> ***** NUTCH-SITE.XML
>>>  <name>db.fetch.retry.max</name>
>>>  <value>6</value>
>>>  <description>The maximum number of times a url that has encountered
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>fetcher.server.delay</name>
>>>  <value>5.0</value>
>>>  <description>The number of seconds the fetcher will delay between
>>> ***** NUTCH-SITE.XML
>>>  <name>fetcher.server.delay</name>
>>>  <value>30.0</value>
>>>  <description>The number of seconds the fetcher will delay between
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>fetcher.threads.fetch</name>
>>>  <value>10</value>
>>>  <description>The number of FetcherThreads the fetcher should use.
>>> ***** NUTCH-SITE.XML
>>>  <name>fetcher.threads.fetch</name>
>>>  <value>100</value>
>>>  <description>The number of FetcherThreads the fetcher should use.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>fetcher.threads.per.host</name>
>>>  <value>1</value>
>>>  <description>This number is the maximum number of threads that
>>> ***** NUTCH-SITE.XML
>>>  <name>fetcher.threads.per.host</name>
>>>  <value>100</value>
>>>  <description>This number is the maximum number of threads that
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>parser.threads.parse</name>
>>>  <value>10</value>
>>>  <description>Number of ParserThreads ParseSegment should 
>>> use.</description>
>>> ***** NUTCH-SITE.XML
>>>  <name>parser.threads.parse</name>
>>>  <value>100</value>
>>>  <description>Number of ParserThreads ParseSegment should 
>>> use.</description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>indexer.minMergeDocs</name>
>>>  <value>50</value>
>>>  <description>This number determines the minimum number of Lucene
>>> ***** NUTCH-SITE.XML
>>>  <name>indexer.minMergeDocs</name>
>>>  <value>10000</value>
>>>  <description>This number determines the minimum number of Lucene
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>indexer.maxMergeDocs</name>
>>>  <value>50</value>
>>>  <description>This number determines the maximum number of Lucene
>>> ***** NUTCH-SITE.XML
>>>  <name>indexer.maxMergeDocs</name>
>>>  <value>10000000</value>
>>>  <description>This number determines the maximum number of Lucene
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>searcher.dir</name>
>>>  <value>.</value>
>>>  <description>
>>> ***** NUTCH-SITE.XML
>>>  <name>searcher.dir</name>
>>>  <value>/srv/db/</value>
>>>  <description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>ipc.client.timeout</name>
>>>  <value>10000</value>
>>>  <description>Defines the timeout for IPC calls in milliseconds. 
>>> </description>
>>> ***** NUTCH-SITE.XML
>>>  <name>ipc.client.timeout</name>
>>>  <value>20000</value>
>>>  <description>Defines the timeout for IPC calls in milliseconds. 
>>> </description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>plugin.includes</name>
>>>  
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value> 
>>>
>>>  <description>Regular expression naming plugin directory names to
>>> ***** NUTCH-SITE.XML
>>>  <name>plugin.includes</name>
>>>  
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query- 
>>>
>>> basic|more|site|url)</value>
>>>  <description>Regular expression naming plugin directory names to
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>parser.character.encoding.default</name>
>>>  <value>windows-1252</value>
>>>  <description>The character encoding to fall back to when no other 
>>> information
>>> ***** NUTCH-SITE.XML
>>>  <name>parser.character.encoding.default</name>
>>>  <value>iso-8859-2</value>
>>>  <description>The character encoding to fall back to when no other 
>>> information
>>> *****
>>>
>>> Any idea what is the problem source?
>>>
>>> Best Regards:
>>>    Ferenc
>>
>>
>>
>>


-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/


Re: nutch 0.7 bug?

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Hi Michael,

I going back to a nigthly build.
I think this problem is related to 'fetcher.threads.per.host' value, 
when it is bigger than 1.
There is another possible sources: fetcher.threads.fetch or 
fetcher.threads.per.host or parser.threads.parse.

Best Regards,
    Ferenc

> Hi Ferenc,
>
> I see the same errors. As I've seen a running installation yesterday, 
> I think it's a configuration mistake. By now I have no idea where. 
> Have you made any progress?
>
> Regards
>
>     Michael
>
>
> yoursoft@freemail.hu wrote:
>
>> Dear Developers!
>>
>> I tested  nutch 0.7 with all the parser plugins, and found the 
>> followings:
>>
>> ------------------------------------------------------------------------- 
>>
>> The fetch broken by with e.g. followings:
>> ------------------------------------------------------------------------- 
>>
>> 050901 110915 fetch okay, but can't parse 
>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
>> reason: failed
>> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved 
>> files are unsupported at this time
>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
>> java.lang.NullPointerException
>>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>        at 
>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110917 SEVERE error writing output:java.io.IOException: key 
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> Exception in thread "main" java.lang.RuntimeException: SEVERE error 
>> logged.  Exiting fetcher.
>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>> etc.
>>
>> --------------------------------------------------------------------------- 
>>
>> There are the differences between nutch-site.xml and nutch-default.xml:
>> --------------------------------------------------------------------------- 
>>
>> ***** nutch-default.xml
>>  <name>http.timeout</name>
>>  <value>10000</value>
>>  <description>The default network timeout, in 
>> milliseconds.</description>
>> ***** NUTCH-SITE.XML
>>  <name>http.timeout</name>
>>  <value>30000</value>
>>  <description>The default network timeout, in 
>> milliseconds.</description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>http.max.delays</name>
>>  <value>3</value>
>>  <description>The number of times a thread will delay when trying to
>> ***** NUTCH-SITE.XML
>>  <name>http.max.delays</name>
>>  <value>6</value>
>>  <description>The number of times a thread will delay when trying to
>> *****
>>
>> ***** nutch-default.xml
>>  <name>http.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>>  <name>http.content.limit</name>
>>  <value>130000</value>
>>  <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>file.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>>  <name>file.content.limit</name>
>>  <value>130000</value>
>>  <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>ftp.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>>  <name>ftp.content.limit</name>
>>  <value>130000</value>
>>  <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>db.max.outlinks.per.page</name>
>>  <value>100</value>
>>  <description>The maximum number of outlinks that we'll process for a 
>> page.
>> ***** NUTCH-SITE.XML
>>  <name>db.max.outlinks.per.page</name>
>>  <value>200</value>
>>  <description>The maximum number of outlinks that we'll process for a 
>> page.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>db.fetch.retry.max</name>
>>  <value>3</value>
>>  <description>The maximum number of times a url that has encountered
>> ***** NUTCH-SITE.XML
>>  <name>db.fetch.retry.max</name>
>>  <value>6</value>
>>  <description>The maximum number of times a url that has encountered
>> *****
>>
>> ***** nutch-default.xml
>>  <name>fetcher.server.delay</name>
>>  <value>5.0</value>
>>  <description>The number of seconds the fetcher will delay between
>> ***** NUTCH-SITE.XML
>>  <name>fetcher.server.delay</name>
>>  <value>30.0</value>
>>  <description>The number of seconds the fetcher will delay between
>> *****
>>
>> ***** nutch-default.xml
>>  <name>fetcher.threads.fetch</name>
>>  <value>10</value>
>>  <description>The number of FetcherThreads the fetcher should use.
>> ***** NUTCH-SITE.XML
>>  <name>fetcher.threads.fetch</name>
>>  <value>100</value>
>>  <description>The number of FetcherThreads the fetcher should use.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>fetcher.threads.per.host</name>
>>  <value>1</value>
>>  <description>This number is the maximum number of threads that
>> ***** NUTCH-SITE.XML
>>  <name>fetcher.threads.per.host</name>
>>  <value>100</value>
>>  <description>This number is the maximum number of threads that
>> *****
>>
>> ***** nutch-default.xml
>>  <name>parser.threads.parse</name>
>>  <value>10</value>
>>  <description>Number of ParserThreads ParseSegment should 
>> use.</description>
>> ***** NUTCH-SITE.XML
>>  <name>parser.threads.parse</name>
>>  <value>100</value>
>>  <description>Number of ParserThreads ParseSegment should 
>> use.</description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>indexer.minMergeDocs</name>
>>  <value>50</value>
>>  <description>This number determines the minimum number of Lucene
>> ***** NUTCH-SITE.XML
>>  <name>indexer.minMergeDocs</name>
>>  <value>10000</value>
>>  <description>This number determines the minimum number of Lucene
>> *****
>>
>> ***** nutch-default.xml
>>  <name>indexer.maxMergeDocs</name>
>>  <value>50</value>
>>  <description>This number determines the maximum number of Lucene
>> ***** NUTCH-SITE.XML
>>  <name>indexer.maxMergeDocs</name>
>>  <value>10000000</value>
>>  <description>This number determines the maximum number of Lucene
>> *****
>>
>> ***** nutch-default.xml
>>  <name>searcher.dir</name>
>>  <value>.</value>
>>  <description>
>> ***** NUTCH-SITE.XML
>>  <name>searcher.dir</name>
>>  <value>/srv/db/</value>
>>  <description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>ipc.client.timeout</name>
>>  <value>10000</value>
>>  <description>Defines the timeout for IPC calls in milliseconds. 
>> </description>
>> ***** NUTCH-SITE.XML
>>  <name>ipc.client.timeout</name>
>>  <value>20000</value>
>>  <description>Defines the timeout for IPC calls in milliseconds. 
>> </description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>plugin.includes</name>
>>  
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value> 
>>
>>  <description>Regular expression naming plugin directory names to
>> ***** NUTCH-SITE.XML
>>  <name>plugin.includes</name>
>>  
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query- 
>>
>> basic|more|site|url)</value>
>>  <description>Regular expression naming plugin directory names to
>> *****
>>
>> ***** nutch-default.xml
>>  <name>parser.character.encoding.default</name>
>>  <value>windows-1252</value>
>>  <description>The character encoding to fall back to when no other 
>> information
>> ***** NUTCH-SITE.XML
>>  <name>parser.character.encoding.default</name>
>>  <value>iso-8859-2</value>
>>  <description>The character encoding to fall back to when no other 
>> information
>> *****
>>
>> Any idea what is the problem source?
>>
>> Best Regards:
>>    Ferenc
>
>
>


Re: nutch 0.7 bug?

Posted by Michael Nebel <mi...@nebel.de>.
Hi Ferenc,

I see the same errors. As I've seen a running installation yesterday, I 
think it's a configuration mistake. By now I have no idea where. Have 
you made any progress?

Regards

	Michael


yoursoft@freemail.hu wrote:

> Dear Developers!
> 
> I tested  nutch 0.7 with all the parser plugins, and found the followings:
> 
> -------------------------------------------------------------------------
> The fetch broken by with e.g. followings:
> -------------------------------------------------------------------------
> 050901 110915 fetch okay, but can't parse 
> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
> reason: failed
> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved 
> files are unsupported at this time
> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
> java.lang.NullPointerException
>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>        at 
> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
> 
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110917 SEVERE error writing output:java.io.IOException: key out 
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
> 
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> Exception in thread "main" java.lang.RuntimeException: SEVERE error 
> logged.  Exiting fetcher.
>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out 
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
> 
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out 
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
> 
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out 
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
> etc.
> 
> ---------------------------------------------------------------------------
> There are the differences between nutch-site.xml and nutch-default.xml:
> ---------------------------------------------------------------------------
> ***** nutch-default.xml
>  <name>http.timeout</name>
>  <value>10000</value>
>  <description>The default network timeout, in milliseconds.</description>
> ***** NUTCH-SITE.XML
>  <name>http.timeout</name>
>  <value>30000</value>
>  <description>The default network timeout, in milliseconds.</description>
> *****
> 
> ***** nutch-default.xml
>  <name>http.max.delays</name>
>  <value>3</value>
>  <description>The number of times a thread will delay when trying to
> ***** NUTCH-SITE.XML
>  <name>http.max.delays</name>
>  <value>6</value>
>  <description>The number of times a thread will delay when trying to
> *****
> 
> ***** nutch-default.xml
>  <name>http.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
>  <name>http.content.limit</name>
>  <value>130000</value>
>  <description>The length limit for downloaded content, in bytes.
> *****
> 
> ***** nutch-default.xml
>  <name>file.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
>  <name>file.content.limit</name>
>  <value>130000</value>
>  <description>The length limit for downloaded content, in bytes.
> *****
> 
> ***** nutch-default.xml
>  <name>ftp.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
>  <name>ftp.content.limit</name>
>  <value>130000</value>
>  <description>The length limit for downloaded content, in bytes.
> *****
> 
> ***** nutch-default.xml
>  <name>db.max.outlinks.per.page</name>
>  <value>100</value>
>  <description>The maximum number of outlinks that we'll process for a page.
> ***** NUTCH-SITE.XML
>  <name>db.max.outlinks.per.page</name>
>  <value>200</value>
>  <description>The maximum number of outlinks that we'll process for a page.
> *****
> 
> ***** nutch-default.xml
>  <name>db.fetch.retry.max</name>
>  <value>3</value>
>  <description>The maximum number of times a url that has encountered
> ***** NUTCH-SITE.XML
>  <name>db.fetch.retry.max</name>
>  <value>6</value>
>  <description>The maximum number of times a url that has encountered
> *****
> 
> ***** nutch-default.xml
>  <name>fetcher.server.delay</name>
>  <value>5.0</value>
>  <description>The number of seconds the fetcher will delay between
> ***** NUTCH-SITE.XML
>  <name>fetcher.server.delay</name>
>  <value>30.0</value>
>  <description>The number of seconds the fetcher will delay between
> *****
> 
> ***** nutch-default.xml
>  <name>fetcher.threads.fetch</name>
>  <value>10</value>
>  <description>The number of FetcherThreads the fetcher should use.
> ***** NUTCH-SITE.XML
>  <name>fetcher.threads.fetch</name>
>  <value>100</value>
>  <description>The number of FetcherThreads the fetcher should use.
> *****
> 
> ***** nutch-default.xml
>  <name>fetcher.threads.per.host</name>
>  <value>1</value>
>  <description>This number is the maximum number of threads that
> ***** NUTCH-SITE.XML
>  <name>fetcher.threads.per.host</name>
>  <value>100</value>
>  <description>This number is the maximum number of threads that
> *****
> 
> ***** nutch-default.xml
>  <name>parser.threads.parse</name>
>  <value>10</value>
>  <description>Number of ParserThreads ParseSegment should 
> use.</description>
> ***** NUTCH-SITE.XML
>  <name>parser.threads.parse</name>
>  <value>100</value>
>  <description>Number of ParserThreads ParseSegment should 
> use.</description>
> *****
> 
> ***** nutch-default.xml
>  <name>indexer.minMergeDocs</name>
>  <value>50</value>
>  <description>This number determines the minimum number of Lucene
> ***** NUTCH-SITE.XML
>  <name>indexer.minMergeDocs</name>
>  <value>10000</value>
>  <description>This number determines the minimum number of Lucene
> *****
> 
> ***** nutch-default.xml
>  <name>indexer.maxMergeDocs</name>
>  <value>50</value>
>  <description>This number determines the maximum number of Lucene
> ***** NUTCH-SITE.XML
>  <name>indexer.maxMergeDocs</name>
>  <value>10000000</value>
>  <description>This number determines the maximum number of Lucene
> *****
> 
> ***** nutch-default.xml
>  <name>searcher.dir</name>
>  <value>.</value>
>  <description>
> ***** NUTCH-SITE.XML
>  <name>searcher.dir</name>
>  <value>/srv/db/</value>
>  <description>
> *****
> 
> ***** nutch-default.xml
>  <name>ipc.client.timeout</name>
>  <value>10000</value>
>  <description>Defines the timeout for IPC calls in milliseconds. 
> </description>
> ***** NUTCH-SITE.XML
>  <name>ipc.client.timeout</name>
>  <value>20000</value>
>  <description>Defines the timeout for IPC calls in milliseconds. 
> </description>
> *****
> 
> ***** nutch-default.xml
>  <name>plugin.includes</name>
>  
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value> 
> 
>  <description>Regular expression naming plugin directory names to
> ***** NUTCH-SITE.XML
>  <name>plugin.includes</name>
>  
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query- 
> 
> basic|more|site|url)</value>
>  <description>Regular expression naming plugin directory names to
> *****
> 
> ***** nutch-default.xml
>  <name>parser.character.encoding.default</name>
>  <value>windows-1252</value>
>  <description>The character encoding to fall back to when no other 
> information
> ***** NUTCH-SITE.XML
>  <name>parser.character.encoding.default</name>
>  <value>iso-8859-2</value>
>  <description>The character encoding to fall back to when no other 
> information
> *****
> 
> Any idea what is the problem source?
> 
> Best Regards:
>    Ferenc


-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/


Re: Event queues vs threads

Posted by Paul Baclace <pe...@baclace.net>.
Doug Cutting wrote:
 >Kelvin Tan wrote:
>> fetcher as a series of event queues (ala SEDA) instead 
>> of with threads.
> 
> I have never been able to write a async version of things with Java's 
> nio that outperforms a threaded version.  In theory it is possible, 
> since you can avoid thread switching overheads.  But in practice I have 
> found it difficult.

I read the David Culler, et al SEDA paper a while back and I think
the real benefit is twofold:  (1) more concurrent connectionsand
(2) graceful degradation (meaning fair scheduling) at maximum load.
IIRC, they hint at competitive-with-apache web serving, but this
depends on specific mix of requests/file sizes, etc.; Tomcat can
also beat the apache web server under some conditions.

Services that need to maintain lots of mostly-idle connections
(like instant messaging) benefit the most from a SEDA architecture.

It should be possible to have graceful degradation with a
thread-oriented architecture.  Perhaps a self-tuning procedure
that, for a specific installation, could discover the parameter
settings to get the most out of a server and have it refuse requests
that would push it into the unfair scheduling zone.

Paul

Re: Event queues vs threads

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hi,
I think some old blog entries are quite interesting - if somone wants to 
find out some details about nio.
http://jroller.com/page/pyrasun/20040426
Regards,
Piotr

Doug Cutting wrote:
> Kelvin Tan wrote:
> 
>> Interesting. I haven't tried it myself. Do you have any 
>> code/benchmarks for this?
> 
> 
> I never committed it anywhere.  I initially tried to write Nutch's IPC 
> mechanism with nio and it was slow and buggy.  One problem was that I 
> needed to switch streams to non-blocking mode in order to read 
> arbitrarily large objects, then switch them back to blocking mode in 
> order to select() on them.  But you can't change this state and remove 
> them from the selector without going through the scheduler.  So the 
> benefit of skipping the scheduler wasn't there.  If I was willing to 
> fragment objects into fixed size chunks then it might have worked, but 
> that's a lot of work.  It's a strange limitation, since with native 
> sockets one can select and then perform arbitrary stream i/o, not 
> limited to a single buffer.
> 
> Also, there's an nio version of Lucene's Directory that's a bit slower 
> than the non-nio version, but this is not using select() or anything.
> 
>> Are you aware of others facing the same problem? 
> 
> 
> How much non-blocking nio code do you find in real Java code?  I have 
> not seen a lot.
> 
> I did find that Sun has implemented a high-performance HTTP client using 
> nio.  This is documented at:
> 
> http://blogs.sun.com/roller/resources/fp/grizzly.pdf
> 
>  From what I can tell the primary benefit is in number of simultaneous 
> clients, not in throughput.  Does a crawler require 1000's of 
> simultaneous connections?  If so, then it looks like careful use of nio 
> could offer some real benefits.
> 
> Doug
> 


Re: Event queues vs threads

Posted by Doug Cutting <cu...@nutch.org>.
Kelvin Tan wrote:
> Interesting. I haven't tried it myself. Do you have any code/benchmarks for this?

I never committed it anywhere.  I initially tried to write Nutch's IPC 
mechanism with nio and it was slow and buggy.  One problem was that I 
needed to switch streams to non-blocking mode in order to read 
arbitrarily large objects, then switch them back to blocking mode in 
order to select() on them.  But you can't change this state and remove 
them from the selector without going through the scheduler.  So the 
benefit of skipping the scheduler wasn't there.  If I was willing to 
fragment objects into fixed size chunks then it might have worked, but 
that's a lot of work.  It's a strange limitation, since with native 
sockets one can select and then perform arbitrary stream i/o, not 
limited to a single buffer.

Also, there's an nio version of Lucene's Directory that's a bit slower 
than the non-nio version, but this is not using select() or anything.

> Are you aware of others facing the same problem? 

How much non-blocking nio code do you find in real Java code?  I have 
not seen a lot.

I did find that Sun has implemented a high-performance HTTP client using 
nio.  This is documented at:

http://blogs.sun.com/roller/resources/fp/grizzly.pdf

 From what I can tell the primary benefit is in number of simultaneous 
clients, not in throughput.  Does a crawler require 1000's of 
simultaneous connections?  If so, then it looks like careful use of nio 
could offer some real benefits.

Doug

Re: Event queues vs threads

Posted by Kelvin Tan <ke...@relevanz.com>.

On Thu, 01 Sep 2005 09:58:49 -0700, Doug Cutting wrote:
> Kelvin Tan wrote:
>> Each of these stages will be handled in its own thread (except
>> for HTML parsing and scoring, which may actually benefit from
>> having multiple threads). With the introduction of non-blocking
>> IO, I think threads should be used only where parallel
>> computation offers performance advantages.
>>
>> Breaking up HttpRequest and HttpResponse, will also pave the way
>> for a non-blocking HTTP implementation.
>>
> I have never been able to write a async version of things with
> Java's nio that outperforms a threaded version.  In theory it is
> possible, since you can avoid thread switching overheads.  But in
> practice I have found it difficult.
>
> Doug

Interesting. I haven't tried it myself. Do you have any code/benchmarks for this? Are you aware of others facing the same problem? 

k


Re: Event queues vs threads

Posted by Doug Cutting <cu...@nutch.org>.
Kelvin Tan wrote:
> Each of these stages will be handled in its own thread (except for HTML parsing and scoring, which may actually benefit from having multiple threads). With the introduction of non-blocking IO, I think threads should be used only where parallel computation offers performance advantages.
> 
> Breaking up HttpRequest and HttpResponse, will also pave the way for a non-blocking HTTP implementation.

I have never been able to write a async version of things with Java's 
nio that outperforms a threaded version.  In theory it is possible, 
since you can avoid thread switching overheads.  But in practice I have 
found it difficult.

Doug

Event queues vs threads

Posted by Kelvin Tan <ke...@relevanz.com>.
I'm toying around with the idea of implementing the fetcher as a series of event queues (ala SEDA) instead of with threads. This is done by breaking up the fetching operation into a series of stages connected by queues, instead of one fetcherthread per task.

The stages I see are:

1. CrawlStarter (url injection)
2. URL filtering and normalizing
3. HttpRequest
4. HttpResponse
5. DB of fetched MD5 hashes
6. DB of fetched URLs
7. Parse and link extraction
8. Output
9. Link/Page Scoring

Each of these stages will be handled in its own thread (except for HTML parsing and scoring, which may actually benefit from having multiple threads). With the introduction of non-blocking IO, I think threads should be used only where parallel computation offers performance advantages.

Breaking up HttpRequest and HttpResponse, will also pave the way for a non-blocking HTTP implementation.

A big advantage also arises from a decrease in programmatic complexity (and possibly performance). With most of the stages being guaranteed to be single-threaded, threading/synchronization issues are dramatically reduced. This may not be so evident in the current/map-red fetch code, but because of the completely online nature of nutch-84/OC, this does simplify things considerably. 

I'll need to dig abit more to see how this can be conceptually translated into map-reduce, but I imagine its do-able. Perhaps each stage gets mapped then reduced?

Any thoughts?