You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "yoursoft@freemail.hu" <yo...@freemail.hu> on 2005/09/01 15:30:40 UTC
nutch 0.7 bug?
Dear Developers!
I tested nutch 0.7 with all the parser plugins, and found the followings:
-------------------------------------------------------------------------
The fetch broken by with e.g. followings:
-------------------------------------------------------------------------
050901 110915 fetch okay, but can't parse
http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
reason: failed
(2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved
files are unsupported at this time
050901 110915 fetching http://en.mimi.hu/fishing/scad.html
050901 110917 SEVERE error writing output:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
at
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110917 SEVERE error writing output:java.io.IOException: key out
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged. Exiting fetcher.
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
050901 110921 SEVERE error writing output:java.io.IOException: key out
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key out
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key out
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
etc.
---------------------------------------------------------------------------
There are the differences between nutch-site.xml and nutch-default.xml:
---------------------------------------------------------------------------
***** nutch-default.xml
<name>http.timeout</name>
<value>10000</value>
<description>The default network timeout, in milliseconds.</description>
***** NUTCH-SITE.XML
<name>http.timeout</name>
<value>30000</value>
<description>The default network timeout, in milliseconds.</description>
*****
***** nutch-default.xml
<name>http.max.delays</name>
<value>3</value>
<description>The number of times a thread will delay when trying to
***** NUTCH-SITE.XML
<name>http.max.delays</name>
<value>6</value>
<description>The number of times a thread will delay when trying to
*****
***** nutch-default.xml
<name>http.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
<name>http.content.limit</name>
<value>130000</value>
<description>The length limit for downloaded content, in bytes.
*****
***** nutch-default.xml
<name>file.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
<name>file.content.limit</name>
<value>130000</value>
<description>The length limit for downloaded content, in bytes.
*****
***** nutch-default.xml
<name>ftp.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
<name>ftp.content.limit</name>
<value>130000</value>
<description>The length limit for downloaded content, in bytes.
*****
***** nutch-default.xml
<name>db.max.outlinks.per.page</name>
<value>100</value>
<description>The maximum number of outlinks that we'll process for a page.
***** NUTCH-SITE.XML
<name>db.max.outlinks.per.page</name>
<value>200</value>
<description>The maximum number of outlinks that we'll process for a page.
*****
***** nutch-default.xml
<name>db.fetch.retry.max</name>
<value>3</value>
<description>The maximum number of times a url that has encountered
***** NUTCH-SITE.XML
<name>db.fetch.retry.max</name>
<value>6</value>
<description>The maximum number of times a url that has encountered
*****
***** nutch-default.xml
<name>fetcher.server.delay</name>
<value>5.0</value>
<description>The number of seconds the fetcher will delay between
***** NUTCH-SITE.XML
<name>fetcher.server.delay</name>
<value>30.0</value>
<description>The number of seconds the fetcher will delay between
*****
***** nutch-default.xml
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>The number of FetcherThreads the fetcher should use.
***** NUTCH-SITE.XML
<name>fetcher.threads.fetch</name>
<value>100</value>
<description>The number of FetcherThreads the fetcher should use.
*****
***** nutch-default.xml
<name>fetcher.threads.per.host</name>
<value>1</value>
<description>This number is the maximum number of threads that
***** NUTCH-SITE.XML
<name>fetcher.threads.per.host</name>
<value>100</value>
<description>This number is the maximum number of threads that
*****
***** nutch-default.xml
<name>parser.threads.parse</name>
<value>10</value>
<description>Number of ParserThreads ParseSegment should
use.</description>
***** NUTCH-SITE.XML
<name>parser.threads.parse</name>
<value>100</value>
<description>Number of ParserThreads ParseSegment should
use.</description>
*****
***** nutch-default.xml
<name>indexer.minMergeDocs</name>
<value>50</value>
<description>This number determines the minimum number of Lucene
***** NUTCH-SITE.XML
<name>indexer.minMergeDocs</name>
<value>10000</value>
<description>This number determines the minimum number of Lucene
*****
***** nutch-default.xml
<name>indexer.maxMergeDocs</name>
<value>50</value>
<description>This number determines the maximum number of Lucene
***** NUTCH-SITE.XML
<name>indexer.maxMergeDocs</name>
<value>10000000</value>
<description>This number determines the maximum number of Lucene
*****
***** nutch-default.xml
<name>searcher.dir</name>
<value>.</value>
<description>
***** NUTCH-SITE.XML
<name>searcher.dir</name>
<value>/srv/db/</value>
<description>
*****
***** nutch-default.xml
<name>ipc.client.timeout</name>
<value>10000</value>
<description>Defines the timeout for IPC calls in milliseconds.
</description>
***** NUTCH-SITE.XML
<name>ipc.client.timeout</name>
<value>20000</value>
<description>Defines the timeout for IPC calls in milliseconds.
</description>
*****
***** nutch-default.xml
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
***** NUTCH-SITE.XML
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
basic|more|site|url)</value>
<description>Regular expression naming plugin directory names to
*****
***** nutch-default.xml
<name>parser.character.encoding.default</name>
<value>windows-1252</value>
<description>The character encoding to fall back to when no other
information
***** NUTCH-SITE.XML
<name>parser.character.encoding.default</name>
<value>iso-8859-2</value>
<description>The character encoding to fall back to when no other
information
*****
Any idea what is the problem source?
Best Regards:
Ferenc
Re: [Nutch-dev] Re: nutch 0.7 bug?
Posted by YourSoft <yo...@freemail.hu>.
Dear Michael,
Thanks, for your mail. But I think there are 2 different problem. I
don't use the rss parser.
Ferenc
Michael Nebel wrotte:
> Just for the mail archives: please see also NUTCH-89.
>
> Thread closed?
>
> Michael
>
>
>
> yoursoft@freemail.hu wrote:
>
>> Hi Michael,
>>
>> I going back to a nigthly build.
>> I think this problem is related to 'fetcher.threads.per.host' value,
>> when it is bigger than 1.
>> There is another possible sources: fetcher.threads.fetch or
>> fetcher.threads.per.host or parser.threads.parse.
>>
>> Best Regards,
>> Ferenc
>>
>>> Hi Ferenc,
>>>
>>> I see the same errors. As I've seen a running installation
>>> yesterday, I think it's a configuration mistake. By now I have no
>>> idea where. Have you made any progress?
>>>
>>> Regards
>>>
>>> Michael
>>>
>>>
>>> yoursoft@freemail.hu wrote:
>>>
>>>> Dear Developers!
>>>>
>>>> I tested nutch 0.7 with all the parser plugins, and found the
>>>> followings:
>>>>
>>>> -------------------------------------------------------------------------
>>>>
>>>> The fetch broken by with e.g. followings:
>>>> -------------------------------------------------------------------------
>>>>
>>>> 050901 110915 fetch okay, but can't parse
>>>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
>>>> reason: failed
>>>> (2,200): org.apache.nutch.parse.msword.FastSavedException:
>>>> Fast-saved files are unsupported at this time
>>>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>>>> 050901 110917 SEVERE error writing
>>>> output:java.lang.NullPointerException
>>>> java.lang.NullPointerException
>>>> at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>>> at
>>>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>>> at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110917 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>> at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>> at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> Exception in thread "main" java.lang.RuntimeException: SEVERE error
>>>> logged. Exiting fetcher.
>>>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>> at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>> at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>> at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>> at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>> at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>> at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> etc.
>>>>
>>>> ---------------------------------------------------------------------------
>>>>
>>>> There are the differences between nutch-site.xml and
>>>> nutch-default.xml:
>>>> ---------------------------------------------------------------------------
>>>>
>>>> ***** nutch-default.xml
>>>> <name>http.timeout</name>
>>>> <value>10000</value>
>>>> <description>The default network timeout, in
>>>> milliseconds.</description>
>>>> ***** NUTCH-SITE.XML
>>>> <name>http.timeout</name>
>>>> <value>30000</value>
>>>> <description>The default network timeout, in
>>>> milliseconds.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>http.max.delays</name>
>>>> <value>3</value>
>>>> <description>The number of times a thread will delay when trying to
>>>> ***** NUTCH-SITE.XML
>>>> <name>http.max.delays</name>
>>>> <value>6</value>
>>>> <description>The number of times a thread will delay when trying to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>http.content.limit</name>
>>>> <value>65536</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>> <name>http.content.limit</name>
>>>> <value>130000</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>file.content.limit</name>
>>>> <value>65536</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>> <name>file.content.limit</name>
>>>> <value>130000</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>ftp.content.limit</name>
>>>> <value>65536</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>> <name>ftp.content.limit</name>
>>>> <value>130000</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>db.max.outlinks.per.page</name>
>>>> <value>100</value>
>>>> <description>The maximum number of outlinks that we'll process for
>>>> a page.
>>>> ***** NUTCH-SITE.XML
>>>> <name>db.max.outlinks.per.page</name>
>>>> <value>200</value>
>>>> <description>The maximum number of outlinks that we'll process for
>>>> a page.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>db.fetch.retry.max</name>
>>>> <value>3</value>
>>>> <description>The maximum number of times a url that has encountered
>>>> ***** NUTCH-SITE.XML
>>>> <name>db.fetch.retry.max</name>
>>>> <value>6</value>
>>>> <description>The maximum number of times a url that has encountered
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>fetcher.server.delay</name>
>>>> <value>5.0</value>
>>>> <description>The number of seconds the fetcher will delay between
>>>> ***** NUTCH-SITE.XML
>>>> <name>fetcher.server.delay</name>
>>>> <value>30.0</value>
>>>> <description>The number of seconds the fetcher will delay between
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>fetcher.threads.fetch</name>
>>>> <value>10</value>
>>>> <description>The number of FetcherThreads the fetcher should use.
>>>> ***** NUTCH-SITE.XML
>>>> <name>fetcher.threads.fetch</name>
>>>> <value>100</value>
>>>> <description>The number of FetcherThreads the fetcher should use.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>fetcher.threads.per.host</name>
>>>> <value>1</value>
>>>> <description>This number is the maximum number of threads that
>>>> ***** NUTCH-SITE.XML
>>>> <name>fetcher.threads.per.host</name>
>>>> <value>100</value>
>>>> <description>This number is the maximum number of threads that
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>parser.threads.parse</name>
>>>> <value>10</value>
>>>> <description>Number of ParserThreads ParseSegment should
>>>> use.</description>
>>>> ***** NUTCH-SITE.XML
>>>> <name>parser.threads.parse</name>
>>>> <value>100</value>
>>>> <description>Number of ParserThreads ParseSegment should
>>>> use.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>indexer.minMergeDocs</name>
>>>> <value>50</value>
>>>> <description>This number determines the minimum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>> <name>indexer.minMergeDocs</name>
>>>> <value>10000</value>
>>>> <description>This number determines the minimum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>indexer.maxMergeDocs</name>
>>>> <value>50</value>
>>>> <description>This number determines the maximum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>> <name>indexer.maxMergeDocs</name>
>>>> <value>10000000</value>
>>>> <description>This number determines the maximum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>searcher.dir</name>
>>>> <value>.</value>
>>>> <description>
>>>> ***** NUTCH-SITE.XML
>>>> <name>searcher.dir</name>
>>>> <value>/srv/db/</value>
>>>> <description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>ipc.client.timeout</name>
>>>> <value>10000</value>
>>>> <description>Defines the timeout for IPC calls in milliseconds.
>>>> </description>
>>>> ***** NUTCH-SITE.XML
>>>> <name>ipc.client.timeout</name>
>>>> <value>20000</value>
>>>> <description>Defines the timeout for IPC calls in milliseconds.
>>>> </description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>plugin.includes</name>
>>>>
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
>>>>
>>>> <description>Regular expression naming plugin directory names to
>>>> ***** NUTCH-SITE.XML
>>>> <name>plugin.includes</name>
>>>>
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
>>>>
>>>> basic|more|site|url)</value>
>>>> <description>Regular expression naming plugin directory names to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>parser.character.encoding.default</name>
>>>> <value>windows-1252</value>
>>>> <description>The character encoding to fall back to when no other
>>>> information
>>>> ***** NUTCH-SITE.XML
>>>> <name>parser.character.encoding.default</name>
>>>> <value>iso-8859-2</value>
>>>> <description>The character encoding to fall back to when no other
>>>> information
>>>> *****
>>>>
>>>> Any idea what is the problem source?
>>>>
>>>> Best Regards:
>>>> Ferenc
>>>
>>>
>>>
>>>
>>>
>
>
Re: nutch 0.7 bug?
Posted by Michael Nebel <mi...@nebel.de>.
Just for the mail archives: please see also NUTCH-89.
Thread closed?
Michael
yoursoft@freemail.hu wrote:
> Hi Michael,
>
> I going back to a nigthly build.
> I think this problem is related to 'fetcher.threads.per.host' value,
> when it is bigger than 1.
> There is another possible sources: fetcher.threads.fetch or
> fetcher.threads.per.host or parser.threads.parse.
>
> Best Regards,
> Ferenc
>
>> Hi Ferenc,
>>
>> I see the same errors. As I've seen a running installation yesterday,
>> I think it's a configuration mistake. By now I have no idea where.
>> Have you made any progress?
>>
>> Regards
>>
>> Michael
>>
>>
>> yoursoft@freemail.hu wrote:
>>
>>> Dear Developers!
>>>
>>> I tested nutch 0.7 with all the parser plugins, and found the
>>> followings:
>>>
>>> -------------------------------------------------------------------------
>>>
>>> The fetch broken by with e.g. followings:
>>> -------------------------------------------------------------------------
>>>
>>> 050901 110915 fetch okay, but can't parse
>>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
>>> reason: failed
>>> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved
>>> files are unsupported at this time
>>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>>> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
>>> java.lang.NullPointerException
>>> at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>> at
>>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110917 SEVERE error writing output:java.io.IOException: key
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> Exception in thread "main" java.lang.RuntimeException: SEVERE error
>>> logged. Exiting fetcher.
>>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>
>>> at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>> etc.
>>>
>>> ---------------------------------------------------------------------------
>>>
>>> There are the differences between nutch-site.xml and nutch-default.xml:
>>> ---------------------------------------------------------------------------
>>>
>>> ***** nutch-default.xml
>>> <name>http.timeout</name>
>>> <value>10000</value>
>>> <description>The default network timeout, in
>>> milliseconds.</description>
>>> ***** NUTCH-SITE.XML
>>> <name>http.timeout</name>
>>> <value>30000</value>
>>> <description>The default network timeout, in
>>> milliseconds.</description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>http.max.delays</name>
>>> <value>3</value>
>>> <description>The number of times a thread will delay when trying to
>>> ***** NUTCH-SITE.XML
>>> <name>http.max.delays</name>
>>> <value>6</value>
>>> <description>The number of times a thread will delay when trying to
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>http.content.limit</name>
>>> <value>65536</value>
>>> <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>> <name>http.content.limit</name>
>>> <value>130000</value>
>>> <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>file.content.limit</name>
>>> <value>65536</value>
>>> <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>> <name>file.content.limit</name>
>>> <value>130000</value>
>>> <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>ftp.content.limit</name>
>>> <value>65536</value>
>>> <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>> <name>ftp.content.limit</name>
>>> <value>130000</value>
>>> <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>db.max.outlinks.per.page</name>
>>> <value>100</value>
>>> <description>The maximum number of outlinks that we'll process for a
>>> page.
>>> ***** NUTCH-SITE.XML
>>> <name>db.max.outlinks.per.page</name>
>>> <value>200</value>
>>> <description>The maximum number of outlinks that we'll process for a
>>> page.
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>db.fetch.retry.max</name>
>>> <value>3</value>
>>> <description>The maximum number of times a url that has encountered
>>> ***** NUTCH-SITE.XML
>>> <name>db.fetch.retry.max</name>
>>> <value>6</value>
>>> <description>The maximum number of times a url that has encountered
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>fetcher.server.delay</name>
>>> <value>5.0</value>
>>> <description>The number of seconds the fetcher will delay between
>>> ***** NUTCH-SITE.XML
>>> <name>fetcher.server.delay</name>
>>> <value>30.0</value>
>>> <description>The number of seconds the fetcher will delay between
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>fetcher.threads.fetch</name>
>>> <value>10</value>
>>> <description>The number of FetcherThreads the fetcher should use.
>>> ***** NUTCH-SITE.XML
>>> <name>fetcher.threads.fetch</name>
>>> <value>100</value>
>>> <description>The number of FetcherThreads the fetcher should use.
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>fetcher.threads.per.host</name>
>>> <value>1</value>
>>> <description>This number is the maximum number of threads that
>>> ***** NUTCH-SITE.XML
>>> <name>fetcher.threads.per.host</name>
>>> <value>100</value>
>>> <description>This number is the maximum number of threads that
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>parser.threads.parse</name>
>>> <value>10</value>
>>> <description>Number of ParserThreads ParseSegment should
>>> use.</description>
>>> ***** NUTCH-SITE.XML
>>> <name>parser.threads.parse</name>
>>> <value>100</value>
>>> <description>Number of ParserThreads ParseSegment should
>>> use.</description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>indexer.minMergeDocs</name>
>>> <value>50</value>
>>> <description>This number determines the minimum number of Lucene
>>> ***** NUTCH-SITE.XML
>>> <name>indexer.minMergeDocs</name>
>>> <value>10000</value>
>>> <description>This number determines the minimum number of Lucene
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>indexer.maxMergeDocs</name>
>>> <value>50</value>
>>> <description>This number determines the maximum number of Lucene
>>> ***** NUTCH-SITE.XML
>>> <name>indexer.maxMergeDocs</name>
>>> <value>10000000</value>
>>> <description>This number determines the maximum number of Lucene
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>searcher.dir</name>
>>> <value>.</value>
>>> <description>
>>> ***** NUTCH-SITE.XML
>>> <name>searcher.dir</name>
>>> <value>/srv/db/</value>
>>> <description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>ipc.client.timeout</name>
>>> <value>10000</value>
>>> <description>Defines the timeout for IPC calls in milliseconds.
>>> </description>
>>> ***** NUTCH-SITE.XML
>>> <name>ipc.client.timeout</name>
>>> <value>20000</value>
>>> <description>Defines the timeout for IPC calls in milliseconds.
>>> </description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>plugin.includes</name>
>>>
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
>>>
>>> <description>Regular expression naming plugin directory names to
>>> ***** NUTCH-SITE.XML
>>> <name>plugin.includes</name>
>>>
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
>>>
>>> basic|more|site|url)</value>
>>> <description>Regular expression naming plugin directory names to
>>> *****
>>>
>>> ***** nutch-default.xml
>>> <name>parser.character.encoding.default</name>
>>> <value>windows-1252</value>
>>> <description>The character encoding to fall back to when no other
>>> information
>>> ***** NUTCH-SITE.XML
>>> <name>parser.character.encoding.default</name>
>>> <value>iso-8859-2</value>
>>> <description>The character encoding to fall back to when no other
>>> information
>>> *****
>>>
>>> Any idea what is the problem source?
>>>
>>> Best Regards:
>>> Ferenc
>>
>>
>>
>>
--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/
Re: nutch 0.7 bug?
Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Hi Michael,
I going back to a nigthly build.
I think this problem is related to 'fetcher.threads.per.host' value,
when it is bigger than 1.
There is another possible sources: fetcher.threads.fetch or
fetcher.threads.per.host or parser.threads.parse.
Best Regards,
Ferenc
> Hi Ferenc,
>
> I see the same errors. As I've seen a running installation yesterday,
> I think it's a configuration mistake. By now I have no idea where.
> Have you made any progress?
>
> Regards
>
> Michael
>
>
> yoursoft@freemail.hu wrote:
>
>> Dear Developers!
>>
>> I tested nutch 0.7 with all the parser plugins, and found the
>> followings:
>>
>> -------------------------------------------------------------------------
>>
>> The fetch broken by with e.g. followings:
>> -------------------------------------------------------------------------
>>
>> 050901 110915 fetch okay, but can't parse
>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
>> reason: failed
>> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved
>> files are unsupported at this time
>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
>> java.lang.NullPointerException
>> at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>> at
>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110917 SEVERE error writing output:java.io.IOException: key
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> Exception in thread "main" java.lang.RuntimeException: SEVERE error
>> logged. Exiting fetcher.
>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>
>> at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>> etc.
>>
>> ---------------------------------------------------------------------------
>>
>> There are the differences between nutch-site.xml and nutch-default.xml:
>> ---------------------------------------------------------------------------
>>
>> ***** nutch-default.xml
>> <name>http.timeout</name>
>> <value>10000</value>
>> <description>The default network timeout, in
>> milliseconds.</description>
>> ***** NUTCH-SITE.XML
>> <name>http.timeout</name>
>> <value>30000</value>
>> <description>The default network timeout, in
>> milliseconds.</description>
>> *****
>>
>> ***** nutch-default.xml
>> <name>http.max.delays</name>
>> <value>3</value>
>> <description>The number of times a thread will delay when trying to
>> ***** NUTCH-SITE.XML
>> <name>http.max.delays</name>
>> <value>6</value>
>> <description>The number of times a thread will delay when trying to
>> *****
>>
>> ***** nutch-default.xml
>> <name>http.content.limit</name>
>> <value>65536</value>
>> <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>> <name>http.content.limit</name>
>> <value>130000</value>
>> <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>> <name>file.content.limit</name>
>> <value>65536</value>
>> <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>> <name>file.content.limit</name>
>> <value>130000</value>
>> <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>> <name>ftp.content.limit</name>
>> <value>65536</value>
>> <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>> <name>ftp.content.limit</name>
>> <value>130000</value>
>> <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>> <name>db.max.outlinks.per.page</name>
>> <value>100</value>
>> <description>The maximum number of outlinks that we'll process for a
>> page.
>> ***** NUTCH-SITE.XML
>> <name>db.max.outlinks.per.page</name>
>> <value>200</value>
>> <description>The maximum number of outlinks that we'll process for a
>> page.
>> *****
>>
>> ***** nutch-default.xml
>> <name>db.fetch.retry.max</name>
>> <value>3</value>
>> <description>The maximum number of times a url that has encountered
>> ***** NUTCH-SITE.XML
>> <name>db.fetch.retry.max</name>
>> <value>6</value>
>> <description>The maximum number of times a url that has encountered
>> *****
>>
>> ***** nutch-default.xml
>> <name>fetcher.server.delay</name>
>> <value>5.0</value>
>> <description>The number of seconds the fetcher will delay between
>> ***** NUTCH-SITE.XML
>> <name>fetcher.server.delay</name>
>> <value>30.0</value>
>> <description>The number of seconds the fetcher will delay between
>> *****
>>
>> ***** nutch-default.xml
>> <name>fetcher.threads.fetch</name>
>> <value>10</value>
>> <description>The number of FetcherThreads the fetcher should use.
>> ***** NUTCH-SITE.XML
>> <name>fetcher.threads.fetch</name>
>> <value>100</value>
>> <description>The number of FetcherThreads the fetcher should use.
>> *****
>>
>> ***** nutch-default.xml
>> <name>fetcher.threads.per.host</name>
>> <value>1</value>
>> <description>This number is the maximum number of threads that
>> ***** NUTCH-SITE.XML
>> <name>fetcher.threads.per.host</name>
>> <value>100</value>
>> <description>This number is the maximum number of threads that
>> *****
>>
>> ***** nutch-default.xml
>> <name>parser.threads.parse</name>
>> <value>10</value>
>> <description>Number of ParserThreads ParseSegment should
>> use.</description>
>> ***** NUTCH-SITE.XML
>> <name>parser.threads.parse</name>
>> <value>100</value>
>> <description>Number of ParserThreads ParseSegment should
>> use.</description>
>> *****
>>
>> ***** nutch-default.xml
>> <name>indexer.minMergeDocs</name>
>> <value>50</value>
>> <description>This number determines the minimum number of Lucene
>> ***** NUTCH-SITE.XML
>> <name>indexer.minMergeDocs</name>
>> <value>10000</value>
>> <description>This number determines the minimum number of Lucene
>> *****
>>
>> ***** nutch-default.xml
>> <name>indexer.maxMergeDocs</name>
>> <value>50</value>
>> <description>This number determines the maximum number of Lucene
>> ***** NUTCH-SITE.XML
>> <name>indexer.maxMergeDocs</name>
>> <value>10000000</value>
>> <description>This number determines the maximum number of Lucene
>> *****
>>
>> ***** nutch-default.xml
>> <name>searcher.dir</name>
>> <value>.</value>
>> <description>
>> ***** NUTCH-SITE.XML
>> <name>searcher.dir</name>
>> <value>/srv/db/</value>
>> <description>
>> *****
>>
>> ***** nutch-default.xml
>> <name>ipc.client.timeout</name>
>> <value>10000</value>
>> <description>Defines the timeout for IPC calls in milliseconds.
>> </description>
>> ***** NUTCH-SITE.XML
>> <name>ipc.client.timeout</name>
>> <value>20000</value>
>> <description>Defines the timeout for IPC calls in milliseconds.
>> </description>
>> *****
>>
>> ***** nutch-default.xml
>> <name>plugin.includes</name>
>>
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
>>
>> <description>Regular expression naming plugin directory names to
>> ***** NUTCH-SITE.XML
>> <name>plugin.includes</name>
>>
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
>>
>> basic|more|site|url)</value>
>> <description>Regular expression naming plugin directory names to
>> *****
>>
>> ***** nutch-default.xml
>> <name>parser.character.encoding.default</name>
>> <value>windows-1252</value>
>> <description>The character encoding to fall back to when no other
>> information
>> ***** NUTCH-SITE.XML
>> <name>parser.character.encoding.default</name>
>> <value>iso-8859-2</value>
>> <description>The character encoding to fall back to when no other
>> information
>> *****
>>
>> Any idea what is the problem source?
>>
>> Best Regards:
>> Ferenc
>
>
>
Re: nutch 0.7 bug?
Posted by Michael Nebel <mi...@nebel.de>.
Hi Ferenc,
I see the same errors. As I've seen a running installation yesterday, I
think it's a configuration mistake. By now I have no idea where. Have
you made any progress?
Regards
Michael
yoursoft@freemail.hu wrote:
> Dear Developers!
>
> I tested nutch 0.7 with all the parser plugins, and found the followings:
>
> -------------------------------------------------------------------------
> The fetch broken by with e.g. followings:
> -------------------------------------------------------------------------
> 050901 110915 fetch okay, but can't parse
> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
> reason: failed
> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved
> files are unsupported at this time
> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
> java.lang.NullPointerException
> at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
> at
> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110917 SEVERE error writing output:java.io.IOException: key out
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> Exception in thread "main" java.lang.RuntimeException: SEVERE error
> logged. Exiting fetcher.
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
> at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
> etc.
>
> ---------------------------------------------------------------------------
> There are the differences between nutch-site.xml and nutch-default.xml:
> ---------------------------------------------------------------------------
> ***** nutch-default.xml
> <name>http.timeout</name>
> <value>10000</value>
> <description>The default network timeout, in milliseconds.</description>
> ***** NUTCH-SITE.XML
> <name>http.timeout</name>
> <value>30000</value>
> <description>The default network timeout, in milliseconds.</description>
> *****
>
> ***** nutch-default.xml
> <name>http.max.delays</name>
> <value>3</value>
> <description>The number of times a thread will delay when trying to
> ***** NUTCH-SITE.XML
> <name>http.max.delays</name>
> <value>6</value>
> <description>The number of times a thread will delay when trying to
> *****
>
> ***** nutch-default.xml
> <name>http.content.limit</name>
> <value>65536</value>
> <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
> <name>http.content.limit</name>
> <value>130000</value>
> <description>The length limit for downloaded content, in bytes.
> *****
>
> ***** nutch-default.xml
> <name>file.content.limit</name>
> <value>65536</value>
> <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
> <name>file.content.limit</name>
> <value>130000</value>
> <description>The length limit for downloaded content, in bytes.
> *****
>
> ***** nutch-default.xml
> <name>ftp.content.limit</name>
> <value>65536</value>
> <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
> <name>ftp.content.limit</name>
> <value>130000</value>
> <description>The length limit for downloaded content, in bytes.
> *****
>
> ***** nutch-default.xml
> <name>db.max.outlinks.per.page</name>
> <value>100</value>
> <description>The maximum number of outlinks that we'll process for a page.
> ***** NUTCH-SITE.XML
> <name>db.max.outlinks.per.page</name>
> <value>200</value>
> <description>The maximum number of outlinks that we'll process for a page.
> *****
>
> ***** nutch-default.xml
> <name>db.fetch.retry.max</name>
> <value>3</value>
> <description>The maximum number of times a url that has encountered
> ***** NUTCH-SITE.XML
> <name>db.fetch.retry.max</name>
> <value>6</value>
> <description>The maximum number of times a url that has encountered
> *****
>
> ***** nutch-default.xml
> <name>fetcher.server.delay</name>
> <value>5.0</value>
> <description>The number of seconds the fetcher will delay between
> ***** NUTCH-SITE.XML
> <name>fetcher.server.delay</name>
> <value>30.0</value>
> <description>The number of seconds the fetcher will delay between
> *****
>
> ***** nutch-default.xml
> <name>fetcher.threads.fetch</name>
> <value>10</value>
> <description>The number of FetcherThreads the fetcher should use.
> ***** NUTCH-SITE.XML
> <name>fetcher.threads.fetch</name>
> <value>100</value>
> <description>The number of FetcherThreads the fetcher should use.
> *****
>
> ***** nutch-default.xml
> <name>fetcher.threads.per.host</name>
> <value>1</value>
> <description>This number is the maximum number of threads that
> ***** NUTCH-SITE.XML
> <name>fetcher.threads.per.host</name>
> <value>100</value>
> <description>This number is the maximum number of threads that
> *****
>
> ***** nutch-default.xml
> <name>parser.threads.parse</name>
> <value>10</value>
> <description>Number of ParserThreads ParseSegment should
> use.</description>
> ***** NUTCH-SITE.XML
> <name>parser.threads.parse</name>
> <value>100</value>
> <description>Number of ParserThreads ParseSegment should
> use.</description>
> *****
>
> ***** nutch-default.xml
> <name>indexer.minMergeDocs</name>
> <value>50</value>
> <description>This number determines the minimum number of Lucene
> ***** NUTCH-SITE.XML
> <name>indexer.minMergeDocs</name>
> <value>10000</value>
> <description>This number determines the minimum number of Lucene
> *****
>
> ***** nutch-default.xml
> <name>indexer.maxMergeDocs</name>
> <value>50</value>
> <description>This number determines the maximum number of Lucene
> ***** NUTCH-SITE.XML
> <name>indexer.maxMergeDocs</name>
> <value>10000000</value>
> <description>This number determines the maximum number of Lucene
> *****
>
> ***** nutch-default.xml
> <name>searcher.dir</name>
> <value>.</value>
> <description>
> ***** NUTCH-SITE.XML
> <name>searcher.dir</name>
> <value>/srv/db/</value>
> <description>
> *****
>
> ***** nutch-default.xml
> <name>ipc.client.timeout</name>
> <value>10000</value>
> <description>Defines the timeout for IPC calls in milliseconds.
> </description>
> ***** NUTCH-SITE.XML
> <name>ipc.client.timeout</name>
> <value>20000</value>
> <description>Defines the timeout for IPC calls in milliseconds.
> </description>
> *****
>
> ***** nutch-default.xml
> <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
>
> <description>Regular expression naming plugin directory names to
> ***** NUTCH-SITE.XML
> <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
>
> basic|more|site|url)</value>
> <description>Regular expression naming plugin directory names to
> *****
>
> ***** nutch-default.xml
> <name>parser.character.encoding.default</name>
> <value>windows-1252</value>
> <description>The character encoding to fall back to when no other
> information
> ***** NUTCH-SITE.XML
> <name>parser.character.encoding.default</name>
> <value>iso-8859-2</value>
> <description>The character encoding to fall back to when no other
> information
> *****
>
> Any idea what is the problem source?
>
> Best Regards:
> Ferenc
--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/
Re: Event queues vs threads
Posted by Paul Baclace <pe...@baclace.net>.
Doug Cutting wrote:
>Kelvin Tan wrote:
>> fetcher as a series of event queues (ala SEDA) instead
>> of with threads.
>
> I have never been able to write a async version of things with Java's
> nio that outperforms a threaded version. In theory it is possible,
> since you can avoid thread switching overheads. But in practice I have
> found it difficult.
I read the David Culler, et al SEDA paper a while back and I think
the real benefit is twofold: (1) more concurrent connectionsand
(2) graceful degradation (meaning fair scheduling) at maximum load.
IIRC, they hint at competitive-with-apache web serving, but this
depends on specific mix of requests/file sizes, etc.; Tomcat can
also beat the apache web server under some conditions.
Services that need to maintain lots of mostly-idle connections
(like instant messaging) benefit the most from a SEDA architecture.
It should be possible to have graceful degradation with a
thread-oriented architecture. Perhaps a self-tuning procedure
that, for a specific installation, could discover the parameter
settings to get the most out of a server and have it refuse requests
that would push it into the unfair scheduling zone.
Paul
Re: Event queues vs threads
Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hi,
I think some old blog entries are quite interesting - if somone wants to
find out some details about nio.
http://jroller.com/page/pyrasun/20040426
Regards,
Piotr
Doug Cutting wrote:
> Kelvin Tan wrote:
>
>> Interesting. I haven't tried it myself. Do you have any
>> code/benchmarks for this?
>
>
> I never committed it anywhere. I initially tried to write Nutch's IPC
> mechanism with nio and it was slow and buggy. One problem was that I
> needed to switch streams to non-blocking mode in order to read
> arbitrarily large objects, then switch them back to blocking mode in
> order to select() on them. But you can't change this state and remove
> them from the selector without going through the scheduler. So the
> benefit of skipping the scheduler wasn't there. If I was willing to
> fragment objects into fixed size chunks then it might have worked, but
> that's a lot of work. It's a strange limitation, since with native
> sockets one can select and then perform arbitrary stream i/o, not
> limited to a single buffer.
>
> Also, there's an nio version of Lucene's Directory that's a bit slower
> than the non-nio version, but this is not using select() or anything.
>
>> Are you aware of others facing the same problem?
>
>
> How much non-blocking nio code do you find in real Java code? I have
> not seen a lot.
>
> I did find that Sun has implemented a high-performance HTTP client using
> nio. This is documented at:
>
> http://blogs.sun.com/roller/resources/fp/grizzly.pdf
>
> From what I can tell the primary benefit is in number of simultaneous
> clients, not in throughput. Does a crawler require 1000's of
> simultaneous connections? If so, then it looks like careful use of nio
> could offer some real benefits.
>
> Doug
>
Re: Event queues vs threads
Posted by Doug Cutting <cu...@nutch.org>.
Kelvin Tan wrote:
> Interesting. I haven't tried it myself. Do you have any code/benchmarks for this?
I never committed it anywhere. I initially tried to write Nutch's IPC
mechanism with nio and it was slow and buggy. One problem was that I
needed to switch streams to non-blocking mode in order to read
arbitrarily large objects, then switch them back to blocking mode in
order to select() on them. But you can't change this state and remove
them from the selector without going through the scheduler. So the
benefit of skipping the scheduler wasn't there. If I was willing to
fragment objects into fixed size chunks then it might have worked, but
that's a lot of work. It's a strange limitation, since with native
sockets one can select and then perform arbitrary stream i/o, not
limited to a single buffer.
Also, there's an nio version of Lucene's Directory that's a bit slower
than the non-nio version, but this is not using select() or anything.
> Are you aware of others facing the same problem?
How much non-blocking nio code do you find in real Java code? I have
not seen a lot.
I did find that Sun has implemented a high-performance HTTP client using
nio. This is documented at:
http://blogs.sun.com/roller/resources/fp/grizzly.pdf
From what I can tell the primary benefit is in number of simultaneous
clients, not in throughput. Does a crawler require 1000's of
simultaneous connections? If so, then it looks like careful use of nio
could offer some real benefits.
Doug
Re: Event queues vs threads
Posted by Kelvin Tan <ke...@relevanz.com>.
On Thu, 01 Sep 2005 09:58:49 -0700, Doug Cutting wrote:
> Kelvin Tan wrote:
>> Each of these stages will be handled in its own thread (except
>> for HTML parsing and scoring, which may actually benefit from
>> having multiple threads). With the introduction of non-blocking
>> IO, I think threads should be used only where parallel
>> computation offers performance advantages.
>>
>> Breaking up HttpRequest and HttpResponse, will also pave the way
>> for a non-blocking HTTP implementation.
>>
> I have never been able to write a async version of things with
> Java's nio that outperforms a threaded version. In theory it is
> possible, since you can avoid thread switching overheads. But in
> practice I have found it difficult.
>
> Doug
Interesting. I haven't tried it myself. Do you have any code/benchmarks for this? Are you aware of others facing the same problem?
k
Re: Event queues vs threads
Posted by Doug Cutting <cu...@nutch.org>.
Kelvin Tan wrote:
> Each of these stages will be handled in its own thread (except for HTML parsing and scoring, which may actually benefit from having multiple threads). With the introduction of non-blocking IO, I think threads should be used only where parallel computation offers performance advantages.
>
> Breaking up HttpRequest and HttpResponse, will also pave the way for a non-blocking HTTP implementation.
I have never been able to write a async version of things with Java's
nio that outperforms a threaded version. In theory it is possible,
since you can avoid thread switching overheads. But in practice I have
found it difficult.
Doug
Event queues vs threads
Posted by Kelvin Tan <ke...@relevanz.com>.
I'm toying around with the idea of implementing the fetcher as a series of event queues (ala SEDA) instead of with threads. This is done by breaking up the fetching operation into a series of stages connected by queues, instead of one fetcherthread per task.
The stages I see are:
1. CrawlStarter (url injection)
2. URL filtering and normalizing
3. HttpRequest
4. HttpResponse
5. DB of fetched MD5 hashes
6. DB of fetched URLs
7. Parse and link extraction
8. Output
9. Link/Page Scoring
Each of these stages will be handled in its own thread (except for HTML parsing and scoring, which may actually benefit from having multiple threads). With the introduction of non-blocking IO, I think threads should be used only where parallel computation offers performance advantages.
Breaking up HttpRequest and HttpResponse, will also pave the way for a non-blocking HTTP implementation.
A big advantage also arises from a decrease in programmatic complexity (and possibly performance). With most of the stages being guaranteed to be single-threaded, threading/synchronization issues are dramatically reduced. This may not be so evident in the current/map-red fetch code, but because of the completely online nature of nutch-84/OC, this does simplify things considerably.
I'll need to dig abit more to see how this can be conceptually translated into map-reduce, but I imagine its do-able. Perhaps each stage gets mapped then reduced?
Any thoughts?