You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by YourSoft <yo...@freemail.hu> on 2005/09/10 17:57:34 UTC
Re: [Nutch-dev] Re: nutch 0.7 bug?
Dear Michael,
Thanks, for your mail. But I think there are 2 different problem. I
don't use the rss parser.
Ferenc
Michael Nebel wrotte:
> Just for the mail archives: please see also NUTCH-89.
>
> Thread closed?
>
> Michael
>
>
>
> yoursoft@freemail.hu wrote:
>
>> Hi Michael,
>>
>> I going back to a nigthly build.
>> I think this problem is related to 'fetcher.threads.per.host' value,
>> when it is bigger than 1.
>> There is another possible sources: fetcher.threads.fetch or
>> fetcher.threads.per.host or parser.threads.parse.
>>
>> Best Regards,
>> Ferenc
>>
>>> Hi Ferenc,
>>>
>>> I see the same errors. As I've seen a running installation
>>> yesterday, I think it's a configuration mistake. By now I have no
>>> idea where. Have you made any progress?
>>>
>>> Regards
>>>
>>> Michael
>>>
>>>
>>> yoursoft@freemail.hu wrote:
>>>
>>>> Dear Developers!
>>>>
>>>> I tested nutch 0.7 with all the parser plugins, and found the
>>>> followings:
>>>>
>>>> -------------------------------------------------------------------------
>>>>
>>>> The fetch broken by with e.g. followings:
>>>> -------------------------------------------------------------------------
>>>>
>>>> 050901 110915 fetch okay, but can't parse
>>>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
>>>> reason: failed
>>>> (2,200): org.apache.nutch.parse.msword.FastSavedException:
>>>> Fast-saved files are unsupported at this time
>>>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>>>> 050901 110917 SEVERE error writing
>>>> output:java.lang.NullPointerException
>>>> java.lang.NullPointerException
>>>> at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>>> at
>>>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>>> at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110917 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>> at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>> at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> Exception in thread "main" java.lang.RuntimeException: SEVERE error
>>>> logged. Exiting fetcher.
>>>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>> at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>> at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>> at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>> at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>> at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>> at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>> at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> etc.
>>>>
>>>> ---------------------------------------------------------------------------
>>>>
>>>> There are the differences between nutch-site.xml and
>>>> nutch-default.xml:
>>>> ---------------------------------------------------------------------------
>>>>
>>>> ***** nutch-default.xml
>>>> <name>http.timeout</name>
>>>> <value>10000</value>
>>>> <description>The default network timeout, in
>>>> milliseconds.</description>
>>>> ***** NUTCH-SITE.XML
>>>> <name>http.timeout</name>
>>>> <value>30000</value>
>>>> <description>The default network timeout, in
>>>> milliseconds.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>http.max.delays</name>
>>>> <value>3</value>
>>>> <description>The number of times a thread will delay when trying to
>>>> ***** NUTCH-SITE.XML
>>>> <name>http.max.delays</name>
>>>> <value>6</value>
>>>> <description>The number of times a thread will delay when trying to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>http.content.limit</name>
>>>> <value>65536</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>> <name>http.content.limit</name>
>>>> <value>130000</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>file.content.limit</name>
>>>> <value>65536</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>> <name>file.content.limit</name>
>>>> <value>130000</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>ftp.content.limit</name>
>>>> <value>65536</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>> <name>ftp.content.limit</name>
>>>> <value>130000</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>db.max.outlinks.per.page</name>
>>>> <value>100</value>
>>>> <description>The maximum number of outlinks that we'll process for
>>>> a page.
>>>> ***** NUTCH-SITE.XML
>>>> <name>db.max.outlinks.per.page</name>
>>>> <value>200</value>
>>>> <description>The maximum number of outlinks that we'll process for
>>>> a page.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>db.fetch.retry.max</name>
>>>> <value>3</value>
>>>> <description>The maximum number of times a url that has encountered
>>>> ***** NUTCH-SITE.XML
>>>> <name>db.fetch.retry.max</name>
>>>> <value>6</value>
>>>> <description>The maximum number of times a url that has encountered
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>fetcher.server.delay</name>
>>>> <value>5.0</value>
>>>> <description>The number of seconds the fetcher will delay between
>>>> ***** NUTCH-SITE.XML
>>>> <name>fetcher.server.delay</name>
>>>> <value>30.0</value>
>>>> <description>The number of seconds the fetcher will delay between
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>fetcher.threads.fetch</name>
>>>> <value>10</value>
>>>> <description>The number of FetcherThreads the fetcher should use.
>>>> ***** NUTCH-SITE.XML
>>>> <name>fetcher.threads.fetch</name>
>>>> <value>100</value>
>>>> <description>The number of FetcherThreads the fetcher should use.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>fetcher.threads.per.host</name>
>>>> <value>1</value>
>>>> <description>This number is the maximum number of threads that
>>>> ***** NUTCH-SITE.XML
>>>> <name>fetcher.threads.per.host</name>
>>>> <value>100</value>
>>>> <description>This number is the maximum number of threads that
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>parser.threads.parse</name>
>>>> <value>10</value>
>>>> <description>Number of ParserThreads ParseSegment should
>>>> use.</description>
>>>> ***** NUTCH-SITE.XML
>>>> <name>parser.threads.parse</name>
>>>> <value>100</value>
>>>> <description>Number of ParserThreads ParseSegment should
>>>> use.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>indexer.minMergeDocs</name>
>>>> <value>50</value>
>>>> <description>This number determines the minimum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>> <name>indexer.minMergeDocs</name>
>>>> <value>10000</value>
>>>> <description>This number determines the minimum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>indexer.maxMergeDocs</name>
>>>> <value>50</value>
>>>> <description>This number determines the maximum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>> <name>indexer.maxMergeDocs</name>
>>>> <value>10000000</value>
>>>> <description>This number determines the maximum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>searcher.dir</name>
>>>> <value>.</value>
>>>> <description>
>>>> ***** NUTCH-SITE.XML
>>>> <name>searcher.dir</name>
>>>> <value>/srv/db/</value>
>>>> <description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>ipc.client.timeout</name>
>>>> <value>10000</value>
>>>> <description>Defines the timeout for IPC calls in milliseconds.
>>>> </description>
>>>> ***** NUTCH-SITE.XML
>>>> <name>ipc.client.timeout</name>
>>>> <value>20000</value>
>>>> <description>Defines the timeout for IPC calls in milliseconds.
>>>> </description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>plugin.includes</name>
>>>>
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
>>>>
>>>> <description>Regular expression naming plugin directory names to
>>>> ***** NUTCH-SITE.XML
>>>> <name>plugin.includes</name>
>>>>
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
>>>>
>>>> basic|more|site|url)</value>
>>>> <description>Regular expression naming plugin directory names to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>> <name>parser.character.encoding.default</name>
>>>> <value>windows-1252</value>
>>>> <description>The character encoding to fall back to when no other
>>>> information
>>>> ***** NUTCH-SITE.XML
>>>> <name>parser.character.encoding.default</name>
>>>> <value>iso-8859-2</value>
>>>> <description>The character encoding to fall back to when no other
>>>> information
>>>> *****
>>>>
>>>> Any idea what is the problem source?
>>>>
>>>> Best Regards:
>>>> Ferenc
>>>
>>>
>>>
>>>
>>>
>
>