You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by YourSoft <yo...@freemail.hu> on 2005/09/10 17:57:34 UTC

Re: [Nutch-dev] Re: nutch 0.7 bug?

Dear Michael,

Thanks, for your mail. But I think there are 2 different problem. I 
don't use the rss parser.

Ferenc

Michael Nebel wrotte:

> Just for the mail archives: please see also NUTCH-89.
>
> Thread closed?
>
> Michael
>
>
>
> yoursoft@freemail.hu wrote:
>
>> Hi Michael,
>>
>> I going back to a nigthly build.
>> I think this problem is related to 'fetcher.threads.per.host' value, 
>> when it is bigger than 1.
>> There is another possible sources: fetcher.threads.fetch or 
>> fetcher.threads.per.host or parser.threads.parse.
>>
>> Best Regards,
>>    Ferenc
>>
>>> Hi Ferenc,
>>>
>>> I see the same errors. As I've seen a running installation 
>>> yesterday, I think it's a configuration mistake. By now I have no 
>>> idea where. Have you made any progress?
>>>
>>> Regards
>>>
>>>     Michael
>>>
>>>
>>> yoursoft@freemail.hu wrote:
>>>
>>>> Dear Developers!
>>>>
>>>> I tested  nutch 0.7 with all the parser plugins, and found the 
>>>> followings:
>>>>
>>>> ------------------------------------------------------------------------- 
>>>>
>>>> The fetch broken by with e.g. followings:
>>>> ------------------------------------------------------------------------- 
>>>>
>>>> 050901 110915 fetch okay, but can't parse 
>>>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
>>>> reason: failed
>>>> (2,200): org.apache.nutch.parse.msword.FastSavedException: 
>>>> Fast-saved files are unsupported at this time
>>>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>>>> 050901 110917 SEVERE error writing 
>>>> output:java.lang.NullPointerException
>>>> java.lang.NullPointerException
>>>>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>>>        at 
>>>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110917 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> Exception in thread "main" java.lang.RuntimeException: SEVERE error 
>>>> logged.  Exiting fetcher.
>>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> etc.
>>>>
>>>> --------------------------------------------------------------------------- 
>>>>
>>>> There are the differences between nutch-site.xml and 
>>>> nutch-default.xml:
>>>> --------------------------------------------------------------------------- 
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.timeout</name>
>>>>  <value>10000</value>
>>>>  <description>The default network timeout, in 
>>>> milliseconds.</description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.timeout</name>
>>>>  <value>30000</value>
>>>>  <description>The default network timeout, in 
>>>> milliseconds.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.max.delays</name>
>>>>  <value>3</value>
>>>>  <description>The number of times a thread will delay when trying to
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.max.delays</name>
>>>>  <value>6</value>
>>>>  <description>The number of times a thread will delay when trying to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>file.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>file.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>ftp.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>ftp.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>db.max.outlinks.per.page</name>
>>>>  <value>100</value>
>>>>  <description>The maximum number of outlinks that we'll process for 
>>>> a page.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>db.max.outlinks.per.page</name>
>>>>  <value>200</value>
>>>>  <description>The maximum number of outlinks that we'll process for 
>>>> a page.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>db.fetch.retry.max</name>
>>>>  <value>3</value>
>>>>  <description>The maximum number of times a url that has encountered
>>>> ***** NUTCH-SITE.XML
>>>>  <name>db.fetch.retry.max</name>
>>>>  <value>6</value>
>>>>  <description>The maximum number of times a url that has encountered
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.server.delay</name>
>>>>  <value>5.0</value>
>>>>  <description>The number of seconds the fetcher will delay between
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.server.delay</name>
>>>>  <value>30.0</value>
>>>>  <description>The number of seconds the fetcher will delay between
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.threads.fetch</name>
>>>>  <value>10</value>
>>>>  <description>The number of FetcherThreads the fetcher should use.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.threads.fetch</name>
>>>>  <value>100</value>
>>>>  <description>The number of FetcherThreads the fetcher should use.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.threads.per.host</name>
>>>>  <value>1</value>
>>>>  <description>This number is the maximum number of threads that
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.threads.per.host</name>
>>>>  <value>100</value>
>>>>  <description>This number is the maximum number of threads that
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>parser.threads.parse</name>
>>>>  <value>10</value>
>>>>  <description>Number of ParserThreads ParseSegment should 
>>>> use.</description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>parser.threads.parse</name>
>>>>  <value>100</value>
>>>>  <description>Number of ParserThreads ParseSegment should 
>>>> use.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>indexer.minMergeDocs</name>
>>>>  <value>50</value>
>>>>  <description>This number determines the minimum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>>  <name>indexer.minMergeDocs</name>
>>>>  <value>10000</value>
>>>>  <description>This number determines the minimum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>indexer.maxMergeDocs</name>
>>>>  <value>50</value>
>>>>  <description>This number determines the maximum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>>  <name>indexer.maxMergeDocs</name>
>>>>  <value>10000000</value>
>>>>  <description>This number determines the maximum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>searcher.dir</name>
>>>>  <value>.</value>
>>>>  <description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>searcher.dir</name>
>>>>  <value>/srv/db/</value>
>>>>  <description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>ipc.client.timeout</name>
>>>>  <value>10000</value>
>>>>  <description>Defines the timeout for IPC calls in milliseconds. 
>>>> </description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>ipc.client.timeout</name>
>>>>  <value>20000</value>
>>>>  <description>Defines the timeout for IPC calls in milliseconds. 
>>>> </description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value> 
>>>>
>>>>  <description>Regular expression naming plugin directory names to
>>>> ***** NUTCH-SITE.XML
>>>>  <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query- 
>>>>
>>>> basic|more|site|url)</value>
>>>>  <description>Regular expression naming plugin directory names to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>parser.character.encoding.default</name>
>>>>  <value>windows-1252</value>
>>>>  <description>The character encoding to fall back to when no other 
>>>> information
>>>> ***** NUTCH-SITE.XML
>>>>  <name>parser.character.encoding.default</name>
>>>>  <value>iso-8859-2</value>
>>>>  <description>The character encoding to fall back to when no other 
>>>> information
>>>> *****
>>>>
>>>> Any idea what is the problem source?
>>>>
>>>> Best Regards:
>>>>    Ferenc
>>>
>>>
>>>
>>>
>>>
>
>