You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/08/31 18:34:39 UTC

merge mapred to trunk

Currently we have three versions of nutch: trunk, 0.7 and mapred.  This 
increases the chances for conflicts.  I would thus like to merge the 
mapred branch into trunk soon.  The soonest I could actually start this 
is next week.  Are there any objections?

Doug

Re: merge mapred to trunk

Posted by Doug Cutting <cu...@nutch.org>.

Jérôme Charron wrote:
> I don't take a look yet at mapred branch.
> It will going to be a good surprise to discover it in the trunk... ;-)

I will make some effort to document things more before I merge to trunk, 
so that folks know what they're getting.  Many things have changed 
(e.g., segment format).  Several things have not yet been fully worked 
out and/or implemented (e.g., segment merging).  But the basics are all 
working (intranet and & whole-web crawling, indexing & search), both in 
standalone and distributed configurations.  My focus has been stress 
testing the distributed infrastructure (NDFS & MapReduce).  We've 
discovered and fixed a number of bugs in this over recent weeks, so it 
is getting ever more stable.  I'm hoping that others can help fill in 
the gaps in tools.

Once the merge is done I'd like to make a few other changes.

These are:

   1. Remove most static references to NutchConf outside of main() 
routines.  The MapReduce-based versions of the command line tools have 
no such references.  The biggest change here will be to plugins. 
Plugins APIs should probably all be modified to use a factory, and the 
factory should be constructed from a NutchConf, e.g., something like:
   public static PluginXFactory PluginXFactory.getFactory(NutchConf);
   public PluginX PluginXFactory.getPlugin(...);
This should permit folks to more easily configure things programatically 
(think JMX) and to run multiple configurations in a single JVM.

   2. FetchListEntry has been mostly replaced with a new, simpler 
datastructure called a CrawlDatum.  FetchListEntry is used in the 
IndexingFilter API to pass the url, fetch date and incoming anchors. 
Currently, in the mapred branch, the indexer creates a dummy 
FetchListEntry to pass to plugins.  But instead the IndexingFilter API 
should probably be altered to pass the CrawlDatum, anchors and url.

I have avoided making these changes since they would make it difficult 
to merge improvements to plugins into the mapred branch.  But, once we 
have moved mapred to trunk, we should make them soon.  Incompatible API 
changes are best made early, so that folks have more time to work with them.

Does this all sound reasonable?

Doug

Re: merge mapred to trunk

Posted by Jérôme Charron <je...@gmail.com>.

On 8/31/05, Piotr Kosiorowski <pk...@gmail.com> wrote:
> 
> Doug Cutting wrote:
> > Currently we have three versions of nutch: trunk, 0.7 and mapred. This
> > increases the chances for conflicts. I would thus like to merge the
> > mapred branch into trunk soon. The soonest I could actually start this
> > is next week. Are there any objections?

+1
I don't take a look yet at mapred branch.
It will going to be a good surprise to discover it in the trunk... ;-)

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: merge mapred to trunk

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Doug Cutting wrote:
> Currently we have three versions of nutch: trunk, 0.7 and mapred.  This 
> increases the chances for conflicts.  I would thus like to merge the 
> mapred branch into trunk soon.  The soonest I could actually start this 
> is next week.  Are there any objections?
> 
> Doug
> 
+1
P.

Re: [Nutch-dev] Re: nutch 0.7 bug?

Posted by YourSoft <yo...@freemail.hu>.

Dear Michael,

Thanks, for your mail. But I think there are 2 different problem. I 
don't use the rss parser.

Ferenc

Michael Nebel wrotte:

> Just for the mail archives: please see also NUTCH-89.
>
> Thread closed?
>
> Michael
>
>
>
> yoursoft@freemail.hu wrote:
>
>> Hi Michael,
>>
>> I going back to a nigthly build.
>> I think this problem is related to 'fetcher.threads.per.host' value, 
>> when it is bigger than 1.
>> There is another possible sources: fetcher.threads.fetch or 
>> fetcher.threads.per.host or parser.threads.parse.
>>
>> Best Regards,
>>    Ferenc
>>
>>> Hi Ferenc,
>>>
>>> I see the same errors. As I've seen a running installation 
>>> yesterday, I think it's a configuration mistake. By now I have no 
>>> idea where. Have you made any progress?
>>>
>>> Regards
>>>
>>>     Michael
>>>
>>>
>>> yoursoft@freemail.hu wrote:
>>>
>>>> Dear Developers!
>>>>
>>>> I tested  nutch 0.7 with all the parser plugins, and found the 
>>>> followings:
>>>>
>>>> ------------------------------------------------------------------------- 
>>>>
>>>> The fetch broken by with e.g. followings:
>>>> ------------------------------------------------------------------------- 
>>>>
>>>> 050901 110915 fetch okay, but can't parse 
>>>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
>>>> reason: failed
>>>> (2,200): org.apache.nutch.parse.msword.FastSavedException: 
>>>> Fast-saved files are unsupported at this time
>>>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>>>> 050901 110917 SEVERE error writing 
>>>> output:java.lang.NullPointerException
>>>> java.lang.NullPointerException
>>>>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>>>        at 
>>>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110917 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> Exception in thread "main" java.lang.RuntimeException: SEVERE error 
>>>> logged.  Exiting fetcher.
>>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>>
>>>>        at 
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at 
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at 
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> etc.
>>>>
>>>> --------------------------------------------------------------------------- 
>>>>
>>>> There are the differences between nutch-site.xml and 
>>>> nutch-default.xml:
>>>> --------------------------------------------------------------------------- 
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.timeout</name>
>>>>  <value>10000</value>
>>>>  <description>The default network timeout, in 
>>>> milliseconds.</description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.timeout</name>
>>>>  <value>30000</value>
>>>>  <description>The default network timeout, in 
>>>> milliseconds.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.max.delays</name>
>>>>  <value>3</value>
>>>>  <description>The number of times a thread will delay when trying to
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.max.delays</name>
>>>>  <value>6</value>
>>>>  <description>The number of times a thread will delay when trying to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>file.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>file.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>ftp.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>ftp.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>db.max.outlinks.per.page</name>
>>>>  <value>100</value>
>>>>  <description>The maximum number of outlinks that we'll process for 
>>>> a page.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>db.max.outlinks.per.page</name>
>>>>  <value>200</value>
>>>>  <description>The maximum number of outlinks that we'll process for 
>>>> a page.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>db.fetch.retry.max</name>
>>>>  <value>3</value>
>>>>  <description>The maximum number of times a url that has encountered
>>>> ***** NUTCH-SITE.XML
>>>>  <name>db.fetch.retry.max</name>
>>>>  <value>6</value>
>>>>  <description>The maximum number of times a url that has encountered
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.server.delay</name>
>>>>  <value>5.0</value>
>>>>  <description>The number of seconds the fetcher will delay between
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.server.delay</name>
>>>>  <value>30.0</value>
>>>>  <description>The number of seconds the fetcher will delay between
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.threads.fetch</name>
>>>>  <value>10</value>
>>>>  <description>The number of FetcherThreads the fetcher should use.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.threads.fetch</name>
>>>>  <value>100</value>
>>>>  <description>The number of FetcherThreads the fetcher should use.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.threads.per.host</name>
>>>>  <value>1</value>
>>>>  <description>This number is the maximum number of threads that
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.threads.per.host</name>
>>>>  <value>100</value>
>>>>  <description>This number is the maximum number of threads that
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>parser.threads.parse</name>
>>>>  <value>10</value>
>>>>  <description>Number of ParserThreads ParseSegment should 
>>>> use.</description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>parser.threads.parse</name>
>>>>  <value>100</value>
>>>>  <description>Number of ParserThreads ParseSegment should 
>>>> use.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>indexer.minMergeDocs</name>
>>>>  <value>50</value>
>>>>  <description>This number determines the minimum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>>  <name>indexer.minMergeDocs</name>
>>>>  <value>10000</value>
>>>>  <description>This number determines the minimum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>indexer.maxMergeDocs</name>
>>>>  <value>50</value>
>>>>  <description>This number determines the maximum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>>  <name>indexer.maxMergeDocs</name>
>>>>  <value>10000000</value>
>>>>  <description>This number determines the maximum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>searcher.dir</name>
>>>>  <value>.</value>
>>>>  <description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>searcher.dir</name>
>>>>  <value>/srv/db/</value>
>>>>  <description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>ipc.client.timeout</name>
>>>>  <value>10000</value>
>>>>  <description>Defines the timeout for IPC calls in milliseconds. 
>>>> </description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>ipc.client.timeout</name>
>>>>  <value>20000</value>
>>>>  <description>Defines the timeout for IPC calls in milliseconds. 
>>>> </description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value> 
>>>>
>>>>  <description>Regular expression naming plugin directory names to
>>>> ***** NUTCH-SITE.XML
>>>>  <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query- 
>>>>
>>>> basic|more|site|url)</value>
>>>>  <description>Regular expression naming plugin directory names to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>parser.character.encoding.default</name>
>>>>  <value>windows-1252</value>
>>>>  <description>The character encoding to fall back to when no other 
>>>> information
>>>> ***** NUTCH-SITE.XML
>>>>  <name>parser.character.encoding.default</name>
>>>>  <value>iso-8859-2</value>
>>>>  <description>The character encoding to fall back to when no other 
>>>> information
>>>> *****
>>>>
>>>> Any idea what is the problem source?
>>>>
>>>> Best Regards:
>>>>    Ferenc
>>>
>>>
>>>
>>>
>>>
>
>

Re: nutch 0.7 bug?

Posted by Michael Nebel <mi...@nebel.de>.

Just for the mail archives: please see also NUTCH-89.

Thread closed?

Michael



yoursoft@freemail.hu wrote:

> Hi Michael,
> 
> I going back to a nigthly build.
> I think this problem is related to 'fetcher.threads.per.host' value, 
> when it is bigger than 1.
> There is another possible sources: fetcher.threads.fetch or 
> fetcher.threads.per.host or parser.threads.parse.
> 
> Best Regards,
>    Ferenc
> 
>> Hi Ferenc,
>>
>> I see the same errors. As I've seen a running installation yesterday, 
>> I think it's a configuration mistake. By now I have no idea where. 
>> Have you made any progress?
>>
>> Regards
>>
>>     Michael
>>
>>
>> yoursoft@freemail.hu wrote:
>>
>>> Dear Developers!
>>>
>>> I tested  nutch 0.7 with all the parser plugins, and found the 
>>> followings:
>>>
>>> ------------------------------------------------------------------------- 
>>>
>>> The fetch broken by with e.g. followings:
>>> ------------------------------------------------------------------------- 
>>>
>>> 050901 110915 fetch okay, but can't parse 
>>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
>>> reason: failed
>>> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved 
>>> files are unsupported at this time
>>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>>> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
>>> java.lang.NullPointerException
>>>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>>        at 
>>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110917 SEVERE error writing output:java.io.IOException: key 
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> Exception in thread "main" java.lang.RuntimeException: SEVERE error 
>>> logged.  Exiting fetcher.
>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>>
>>>        at 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>> etc.
>>>
>>> --------------------------------------------------------------------------- 
>>>
>>> There are the differences between nutch-site.xml and nutch-default.xml:
>>> --------------------------------------------------------------------------- 
>>>
>>> ***** nutch-default.xml
>>>  <name>http.timeout</name>
>>>  <value>10000</value>
>>>  <description>The default network timeout, in 
>>> milliseconds.</description>
>>> ***** NUTCH-SITE.XML
>>>  <name>http.timeout</name>
>>>  <value>30000</value>
>>>  <description>The default network timeout, in 
>>> milliseconds.</description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>http.max.delays</name>
>>>  <value>3</value>
>>>  <description>The number of times a thread will delay when trying to
>>> ***** NUTCH-SITE.XML
>>>  <name>http.max.delays</name>
>>>  <value>6</value>
>>>  <description>The number of times a thread will delay when trying to
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>http.content.limit</name>
>>>  <value>65536</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>>  <name>http.content.limit</name>
>>>  <value>130000</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>file.content.limit</name>
>>>  <value>65536</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>>  <name>file.content.limit</name>
>>>  <value>130000</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>ftp.content.limit</name>
>>>  <value>65536</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>>  <name>ftp.content.limit</name>
>>>  <value>130000</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>db.max.outlinks.per.page</name>
>>>  <value>100</value>
>>>  <description>The maximum number of outlinks that we'll process for a 
>>> page.
>>> ***** NUTCH-SITE.XML
>>>  <name>db.max.outlinks.per.page</name>
>>>  <value>200</value>
>>>  <description>The maximum number of outlinks that we'll process for a 
>>> page.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>db.fetch.retry.max</name>
>>>  <value>3</value>
>>>  <description>The maximum number of times a url that has encountered
>>> ***** NUTCH-SITE.XML
>>>  <name>db.fetch.retry.max</name>
>>>  <value>6</value>
>>>  <description>The maximum number of times a url that has encountered
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>fetcher.server.delay</name>
>>>  <value>5.0</value>
>>>  <description>The number of seconds the fetcher will delay between
>>> ***** NUTCH-SITE.XML
>>>  <name>fetcher.server.delay</name>
>>>  <value>30.0</value>
>>>  <description>The number of seconds the fetcher will delay between
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>fetcher.threads.fetch</name>
>>>  <value>10</value>
>>>  <description>The number of FetcherThreads the fetcher should use.
>>> ***** NUTCH-SITE.XML
>>>  <name>fetcher.threads.fetch</name>
>>>  <value>100</value>
>>>  <description>The number of FetcherThreads the fetcher should use.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>fetcher.threads.per.host</name>
>>>  <value>1</value>
>>>  <description>This number is the maximum number of threads that
>>> ***** NUTCH-SITE.XML
>>>  <name>fetcher.threads.per.host</name>
>>>  <value>100</value>
>>>  <description>This number is the maximum number of threads that
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>parser.threads.parse</name>
>>>  <value>10</value>
>>>  <description>Number of ParserThreads ParseSegment should 
>>> use.</description>
>>> ***** NUTCH-SITE.XML
>>>  <name>parser.threads.parse</name>
>>>  <value>100</value>
>>>  <description>Number of ParserThreads ParseSegment should 
>>> use.</description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>indexer.minMergeDocs</name>
>>>  <value>50</value>
>>>  <description>This number determines the minimum number of Lucene
>>> ***** NUTCH-SITE.XML
>>>  <name>indexer.minMergeDocs</name>
>>>  <value>10000</value>
>>>  <description>This number determines the minimum number of Lucene
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>indexer.maxMergeDocs</name>
>>>  <value>50</value>
>>>  <description>This number determines the maximum number of Lucene
>>> ***** NUTCH-SITE.XML
>>>  <name>indexer.maxMergeDocs</name>
>>>  <value>10000000</value>
>>>  <description>This number determines the maximum number of Lucene
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>searcher.dir</name>
>>>  <value>.</value>
>>>  <description>
>>> ***** NUTCH-SITE.XML
>>>  <name>searcher.dir</name>
>>>  <value>/srv/db/</value>
>>>  <description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>ipc.client.timeout</name>
>>>  <value>10000</value>
>>>  <description>Defines the timeout for IPC calls in milliseconds. 
>>> </description>
>>> ***** NUTCH-SITE.XML
>>>  <name>ipc.client.timeout</name>
>>>  <value>20000</value>
>>>  <description>Defines the timeout for IPC calls in milliseconds. 
>>> </description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>plugin.includes</name>
>>>  
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value> 
>>>
>>>  <description>Regular expression naming plugin directory names to
>>> ***** NUTCH-SITE.XML
>>>  <name>plugin.includes</name>
>>>  
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query- 
>>>
>>> basic|more|site|url)</value>
>>>  <description>Regular expression naming plugin directory names to
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>parser.character.encoding.default</name>
>>>  <value>windows-1252</value>
>>>  <description>The character encoding to fall back to when no other 
>>> information
>>> ***** NUTCH-SITE.XML
>>>  <name>parser.character.encoding.default</name>
>>>  <value>iso-8859-2</value>
>>>  <description>The character encoding to fall back to when no other 
>>> information
>>> *****
>>>
>>> Any idea what is the problem source?
>>>
>>> Best Regards:
>>>    Ferenc
>>
>>
>>
>>


-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/

Re: nutch 0.7 bug?

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Hi Michael,

I going back to a nigthly build.
I think this problem is related to 'fetcher.threads.per.host' value, 
when it is bigger than 1.
There is another possible sources: fetcher.threads.fetch or 
fetcher.threads.per.host or parser.threads.parse.

Best Regards,
    Ferenc

> Hi Ferenc,
>
> I see the same errors. As I've seen a running installation yesterday, 
> I think it's a configuration mistake. By now I have no idea where. 
> Have you made any progress?
>
> Regards
>
>     Michael
>
>
> yoursoft@freemail.hu wrote:
>
>> Dear Developers!
>>
>> I tested  nutch 0.7 with all the parser plugins, and found the 
>> followings:
>>
>> ------------------------------------------------------------------------- 
>>
>> The fetch broken by with e.g. followings:
>> ------------------------------------------------------------------------- 
>>
>> 050901 110915 fetch okay, but can't parse 
>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
>> reason: failed
>> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved 
>> files are unsupported at this time
>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
>> java.lang.NullPointerException
>>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>        at 
>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110917 SEVERE error writing output:java.io.IOException: key 
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> Exception in thread "main" java.lang.RuntimeException: SEVERE error 
>> logged.  Exiting fetcher.
>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
>>
>>        at 
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key 
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>> etc.
>>
>> --------------------------------------------------------------------------- 
>>
>> There are the differences between nutch-site.xml and nutch-default.xml:
>> --------------------------------------------------------------------------- 
>>
>> ***** nutch-default.xml
>>  <name>http.timeout</name>
>>  <value>10000</value>
>>  <description>The default network timeout, in 
>> milliseconds.</description>
>> ***** NUTCH-SITE.XML
>>  <name>http.timeout</name>
>>  <value>30000</value>
>>  <description>The default network timeout, in 
>> milliseconds.</description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>http.max.delays</name>
>>  <value>3</value>
>>  <description>The number of times a thread will delay when trying to
>> ***** NUTCH-SITE.XML
>>  <name>http.max.delays</name>
>>  <value>6</value>
>>  <description>The number of times a thread will delay when trying to
>> *****
>>
>> ***** nutch-default.xml
>>  <name>http.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>>  <name>http.content.limit</name>
>>  <value>130000</value>
>>  <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>file.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>>  <name>file.content.limit</name>
>>  <value>130000</value>
>>  <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>ftp.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>>  <name>ftp.content.limit</name>
>>  <value>130000</value>
>>  <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>db.max.outlinks.per.page</name>
>>  <value>100</value>
>>  <description>The maximum number of outlinks that we'll process for a 
>> page.
>> ***** NUTCH-SITE.XML
>>  <name>db.max.outlinks.per.page</name>
>>  <value>200</value>
>>  <description>The maximum number of outlinks that we'll process for a 
>> page.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>db.fetch.retry.max</name>
>>  <value>3</value>
>>  <description>The maximum number of times a url that has encountered
>> ***** NUTCH-SITE.XML
>>  <name>db.fetch.retry.max</name>
>>  <value>6</value>
>>  <description>The maximum number of times a url that has encountered
>> *****
>>
>> ***** nutch-default.xml
>>  <name>fetcher.server.delay</name>
>>  <value>5.0</value>
>>  <description>The number of seconds the fetcher will delay between
>> ***** NUTCH-SITE.XML
>>  <name>fetcher.server.delay</name>
>>  <value>30.0</value>
>>  <description>The number of seconds the fetcher will delay between
>> *****
>>
>> ***** nutch-default.xml
>>  <name>fetcher.threads.fetch</name>
>>  <value>10</value>
>>  <description>The number of FetcherThreads the fetcher should use.
>> ***** NUTCH-SITE.XML
>>  <name>fetcher.threads.fetch</name>
>>  <value>100</value>
>>  <description>The number of FetcherThreads the fetcher should use.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>fetcher.threads.per.host</name>
>>  <value>1</value>
>>  <description>This number is the maximum number of threads that
>> ***** NUTCH-SITE.XML
>>  <name>fetcher.threads.per.host</name>
>>  <value>100</value>
>>  <description>This number is the maximum number of threads that
>> *****
>>
>> ***** nutch-default.xml
>>  <name>parser.threads.parse</name>
>>  <value>10</value>
>>  <description>Number of ParserThreads ParseSegment should 
>> use.</description>
>> ***** NUTCH-SITE.XML
>>  <name>parser.threads.parse</name>
>>  <value>100</value>
>>  <description>Number of ParserThreads ParseSegment should 
>> use.</description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>indexer.minMergeDocs</name>
>>  <value>50</value>
>>  <description>This number determines the minimum number of Lucene
>> ***** NUTCH-SITE.XML
>>  <name>indexer.minMergeDocs</name>
>>  <value>10000</value>
>>  <description>This number determines the minimum number of Lucene
>> *****
>>
>> ***** nutch-default.xml
>>  <name>indexer.maxMergeDocs</name>
>>  <value>50</value>
>>  <description>This number determines the maximum number of Lucene
>> ***** NUTCH-SITE.XML
>>  <name>indexer.maxMergeDocs</name>
>>  <value>10000000</value>
>>  <description>This number determines the maximum number of Lucene
>> *****
>>
>> ***** nutch-default.xml
>>  <name>searcher.dir</name>
>>  <value>.</value>
>>  <description>
>> ***** NUTCH-SITE.XML
>>  <name>searcher.dir</name>
>>  <value>/srv/db/</value>
>>  <description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>ipc.client.timeout</name>
>>  <value>10000</value>
>>  <description>Defines the timeout for IPC calls in milliseconds. 
>> </description>
>> ***** NUTCH-SITE.XML
>>  <name>ipc.client.timeout</name>
>>  <value>20000</value>
>>  <description>Defines the timeout for IPC calls in milliseconds. 
>> </description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>plugin.includes</name>
>>  
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value> 
>>
>>  <description>Regular expression naming plugin directory names to
>> ***** NUTCH-SITE.XML
>>  <name>plugin.includes</name>
>>  
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query- 
>>
>> basic|more|site|url)</value>
>>  <description>Regular expression naming plugin directory names to
>> *****
>>
>> ***** nutch-default.xml
>>  <name>parser.character.encoding.default</name>
>>  <value>windows-1252</value>
>>  <description>The character encoding to fall back to when no other 
>> information
>> ***** NUTCH-SITE.XML
>>  <name>parser.character.encoding.default</name>
>>  <value>iso-8859-2</value>
>>  <description>The character encoding to fall back to when no other 
>> information
>> *****
>>
>> Any idea what is the problem source?
>>
>> Best Regards:
>>    Ferenc
>
>
>

Re: nutch 0.7 bug?

Posted by Michael Nebel <mi...@nebel.de>.

Hi Ferenc,

I see the same errors. As I've seen a running installation yesterday, I 
think it's a configuration mistake. By now I have no idea where. Have 
you made any progress?

Regards

	Michael


yoursoft@freemail.hu wrote:

> Dear Developers!
> 
> I tested  nutch 0.7 with all the parser plugins, and found the followings:
> 
> -------------------------------------------------------------------------
> The fetch broken by with e.g. followings:
> -------------------------------------------------------------------------
> 050901 110915 fetch okay, but can't parse 
> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
> reason: failed
> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved 
> files are unsupported at this time
> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
> java.lang.NullPointerException
>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>        at 
> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
> 
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110917 SEVERE error writing output:java.io.IOException: key out 
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
> 
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> Exception in thread "main" java.lang.RuntimeException: SEVERE error 
> logged.  Exiting fetcher.
>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out 
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
> 
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out 
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) 
> 
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out 
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
> etc.
> 
> ---------------------------------------------------------------------------
> There are the differences between nutch-site.xml and nutch-default.xml:
> ---------------------------------------------------------------------------
> ***** nutch-default.xml
>  <name>http.timeout</name>
>  <value>10000</value>
>  <description>The default network timeout, in milliseconds.</description>
> ***** NUTCH-SITE.XML
>  <name>http.timeout</name>
>  <value>30000</value>
>  <description>The default network timeout, in milliseconds.</description>
> *****
> 
> ***** nutch-default.xml
>  <name>http.max.delays</name>
>  <value>3</value>
>  <description>The number of times a thread will delay when trying to
> ***** NUTCH-SITE.XML
>  <name>http.max.delays</name>
>  <value>6</value>
>  <description>The number of times a thread will delay when trying to
> *****
> 
> ***** nutch-default.xml
>  <name>http.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
>  <name>http.content.limit</name>
>  <value>130000</value>
>  <description>The length limit for downloaded content, in bytes.
> *****
> 
> ***** nutch-default.xml
>  <name>file.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
>  <name>file.content.limit</name>
>  <value>130000</value>
>  <description>The length limit for downloaded content, in bytes.
> *****
> 
> ***** nutch-default.xml
>  <name>ftp.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
>  <name>ftp.content.limit</name>
>  <value>130000</value>
>  <description>The length limit for downloaded content, in bytes.
> *****
> 
> ***** nutch-default.xml
>  <name>db.max.outlinks.per.page</name>
>  <value>100</value>
>  <description>The maximum number of outlinks that we'll process for a page.
> ***** NUTCH-SITE.XML
>  <name>db.max.outlinks.per.page</name>
>  <value>200</value>
>  <description>The maximum number of outlinks that we'll process for a page.
> *****
> 
> ***** nutch-default.xml
>  <name>db.fetch.retry.max</name>
>  <value>3</value>
>  <description>The maximum number of times a url that has encountered
> ***** NUTCH-SITE.XML
>  <name>db.fetch.retry.max</name>
>  <value>6</value>
>  <description>The maximum number of times a url that has encountered
> *****
> 
> ***** nutch-default.xml
>  <name>fetcher.server.delay</name>
>  <value>5.0</value>
>  <description>The number of seconds the fetcher will delay between
> ***** NUTCH-SITE.XML
>  <name>fetcher.server.delay</name>
>  <value>30.0</value>
>  <description>The number of seconds the fetcher will delay between
> *****
> 
> ***** nutch-default.xml
>  <name>fetcher.threads.fetch</name>
>  <value>10</value>
>  <description>The number of FetcherThreads the fetcher should use.
> ***** NUTCH-SITE.XML
>  <name>fetcher.threads.fetch</name>
>  <value>100</value>
>  <description>The number of FetcherThreads the fetcher should use.
> *****
> 
> ***** nutch-default.xml
>  <name>fetcher.threads.per.host</name>
>  <value>1</value>
>  <description>This number is the maximum number of threads that
> ***** NUTCH-SITE.XML
>  <name>fetcher.threads.per.host</name>
>  <value>100</value>
>  <description>This number is the maximum number of threads that
> *****
> 
> ***** nutch-default.xml
>  <name>parser.threads.parse</name>
>  <value>10</value>
>  <description>Number of ParserThreads ParseSegment should 
> use.</description>
> ***** NUTCH-SITE.XML
>  <name>parser.threads.parse</name>
>  <value>100</value>
>  <description>Number of ParserThreads ParseSegment should 
> use.</description>
> *****
> 
> ***** nutch-default.xml
>  <name>indexer.minMergeDocs</name>
>  <value>50</value>
>  <description>This number determines the minimum number of Lucene
> ***** NUTCH-SITE.XML
>  <name>indexer.minMergeDocs</name>
>  <value>10000</value>
>  <description>This number determines the minimum number of Lucene
> *****
> 
> ***** nutch-default.xml
>  <name>indexer.maxMergeDocs</name>
>  <value>50</value>
>  <description>This number determines the maximum number of Lucene
> ***** NUTCH-SITE.XML
>  <name>indexer.maxMergeDocs</name>
>  <value>10000000</value>
>  <description>This number determines the maximum number of Lucene
> *****
> 
> ***** nutch-default.xml
>  <name>searcher.dir</name>
>  <value>.</value>
>  <description>
> ***** NUTCH-SITE.XML
>  <name>searcher.dir</name>
>  <value>/srv/db/</value>
>  <description>
> *****
> 
> ***** nutch-default.xml
>  <name>ipc.client.timeout</name>
>  <value>10000</value>
>  <description>Defines the timeout for IPC calls in milliseconds. 
> </description>
> ***** NUTCH-SITE.XML
>  <name>ipc.client.timeout</name>
>  <value>20000</value>
>  <description>Defines the timeout for IPC calls in milliseconds. 
> </description>
> *****
> 
> ***** nutch-default.xml
>  <name>plugin.includes</name>
>  
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value> 
> 
>  <description>Regular expression naming plugin directory names to
> ***** NUTCH-SITE.XML
>  <name>plugin.includes</name>
>  
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query- 
> 
> basic|more|site|url)</value>
>  <description>Regular expression naming plugin directory names to
> *****
> 
> ***** nutch-default.xml
>  <name>parser.character.encoding.default</name>
>  <value>windows-1252</value>
>  <description>The character encoding to fall back to when no other 
> information
> ***** NUTCH-SITE.XML
>  <name>parser.character.encoding.default</name>
>  <value>iso-8859-2</value>
>  <description>The character encoding to fall back to when no other 
> information
> *****
> 
> Any idea what is the problem source?
> 
> Best Regards:
>    Ferenc


-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/

Re: Event queues vs threads

Posted by Paul Baclace <pe...@baclace.net>.

Doug Cutting wrote:
 >Kelvin Tan wrote:
>> fetcher as a series of event queues (ala SEDA) instead 
>> of with threads.
> 
> I have never been able to write a async version of things with Java's 
> nio that outperforms a threaded version.  In theory it is possible, 
> since you can avoid thread switching overheads.  But in practice I have 
> found it difficult.

I read the David Culler, et al SEDA paper a while back and I think
the real benefit is twofold:  (1) more concurrent connectionsand
(2) graceful degradation (meaning fair scheduling) at maximum load.
IIRC, they hint at competitive-with-apache web serving, but this
depends on specific mix of requests/file sizes, etc.; Tomcat can
also beat the apache web server under some conditions.

Services that need to maintain lots of mostly-idle connections
(like instant messaging) benefit the most from a SEDA architecture.

It should be possible to have graceful degradation with a
thread-oriented architecture.  Perhaps a self-tuning procedure
that, for a specific installation, could discover the parameter
settings to get the most out of a server and have it refuse requests
that would push it into the unfair scheduling zone.

Paul

Re: Event queues vs threads

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Hi,
I think some old blog entries are quite interesting - if somone wants to 
find out some details about nio.
http://jroller.com/page/pyrasun/20040426
Regards,
Piotr

Doug Cutting wrote:
> Kelvin Tan wrote:
> 
>> Interesting. I haven't tried it myself. Do you have any 
>> code/benchmarks for this?
> 
> 
> I never committed it anywhere.  I initially tried to write Nutch's IPC 
> mechanism with nio and it was slow and buggy.  One problem was that I 
> needed to switch streams to non-blocking mode in order to read 
> arbitrarily large objects, then switch them back to blocking mode in 
> order to select() on them.  But you can't change this state and remove 
> them from the selector without going through the scheduler.  So the 
> benefit of skipping the scheduler wasn't there.  If I was willing to 
> fragment objects into fixed size chunks then it might have worked, but 
> that's a lot of work.  It's a strange limitation, since with native 
> sockets one can select and then perform arbitrary stream i/o, not 
> limited to a single buffer.
> 
> Also, there's an nio version of Lucene's Directory that's a bit slower 
> than the non-nio version, but this is not using select() or anything.
> 
>> Are you aware of others facing the same problem? 
> 
> 
> How much non-blocking nio code do you find in real Java code?  I have 
> not seen a lot.
> 
> I did find that Sun has implemented a high-performance HTTP client using 
> nio.  This is documented at:
> 
> http://blogs.sun.com/roller/resources/fp/grizzly.pdf
> 
>  From what I can tell the primary benefit is in number of simultaneous 
> clients, not in throughput.  Does a crawler require 1000's of 
> simultaneous connections?  If so, then it looks like careful use of nio 
> could offer some real benefits.
> 
> Doug
>

Re: Event queues vs threads

Posted by Doug Cutting <cu...@nutch.org>.

Kelvin Tan wrote:
> Interesting. I haven't tried it myself. Do you have any code/benchmarks for this?

I never committed it anywhere.  I initially tried to write Nutch's IPC 
mechanism with nio and it was slow and buggy.  One problem was that I 
needed to switch streams to non-blocking mode in order to read 
arbitrarily large objects, then switch them back to blocking mode in 
order to select() on them.  But you can't change this state and remove 
them from the selector without going through the scheduler.  So the 
benefit of skipping the scheduler wasn't there.  If I was willing to 
fragment objects into fixed size chunks then it might have worked, but 
that's a lot of work.  It's a strange limitation, since with native 
sockets one can select and then perform arbitrary stream i/o, not 
limited to a single buffer.

Also, there's an nio version of Lucene's Directory that's a bit slower 
than the non-nio version, but this is not using select() or anything.

> Are you aware of others facing the same problem? 

How much non-blocking nio code do you find in real Java code?  I have 
not seen a lot.

I did find that Sun has implemented a high-performance HTTP client using 
nio.  This is documented at:

http://blogs.sun.com/roller/resources/fp/grizzly.pdf

 From what I can tell the primary benefit is in number of simultaneous 
clients, not in throughput.  Does a crawler require 1000's of 
simultaneous connections?  If so, then it looks like careful use of nio 
could offer some real benefits.

Doug

Re: Event queues vs threads

Posted by Kelvin Tan <ke...@relevanz.com>.


On Thu, 01 Sep 2005 09:58:49 -0700, Doug Cutting wrote:
> Kelvin Tan wrote:
>> Each of these stages will be handled in its own thread (except
>> for HTML parsing and scoring, which may actually benefit from
>> having multiple threads). With the introduction of non-blocking
>> IO, I think threads should be used only where parallel
>> computation offers performance advantages.
>>
>> Breaking up HttpRequest and HttpResponse, will also pave the way
>> for a non-blocking HTTP implementation.
>>
> I have never been able to write a async version of things with
> Java's nio that outperforms a threaded version.  In theory it is
> possible, since you can avoid thread switching overheads.  But in
> practice I have found it difficult.
>
> Doug

Interesting. I haven't tried it myself. Do you have any code/benchmarks for this? Are you aware of others facing the same problem? 

k

Re: Event queues vs threads

Posted by Doug Cutting <cu...@nutch.org>.

Kelvin Tan wrote:
> Each of these stages will be handled in its own thread (except for HTML parsing and scoring, which may actually benefit from having multiple threads). With the introduction of non-blocking IO, I think threads should be used only where parallel computation offers performance advantages.
> 
> Breaking up HttpRequest and HttpResponse, will also pave the way for a non-blocking HTTP implementation.

I have never been able to write a async version of things with Java's 
nio that outperforms a threaded version.  In theory it is possible, 
since you can avoid thread switching overheads.  But in practice I have 
found it difficult.

Doug

Event queues vs threads

Posted by Kelvin Tan <ke...@relevanz.com>.

I'm toying around with the idea of implementing the fetcher as a series of event queues (ala SEDA) instead of with threads. This is done by breaking up the fetching operation into a series of stages connected by queues, instead of one fetcherthread per task.

The stages I see are:

1. CrawlStarter (url injection)
2. URL filtering and normalizing
3. HttpRequest
4. HttpResponse
5. DB of fetched MD5 hashes
6. DB of fetched URLs
7. Parse and link extraction
8. Output
9. Link/Page Scoring

Each of these stages will be handled in its own thread (except for HTML parsing and scoring, which may actually benefit from having multiple threads). With the introduction of non-blocking IO, I think threads should be used only where parallel computation offers performance advantages.

Breaking up HttpRequest and HttpResponse, will also pave the way for a non-blocking HTTP implementation.

A big advantage also arises from a decrease in programmatic complexity (and possibly performance). With most of the stages being guaranteed to be single-threaded, threading/synchronization issues are dramatically reduced. This may not be so evident in the current/map-red fetch code, but because of the completely online nature of nutch-84/OC, this does simplify things considerably. 

I'll need to dig abit more to see how this can be conceptually translated into map-reduce, but I imagine its do-able. Perhaps each stage gets mapped then reduced?

Any thoughts?

nutch 0.7 bug?

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Dear Developers!

I tested  nutch 0.7 with all the parser plugins, and found the followings:

-------------------------------------------------------------------------
The fetch broken by with e.g. followings:
-------------------------------------------------------------------------
050901 110915 fetch okay, but can't parse 
http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, 
reason: failed
(2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved 
files are unsupported at this time
050901 110915 fetching http://en.mimi.hu/fishing/scad.html
050901 110917 SEVERE error writing output:java.lang.NullPointerException
java.lang.NullPointerException
        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
        at 
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110917 SEVERE error writing output:java.io.IOException: key out 
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Exception in thread "main" java.lang.RuntimeException: SEVERE error 
logged.  Exiting fetcher.
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
050901 110921 SEVERE error writing output:java.io.IOException: key out 
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key out 
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key out 
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
etc.

---------------------------------------------------------------------------
There are the differences between nutch-site.xml and nutch-default.xml:
---------------------------------------------------------------------------
 ***** nutch-default.xml
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description>
***** NUTCH-SITE.XML
  <name>http.timeout</name>
  <value>30000</value>
  <description>The default network timeout, in milliseconds.</description>
*****

***** nutch-default.xml
  <name>http.max.delays</name>
  <value>3</value>
  <description>The number of times a thread will delay when trying to
***** NUTCH-SITE.XML
  <name>http.max.delays</name>
  <value>6</value>
  <description>The number of times a thread will delay when trying to
*****

***** nutch-default.xml
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
  <name>http.content.limit</name>
  <value>130000</value>
  <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
  <name>file.content.limit</name>
  <value>130000</value>
  <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
  <name>ftp.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
  <name>ftp.content.limit</name>
  <value>130000</value>
  <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
***** NUTCH-SITE.XML
  <name>db.max.outlinks.per.page</name>
  <value>200</value>
  <description>The maximum number of outlinks that we'll process for a page.
*****

***** nutch-default.xml
  <name>db.fetch.retry.max</name>
  <value>3</value>
  <description>The maximum number of times a url that has encountered
***** NUTCH-SITE.XML
  <name>db.fetch.retry.max</name>
  <value>6</value>
  <description>The maximum number of times a url that has encountered
*****

***** nutch-default.xml
  <name>fetcher.server.delay</name>
  <value>5.0</value>
  <description>The number of seconds the fetcher will delay between
***** NUTCH-SITE.XML
  <name>fetcher.server.delay</name>
  <value>30.0</value>
  <description>The number of seconds the fetcher will delay between
*****

***** nutch-default.xml
  <name>fetcher.threads.fetch</name>
  <value>10</value>
  <description>The number of FetcherThreads the fetcher should use.
***** NUTCH-SITE.XML
  <name>fetcher.threads.fetch</name>
  <value>100</value>
  <description>The number of FetcherThreads the fetcher should use.
*****

***** nutch-default.xml
  <name>fetcher.threads.per.host</name>
  <value>1</value>
  <description>This number is the maximum number of threads that
***** NUTCH-SITE.XML
  <name>fetcher.threads.per.host</name>
  <value>100</value>
  <description>This number is the maximum number of threads that
*****

***** nutch-default.xml
  <name>parser.threads.parse</name>
  <value>10</value>
  <description>Number of ParserThreads ParseSegment should 
use.</description>
***** NUTCH-SITE.XML
  <name>parser.threads.parse</name>
  <value>100</value>
  <description>Number of ParserThreads ParseSegment should 
use.</description>
*****

***** nutch-default.xml
  <name>indexer.minMergeDocs</name>
  <value>50</value>
  <description>This number determines the minimum number of Lucene
***** NUTCH-SITE.XML
  <name>indexer.minMergeDocs</name>
  <value>10000</value>
  <description>This number determines the minimum number of Lucene
*****

***** nutch-default.xml
  <name>indexer.maxMergeDocs</name>
  <value>50</value>
  <description>This number determines the maximum number of Lucene
***** NUTCH-SITE.XML
  <name>indexer.maxMergeDocs</name>
  <value>10000000</value>
  <description>This number determines the maximum number of Lucene
*****

***** nutch-default.xml
  <name>searcher.dir</name>
  <value>.</value>
  <description>
***** NUTCH-SITE.XML
  <name>searcher.dir</name>
  <value>/srv/db/</value>
  <description>
*****

***** nutch-default.xml
  <name>ipc.client.timeout</name>
  <value>10000</value>
  <description>Defines the timeout for IPC calls in milliseconds. 
</description>
***** NUTCH-SITE.XML
  <name>ipc.client.timeout</name>
  <value>20000</value>
  <description>Defines the timeout for IPC calls in milliseconds. 
</description>
*****

***** nutch-default.xml
  <name>plugin.includes</name>
  
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
***** NUTCH-SITE.XML
  <name>plugin.includes</name>
  
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
basic|more|site|url)</value>
  <description>Regular expression naming plugin directory names to
*****

***** nutch-default.xml
  <name>parser.character.encoding.default</name>
  <value>windows-1252</value>
  <description>The character encoding to fall back to when no other 
information
***** NUTCH-SITE.XML
  <name>parser.character.encoding.default</name>
  <value>iso-8859-2</value>
  <description>The character encoding to fall back to when no other 
information
*****

Any idea what is the problem source?

Best Regards:
    Ferenc

Re: merge mapred to trunk

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:
> Currently we have three versions of nutch: trunk, 0.7 and mapred.  This 
> increases the chances for conflicts.  I would thus like to merge the 
> mapred branch into trunk soon.  The soonest I could actually start this 
> is next week.  Are there any objections?

++1 :-)


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: merge mapred to trunk

Posted by Doug Cutting <cu...@nutch.org>.

I will postpone the merge of the mapred branch into trunk until I have a 
chance to (a) add some MapReduce documentation; and (b) implement 
MapReduce-based dedup.

Doug

Doug Cutting wrote:
> Currently we have three versions of nutch: trunk, 0.7 and mapred.  This 
> increases the chances for conflicts.  I would thus like to merge the 
> mapred branch into trunk soon.  The soonest I could actually start this 
> is next week.  Are there any objections?
> 
> Doug

Re: merge mapred to trunk

Posted by Kelvin Tan <ke...@relevanz.com>.


On Wed, 31 Aug 2005 14:37:54 -0700, Doug Cutting wrote:
> ogjunk-nutch@yahoo.com wrote:
>> I, too, am looking forward to this, but I am wondering what that
>> will do to Kelvin Tan's recent contribution, especially since I
>> saw that both MapReduce and Kelvin's code change how
>> FetchListEntry works.  If merging mapred to trunk means losing
>> Kelvin's changes, then I suggest one of Nutch developers
>> evaluates Kelvin's modifications and, if they are good, commits
>> them to trunk, and then makes the final pre-mapred release (e.g.
>> release-0.8).
>>
>
> It won't lose Kelvin's patch: it will still be a patch to 0.7.
>
> What I worry about is the alternate scenario: that Kelvin & others
> invest a lot of effort making this work with 0.7, while the mapred-
> based code diverges even further.  It would be best if Kelvin's
> patch is ported to the mapred branch sooner rather than later, then
> maintained there.
>
> Doug

Agreed. I have some time in the coming weeks, and will work fulltime to evolve the patch to be more compatible with Nutch especially map-red.. 

k

Re: merge mapred to trunk

Posted by og...@yahoo.com.

--- Doug Cutting <cu...@nutch.org> wrote:

> ogjunk-nutch@yahoo.com wrote:
> > I, too, am looking forward to this, but I am wondering what that
> will
> > do to Kelvin Tan's recent contribution, especially since I saw that
> > both MapReduce and Kelvin's code change how FetchListEntry works. 
> If
> > merging mapred to trunk means losing Kelvin's changes, then I
> suggest
> > one of Nutch developers evaluates Kelvin's modifications and, if
> they
> > are good, commits them to trunk, and then makes the final
> pre-mapred
> > release (e.g. release-0.8).
> 
> It won't lose Kelvin's patch: it will still be a patch to 0.7.

Ah, right, we could always make a 0.7.* release from release 0.7.

> What I worry about is the alternate scenario: that Kelvin & others 
> invest a lot of effort making this work with 0.7, while the
> mapred-based 
> code diverges even further.  It would be best if Kelvin's patch is 
> ported to the mapred branch sooner rather than later, then maintained
> there.

I agree.  I'll actually see Kelvin in person tomorrow, so we'll see if
this is something he can do.  It looks like he added some much-needed
functionality in his patch, so it'd good to keep it.

Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.

Re: merge mapred to trunk

Posted by Doug Cutting <cu...@nutch.org>.

ogjunk-nutch@yahoo.com wrote:
> I, too, am looking forward to this, but I am wondering what that will
> do to Kelvin Tan's recent contribution, especially since I saw that
> both MapReduce and Kelvin's code change how FetchListEntry works.  If
> merging mapred to trunk means losing Kelvin's changes, then I suggest
> one of Nutch developers evaluates Kelvin's modifications and, if they
> are good, commits them to trunk, and then makes the final pre-mapred
> release (e.g. release-0.8).

It won't lose Kelvin's patch: it will still be a patch to 0.7.

What I worry about is the alternate scenario: that Kelvin & others 
invest a lot of effort making this work with 0.7, while the mapred-based 
code diverges even further.  It would be best if Kelvin's patch is 
ported to the mapred branch sooner rather than later, then maintained there.

Doug

Re: merge mapred to trunk

Posted by og...@yahoo.com.

> Currently we have three versions of nutch: trunk, 0.7 and mapred. 
> This 
> increases the chances for conflicts.  I would thus like to merge the 
> mapred branch into trunk soon.  The soonest I could actually start
> this is next week.  Are there any objections?

I, too, am looking forward to this, but I am wondering what that will
do to Kelvin Tan's recent contribution, especially since I saw that
both MapReduce and Kelvin's code change how FetchListEntry works.  If
merging mapred to trunk means losing Kelvin's changes, then I suggest
one of Nutch developers evaluates Kelvin's modifications and, if they
are good, commits them to trunk, and then makes the final pre-mapred
release (e.g. release-0.8).

Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.