You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Olive g <ol...@hotmail.com> on 2006/04/05 02:24:29 UTC
Re: Query on merged indexes returned 0 hit - more issues
Hi gurus,
I tried the workaround and I found some more issues. It appears to me that
inverlinks does not work properly with more than 5 input parts.
For example the following command (with number of map tasks set to 5 and the
number of reduce tasks set to 5, using dfs, nutch 0.8)
../search/bin/nutch invertlinks test5/linkdb test5/segments/20060403192429
test5/segments/20060403193814 >& linkdb-test5&
generated basically the same error for all 5 reduce tasks:
java.rmi.RemoteException: java.io.IOException: Could not complete write to
file /user/root/test5/linkdb/362527374/part-00000/.data.crc
by DFSClient_441718647 at java.lang.Throwable.(Throwable.java:57) at
java.lang.Throwable.(Throwable.java:68) at
org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205)
the contents of test5/segments/20060403192429/content/ are
/user/root/test5/segments/20060403192429/content/part-00000 123617
/user/root/test5/segments/20060403192429/content/part-00001 141105
/user/root/test5/segments/20060403192429/content/part-00002 168565
/user/root/test5/segments/20060403192429/content/part-00003 179788
/user/root/test5/segments/20060403192429/content/part-00004 70356
the contents of test5/segments/20060403193814/content/ are
/user/root/test5/segments/20060403193814/content/part-00000 103014
/user/root/test5/segments/20060403193814/content/part-00001 159010
/user/root/test5/segments/20060403193814/content/part-00002 92892
/user/root/test5/segments/20060403193814/content/part-00003 103847
/user/root/test5/segments/20060403193814/content/part-00004 102626
In the example above there are 10 input parts in two segments. I noticed
that this doesn't happen when there are no more than 5 input parts and it
consistently happens when there are more than 5, even if they are in the
same segment.
The urgency of this problem is that it prevents incremental crawling,
whether by merging segments or by incremental depth crawling, because after
5 more incremental crawls we have 6 parts.
Please let me know what you think.
Thank you!
Olive
>From: Andrzej Bialecki <ab...@getopt.org>
>Reply-To: nutch-user@lucene.apache.org
>To: nutch-user@lucene.apache.org
>Subject: Re: Query on merged indexes returned 0 hit - test case included
>(Nutch 0.8)
>Date: Tue, 04 Apr 2006 19:20:43 +0200
>
>Olive g wrote:
>>Thank you! Zaheed sent out a workaround in another thread as follows. Do
>>you think this would
>>work (on Nutch 0.8 w/ DFS).
>>
>
>Yes, it should work. This is a cheap way to merge two DBs - thanks Zaheed!
>Just remember to rename the part-xxxxx dirs so that they are sequential.
>
>>Also, when do you expect to port the feature to 0.8 (I know it's not the
>>highest priority for
>>you :)) - but really, merging index is critical for incremental crawls. Is
>>it possible that it can be
>>implemented sooner? Please ... Our project depends on this ...
>>
>
>These features (incremental updates, merging indexes) are already supported
>if you use individual command-line tools and a single DB. So, I'm not
>planning to do anything about it.
>
>--
>Best regards,
>Andrzej Bialecki <><
>___. ___ ___ ___ _ _ __________________________________
>[__ || __|__/|__||\/| Information Retrieval, Semantic Web
>___|||__|| \| || | Embedded Unix, System Integration
>http://www.sigram.com Contact: info at sigram dot com
>
>
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/