You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Olive g <ol...@hotmail.com> on 2006/04/06 14:25:48 UTC

please help!! inverlinks not work properly with more than 5 input parts (0.8)

Hi gurus,

I posted questions on how to do incremental crawls on 0.8 a few days ago and 
thank you all for your help. However, when I tried to workaround (see 
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04111.html), 
inverlinks crashed when there were more than 5 input parts.

Here are the details:

For example the following command (with number of map tasks set to 5 and the 
number of reduce tasks set to 5, using dfs, nutch 0.8) ../search/bin/nutch 
invertlinks test5/linkdb test5/segments/20060403192429 
test5/segments/20060403193814 >& linkdb-test5&

generated basically the same error for all 5 reduce tasks:


java.rmi.RemoteException: java.io.IOException: Could not complete write to 
file /user/root/test5/linkdb/362527374/part-00000/.data.crc

by DFSClient_441718647 at java.lang.Throwable.(Throwable.java:57) at 
java.lang.Throwable.(Throwable.java:68) at

org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205)


the contents of test5/segments/20060403192429/content/ are

/user/root/test5/segments/20060403192429/content/part-00000     123617
/user/root/test5/segments/20060403192429/content/part-00001     141105
/user/root/test5/segments/20060403192429/content/part-00002     168565
/user/root/test5/segments/20060403192429/content/part-00003     179788
/user/root/test5/segments/20060403192429/content/part-00004     70356

the contents of test5/segments/20060403193814/content/ are

/user/root/test5/segments/20060403193814/content/part-00000     103014
/user/root/test5/segments/20060403193814/content/part-00001     159010
/user/root/test5/segments/20060403193814/content/part-00002     92892
/user/root/test5/segments/20060403193814/content/part-00003     103847
/user/root/test5/segments/20060403193814/content/part-00004     102626


In the example above there are 10 input parts in two segments. I noticed 
that this doesn't happen when there are no more than 5 input parts and it 
consistently happens when there are more than 5, even if they are in the 
same segment.

The urgency of this problem is that it prevents incremental crawling, 
whether by merging segments or by incremental depth crawling, because after 
5 more incremental crawls we have 6 parts.

Please let me know what you think. Did I miss any configuration? Is there a 
workaround for the workaround?

Thank you!

Olive

_________________________________________________________________
Don’t just search. Find. Check out the new MSN Search! 
http://search.msn.click-url.com/go/onm00200636ave/direct/01/


Re: please help!! inverlinks not work properly with more than 5 input parts (0.8)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Olive g wrote:
> Hi gurus,
>
> I posted questions on how to do incremental crawls on 0.8 a few days 
> ago and thank you all for your help. However, when I tried to 
> workaround (see 
> http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04111.html), 
> inverlinks crashed when there were more than 5 input parts.
>

You should understand very clearly that what you are doing is NOT 
supported and very non-standard. It might (or might not) have worked as 
a one time workaround to get you out of trouble.

Nutch DOES support incremental crawling and indexing, and the way it 
does is described in the tutorial 
(http://wiki.apache.org/nutch/NutchTutorial). Please follow the tutorial 
where it says about "Step-by-Step or Whole-web Crawling" - you will save 
yourself (and us) a lot of grief.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com