You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Schick <sc...@informatik.uni-rostock.de> on 2007/10/02 14:19:14 UTC

Re: incremental crawling

Hello,


we are running into the same problem you described!
The issue about rebuild is also important for us, but much more important is
the the fact that all custom fields will be discarded. 

Are there any solutions now?


Regards, 

Sebastian


charlie w wrote:
> 
> Thanks for that link (and a note to self; don't ask a question on the list
> just before going on vacation...)
> 
> Perhaps I don't understand the patch, but It seems, that the it is only
> meant to avoid recrawling content that hasn't changed.  It doesn't really
> have to do with avoiding a rebuild of the entire index if I add a
> document;
> or does it?
> 
> Does Nutch have the ability to add to an index without a complete rebuild,
> or is a complete rebuild required if I add even a single document?
> 
> Furthermore, even if I were to decide that the complete rebuild is
> acceptable, Nutch is still discarding my custom fields from all documents
> that are not being updated.  Why is this happening?
> 
> I appreciate the help; thanks.
> -Charlie
> 
> 
> 
> On 4/14/07, rubdabadub <ru...@gmail.com> wrote:
>>
>> Hi Cahrlie:
>>
>> On 4/14/07, c wanek <sp...@gmail.com> wrote:
>> > Greetings,
>> >
>> > Now I'm at the point where I would like to add to my crawl, with a new
>> set
>> > of seed urls.  Using a variation on the recrawl script on the wiki, I
>> can
>> > make this happen, but I am running into a what is, for me, a
>> showstopper
>> > issue.  The custom fields I added to the documents of the first crawl
>> are
>> > lost when the documents from the second crawl are added to the index.
>>
>> Nutch is all about writing once. All operation write once this is how
>> map-reduce
>> works.. This is why incremental crawling is difficult. But :-)
>>
>> http://issues.apache.org/jira/browse/NUTCH-61
>>
>> Like you many others want this to happen. And to the best of my knowledge
>> Andrzej Bialecki will be addressing the issue after 0.9 release .. which
>> is
>> anytime now :-)
>>
>> So you might give it a go with Nutch-61 but NOTE it doesn't work with
>> current trunk.
>>
>> Regards
>> raj
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/incremental-crawling-tf3574227.html#a12997582
Sent from the Nutch - User mailing list archive at Nabble.com.