You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by "Leimbach, Johannes" <JL...@CONET.DE> on 2006/08/08 15:48:16 UTC

Need advice for doing incremental Index updates

Hello,

 

I need some advice regarding incremental index updates. 

 

There are three cases I need to handle when iterating over the
sourcefiles (files that need to be indexed): 

1.	A file did not change since the last update
2.	A file did change since the last update
3.	A file was removed since the last update

 

Case 1. is easy... 

Case 2. as well.. just remove the old file and add the new one

Case 3. is bugging me..

 

How can I find out if a file which is specified in the index, does not
exist anymore?

 

The blunt solution would be to retrieve *all* file paths from the index,
and check whether each one exists. If so - go on, if the file does not
exist on disk, remove it from the index. The problem I have with this
is, that I am possibly pulling a lot of data from the lucene index. I
will also do a lot of local filesystem checks. Sloooow?!

 

Another idea I had is about introducing an "index version" integer. This
number will be unique for each start of the parsing process. So each
time my indexer program is started a new "index version" is created. Now
each file which exists in the index and gets processed will have the
"index version" number stored as a document field.

This way all newly added and modified documents will have an up to date
"index version" flag after indexing is complete. 

Now, to remove all physically deleted files from the index, I would
select all documents which have an old "index version" flag stored
inside them. Every document with such an old number can be safely
removed. 

Problem with this solution is, that *every* document in the index will
get updated: First the old index version field is removed, then the new
field is added. 

On the plusside, removing deleted files will be very fast. 

 

 

What would you recommend for keeping an incremental update? 

I fear the first version will be utterly slow for small updates whereas
the second version will be a lot faster - though adding stuff is slower
because of the additional field update for every document. 

 

Thanks for your advice,

Johannes :-)

 

 


Re: Need advice for doing incremental Index updates

Posted by John <jl...@e5systems.com>.
Hi,
If run the incrimental process,as walk my directory tree of files,does it 
cost more time?
Because I must run a thread to do as you said,and it runs all the time.
Thanks ,
john




----- Original Message ----- 
From: "Chris Hostetter" <ho...@fucit.org>
To: <ge...@lucene.apache.org>
Sent: Wednesday, August 09, 2006 5:32 AM
Subject: Re: Need advice for doing incremental Index updates


>
> i would solve your problem external to the index ... everytime you run
> your incrimental process, as you walk your directory tree of files (adding
> the new ones, deleting/readdign the modified ones) record every file and
> save that somewhere.  when you are all done, compare the list from this
> run with the list from the last run -- any file in the old list and not in
> hte new list is a document to be deleted.
>
>
> : Date: Tue, 8 Aug 2006 15:48:16 +0200
> : From: "Leimbach, Johannes" <JL...@CONET.DE>
> : Reply-To: general@lucene.apache.org
> : To: general@lucene.apache.org
> : Subject: Need advice for doing incremental Index updates
> :
> : Hello,
> :
> :
> :
> : I need some advice regarding incremental index updates.
> :
> :
> :
> : There are three cases I need to handle when iterating over the
> : sourcefiles (files that need to be indexed):
> :
> : 1. A file did not change since the last update
> : 2. A file did change since the last update
> : 3. A file was removed since the last update
> :
> :
> :
> : Case 1. is easy...
> :
> : Case 2. as well.. just remove the old file and add the new one
> :
> : Case 3. is bugging me..
> :
> :
> :
> : How can I find out if a file which is specified in the index, does not
> : exist anymore?
> :
> :
> :
> : The blunt solution would be to retrieve *all* file paths from the index,
> : and check whether each one exists. If so - go on, if the file does not
> : exist on disk, remove it from the index. The problem I have with this
> : is, that I am possibly pulling a lot of data from the lucene index. I
> : will also do a lot of local filesystem checks. Sloooow?!
> :
> :
> :
> : Another idea I had is about introducing an "index version" integer. This
> : number will be unique for each start of the parsing process. So each
> : time my indexer program is started a new "index version" is created. Now
> : each file which exists in the index and gets processed will have the
> : "index version" number stored as a document field.
> :
> : This way all newly added and modified documents will have an up to date
> : "index version" flag after indexing is complete.
> :
> : Now, to remove all physically deleted files from the index, I would
> : select all documents which have an old "index version" flag stored
> : inside them. Every document with such an old number can be safely
> : removed.
> :
> : Problem with this solution is, that *every* document in the index will
> : get updated: First the old index version field is removed, then the new
> : field is added.
> :
> : On the plusside, removing deleted files will be very fast.
> :
> :
> :
> :
> :
> : What would you recommend for keeping an incremental update?
> :
> : I fear the first version will be utterly slow for small updates whereas
> : the second version will be a lot faster - though adding stuff is slower
> : because of the additional field update for every document.
> :
> :
> :
> : Thanks for your advice,
> :
> : Johannes :-)
> :
> :
> :
> :
> :
> :
>
>
>
> -Hoss 



Re: Need advice for doing incremental Index updates

Posted by Chris Hostetter <ho...@fucit.org>.
i would solve your problem external to the index ... everytime you run
your incrimental process, as you walk your directory tree of files (adding
the new ones, deleting/readdign the modified ones) record every file and
save that somewhere.  when you are all done, compare the list from this
run with the list from the last run -- any file in the old list and not in
hte new list is a document to be deleted.


: Date: Tue, 8 Aug 2006 15:48:16 +0200
: From: "Leimbach, Johannes" <JL...@CONET.DE>
: Reply-To: general@lucene.apache.org
: To: general@lucene.apache.org
: Subject: Need advice for doing incremental Index updates
:
: Hello,
:
:
:
: I need some advice regarding incremental index updates.
:
:
:
: There are three cases I need to handle when iterating over the
: sourcefiles (files that need to be indexed):
:
: 1.	A file did not change since the last update
: 2.	A file did change since the last update
: 3.	A file was removed since the last update
:
:
:
: Case 1. is easy...
:
: Case 2. as well.. just remove the old file and add the new one
:
: Case 3. is bugging me..
:
:
:
: How can I find out if a file which is specified in the index, does not
: exist anymore?
:
:
:
: The blunt solution would be to retrieve *all* file paths from the index,
: and check whether each one exists. If so - go on, if the file does not
: exist on disk, remove it from the index. The problem I have with this
: is, that I am possibly pulling a lot of data from the lucene index. I
: will also do a lot of local filesystem checks. Sloooow?!
:
:
:
: Another idea I had is about introducing an "index version" integer. This
: number will be unique for each start of the parsing process. So each
: time my indexer program is started a new "index version" is created. Now
: each file which exists in the index and gets processed will have the
: "index version" number stored as a document field.
:
: This way all newly added and modified documents will have an up to date
: "index version" flag after indexing is complete.
:
: Now, to remove all physically deleted files from the index, I would
: select all documents which have an old "index version" flag stored
: inside them. Every document with such an old number can be safely
: removed.
:
: Problem with this solution is, that *every* document in the index will
: get updated: First the old index version field is removed, then the new
: field is added.
:
: On the plusside, removing deleted files will be very fast.
:
:
:
:
:
: What would you recommend for keeping an incremental update?
:
: I fear the first version will be utterly slow for small updates whereas
: the second version will be a lot faster - though adding stuff is slower
: because of the additional field update for every document.
:
:
:
: Thanks for your advice,
:
: Johannes :-)
:
:
:
:
:
:



-Hoss