You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ivan Vasilev <iv...@sirma.bg> on 2008/08/19 14:01:37 UTC

Updating tag-indexes

Hi Lucene Guys,

I have a question that is simple but is important for me. I did not 
found the answer in the javadoc so I am asking here.
When adding Document-s by the method IndexWriter.addDocument(doc) does 
the documents obtain Lucene IDs in the order that they are added to the 
IndexWriter? I mean will first added doc be with Lucene ID 0, second 
added with Lucene ID 1, etc?

Bellow I describe why I am asking this.
We plan to split our index to two separate indexes that will be read by 
ParallelReader class. This is so because the one of them will contain 
field(s) that will be indexed and stored and it will be frequently 
changed. So to have always correct data returned from the ParallelReader 
when changing documents in the small index the Lucene IDs of these docs 
have to remain the same.
To do this Karl Wettin suggests a solution described in *LUCENE-879 
<https://issues.apache.org/jira/browse/LUCENE-879>*. I do not like this 
solution because it is connected to changing Lucene source code, and 
after each refactoring potentially I will have problems. The solution is 
related to optimizing index so it will not be reasonably faster than the 
one that I prefer. And it is:
1. Read the whole index and reconstruct the documents including index 
data by using TermDocs and TermEnum classes;
2. Change the needed documents;
3. Index documents in new index that will replace the initial one.
I can even simplify this algorithm (and the speed) if all the fields 
will be always stored - I can read just the stored data and based on 
this to reconstruct the content of the docs and re index them in new.

But anyway everything in the my approaches will depend on this - are 
LuceneIDs in the index ordered in the same way as docs are added to the 
IndexWriter.

Thanks in Advance,
Ivan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Updating tag-indexes

Posted by Erick Erickson <er...@gmail.com>.
I'd add to Michael's mail the *strong* recommendation that you provide
your own unique doc IDs and use *those* instead. It'll save you a world
of grief. Whenever you need to add a new doc to an existing index, you
can get the maximum of *your* unique IDs and increment it yourself.

One thing to remember is that not all Lucene docs need to have the same
fields. So it's even possible to have a *very special* document that
contains
meta-data about your index, say the last used of your generated IDs and
keep that meta-data doc up to date. If you put fields in that doc that are
NOT
in any other doc, you don't have to worry about accidentally getting this
meta-data doc in your searches....

Best
Erick

On Tue, Aug 19, 2008 at 8:01 AM, Ivan Vasilev <iv...@sirma.bg> wrote:

> Hi Lucene Guys,
>
> I have a question that is simple but is important for me. I did not found
> the answer in the javadoc so I am asking here.
> When adding Document-s by the method IndexWriter.addDocument(doc) does the
> documents obtain Lucene IDs in the order that they are added to the
> IndexWriter? I mean will first added doc be with Lucene ID 0, second added
> with Lucene ID 1, etc?
>
> Bellow I describe why I am asking this.
> We plan to split our index to two separate indexes that will be read by
> ParallelReader class. This is so because the one of them will contain
> field(s) that will be indexed and stored and it will be frequently changed.
> So to have always correct data returned from the ParallelReader when
> changing documents in the small index the Lucene IDs of these docs have to
> remain the same.
> To do this Karl Wettin suggests a solution described in *LUCENE-879 <
> https://issues.apache.org/jira/browse/LUCENE-879>*. I do not like this
> solution because it is connected to changing Lucene source code, and after
> each refactoring potentially I will have problems. The solution is related
> to optimizing index so it will not be reasonably faster than the one that I
> prefer. And it is:
> 1. Read the whole index and reconstruct the documents including index data
> by using TermDocs and TermEnum classes;
> 2. Change the needed documents;
> 3. Index documents in new index that will replace the initial one.
> I can even simplify this algorithm (and the speed) if all the fields will
> be always stored - I can read just the stored data and based on this to
> reconstruct the content of the docs and re index them in new.
>
> But anyway everything in the my approaches will depend on this - are
> LuceneIDs in the index ordered in the same way as docs are added to the
> IndexWriter.
>
> Thanks in Advance,
> Ivan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Updating tag-indexes

Posted by Ivan Vasilev <iv...@sirma.bg>.
Thanks Michael and Erick,

Yes we have our own unique IDs for the docs - we use them for other 
purposes, but here we need to deal with Lucene IDs as it is important 
for ParalelReader.
Ok then I will implement the algorithm for updating the small tag-index 
as I intended and as it is not guarantied from the API I will take care 
to check if docIDs remain to be assigned sequentially in each new Lucene 
release.
I will also take care about deletions - first before starting update 
process for the small index I will keep info about deleted docs, then 
will undelete them. After that will start update algorithm. When it 
finishes if there are deletions this will mean that some exception 
occurred. The same will mean if the number of the docs is less than 
initial (then some collapse of deletions occurred) - will have to 
restart the process. If everything is ok then will delete again the docs 
about I have kept info in the begining.

Thanks Once Again,
Ivan



Michael McCandless wrote:
>
> Yes, docIDs are currently sequentially assigned, starting with 0.
>
> BUT: on hitting an exception (say in your analyzer) it will usually 
> use up a docID (and then immediately mark it as deleted).
>
> Also, this behavior isn't "promised" in the API, ie it could in theory 
> (though I think it unlikely) change in a future release of Lucene.
>
> And remember when a merge completes (or, optimize), any deleted docs 
> will "collapse down" all docIDs after them.
>
> Mike
>
> Ivan Vasilev wrote:
>
>> Hi Lucene Guys,
>>
>> I have a question that is simple but is important for me. I did not 
>> found the answer in the javadoc so I am asking here.
>> When adding Document-s by the method IndexWriter.addDocument(doc) 
>> does the documents obtain Lucene IDs in the order that they are added 
>> to the IndexWriter? I mean will first added doc be with Lucene ID 0, 
>> second added with Lucene ID 1, etc?
>>
>> Bellow I describe why I am asking this.
>> We plan to split our index to two separate indexes that will be read 
>> by ParallelReader class. This is so because the one of them will 
>> contain field(s) that will be indexed and stored and it will be 
>> frequently changed. So to have always correct data returned from the 
>> ParallelReader when changing documents in the small index the Lucene 
>> IDs of these docs have to remain the same.
>> To do this Karl Wettin suggests a solution described in *LUCENE-879 
>> <https://issues.apache.org/jira/browse/LUCENE-879>*. I do not like 
>> this solution because it is connected to changing Lucene source code, 
>> and after each refactoring potentially I will have problems. The 
>> solution is related to optimizing index so it will not be reasonably 
>> faster than the one that I prefer. And it is:
>> 1. Read the whole index and reconstruct the documents including index 
>> data by using TermDocs and TermEnum classes;
>> 2. Change the needed documents;
>> 3. Index documents in new index that will replace the initial one.
>> I can even simplify this algorithm (and the speed) if all the fields 
>> will be always stored - I can read just the stored data and based on 
>> this to reconstruct the content of the docs and re index them in new.
>>
>> But anyway everything in the my approaches will depend on this - are 
>> LuceneIDs in the index ordered in the same way as docs are added to 
>> the IndexWriter.
>>
>> Thanks in Advance,
>> Ivan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> __________ NOD32 3366 (20080819) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Updating tag-indexes

Posted by Michael McCandless <lu...@mikemccandless.com>.
Yes, docIDs are currently sequentially assigned, starting with 0.

BUT: on hitting an exception (say in your analyzer) it will usually  
use up a docID (and then immediately mark it as deleted).

Also, this behavior isn't "promised" in the API, ie it could in theory  
(though I think it unlikely) change in a future release of Lucene.

And remember when a merge completes (or, optimize), any deleted docs  
will "collapse down" all docIDs after them.

Mike

Ivan Vasilev wrote:

> Hi Lucene Guys,
>
> I have a question that is simple but is important for me. I did not  
> found the answer in the javadoc so I am asking here.
> When adding Document-s by the method IndexWriter.addDocument(doc)  
> does the documents obtain Lucene IDs in the order that they are  
> added to the IndexWriter? I mean will first added doc be with Lucene  
> ID 0, second added with Lucene ID 1, etc?
>
> Bellow I describe why I am asking this.
> We plan to split our index to two separate indexes that will be read  
> by ParallelReader class. This is so because the one of them will  
> contain field(s) that will be indexed and stored and it will be  
> frequently changed. So to have always correct data returned from the  
> ParallelReader when changing documents in the small index the Lucene  
> IDs of these docs have to remain the same.
> To do this Karl Wettin suggests a solution described in *LUCENE-879 <https://issues.apache.org/jira/browse/LUCENE-879 
> >*. I do not like this solution because it is connected to changing  
> Lucene source code, and after each refactoring potentially I will  
> have problems. The solution is related to optimizing index so it  
> will not be reasonably faster than the one that I prefer. And it is:
> 1. Read the whole index and reconstruct the documents including  
> index data by using TermDocs and TermEnum classes;
> 2. Change the needed documents;
> 3. Index documents in new index that will replace the initial one.
> I can even simplify this algorithm (and the speed) if all the fields  
> will be always stored - I can read just the stored data and based on  
> this to reconstruct the content of the docs and re index them in new.
>
> But anyway everything in the my approaches will depend on this - are  
> LuceneIDs in the index ordered in the same way as docs are added to  
> the IndexWriter.
>
> Thanks in Advance,
> Ivan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org