You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Justin Swanhart <gr...@gmail.com> on 2004/10/29 20:48:01 UTC

When do document ids change

Given an FSDirectory based index A.
Documents are added to A with an IndexWriter
  minMergeDocs = 20000
  mergeFactor = 3

Documents are never deleted.

Once the RAMDirectory merges documents to the index:

a) will the documentID values for index A ever change?
b) can a mapping between a term in the document and newly created
documentID be made?

Why I am asking this question:
I have a database with about 10M rows in it.  My search engine needs
to be able to quickly
get all the rows back from the database that match a query.  All the
rows need to be
returned at once, because the entire result set is sorted based on user input.  

What I want to do:
When a documentID gets assigned to a document, I want to update the
database row with
that matches the document field "id" with the lucene documentID.  That
way, I can use a
hitcollector to gather just the documentID values from the search and
insert them into a
temporary cache table, then grab the matching rows from the database. 
This will work assuming the documentID values for the given document
never change.

Currently, running an IndexSearcher.search() and getting all the rows
back takes between
5 and 30 seconds for most queries, which is certainly not fast enough.
 The time it takes to collect the documentIDs however is less than 1
second.  All the time is taken by calling
hits.doc() for each document to get the "id" field to insert into the database. 

So finally,  will what I want to do work, and if so, how can I go
about updating the database when the documentID is created?

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: When do document ids change

Posted by Paul Elschot <pa...@xs4all.nl>.

Justin,

On Friday 29 October 2004 20:48,  you wrote:
> Given an FSDirectory based index A.
> Documents are added to A with an IndexWriter
>   minMergeDocs = 20000
>   mergeFactor = 3
> 
> Documents are never deleted.
> 
> Once the RAMDirectory merges documents to the index:
> 
> a) will the documentID values for index A ever change?

A document id may change after deleting a document that was
added earlier than the document.
Adding more docs may then change the id.
Optimizing the index will then change the id.

> b) can a mapping between a term in the document and newly created
> documentID be made?

Yes. See below on how.

> Why I am asking this question:
> I have a database with about 10M rows in it.  My search engine needs
> to be able to quickly
> get all the rows back from the database that match a query.  All the
> rows need to be
> returned at once, because the entire result set is sorted based on user input.  

Did you try IndexSearcher.search() or Search.search() with a Sort argument?

> What I want to do:
> When a documentID gets assigned to a document, I want to update the
> database row with
> that matches the document field "id" with the lucene documentID.  That
> way, I can use a
> hitcollector to gather just the documentID values from the search and
> insert them into a
> temporary cache table, then grab the matching rows from the database. 
> This will work assuming the documentID values for the given document
> never change.

It will work on the condition that documents are never (in the absolute sense)
deleted from the lucene index, and that one never merges indexes.

> Currently, running an IndexSearcher.search() and getting all the rows
> back takes between
> 5 and 30 seconds for most queries, which is certainly not fast enough.
>  The time it takes to collect the documentIDs however is less than 1
> second.  All the time is taken by calling
> hits.doc() for each document to get the "id" field to insert into the database. 

One can speed up retrieving data from Lucene indexes by retrieving
in the order of docId, via indexReader.document(docId). Make sure
no other threads are using the index at the same time.
One can also store the Lucene files with the stored fields on another disk,
but for that some coding is needed.

You may have to implement your own HitCollector. 
Lucene does not guarantee that the hits are collected in 
order of docId, but the collecting order is normally not far off.

> So finally,  will what I want to do work, and if so, how can I go

It will work, but I would not recommend it. Just retrieve what you
need from the Lucene index in the order of the docId's.
Try and store as little data per document as possible.

> about updating the database when the documentID is created?

To know the docId use an indexed primary key in lucene and search
for it using IndexReader.termDocs(new Term(keyField, keyValue)).

Regards,
Paul Elschot.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org