You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shane O'Sullivan <sh...@gmail.com> on 2005/10/11 17:58:32 UTC

Are Non-consecutive Document IDs feasible?

Hi all,

As far as I understand today, Lucene assigns docIDs to documents according
to the order in which the documents are added to the index. Hence, docIDs
are assigned by the engine in a sequential manner, without gaps. This order
of document identifiers then determines the order of the postings in the
postings lists, i.e. all postings lists are sorted by docID. It also means
that the same document appearing in two different indices would probably not
have the same docID (unless some extreme care was taken to insert documents
in the same order).

There are situations where the application wants to determine the docID for
the index, i.e. to control the ordering of occurrences in the postings
lists. This is useful to ensure, for example, that a document has a stable
and consistent document identifier regardless of insertion order to an
index.

In either case, the application would want to pass into the index the
numeric identifier of the document. However, such identifiers may not be
sequential, i.e. it's possible that there would be a document with docID M
without there being any document whose docID is M-1.

Q1. How difficult would it be to change Lucene to accept the docIDs from the
application, and not care about any possible gaps those ids may have?
One possible problem is that since the Doc Ids could become very large, and
are non-sequential, creating a single array for them all would not be
feasible.

Q2. Does Lucene's search code depend on the fact that document IDs are
sequential?

Thanks

Shane

Re: Are Non-consecutive Document IDs feasible?

Posted by Yonik Seeley <ys...@gmail.com>.
Yes, lucene depends on consecutive docids.

For the query side, the following thjings come to mind.
- for sorting, the FieldCache allocates arrays up to maxDoc()
- for deleted documents, it's a BitVector up to maxDoc()
- Some queries like MatchAllDocumentsQuery do a linear scan through deleted
documents

Just add a field to every document that will act as the id. If you need more
performance you could cache the mapping from external_id -> internal_id.

-Yonik
Now hiring -- http://tinyurl.com/7m67g

On 10/11/05, Shane O'Sullivan <sh...@gmail.com> wrote:
>
> Hi all,
>
> As far as I understand today, Lucene assigns docIDs to documents according
> to the order in which the documents are added to the index. Hence, docIDs
> are assigned by the engine in a sequential manner, without gaps. This
> order
> of document identifiers then determines the order of the postings in the
> postings lists, i.e. all postings lists are sorted by docID. It also means
> that the same document appearing in two different indices would probably
> not
> have the same docID (unless some extreme care was taken to insert
> documents
> in the same order).
>
> There are situations where the application wants to determine the docID
> for
> the index, i.e. to control the ordering of occurrences in the postings
> lists. This is useful to ensure, for example, that a document has a stable
> and consistent document identifier regardless of insertion order to an
> index.
>
> In either case, the application would want to pass into the index the
> numeric identifier of the document. However, such identifiers may not be
> sequential, i.e. it's possible that there would be a document with docID M
> without there being any document whose docID is M-1.
>
> Q1. How difficult would it be to change Lucene to accept the docIDs from
> the
> application, and not care about any possible gaps those ids may have?
> One possible problem is that since the Doc Ids could become very large,
> and
> are non-sequential, creating a single array for them all would not be
> feasible.
>
> Q2. Does Lucene's search code depend on the fact that document IDs are
> sequential?
>
> Thanks
>
> Shane
>
>

Re: Are Non-consecutive Document IDs feasible?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
How about indexing a field with your application-centric id?  This is  
_the_ way this sort of thing is handled.  You could then query for a  
specific id using a TermQuery.

     Erik



On Oct 11, 2005, at 11:58 AM, Shane O'Sullivan wrote:

> Hi all,
>
> As far as I understand today, Lucene assigns docIDs to documents  
> according
> to the order in which the documents are added to the index. Hence,  
> docIDs
> are assigned by the engine in a sequential manner, without gaps.  
> This order
> of document identifiers then determines the order of the postings  
> in the
> postings lists, i.e. all postings lists are sorted by docID. It  
> also means
> that the same document appearing in two different indices would  
> probably not
> have the same docID (unless some extreme care was taken to insert  
> documents
> in the same order).
>
> There are situations where the application wants to determine the  
> docID for
> the index, i.e. to control the ordering of occurrences in the postings
> lists. This is useful to ensure, for example, that a document has a  
> stable
> and consistent document identifier regardless of insertion order to an
> index.
>
> In either case, the application would want to pass into the index the
> numeric identifier of the document. However, such identifiers may  
> not be
> sequential, i.e. it's possible that there would be a document with  
> docID M
> without there being any document whose docID is M-1.
>
> Q1. How difficult would it be to change Lucene to accept the docIDs  
> from the
> application, and not care about any possible gaps those ids may have?
> One possible problem is that since the Doc Ids could become very  
> large, and
> are non-sequential, creating a single array for them all would not be
> feasible.
>
> Q2. Does Lucene's search code depend on the fact that document IDs are
> sequential?
>
> Thanks
>
> Shane
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: Are Non-consecutive Document IDs feasible?

Posted by Robert Engels <re...@ix.netcom.com>.
Just add another field to document that is your "external" document
identifier, which is what the request is essentially asking for - another
layer of indirection between identifiers and physical locations in the
index.

-----Original Message-----
From: Shane O'Sullivan [mailto:shaneosullivan1@gmail.com]
Sent: Tuesday, October 11, 2005 10:59 AM
To: java-dev@lucene.apache.org
Subject: Are Non-consecutive Document IDs feasible?


Hi all,

As far as I understand today, Lucene assigns docIDs to documents according
to the order in which the documents are added to the index. Hence, docIDs
are assigned by the engine in a sequential manner, without gaps. This order
of document identifiers then determines the order of the postings in the
postings lists, i.e. all postings lists are sorted by docID. It also means
that the same document appearing in two different indices would probably not
have the same docID (unless some extreme care was taken to insert documents
in the same order).

There are situations where the application wants to determine the docID for
the index, i.e. to control the ordering of occurrences in the postings
lists. This is useful to ensure, for example, that a document has a stable
and consistent document identifier regardless of insertion order to an
index.

In either case, the application would want to pass into the index the
numeric identifier of the document. However, such identifiers may not be
sequential, i.e. it's possible that there would be a document with docID M
without there being any document whose docID is M-1.

Q1. How difficult would it be to change Lucene to accept the docIDs from the
application, and not care about any possible gaps those ids may have?
One possible problem is that since the Doc Ids could become very large, and
are non-sequential, creating a single array for them all would not be
feasible.

Q2. Does Lucene's search code depend on the fact that document IDs are
sequential?

Thanks

Shane


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org