You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Trond Aksel Myklebust <ta...@online.no> on 2005/10/16 16:17:15 UTC

Large queries

How is Lucene handling very large queries? I have 6million documents, which
each has a "docID" field. There is a total of 20000 distinct docID's, so
many documents got the same docID which consists of a filename (only name,
not path).

Sometimes, I must get all documents that has one of 10 docID's, and
sometimes I need to get all documents that has one of 10000 docIDs. Is there
any other way than doing a query: docID:(file1 file2 file3 file4..) ?

 

Trond A Myklebust

Re: Large queries

Posted by jian chen <ch...@gmail.com>.

Hi, Trond,

It should be no problem for Lucene to handle 6 million documents.

For your query, it seems you want to do a disjunctive (or'ed) query for
multiple terms, 10 terms or 10000 terms for example. The worst case I can
think of is, you can very easily write your own query class to handle this,
utilizing the TermDocs iterator class.

Say, if you want to have one of 10 docID's, you have 10 TermDocs. Each
TermDoc corresponds to a term. Then, you can do a multi-way (in this case,
10 way) merge of these 10 TermDocs, and generate a final list of the doc
ids.

I suggest that you can look at PhraseQuery and PhraseScrorer to see how it
does the conjunctive merge to find the docs that contains all the terms. In
your case, instead of doing intersection, you are doing a union of all the
term docs, right?

Maybe there is already some query class that comes with the Lucene package
that does this. However, the method I described should also help just in
case.

Cheers,

Jian

On 10/16/05, Trond Aksel Myklebust <ta...@online.no> wrote:
>
> How is Lucene handling very large queries? I have 6million documents,
> which
> each has a "docID" field. There is a total of 20000 distinct docID's, so
> many documents got the same docID which consists of a filename (only name,
> not path).
>
> Sometimes, I must get all documents that has one of 10 docID's, and
> sometimes I need to get all documents that has one of 10000 docIDs. Is
> there
> any other way than doing a query: docID:(file1 file2 file3 file4..) ?
>
>
>
> Trond A Myklebust
>
>
>
>
>
>

Re: Large queries

Posted by jian chen <ch...@gmail.com>.

Hi, Trond,

By the way, it appears to me that Lucene uses the iterator pattern a lot,
like SegmentTermEnum, TermDocs, TermPositions, etc. Each iterator uses the
underlying fix sized buffer to load a chunck of data at a time. So, even you
have millions of documents, you shouldn't run into memory problem given
Lucene's index data structure.

That said, there are some parameters that you can tweak given really large
amount of documents. But I suggest that you don't need to worry about it
unless you really run into it.

Cheers,

Jian

On 10/16/05, Trond Aksel Myklebust <ta...@online.no> wrote:
>
> How is Lucene handling very large queries? I have 6million documents,
> which
> each has a "docID" field. There is a total of 20000 distinct docID's, so
> many documents got the same docID which consists of a filename (only name,
> not path).
>
> Sometimes, I must get all documents that has one of 10 docID's, and
> sometimes I need to get all documents that has one of 10000 docIDs. Is
> there
> any other way than doing a query: docID:(file1 file2 file3 file4..) ?
>
>
>
> Trond A Myklebust
>
>
>
>
>
>