You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2006/11/09 21:49:26 UTC

Re: Searching by bit masks

Let me see if I have a clue what you're trying to do. Warning: I'm a bit
confused since "filter" has a very specific meaning in Lucene, so when you
talk about filters I'm assuming that you're NOT talking about Lucene
filters, but rather just a set of flags you're associating with each
document, and then a set of flags with the same semantics at search time.

If that's all true, here's the first approach I would take...

Don't store a bitmask in Lucene. Rather, index a field for each flag with
each document. NOTE: you do NOT have to have the same fields for all
documents....

Something like
Document doc = new Document();
doc.add("flag1", "Y");
doc.add("flag2", "Y");
IndexWriter.add(doc);

Document doc = new Document();
doc.add("flag1", "Y");   // NOTE: no "flag2" field. No problem.
IndexWriter.add(doc);

Now your searches are simple. Just search for the "or" of the fields with
the flags you're interested in. i.e. "flag1=Y" or "flag2="Y".....

If you want to get fancy, you can use TermDocs to enumerate the document IDs
with values for specified fields and perhaps even create a Lucene filter
based on that enumeration. This is much faster than you may think.

You probably want to index but not store such flags.

I suspect that this will be waaaaay faster than trying to inspect a binary
field in each document and then see if the bits were set, because that would
require you to read each document rather than just look at the terms in the
index.

I doubt that this will add much in the way of size to your index, and
anyway, disk space is cheap.

NOTE: the down side here is that you must delete and re-add a document to
modify it, which may be slow. But you'd have to do that when you updated
your bit mask anyway.....


Another approach: create a set of Lucene Filters (really, these are just
Java bitsets), one for each flag. All this is a bitset with one bit for each
document, or about 1M of memory per flag with 8M docs. So you'd populate
flag1Filter, flag2Filter... and have these ready whenever you needed them.

You should very rapidly be able to do any of the logical operations on these
bitsets (OR in your case) and use the resulting Lucene filter in your query.
It's up to you whether you create these filters as part of server warm-up or
just create them when needed, letting the first user who encounters them pay
the price for creating them. This is kind of the Solr warm-up idea. The
CachingWrapperFilter class should keep these around for you. Creating a
Lucene filter is much faster than I thought. See Lucene In Action for a
sample.

Of course, I may be entirely misunderstanding your problem, in which case
I'd ask you to explain a bit more <G>.

Best
Erick

On 11/9/06, Larry Taylor <lt...@employon.com> wrote:
>
> Hello,
>
> I am currently evaluating Lucene to see if it would be appropriate to
> replace my company's current search software. So far everything has been
> looking great, however there is one requirement that I am not too
> certain about.
>
> What we need to do is to be able to store a bit mask specifying various
> filter flags for a document in the index and then search this field by
> specifying another bit mask with desired filters, returning documents
> that have any of the specified flags set. In other words, we are doing a
> bitwise OR on the stored filter bit mask and the specified filter bit
> mask and if it is non-zero, we want to return the document.
>
> Before I started toying around with various options myself, I wanted to
> see if any of you good folks in the Lucene community had some
> suggestions for an efficient way to implement this.
>
> We currently need to index ~8,000,000 documents. We have several filter
> flag fields, the most important of which currently has 7 possible flags
> with any combination of the flags being valid. The number of flags is
> expected to increase rather rapidly in the near future.
>
> My preemptive thanks for your suggestions,
>
>
> Lawrence Taylor
> Senior Software Engineer
> Employon
> Message was edited by: ltaylor.employon
>
>
>

RE: Searching by bit masks

Posted by Larry Taylor <lt...@employon.com>.
Excellent, caching filters seem to fit the bill best so will use those
with the flags stored in the underlying index in the format you
suggested.  Thank you for the assistance.

Larry

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org] 
Sent: Friday, November 10, 2006 12:27 PM
To: java-user@lucene.apache.org
Subject: Re: Searching by bit masks

Erick Erickson wrote:
> Something like
> Document doc = new Document();
> doc.add("flag1", "Y");
> doc.add("flag2", "Y");
> IndexWriter.add(doc);

Fields have overheads.  It would be more efficient to implement this as 
a single field with a different value for each boolean flag (as others 
have suggested).

> Another approach: create a set of Lucene Filters (really, these are
just
> Java bitsets), one for each flag. All this is a bitset with one bit
for 
> each
> document, or about 1M of memory per flag with 8M docs. So you'd
populate
> flag1Filter, flag2Filter... and have these ready whenever you needed
them.

Cached filters will be faster especially when a large portion of the 
documents have the flag set.  If, for example, you have a flag that is 
set in half the documents that is specified in half the queries, then a 
cached filter will have a large impact on not only the performance of 
those queries but on the performance of your service as a whole.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Searching by bit masks

Posted by Doug Cutting <cu...@apache.org>.
Erick Erickson wrote:
> Something like
> Document doc = new Document();
> doc.add("flag1", "Y");
> doc.add("flag2", "Y");
> IndexWriter.add(doc);

Fields have overheads.  It would be more efficient to implement this as 
a single field with a different value for each boolean flag (as others 
have suggested).

> Another approach: create a set of Lucene Filters (really, these are just
> Java bitsets), one for each flag. All this is a bitset with one bit for 
> each
> document, or about 1M of memory per flag with 8M docs. So you'd populate
> flag1Filter, flag2Filter... and have these ready whenever you needed them.

Cached filters will be faster especially when a large portion of the 
documents have the flag set.  If, for example, you have a flag that is 
set in half the documents that is specified in half the queries, then a 
cached filter will have a large impact on not only the performance of 
those queries but on the performance of your service as a whole.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org