You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Larry Taylor <lt...@employon.com> on 2006/11/09 21:19:48 UTC

Searching by bit masks

Hello, 

I am currently evaluating Lucene to see if it would be appropriate to
replace my company's current search software. So far everything has been
looking great, however there is one requirement that I am not too
certain about. 

What we need to do is to be able to store a bit mask specifying various
filter flags for a document in the index and then search this field by
specifying another bit mask with desired filters, returning documents
that have any of the specified flags set. In other words, we are doing a
bitwise OR on the stored filter bit mask and the specified filter bit
mask and if it is non-zero, we want to return the document. 

Before I started toying around with various options myself, I wanted to
see if any of you good folks in the Lucene community had some
suggestions for an efficient way to implement this. 

We currently need to index ~8,000,000 documents. We have several filter
flag fields, the most important of which currently has 7 possible flags
with any combination of the flags being valid. The number of flags is
expected to increase rather rapidly in the near future. 

My preemptive thanks for your suggestions,


Lawrence Taylor
Senior Software Engineer
Employon
Message was edited by: ltaylor.employon

Re: Searching by bit masks

Posted by Erick Erickson <er...@gmail.com>.

You could store a value for each flag then be careful about what analyzers
you use. For instance, using WhitespaceAnalyzer (index AND search) and doing
your own casing. That is, make sure you lowercase as necessary (NOTE:
operators AND, OR NOT must not be lowercased if you send them through
queryparser) when you index and when you query.

Doc field for these is "flag"
Doc 1 has tokens indexed "joke=y", "adult=n" (input stream is "joke=y
adult=n")
Doc 2 has tokens indexed "joke=y", "adult=y"

Now your query for site 1 looks like "joke=y" AND "adult=y" (looking at flag
field)
and site 2 is "joke=y" AND "adult=n" (ditto)

Non jokes would just be "joke=n"

etc.....

Note especially that if you used, say, StandardAnalyzer, you'd get tokens
"joke", "y", "n", "adult" in your flag field for doc1 and "joke", "y",
"adult" for doc 2, which is not what you want at all.

This will certainly increase your size a bit. Do you have any idea how big
your indexes are going to be? If storing each field makes the index grow
from, say, 500M to 550M, you won't care. If you're storing a bazillion
documents and it'll bloat your index from 10G to 20G, you probably do. One
pertinent question is "how many fields do you expect to store per
document?"...

Best
Erick

On 11/28/06, Biggy <bi...@web.de> wrote:
>
>
> The background of this is also separating content according to domains
>
> Example:
> - pictureA (marked as a "joke" #flag :1)
> - pictureB (marked as a "adult picture" #flag: 2)
> Site1: Users allowed to view everything (pictureA, pictureB )
> Site2: Users allowed to view everything except pictureB (no adult content)
>
> This szenario, for instance means a query from each site via sql could be
> Site1: ... status & 3 ; // all pictures (joke,adult)
> Site2:...  not (status & 1) ; // no adult stuff
>
> PROBLEMS
> Because the business rules are a negation - everything except this and
> that.
> We have a problem
> adding new content types. Adding a new picture type means changing the
> whole
> picture flags with the new status.
>
> So backward compatibility is not possible.
>
> That's why i thought with Lucene, i could search using "NOT"
> That is: Give me all non-adult pictures in case of Site2
>
> Any suggestions to overcome this flag problem, without changing the DB
> status and re-indexing everything on new picture types.
>
> thanks for good advice thus far
>
>
>
> Erick Erickson wrote:
> >
> > Lucene will automatically separate tokens during index and search if you
> > use
> > the right analyzer. See the various classes that implement Analyzer. I
> > don't
> > know if you really wanted to use the numeric literals, but I wouldn't.
> The
> > analyzers that do the most for you (automatically break up on spaces,
> > lowercase, etc) often ignore numbers. Just in case you were thinking
> about
> > doing it that way....
> >
> > I would NOT store the inverse and then use NOT. the NOT operator doesn't
> > behave as you expect, it's not a strict boolean operator. See the thread
> > titled *Another problem with the QueryParser *in this list. And anything
> > else Chris or Yonik or ...  has to say on the subject. This is a source
> of
> > considerable confusion. For instance, you can't query on just the phrase
> > "NOT no_music". Not to mention what happens if/when a user can actually
> > NOT
> > in the query.
> >
> > In general, I *strongly* recommend doing it the simple, intuitive way
> > first.
> > Only get fancy if you actually have something to gain. Here, you're
> > talking
> > about some storage savings. Maybe (have you checked how big your index
> > will
> > be? Will this approach be enough to matter? How do you know?). You're
> > creating code that will confuse not only yourself but whoever has to get
> > into this code later.
> >
> > By rushing in and doing an optimization (which you neither  *know* nor
> can
> > reasonably expect to gain you anything measurable since you don't know
> the
> > internals of Lucene well enough to predict. Neither do I BTW...) you're
> > creating complexity which you don't know is necessary. I'd only go there
> > if
> > doing it the straight-forward way shows performance issues. I'd also bet
> > that any performance issues you see are not related to this issue......
> >
> > Best
> > Erick
> >
> > On 11/28/06, Biggy <bi...@web.de> wrote:
> >>
> >>
> >>
> >> OK here what i've come up with - After reading your suggestions
> >> - bit set from DB stays untouched
> >> - only one field shall be used to store interest field bits in the
> >> document:
> >> "interest". Saves disk space.
> >> - The bits shall be not be converted to readable string but added as
> >> values
> >> separated by space " "
> >> ====Code Below====
> >> -----------------
> >> public Document getDocument(int db_interest_bits)
> >> {
> >>    String interest_string ="";
> >>    // sport
> >>    if (db_interest_bits & 1) {
> >>        interest_string +="1"+" "; // empty space as delimiter
> >>    }
> >>    // music
> >>    if (bitsfromdb & 2) {
> >>        interest_string +="2"+" "; // empty space as delimiter
> >>    }
> >>
> >>    Document doc = new Document();
> >>    doc.add("interest", interest_string);
> >>    // how do i tell Lucene to separate tokens on search ?
> >>
> >>    return doc;
> >> }
> >> ---------------
> >>
> >> FURTHERMORE - i realized that almost all potential values are often set
> >> i.e.
> >> sport music film
> >> sport music
> >> sport music film
> >> sport music film
> >> sport music
> >> music
> >>
> >> So i was thinking : How about doing the reverse when it comes to
> building
> >> the index ?
> >> I would onyl store the fields that are not set.
> >> The search would be a negation.
> >>
> >> Example Values ofd interest:
> >> 1. "no_film" => Only a film is not set
> >> 2. "no_sport no_film" => film and sport are not set
> >> 3. "" => all values are set since this is a negation
> >>
> >>
> >> It follows, searching for people interested in music:
> >> => search for NOT no_music
> >>
> >> QUESTION
> >> How does the perfomance of a negative search NOT compare to a normal
> one
> >> I.E.
> >> "NOT no_music" vs "music" search under the premise that most interest
> >> flags
> >> are set ?
> >>
> >>
> >>
> >> ---------
> >>
> >> Daniel Noll-3 wrote:
> >> >
> >> > Erick Erickson wrote:
> >> >> Well, you really have the code already <G>. From the top...
> >> >>
> >> >> 1> there's no good way to support searching bitfields If you wanted,
> >> you
> >> >> could probably store it as a small integer and then search on it,
> but
> >> >> that's
> >> >> waaay too complicated than you want.
> >> >>
> >> >> 2> Add the fields like you have the snippet from, something like
> >> >> Document doc = new Document.
> >> >> if (bitsfromdb & 1) {
> >> >>    doc.add("sport", "y");
> >> >> }
> >> >> if (bitsfromdb & 2) {
> >> >>    doc.add("music", "y");
> >> >> }
> >> >
> >> > Beware that if there are a large number of bits, this is going to
> >> impact
> >> > memory usage due to there being more fields.
> >> >
> >> > Perhaps a better way would be to use a single "bits" field and store
> >> the
> >> > words "sport", "music", ... in that field.
> >> >
> >> > Daniel
> >> >
> >> >
> >> > --
> >> > Daniel Noll
> >> >
> >> > Nuix Pty Ltd
> >> > Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280
> >> 0699
> >> > Web: http://nuix.com/                               Fax: +61 2 9212
> >> 6902
> >> >
> >> > This message is intended only for the named recipient. If you are not
> >> > the intended recipient you are notified that disclosing, copying,
> >> > distributing or taking any action in reliance on the contents of this
> >> > message or attachment is strictly prohibited.
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7576286
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7581771
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Searching by bit masks

Posted by Biggy <bi...@web.de>.

The background of this is also separating content according to domains

Example: 
- pictureA (marked as a "joke" #flag :1)
- pictureB (marked as a "adult picture" #flag: 2)
Site1: Users allowed to view everything (pictureA, pictureB )
Site2: Users allowed to view everything except pictureB (no adult content)

This szenario, for instance means a query from each site via sql could be
Site1: ... status & 3 ; // all pictures (joke,adult)
Site2:...  not (status & 1) ; // no adult stuff

PROBLEMS
Because the business rules are a negation - everything except this and that.
We have a problem
adding new content types. Adding a new picture type means changing the whole
picture flags with the new status. 

So backward compatibility is not possible.

That's why i thought with Lucene, i could search using "NOT"
That is: Give me all non-adult pictures in case of Site2

Any suggestions to overcome this flag problem, without changing the DB
status and re-indexing everything on new picture types.

thanks for good advice thus far



Erick Erickson wrote:
> 
> Lucene will automatically separate tokens during index and search if you
> use
> the right analyzer. See the various classes that implement Analyzer. I
> don't
> know if you really wanted to use the numeric literals, but I wouldn't. The
> analyzers that do the most for you (automatically break up on spaces,
> lowercase, etc) often ignore numbers. Just in case you were thinking about
> doing it that way....
> 
> I would NOT store the inverse and then use NOT. the NOT operator doesn't
> behave as you expect, it's not a strict boolean operator. See the thread
> titled *Another problem with the QueryParser *in this list. And anything
> else Chris or Yonik or ...  has to say on the subject. This is a source of
> considerable confusion. For instance, you can't query on just the phrase
> "NOT no_music". Not to mention what happens if/when a user can actually
> NOT
> in the query.
> 
> In general, I *strongly* recommend doing it the simple, intuitive way
> first.
> Only get fancy if you actually have something to gain. Here, you're
> talking
> about some storage savings. Maybe (have you checked how big your index
> will
> be? Will this approach be enough to matter? How do you know?). You're
> creating code that will confuse not only yourself but whoever has to get
> into this code later.
> 
> By rushing in and doing an optimization (which you neither  *know* nor can
> reasonably expect to gain you anything measurable since you don't know the
> internals of Lucene well enough to predict. Neither do I BTW...) you're
> creating complexity which you don't know is necessary. I'd only go there
> if
> doing it the straight-forward way shows performance issues. I'd also bet
> that any performance issues you see are not related to this issue......
> 
> Best
> Erick
> 
> On 11/28/06, Biggy <bi...@web.de> wrote:
>>
>>
>>
>> OK here what i've come up with - After reading your suggestions
>> - bit set from DB stays untouched
>> - only one field shall be used to store interest field bits in the
>> document:
>> "interest". Saves disk space.
>> - The bits shall be not be converted to readable string but added as
>> values
>> separated by space " "
>> ====Code Below====
>> -----------------
>> public Document getDocument(int db_interest_bits)
>> {
>>    String interest_string ="";
>>    // sport
>>    if (db_interest_bits & 1) {
>>        interest_string +="1"+" "; // empty space as delimiter
>>    }
>>    // music
>>    if (bitsfromdb & 2) {
>>        interest_string +="2"+" "; // empty space as delimiter
>>    }
>>
>>    Document doc = new Document();
>>    doc.add("interest", interest_string);
>>    // how do i tell Lucene to separate tokens on search ?
>>
>>    return doc;
>> }
>> ---------------
>>
>> FURTHERMORE - i realized that almost all potential values are often set
>> i.e.
>> sport music film
>> sport music
>> sport music film
>> sport music film
>> sport music
>> music
>>
>> So i was thinking : How about doing the reverse when it comes to building
>> the index ?
>> I would onyl store the fields that are not set.
>> The search would be a negation.
>>
>> Example Values ofd interest:
>> 1. "no_film" => Only a film is not set
>> 2. "no_sport no_film" => film and sport are not set
>> 3. "" => all values are set since this is a negation
>>
>>
>> It follows, searching for people interested in music:
>> => search for NOT no_music
>>
>> QUESTION
>> How does the perfomance of a negative search NOT compare to a normal one
>> I.E.
>> "NOT no_music" vs "music" search under the premise that most interest
>> flags
>> are set ?
>>
>>
>>
>> ---------
>>
>> Daniel Noll-3 wrote:
>> >
>> > Erick Erickson wrote:
>> >> Well, you really have the code already <G>. From the top...
>> >>
>> >> 1> there's no good way to support searching bitfields If you wanted,
>> you
>> >> could probably store it as a small integer and then search on it, but
>> >> that's
>> >> waaay too complicated than you want.
>> >>
>> >> 2> Add the fields like you have the snippet from, something like
>> >> Document doc = new Document.
>> >> if (bitsfromdb & 1) {
>> >>    doc.add("sport", "y");
>> >> }
>> >> if (bitsfromdb & 2) {
>> >>    doc.add("music", "y");
>> >> }
>> >
>> > Beware that if there are a large number of bits, this is going to
>> impact
>> > memory usage due to there being more fields.
>> >
>> > Perhaps a better way would be to use a single "bits" field and store
>> the
>> > words "sport", "music", ... in that field.
>> >
>> > Daniel
>> >
>> >
>> > --
>> > Daniel Noll
>> >
>> > Nuix Pty Ltd
>> > Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280
>> 0699
>> > Web: http://nuix.com/                               Fax: +61 2 9212
>> 6902
>> >
>> > This message is intended only for the named recipient. If you are not
>> > the intended recipient you are notified that disclosing, copying,
>> > distributing or taking any action in reliance on the contents of this
>> > message or attachment is strictly prohibited.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7576286
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7581771
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching by bit masks

Posted by Erick Erickson <er...@gmail.com>.

Lucene will automatically separate tokens during index and search if you use
the right analyzer. See the various classes that implement Analyzer. I don't
know if you really wanted to use the numeric literals, but I wouldn't. The
analyzers that do the most for you (automatically break up on spaces,
lowercase, etc) often ignore numbers. Just in case you were thinking about
doing it that way....

I would NOT store the inverse and then use NOT. the NOT operator doesn't
behave as you expect, it's not a strict boolean operator. See the thread
titled *Another problem with the QueryParser *in this list. And anything
else Chris or Yonik or ...  has to say on the subject. This is a source of
considerable confusion. For instance, you can't query on just the phrase
"NOT no_music". Not to mention what happens if/when a user can actually NOT
in the query.

In general, I *strongly* recommend doing it the simple, intuitive way first.
Only get fancy if you actually have something to gain. Here, you're talking
about some storage savings. Maybe (have you checked how big your index will
be? Will this approach be enough to matter? How do you know?). You're
creating code that will confuse not only yourself but whoever has to get
into this code later.

By rushing in and doing an optimization (which you neither  *know* nor can
reasonably expect to gain you anything measurable since you don't know the
internals of Lucene well enough to predict. Neither do I BTW...) you're
creating complexity which you don't know is necessary. I'd only go there if
doing it the straight-forward way shows performance issues. I'd also bet
that any performance issues you see are not related to this issue......

Best
Erick

On 11/28/06, Biggy <bi...@web.de> wrote:
>
>
>
> OK here what i've come up with - After reading your suggestions
> - bit set from DB stays untouched
> - only one field shall be used to store interest field bits in the
> document:
> "interest". Saves disk space.
> - The bits shall be not be converted to readable string but added as
> values
> separated by space " "
> ====Code Below====
> -----------------
> public Document getDocument(int db_interest_bits)
> {
>    String interest_string ="";
>    // sport
>    if (db_interest_bits & 1) {
>        interest_string +="1"+" "; // empty space as delimiter
>    }
>    // music
>    if (bitsfromdb & 2) {
>        interest_string +="2"+" "; // empty space as delimiter
>    }
>
>    Document doc = new Document();
>    doc.add("interest", interest_string);
>    // how do i tell Lucene to separate tokens on search ?
>
>    return doc;
> }
> ---------------
>
> FURTHERMORE - i realized that almost all potential values are often set
> i.e.
> sport music film
> sport music
> sport music film
> sport music film
> sport music
> music
>
> So i was thinking : How about doing the reverse when it comes to building
> the index ?
> I would onyl store the fields that are not set.
> The search would be a negation.
>
> Example Values ofd interest:
> 1. "no_film" => Only a film is not set
> 2. "no_sport no_film" => film and sport are not set
> 3. "" => all values are set since this is a negation
>
>
> It follows, searching for people interested in music:
> => search for NOT no_music
>
> QUESTION
> How does the perfomance of a negative search NOT compare to a normal one
> I.E.
> "NOT no_music" vs "music" search under the premise that most interest
> flags
> are set ?
>
>
>
> ---------
>
> Daniel Noll-3 wrote:
> >
> > Erick Erickson wrote:
> >> Well, you really have the code already <G>. From the top...
> >>
> >> 1> there's no good way to support searching bitfields If you wanted,
> you
> >> could probably store it as a small integer and then search on it, but
> >> that's
> >> waaay too complicated than you want.
> >>
> >> 2> Add the fields like you have the snippet from, something like
> >> Document doc = new Document.
> >> if (bitsfromdb & 1) {
> >>    doc.add("sport", "y");
> >> }
> >> if (bitsfromdb & 2) {
> >>    doc.add("music", "y");
> >> }
> >
> > Beware that if there are a large number of bits, this is going to impact
> > memory usage due to there being more fields.
> >
> > Perhaps a better way would be to use a single "bits" field and store the
> > words "sport", "music", ... in that field.
> >
> > Daniel
> >
> >
> > --
> > Daniel Noll
> >
> > Nuix Pty Ltd
> > Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
> > Web: http://nuix.com/                               Fax: +61 2 9212 6902
> >
> > This message is intended only for the named recipient. If you are not
> > the intended recipient you are notified that disclosing, copying,
> > distributing or taking any action in reliance on the contents of this
> > message or attachment is strictly prohibited.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7576286
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Searching by bit masks

Posted by Biggy <bi...@web.de>.


OK here what i've come up with - After reading your suggestions
- bit set from DB stays untouched
- only one field shall be used to store interest field bits in the document:
"interest". Saves disk space.
- The bits shall be not be converted to readable string but added as values
separated by space " "
====Code Below====
-----------------
public Document getDocument(int db_interest_bits)
{
   String interest_string ="";
   // sport
   if (db_interest_bits & 1) {
       interest_string +="1"+" "; // empty space as delimiter
   }
   // music
   if (bitsfromdb & 2) {
       interest_string +="2"+" "; // empty space as delimiter
   } 

   Document doc = new Document(); 
   doc.add("interest", interest_string); 
   // how do i tell Lucene to separate tokens on search ?

   return doc;
}
---------------

FURTHERMORE - i realized that almost all potential values are often set
i.e.
sport music film
sport music
sport music film
sport music film
sport music
music

So i was thinking : How about doing the reverse when it comes to building
the index ?
I would onyl store the fields that are not set.
The search would be a negation.

Example Values ofd interest:
1. "no_film" => Only a film is not set
2. "no_sport no_film" => film and sport are not set
3. "" => all values are set since this is a negation


It follows, searching for people interested in music:
=> search for NOT no_music

QUESTION
How does the perfomance of a negative search NOT compare to a normal one
I.E. 
"NOT no_music" vs "music" search under the premise that most interest flags
are set ?



---------

Daniel Noll-3 wrote:
> 
> Erick Erickson wrote:
>> Well, you really have the code already <G>. From the top...
>> 
>> 1> there's no good way to support searching bitfields If you wanted, you
>> could probably store it as a small integer and then search on it, but 
>> that's
>> waaay too complicated than you want.
>> 
>> 2> Add the fields like you have the snippet from, something like
>> Document doc = new Document.
>> if (bitsfromdb & 1) {
>>    doc.add("sport", "y");
>> }
>> if (bitsfromdb & 2) {
>>    doc.add("music", "y");
>> }
> 
> Beware that if there are a large number of bits, this is going to impact 
> memory usage due to there being more fields.
> 
> Perhaps a better way would be to use a single "bits" field and store the 
> words "sport", "music", ... in that field.
> 
> Daniel
> 
> 
> -- 
> Daniel Noll
> 
> Nuix Pty Ltd
> Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
> Web: http://nuix.com/                               Fax: +61 2 9212 6902
> 
> This message is intended only for the named recipient. If you are not
> the intended recipient you are notified that disclosing, copying,
> distributing or taking any action in reliance on the contents of this
> message or attachment is strictly prohibited.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7576286
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching by bit masks

Posted by Daniel Noll <da...@nuix.com>.

Erick Erickson wrote:
> Well, you really have the code already <G>. From the top...
> 
> 1> there's no good way to support searching bitfields If you wanted, you
> could probably store it as a small integer and then search on it, but 
> that's
> waaay too complicated than you want.
> 
> 2> Add the fields like you have the snippet from, something like
> Document doc = new Document.
> if (bitsfromdb & 1) {
>    doc.add("sport", "y");
> }
> if (bitsfromdb & 2) {
>    doc.add("music", "y");
> }

Beware that if there are a large number of bits, this is going to impact 
memory usage due to there being more fields.

Perhaps a better way would be to use a single "bits" field and store the 
words "sport", "music", ... in that field.

Daniel

-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching by bit masks

Posted by Erick Erickson <er...@gmail.com>.

Well, you really have the code already <G>. From the top...

1> there's no good way to support searching bitfields If you wanted, you
could probably store it as a small integer and then search on it, but that's
waaay too complicated than you want.

2> Add the fields like you have the snippet from, something like
Document doc = new Document.
if (bitsfromdb & 1) {
    doc.add("sport", "y");
}
if (bitsfromdb & 2) {
    doc.add("music", "y");
}
.
.
.
IndexWriter.add(doc);

Now, when searching, search on things like new Term("sport", "y"))..... and
you'll only get the documents that correspond to the 2s bit being set.

Watch out for capitalization. Y may not be equivalent to y. It depends on
the analyzer you use at index AND search time.

You can or as many of these together as you want. In your example, you could
have up to 4 sub-clauses just for the bitmask-equivalents.

NOTE: the documents won't all have the same fields. A document may not have,
for instance, the "sports" field. This is OK in Lucene, but not the first
thing folks with their DB hat on think of....

Get a copy of Luke (google lucene luke) and get familiar with it for
examining your index and the effects of various analyzers. Really, really,
really get a copy of Luke. Really.

Do you have a copy of "Lucene In Action"? If not, I highly recommend it. It
has tons of useful examples as well as a good introduction to many of the
concepts. It's written to the 1.4 codebase, so be warned that there are some
incompatibilities that are, for the most part, minor.

Best
Erick
On 11/27/06, Biggy <bi...@web.de> wrote:
>
>
>
> i have the same problem here. I have an interest bit field, which i
> receive from the applciation backend. I have control over how the
> docuemtns
> are built.
> To be specific, the field looks like this:
>
> ID: interest
> 1 : sport
> 2 : music
> 4 : film
> 8 : clubs
>
> So someone interested in sports and music can be found by "interest & 3"
> =>
> e.g. when using SQL.
>
> I do not wish to Post-Filter the results
> On to Lucene, Is there a filter which supports this kind of query ?
>
> Someone suggested splitting the bits into fields:
> > Document doc = new Document();
> > doc.add("flag1", "Y");
> > doc.add("flag2", "Y");
> > IndexWriter.add(doc);
> Is this helpful at all ?
>
> Code would be helpful too as i am a newbie
>
>
>
> ltaylor.employon wrote:
> >
> > Hello,
> >
> > I am currently evaluating Lucene to see if it would be appropriate to
> > replace my company's current search software. So far everything has been
> > looking great, however there is one requirement that I am not too
> > certain about.
> >
> > What we need to do is to be able to store a bit mask specifying various
> > filter flags for a document in the index and then search this field by
> > specifying another bit mask with desired filters, returning documents
> > that have any of the specified flags set. In other words, we are doing a
> > bitwise OR on the stored filter bit mask and the specified filter bit
> > mask and if it is non-zero, we want to return the document.
> >
> > Before I started toying around with various options myself, I wanted to
> > see if any of you good folks in the Lucene community had some
> > suggestions for an efficient way to implement this.
> >
> > We currently need to index ~8,000,000 documents. We have several filter
> > flag fields, the most important of which currently has 7 possible flags
> > with any combination of the flags being valid. The number of flags is
> > expected to increase rather rapidly in the near future.
> >
> > My preemptive thanks for your suggestions,
> >
> >
> > Lawrence Taylor
> > Senior Software Engineer
> > Employon
> > Message was edited by: ltaylor.employon
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7564237
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Searching by bit masks

Posted by Biggy <bi...@web.de>.


i have the same problem here. I have an interest bit field, which i 
receive from the applciation backend. I have control over how the docuemtns
are built.
To be specific, the field looks like this:

ID: interest
1 : sport
2 : music
4 : film
8 : clubs

So someone interested in sports and music can be found by "interest & 3" =>
e.g. when using SQL.

I do not wish to Post-Filter the results
On to Lucene, Is there a filter which supports this kind of query ?

Someone suggested splitting the bits into fields:
> Document doc = new Document();
> doc.add("flag1", "Y");
> doc.add("flag2", "Y");
> IndexWriter.add(doc); 
Is this helpful at all ?

Code would be helpful too as i am a newbie



ltaylor.employon wrote:
> 
> Hello, 
> 
> I am currently evaluating Lucene to see if it would be appropriate to
> replace my company's current search software. So far everything has been
> looking great, however there is one requirement that I am not too
> certain about. 
> 
> What we need to do is to be able to store a bit mask specifying various
> filter flags for a document in the index and then search this field by
> specifying another bit mask with desired filters, returning documents
> that have any of the specified flags set. In other words, we are doing a
> bitwise OR on the stored filter bit mask and the specified filter bit
> mask and if it is non-zero, we want to return the document. 
> 
> Before I started toying around with various options myself, I wanted to
> see if any of you good folks in the Lucene community had some
> suggestions for an efficient way to implement this. 
> 
> We currently need to index ~8,000,000 documents. We have several filter
> flag fields, the most important of which currently has 7 possible flags
> with any combination of the flags being valid. The number of flags is
> expected to increase rather rapidly in the near future. 
> 
> My preemptive thanks for your suggestions,
> 
> 
> Lawrence Taylor
> Senior Software Engineer
> Employon
> Message was edited by: ltaylor.employon
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7564237
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching by bit masks

Posted by John Haxby <jc...@scalix.com>.

Larry Taylor wrote:
> What we need to do is to be able to store a bit mask specifying various
> filter flags for a document in the index and then search this field by
> specifying another bit mask with desired filters, returning documents
> that have any of the specified flags set. In other words, we are doing a
> bitwise OR on the stored filter bit mask and the specified filter bit
> mask and if it is non-zero, we want to return the document. 
>   
We did this by naming the flags and storing them in a single field.  So 
if we have three flags called, oh, alice, bob and clive and we wanted to 
indicate alice and clive were set and bob clear then we store "alice", 
"unbob" and "clive" in the "flags" field.

That seems to be quite efficient -- I can now search for any combination 
of flags (or bits) set or unset and I can add new flags at will.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Searching by bit masks

Posted by Larry Taylor <lt...@employon.com>.

Excellent, caching filters seem to fit the bill best so will use those
with the flags stored in the underlying index in the format you
suggested.  Thank you for the assistance.

Larry

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org] 
Sent: Friday, November 10, 2006 12:27 PM
To: java-user@lucene.apache.org
Subject: Re: Searching by bit masks

Erick Erickson wrote:
> Something like
> Document doc = new Document();
> doc.add("flag1", "Y");
> doc.add("flag2", "Y");
> IndexWriter.add(doc);

Fields have overheads.  It would be more efficient to implement this as 
a single field with a different value for each boolean flag (as others 
have suggested).

> Another approach: create a set of Lucene Filters (really, these are
just
> Java bitsets), one for each flag. All this is a bitset with one bit
for 
> each
> document, or about 1M of memory per flag with 8M docs. So you'd
populate
> flag1Filter, flag2Filter... and have these ready whenever you needed
them.

Cached filters will be faster especially when a large portion of the 
documents have the flag set.  If, for example, you have a flag that is 
set in half the documents that is specified in half the queries, then a 
cached filter will have a large impact on not only the performance of 
those queries but on the performance of your service as a whole.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching by bit masks

Posted by Doug Cutting <cu...@apache.org>.

Erick Erickson wrote:
> Something like
> Document doc = new Document();
> doc.add("flag1", "Y");
> doc.add("flag2", "Y");
> IndexWriter.add(doc);

Fields have overheads.  It would be more efficient to implement this as 
a single field with a different value for each boolean flag (as others 
have suggested).

> Another approach: create a set of Lucene Filters (really, these are just
> Java bitsets), one for each flag. All this is a bitset with one bit for 
> each
> document, or about 1M of memory per flag with 8M docs. So you'd populate
> flag1Filter, flag2Filter... and have these ready whenever you needed them.

Cached filters will be faster especially when a large portion of the 
documents have the flag set.  If, for example, you have a flag that is 
set in half the documents that is specified in half the queries, then a 
cached filter will have a large impact on not only the performance of 
those queries but on the performance of your service as a whole.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching by bit masks

Posted by Erick Erickson <er...@gmail.com>.

Let me see if I have a clue what you're trying to do. Warning: I'm a bit
confused since "filter" has a very specific meaning in Lucene, so when you
talk about filters I'm assuming that you're NOT talking about Lucene
filters, but rather just a set of flags you're associating with each
document, and then a set of flags with the same semantics at search time.

If that's all true, here's the first approach I would take...

Don't store a bitmask in Lucene. Rather, index a field for each flag with
each document. NOTE: you do NOT have to have the same fields for all
documents....

Something like
Document doc = new Document();
doc.add("flag1", "Y");
doc.add("flag2", "Y");
IndexWriter.add(doc);

Document doc = new Document();
doc.add("flag1", "Y");   // NOTE: no "flag2" field. No problem.
IndexWriter.add(doc);

Now your searches are simple. Just search for the "or" of the fields with
the flags you're interested in. i.e. "flag1=Y" or "flag2="Y".....

If you want to get fancy, you can use TermDocs to enumerate the document IDs
with values for specified fields and perhaps even create a Lucene filter
based on that enumeration. This is much faster than you may think.

You probably want to index but not store such flags.

I suspect that this will be waaaaay faster than trying to inspect a binary
field in each document and then see if the bits were set, because that would
require you to read each document rather than just look at the terms in the
index.

I doubt that this will add much in the way of size to your index, and
anyway, disk space is cheap.

NOTE: the down side here is that you must delete and re-add a document to
modify it, which may be slow. But you'd have to do that when you updated
your bit mask anyway.....

Another approach: create a set of Lucene Filters (really, these are just
Java bitsets), one for each flag. All this is a bitset with one bit for each
document, or about 1M of memory per flag with 8M docs. So you'd populate
flag1Filter, flag2Filter... and have these ready whenever you needed them.

You should very rapidly be able to do any of the logical operations on these
bitsets (OR in your case) and use the resulting Lucene filter in your query.
It's up to you whether you create these filters as part of server warm-up or
just create them when needed, letting the first user who encounters them pay
the price for creating them. This is kind of the Solr warm-up idea. The
CachingWrapperFilter class should keep these around for you. Creating a
Lucene filter is much faster than I thought. See Lucene In Action for a
sample.

Of course, I may be entirely misunderstanding your problem, in which case
I'd ask you to explain a bit more <G>.

Best
Erick

On 11/9/06, Larry Taylor <lt...@employon.com> wrote:
>
> Hello,
>
> I am currently evaluating Lucene to see if it would be appropriate to
> replace my company's current search software. So far everything has been
> looking great, however there is one requirement that I am not too
> certain about.
>
> What we need to do is to be able to store a bit mask specifying various
> filter flags for a document in the index and then search this field by
> specifying another bit mask with desired filters, returning documents
> that have any of the specified flags set. In other words, we are doing a
> bitwise OR on the stored filter bit mask and the specified filter bit
> mask and if it is non-zero, we want to return the document.
>
> Before I started toying around with various options myself, I wanted to
> see if any of you good folks in the Lucene community had some
> suggestions for an efficient way to implement this.
>
> We currently need to index ~8,000,000 documents. We have several filter
> flag fields, the most important of which currently has 7 possible flags
> with any combination of the flags being valid. The number of flags is
> expected to increase rather rapidly in the near future.
>
> My preemptive thanks for your suggestions,
>
>
> Lawrence Taylor
> Senior Software Engineer
> Employon
> Message was edited by: ltaylor.employon
>
>
>