You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Israel Ekpo <is...@gmail.com> on 2010/05/13 23:27:08 UTC

Bitwise Operations on Integer Fields in Lucene and Solr Index

Hello Lucene and Solr Community

I have a custom org.apache.lucene.search.Filter that I would like to
contribute to the Lucene and Solr projects.

So I would need some direction as to how to create and ISSUE or submit a
patch.

It looks like there have been changes to the way this is done since the
latest merge of the two projects (Lucene and Solr).

Recently, some Solr users have been looking for a way to perform bitwise
operations between and integer value and some fields in the Index

So, I wrote a Solr QParser plugin to do this using a custom Lucene Filter.

This package makes it possible to filter results returned from a query based
on the results of a bitwise operation on an integer field in the documents
returned from the pre-constructed query.

You can perform three basic types of operations on these integer fields

    * BitwiseOperation.BITWISE_AND (bitwise AND)
    * BitwiseOperation.BITWISE_OR (bitwise inclusive OR)
    * BitwiseOperation.BITWISE_XOR (bitwise exclusive OR)

You can also negate the results of these operations.

For example, imagine there is an integer field in the index named "flags"
with the a value 8 (1000 in binary). The following results will be expected
:

   1. A source value of 8 will match during a BitwiseOperation.BITWISE_AND
operation, with negate set to false.
   2. A source value of 4 will match during a BitwiseOperation.BITWISE_AND
operation, with negate set to true.

The BitwiseFilter constructor accepts the following values

    * The name of the integer field (A string)
    * The BitwiseOperation object. Example BitwiseOperation.BITWISE_XOR
    * The source value (an integer)
    * A boolean value indicating whether or not to negate the results of the
operation
    * A pre-constructed org.apache.lucene.search.Query

Here is an example of how you would use it with Solr

http://localhost:8983/solr/bitwise/select/?q={!bitwisefield=user_permissions
op=AND source=3 negate=true}state:FL

http://localhost:8983/solr/bitwise/select/?q={!bitwisefield=user_permissions
op=AND source=3}state:FL

Here is an example of how you would use it with Lucene

public class BitwiseTestSearch extends BitwiseTestBase {

    public BitwiseTestSearch()
    {

    }

    public void search() throws IOException, ParseException
    {
        setupSearch();

        // term
        Term t = new Term(COUNTRY_KEY, "us");

        // term query
        Query q = new TermQuery(t);

        // maximum number of documents to display
        int limit = 1000;

        int sourceValue = 0 ;

        boolean negate = false;

        BitwiseFilter bitwiseFilter = new BitwiseFilter(USER_PERMS_KEY,
BitwiseOperation.BITWISE_XOR, sourceValue, negate, q);

        Query fq = new FilteredQuery(q, bitwiseFilter);

        ScoreDoc[] hits = isearcher.search(fq, null, limit).scoreDocs;

        BitwiseResultFilter resultFilter = bitwiseFilter.getResultFilter();

        for (int i = 0; i < hits.length; i++) {

            Document hitDoc = isearcher.doc(hits[i].doc);

            System.out.println(FIRST_NAME_KEY + " field has a value of " +
hitDoc.get(FIRST_NAME_KEY));
            System.out.println(LAST_NAME_KEY + " field has a value of " +
hitDoc.get(LAST_NAME_KEY));
            System.out.println(ACTIVE_KEY + " field has a value of " +
hitDoc.get(ACTIVE_KEY));

            System.out.println(USER_PERMS_KEY + " field has a value of " +
hitDoc.get(USER_PERMS_KEY));

            System.out.println("doc ID --> " + hits[i].doc);


System.out.println("...............................................................");
        }

        System.out.println("sourceValue = " + sourceValue + ",operation = "
+ resultFilter.getOperation().getOperationName() + ", negate = " + negate);

        System.out.println("A total of " + hits.length + " documents were
found from the search\n");

        shutdown();
    }

    public static void main(String args[]) throws IOException,
ParseException
    {
        BitwiseTestSearch search = new BitwiseTestSearch();

        search.search();
    }
}

Any guidance would be highly appreciated.

Thanks.


-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: Bitwise Operations on Integer Fields in Lucene and Solr Index

Posted by Israel Ekpo <is...@gmail.com>.
Correction,

I meant to list

https://issues.apache.org/jira/browse/LUCENE-2460
https://issues.apache.org/jira/browse/SOLR-1913



On Thu, May 13, 2010 at 10:13 PM, Israel Ekpo <is...@gmail.com> wrote:

> I have created two ISSUES as new features
>
> https://issues.apache.org/jira/browse/LUCENE-1560
>
> https://issues.apache.org/jira/browse/SOLR-1913
>
> The first one is for the Lucene Filter.
>
> The second one is for the Solr QParserPlugin
>
> The source code and jar files are attached and the Solr plugin is available
> for use immediately.
>
>
>
>
> On Thu, May 13, 2010 at 6:42 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>
>> On 2010-05-13 23:27, Israel Ekpo wrote:
>> > Hello Lucene and Solr Community
>> >
>> > I have a custom org.apache.lucene.search.Filter that I would like to
>> > contribute to the Lucene and Solr projects.
>> >
>> > So I would need some direction as to how to create and ISSUE or submit a
>> > patch.
>> >
>> > It looks like there have been changes to the way this is done since the
>> > latest merge of the two projects (Lucene and Solr).
>> >
>> > Recently, some Solr users have been looking for a way to perform bitwise
>> > operations between and integer value and some fields in the Index
>> >
>> > So, I wrote a Solr QParser plugin to do this using a custom Lucene
>> Filter.
>> >
>> > This package makes it possible to filter results returned from a query
>> based
>> > on the results of a bitwise operation on an integer field in the
>> documents
>> > returned from the pre-constructed query.
>>
>> Hi,
>>
>> What a coincidence! :) I'm working on something very similar, only the
>> use case that I need to support is slightly different - I want to
>> support a ranked search based on a bitwise overlap of query value and
>> field value. That is, the number of differing bits would reduce the
>> score. This scenario occurs e.g. during near-duplicate detection that
>> uses fuzzy signatures, on document- or sentence levels.
>>
>> I'm going to submit my code early next week, it still needs some
>> polishing. I have two ways to execute this query, neither of which uses
>> filters at the moment:
>>
>> * method 1: during indexing the bits in the fields are turned into
>> on/off terms on the same field, and during search a BooleanQuery is
>> formed from the int value with the same terms. Scoring is courtesy of
>> BooleanScorer. This method supports only a single int value per field.
>>
>> * method 2, incomplete yet - during indexing the bits are turned into
>> terms as before, but this method supports multiple int values per field:
>> terms that correspond to bitmasks on the same value are put at the same
>> positions. Then a specialized Query / Scorer traverses all 32 posting
>> lists in parallel, moving through all matching docs and scoring
>> according to how many terms matched at the same position.
>>
>> I wrapped this in a Solr FieldType, and instead of using a custom
>> QParser plugin I simply implemented FieldType.getFieldQuery().
>>
>> It would be great to work out a convenient user-level API for this
>> feature, both the scoring and the non-scoring case.
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
>
> --
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the gift.
> Quality First. Measure Twice. Cut Once.
> http://www.israelekpo.com/
>



-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: Bitwise Operations on Integer Fields in Lucene and Solr Index

Posted by Israel Ekpo <is...@gmail.com>.
Correction,

I meant to list

https://issues.apache.org/jira/browse/LUCENE-2460
https://issues.apache.org/jira/browse/SOLR-1913



On Thu, May 13, 2010 at 10:13 PM, Israel Ekpo <is...@gmail.com> wrote:

> I have created two ISSUES as new features
>
> https://issues.apache.org/jira/browse/LUCENE-1560
>
> https://issues.apache.org/jira/browse/SOLR-1913
>
> The first one is for the Lucene Filter.
>
> The second one is for the Solr QParserPlugin
>
> The source code and jar files are attached and the Solr plugin is available
> for use immediately.
>
>
>
>
> On Thu, May 13, 2010 at 6:42 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>
>> On 2010-05-13 23:27, Israel Ekpo wrote:
>> > Hello Lucene and Solr Community
>> >
>> > I have a custom org.apache.lucene.search.Filter that I would like to
>> > contribute to the Lucene and Solr projects.
>> >
>> > So I would need some direction as to how to create and ISSUE or submit a
>> > patch.
>> >
>> > It looks like there have been changes to the way this is done since the
>> > latest merge of the two projects (Lucene and Solr).
>> >
>> > Recently, some Solr users have been looking for a way to perform bitwise
>> > operations between and integer value and some fields in the Index
>> >
>> > So, I wrote a Solr QParser plugin to do this using a custom Lucene
>> Filter.
>> >
>> > This package makes it possible to filter results returned from a query
>> based
>> > on the results of a bitwise operation on an integer field in the
>> documents
>> > returned from the pre-constructed query.
>>
>> Hi,
>>
>> What a coincidence! :) I'm working on something very similar, only the
>> use case that I need to support is slightly different - I want to
>> support a ranked search based on a bitwise overlap of query value and
>> field value. That is, the number of differing bits would reduce the
>> score. This scenario occurs e.g. during near-duplicate detection that
>> uses fuzzy signatures, on document- or sentence levels.
>>
>> I'm going to submit my code early next week, it still needs some
>> polishing. I have two ways to execute this query, neither of which uses
>> filters at the moment:
>>
>> * method 1: during indexing the bits in the fields are turned into
>> on/off terms on the same field, and during search a BooleanQuery is
>> formed from the int value with the same terms. Scoring is courtesy of
>> BooleanScorer. This method supports only a single int value per field.
>>
>> * method 2, incomplete yet - during indexing the bits are turned into
>> terms as before, but this method supports multiple int values per field:
>> terms that correspond to bitmasks on the same value are put at the same
>> positions. Then a specialized Query / Scorer traverses all 32 posting
>> lists in parallel, moving through all matching docs and scoring
>> according to how many terms matched at the same position.
>>
>> I wrapped this in a Solr FieldType, and instead of using a custom
>> QParser plugin I simply implemented FieldType.getFieldQuery().
>>
>> It would be great to work out a convenient user-level API for this
>> feature, both the scoring and the non-scoring case.
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
>
> --
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the gift.
> Quality First. Measure Twice. Cut Once.
> http://www.israelekpo.com/
>



-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: Bitwise Operations on Integer Fields in Lucene and Solr Index

Posted by Israel Ekpo <is...@gmail.com>.
I have created two ISSUES as new features

https://issues.apache.org/jira/browse/LUCENE-1560

https://issues.apache.org/jira/browse/SOLR-1913

The first one is for the Lucene Filter.

The second one is for the Solr QParserPlugin

The source code and jar files are attached and the Solr plugin is available
for use immediately.



On Thu, May 13, 2010 at 6:42 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2010-05-13 23:27, Israel Ekpo wrote:
> > Hello Lucene and Solr Community
> >
> > I have a custom org.apache.lucene.search.Filter that I would like to
> > contribute to the Lucene and Solr projects.
> >
> > So I would need some direction as to how to create and ISSUE or submit a
> > patch.
> >
> > It looks like there have been changes to the way this is done since the
> > latest merge of the two projects (Lucene and Solr).
> >
> > Recently, some Solr users have been looking for a way to perform bitwise
> > operations between and integer value and some fields in the Index
> >
> > So, I wrote a Solr QParser plugin to do this using a custom Lucene
> Filter.
> >
> > This package makes it possible to filter results returned from a query
> based
> > on the results of a bitwise operation on an integer field in the
> documents
> > returned from the pre-constructed query.
>
> Hi,
>
> What a coincidence! :) I'm working on something very similar, only the
> use case that I need to support is slightly different - I want to
> support a ranked search based on a bitwise overlap of query value and
> field value. That is, the number of differing bits would reduce the
> score. This scenario occurs e.g. during near-duplicate detection that
> uses fuzzy signatures, on document- or sentence levels.
>
> I'm going to submit my code early next week, it still needs some
> polishing. I have two ways to execute this query, neither of which uses
> filters at the moment:
>
> * method 1: during indexing the bits in the fields are turned into
> on/off terms on the same field, and during search a BooleanQuery is
> formed from the int value with the same terms. Scoring is courtesy of
> BooleanScorer. This method supports only a single int value per field.
>
> * method 2, incomplete yet - during indexing the bits are turned into
> terms as before, but this method supports multiple int values per field:
> terms that correspond to bitmasks on the same value are put at the same
> positions. Then a specialized Query / Scorer traverses all 32 posting
> lists in parallel, moving through all matching docs and scoring
> according to how many terms matched at the same position.
>
> I wrapped this in a Solr FieldType, and instead of using a custom
> QParser plugin I simply implemented FieldType.getFieldQuery().
>
> It would be great to work out a convenient user-level API for this
> feature, both the scoring and the non-scoring case.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: Bitwise Operations on Integer Fields in Lucene and Solr Index

Posted by Israel Ekpo <is...@gmail.com>.
I have created two ISSUES as new features

https://issues.apache.org/jira/browse/LUCENE-1560

https://issues.apache.org/jira/browse/SOLR-1913

The first one is for the Lucene Filter.

The second one is for the Solr QParserPlugin

The source code and jar files are attached and the Solr plugin is available
for use immediately.



On Thu, May 13, 2010 at 6:42 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2010-05-13 23:27, Israel Ekpo wrote:
> > Hello Lucene and Solr Community
> >
> > I have a custom org.apache.lucene.search.Filter that I would like to
> > contribute to the Lucene and Solr projects.
> >
> > So I would need some direction as to how to create and ISSUE or submit a
> > patch.
> >
> > It looks like there have been changes to the way this is done since the
> > latest merge of the two projects (Lucene and Solr).
> >
> > Recently, some Solr users have been looking for a way to perform bitwise
> > operations between and integer value and some fields in the Index
> >
> > So, I wrote a Solr QParser plugin to do this using a custom Lucene
> Filter.
> >
> > This package makes it possible to filter results returned from a query
> based
> > on the results of a bitwise operation on an integer field in the
> documents
> > returned from the pre-constructed query.
>
> Hi,
>
> What a coincidence! :) I'm working on something very similar, only the
> use case that I need to support is slightly different - I want to
> support a ranked search based on a bitwise overlap of query value and
> field value. That is, the number of differing bits would reduce the
> score. This scenario occurs e.g. during near-duplicate detection that
> uses fuzzy signatures, on document- or sentence levels.
>
> I'm going to submit my code early next week, it still needs some
> polishing. I have two ways to execute this query, neither of which uses
> filters at the moment:
>
> * method 1: during indexing the bits in the fields are turned into
> on/off terms on the same field, and during search a BooleanQuery is
> formed from the int value with the same terms. Scoring is courtesy of
> BooleanScorer. This method supports only a single int value per field.
>
> * method 2, incomplete yet - during indexing the bits are turned into
> terms as before, but this method supports multiple int values per field:
> terms that correspond to bitmasks on the same value are put at the same
> positions. Then a specialized Query / Scorer traverses all 32 posting
> lists in parallel, moving through all matching docs and scoring
> according to how many terms matched at the same position.
>
> I wrapped this in a Solr FieldType, and instead of using a custom
> QParser plugin I simply implemented FieldType.getFieldQuery().
>
> It would be great to work out a convenient user-level API for this
> feature, both the scoring and the non-scoring case.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: Bitwise Operations on Integer Fields in Lucene and Solr Index

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-05-13 23:27, Israel Ekpo wrote:
> Hello Lucene and Solr Community
> 
> I have a custom org.apache.lucene.search.Filter that I would like to
> contribute to the Lucene and Solr projects.
> 
> So I would need some direction as to how to create and ISSUE or submit a
> patch.
> 
> It looks like there have been changes to the way this is done since the
> latest merge of the two projects (Lucene and Solr).
> 
> Recently, some Solr users have been looking for a way to perform bitwise
> operations between and integer value and some fields in the Index
> 
> So, I wrote a Solr QParser plugin to do this using a custom Lucene Filter.
> 
> This package makes it possible to filter results returned from a query based
> on the results of a bitwise operation on an integer field in the documents
> returned from the pre-constructed query.

Hi,

What a coincidence! :) I'm working on something very similar, only the
use case that I need to support is slightly different - I want to
support a ranked search based on a bitwise overlap of query value and
field value. That is, the number of differing bits would reduce the
score. This scenario occurs e.g. during near-duplicate detection that
uses fuzzy signatures, on document- or sentence levels.

I'm going to submit my code early next week, it still needs some
polishing. I have two ways to execute this query, neither of which uses
filters at the moment:

* method 1: during indexing the bits in the fields are turned into
on/off terms on the same field, and during search a BooleanQuery is
formed from the int value with the same terms. Scoring is courtesy of
BooleanScorer. This method supports only a single int value per field.

* method 2, incomplete yet - during indexing the bits are turned into
terms as before, but this method supports multiple int values per field:
terms that correspond to bitmasks on the same value are put at the same
positions. Then a specialized Query / Scorer traverses all 32 posting
lists in parallel, moving through all matching docs and scoring
according to how many terms matched at the same position.

I wrapped this in a Solr FieldType, and instead of using a custom
QParser plugin I simply implemented FieldType.getFieldQuery().

It would be great to work out a convenient user-level API for this
feature, both the scoring and the non-scoring case.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org