You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Bill Au <bi...@gmail.com> on 2009/08/11 23:30:02 UTC

Using Lucene's payload in Solr

It looks like things have changed a bit since this subject was last brought
up here.  I see that there are support in Solr/Lucene for indexing payload
data (DelimitedPayloadTokenFilterFactory and DelimitedPayloadTokenFilter).
Overriding the Similarity class is straight forward.  So the last piece of
the puzzle is to use a BoostingTermQuery when searching.  I think all I need
to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser under
the cover.  I think all I need to do is to write my own query parser plugin
that uses a custom query parser, with the only difference being in the
getFieldQuery() method where a BoostingTermQuery is used instead of a
TermQuery.

Am I on the right track?  Has anyone done something like this already?
Since Solr already has indexing support for payload, I was hoping that query
support is already in the works if not available already.  If not, I am
willing to contribute but will probably need some guidance since my
knowledge in Solr query parser is weak.

Bill

Re: Using Lucene's payload in Solr

Posted by Bill Au <bi...@gmail.com>.

Thanks for sharing your code, Ken.  It is pretty much the same code that I
have written except that my custom QueryParser extends Solr's
SolrQueryParser instead of Lucene's QueryParser.  I am also using BFTQ
instead of BTQ.  I have tested it and do see the payload being used in the
explain output.

Functionally I have got everything work now.  I still have to decide how I
want to index the payload (using DelimitedPayloadTokenFilter or my own
custom format/code).

Bill

On Thu, Aug 13, 2009 at 11:31 AM, Ensdorf Ken <En...@zoominfo.com> wrote:

> > > It looks like things have changed a bit since this subject was last
> > > brought
> > > up here.  I see that there are support in Solr/Lucene for indexing
> > > payload
> > > data (DelimitedPayloadTokenFilterFactory and
> > > DelimitedPayloadTokenFilter).
> > > Overriding the Similarity class is straight forward.  So the last
> > > piece of
> > > the puzzle is to use a BoostingTermQuery when searching.  I think
> > > all I need
> > > to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser
> > > under
> > > the cover.  I think all I need to do is to write my own query parser
> > > plugin
> > > that uses a custom query parser, with the only difference being in
> > the
> > > getFieldQuery() method where a BoostingTermQuery is used instead of a
> > > TermQuery.
> >
> > The BTQ is now deprecated in favor of the BoostingFunctionTermQuery,
> > which gives some more flexibility in terms of how the spans in a
> > single document are scored.
> >
> > >
> > > Am I on the right track?
> >
> > Yes.
> >
> > > Has anyone done something like this already?
> >
>
> I wrote a QParserPlugin that seems to do the trick.  This is minimally
> tested - we're not actually using it at the moment, but should get you
> going.  Also, as Grant suggested, you may want to sub BFTQ for BTQ below:
>
> package com.zoominfo.solr.analysis;
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.queryParser.*;
> import org.apache.lucene.search.*;
> import org.apache.lucene.search.payloads.BoostingTermQuery;
> import org.apache.solr.common.params.*;
> import org.apache.solr.common.util.NamedList;
> import org.apache.solr.request.SolrQueryRequest;
> import org.apache.solr.search.*;
>
> public class BoostingTermQParserPlugin extends QParserPlugin {
>  public static String NAME = "zoom";
>
>  public void init(NamedList args) {
>  }
>
>  public QParser createParser(String qstr, SolrParams localParams,
> SolrParams params, SolrQueryRequest req) {
>        System.out.print("BoostingTermQParserPlugin::createParser\n");
>    return new BoostingTermQParser(qstr, localParams, params, req);
>  }
> }
>
> class BoostingTermQueryParser extends QueryParser {
>
>        public BoostingTermQueryParser(String f, Analyzer a) {
>                super(f, a);
>
>  System.out.print("BoostingTermQueryParser::BoostingTermQueryParser\n");
>        }
>
>    @Override
>        protected Query newTermQuery(Term term){
>                System.out.print("BoostingTermQueryParser::newTermQuery\n");
>            return new BoostingTermQuery(term);
>        }
> }
>
> class BoostingTermQParser extends QParser {
>  String sortStr;
>  QueryParser lparser;
>
>  public BoostingTermQParser(String qstr, SolrParams localParams, SolrParams
> params, SolrQueryRequest req) {
>    super(qstr, localParams, params, req);
>        System.out.print("BoostingTermQParser::BoostingTermQParser\n");
>  }
>
>
>  public Query parse() throws ParseException {
>        System.out.print("BoostingTermQParser::parse\n");
>    String qstr = getString();
>
>    String defaultField = getParam(CommonParams.DF);
>    if (defaultField==null) {
>      defaultField =
> getReq().getSchema().getSolrQueryParser(null).getField();
>    }
>
>    lparser = new BoostingTermQueryParser(defaultField,
> getReq().getSchema().getQueryAnalyzer());
>
>    // these could either be checked & set here, or in the SolrQueryParser
> constructor
>    String opParam = getParam(QueryParsing.OP);
>    if (opParam != null) {
>      lparser.setDefaultOperator("AND".equals(opParam) ?
> QueryParser.Operator.AND : QueryParser.Operator.OR);
>    } else {
>      // try to get default operator from schema
>
>  lparser.setDefaultOperator(getReq().getSchema().getSolrQueryParser(null).getDefaultOperator());
>    }
>
>    return lparser.parse(qstr);
>  }
>
>
>  public String[] getDefaultHighlightFields() {
>    return new String[]{lparser.getField()};
>  }
>
> }
>

RE: Using Lucene's payload in Solr

Posted by Ensdorf Ken <En...@zoominfo.com>.

> > It looks like things have changed a bit since this subject was last
> > brought
> > up here.  I see that there are support in Solr/Lucene for indexing
> > payload
> > data (DelimitedPayloadTokenFilterFactory and
> > DelimitedPayloadTokenFilter).
> > Overriding the Similarity class is straight forward.  So the last
> > piece of
> > the puzzle is to use a BoostingTermQuery when searching.  I think
> > all I need
> > to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser
> > under
> > the cover.  I think all I need to do is to write my own query parser
> > plugin
> > that uses a custom query parser, with the only difference being in
> the
> > getFieldQuery() method where a BoostingTermQuery is used instead of a
> > TermQuery.
>
> The BTQ is now deprecated in favor of the BoostingFunctionTermQuery,
> which gives some more flexibility in terms of how the spans in a
> single document are scored.
>
> >
> > Am I on the right track?
>
> Yes.
>
> > Has anyone done something like this already?
>

I wrote a QParserPlugin that seems to do the trick.  This is minimally tested - we're not actually using it at the moment, but should get you going.  Also, as Grant suggested, you may want to sub BFTQ for BTQ below:

package com.zoominfo.solr.analysis;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.search.*;
import org.apache.lucene.search.payloads.BoostingTermQuery;
import org.apache.solr.common.params.*;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.*;

public class BoostingTermQParserPlugin extends QParserPlugin {
  public static String NAME = "zoom";

  public void init(NamedList args) {
  }

  public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {
        System.out.print("BoostingTermQParserPlugin::createParser\n");
    return new BoostingTermQParser(qstr, localParams, params, req);
  }
}

class BoostingTermQueryParser extends QueryParser {

        public BoostingTermQueryParser(String f, Analyzer a) {
                super(f, a);
                System.out.print("BoostingTermQueryParser::BoostingTermQueryParser\n");
        }

    @Override
        protected Query newTermQuery(Term term){
                System.out.print("BoostingTermQueryParser::newTermQuery\n");
            return new BoostingTermQuery(term);
        }
}

class BoostingTermQParser extends QParser {
  String sortStr;
  QueryParser lparser;

  public BoostingTermQParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {
    super(qstr, localParams, params, req);
        System.out.print("BoostingTermQParser::BoostingTermQParser\n");
  }


  public Query parse() throws ParseException {
        System.out.print("BoostingTermQParser::parse\n");
    String qstr = getString();

    String defaultField = getParam(CommonParams.DF);
    if (defaultField==null) {
      defaultField = getReq().getSchema().getSolrQueryParser(null).getField();
    }

    lparser = new BoostingTermQueryParser(defaultField, getReq().getSchema().getQueryAnalyzer());

    // these could either be checked & set here, or in the SolrQueryParser constructor
    String opParam = getParam(QueryParsing.OP);
    if (opParam != null) {
      lparser.setDefaultOperator("AND".equals(opParam) ? QueryParser.Operator.AND : QueryParser.Operator.OR);
    } else {
      // try to get default operator from schema
      lparser.setDefaultOperator(getReq().getSchema().getSolrQueryParser(null).getDefaultOperator());
    }

    return lparser.parse(qstr);
  }


  public String[] getDefaultHighlightFields() {
    return new String[]{lparser.getField()};
  }

}

Re: Using Lucene's payload in Solr

Posted by Bill Au <bi...@gmail.com>.

I need to boost a field differently according to the content of the field.
Here is an example:

<doc>
  <field name="name">Solr</field>
  <field name="category" payload="3.0">information retrieval</category>
  <field name="category" payload="2.0">webapp</category>
  <field name="category" payload="2.0>java</category>
  <field name="category" payload="1.0">xml</category>
</doc>
<doc>
  <field name="name">Tomcat</field>
  <field name="category" payload="3.0">webapp</category>
  <field name="category" payload="2.0>java</category>
</doc>
<doc>
  <field name="name">XMLSpy</field>
  <field name="category" payload="3.0">xml</category>
  <field name="category" payload="2.0">ide</category>
</doc>

A seach on category:webapp should return Tomcat before Solr.  A search on
category:xml should return XMLSpy before Solr.

Bill

On Thu, Aug 13, 2009 at 12:13 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Aug 13, 2009, at 11:58 AM, Bill Au wrote:
>
>  Thanks for the tip on BFTQ.  I have been using a nightly build before that
>> was committed.  I have upgrade to the latest nightly build and will use
>> that
>> instead of BTQ.
>>
>> I got DelimitedPayloadTokenFilter to work and see that the terms and
>> payload
>> of the field are correct but the delimiter and payload are stored so they
>> appear in the response also.  Here is an example:
>>
>> XML for indexing:
>> <field name="title">Solr|2.0 In|2.0 Action|2.0</field>
>>
>>
>> XML response:
>> <doc>
>> <str name"title">Solr|2.0 In|2.0 Action|2.0</str>
>> </doc>
>>
>
>
> Correct.
>
>
>>
>>>  I want to set payload on a field that has a variable number of words.
>>  So I
>> guess I can use a copy field with a PatternTokenizerFactory to filter out
>> the delimiter and payload.
>>
>> I am thinking maybe I can do this instead when indexing:
>>
>> XML for indexing:
>> <field name="title" payload="2.0">Solr In Action</field>
>>
>
> Hmmm, interesting, what's your motivation vs. boosting the field?
>
>
>
>
>> This will simplify indexing as I don't have to repeat the payload for each
>> word in the field.  I do have to write a payload aware update handler.  It
>> looks like I can use Lucene's NumericPayloadTokenFilter in my custom
>> update
>> handler to
>>
>> Any thoughts/comments/suggestions?
>>
>>
>
>  Bill
>>
>>
>> On Wed, Aug 12, 2009 at 7:13 AM, Grant Ingersoll <gsingers@apache.org
>> >wrote:
>>
>>
>>> On Aug 11, 2009, at 5:30 PM, Bill Au wrote:
>>>
>>> It looks like things have changed a bit since this subject was last
>>>
>>>> brought
>>>> up here.  I see that there are support in Solr/Lucene for indexing
>>>> payload
>>>> data (DelimitedPayloadTokenFilterFactory and
>>>> DelimitedPayloadTokenFilter).
>>>> Overriding the Similarity class is straight forward.  So the last piece
>>>> of
>>>> the puzzle is to use a BoostingTermQuery when searching.  I think all I
>>>> need
>>>> to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser
>>>> under
>>>> the cover.  I think all I need to do is to write my own query parser
>>>> plugin
>>>> that uses a custom query parser, with the only difference being in the
>>>> getFieldQuery() method where a BoostingTermQuery is used instead of a
>>>> TermQuery.
>>>>
>>>>
>>> The BTQ is now deprecated in favor of the BoostingFunctionTermQuery,
>>> which
>>> gives some more flexibility in terms of how the spans in a single
>>> document
>>> are scored.
>>>
>>>
>>>  Am I on the right track?
>>>>
>>>>
>>> Yes.
>>>
>>> Has anyone done something like this already?
>>>
>>>>
>>>>
>>> I intend to, but haven't started.
>>>
>>> Since Solr already has indexing support for payload, I was hoping that
>>>
>>>> query
>>>> support is already in the works if not available already.  If not, I am
>>>> willing to contribute but will probably need some guidance since my
>>>> knowledge in Solr query parser is weak.
>>>>
>>>>
>>>
>>> https://issues.apache.org/jira/browse/SOLR-1337
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Using Lucene's payload in Solr

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 13, 2009, at 11:58 AM, Bill Au wrote:

> Thanks for the tip on BFTQ.  I have been using a nightly build  
> before that
> was committed.  I have upgrade to the latest nightly build and will  
> use that
> instead of BTQ.
>
> I got DelimitedPayloadTokenFilter to work and see that the terms and  
> payload
> of the field are correct but the delimiter and payload are stored so  
> they
> appear in the response also.  Here is an example:
>
> XML for indexing:
> <field name="title">Solr|2.0 In|2.0 Action|2.0</field>
>
>
> XML response:
> <doc>
> <str name"title">Solr|2.0 In|2.0 Action|2.0</str>
> </doc>


Correct.

>
>>
> I want to set payload on a field that has a variable number of  
> words.  So I
> guess I can use a copy field with a PatternTokenizerFactory to  
> filter out
> the delimiter and payload.
>
> I am thinking maybe I can do this instead when indexing:
>
> XML for indexing:
> <field name="title" payload="2.0">Solr In Action</field>

Hmmm, interesting, what's your motivation vs. boosting the field?


>
> This will simplify indexing as I don't have to repeat the payload  
> for each
> word in the field.  I do have to write a payload aware update  
> handler.  It
> looks like I can use Lucene's NumericPayloadTokenFilter in my custom  
> update
> handler to
>
> Any thoughts/comments/suggestions?
>


> Bill
>
>
> On Wed, Aug 12, 2009 at 7:13 AM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>>
>> On Aug 11, 2009, at 5:30 PM, Bill Au wrote:
>>
>> It looks like things have changed a bit since this subject was last
>>> brought
>>> up here.  I see that there are support in Solr/Lucene for indexing  
>>> payload
>>> data (DelimitedPayloadTokenFilterFactory and  
>>> DelimitedPayloadTokenFilter).
>>> Overriding the Similarity class is straight forward.  So the last  
>>> piece of
>>> the puzzle is to use a BoostingTermQuery when searching.  I think  
>>> all I
>>> need
>>> to do is to subclass Solr's LuceneQParserPlugin uses  
>>> SolrQueryParser under
>>> the cover.  I think all I need to do is to write my own query parser
>>> plugin
>>> that uses a custom query parser, with the only difference being in  
>>> the
>>> getFieldQuery() method where a BoostingTermQuery is used instead  
>>> of a
>>> TermQuery.
>>>
>>
>> The BTQ is now deprecated in favor of the  
>> BoostingFunctionTermQuery, which
>> gives some more flexibility in terms of how the spans in a single  
>> document
>> are scored.
>>
>>
>>> Am I on the right track?
>>>
>>
>> Yes.
>>
>> Has anyone done something like this already?
>>>
>>
>> I intend to, but haven't started.
>>
>> Since Solr already has indexing support for payload, I was hoping  
>> that
>>> query
>>> support is already in the works if not available already.  If not,  
>>> I am
>>> willing to contribute but will probably need some guidance since my
>>> knowledge in Solr query parser is weak.
>>>
>>
>>
>> https://issues.apache.org/jira/browse/SOLR-1337
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Using Lucene's payload in Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: Is it possible to have the copyField strip off the payload while it is
: copying since doing it in the analysis phrase is too late?  Or should I
: start looking into using UpdateProcessors as Chris had suggested?

"nope" and "yep"

I've had an idea in the back of my mind ofr a while now about adding more 
options ot the fieldTypes to specify how the *stored* values should be 
modified when indexing ... but there's nothing there to do that yet.  you 
have to make the modifications in an Updateprocessor (or in a response 
writer)

: >> It seems like it might be simpler have two new (generic) UpdateProcessors:
: >> one that can clone fieldA into fieldB, and one that can do regex mutations
: >> on fieldB ... neither needs to know about payloads at all, but the first
: >> can made a copy of "2.0|Solr In Action" and the second can strip off the
: >> "2.0|" from the copy.
: >>
: >> then you can write a new NumericPayloadRegexTokenizer that takes in two
: >> regex expressions -- one that knows how to extract the payload from a
: >> piece of input, and one that specifies the tokenization.
: >>
: >> those three classes seem easier to implemnt, easier to maintain, and more
: >> generally reusable then a custom xml request handler for your updates.


-Hoss

Re: Using Lucene's payload in Solr

Posted by Bill Au <bi...@gmail.com>.

While testing my code I discovered that my copyField with PatternTokenize
does not do what I want.  This is what I am indexing into Solr:

<field name="title">2.0|Solr In Action</field>

My copyField is simply:

   <copyField source="title" dest="titleRaw"/>

field titleRaw is of type title_raw:

    <fieldType name="title_raw" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.PatternTokenizerFactory" pattern="[^#]*#(.*)"
group="1"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
    </fieldType>

For my example input "Solr in Action" is indexed into the titleRaw field
without the payload.  But the payload is still stored.  So when I retrieve
the field titleRaw I still get back "2.0|Solr in Action" where what I really
want is just "Solr in Action".

Is it possible to have the copyField strip off the payload while it is
copying since doing it in the analysis phrase is too late?  Or should I
start looking into using UpdateProcessors as Chris had suggested?

Bill

On Fri, Aug 21, 2009 at 12:04 PM, Bill Au <bi...@gmail.com> wrote:

> I ended up not using an XML attribute for the payload since I need to
> return the payload in query response.  So I ended up going with:
>
> <field name="title">2.0|Solr In Action</field>
>
> My payload is numeric so I can pick a non-numeric delimiter (ie '|').
> Putting the payload in front means I don't have to worry about the delimiter
> appearing in the value.  The payload is required in my case so I can simply
> look for the first occurrence of the delimiter and ignore the possibility of
> the delimiter appearing in the value.
>
> I ended up writing a custom Tokenizer and a copy field with a
> PatternTokenizerFactory to filter out the delimiter and payload.  That's is
> straight forward in terms of implementation.  On top of that I can still use
> the CSV loader, which I really like because of its speed.
>
> Bill.
>
> On Thu, Aug 20, 2009 at 10:36 PM, Chris Hostetter <
> hossman_lucene@fucit.org> wrote:
>
>>
>> : of the field are correct but the delimiter and payload are stored so
>> they
>> : appear in the response also.  Here is an example:
>>         ...
>> : I am thinking maybe I can do this instead when indexing:
>> :
>> : XML for indexing:
>> : <field name="title" payload="2.0">Solr In Action</field>
>> :
>> : This will simplify indexing as I don't have to repeat the payload for
>> each
>>
>> but now you're into a custom request handler for the updates to deal with
>> the custom XML attribute so you can't use DIH, or CSV loading.
>>
>> It seems like it might be simpler have two new (generic) UpdateProcessors:
>> one that can clone fieldA into fieldB, and one that can do regex mutations
>> on fieldB ... neither needs to know about payloads at all, but the first
>> can made a copy of "2.0|Solr In Action" and the second can strip off the
>> "2.0|" from the copy.
>>
>> then you can write a new NumericPayloadRegexTokenizer that takes in two
>> regex expressions -- one that knows how to extract the payload from a
>> piece of input, and one that specifies the tokenization.
>>
>> those three classes seem easier to implemnt, easier to maintain, and more
>> generally reusable then a custom xml request handler for your updates.
>>
>>
>> -Hoss
>>
>>
>

Re: Using Lucene's payload in Solr

Posted by Bill Au <bi...@gmail.com>.

I ended up not using an XML attribute for the payload since I need to return
the payload in query response.  So I ended up going with:

<field name="title">2.0|Solr In Action</field>

My payload is numeric so I can pick a non-numeric delimiter (ie '|').
Putting the payload in front means I don't have to worry about the delimiter
appearing in the value.  The payload is required in my case so I can simply
look for the first occurrence of the delimiter and ignore the possibility of
the delimiter appearing in the value.

I ended up writing a custom Tokenizer and a copy field with a
PatternTokenizerFactory to filter out the delimiter and payload.  That's is
straight forward in terms of implementation.  On top of that I can still use
the CSV loader, which I really like because of its speed.

Bill.

On Thu, Aug 20, 2009 at 10:36 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : of the field are correct but the delimiter and payload are stored so they
> : appear in the response also.  Here is an example:
>         ...
> : I am thinking maybe I can do this instead when indexing:
> :
> : XML for indexing:
> : <field name="title" payload="2.0">Solr In Action</field>
> :
> : This will simplify indexing as I don't have to repeat the payload for
> each
>
> but now you're into a custom request handler for the updates to deal with
> the custom XML attribute so you can't use DIH, or CSV loading.
>
> It seems like it might be simpler have two new (generic) UpdateProcessors:
> one that can clone fieldA into fieldB, and one that can do regex mutations
> on fieldB ... neither needs to know about payloads at all, but the first
> can made a copy of "2.0|Solr In Action" and the second can strip off the
> "2.0|" from the copy.
>
> then you can write a new NumericPayloadRegexTokenizer that takes in two
> regex expressions -- one that knows how to extract the payload from a
> piece of input, and one that specifies the tokenization.
>
> those three classes seem easier to implemnt, easier to maintain, and more
> generally reusable then a custom xml request handler for your updates.
>
>
> -Hoss
>
>

Re: Using Lucene's payload in Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: of the field are correct but the delimiter and payload are stored so they
: appear in the response also.  Here is an example:
	...
: I am thinking maybe I can do this instead when indexing:
: 
: XML for indexing:
: <field name="title" payload="2.0">Solr In Action</field>
: 
: This will simplify indexing as I don't have to repeat the payload for each

but now you're into a custom request handler for the updates to deal with 
the custom XML attribute so you can't use DIH, or CSV loading.

It seems like it might be simpler have two new (generic) UpdateProcessors: 
one that can clone fieldA into fieldB, and one that can do regex mutations 
on fieldB ... neither needs to know about payloads at all, but the first 
can made a copy of "2.0|Solr In Action" and the second can strip off the 
"2.0|" from the copy.

then you can write a new NumericPayloadRegexTokenizer that takes in two 
regex expressions -- one that knows how to extract the payload from a 
piece of input, and one that specifies the tokenization.

those three classes seem easier to implemnt, easier to maintain, and more 
generally reusable then a custom xml request handler for your updates.


-Hoss

Re: Using Lucene's payload in Solr

Posted by Bill Au <bi...@gmail.com>.

Thanks for the tip on BFTQ.  I have been using a nightly build before that
was committed.  I have upgrade to the latest nightly build and will use that
instead of BTQ.

I got DelimitedPayloadTokenFilter to work and see that the terms and payload
of the field are correct but the delimiter and payload are stored so they
appear in the response also.  Here is an example:

XML for indexing:
<field name="title">Solr|2.0 In|2.0 Action|2.0</field>


XML response:
<doc>
<str name"title">Solr|2.0 In|2.0 Action|2.0</str>
</doc>

>
I want to set payload on a field that has a variable number of words.  So I
guess I can use a copy field with a PatternTokenizerFactory to filter out
the delimiter and payload.

I am thinking maybe I can do this instead when indexing:

XML for indexing:
<field name="title" payload="2.0">Solr In Action</field>

This will simplify indexing as I don't have to repeat the payload for each
word in the field.  I do have to write a payload aware update handler.  It
looks like I can use Lucene's NumericPayloadTokenFilter in my custom update
handler to

Any thoughts/comments/suggestions?

Bill


On Wed, Aug 12, 2009 at 7:13 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Aug 11, 2009, at 5:30 PM, Bill Au wrote:
>
>  It looks like things have changed a bit since this subject was last
>> brought
>> up here.  I see that there are support in Solr/Lucene for indexing payload
>> data (DelimitedPayloadTokenFilterFactory and DelimitedPayloadTokenFilter).
>> Overriding the Similarity class is straight forward.  So the last piece of
>> the puzzle is to use a BoostingTermQuery when searching.  I think all I
>> need
>> to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser under
>> the cover.  I think all I need to do is to write my own query parser
>> plugin
>> that uses a custom query parser, with the only difference being in the
>> getFieldQuery() method where a BoostingTermQuery is used instead of a
>> TermQuery.
>>
>
> The BTQ is now deprecated in favor of the BoostingFunctionTermQuery, which
> gives some more flexibility in terms of how the spans in a single document
> are scored.
>
>
>> Am I on the right track?
>>
>
> Yes.
>
>  Has anyone done something like this already?
>>
>
> I intend to, but haven't started.
>
>  Since Solr already has indexing support for payload, I was hoping that
>> query
>> support is already in the works if not available already.  If not, I am
>> willing to contribute but will probably need some guidance since my
>> knowledge in Solr query parser is weak.
>>
>
>
> https://issues.apache.org/jira/browse/SOLR-1337
>

Re: Using Lucene's payload in Solr

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 11, 2009, at 5:30 PM, Bill Au wrote:

> It looks like things have changed a bit since this subject was last  
> brought
> up here.  I see that there are support in Solr/Lucene for indexing  
> payload
> data (DelimitedPayloadTokenFilterFactory and  
> DelimitedPayloadTokenFilter).
> Overriding the Similarity class is straight forward.  So the last  
> piece of
> the puzzle is to use a BoostingTermQuery when searching.  I think  
> all I need
> to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser  
> under
> the cover.  I think all I need to do is to write my own query parser  
> plugin
> that uses a custom query parser, with the only difference being in the
> getFieldQuery() method where a BoostingTermQuery is used instead of a
> TermQuery.

The BTQ is now deprecated in favor of the BoostingFunctionTermQuery,  
which gives some more flexibility in terms of how the spans in a  
single document are scored.

>
> Am I on the right track?

Yes.

> Has anyone done something like this already?

I intend to, but haven't started.

> Since Solr already has indexing support for payload, I was hoping  
> that query
> support is already in the works if not available already.  If not, I  
> am
> willing to contribute but will probably need some guidance since my
> knowledge in Solr query parser is weak.


https://issues.apache.org/jira/browse/SOLR-1337