You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Gagandeep singh <ga...@gmail.com> on 2013/03/31 07:18:30 UTC

New implementation of MLT

Hi folks

We started using the default implementation of MLT
(org.apache.solr.handler.MoreLikeThisHandler) recently and found that there
are a couple of things it lacks:

   1. Searching for terms in the same field as the original document:
      - the current implementation picks the top field to search an
      interesting term in based on docFreq, however this can give bad
results if
      say original product is from brand:"RED Valentino", and we end
up searching
      red in color field.
   2. Phrase boosts:
      - if product name is "business cards", then it makes sense to give a
      boost to the phrase boost to products which are also business cards.
   3. Support for bq, bf, fq, multiplicative boost:
      - you might want to filter out_of_stock products, give a
      multiplicative boost to a product based on their price
similarity / launch
      date.
   4. Support of explainOther

We had a use case for each of these and i ended up writing my own
MLTQueryParser which builds the MLT query for a given document. It also has
a new concept called childDocs. You can think of some documents as
products, and a collection of products can be though of as a category page.
You could search for similar documents based on the products a category
page has.

I was wondering if you guys would be interested in an alternate
implementation of MLT that supports all the knobs that solr search does. I
could post a patch file maybe?

Thanks
Gagan

RE: Ranged Query

Posted by Demian Katz <de...@villanova.edu>.
The documentation is pretty vague, but I think you may be on the right track with MultiTermAwareComponent - is this an interface that your custom analyzer needs to implement?

- Demian

From: Osullivan L. [mailto:L.Osullivan@swansea.ac.uk]
Sent: Wednesday, April 03, 2013 8:08 AM
To: general@lucene.apache.org
Cc: vufind-tech@lists.sourceforge.net; dev@lucene.apache.org
Subject: [VuFind-Tech] Ranged Query

Greetings,

I have a custom analyzer which converts Library of Congress Callnumbers into normalized strings:

   <fieldType name="LCNormalized" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="org.vufind.solr.analysis.LCCNormalizeFilterFactory"/>
      </analyzer>
    </fieldType>
   <field name="callnumber-normalized" type="LCNormalized" indexed="true" stored="true" />

Thus, values like:

PQ239.Z56
PQ239.H63 2008
PQ239.S62 1982
PQ239.B68 1983
PQ2390.S35 A5
PQ2390.S35 B8 1898
PQ2389 .R65 F3 1854 t.1
PQ239.A7 1969
PQ2.N6 1959
PQ22.A4 D47 1949
PQ238.L57 1985

become:

PQ 0239.000000 Z0.560000
PQ 0239.000000 H0.630000 002008
PQ 0239.000000 S0.620000 001982
PQ 0239.000000 B0.680000 001983
PQ 2390.000000 S0.350000 A0.500000
PQ 2390.000000 S0.350000 B0.800000 001898
PQ 2389.000000 R0.650000 F0.300000 001854 T.000001
PQ 0002.000000 N0.600000 001959
PQ 0022.000000 A0.400000 D0.470000 001949
PQ 0238.000000 L0.570000 001985

This allows items to be accurately sorted by callnumber.

I would also like to perform ranged searches on the normalised callnumber but whereas callnumber-normalized=[DS+TO+FE] will correctly list items with callnumbers between DS and FE, starting with DT* and finishing with FD* , callnumber-normalized=[DS763+TO+FE] incorrectly starts at DT* and finishes with FD*.

Can anyone explain why this might be the case?

Looking at http://wiki.apache.org/solr/MultitermQueryAnalysis#Current_components_that_implement_MultiTermAwareComponent, would I have to add one of the MultiTermAware Factories to make this work?

Thanks,

Luke

--

Luke O'Sullivan

Systems Developer

Web Team

Swansea University, Singleton Park, Swansea SA2 8PP, UK

l.osullivan@swansea.ac.uk<ma...@swansea.ac.uk>

01792 602772

@l_os_cymru

Re: Ranged Query

Posted by Erick Erickson <er...@gmail.com>.
Assuming your custom filter emits one and only one token, making it
multiTermAware is fine. What that means is that when you add wildcards
to your query terms, your filter will automatically be put into the
filter analysis chain at query time as well as index time. You care if
you expect to search exact terms, i.e. you want to search on something
like PQ239.H* and hit PQ 0239.000000 H0.630000 002008.

Hmmmm, that could be your problem here, DS763 won't match anything
before DT since all your DS entries are "DS " and DS763 is after
everything that starts with "DS ". So I'm guessing that if you started
with "DS 0763" you'd get what you wanted? That'd be evidence that you
do need MultiTermAwareness.... But attach &debug=query to see exactly
what the results of query parsing are, because I'm reaching a bit here
and making the assumption that the standard analysis chain gets called
in the range case.

As an aside, if it's still early in your project's life-cycle,
consider changing the hyphens in your field names to underscores.
Hyphens will work, but it's _really_ easy to miss a space somewhere
and have them treated as the NOT operator and then have to debug
things.....

Best
Erick

On Wed, Apr 3, 2013 at 8:08 AM, Osullivan L. <L....@swansea.ac.uk> wrote:
> Greetings,
>
> I have a custom analyzer which converts Library of Congress Callnumbers into
> normalized strings:
>
>    <fieldType name="LCNormalized" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>       <analyzer>
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="org.vufind.solr.analysis.LCCNormalizeFilterFactory"/>
>       </analyzer>
>     </fieldType>
>    <field name="callnumber-normalized" type="LCNormalized" indexed="true"
> stored="true" />
>
> Thus, values like:
>
> PQ239.Z56
> PQ239.H63 2008
> PQ239.S62 1982
> PQ239.B68 1983
> PQ2390.S35 A5
> PQ2390.S35 B8 1898
> PQ2389 .R65 F3 1854 t.1
> PQ239.A7 1969
> PQ2.N6 1959
> PQ22.A4 D47 1949
> PQ238.L57 1985
>
> become:
>
> PQ 0239.000000 Z0.560000
> PQ 0239.000000 H0.630000 002008
> PQ 0239.000000 S0.620000 001982
> PQ 0239.000000 B0.680000 001983
> PQ 2390.000000 S0.350000 A0.500000
> PQ 2390.000000 S0.350000 B0.800000 001898
> PQ 2389.000000 R0.650000 F0.300000 001854 T.000001
> PQ 0002.000000 N0.600000 001959
> PQ 0022.000000 A0.400000 D0.470000 001949
> PQ 0238.000000 L0.570000 001985
>
> This allows items to be accurately sorted by callnumber.
>
> I would also like to perform ranged searches on the normalised callnumber
> but whereas callnumber-normalized=[DS+TO+FE] will correctly list items with
> callnumbers between DS and FE, starting with DT* and finishing with FD* ,
> callnumber-normalized=[DS763+TO+FE] incorrectly starts at DT* and finishes
> with FD*.
>
> Can anyone explain why this might be the case?
>
> Looking at
> http://wiki.apache.org/solr/MultitermQueryAnalysis#Current_components_that_implement_MultiTermAwareComponent,
> would I have to add one of the MultiTermAware Factories to make this work?
>
> Thanks,
>
> Luke
>
> --
> Luke O'Sullivan
> Systems Developer
> Web Team
> Swansea University, Singleton Park, Swansea SA2 8PP, UK
> l.osullivan@swansea.ac.uk
> 01792 602772
> @l_os_cymru

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Ranged Query

Posted by Uwe Schindler <uw...@thetaphi.de>.



"Osullivan L." <L....@swansea.ac.uk> schrieb:

>Greetings,
>
>I have a custom analyzer which converts Library of Congress Callnumbers
>into normalized strings:
>
><fieldType name="LCNormalized" class="solr.TextField"
>sortMissingLast="true" omitNorms="true">
>      <analyzer>
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="org.vufind.solr.analysis.LCCNormalizeFilterFactory"/>
>      </analyzer>
>    </fieldType>
><field name="callnumber-normalized" type="LCNormalized" indexed="true"
>stored="true" />
>
>Thus, values like:
>
>PQ239.Z56
>PQ239.H63 2008
>PQ239.S62 1982
>PQ239.B68 1983
>PQ2390.S35 A5
>PQ2390.S35 B8 1898
>PQ2389 .R65 F3 1854 t.1
>PQ239.A7 1969
>PQ2.N6 1959
>PQ22.A4 D47 1949
>PQ238.L57 1985
>
>become:
>
>PQ 0239.000000 Z0.560000
>PQ 0239.000000 H0.630000 002008
>PQ 0239.000000 S0.620000 001982
>PQ 0239.000000 B0.680000 001983
>PQ 2390.000000 S0.350000 A0.500000
>PQ 2390.000000 S0.350000 B0.800000 001898
>PQ 2389.000000 R0.650000 F0.300000 001854 T.000001
>PQ 0002.000000 N0.600000 001959
>PQ 0022.000000 A0.400000 D0.470000 001949
>PQ 0238.000000 L0.570000 001985
>
>This allows items to be accurately sorted by callnumber.
>
>I would also like to perform ranged searches on the normalised
>callnumber but whereas callnumber-normalized=[DS+TO+FE] will correctly
>list items with callnumbers between DS and FE, starting with DT* and
>finishing with FD* , callnumber-normalized=[DS763+TO+FE] incorrectly
>starts at DT* and finishes with FD*.
>
>Can anyone explain why this might be the case?
>
>Looking at
>http://wiki.apache.org/solr/MultitermQueryAnalysis#Current_components_that_implement_MultiTermAwareComponent,
>would I have to add one of the MultiTermAware Factories to make this
>work?
>
>Thanks,
>
>Luke
>
>--
>Luke O'Sullivan
>Systems Developer
>Web Team
>Swansea University, Singleton Park, Swansea SA2 8PP, UK
>l.osullivan@swansea.ac.uk<ma...@swansea.ac.uk>
>01792 602772
>@l_os_cymru

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de

Ranged Query

Posted by "Osullivan L." <L....@swansea.ac.uk>.
Greetings,

I have a custom analyzer which converts Library of Congress Callnumbers into normalized strings:

   <fieldType name="LCNormalized" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="org.vufind.solr.analysis.LCCNormalizeFilterFactory"/>
      </analyzer>
    </fieldType>
   <field name="callnumber-normalized" type="LCNormalized" indexed="true" stored="true" />

Thus, values like:

PQ239.Z56
PQ239.H63 2008
PQ239.S62 1982
PQ239.B68 1983
PQ2390.S35 A5
PQ2390.S35 B8 1898
PQ2389 .R65 F3 1854 t.1
PQ239.A7 1969
PQ2.N6 1959
PQ22.A4 D47 1949
PQ238.L57 1985

become:

PQ 0239.000000 Z0.560000
PQ 0239.000000 H0.630000 002008
PQ 0239.000000 S0.620000 001982
PQ 0239.000000 B0.680000 001983
PQ 2390.000000 S0.350000 A0.500000
PQ 2390.000000 S0.350000 B0.800000 001898
PQ 2389.000000 R0.650000 F0.300000 001854 T.000001
PQ 0002.000000 N0.600000 001959
PQ 0022.000000 A0.400000 D0.470000 001949
PQ 0238.000000 L0.570000 001985

This allows items to be accurately sorted by callnumber.

I would also like to perform ranged searches on the normalised callnumber but whereas callnumber-normalized=[DS+TO+FE] will correctly list items with callnumbers between DS and FE, starting with DT* and finishing with FD* , callnumber-normalized=[DS763+TO+FE] incorrectly starts at DT* and finishes with FD*.

Can anyone explain why this might be the case?

Looking at http://wiki.apache.org/solr/MultitermQueryAnalysis#Current_components_that_implement_MultiTermAwareComponent, would I have to add one of the MultiTermAware Factories to make this work?

Thanks,

Luke

--
Luke O'Sullivan
Systems Developer
Web Team
Swansea University, Singleton Park, Swansea SA2 8PP, UK
l.osullivan@swansea.ac.uk<ma...@swansea.ac.uk>
01792 602772
@l_os_cymru

Ranged Query

Posted by "Osullivan L." <L....@swansea.ac.uk>.
Greetings,

I have a custom analyzer which converts Library of Congress Callnumbers into normalized strings:

   <fieldType name="LCNormalized" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="org.vufind.solr.analysis.LCCNormalizeFilterFactory"/>
      </analyzer>
    </fieldType>
   <field name="callnumber-normalized" type="LCNormalized" indexed="true" stored="true" />

Thus, values like:

PQ239.Z56
PQ239.H63 2008
PQ239.S62 1982
PQ239.B68 1983
PQ2390.S35 A5
PQ2390.S35 B8 1898
PQ2389 .R65 F3 1854 t.1
PQ239.A7 1969
PQ2.N6 1959
PQ22.A4 D47 1949
PQ238.L57 1985

become:

PQ 0239.000000 Z0.560000
PQ 0239.000000 H0.630000 002008
PQ 0239.000000 S0.620000 001982
PQ 0239.000000 B0.680000 001983
PQ 2390.000000 S0.350000 A0.500000
PQ 2390.000000 S0.350000 B0.800000 001898
PQ 2389.000000 R0.650000 F0.300000 001854 T.000001
PQ 0002.000000 N0.600000 001959
PQ 0022.000000 A0.400000 D0.470000 001949
PQ 0238.000000 L0.570000 001985

This allows items to be accurately sorted by callnumber.

I would also like to perform ranged searches on the normalised callnumber but whereas callnumber-normalized=[DS+TO+FE] will correctly list items with callnumbers between DS and FE, starting with DT* and finishing with FD* , callnumber-normalized=[DS763+TO+FE] incorrectly starts at DT* and finishes with FD*.

Can anyone explain why this might be the case?

Looking at http://wiki.apache.org/solr/MultitermQueryAnalysis#Current_components_that_implement_MultiTermAwareComponent, would I have to add one of the MultiTermAware Factories to make this work?

Thanks,

Luke

--
Luke O'Sullivan
Systems Developer
Web Team
Swansea University, Singleton Park, Swansea SA2 8PP, UK
l.osullivan@swansea.ac.uk<ma...@swansea.ac.uk>
01792 602772
@l_os_cymru

Re: New implementation of MLT

Posted by "Osullivan L." <L....@swansea.ac.uk>.
Greetings,

I have a custom analyzer which converts Library of Congress Callnumbers into normalized strings:

   <fieldType name="LCNormalized" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="org.vufind.solr.analysis.LCCNormalizeFilterFactory"/>
      </analyzer>
    </fieldType>
   <field name="callnumber-normalized" type="LCNormalized" indexed="true" stored="true" />

Thus, values like:

PQ239.Z56
PQ239.H63 2008
PQ239.S62 1982
PQ239.B68 1983
PQ2390.S35 A5
PQ2390.S35 B8 1898
PQ2389 .R65 F3 1854 t.1
PQ239.A7 1969
PQ2.N6 1959
PQ22.A4 D47 1949
PQ238.L57 1985

become:

PQ 0239.000000 Z0.560000
PQ 0239.000000 H0.630000 002008
PQ 0239.000000 S0.620000 001982
PQ 0239.000000 B0.680000 001983
PQ 2390.000000 S0.350000 A0.500000
PQ 2390.000000 S0.350000 B0.800000 001898
PQ 2389.000000 R0.650000 F0.300000 001854 T.000001
PQ 0002.000000 N0.600000 001959
PQ 0022.000000 A0.400000 D0.470000 001949
PQ 0238.000000 L0.570000 001985

This allows items to be accurately sorted by callnumber.

I would also like to perform ranged searches on the normalised callnumber but whereas callnumber-normalized=[DS+TO+FE] will correctly list items with callnumbers between DS and FE, starting with DT* and finishing with FD* , callnumber-normalized=[DS763+TO+FE] incorrectly starts at DT* and finishes with FD*.

Can anyone explain why this might be the case?

Looking at http://wiki.apache.org/solr/MultitermQueryAnalysis#Current_components_that_implement_MultiTermAwareComponent, would I have to add one of the MultiTermAware Factories to make this work?

Thanks,

Luke


--
Luke O'Sullivan
Systems Developer
Web Team
Swansea University, Singleton Park, Swansea SA2 8PP, UK
l.osullivan@swansea.ac.uk<ma...@swansea.ac.uk>
01792 602772
@l_os_cymru

Re: New implementation of MLT

Posted by "Osullivan L." <L....@swansea.ac.uk>.
Greetings,

I have a custom analyzer which converts Library of Congress Callnumbers into normalized strings:

   <fieldType name="LCNormalized" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="org.vufind.solr.analysis.LCCNormalizeFilterFactory"/>
      </analyzer>
    </fieldType>
   <field name="callnumber-normalized" type="LCNormalized" indexed="true" stored="true" />

Thus, values like:

PQ239.Z56
PQ239.H63 2008
PQ239.S62 1982
PQ239.B68 1983
PQ2390.S35 A5
PQ2390.S35 B8 1898
PQ2389 .R65 F3 1854 t.1
PQ239.A7 1969
PQ2.N6 1959
PQ22.A4 D47 1949
PQ238.L57 1985

become:

PQ 0239.000000 Z0.560000
PQ 0239.000000 H0.630000 002008
PQ 0239.000000 S0.620000 001982
PQ 0239.000000 B0.680000 001983
PQ 2390.000000 S0.350000 A0.500000
PQ 2390.000000 S0.350000 B0.800000 001898
PQ 2389.000000 R0.650000 F0.300000 001854 T.000001
PQ 0002.000000 N0.600000 001959
PQ 0022.000000 A0.400000 D0.470000 001949
PQ 0238.000000 L0.570000 001985

This allows items to be accurately sorted by callnumber.

I would also like to perform ranged searches on the normalised callnumber but whereas callnumber-normalized=[DS+TO+FE] will correctly list items with callnumbers between DS and FE, starting with DT* and finishing with FD* , callnumber-normalized=[DS763+TO+FE] incorrectly starts at DT* and finishes with FD*.

Can anyone explain why this might be the case?

Looking at http://wiki.apache.org/solr/MultitermQueryAnalysis#Current_components_that_implement_MultiTermAwareComponent, would I have to add one of the MultiTermAware Factories to make this work?

Thanks,

Luke


--
Luke O'Sullivan
Systems Developer
Web Team
Swansea University, Singleton Park, Swansea SA2 8PP, UK
l.osullivan@swansea.ac.uk<ma...@swansea.ac.uk>
01792 602772
@l_os_cymru