You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sandeep Mestry <sa...@gmail.com> on 2013/04/03 15:55:55 UTC

Question on Exact Matches - edismax

Hi All,

I have a requirement where in exact matches for 2 fields (Series Title,
Title) should be ranked higher than the partial matches. The configuration
looks like below:

<requestHandler name="assetdismax" class="solr.SearchHandler" >
        <lst name="defaults">
            <str name="defType">edismax</str>
            <str name="echoParams">explicit</str>
            <float name="tie">0.01</float>
            <str name="qf">*pg_series_title_ci*^500 *title_ci*^300 *
pg_series_title*^200 *title*^25 classifications^15 classifications_texts^15
parent_classifications^10 synonym_classifications^5 pg_brand_title^5
pg_series_working_title^5 p_programme_title^5 p_item_title^5
p_interstitial_title^5 description^15 pg_series_description annotations^0.1
classification_notes^0.05 pv_program_version_number^2
pv_program_version_number_ci^2 pv_program_number^2 pv_program_number_ci^2
p_program_number^2 ma_version_number^2 ma_recording_location
ma_contributions^0.001 rel_pg_series_title rel_programme_title
rel_programme_number rel_programme_number_ci pg_uuid^0.5 p_uuid^0.5
pv_uuid^0.5 ma_uuid^0.5</str>
            <str name="pf">pg_series_title_ci^500 title_ci^500</str>
            <int name="ps">0</int>
            <str name="q.alt">*:*</str>
            <str name="mm">100%</str>
            <str name="q.op">AND</str>
            <str name="facet">true</str>
            <str name="facet.limit">-1</str>
            <str name="facet.mincount">1</str>
        </lst>
    </requestHandler>

As you can see above, the search is against many fields. What I'd want is
the documents that have exact matches for series title and title fields
should rank higher than the rest.

I have added 2 case insensitive (*pg_series_title_ci, title_ci*) fields for
series title and title and have boosted them higher over the tokenized and
rest of the fields. I have also implemented a similarity class to override
idf however I still get documents having partial matches in title and other
fields ranking higher than exact match in pg_series_title_ci.

Many Thanks,
Sandeep

Re: Question on Exact Matches - edismax

Posted by Sandeep Mestry <sa...@gmail.com>.
Another problem that I see in Solr analysis is the query term that matches
the tokenized field does not match on the case insensitive field.
So, if I'm searching for 'coast to coast', I see that the tokenized series
title (pg_series_title) is matched but not the ci field which is
pg_series_title_ci.

The definition of both field is as below:

<fieldType name="text_wc" class="solr.TextField" positionIncrementGap="100"
>
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
stemEnglishPossessive="0" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
splitOnNumerics="0" preserveOriginal="1" />
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
stemEnglishPossessive="0" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
splitOnNumerics="0" preserveOriginal="1" />
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>


<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true"
omitNorms="true" compressThreshold="10">
            <analyzer>
                <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>

<field name="pg_series_title" type="text_wc" indexed="true" stored="true"
multiValued="false" />
<field name="pg_series_title_ci" type="string_ci" indexed="true"
stored="true" multiValued="false" />

*<copyField source="pg_series_title" dest="pg_series_title_ci" />*
*
*
*Can this copyfield directive be an issue? Should it be other way round or
does it matter?*

Thanks,
Sandeep





On 4 April 2013 10:38, Sandeep Mestry <sa...@gmail.com> wrote:

> Hi Jan,
>
> Thanks for your reply. I have defined string_ci like below:
>
> <fieldType name="string_ci" class="solr.TextField" sortMissingLast="true"
> omitNorms="true" compressThreshold="10">
>             <analyzer>
>                 <tokenizer class="solr.KeywordTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>         </fieldType>
>
> When I analyse the query in solr, I saw that document containing
> pg_series_title_ci:"funny"  matches when I do a search for
> pg_series_title_ci:"funny games" and is ranked higher than the document
> containing the exact matches. I can use the default string data type but
> then the match will be on exact casing.
>
> Thanks,
> Sandeep
>
>
> On 3 April 2013 22:20, Jan Høydahl <ja...@cominvent.com> wrote:
>
>> Can you show us your *_ci field type? Solr does not really have a way to
>> tell whether a match is "exact" or only partial, but you could hack around
>> it with the fieldType. See https://github.com/cominvent/exactmatch for a
>> possible solution.
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>>
>> 3. apr. 2013 kl. 15:55 skrev Sandeep Mestry <sa...@gmail.com>:
>>
>> > Hi All,
>> >
>> > I have a requirement where in exact matches for 2 fields (Series Title,
>> > Title) should be ranked higher than the partial matches. The
>> configuration
>> > looks like below:
>> >
>> > <requestHandler name="assetdismax" class="solr.SearchHandler" >
>> >        <lst name="defaults">
>> >            <str name="defType">edismax</str>
>> >            <str name="echoParams">explicit</str>
>> >            <float name="tie">0.01</float>
>> >            <str name="qf">*pg_series_title_ci*^500 *title_ci*^300 *
>> > pg_series_title*^200 *title*^25 classifications^15
>> classifications_texts^15
>> > parent_classifications^10 synonym_classifications^5 pg_brand_title^5
>> > pg_series_working_title^5 p_programme_title^5 p_item_title^5
>> > p_interstitial_title^5 description^15 pg_series_description
>> annotations^0.1
>> > classification_notes^0.05 pv_program_version_number^2
>> > pv_program_version_number_ci^2 pv_program_number^2
>> pv_program_number_ci^2
>> > p_program_number^2 ma_version_number^2 ma_recording_location
>> > ma_contributions^0.001 rel_pg_series_title rel_programme_title
>> > rel_programme_number rel_programme_number_ci pg_uuid^0.5 p_uuid^0.5
>> > pv_uuid^0.5 ma_uuid^0.5</str>
>> >            <str name="pf">pg_series_title_ci^500 title_ci^500</str>
>> >            <int name="ps">0</int>
>> >            <str name="q.alt">*:*</str>
>> >            <str name="mm">100%</str>
>> >            <str name="q.op">AND</str>
>> >            <str name="facet">true</str>
>> >            <str name="facet.limit">-1</str>
>> >            <str name="facet.mincount">1</str>
>> >        </lst>
>> >    </requestHandler>
>> >
>> > As you can see above, the search is against many fields. What I'd want
>> is
>> > the documents that have exact matches for series title and title fields
>> > should rank higher than the rest.
>> >
>> > I have added 2 case insensitive (*pg_series_title_ci, title_ci*) fields
>> for
>> > series title and title and have boosted them higher over the tokenized
>> and
>> > rest of the fields. I have also implemented a similarity class to
>> override
>> > idf however I still get documents having partial matches in title and
>> other
>> > fields ranking higher than exact match in pg_series_title_ci.
>> >
>> > Many Thanks,
>> > Sandeep
>>
>>
>

Re: Question on Exact Matches - edismax

Posted by Sandeep Mestry <sa...@gmail.com>.
Hi Jan,

Thanks for your reply. I have defined string_ci like below:

<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true"
omitNorms="true" compressThreshold="10">
            <analyzer>
                <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>

When I analyse the query in solr, I saw that document containing
pg_series_title_ci:"funny"  matches when I do a search for
pg_series_title_ci:"funny games" and is ranked higher than the document
containing the exact matches. I can use the default string data type but
then the match will be on exact casing.

Thanks,
Sandeep


On 3 April 2013 22:20, Jan Høydahl <ja...@cominvent.com> wrote:

> Can you show us your *_ci field type? Solr does not really have a way to
> tell whether a match is "exact" or only partial, but you could hack around
> it with the fieldType. See https://github.com/cominvent/exactmatch for a
> possible solution.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> 3. apr. 2013 kl. 15:55 skrev Sandeep Mestry <sa...@gmail.com>:
>
> > Hi All,
> >
> > I have a requirement where in exact matches for 2 fields (Series Title,
> > Title) should be ranked higher than the partial matches. The
> configuration
> > looks like below:
> >
> > <requestHandler name="assetdismax" class="solr.SearchHandler" >
> >        <lst name="defaults">
> >            <str name="defType">edismax</str>
> >            <str name="echoParams">explicit</str>
> >            <float name="tie">0.01</float>
> >            <str name="qf">*pg_series_title_ci*^500 *title_ci*^300 *
> > pg_series_title*^200 *title*^25 classifications^15
> classifications_texts^15
> > parent_classifications^10 synonym_classifications^5 pg_brand_title^5
> > pg_series_working_title^5 p_programme_title^5 p_item_title^5
> > p_interstitial_title^5 description^15 pg_series_description
> annotations^0.1
> > classification_notes^0.05 pv_program_version_number^2
> > pv_program_version_number_ci^2 pv_program_number^2 pv_program_number_ci^2
> > p_program_number^2 ma_version_number^2 ma_recording_location
> > ma_contributions^0.001 rel_pg_series_title rel_programme_title
> > rel_programme_number rel_programme_number_ci pg_uuid^0.5 p_uuid^0.5
> > pv_uuid^0.5 ma_uuid^0.5</str>
> >            <str name="pf">pg_series_title_ci^500 title_ci^500</str>
> >            <int name="ps">0</int>
> >            <str name="q.alt">*:*</str>
> >            <str name="mm">100%</str>
> >            <str name="q.op">AND</str>
> >            <str name="facet">true</str>
> >            <str name="facet.limit">-1</str>
> >            <str name="facet.mincount">1</str>
> >        </lst>
> >    </requestHandler>
> >
> > As you can see above, the search is against many fields. What I'd want is
> > the documents that have exact matches for series title and title fields
> > should rank higher than the rest.
> >
> > I have added 2 case insensitive (*pg_series_title_ci, title_ci*) fields
> for
> > series title and title and have boosted them higher over the tokenized
> and
> > rest of the fields. I have also implemented a similarity class to
> override
> > idf however I still get documents having partial matches in title and
> other
> > fields ranking higher than exact match in pg_series_title_ci.
> >
> > Many Thanks,
> > Sandeep
>
>

Re: Question on Exact Matches - edismax

Posted by Jan Høydahl <ja...@cominvent.com>.
Can you show us your *_ci field type? Solr does not really have a way to tell whether a match is "exact" or only partial, but you could hack around it with the fieldType. See https://github.com/cominvent/exactmatch for a possible solution.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

3. apr. 2013 kl. 15:55 skrev Sandeep Mestry <sa...@gmail.com>:

> Hi All,
> 
> I have a requirement where in exact matches for 2 fields (Series Title,
> Title) should be ranked higher than the partial matches. The configuration
> looks like below:
> 
> <requestHandler name="assetdismax" class="solr.SearchHandler" >
>        <lst name="defaults">
>            <str name="defType">edismax</str>
>            <str name="echoParams">explicit</str>
>            <float name="tie">0.01</float>
>            <str name="qf">*pg_series_title_ci*^500 *title_ci*^300 *
> pg_series_title*^200 *title*^25 classifications^15 classifications_texts^15
> parent_classifications^10 synonym_classifications^5 pg_brand_title^5
> pg_series_working_title^5 p_programme_title^5 p_item_title^5
> p_interstitial_title^5 description^15 pg_series_description annotations^0.1
> classification_notes^0.05 pv_program_version_number^2
> pv_program_version_number_ci^2 pv_program_number^2 pv_program_number_ci^2
> p_program_number^2 ma_version_number^2 ma_recording_location
> ma_contributions^0.001 rel_pg_series_title rel_programme_title
> rel_programme_number rel_programme_number_ci pg_uuid^0.5 p_uuid^0.5
> pv_uuid^0.5 ma_uuid^0.5</str>
>            <str name="pf">pg_series_title_ci^500 title_ci^500</str>
>            <int name="ps">0</int>
>            <str name="q.alt">*:*</str>
>            <str name="mm">100%</str>
>            <str name="q.op">AND</str>
>            <str name="facet">true</str>
>            <str name="facet.limit">-1</str>
>            <str name="facet.mincount">1</str>
>        </lst>
>    </requestHandler>
> 
> As you can see above, the search is against many fields. What I'd want is
> the documents that have exact matches for series title and title fields
> should rank higher than the rest.
> 
> I have added 2 case insensitive (*pg_series_title_ci, title_ci*) fields for
> series title and title and have boosted them higher over the tokenized and
> rest of the fields. I have also implemented a similarity class to override
> idf however I still get documents having partial matches in title and other
> fields ranking higher than exact match in pg_series_title_ci.
> 
> Many Thanks,
> Sandeep