You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by gorjida <al...@sciencescape.net> on 2014/07/31 18:56:31 UTC

Solr gives the same fieldnorm for two different-size fields

I use solr for searching over a collection of institution names... My solr DB
contains multiple field names such as name, country, city, .... A sample
document looks like this:

{
        "solr_id": 130950,
        "rg_id": 140239,
        "rg_parent_id": 1438,
        "name": "University of California Berkeley Research",
        "ext_name": "",
        "city": "Berkeley",
        "country": "US",
        "state": "CA",
        "type": "academic/gen",
        "ext_city": "",
        "zip": "94720-5100",
        "_version_": 1474909528315134000
      },

I need to search over this database... My query looks like this:

name: (university of california berkeley)

After running this query, top-2 matches are as follows:

{
        "solr_id": 130950,
        "rg_id": 140239,
        "rg_parent_id": 1438,
        "name": "University of California Berkeley Research",
        "ext_name": "",
        "city": "Berkeley",
        "country": "US",
        "state": "CA",
        "type": "academic/gen",
        "ext_city": "",
        "zip": "94720-5100",
        "_version_": 1474909528315134000,
        "score": 1.8849033
      },
      {
        "solr_id": 350,
        "rg_id": 1438,
        "rg_parent_id": 1439,
        "name": "University of California Berkeley",
        "ext_name": "",
        "city": "Berkeley",
        "country": "US",
        "state": "CA",
        "type": "academic",
        "ext_city": "",
        "zip": "94720",
        "_version_": 1474909520371122200,
        "score": 1.8849033
      },

Indeed, both "University of California Berkeley Research" and "University of
California Berkeley" get the same score (1.8849033)... FYI, my schema looks
like this:

fieldType name="text_general" class="solr.TextField" omitNorms="false"
autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="false"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="false"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

I also checked the debugger and noticed that both documents return the same
fieldnorm (.5)... The bizzare thing is that solr works fine for these
queries:
--- name: (university of toronto)
--- name: (university of california los angeles)

Indeed, it seems that solr fails once the number of tokens in the documents
is equal to "4"... For above queries, the first one (university of toronto)
has three tokens and the second one has 5 tokens... I am totally stuck at
this point why solr cannot provide different fieldnorms for (University of
California Berkeley) and (University of California Berkeley Research)...
Also, I do not understand why it just happens when I have 4 tokens in the
field? I would appreciate if anyone can share the feedback...

PS. I have also tested "solr.StopFilterFactory" ignoreCase="true" and the
problem is not still resolved...

Regards,

Ali



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-gives-the-same-fieldnorm-for-two-different-size-fields-tp4150418.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr gives the same fieldnorm for two different-size fields

Posted by Erick Erickson <er...@gmail.com>.

And it won't be <G>. Basically, the norms are an approximation (They used
to be just a byte long), so
fields of "close" lengths will have the same value here.

Why is this an issue? If you back up a second, is a word appearing in a
4-word field really "enough" more
important than one appearing in a 5 word field to require a distinction?

Lately you can specify field norms that are longer than a byte, but the
overall problem still remains.

Frankly, though, I think this is something that's a distraction and that
users won't notice.

FWIW,
Erick


On Thu, Jul 31, 2014 at 9:56 AM, gorjida <al...@sciencescape.net> wrote:

> I use solr for searching over a collection of institution names... My solr
> DB
> contains multiple field names such as name, country, city, .... A sample
> document looks like this:
>
> {
>         "solr_id": 130950,
>         "rg_id": 140239,
>         "rg_parent_id": 1438,
>         "name": "University of California Berkeley Research",
>         "ext_name": "",
>         "city": "Berkeley",
>         "country": "US",
>         "state": "CA",
>         "type": "academic/gen",
>         "ext_city": "",
>         "zip": "94720-5100",
>         "_version_": 1474909528315134000
>       },
>
> I need to search over this database... My query looks like this:
>
> name: (university of california berkeley)
>
> After running this query, top-2 matches are as follows:
>
> {
>         "solr_id": 130950,
>         "rg_id": 140239,
>         "rg_parent_id": 1438,
>         "name": "University of California Berkeley Research",
>         "ext_name": "",
>         "city": "Berkeley",
>         "country": "US",
>         "state": "CA",
>         "type": "academic/gen",
>         "ext_city": "",
>         "zip": "94720-5100",
>         "_version_": 1474909528315134000,
>         "score": 1.8849033
>       },
>       {
>         "solr_id": 350,
>         "rg_id": 1438,
>         "rg_parent_id": 1439,
>         "name": "University of California Berkeley",
>         "ext_name": "",
>         "city": "Berkeley",
>         "country": "US",
>         "state": "CA",
>         "type": "academic",
>         "ext_city": "",
>         "zip": "94720",
>         "_version_": 1474909520371122200,
>         "score": 1.8849033
>       },
>
> Indeed, both "University of California Berkeley Research" and "University
> of
> California Berkeley" get the same score (1.8849033)... FYI, my schema looks
> like this:
>
> fieldType name="text_general" class="solr.TextField" omitNorms="false"
> autoGeneratePhraseQueries="true">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="false"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="false"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> I also checked the debugger and noticed that both documents return the same
> fieldnorm (.5)... The bizzare thing is that solr works fine for these
> queries:
> --- name: (university of toronto)
> --- name: (university of california los angeles)
>
> Indeed, it seems that solr fails once the number of tokens in the documents
> is equal to "4"... For above queries, the first one (university of toronto)
> has three tokens and the second one has 5 tokens... I am totally stuck at
> this point why solr cannot provide different fieldnorms for (University of
> California Berkeley) and (University of California Berkeley Research)...
> Also, I do not understand why it just happens when I have 4 tokens in the
> field? I would appreciate if anyone can share the feedback...
>
> PS. I have also tested "solr.StopFilterFactory" ignoreCase="true" and the
> problem is not still resolved...
>
> Regards,
>
> Ali
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-gives-the-same-fieldnorm-for-two-different-size-fields-tp4150418.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr gives the same fieldnorm for two different-size fields

Posted by Umesh Prasad <um...@gmail.com>.

What you really need is a covering type  match. I feel your use case fits
into this type

Score (Exact match in order) >   Score ( Exact match without order ) >
Score (Non Exact Match)

Example  Query : a b c

Example docs :
  d1 :  a b c
  d2 :  a c b
  d3 :  c a b
  d4 : a b c d
  d5 : a b c d e

Use case 1 : Only exact match is a match. (So only d1 is a match)
Use case 2 : Only in order are matches. So d2, d3 aren't matches. Scores
are d1 > d4 > d5
Use case 3 : Only in order are matches. And only one extra term is allowed.
So d2, d3, d5  aren't matches. Scores are d1 > d4
Use case 4 : All are matches and d1 > d2 > d3 > d4 > d5

All of these use cases can be satisfied by using SpanQueries, which tracks
the positions at which terms matches. For covering match, you will need to
introduce add start and end sentinel terms during indexing.

There is an excellent post by Mark Miller about span queries
http://searchhub.org/2009/07/18/the-spanquery/
 Solr's SurroundQuery Parser allows you to create SpanQueries
http://wiki.apache.org/solr/SurroundQueryParser
Or you can plug your own query parser into solr to do the same.

Some more links you can get here ..
http://search-lucene.com/?q=span+queries&fc_project=Lucene&fc_project=Solr



On 1 August 2014 00:24, Erick Erickson <er...@gmail.com> wrote:

> You can consider, say, a copyField directive and copy the field into a
> string type (or perhaps keyworTokenizer followed by lowerCaseFilter) and
> then match or boost on an exact match rather than trying to make scoring
> fill this role.
>
> In any case, I'm thinking of normalizing the sensitive fields and indexing
> them as a single token (i.e. the string type or keywordtokenizer) to
> disambiguate these cases.
>
> Because otherwise I fear you'll get one situation to work, then fail on the
> next case. In your example, you're trying to use length normalization to
> influence scoring to get the doc with the shorter field to sort above the
> doc with the longer field. But what are you going to do when your target is
> "university of california berkley research"? Rely on matching all the
> terms? And so on...
>
> Best,
> Erick
>
>
> On Thu, Jul 31, 2014 at 10:26 AM, gorjida <al...@sciencescape.net> wrote:
>
> > Thanks so much for your reply... In my case, it really matters because I
> am
> > going to find the correct institution match for an affiliation string...
> > For
> > example, if an author belongs to the "university of Toronto", his/her
> > affiliation should be normalized against the solr... In this case,
> > "University of California Berkley Research" is a different place to
> > "university of california berkeley"... I see top-matches are tied in the
> > score for this specific example... I can break the tie using other
> > techniques... However, I am keen to see if this is a common problem in
> > solr?
> >
> > Regards,
> >
> > Ali
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Solr-gives-the-same-fieldnorm-for-two-different-size-fields-tp4150418p4150430.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>



-- 
---
Thanks & Regards
Umesh Prasad

Re: Solr gives the same fieldnorm for two different-size fields

Posted by Erick Erickson <er...@gmail.com>.

You can consider, say, a copyField directive and copy the field into a
string type (or perhaps keyworTokenizer followed by lowerCaseFilter) and
then match or boost on an exact match rather than trying to make scoring
fill this role.

In any case, I'm thinking of normalizing the sensitive fields and indexing
them as a single token (i.e. the string type or keywordtokenizer) to
disambiguate these cases.

Because otherwise I fear you'll get one situation to work, then fail on the
next case. In your example, you're trying to use length normalization to
influence scoring to get the doc with the shorter field to sort above the
doc with the longer field. But what are you going to do when your target is
"university of california berkley research"? Rely on matching all the
terms? And so on...

Best,
Erick

On Thu, Jul 31, 2014 at 10:26 AM, gorjida <al...@sciencescape.net> wrote:

> Thanks so much for your reply... In my case, it really matters because I am
> going to find the correct institution match for an affiliation string...
> For
> example, if an author belongs to the "university of Toronto", his/her
> affiliation should be normalized against the solr... In this case,
> "University of California Berkley Research" is a different place to
> "university of california berkeley"... I see top-matches are tied in the
> score for this specific example... I can break the tie using other
> techniques... However, I am keen to see if this is a common problem in
> solr?
>
> Regards,
>
> Ali
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-gives-the-same-fieldnorm-for-two-different-size-fields-tp4150418p4150430.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr gives the same fieldnorm for two different-size fields

Posted by gorjida <al...@sciencescape.net>.

Thanks so much for your reply... In my case, it really matters because I am
going to find the correct institution match for an affiliation string... For
example, if an author belongs to the "university of Toronto", his/her
affiliation should be normalized against the solr... In this case,
"University of California Berkley Research" is a different place to
"university of california berkeley"... I see top-matches are tied in the
score for this specific example... I can break the tie using other
techniques... However, I am keen to see if this is a common problem in solr? 

Regards,

Ali  



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-gives-the-same-fieldnorm-for-two-different-size-fields-tp4150418p4150430.html
Sent from the Solr - User mailing list archive at Nabble.com.