You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "S.L" <si...@gmail.com> on 2014/12/08 15:49:17 UTC

Length norm not functioning in solr queries.

I have two documents doc1 and doc2 and each one of those has a field called
phoneName.

doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
Smartphone Factory Unlocked"

doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"

Here if I search for
q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true

Doc1 and Doc2 both have the same identical score , but since the field
phoneName in the doc2 has shorter length I would expect it to have a higher
score , but both have an identical score of 9.961212.

The phoneName filed is defined as follows.As we can see no where am I
specifying omitNorms=True, still the behavior seems to be that the length
norm is not functioning at all. Can some one let me know whats the issue
here ?

        <field name="phoneName" type="text_en_splitting" indexed="true"
            stored="true" required="true" />
        <fieldType name="text_en_splitting" class="solr.TextField"
            positionIncrementGap="100" autoGeneratePhraseQueries="true">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <!-- in this example, we will only use synonyms at query
time <filter
                    class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true"
                    expand="false"/> -->
                <!-- Case insensitive stop word removal. add
enablePositionIncrements=true
                    in both the index and query analyzers to leave a 'gap'
for more accurate
                    phrase queries. -->
                <filter class="solr.StopFilterFactory" ignoreCase="true"
                    words="lang/stopwords_en.txt"
enablePositionIncrements="true" />
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
catenateWords="1"
                    catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt" />
                <filter class="solr.PorterStemFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt"
                    ignoreCase="true" expand="true" />
                <filter class="solr.StopFilterFactory" ignoreCase="true"
                    words="lang/stopwords_en.txt"
enablePositionIncrements="true" />
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
catenateWords="0"
                    catenateNumbers="0" catenateAll="0"
splitOnCaseChange="1" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt" />
                <filter class="solr.PorterStemFilterFactory" />
            </analyzer>
        </fieldType>

Re: Length norm not functioning in solr queries.

Posted by "S.L" <si...@gmail.com>.
Yes, I understand that reindexing is neccesary , however for some reason I
was not able to invoke the js script from the updateprocessor, so I ended
up using Java only solution at index time.

Thanks.

On Thu, Dec 11, 2014 at 7:18 AM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:
>
> Hi,
>
> No special steps to be taken for cloud setup. Please note that for both
> solutions, re-index is mandatory.
>
> Ahmet
>
>
>
> On Thursday, December 11, 2014 12:15 PM, S.L <si...@gmail.com>
> wrote:
> Ahmet,
>
> Thank you , as the configurations in SolrCloud are uploaded to zookeeper ,
> are there any special steps that need to be taken to make this work in
> SolrCloud ?
>
>
> On Wed, Dec 10, 2014 at 4:32 AM, Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
> >
> > Hi,
> >
> > Or even better, you can use your new field for tie break purposes. Where
> > scores are identical.
> > e.g. sort=score desc, wordCount asc
> >
> > Ahmet
> >
> >
> > On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan <
> iorixxx@yahoo.com>
> > wrote:
> > Hi,
> >
> > You mean update processor factory?
> >
> > Here is augmented (wordCount field added) version of your example :
> >
> > doc1:
> >
> > phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
> > Smartphone Factory Unlocked"
> > wordCount: 11
> >
> > doc2:
> >
> > phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> > wordCount: 9
> >
> >
> > First task is simply calculate wordCount values. You can do it in your
> > indexing code, or other places.
> > I quickly skimmed existing update processors but I couldn't find stock
> > implementation.
> > CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is
> > all about multivalued fields.
> >
> > I guess, A simple javascript that splits on whitespace and returns the
> > produced array size would do the trick :
> > StatelessScriptUpdateProcessorFactory
> >
> >
> >
> > At this point you have a int field named word count.
> > boost=div(1,wordCount) should work. Or you can came up with more
> > sophisticated math formula.
> >
> > Ahmet
> >
> >
> > On Wednesday, December 10, 2014 11:12 AM, S.L <simpleliving016@gmail.com
> >
> > wrote:
> > Hi Ahmet,
> >
> > Is there already an implementation of the suggested work around ? Thanks.
> >
> >
> > On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan <io...@yahoo.com.invalid>
> > wrote:
> >
> > > Hi,
> > >
> > > Default length norm is not best option for differentiating very short
> > > documents, like product names.
> > > Please see :
> > > http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
> > >
> > > I suggest you to create an additional integer field, that holds number
> of
> > > tokens. You can populate it via update processor. And then penalise
> > (using
> > > fuction queries) according to that field. This way you have more fine
> > > grained and flexible control over it.
> > >
> > > Ahmet
> > >
> > >
> > >
> > > On Tuesday, December 9, 2014 12:22 PM, S.L <si...@gmail.com>
> > > wrote:
> > > Hi ,
> > >
> > > Mikhail Thanks , I looked at the explain and this is what I see for the
> > two
> > > different documents in questions, they have identical scores   even
> > though
> > > the document 2 has a shorter productName field, I do not see any
> > lenghtNorm
> > > related information in the explain.
> > >
> > > Also I am not exactly clear on what needs to be looked in the API ?
> > >
> > > *Search Query* : q=iphone+4s+16gb&qf= productName&mm=1&pf=
> > > productName&ps=1&pf2= productName&pf3=
> > > productName&stopwords=true&lowercaseOperators=true
> > >
> > > *productName Details about Apple iPhone 4s 16GB Smartphone AT&T Factory
> > > Unlocked *
> > >
> > >
> > >    - *100%* 10.649221 sum of the following:
> > >       - *10.58%* 1.1270299 sum of the following:
> > >          - *2.1%* 0.22383358 productName:iphon
> > >          - *3.47%* 0.36922288 productName:"4 s"
> > >          - *5.01%* 0.53397346 productName:"16 gb"
> > >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> > >       - *27.79%* 2.959255 sum of the following:
> > >          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
> > >          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
> > >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> > >
> > >
> > > *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
> > >
> > >
> > >    - *100%* 10.649221 sum of the following:
> > >       - *10.58%* 1.1270299 sum of the following:
> > >          - *2.1%* 0.22383358 productName:iphon
> > >          - *3.47%* 0.36922288 productName:"4 s"
> > >          - *5.01%* 0.53397346 productName:"16 gb"
> > >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> > >       - *27.79%* 2.959255 sum of the following:
> > >          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
> > >          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
> > >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev <
> > > mkhludnev@griddynamics.com> wrote:
> > >
> > > > It's worth to look into <explain> to check particular scoring values.
> > But
> > > > for most suspect is the reducing precision when float norms are
> stored
> > in
> > > > byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
> > > >
> > > >
> > > > On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com>
> wrote:
> > > >
> > > > > I have two documents doc1 and doc2 and each one of those has a
> field
> > > > called
> > > > > phoneName.
> > > > >
> > > > > doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White
> > (Verizon)
> > > > > Smartphone Factory Unlocked"
> > > > >
> > > > > doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> > > > >
> > > > > Here if I search for
> > > > >
> > > > >
> > > >
> > >
> >
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
> > > > >
> > > > > Doc1 and Doc2 both have the same identical score , but since the
> > field
> > > > > phoneName in the doc2 has shorter length I would expect it to have
> a
> > > > higher
> > > > > score , but both have an identical score of 9.961212.
> > > > >
> > > > > The phoneName filed is defined as follows.As we can see no where
> am I
> > > > > specifying omitNorms=True, still the behavior seems to be that the
> > > length
> > > > > norm is not functioning at all. Can some one let me know whats the
> > > issue
> > > > > here ?
> > > > >
> > > > >         <field name="phoneName" type="text_en_splitting"
> > indexed="true"
> > > > >             stored="true" required="true" />
> > > > >         <fieldType name="text_en_splitting" class="solr.TextField"
> > > > >             positionIncrementGap="100"
> > > autoGeneratePhraseQueries="true">
> > > > >             <analyzer type="index">
> > > > >                 <tokenizer class="solr.WhitespaceTokenizerFactory"
> />
> > > > >                 <!-- in this example, we will only use synonyms at
> > > query
> > > > > time <filter
> > > > >                     class="solr.SynonymFilterFactory"
> > > > > synonyms="index_synonyms.txt" ignoreCase="true"
> > > > >                     expand="false"/> -->
> > > > >                 <!-- Case insensitive stop word removal. add
> > > > > enablePositionIncrements=true
> > > > >                     in both the index and query analyzers to leave
> a
> > > > 'gap'
> > > > > for more accurate
> > > > >                     phrase queries. -->
> > > > >                 <filter class="solr.StopFilterFactory"
> > > ignoreCase="true"
> > > > >                     words="lang/stopwords_en.txt"
> > > > > enablePositionIncrements="true" />
> > > > >                 <filter class="solr.WordDelimiterFilterFactory"
> > > > >                     generateWordParts="1" generateNumberParts="1"
> > > > > catenateWords="1"
> > > > >                     catenateNumbers="1" catenateAll="0"
> > > > > splitOnCaseChange="1" />
> > > > >                 <filter class="solr.LowerCaseFilterFactory" />
> > > > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > > > protected="protwords.txt" />
> > > > >                 <filter class="solr.PorterStemFilterFactory" />
> > > > >             </analyzer>
> > > > >             <analyzer type="query">
> > > > >                 <tokenizer class="solr.WhitespaceTokenizerFactory"
> />
> > > > >                 <filter class="solr.SynonymFilterFactory"
> > > > > synonyms="synonyms.txt"
> > > > >                     ignoreCase="true" expand="true" />
> > > > >                 <filter class="solr.StopFilterFactory"
> > > ignoreCase="true"
> > > > >                     words="lang/stopwords_en.txt"
> > > > > enablePositionIncrements="true" />
> > > > >                 <filter class="solr.WordDelimiterFilterFactory"
> > > > >                     generateWordParts="1" generateNumberParts="1"
> > > > > catenateWords="0"
> > > > >                     catenateNumbers="0" catenateAll="0"
> > > > > splitOnCaseChange="1" />
> > > > >                 <filter class="solr.LowerCaseFilterFactory" />
> > > > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > > > protected="protwords.txt" />
> > > > >                 <filter class="solr.PorterStemFilterFactory" />
> > > > >             </analyzer>
> > > > >         </fieldType>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sincerely yours
> > > > Mikhail Khludnev
> > > > Principal Engineer,
> > > > Grid Dynamics
> > > >
> > > > <http://www.griddynamics.com>
> > > > <mk...@griddynamics.com>
> > > >
> > >
> >
>

Re: Length norm not functioning in solr queries.

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi,

No special steps to be taken for cloud setup. Please note that for both solutions, re-index is mandatory.

Ahmet



On Thursday, December 11, 2014 12:15 PM, S.L <si...@gmail.com> wrote:
Ahmet,

Thank you , as the configurations in SolrCloud are uploaded to zookeeper ,
are there any special steps that need to be taken to make this work in
SolrCloud ?


On Wed, Dec 10, 2014 at 4:32 AM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:
>
> Hi,
>
> Or even better, you can use your new field for tie break purposes. Where
> scores are identical.
> e.g. sort=score desc, wordCount asc
>
> Ahmet
>
>
> On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan <io...@yahoo.com>
> wrote:
> Hi,
>
> You mean update processor factory?
>
> Here is augmented (wordCount field added) version of your example :
>
> doc1:
>
> phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
> Smartphone Factory Unlocked"
> wordCount: 11
>
> doc2:
>
> phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> wordCount: 9
>
>
> First task is simply calculate wordCount values. You can do it in your
> indexing code, or other places.
> I quickly skimmed existing update processors but I couldn't find stock
> implementation.
> CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is
> all about multivalued fields.
>
> I guess, A simple javascript that splits on whitespace and returns the
> produced array size would do the trick :
> StatelessScriptUpdateProcessorFactory
>
>
>
> At this point you have a int field named word count.
> boost=div(1,wordCount) should work. Or you can came up with more
> sophisticated math formula.
>
> Ahmet
>
>
> On Wednesday, December 10, 2014 11:12 AM, S.L <si...@gmail.com>
> wrote:
> Hi Ahmet,
>
> Is there already an implementation of the suggested work around ? Thanks.
>
>
> On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
>
> > Hi,
> >
> > Default length norm is not best option for differentiating very short
> > documents, like product names.
> > Please see :
> > http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
> >
> > I suggest you to create an additional integer field, that holds number of
> > tokens. You can populate it via update processor. And then penalise
> (using
> > fuction queries) according to that field. This way you have more fine
> > grained and flexible control over it.
> >
> > Ahmet
> >
> >
> >
> > On Tuesday, December 9, 2014 12:22 PM, S.L <si...@gmail.com>
> > wrote:
> > Hi ,
> >
> > Mikhail Thanks , I looked at the explain and this is what I see for the
> two
> > different documents in questions, they have identical scores   even
> though
> > the document 2 has a shorter productName field, I do not see any
> lenghtNorm
> > related information in the explain.
> >
> > Also I am not exactly clear on what needs to be looked in the API ?
> >
> > *Search Query* : q=iphone+4s+16gb&qf= productName&mm=1&pf=
> > productName&ps=1&pf2= productName&pf3=
> > productName&stopwords=true&lowercaseOperators=true
> >
> > *productName Details about Apple iPhone 4s 16GB Smartphone AT&T Factory
> > Unlocked *
> >
> >
> >    - *100%* 10.649221 sum of the following:
> >       - *10.58%* 1.1270299 sum of the following:
> >          - *2.1%* 0.22383358 productName:iphon
> >          - *3.47%* 0.36922288 productName:"4 s"
> >          - *5.01%* 0.53397346 productName:"16 gb"
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >       - *27.79%* 2.959255 sum of the following:
> >          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
> >          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >
> >
> > *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
> >
> >
> >    - *100%* 10.649221 sum of the following:
> >       - *10.58%* 1.1270299 sum of the following:
> >          - *2.1%* 0.22383358 productName:iphon
> >          - *3.47%* 0.36922288 productName:"4 s"
> >          - *5.01%* 0.53397346 productName:"16 gb"
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >       - *27.79%* 2.959255 sum of the following:
> >          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
> >          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >
> >
> >
> >
> >
> > On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev <
> > mkhludnev@griddynamics.com> wrote:
> >
> > > It's worth to look into <explain> to check particular scoring values.
> But
> > > for most suspect is the reducing precision when float norms are stored
> in
> > > byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
> > >
> > >
> > > On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com> wrote:
> > >
> > > > I have two documents doc1 and doc2 and each one of those has a field
> > > called
> > > > phoneName.
> > > >
> > > > doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White
> (Verizon)
> > > > Smartphone Factory Unlocked"
> > > >
> > > > doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> > > >
> > > > Here if I search for
> > > >
> > > >
> > >
> >
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
> > > >
> > > > Doc1 and Doc2 both have the same identical score , but since the
> field
> > > > phoneName in the doc2 has shorter length I would expect it to have a
> > > higher
> > > > score , but both have an identical score of 9.961212.
> > > >
> > > > The phoneName filed is defined as follows.As we can see no where am I
> > > > specifying omitNorms=True, still the behavior seems to be that the
> > length
> > > > norm is not functioning at all. Can some one let me know whats the
> > issue
> > > > here ?
> > > >
> > > >         <field name="phoneName" type="text_en_splitting"
> indexed="true"
> > > >             stored="true" required="true" />
> > > >         <fieldType name="text_en_splitting" class="solr.TextField"
> > > >             positionIncrementGap="100"
> > autoGeneratePhraseQueries="true">
> > > >             <analyzer type="index">
> > > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > > >                 <!-- in this example, we will only use synonyms at
> > query
> > > > time <filter
> > > >                     class="solr.SynonymFilterFactory"
> > > > synonyms="index_synonyms.txt" ignoreCase="true"
> > > >                     expand="false"/> -->
> > > >                 <!-- Case insensitive stop word removal. add
> > > > enablePositionIncrements=true
> > > >                     in both the index and query analyzers to leave a
> > > 'gap'
> > > > for more accurate
> > > >                     phrase queries. -->
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > >                     words="lang/stopwords_en.txt"
> > > > enablePositionIncrements="true" />
> > > >                 <filter class="solr.WordDelimiterFilterFactory"
> > > >                     generateWordParts="1" generateNumberParts="1"
> > > > catenateWords="1"
> > > >                     catenateNumbers="1" catenateAll="0"
> > > > splitOnCaseChange="1" />
> > > >                 <filter class="solr.LowerCaseFilterFactory" />
> > > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > > protected="protwords.txt" />
> > > >                 <filter class="solr.PorterStemFilterFactory" />
> > > >             </analyzer>
> > > >             <analyzer type="query">
> > > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > > >                 <filter class="solr.SynonymFilterFactory"
> > > > synonyms="synonyms.txt"
> > > >                     ignoreCase="true" expand="true" />
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > >                     words="lang/stopwords_en.txt"
> > > > enablePositionIncrements="true" />
> > > >                 <filter class="solr.WordDelimiterFilterFactory"
> > > >                     generateWordParts="1" generateNumberParts="1"
> > > > catenateWords="0"
> > > >                     catenateNumbers="0" catenateAll="0"
> > > > splitOnCaseChange="1" />
> > > >                 <filter class="solr.LowerCaseFilterFactory" />
> > > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > > protected="protwords.txt" />
> > > >                 <filter class="solr.PorterStemFilterFactory" />
> > > >             </analyzer>
> > > >         </fieldType>
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > > <mk...@griddynamics.com>
> > >
> >
>

Re: Length norm not functioning in solr queries.

Posted by "S.L" <si...@gmail.com>.
Ahmet,

Thank you , as the configurations in SolrCloud are uploaded to zookeeper ,
are there any special steps that need to be taken to make this work in
SolrCloud ?

On Wed, Dec 10, 2014 at 4:32 AM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:
>
> Hi,
>
> Or even better, you can use your new field for tie break purposes. Where
> scores are identical.
> e.g. sort=score desc, wordCount asc
>
> Ahmet
>
>
> On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan <io...@yahoo.com>
> wrote:
> Hi,
>
> You mean update processor factory?
>
> Here is augmented (wordCount field added) version of your example :
>
> doc1:
>
> phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
> Smartphone Factory Unlocked"
> wordCount: 11
>
> doc2:
>
> phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> wordCount: 9
>
>
> First task is simply calculate wordCount values. You can do it in your
> indexing code, or other places.
> I quickly skimmed existing update processors but I couldn't find stock
> implementation.
> CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is
> all about multivalued fields.
>
> I guess, A simple javascript that splits on whitespace and returns the
> produced array size would do the trick :
> StatelessScriptUpdateProcessorFactory
>
>
>
> At this point you have a int field named word count.
> boost=div(1,wordCount) should work. Or you can came up with more
> sophisticated math formula.
>
> Ahmet
>
>
> On Wednesday, December 10, 2014 11:12 AM, S.L <si...@gmail.com>
> wrote:
> Hi Ahmet,
>
> Is there already an implementation of the suggested work around ? Thanks.
>
>
> On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
>
> > Hi,
> >
> > Default length norm is not best option for differentiating very short
> > documents, like product names.
> > Please see :
> > http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
> >
> > I suggest you to create an additional integer field, that holds number of
> > tokens. You can populate it via update processor. And then penalise
> (using
> > fuction queries) according to that field. This way you have more fine
> > grained and flexible control over it.
> >
> > Ahmet
> >
> >
> >
> > On Tuesday, December 9, 2014 12:22 PM, S.L <si...@gmail.com>
> > wrote:
> > Hi ,
> >
> > Mikhail Thanks , I looked at the explain and this is what I see for the
> two
> > different documents in questions, they have identical scores   even
> though
> > the document 2 has a shorter productName field, I do not see any
> lenghtNorm
> > related information in the explain.
> >
> > Also I am not exactly clear on what needs to be looked in the API ?
> >
> > *Search Query* : q=iphone+4s+16gb&qf= productName&mm=1&pf=
> > productName&ps=1&pf2= productName&pf3=
> > productName&stopwords=true&lowercaseOperators=true
> >
> > *productName Details about Apple iPhone 4s 16GB Smartphone AT&T Factory
> > Unlocked *
> >
> >
> >    - *100%* 10.649221 sum of the following:
> >       - *10.58%* 1.1270299 sum of the following:
> >          - *2.1%* 0.22383358 productName:iphon
> >          - *3.47%* 0.36922288 productName:"4 s"
> >          - *5.01%* 0.53397346 productName:"16 gb"
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >       - *27.79%* 2.959255 sum of the following:
> >          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
> >          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >
> >
> > *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
> >
> >
> >    - *100%* 10.649221 sum of the following:
> >       - *10.58%* 1.1270299 sum of the following:
> >          - *2.1%* 0.22383358 productName:iphon
> >          - *3.47%* 0.36922288 productName:"4 s"
> >          - *5.01%* 0.53397346 productName:"16 gb"
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >       - *27.79%* 2.959255 sum of the following:
> >          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
> >          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >
> >
> >
> >
> >
> > On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev <
> > mkhludnev@griddynamics.com> wrote:
> >
> > > It's worth to look into <explain> to check particular scoring values.
> But
> > > for most suspect is the reducing precision when float norms are stored
> in
> > > byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
> > >
> > >
> > > On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com> wrote:
> > >
> > > > I have two documents doc1 and doc2 and each one of those has a field
> > > called
> > > > phoneName.
> > > >
> > > > doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White
> (Verizon)
> > > > Smartphone Factory Unlocked"
> > > >
> > > > doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> > > >
> > > > Here if I search for
> > > >
> > > >
> > >
> >
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
> > > >
> > > > Doc1 and Doc2 both have the same identical score , but since the
> field
> > > > phoneName in the doc2 has shorter length I would expect it to have a
> > > higher
> > > > score , but both have an identical score of 9.961212.
> > > >
> > > > The phoneName filed is defined as follows.As we can see no where am I
> > > > specifying omitNorms=True, still the behavior seems to be that the
> > length
> > > > norm is not functioning at all. Can some one let me know whats the
> > issue
> > > > here ?
> > > >
> > > >         <field name="phoneName" type="text_en_splitting"
> indexed="true"
> > > >             stored="true" required="true" />
> > > >         <fieldType name="text_en_splitting" class="solr.TextField"
> > > >             positionIncrementGap="100"
> > autoGeneratePhraseQueries="true">
> > > >             <analyzer type="index">
> > > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > > >                 <!-- in this example, we will only use synonyms at
> > query
> > > > time <filter
> > > >                     class="solr.SynonymFilterFactory"
> > > > synonyms="index_synonyms.txt" ignoreCase="true"
> > > >                     expand="false"/> -->
> > > >                 <!-- Case insensitive stop word removal. add
> > > > enablePositionIncrements=true
> > > >                     in both the index and query analyzers to leave a
> > > 'gap'
> > > > for more accurate
> > > >                     phrase queries. -->
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > >                     words="lang/stopwords_en.txt"
> > > > enablePositionIncrements="true" />
> > > >                 <filter class="solr.WordDelimiterFilterFactory"
> > > >                     generateWordParts="1" generateNumberParts="1"
> > > > catenateWords="1"
> > > >                     catenateNumbers="1" catenateAll="0"
> > > > splitOnCaseChange="1" />
> > > >                 <filter class="solr.LowerCaseFilterFactory" />
> > > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > > protected="protwords.txt" />
> > > >                 <filter class="solr.PorterStemFilterFactory" />
> > > >             </analyzer>
> > > >             <analyzer type="query">
> > > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > > >                 <filter class="solr.SynonymFilterFactory"
> > > > synonyms="synonyms.txt"
> > > >                     ignoreCase="true" expand="true" />
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > >                     words="lang/stopwords_en.txt"
> > > > enablePositionIncrements="true" />
> > > >                 <filter class="solr.WordDelimiterFilterFactory"
> > > >                     generateWordParts="1" generateNumberParts="1"
> > > > catenateWords="0"
> > > >                     catenateNumbers="0" catenateAll="0"
> > > > splitOnCaseChange="1" />
> > > >                 <filter class="solr.LowerCaseFilterFactory" />
> > > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > > protected="protwords.txt" />
> > > >                 <filter class="solr.PorterStemFilterFactory" />
> > > >             </analyzer>
> > > >         </fieldType>
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > > <mk...@griddynamics.com>
> > >
> >
>

Re: Length norm not functioning in solr queries.

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi,

Or even better, you can use your new field for tie break purposes. Where scores are identical.
e.g. sort=score desc, wordCount asc

Ahmet


On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan <io...@yahoo.com> wrote:
Hi,

You mean update processor factory?

Here is augmented (wordCount field added) version of your example :

doc1:

phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
Smartphone Factory Unlocked"
wordCount: 11

doc2:

phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
wordCount: 9


First task is simply calculate wordCount values. You can do it in your indexing code, or other places.
I quickly skimmed existing update processors but I couldn't find stock implementation. 
CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is all about multivalued fields.

I guess, A simple javascript that splits on whitespace and returns the produced array size would do the trick : StatelessScriptUpdateProcessorFactory



At this point you have a int field named word count. boost=div(1,wordCount) should work. Or you can came up with more sophisticated math formula.

Ahmet


On Wednesday, December 10, 2014 11:12 AM, S.L <si...@gmail.com> wrote:
Hi Ahmet,

Is there already an implementation of the suggested work around ? Thanks.


On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi,
>
> Default length norm is not best option for differentiating very short
> documents, like product names.
> Please see :
> http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
>
> I suggest you to create an additional integer field, that holds number of
> tokens. You can populate it via update processor. And then penalise (using
> fuction queries) according to that field. This way you have more fine
> grained and flexible control over it.
>
> Ahmet
>
>
>
> On Tuesday, December 9, 2014 12:22 PM, S.L <si...@gmail.com>
> wrote:
> Hi ,
>
> Mikhail Thanks , I looked at the explain and this is what I see for the two
> different documents in questions, they have identical scores   even though
> the document 2 has a shorter productName field, I do not see any lenghtNorm
> related information in the explain.
>
> Also I am not exactly clear on what needs to be looked in the API ?
>
> *Search Query* : q=iphone+4s+16gb&qf= productName&mm=1&pf=
> productName&ps=1&pf2= productName&pf3=
> productName&stopwords=true&lowercaseOperators=true
>
> *productName Details about Apple iPhone 4s 16GB Smartphone AT&T Factory
> Unlocked *
>
>
>    - *100%* 10.649221 sum of the following:
>       - *10.58%* 1.1270299 sum of the following:
>          - *2.1%* 0.22383358 productName:iphon
>          - *3.47%* 0.36922288 productName:"4 s"
>          - *5.01%* 0.53397346 productName:"16 gb"
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>       - *27.79%* 2.959255 sum of the following:
>          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
>          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>
>
> *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
>
>
>    - *100%* 10.649221 sum of the following:
>       - *10.58%* 1.1270299 sum of the following:
>          - *2.1%* 0.22383358 productName:iphon
>          - *3.47%* 0.36922288 productName:"4 s"
>          - *5.01%* 0.53397346 productName:"16 gb"
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>       - *27.79%* 2.959255 sum of the following:
>          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
>          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>
>
>
>
>
> On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
> > It's worth to look into <explain> to check particular scoring values. But
> > for most suspect is the reducing precision when float norms are stored in
> > byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
> >
> >
> > On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com> wrote:
> >
> > > I have two documents doc1 and doc2 and each one of those has a field
> > called
> > > phoneName.
> > >
> > > doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
> > > Smartphone Factory Unlocked"
> > >
> > > doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> > >
> > > Here if I search for
> > >
> > >
> >
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
> > >
> > > Doc1 and Doc2 both have the same identical score , but since the field
> > > phoneName in the doc2 has shorter length I would expect it to have a
> > higher
> > > score , but both have an identical score of 9.961212.
> > >
> > > The phoneName filed is defined as follows.As we can see no where am I
> > > specifying omitNorms=True, still the behavior seems to be that the
> length
> > > norm is not functioning at all. Can some one let me know whats the
> issue
> > > here ?
> > >
> > >         <field name="phoneName" type="text_en_splitting" indexed="true"
> > >             stored="true" required="true" />
> > >         <fieldType name="text_en_splitting" class="solr.TextField"
> > >             positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
> > >             <analyzer type="index">
> > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > >                 <!-- in this example, we will only use synonyms at
> query
> > > time <filter
> > >                     class="solr.SynonymFilterFactory"
> > > synonyms="index_synonyms.txt" ignoreCase="true"
> > >                     expand="false"/> -->
> > >                 <!-- Case insensitive stop word removal. add
> > > enablePositionIncrements=true
> > >                     in both the index and query analyzers to leave a
> > 'gap'
> > > for more accurate
> > >                     phrase queries. -->
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > >                     words="lang/stopwords_en.txt"
> > > enablePositionIncrements="true" />
> > >                 <filter class="solr.WordDelimiterFilterFactory"
> > >                     generateWordParts="1" generateNumberParts="1"
> > > catenateWords="1"
> > >                     catenateNumbers="1" catenateAll="0"
> > > splitOnCaseChange="1" />
> > >                 <filter class="solr.LowerCaseFilterFactory" />
> > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > protected="protwords.txt" />
> > >                 <filter class="solr.PorterStemFilterFactory" />
> > >             </analyzer>
> > >             <analyzer type="query">
> > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > >                 <filter class="solr.SynonymFilterFactory"
> > > synonyms="synonyms.txt"
> > >                     ignoreCase="true" expand="true" />
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > >                     words="lang/stopwords_en.txt"
> > > enablePositionIncrements="true" />
> > >                 <filter class="solr.WordDelimiterFilterFactory"
> > >                     generateWordParts="1" generateNumberParts="1"
> > > catenateWords="0"
> > >                     catenateNumbers="0" catenateAll="0"
> > > splitOnCaseChange="1" />
> > >                 <filter class="solr.LowerCaseFilterFactory" />
> > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > protected="protwords.txt" />
> > >                 <filter class="solr.PorterStemFilterFactory" />
> > >             </analyzer>
> > >         </fieldType>
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> > <mk...@griddynamics.com>
> >
>

Re: Length norm not functioning in solr queries.

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi,

You mean update processor factory?

Here is augmented (wordCount field added) version of your example :

doc1:

phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
Smartphone Factory Unlocked"
wordCount: 11

doc2:

phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
wordCount: 9


First task is simply calculate wordCount values. You can do it in your indexing code, or other places.
I quickly skimmed existing update processors but I couldn't find stock implementation. 
CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is all about multivalued fields.

I guess, A simple javascript that splits on whitespace and returns the produced array size would do the trick : StatelessScriptUpdateProcessorFactory



At this point you have a int field named word count. boost=div(1,wordCount) should work. Or you can came up with more sophisticated math formula.

Ahmet

On Wednesday, December 10, 2014 11:12 AM, S.L <si...@gmail.com> wrote:
Hi Ahmet,

Is there already an implementation of the suggested work around ? Thanks.


On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi,
>
> Default length norm is not best option for differentiating very short
> documents, like product names.
> Please see :
> http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
>
> I suggest you to create an additional integer field, that holds number of
> tokens. You can populate it via update processor. And then penalise (using
> fuction queries) according to that field. This way you have more fine
> grained and flexible control over it.
>
> Ahmet
>
>
>
> On Tuesday, December 9, 2014 12:22 PM, S.L <si...@gmail.com>
> wrote:
> Hi ,
>
> Mikhail Thanks , I looked at the explain and this is what I see for the two
> different documents in questions, they have identical scores   even though
> the document 2 has a shorter productName field, I do not see any lenghtNorm
> related information in the explain.
>
> Also I am not exactly clear on what needs to be looked in the API ?
>
> *Search Query* : q=iphone+4s+16gb&qf= productName&mm=1&pf=
> productName&ps=1&pf2= productName&pf3=
> productName&stopwords=true&lowercaseOperators=true
>
> *productName Details about Apple iPhone 4s 16GB Smartphone AT&T Factory
> Unlocked *
>
>
>    - *100%* 10.649221 sum of the following:
>       - *10.58%* 1.1270299 sum of the following:
>          - *2.1%* 0.22383358 productName:iphon
>          - *3.47%* 0.36922288 productName:"4 s"
>          - *5.01%* 0.53397346 productName:"16 gb"
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>       - *27.79%* 2.959255 sum of the following:
>          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
>          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>
>
> *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
>
>
>    - *100%* 10.649221 sum of the following:
>       - *10.58%* 1.1270299 sum of the following:
>          - *2.1%* 0.22383358 productName:iphon
>          - *3.47%* 0.36922288 productName:"4 s"
>          - *5.01%* 0.53397346 productName:"16 gb"
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>       - *27.79%* 2.959255 sum of the following:
>          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
>          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>
>
>
>
>
> On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
> > It's worth to look into <explain> to check particular scoring values. But
> > for most suspect is the reducing precision when float norms are stored in
> > byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
> >
> >
> > On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com> wrote:
> >
> > > I have two documents doc1 and doc2 and each one of those has a field
> > called
> > > phoneName.
> > >
> > > doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
> > > Smartphone Factory Unlocked"
> > >
> > > doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> > >
> > > Here if I search for
> > >
> > >
> >
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
> > >
> > > Doc1 and Doc2 both have the same identical score , but since the field
> > > phoneName in the doc2 has shorter length I would expect it to have a
> > higher
> > > score , but both have an identical score of 9.961212.
> > >
> > > The phoneName filed is defined as follows.As we can see no where am I
> > > specifying omitNorms=True, still the behavior seems to be that the
> length
> > > norm is not functioning at all. Can some one let me know whats the
> issue
> > > here ?
> > >
> > >         <field name="phoneName" type="text_en_splitting" indexed="true"
> > >             stored="true" required="true" />
> > >         <fieldType name="text_en_splitting" class="solr.TextField"
> > >             positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
> > >             <analyzer type="index">
> > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > >                 <!-- in this example, we will only use synonyms at
> query
> > > time <filter
> > >                     class="solr.SynonymFilterFactory"
> > > synonyms="index_synonyms.txt" ignoreCase="true"
> > >                     expand="false"/> -->
> > >                 <!-- Case insensitive stop word removal. add
> > > enablePositionIncrements=true
> > >                     in both the index and query analyzers to leave a
> > 'gap'
> > > for more accurate
> > >                     phrase queries. -->
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > >                     words="lang/stopwords_en.txt"
> > > enablePositionIncrements="true" />
> > >                 <filter class="solr.WordDelimiterFilterFactory"
> > >                     generateWordParts="1" generateNumberParts="1"
> > > catenateWords="1"
> > >                     catenateNumbers="1" catenateAll="0"
> > > splitOnCaseChange="1" />
> > >                 <filter class="solr.LowerCaseFilterFactory" />
> > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > protected="protwords.txt" />
> > >                 <filter class="solr.PorterStemFilterFactory" />
> > >             </analyzer>
> > >             <analyzer type="query">
> > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > >                 <filter class="solr.SynonymFilterFactory"
> > > synonyms="synonyms.txt"
> > >                     ignoreCase="true" expand="true" />
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > >                     words="lang/stopwords_en.txt"
> > > enablePositionIncrements="true" />
> > >                 <filter class="solr.WordDelimiterFilterFactory"
> > >                     generateWordParts="1" generateNumberParts="1"
> > > catenateWords="0"
> > >                     catenateNumbers="0" catenateAll="0"
> > > splitOnCaseChange="1" />
> > >                 <filter class="solr.LowerCaseFilterFactory" />
> > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > protected="protwords.txt" />
> > >                 <filter class="solr.PorterStemFilterFactory" />
> > >             </analyzer>
> > >         </fieldType>
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> > <mk...@griddynamics.com>
> >
>

Re: Length norm not functioning in solr queries.

Posted by "S.L" <si...@gmail.com>.
Mikhail,

Thank you for confirming this , however Ahmet's proposal seems more simpler
to implement to me .

On Wed, Dec 10, 2014 at 5:07 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:
>
> S.L,
>
> I briefly skimmed Lucene50NormsConsumer.writeNormsField(), my conclusion
> is: if you supply own similarity, which just avoids putting float to byte
> in Similarity.computeNorm(FieldInvertState), you get right this value in .
> Similarity.decodeNormValue(long).
> You may wonder but this is what's exactly done in PreciseDefaultSimilarity
> in TestLongNormValueSource. I think you can just use it.
>
> On Wed, Dec 10, 2014 at 12:11 PM, S.L <si...@gmail.com> wrote:
>
> > Hi Ahmet,
> >
> > Is there already an implementation of the suggested work around ? Thanks.
> >
> > On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan <io...@yahoo.com.invalid>
> > wrote:
> >
> > > Hi,
> > >
> > > Default length norm is not best option for differentiating very short
> > > documents, like product names.
> > > Please see :
> > > http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
> > >
> > > I suggest you to create an additional integer field, that holds number
> of
> > > tokens. You can populate it via update processor. And then penalise
> > (using
> > > fuction queries) according to that field. This way you have more fine
> > > grained and flexible control over it.
> > >
> > > Ahmet
> > >
> > >
> > >
> > > On Tuesday, December 9, 2014 12:22 PM, S.L <si...@gmail.com>
> > > wrote:
> > > Hi ,
> > >
> > > Mikhail Thanks , I looked at the explain and this is what I see for the
> > two
> > > different documents in questions, they have identical scores   even
> > though
> > > the document 2 has a shorter productName field, I do not see any
> > lenghtNorm
> > > related information in the explain.
> > >
> > > Also I am not exactly clear on what needs to be looked in the API ?
> > >
> > > *Search Query* : q=iphone+4s+16gb&qf= productName&mm=1&pf=
> > > productName&ps=1&pf2= productName&pf3=
> > > productName&stopwords=true&lowercaseOperators=true
> > >
> > > *productName Details about Apple iPhone 4s 16GB Smartphone AT&T Factory
> > > Unlocked *
> > >
> > >
> > >    - *100%* 10.649221 sum of the following:
> > >       - *10.58%* 1.1270299 sum of the following:
> > >          - *2.1%* 0.22383358 productName:iphon
> > >          - *3.47%* 0.36922288 productName:"4 s"
> > >          - *5.01%* 0.53397346 productName:"16 gb"
> > >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> > >       - *27.79%* 2.959255 sum of the following:
> > >          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
> > >          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
> > >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> > >
> > >
> > > *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
> > >
> > >
> > >    - *100%* 10.649221 sum of the following:
> > >       - *10.58%* 1.1270299 sum of the following:
> > >          - *2.1%* 0.22383358 productName:iphon
> > >          - *3.47%* 0.36922288 productName:"4 s"
> > >          - *5.01%* 0.53397346 productName:"16 gb"
> > >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> > >       - *27.79%* 2.959255 sum of the following:
> > >          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
> > >          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
> > >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev <
> > > mkhludnev@griddynamics.com> wrote:
> > >
> > > > It's worth to look into <explain> to check particular scoring values.
> > But
> > > > for most suspect is the reducing precision when float norms are
> stored
> > in
> > > > byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
> > > >
> > > >
> > > > On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com>
> wrote:
> > > >
> > > > > I have two documents doc1 and doc2 and each one of those has a
> field
> > > > called
> > > > > phoneName.
> > > > >
> > > > > doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White
> > (Verizon)
> > > > > Smartphone Factory Unlocked"
> > > > >
> > > > > doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> > > > >
> > > > > Here if I search for
> > > > >
> > > > >
> > > >
> > >
> >
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
> > > > >
> > > > > Doc1 and Doc2 both have the same identical score , but since the
> > field
> > > > > phoneName in the doc2 has shorter length I would expect it to have
> a
> > > > higher
> > > > > score , but both have an identical score of 9.961212.
> > > > >
> > > > > The phoneName filed is defined as follows.As we can see no where
> am I
> > > > > specifying omitNorms=True, still the behavior seems to be that the
> > > length
> > > > > norm is not functioning at all. Can some one let me know whats the
> > > issue
> > > > > here ?
> > > > >
> > > > >         <field name="phoneName" type="text_en_splitting"
> > indexed="true"
> > > > >             stored="true" required="true" />
> > > > >         <fieldType name="text_en_splitting" class="solr.TextField"
> > > > >             positionIncrementGap="100"
> > > autoGeneratePhraseQueries="true">
> > > > >             <analyzer type="index">
> > > > >                 <tokenizer class="solr.WhitespaceTokenizerFactory"
> />
> > > > >                 <!-- in this example, we will only use synonyms at
> > > query
> > > > > time <filter
> > > > >                     class="solr.SynonymFilterFactory"
> > > > > synonyms="index_synonyms.txt" ignoreCase="true"
> > > > >                     expand="false"/> -->
> > > > >                 <!-- Case insensitive stop word removal. add
> > > > > enablePositionIncrements=true
> > > > >                     in both the index and query analyzers to leave
> a
> > > > 'gap'
> > > > > for more accurate
> > > > >                     phrase queries. -->
> > > > >                 <filter class="solr.StopFilterFactory"
> > > ignoreCase="true"
> > > > >                     words="lang/stopwords_en.txt"
> > > > > enablePositionIncrements="true" />
> > > > >                 <filter class="solr.WordDelimiterFilterFactory"
> > > > >                     generateWordParts="1" generateNumberParts="1"
> > > > > catenateWords="1"
> > > > >                     catenateNumbers="1" catenateAll="0"
> > > > > splitOnCaseChange="1" />
> > > > >                 <filter class="solr.LowerCaseFilterFactory" />
> > > > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > > > protected="protwords.txt" />
> > > > >                 <filter class="solr.PorterStemFilterFactory" />
> > > > >             </analyzer>
> > > > >             <analyzer type="query">
> > > > >                 <tokenizer class="solr.WhitespaceTokenizerFactory"
> />
> > > > >                 <filter class="solr.SynonymFilterFactory"
> > > > > synonyms="synonyms.txt"
> > > > >                     ignoreCase="true" expand="true" />
> > > > >                 <filter class="solr.StopFilterFactory"
> > > ignoreCase="true"
> > > > >                     words="lang/stopwords_en.txt"
> > > > > enablePositionIncrements="true" />
> > > > >                 <filter class="solr.WordDelimiterFilterFactory"
> > > > >                     generateWordParts="1" generateNumberParts="1"
> > > > > catenateWords="0"
> > > > >                     catenateNumbers="0" catenateAll="0"
> > > > > splitOnCaseChange="1" />
> > > > >                 <filter class="solr.LowerCaseFilterFactory" />
> > > > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > > > protected="protwords.txt" />
> > > > >                 <filter class="solr.PorterStemFilterFactory" />
> > > > >             </analyzer>
> > > > >         </fieldType>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sincerely yours
> > > > Mikhail Khludnev
> > > > Principal Engineer,
> > > > Grid Dynamics
> > > >
> > > > <http://www.griddynamics.com>
> > > > <mk...@griddynamics.com>
> > > >
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mk...@griddynamics.com>
>

Re: Length norm not functioning in solr queries.

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
S.L,

I briefly skimmed Lucene50NormsConsumer.writeNormsField(), my conclusion
is: if you supply own similarity, which just avoids putting float to byte
in Similarity.computeNorm(FieldInvertState), you get right this value in .
Similarity.decodeNormValue(long).
You may wonder but this is what's exactly done in PreciseDefaultSimilarity
in TestLongNormValueSource. I think you can just use it.

On Wed, Dec 10, 2014 at 12:11 PM, S.L <si...@gmail.com> wrote:

> Hi Ahmet,
>
> Is there already an implementation of the suggested work around ? Thanks.
>
> On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
>
> > Hi,
> >
> > Default length norm is not best option for differentiating very short
> > documents, like product names.
> > Please see :
> > http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
> >
> > I suggest you to create an additional integer field, that holds number of
> > tokens. You can populate it via update processor. And then penalise
> (using
> > fuction queries) according to that field. This way you have more fine
> > grained and flexible control over it.
> >
> > Ahmet
> >
> >
> >
> > On Tuesday, December 9, 2014 12:22 PM, S.L <si...@gmail.com>
> > wrote:
> > Hi ,
> >
> > Mikhail Thanks , I looked at the explain and this is what I see for the
> two
> > different documents in questions, they have identical scores   even
> though
> > the document 2 has a shorter productName field, I do not see any
> lenghtNorm
> > related information in the explain.
> >
> > Also I am not exactly clear on what needs to be looked in the API ?
> >
> > *Search Query* : q=iphone+4s+16gb&qf= productName&mm=1&pf=
> > productName&ps=1&pf2= productName&pf3=
> > productName&stopwords=true&lowercaseOperators=true
> >
> > *productName Details about Apple iPhone 4s 16GB Smartphone AT&T Factory
> > Unlocked *
> >
> >
> >    - *100%* 10.649221 sum of the following:
> >       - *10.58%* 1.1270299 sum of the following:
> >          - *2.1%* 0.22383358 productName:iphon
> >          - *3.47%* 0.36922288 productName:"4 s"
> >          - *5.01%* 0.53397346 productName:"16 gb"
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >       - *27.79%* 2.959255 sum of the following:
> >          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
> >          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >
> >
> > *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
> >
> >
> >    - *100%* 10.649221 sum of the following:
> >       - *10.58%* 1.1270299 sum of the following:
> >          - *2.1%* 0.22383358 productName:iphon
> >          - *3.47%* 0.36922288 productName:"4 s"
> >          - *5.01%* 0.53397346 productName:"16 gb"
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >       - *27.79%* 2.959255 sum of the following:
> >          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
> >          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
> >       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
> >
> >
> >
> >
> >
> > On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev <
> > mkhludnev@griddynamics.com> wrote:
> >
> > > It's worth to look into <explain> to check particular scoring values.
> But
> > > for most suspect is the reducing precision when float norms are stored
> in
> > > byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
> > >
> > >
> > > On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com> wrote:
> > >
> > > > I have two documents doc1 and doc2 and each one of those has a field
> > > called
> > > > phoneName.
> > > >
> > > > doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White
> (Verizon)
> > > > Smartphone Factory Unlocked"
> > > >
> > > > doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> > > >
> > > > Here if I search for
> > > >
> > > >
> > >
> >
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
> > > >
> > > > Doc1 and Doc2 both have the same identical score , but since the
> field
> > > > phoneName in the doc2 has shorter length I would expect it to have a
> > > higher
> > > > score , but both have an identical score of 9.961212.
> > > >
> > > > The phoneName filed is defined as follows.As we can see no where am I
> > > > specifying omitNorms=True, still the behavior seems to be that the
> > length
> > > > norm is not functioning at all. Can some one let me know whats the
> > issue
> > > > here ?
> > > >
> > > >         <field name="phoneName" type="text_en_splitting"
> indexed="true"
> > > >             stored="true" required="true" />
> > > >         <fieldType name="text_en_splitting" class="solr.TextField"
> > > >             positionIncrementGap="100"
> > autoGeneratePhraseQueries="true">
> > > >             <analyzer type="index">
> > > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > > >                 <!-- in this example, we will only use synonyms at
> > query
> > > > time <filter
> > > >                     class="solr.SynonymFilterFactory"
> > > > synonyms="index_synonyms.txt" ignoreCase="true"
> > > >                     expand="false"/> -->
> > > >                 <!-- Case insensitive stop word removal. add
> > > > enablePositionIncrements=true
> > > >                     in both the index and query analyzers to leave a
> > > 'gap'
> > > > for more accurate
> > > >                     phrase queries. -->
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > >                     words="lang/stopwords_en.txt"
> > > > enablePositionIncrements="true" />
> > > >                 <filter class="solr.WordDelimiterFilterFactory"
> > > >                     generateWordParts="1" generateNumberParts="1"
> > > > catenateWords="1"
> > > >                     catenateNumbers="1" catenateAll="0"
> > > > splitOnCaseChange="1" />
> > > >                 <filter class="solr.LowerCaseFilterFactory" />
> > > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > > protected="protwords.txt" />
> > > >                 <filter class="solr.PorterStemFilterFactory" />
> > > >             </analyzer>
> > > >             <analyzer type="query">
> > > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > > >                 <filter class="solr.SynonymFilterFactory"
> > > > synonyms="synonyms.txt"
> > > >                     ignoreCase="true" expand="true" />
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > >                     words="lang/stopwords_en.txt"
> > > > enablePositionIncrements="true" />
> > > >                 <filter class="solr.WordDelimiterFilterFactory"
> > > >                     generateWordParts="1" generateNumberParts="1"
> > > > catenateWords="0"
> > > >                     catenateNumbers="0" catenateAll="0"
> > > > splitOnCaseChange="1" />
> > > >                 <filter class="solr.LowerCaseFilterFactory" />
> > > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > > protected="protwords.txt" />
> > > >                 <filter class="solr.PorterStemFilterFactory" />
> > > >             </analyzer>
> > > >         </fieldType>
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > > <mk...@griddynamics.com>
> > >
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: Length norm not functioning in solr queries.

Posted by "S.L" <si...@gmail.com>.
Hi Ahmet,

Is there already an implementation of the suggested work around ? Thanks.

On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi,
>
> Default length norm is not best option for differentiating very short
> documents, like product names.
> Please see :
> http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
>
> I suggest you to create an additional integer field, that holds number of
> tokens. You can populate it via update processor. And then penalise (using
> fuction queries) according to that field. This way you have more fine
> grained and flexible control over it.
>
> Ahmet
>
>
>
> On Tuesday, December 9, 2014 12:22 PM, S.L <si...@gmail.com>
> wrote:
> Hi ,
>
> Mikhail Thanks , I looked at the explain and this is what I see for the two
> different documents in questions, they have identical scores   even though
> the document 2 has a shorter productName field, I do not see any lenghtNorm
> related information in the explain.
>
> Also I am not exactly clear on what needs to be looked in the API ?
>
> *Search Query* : q=iphone+4s+16gb&qf= productName&mm=1&pf=
> productName&ps=1&pf2= productName&pf3=
> productName&stopwords=true&lowercaseOperators=true
>
> *productName Details about Apple iPhone 4s 16GB Smartphone AT&T Factory
> Unlocked *
>
>
>    - *100%* 10.649221 sum of the following:
>       - *10.58%* 1.1270299 sum of the following:
>          - *2.1%* 0.22383358 productName:iphon
>          - *3.47%* 0.36922288 productName:"4 s"
>          - *5.01%* 0.53397346 productName:"16 gb"
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>       - *27.79%* 2.959255 sum of the following:
>          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
>          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>
>
> *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
>
>
>    - *100%* 10.649221 sum of the following:
>       - *10.58%* 1.1270299 sum of the following:
>          - *2.1%* 0.22383358 productName:iphon
>          - *3.47%* 0.36922288 productName:"4 s"
>          - *5.01%* 0.53397346 productName:"16 gb"
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>       - *27.79%* 2.959255 sum of the following:
>          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
>          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>
>
>
>
>
> On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
> > It's worth to look into <explain> to check particular scoring values. But
> > for most suspect is the reducing precision when float norms are stored in
> > byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
> >
> >
> > On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com> wrote:
> >
> > > I have two documents doc1 and doc2 and each one of those has a field
> > called
> > > phoneName.
> > >
> > > doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
> > > Smartphone Factory Unlocked"
> > >
> > > doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> > >
> > > Here if I search for
> > >
> > >
> >
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
> > >
> > > Doc1 and Doc2 both have the same identical score , but since the field
> > > phoneName in the doc2 has shorter length I would expect it to have a
> > higher
> > > score , but both have an identical score of 9.961212.
> > >
> > > The phoneName filed is defined as follows.As we can see no where am I
> > > specifying omitNorms=True, still the behavior seems to be that the
> length
> > > norm is not functioning at all. Can some one let me know whats the
> issue
> > > here ?
> > >
> > >         <field name="phoneName" type="text_en_splitting" indexed="true"
> > >             stored="true" required="true" />
> > >         <fieldType name="text_en_splitting" class="solr.TextField"
> > >             positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
> > >             <analyzer type="index">
> > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > >                 <!-- in this example, we will only use synonyms at
> query
> > > time <filter
> > >                     class="solr.SynonymFilterFactory"
> > > synonyms="index_synonyms.txt" ignoreCase="true"
> > >                     expand="false"/> -->
> > >                 <!-- Case insensitive stop word removal. add
> > > enablePositionIncrements=true
> > >                     in both the index and query analyzers to leave a
> > 'gap'
> > > for more accurate
> > >                     phrase queries. -->
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > >                     words="lang/stopwords_en.txt"
> > > enablePositionIncrements="true" />
> > >                 <filter class="solr.WordDelimiterFilterFactory"
> > >                     generateWordParts="1" generateNumberParts="1"
> > > catenateWords="1"
> > >                     catenateNumbers="1" catenateAll="0"
> > > splitOnCaseChange="1" />
> > >                 <filter class="solr.LowerCaseFilterFactory" />
> > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > protected="protwords.txt" />
> > >                 <filter class="solr.PorterStemFilterFactory" />
> > >             </analyzer>
> > >             <analyzer type="query">
> > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > >                 <filter class="solr.SynonymFilterFactory"
> > > synonyms="synonyms.txt"
> > >                     ignoreCase="true" expand="true" />
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > >                     words="lang/stopwords_en.txt"
> > > enablePositionIncrements="true" />
> > >                 <filter class="solr.WordDelimiterFilterFactory"
> > >                     generateWordParts="1" generateNumberParts="1"
> > > catenateWords="0"
> > >                     catenateNumbers="0" catenateAll="0"
> > > splitOnCaseChange="1" />
> > >                 <filter class="solr.LowerCaseFilterFactory" />
> > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > protected="protwords.txt" />
> > >                 <filter class="solr.PorterStemFilterFactory" />
> > >             </analyzer>
> > >         </fieldType>
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> > <mk...@griddynamics.com>
> >
>

Re: Length norm not functioning in solr queries.

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi,

Default length norm is not best option for differentiating very short documents, like product names.
Please see : http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec

I suggest you to create an additional integer field, that holds number of tokens. You can populate it via update processor. And then penalise (using fuction queries) according to that field. This way you have more fine grained and flexible control over it.

Ahmet



On Tuesday, December 9, 2014 12:22 PM, S.L <si...@gmail.com> wrote:
Hi ,

Mikhail Thanks , I looked at the explain and this is what I see for the two
different documents in questions, they have identical scores   even though
the document 2 has a shorter productName field, I do not see any lenghtNorm
related information in the explain.

Also I am not exactly clear on what needs to be looked in the API ?

*Search Query* : q=iphone+4s+16gb&qf= productName&mm=1&pf=
productName&ps=1&pf2= productName&pf3=
productName&stopwords=true&lowercaseOperators=true

*productName Details about Apple iPhone 4s 16GB Smartphone AT&T Factory
Unlocked *


   - *100%* 10.649221 sum of the following:
      - *10.58%* 1.1270299 sum of the following:
         - *2.1%* 0.22383358 productName:iphon
         - *3.47%* 0.36922288 productName:"4 s"
         - *5.01%* 0.53397346 productName:"16 gb"
      - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
      - *27.79%* 2.959255 sum of the following:
         - *10.97%* 1.1680154 productName:"iphon 4 s"~1
         - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
      - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1


*productName Apple iPhone 4S 16GB for Net10, No Contract, White*


   - *100%* 10.649221 sum of the following:
      - *10.58%* 1.1270299 sum of the following:
         - *2.1%* 0.22383358 productName:iphon
         - *3.47%* 0.36922288 productName:"4 s"
         - *5.01%* 0.53397346 productName:"16 gb"
      - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
      - *27.79%* 2.959255 sum of the following:
         - *10.97%* 1.1680154 productName:"iphon 4 s"~1
         - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
      - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1





On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> It's worth to look into <explain> to check particular scoring values. But
> for most suspect is the reducing precision when float norms are stored in
> byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
>
>
> On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com> wrote:
>
> > I have two documents doc1 and doc2 and each one of those has a field
> called
> > phoneName.
> >
> > doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
> > Smartphone Factory Unlocked"
> >
> > doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> >
> > Here if I search for
> >
> >
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
> >
> > Doc1 and Doc2 both have the same identical score , but since the field
> > phoneName in the doc2 has shorter length I would expect it to have a
> higher
> > score , but both have an identical score of 9.961212.
> >
> > The phoneName filed is defined as follows.As we can see no where am I
> > specifying omitNorms=True, still the behavior seems to be that the length
> > norm is not functioning at all. Can some one let me know whats the issue
> > here ?
> >
> >         <field name="phoneName" type="text_en_splitting" indexed="true"
> >             stored="true" required="true" />
> >         <fieldType name="text_en_splitting" class="solr.TextField"
> >             positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >             <analyzer type="index">
> >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >                 <!-- in this example, we will only use synonyms at query
> > time <filter
> >                     class="solr.SynonymFilterFactory"
> > synonyms="index_synonyms.txt" ignoreCase="true"
> >                     expand="false"/> -->
> >                 <!-- Case insensitive stop word removal. add
> > enablePositionIncrements=true
> >                     in both the index and query analyzers to leave a
> 'gap'
> > for more accurate
> >                     phrase queries. -->
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> >                     words="lang/stopwords_en.txt"
> > enablePositionIncrements="true" />
> >                 <filter class="solr.WordDelimiterFilterFactory"
> >                     generateWordParts="1" generateNumberParts="1"
> > catenateWords="1"
> >                     catenateNumbers="1" catenateAll="0"
> > splitOnCaseChange="1" />
> >                 <filter class="solr.LowerCaseFilterFactory" />
> >                 <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt" />
> >                 <filter class="solr.PorterStemFilterFactory" />
> >             </analyzer>
> >             <analyzer type="query">
> >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >                 <filter class="solr.SynonymFilterFactory"
> > synonyms="synonyms.txt"
> >                     ignoreCase="true" expand="true" />
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> >                     words="lang/stopwords_en.txt"
> > enablePositionIncrements="true" />
> >                 <filter class="solr.WordDelimiterFilterFactory"
> >                     generateWordParts="1" generateNumberParts="1"
> > catenateWords="0"
> >                     catenateNumbers="0" catenateAll="0"
> > splitOnCaseChange="1" />
> >                 <filter class="solr.LowerCaseFilterFactory" />
> >                 <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt" />
> >                 <filter class="solr.PorterStemFilterFactory" />
> >             </analyzer>
> >         </fieldType>
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mk...@griddynamics.com>
>

Re: Length norm not functioning in solr queries.

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
I wonder why your explains are so brief, mine looks like

    <str>
0.4500489 = (MATCH) weight(text:inc in 17) [DefaultSimilarity], result of:
  0.4500489 = fieldWeight in 17, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    2.880313 = idf(docFreq=8, maxDocs=59)
    0.15625 = fieldNorm(doc=17)</str>
    <str>
0.4500489 = (MATCH) weight(text:inc in 27) [DefaultSimilarity], result of:
  0.4500489 = fieldWeight in 27, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    2.880313 = idf(docFreq=8, maxDocs=59)
    0.15625 = fieldNorm(doc=27)</str>

here we can see fieldNorm factors. These two docs are rather different,
however norm factors are equal.

> Also I am not exactly clear on what needs to be looked in the API ?

Because you can see how exactly how it looses precision when stores
float field norm in the byte.



On Tue, Dec 9, 2014 at 1:22 PM, S.L <si...@gmail.com> wrote:

> Hi ,
>
> Mikhail Thanks , I looked at the explain and this is what I see for the two
> different documents in questions, they have identical scores   even though
> the document 2 has a shorter productName field, I do not see any lenghtNorm
> related information in the explain.
>
> Also I am not exactly clear on what needs to be looked in the API ?
>
> *Search Query* : q=iphone+4s+16gb&qf= productName&mm=1&pf=
> productName&ps=1&pf2= productName&pf3=
> productName&stopwords=true&lowercaseOperators=true
>
> *productName Details about Apple iPhone 4s 16GB Smartphone AT&T Factory
> Unlocked *
>
>
>    - *100%* 10.649221 sum of the following:
>       - *10.58%* 1.1270299 sum of the following:
>          - *2.1%* 0.22383358 productName:iphon
>          - *3.47%* 0.36922288 productName:"4 s"
>          - *5.01%* 0.53397346 productName:"16 gb"
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>       - *27.79%* 2.959255 sum of the following:
>          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
>          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>
>
> *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
>
>
>    - *100%* 10.649221 sum of the following:
>       - *10.58%* 1.1270299 sum of the following:
>          - *2.1%* 0.22383358 productName:iphon
>          - *3.47%* 0.36922288 productName:"4 s"
>          - *5.01%* 0.53397346 productName:"16 gb"
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>       - *27.79%* 2.959255 sum of the following:
>          - *10.97%* 1.1680154 productName:"iphon 4 s"~1
>          - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
>       - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
>
>
>
>
> On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
> > It's worth to look into <explain> to check particular scoring values. But
> > for most suspect is the reducing precision when float norms are stored in
> > byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
> >
> >
> > On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com> wrote:
> >
> > > I have two documents doc1 and doc2 and each one of those has a field
> > called
> > > phoneName.
> > >
> > > doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
> > > Smartphone Factory Unlocked"
> > >
> > > doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> > >
> > > Here if I search for
> > >
> > >
> >
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
> > >
> > > Doc1 and Doc2 both have the same identical score , but since the field
> > > phoneName in the doc2 has shorter length I would expect it to have a
> > higher
> > > score , but both have an identical score of 9.961212.
> > >
> > > The phoneName filed is defined as follows.As we can see no where am I
> > > specifying omitNorms=True, still the behavior seems to be that the
> length
> > > norm is not functioning at all. Can some one let me know whats the
> issue
> > > here ?
> > >
> > >         <field name="phoneName" type="text_en_splitting" indexed="true"
> > >             stored="true" required="true" />
> > >         <fieldType name="text_en_splitting" class="solr.TextField"
> > >             positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
> > >             <analyzer type="index">
> > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > >                 <!-- in this example, we will only use synonyms at
> query
> > > time <filter
> > >                     class="solr.SynonymFilterFactory"
> > > synonyms="index_synonyms.txt" ignoreCase="true"
> > >                     expand="false"/> -->
> > >                 <!-- Case insensitive stop word removal. add
> > > enablePositionIncrements=true
> > >                     in both the index and query analyzers to leave a
> > 'gap'
> > > for more accurate
> > >                     phrase queries. -->
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > >                     words="lang/stopwords_en.txt"
> > > enablePositionIncrements="true" />
> > >                 <filter class="solr.WordDelimiterFilterFactory"
> > >                     generateWordParts="1" generateNumberParts="1"
> > > catenateWords="1"
> > >                     catenateNumbers="1" catenateAll="0"
> > > splitOnCaseChange="1" />
> > >                 <filter class="solr.LowerCaseFilterFactory" />
> > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > protected="protwords.txt" />
> > >                 <filter class="solr.PorterStemFilterFactory" />
> > >             </analyzer>
> > >             <analyzer type="query">
> > >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > >                 <filter class="solr.SynonymFilterFactory"
> > > synonyms="synonyms.txt"
> > >                     ignoreCase="true" expand="true" />
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > >                     words="lang/stopwords_en.txt"
> > > enablePositionIncrements="true" />
> > >                 <filter class="solr.WordDelimiterFilterFactory"
> > >                     generateWordParts="1" generateNumberParts="1"
> > > catenateWords="0"
> > >                     catenateNumbers="0" catenateAll="0"
> > > splitOnCaseChange="1" />
> > >                 <filter class="solr.LowerCaseFilterFactory" />
> > >                 <filter class="solr.KeywordMarkerFilterFactory"
> > > protected="protwords.txt" />
> > >                 <filter class="solr.PorterStemFilterFactory" />
> > >             </analyzer>
> > >         </fieldType>
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> > <mk...@griddynamics.com>
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: Length norm not functioning in solr queries.

Posted by "S.L" <si...@gmail.com>.
Hi ,

Mikhail Thanks , I looked at the explain and this is what I see for the two
different documents in questions, they have identical scores   even though
the document 2 has a shorter productName field, I do not see any lenghtNorm
related information in the explain.

Also I am not exactly clear on what needs to be looked in the API ?

*Search Query* : q=iphone+4s+16gb&qf= productName&mm=1&pf=
productName&ps=1&pf2= productName&pf3=
productName&stopwords=true&lowercaseOperators=true

*productName Details about Apple iPhone 4s 16GB Smartphone AT&T Factory
Unlocked *


   - *100%* 10.649221 sum of the following:
      - *10.58%* 1.1270299 sum of the following:
         - *2.1%* 0.22383358 productName:iphon
         - *3.47%* 0.36922288 productName:"4 s"
         - *5.01%* 0.53397346 productName:"16 gb"
      - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
      - *27.79%* 2.959255 sum of the following:
         - *10.97%* 1.1680154 productName:"iphon 4 s"~1
         - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
      - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1


*productName Apple iPhone 4S 16GB for Net10, No Contract, White*


   - *100%* 10.649221 sum of the following:
      - *10.58%* 1.1270299 sum of the following:
         - *2.1%* 0.22383358 productName:iphon
         - *3.47%* 0.36922288 productName:"4 s"
         - *5.01%* 0.53397346 productName:"16 gb"
      - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1
      - *27.79%* 2.959255 sum of the following:
         - *10.97%* 1.1680154 productName:"iphon 4 s"~1
         - *16.82%* 1.7912396 productName:"4 s 16 gb"~1
      - *30.81%* 3.2814684 productName:"iphon 4 s 16 gb"~1




On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> It's worth to look into <explain> to check particular scoring values. But
> for most suspect is the reducing precision when float norms are stored in
> byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
>
>
> On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com> wrote:
>
> > I have two documents doc1 and doc2 and each one of those has a field
> called
> > phoneName.
> >
> > doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
> > Smartphone Factory Unlocked"
> >
> > doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
> >
> > Here if I search for
> >
> >
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
> >
> > Doc1 and Doc2 both have the same identical score , but since the field
> > phoneName in the doc2 has shorter length I would expect it to have a
> higher
> > score , but both have an identical score of 9.961212.
> >
> > The phoneName filed is defined as follows.As we can see no where am I
> > specifying omitNorms=True, still the behavior seems to be that the length
> > norm is not functioning at all. Can some one let me know whats the issue
> > here ?
> >
> >         <field name="phoneName" type="text_en_splitting" indexed="true"
> >             stored="true" required="true" />
> >         <fieldType name="text_en_splitting" class="solr.TextField"
> >             positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >             <analyzer type="index">
> >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >                 <!-- in this example, we will only use synonyms at query
> > time <filter
> >                     class="solr.SynonymFilterFactory"
> > synonyms="index_synonyms.txt" ignoreCase="true"
> >                     expand="false"/> -->
> >                 <!-- Case insensitive stop word removal. add
> > enablePositionIncrements=true
> >                     in both the index and query analyzers to leave a
> 'gap'
> > for more accurate
> >                     phrase queries. -->
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> >                     words="lang/stopwords_en.txt"
> > enablePositionIncrements="true" />
> >                 <filter class="solr.WordDelimiterFilterFactory"
> >                     generateWordParts="1" generateNumberParts="1"
> > catenateWords="1"
> >                     catenateNumbers="1" catenateAll="0"
> > splitOnCaseChange="1" />
> >                 <filter class="solr.LowerCaseFilterFactory" />
> >                 <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt" />
> >                 <filter class="solr.PorterStemFilterFactory" />
> >             </analyzer>
> >             <analyzer type="query">
> >                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >                 <filter class="solr.SynonymFilterFactory"
> > synonyms="synonyms.txt"
> >                     ignoreCase="true" expand="true" />
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> >                     words="lang/stopwords_en.txt"
> > enablePositionIncrements="true" />
> >                 <filter class="solr.WordDelimiterFilterFactory"
> >                     generateWordParts="1" generateNumberParts="1"
> > catenateWords="0"
> >                     catenateNumbers="0" catenateAll="0"
> > splitOnCaseChange="1" />
> >                 <filter class="solr.LowerCaseFilterFactory" />
> >                 <filter class="solr.KeywordMarkerFilterFactory"
> > protected="protwords.txt" />
> >                 <filter class="solr.PorterStemFilterFactory" />
> >             </analyzer>
> >         </fieldType>
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mk...@griddynamics.com>
>

Re: Length norm not functioning in solr queries.

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
It's worth to look into <explain> to check particular scoring values. But
for most suspect is the reducing precision when float norms are stored in
byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)


On Mon, Dec 8, 2014 at 5:49 PM, S.L <si...@gmail.com> wrote:

> I have two documents doc1 and doc2 and each one of those has a field called
> phoneName.
>
> doc1:phoneName:"Details about  Apple iPhone 4s - 16GB - White (Verizon)
> Smartphone Factory Unlocked"
>
> doc2:phoneName:"Apple iPhone 4S 16GB for Net10, No Contract, White"
>
> Here if I search for
>
> q=iphone+4s+16gb&qf=phoneName&mm=1&pf=phoneName&ps=1&pf2=phoneName&pf3=phoneName&stopwords=true&lowercaseOperators=true
>
> Doc1 and Doc2 both have the same identical score , but since the field
> phoneName in the doc2 has shorter length I would expect it to have a higher
> score , but both have an identical score of 9.961212.
>
> The phoneName filed is defined as follows.As we can see no where am I
> specifying omitNorms=True, still the behavior seems to be that the length
> norm is not functioning at all. Can some one let me know whats the issue
> here ?
>
>         <field name="phoneName" type="text_en_splitting" indexed="true"
>             stored="true" required="true" />
>         <fieldType name="text_en_splitting" class="solr.TextField"
>             positionIncrementGap="100" autoGeneratePhraseQueries="true">
>             <analyzer type="index">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
>                 <!-- in this example, we will only use synonyms at query
> time <filter
>                     class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true"
>                     expand="false"/> -->
>                 <!-- Case insensitive stop word removal. add
> enablePositionIncrements=true
>                     in both the index and query analyzers to leave a 'gap'
> for more accurate
>                     phrase queries. -->
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
>                     words="lang/stopwords_en.txt"
> enablePositionIncrements="true" />
>                 <filter class="solr.WordDelimiterFilterFactory"
>                     generateWordParts="1" generateNumberParts="1"
> catenateWords="1"
>                     catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1" />
>                 <filter class="solr.LowerCaseFilterFactory" />
>                 <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt" />
>                 <filter class="solr.PorterStemFilterFactory" />
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
>                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
>                     ignoreCase="true" expand="true" />
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
>                     words="lang/stopwords_en.txt"
> enablePositionIncrements="true" />
>                 <filter class="solr.WordDelimiterFilterFactory"
>                     generateWordParts="1" generateNumberParts="1"
> catenateWords="0"
>                     catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="1" />
>                 <filter class="solr.LowerCaseFilterFactory" />
>                 <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt" />
>                 <filter class="solr.PorterStemFilterFactory" />
>             </analyzer>
>         </fieldType>
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>