You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Sean Timm (JIRA)" <ji...@apache.org> on 2008/08/21 21:32:44 UTC
[jira] Created: (LUCENE-1360) A Similarity class which has unique
length norms for numTerms <= 10
A Similarity class which has unique length norms for numTerms <= 10
-------------------------------------------------------------------
Key: LUCENE-1360
URL: https://issues.apache.org/jira/browse/LUCENE-1360
Project: Lucene - Java
Issue Type: Improvement
Reporter: Sean Timm
Priority: Trivial
Attachments: ShortFieldNormSimilarity.java
A Similarity class which extends DefaultSimilarity and simply overrides lengthNorm. lengthNorm is implemented as a lookup for numTerms <= 10, else as {{1/sqrt(numTerms)}}. This is to avoid term counts below 11 from having the same lengthNorm after stored as a single byte in the index.
This is useful if your search is only on short fields such as titles or product descriptions.
See mailing list discussion: http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-1360) A Similarity class which has unique
length norms for numTerms <= 10
Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lance Norskog updated LUCENE-1360:
----------------------------------
Attachment: LUCENE-1380 visualization.pdf
This is a graph of the standard norms against the results of this patch. The orange/red dots at the left are the elevated values for boosting short documents.
Both displays show the norms after the 8-bit encode/decode process, rather than raw 1/x. Here is the code for the generator:
public class FloatEncode {
private static float ARR[] = { 0.0f, 1.5f, 1.25f, 1.0f, 0.875f, 0.75f, 0.625f, 0.5f, 0.4375f, 0.375f, 0.3125f};
/**
* @param args
*/
public static void main(String[] args) {
for(int i = 1; i < 100; i++) {
float f = i;
f = 1/f;
byte b = SmallFloat.floatToByte315(f);
float f2 = SmallFloat.byte315ToFloat(b);
float ff = f2;
if (i < ARR.length)
ff = ARR[i];
System.out.println(i + "," + f2 + "," + ff);
}
}
}
> A Similarity class which has unique length norms for numTerms <= 10
> -------------------------------------------------------------------
>
> Key: LUCENE-1360
> URL: https://issues.apache.org/jira/browse/LUCENE-1360
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Sean Timm
> Assignee: Otis Gospodnetic
> Priority: Trivial
> Attachments: LUCENE-1380 visualization.pdf, ShortFieldNormSimilarity.java
>
>
> A Similarity class which extends DefaultSimilarity and simply overrides lengthNorm. lengthNorm is implemented as a lookup for numTerms <= 10, else as {{1/sqrt(numTerms)}}. This is to avoid term counts below 11 from having the same lengthNorm after stored as a single byte in the index.
> This is useful if your search is only on short fields such as titles or product descriptions.
> See mailing list discussion: http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-1360) A Similarity class which has unique
length norms for numTerms <= 10
Posted by "Sean Timm (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Timm updated LUCENE-1360:
------------------------------
Attachment: ShortFieldNormSimilarity.java
> A Similarity class which has unique length norms for numTerms <= 10
> -------------------------------------------------------------------
>
> Key: LUCENE-1360
> URL: https://issues.apache.org/jira/browse/LUCENE-1360
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Sean Timm
> Priority: Trivial
> Attachments: ShortFieldNormSimilarity.java
>
>
> A Similarity class which extends DefaultSimilarity and simply overrides lengthNorm. lengthNorm is implemented as a lookup for numTerms <= 10, else as {{1/sqrt(numTerms)}}. This is to avoid term counts below 11 from having the same lengthNorm after stored as a single byte in the index.
> This is useful if your search is only on short fields such as titles or product descriptions.
> See mailing list discussion: http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Assigned: (LUCENE-1360) A Similarity class which has unique
length norms for numTerms <= 10
Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Otis Gospodnetic reassigned LUCENE-1360:
----------------------------------------
Assignee: Otis Gospodnetic
> A Similarity class which has unique length norms for numTerms <= 10
> -------------------------------------------------------------------
>
> Key: LUCENE-1360
> URL: https://issues.apache.org/jira/browse/LUCENE-1360
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Sean Timm
> Assignee: Otis Gospodnetic
> Priority: Trivial
> Attachments: ShortFieldNormSimilarity.java
>
>
> A Similarity class which extends DefaultSimilarity and simply overrides lengthNorm. lengthNorm is implemented as a lookup for numTerms <= 10, else as {{1/sqrt(numTerms)}}. This is to avoid term counts below 11 from having the same lengthNorm after stored as a single byte in the index.
> This is useful if your search is only on short fields such as titles or product descriptions.
> See mailing list discussion: http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-1360) A Similarity class which has unique
length norms for numTerms <= 10
Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759054#action_12759054 ]
Shalin Shekhar Mangar commented on LUCENE-1360:
-----------------------------------------------
I'm interested in this issue as well.
> A Similarity class which has unique length norms for numTerms <= 10
> -------------------------------------------------------------------
>
> Key: LUCENE-1360
> URL: https://issues.apache.org/jira/browse/LUCENE-1360
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Sean Timm
> Assignee: Otis Gospodnetic
> Priority: Trivial
> Attachments: ShortFieldNormSimilarity.java
>
>
> A Similarity class which extends DefaultSimilarity and simply overrides lengthNorm. lengthNorm is implemented as a lookup for numTerms <= 10, else as {{1/sqrt(numTerms)}}. This is to avoid term counts below 11 from having the same lengthNorm after stored as a single byte in the index.
> This is useful if your search is only on short fields such as titles or product descriptions.
> See mailing list discussion: http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1360) A Similarity class which
has unique length norms for numTerms <= 10
Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780897#action_12780897 ]
Lance Norskog edited comment on LUCENE-1360 at 11/21/09 2:58 AM:
-----------------------------------------------------------------
This is a graph of the standard norms against the results of this patch. The orange/red dots at the left are the elevated values for boosting short documents.
Both displays show the norms after the 8-bit encode/decode process, rather than raw 1/x. Here is the code for the generator:
{code}
public class FloatEncode {
private static float ARR[] = { 0.0f, 1.5f, 1.25f, 1.0f, 0.875f, 0.75f,
0.625f, 0.5f, 0.4375f, 0.375f, 0.3125f};
public static void main(String[] args) {
for(int i = 1; i < 100; i++) {
float f = i;
f = 1/f;
byte b = SmallFloat.floatToByte315(f);
float f2 = SmallFloat.byte315ToFloat(b);
float ff = f2;
if (i < ARR.length)
ff = ARR[i];
System.out.println(i + "," + f2 + "," + ff);
}
}
}
{code}
was (Author: lancenorskog):
This is a graph of the standard norms against the results of this patch. The orange/red dots at the left are the elevated values for boosting short documents.
Both displays show the norms after the 8-bit encode/decode process, rather than raw 1/x. Here is the code for the generator:
public class FloatEncode {
private static float ARR[] = { 0.0f, 1.5f, 1.25f, 1.0f, 0.875f, 0.75f, 0.625f, 0.5f, 0.4375f, 0.375f, 0.3125f};
/**
* @param args
*/
public static void main(String[] args) {
for(int i = 1; i < 100; i++) {
float f = i;
f = 1/f;
byte b = SmallFloat.floatToByte315(f);
float f2 = SmallFloat.byte315ToFloat(b);
float ff = f2;
if (i < ARR.length)
ff = ARR[i];
System.out.println(i + "," + f2 + "," + ff);
}
}
}
> A Similarity class which has unique length norms for numTerms <= 10
> -------------------------------------------------------------------
>
> Key: LUCENE-1360
> URL: https://issues.apache.org/jira/browse/LUCENE-1360
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Sean Timm
> Assignee: Otis Gospodnetic
> Priority: Trivial
> Attachments: LUCENE-1380 visualization.pdf, ShortFieldNormSimilarity.java
>
>
> A Similarity class which extends DefaultSimilarity and simply overrides lengthNorm. lengthNorm is implemented as a lookup for numTerms <= 10, else as {{1/sqrt(numTerms)}}. This is to avoid term counts below 11 from having the same lengthNorm after stored as a single byte in the index.
> This is useful if your search is only on short fields such as titles or product descriptions.
> See mailing list discussion: http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1360) A Similarity class which
has unique length norms for numTerms <= 10
Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780897#action_12780897 ]
Lance Norskog edited comment on LUCENE-1360 at 11/21/09 3:03 AM:
-----------------------------------------------------------------
This is a graph of the standard norms against the results of this patch. The orange/red dots at the left are the elevated values for boosting short documents.
Both displays show the norms after the 8-bit encode/decode process, rather than raw 1/x. Here is the code for the generator:
{code}
public class FloatEncode {
private static float ARR[] = { 0.0f, 1.5f, 1.25f, 1.0f, 0.875f, 0.75f,
0.625f, 0.5f, 0.4375f, 0.375f, 0.3125f};
public static void main(String[] args) {
for(int i = 1; i < 100; i++) {
float f = i;
f = 1/f;
byte b = SmallFloat.floatToByte315(f);
float f2 = SmallFloat.byte315ToFloat(b);
float ff = f2;
if (i < ARR.length)
ff = ARR[i];
System.out.println(i + "," + f2 + "," + ff);
}
}
}
{code}
(Please pretend I named it LUCENE-1360 instead of LUCENE-1380.)
was (Author: lancenorskog):
This is a graph of the standard norms against the results of this patch. The orange/red dots at the left are the elevated values for boosting short documents.
Both displays show the norms after the 8-bit encode/decode process, rather than raw 1/x. Here is the code for the generator:
{code}
public class FloatEncode {
private static float ARR[] = { 0.0f, 1.5f, 1.25f, 1.0f, 0.875f, 0.75f,
0.625f, 0.5f, 0.4375f, 0.375f, 0.3125f};
public static void main(String[] args) {
for(int i = 1; i < 100; i++) {
float f = i;
f = 1/f;
byte b = SmallFloat.floatToByte315(f);
float f2 = SmallFloat.byte315ToFloat(b);
float ff = f2;
if (i < ARR.length)
ff = ARR[i];
System.out.println(i + "," + f2 + "," + ff);
}
}
}
{code}
> A Similarity class which has unique length norms for numTerms <= 10
> -------------------------------------------------------------------
>
> Key: LUCENE-1360
> URL: https://issues.apache.org/jira/browse/LUCENE-1360
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Sean Timm
> Assignee: Otis Gospodnetic
> Priority: Trivial
> Attachments: LUCENE-1380 visualization.pdf, ShortFieldNormSimilarity.java
>
>
> A Similarity class which extends DefaultSimilarity and simply overrides lengthNorm. lengthNorm is implemented as a lookup for numTerms <= 10, else as {{1/sqrt(numTerms)}}. This is to avoid term counts below 11 from having the same lengthNorm after stored as a single byte in the index.
> This is useful if your search is only on short fields such as titles or product descriptions.
> See mailing list discussion: http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-1360) A Similarity class which has unique
length norms for numTerms <= 10
Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758966#action_12758966 ]
Lance Norskog commented on LUCENE-1360:
---------------------------------------
Is this code still interesting? That is, are there newer tools in Lucene that handle this problem?
I have found searching movie titles, product descriptions etc. difficult to manage really well. Mainstream text retrieval research & applied tech is very strongly biased towards bodies of text.
> A Similarity class which has unique length norms for numTerms <= 10
> -------------------------------------------------------------------
>
> Key: LUCENE-1360
> URL: https://issues.apache.org/jira/browse/LUCENE-1360
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Sean Timm
> Assignee: Otis Gospodnetic
> Priority: Trivial
> Attachments: ShortFieldNormSimilarity.java
>
>
> A Similarity class which extends DefaultSimilarity and simply overrides lengthNorm. lengthNorm is implemented as a lookup for numTerms <= 10, else as {{1/sqrt(numTerms)}}. This is to avoid term counts below 11 from having the same lengthNorm after stored as a single byte in the index.
> This is useful if your search is only on short fields such as titles or product descriptions.
> See mailing list discussion: http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org