You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Mike Krimerman (JIRA)" <ji...@apache.org> on 2007/10/26 00:25:50 UTC

[jira] Created: (SOLR-395) Spell-check should return frequencies of word and suggestions

Spell-check should return frequencies of word and suggestions
-------------------------------------------------------------

                 Key: SOLR-395
                 URL: https://issues.apache.org/jira/browse/SOLR-395
             Project: Solr
          Issue Type: Improvement
          Components: search
    Affects Versions: 1.3
            Reporter: Mike Krimerman
            Priority: Minor
             Fix For: 1.3


When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Mike Krimerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Krimerman updated SOLR-395:
--------------------------------

    Attachment: returnFrequencies.patch

patch for returning frequencies for word and suggestions.
Lucene's suggestions are sorted by distance first and frequency second (if applicable).

The patch adds two fields:
 * a frequency field for the word
 * a list of frequencies (same length as the suggestion list).



> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Mike Krimerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Krimerman updated SOLR-395:
--------------------------------

    Attachment: extended_results.diff

The attached patch combines patches for issues 375, 395, 401 and some more:
# (375) Adds the *exist* property for a single word spell-check - whether the word exists in dictionary
# Adds the *sp.query.onlyMorePopular* option for returning suggestions that are more popular than query word(s)
# The *sp.query.extendedResults* implies a multi-word query plus returning frequencies for each word in query and for each suggestion.
# (401) A minimum *threshold* for adding words to the spell-check dictionary as percent/100 of documents where word should appear.
# *Arguments* prefixed with the 'sp' prefix, backwards compatibility remains.
## _sp.dictionary.indexDir_ - backwards compatible with _spellcheckerIndexDir_
## _sp.dictionary.termSourceField_ - backwards compatible with _termSourceField_
## _sp.dictionary.threshold_ - threshold for words to enter dictionary
## _sp.query.suggestionCount_ - backwards compatible with _suggestionCount_
## _sp.query.accuracy_ - backwards compatible with _accuracy_
## _sp.query.onlyMorePopular_ - only more popular suggestions
## _sp.query.extendedResults_ - multi-word query and a response with frequencies
# (375) A *unit-test* file, extended and modified to test 401
# Formatted extended-results to be more friendly for Python/Ruby




> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: extended_results.diff, returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Klaas updated SOLR-395:
----------------------------

    Component/s:     (was: search)
                 spellchecker

> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Scott Tabar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537958 ] 

Scott Tabar commented on SOLR-395:
----------------------------------

I will be making changes to SOLR-375 to display the frequency for the word that is being checked instead of using the boolean exists.  This should not be conditional by a parameter, but should be part of the default results as is the exists modification currently has been implemented.

It would not be a problem to incorporate these changes in to JIRA-375 and also for me to add additional unit tests to cover the frequency modifications.

Mike (both), do you have any other suggestions to enhance the SpellCheckerRequestHandler?

Not running this code, but just reviewing the patch, it appears like the frequency list is parallel and separate to the suggestion list.  This is great from the perspective of backwards compatibility, but would it make more sense to alter the suggestion list's data structure to make a stronger tie or relationship to the word that is be suggested?  Right now only the frequency is of interest, but if additional information can be provided, say the value of "distance", then there would be a logical place for it, otherwise we would have yet another "list" of "values".  Having an organized data structure could be more conducive to using Java's "for each" or Prototype's "each" construct without needing to track index values in to one array or the other.  I realize this may be more a matter of preference on style, but nows the time to make a change if it is so desired. 

One idea of integrating the frequency of the suggestion is to make the frequency an attribute on the <str> tag such as <str frequency="1283">happy</str>.  This may help with backwards compatibility but there is not much support for the addition of attributes within Solr so that could prevent its use. 



> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539661 ] 

klaasm edited comment on SOLR-395 at 11/2/07 11:35 AM:
-----------------------------------------------------------

a python example
{noformat}
{
  'responseHeader': {
    'status':0,
    'QTime':16
  },
  'result':{
    'pithon':{
      'frequency':5,
      'suggestions':['python',{'frequency':18785}]
    },
    'haus':{
      'frequency':482,
      'suggestions':['hats',{'frequency':6794},'hans',
{'frequency':5986},'haul',{'frequency':3152},'haas',
{'frequency':1054},'hays',{'frequency':533}]
    },
    'endication':{
      'frequency':0,
      'suggestions':['indication',{'frequency':9634},'syndication',
{'frequency':17777},'dedication',{'frequency':4470},'medication',
{'frequency':3746},'indications',{'frequency':2783}]
    }
  }
}
{noformat}

      was (Author: klaasm):
    a python example
{code:java}
{
  'responseHeader': {
    'status':0,
    'QTime':16
  },
  'result':{
    'pithon':{
      'frequency':5,
      'suggestions':['python',{'frequency':18785}]
    },
    'haus':{
      'frequency':482,
      'suggestions':['hats',{'frequency':6794},'hans',
{'frequency':5986},'haul',{'frequency':3152},'haas',
{'frequency':1054},'hays',{'frequency':533}]
    },
    'endication':{
      'frequency':0,
      'suggestions':['indication',{'frequency':9634},'syndication',
{'frequency':17777},'dedication',{'frequency':4470},'medication',
{'frequency':3746},'indications',{'frequency':2783}]
    }
  }
}
{/code}
  
> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Assignee: Mike Klaas
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: extended_results.diff, returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539661 ] 

Mike Klaas commented on SOLR-395:
---------------------------------

a python example
{code:java}
{
  'responseHeader': {
    'status':0,
    'QTime':16
  },
  'result':{
    'pithon':{
      'frequency':5,
      'suggestions':['python',{'frequency':18785}]
    },
    'haus':{
      'frequency':482,
      'suggestions':['hats',{'frequency':6794},'hans',
{'frequency':5986},'haul',{'frequency':3152},'haas',
{'frequency':1054},'hays',{'frequency':533}]
    },
    'endication':{
      'frequency':0,
      'suggestions':['indication',{'frequency':9634},'syndication',
{'frequency':17777},'dedication',{'frequency':4470},'medication',
{'frequency':3746},'indications',{'frequency':2783}]
    }
  }
}
{/code}

> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Assignee: Mike Klaas
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: extended_results.diff, returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Mike Krimerman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538061 ] 

Mike Krimerman commented on SOLR-395:
-------------------------------------

The separate list of frequencies is indeed for backwards compatibility, it seems preferable to do as you suggested and add a frequency for each suggestion if backwards compatibility is not an issue.
If the distance can be added it would be a nice addition. Lucene sorts the suggestion list by distance first and frequency second. 

Regarding the XML formatting, that would be nice addition. However I was under the impression that Solr uses only tag-elements (and not attributes) for responses. How would the frequency be returned if a JSON or Python response is requested?

Another nice addition might be to implement the decision of the prominent suggestion; however that might require some heuristics and not be generic.


> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Klaas reassigned SOLR-395:
-------------------------------

    Assignee: Mike Klaas

> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Assignee: Mike Klaas
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: extended_results.diff, returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Mike Krimerman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539653 ] 

Mike Krimerman commented on SOLR-395:
-------------------------------------

The new format produces output as (querying for pithon+progremming, extendedResults=true):
{code:xml} 
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">173</int>
    </lst>
    <lst name="result">
        <lst name="pithon">
            <int name="frequency">5</int>
            <lst name="suggestions">
                <lst name="python">
                    <int name="frequency">18785</int>
                </lst>
            </lst>
        </lst>
        <lst name="progremming">
            <int name="frequency">0</int>
            <lst name="suggestions">
                <lst name="programming">
                    <int name="frequency">70997</int>
                </lst>
                <lst name="progressing">
                    <int name="frequency">1930</int>
                </lst>
                <lst name="programing">
                    <int name="frequency">597</int>
                </lst>
                <lst name="progamming">
                    <int name="frequency">113</int>
                </lst>
                <lst name="reprogramming">
                    <int name="frequency">344</int>
                </lst>
            </lst>
        </lst>
    </lst>
</response>
{code}
In this example the best suggestions are the first ones. Some queries may return a suggestion which is very close to the query word, but with relatively low frequency (Lucene sorts results by distance first). In that case suggestions that are somewhat farther but with a much higher frequency should be chosen.


> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Assignee: Mike Klaas
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: extended_results.diff, returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538064 ] 

Mike Klaas commented on SOLR-395:
---------------------------------

If the extra data is only present when some parameter is present, backward compatibility is not affected.



> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Klaas resolved SOLR-395.
-----------------------------

    Resolution: Fixed

Committed!  Thanks Mike and Scott.

> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Assignee: Mike Klaas
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: extended_results.diff, returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539612 ] 

Yonik Seeley commented on SOLR-395:
-----------------------------------

> the new format is extensible: new data can be added to the suggestions without breaking compatibility.

That's always a good thing... could you give an example of the new format for those of us too lazy to try it out ourselves?

> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Assignee: Mike Klaas
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: extended_results.diff, returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-395) Spell-check should return frequencies of word and suggestions

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537778 ] 

Mike Klaas commented on SOLR-395:
---------------------------------

Might it be better to rename the fields "queryFreq"/"suggestionFreqs"?  (or something more different that "frequency" + "frequencies")

> Spell-check should return frequencies of word and suggestions
> -------------------------------------------------------------
>
>                 Key: SOLR-395
>                 URL: https://issues.apache.org/jira/browse/SOLR-395
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Mike Krimerman
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: returnFrequencies.patch
>
>
> When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
> This feature should be optional (using a URL param).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.