You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2006/12/12 07:43:20 UTC

[jira] Created: (SOLR-81) Add Query Spellchecker functionality

Add Query Spellchecker functionality
------------------------------------

                 Key: SOLR-81
                 URL: http://issues.apache.org/jira/browse/SOLR-81
             Project: Solr
          Issue Type: New Feature
          Components: search
            Reporter: Otis Gospodnetic
            Priority: Minor


Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:

<doc>
<field name="word">lettuce</field>
<field name="start3">let</field>
<field name="gram3">let ett ttu tuc uce</field>
<field name="end3">uce</field>
<field name="start4">lett</field>
<field name="gram4">lett ettu ttuc tuce</field>
<field name="end4">tuce</field>
</doc>

See:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483805 ] 

Otis Gospodnetic commented on SOLR-81:
--------------------------------------

Hoss, another poooooossibly interesting and useful addition:

Make use of public RAMDirectory(Directory dir) and allow one to specify that even though the spellchecker index exists in FS, use it only to pull it into a RAMDir-based index.  Might not be a huge win because most spellchecker indices are probably pretty small and easily fit in RAM already, even when they are FSDir-based, but I thought I'd mention it anyway.




> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: hoss.spell.patch, SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated SOLR-81:
---------------------------------

    Attachment: SOLR-81-spellchecker.patch

Here is a new version of the patch:

- No token filters, no schema.xml changes, no invariant properties used
- Only 1 SpellCheckerRequestHandler that either returns spelling suggestions or rebuilds the spellchecker index is cmd=rebuild is specified
- SpellChecker instance is no longer static
- kept spellchecker.xml example doc
- still using absolute path for index dir

I'm still recovering from a +13h timezone change, so please shout if I missed anything.  I'd like to commit this by the end of the week, so please help me finalize this.

Adam: if you are doing any work on this, please email or comment here, so we don't duplicate the effort.



> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473848 ] 

Adam Hiatt commented on SOLR-81:
--------------------------------

What was the bug? I couldn't tell from the Lucene issue description.





> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483795 ] 

Ryan McKinley commented on SOLR-81:
-----------------------------------

>  * can't do relative path to dataDir, because we can't getdataDir,
>    because SolrCore isn't done initializing yet.

with SOLR-182, SolrCore gets initialized first - so we could use relative paths during handler initialization.


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: hoss.spell.patch, SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12460405 ] 
            
Otis Gospodnetic commented on SOLR-81:
--------------------------------------

This patch contains 3 new classes for org.apache.solr.analysis:
1. NGramTokenizerFactory
2. NGramTokenizer
3. NGramTokenizerTest (all tests pass)
+ 1 modified class:
4. BaseTokenizerFactory

I *think* the above can be configured in schema.xml as follows:

    <fieldtype name="gram1" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <tokenizer class="solr.NGramTokenizerFactory" minGram="1" maxGram="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="gram2" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <tokenizer class="solr.NGramTokenizerFactory" minGram="2" maxGram="2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="gram3" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <tokenizer class="solr.NGramTokenizerFactory" minGram="3" maxGram="3"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

And I *believe* the following fields would have to be defined (to match the fields in Spellchecker.java):

<field name="word" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="start1" type="string" indexed="true" stored="true" multiValued="false"/>  **
<field name="end1" type="string" indexed="true" stored="true" multiValued="false"/> **
<field name="start2" type="string" indexed="true" stored="true" multiValued="false"/> **
<field name="end2" type="string" indexed="true" stored="true" multiValued="false"/> **
<field name="start3" type="string" indexed="true" stored="true" multiValued="false"/> **
<field name="end3" type="string" indexed="true" stored="true" multiValued="false"/> **
<field name="start4" type="string" indexed="true" stored="true" multiValued="false"/> **
<field name="end4" type="string" indexed="true" stored="true" multiValued="false"/> **
<field name="gram1" type="gram1" indexed="true" stored="true" multiValued="false"/>
<field name="gram2" type="gram2" indexed="true" stored="true" multiValued="false"/>
<field name="gram3" type="gram3" indexed="true" stored="true" multiValued="false"/>
<field name="gram4" type="gram4" indexed="true" stored="true" multiValued="false"/>

c.f. http://wiki.apache.org/jakarta-lucene/SpellChecker
I am not sure how to configure the fields marked with  ** above.
Maybe I don't even need startN/endN fields.  I am not sure how endN fields would be useful.  The startN are probably useful because those can get an extra boost.

I *think* the above config (except for ** fields, which I don't know how to handle) will do the following.
If the input (query string) is "pork", my ngrammer may generate the following uni- and bi-gram tokens:

  p o r k po or rk

And this is how I think they will get mapped to fields and indexed:
word: pork
gram1: p o r k
gram2: po or rk
start1: p **
start2: po **
end1 rk **
end2: rk **

Again, not sure how to achieve **.

I haven't actually tried this.  I am only modifying my local example/solr/conf/schema.xml for now, and I haven't actually indexed anything with the above config.

Thoughts/comments?

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481759 ] 

Otis Gospodnetic commented on SOLR-81:
--------------------------------------

There is a useless (I think) static IndexReader in there:
    private static IndexReader reader = null;

If we set this to some real IndexReader, we can get the SpellChecker to act as follows (from its coffeedocs):

....
   * @param ir the indexReader of the user index (can be null see field param)
   * @param field String the field of the user index: if field is not null, the suggested
   * words are restricted to the words present in this field.
   * @param morePopular boolean return only the suggest words that are more frequent than the searched word
   * (only if restricted mode = (indexReader!=null and field!=null)
....
  public String[] suggestSimilar(String word, int numSug, IndexReader ir,
      String field, boolean morePopular) throws IOException {

So, should we do this on init:
  reader = req.getSearcher().getReader();
?
Or maybe add a new param to solrconfig.xml's declaration of the SpellCheckerRequestHandler that turns this on/off?

Thoughts?


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12460331 ] 
            
Otis Gospodnetic commented on SOLR-81:
--------------------------------------

Ogün - yes, that Spellchecker class in Lucene's contrib/spellchecker has 1.0f defined as the boost for the last n-gram.  I'm not even sure if that's needed.  I talked to Bob Carpenter (alias-i.com) about it recently, and he said boosting the end ngram doesn't make sense, if I remember correctly.  I'm inclined to go remove that from the source completely.  Thoughts?

I'm unsure about how to integrate the Lucene spellchecker code into Solr, though.  There is no "n-gram tokenizer" per se in the spellchecker extension, so I can't really point NGramFilter config in Solr's schema.xml to anything in that spellchecker library.... I can write my own n-gram Filter, that's not a problem, but you said you made use of the Lucene spellchecker code, and I can't see how to do that.

Did you simply create your own NGramFilter that creates the same ngrams as Spellchecker.java, and then used the Spellchecker.suggest(String word) method *only* for fetching/getting alternative spelling suggestions?


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Hiatt updated SOLR-81:
---------------------------

    Attachment: SOLR-81-edgengram-ngram.patch

This new patch provides a superset of the functionality of Otis's orginal patch. Specifically it includes edge n-gram tokenizers based on Otis's lucene analyzer contrib. I modified this tokenizer to output edge n-grams in a range of sizes (ie you can tokenizer a range of 1-2 on the string "abc" resulting in "a", "ab").  This patch also fixes a bug in the n-gram factory and provides some code cleanup. 

For clarity's sake this patch suplants 'SOLR-81-ngram.patch' 

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic resolved SOLR-81.
----------------------------------

    Resolution: Fixed

Thanks Hoss, finito!


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: hoss.spell.patch, SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated SOLR-81:
---------------------------------

    Attachment: SOLR-81-ngram.patch

- Fixed SpellCheckerRequestHandler (needed that init method)
- Fixed SpellCheckerRequestHandler name in solrconfig.xml
- Removed unused field types, fields, and copy fields from schema.xml

The indexing part has a bug (will bring it to solr-user in a moment).

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477547 ] 

Yonik Seeley commented on SOLR-81:
----------------------------------

Is spelling check normally going to be integrated into the "main" index, or will it normally be a separate index?
If the latter, does it make more sense for some of this (the field definitions & handler) to be in contrib instead of core?

Any other way to avoid "cluttering" the current schema.xml?

If spelling check is to be a core feature (that one can turn on for any field in any index), it seems like it needs to be easier to configure.  Having the user define all the ngram fields, fieldTypes, and copyField statements doesn't seem ideal.

If, however, this is more of a "configuration" of solr used for spell-checking, it might make more sense for contrib.

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479670 ] 

Hoss Man commented on SOLR-81:
------------------------------

Okay, assuming what we are talking about is a adding the existing Lucene SpellChecker as a hook into a Solr instance, where the dictionary may be built externally, or it may be built based on the main source index, here's my comments based on the most recent patch (from Adam ... as i recall it already incorperates most of Otis's stuff)


1) we should definitely move the *NGramTokenizerFactories into a seperate issues since they don't come into play here.

2) when configuring where the SpellChecker Directory lives, we should probably support three options:  a path relative dataDir (so the regular replication scripts can copy a SpellChecker index from a master to slave right along with the main index), an absolute path, or a RAMDirectory

3) it seems like the functionality present in SpellCheckerRequestHandler and SpellCheckerCommitRequestHandler should all be in one request handler (where special query time input triggers the rebuild).  that way no redundent configuration is required.  There should also be an option for "reloading" the SpellChecker instance from disk (ie: reopening it's IndexReader) without rebuilding -- which would be useful for people who are (re)buidling the SpellChecker index external from Solr and need a way to tell Solr to start using the new one

A key use case i'm imagining is that a master could have a postCommit listener configured to ping "qt=spell&rebuild=true" after each commit, while a slave could have "qt=spell&rebuild=true" to pick up the changed SpellCheck index.

4) i'ts not really safe to trust this...

+        IndexReader indexReader = req.getSearcher().getReader();
+        Dictionary dictionary = new LuceneDictionary(indexReader, termSourceField);

...the source field might be encoded in some way.  We really need a subclass of LuceneDictionary that knows about the IndexSchema/FieldType of termSourceField to extract the Readable value from the indexed terms (either that or we go ahead and feed SpellChecker the raw terms, and then at query time run the users (mispelled) input through the query time analyzer for termSourceField before passing it to spellChecker.suggestSimilar

5) we definitely shouldn't have a "private static SpellChecker" it should be possible to have multiple SpellChecker instance (from different dictionaries) just by registerig multiple instances of the handler .. at first glance this seems like it might make adding SpellChecking funtionality to the other requesthandlers hard .. except that they can call core.getRequestHandler(name) ... so we can still add code to other request handlers so that they can be configured to ask for a SpellCheckerRequestHandler by name, and delegate some spell checking functionality to it.

6) as far as configuring things like spellcheckerIndexDir and termSourceField, there's no reason to do that in the "invariants" list .. that's really for things that the code allows to be query time params, but the person configuring Solr doesn't want query clients to be able to specify it.  seperate init params can be used and accessed directly from the init method (just like XSLTResponseWriter)...

+    <requestHandler name="spellchecker" class="solr.SpellCheckerRequestHandler">
+        <!-- default values for query parameters -->
+        <lst name="defaults">
+            <str name="echoParams">explicit</str>
+            <int name="suggestionCount">1</int>
+            <float name="accuracy">0.5</float>
+        </lst>
+
+        <!-- main init params for handler -->
+        <str name="spellcheckerIndexDir">/tmp/spellchecker</str>
+        <str name="termSourceField">word</str>
+    </requestHandler>
+

7) adding a special field to the example schema (and example docs) to demonstrate building the SpellChecker index is a bit confusing ... we should just build the dictionary off of an existing field that contains data (like "name" or "text") to demonstrate the common use case.



> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477568 ] 

Adam Hiatt commented on SOLR-81:
--------------------------------

> Is spelling check normally going to be integrated into the "main" index, or will it normally be a separate index?
AH: It is a separate index.

> If the latter, does it make more sense for some of this (the field definitions & handler) to be in contrib instead of core? 
AH: That would be fine by me. However, it should be noted that it can be turned on for any field.

> Any other way to avoid "cluttering" the current schema.xml?
> If spelling check is to be a core feature (that one can turn on for any field in any index), it seems like it needs to be easier to configure. Having the user 
> define all the ngram fields, fieldTypes, and copyField statements doesn't seem ideal. 
AH: I think there is some confusion over Otis's version and mine. I was never able to get Otis's version (single index using ngram types + copyfields) working fully so I went with the pure SpellChecker implementation that doesn't require any of that (no schema.xml additions) It just needs for the user to use a custom request handler to query for spelling corrections (Otis wrote the original) and a custom commit handler (based on CommitRequestHandler) to rebuild the spell checker index.

For the record the version I commited is: https://issues.apache.org/jira/secure/attachment/12352485/SOLR-81-spellchecker.patch

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Erik Hatcher (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12460397 ] 
            
Erik Hatcher commented on SOLR-81:
----------------------------------

The way the spell checker works is to have a separate spell checking index.  This could be integrated into Solr with a custom cache that builds the dictionary index into a RAMDirectory.  I've done this in Collex for AJAX suggestions.  Will it scale?  I'm not sure, but I suspect for many Solr installations it'd fit into RAM just fine.  Tie in a custom request handler (and underlying Util class like highlighting, etc) and you're all set!   :)

What have I overlooked or oversimplified?

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483804 ] 

Otis Gospodnetic commented on SOLR-81:
--------------------------------------

I haven't applied this and tried it, but I looked at the patch, and like the changes.  The only issues I could stop are 3 typos that we can clean up later.

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: hoss.spell.patch, SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/SOLR-81?page=all ]

Otis Gospodnetic updated SOLR-81:
---------------------------------

    Comment: was deleted

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated SOLR-81:
---------------------------------

    Attachment: SOLR-81-ngram-schema.patch

schema.xml changes:
- to make use of NGramTokenizerFactory and EdgeNGramTokenizerFactory
- to define some <field>s and some <copyField>s to be used with Lucene SpellChecker.


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/SOLR-81?page=all ]

Otis Gospodnetic updated SOLR-81:
---------------------------------

    Attachment: SOLR-81-ngram.patch

This patch contains 3 new classes for org.apache.solr.analysis:
1. NGramTokenizerFactory
2. NGramTokenizer
3. NGramTokenizerTest (all tests pass)

I *think* the above can be configured in schema.xml as follows:

    <fieldtype name="wordField" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.NGramTokenizerFactory"/>
      </analyzer>
    </fieldtype>

And I *believe* the following fields would have to be defined (to match the fields in Spellchecker.java):
<field name="word"   type="string" indexed="true" stored="true" multiValued="false"/>
<field name="start1" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="end1"   type="string" indexed="true" stored="true" multiValued="false"/>
<field name="start2" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="end2"   type="string" indexed="true" stored="true" multiValued="false"/>
<field name="start3" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="end3"   type="string" indexed="true" stored="true" multiValued="false"/>
<field name="start4" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="end4"   type="string" indexed="true" stored="true" multiValued="false"/>
<field name="gram1"  type="string" indexed="true" stored="true" multiValued="false"/>
<field name="gram2"  type="string" indexed="true" stored="true" multiValued="false"/>
<field name="gram3"  type="string" indexed="true" stored="true" multiValued="false"/>
<field name="gram4"  type="string" indexed="true" stored="true" multiValued="false"/>

c.f. http://wiki.apache.org/jakarta-lucene/SpellChecker

What I'm not sure about is how I'll get Solr to put the right ngrams into the right fields (defined above and also as a set of copyFields).
For example, if the input (query string) is "pork", my ngrammer may generate the following uni- and bi-gram tokens:

  p o r k po or rk

The following should then happen:
word: pork
start1: p
start2: po
gram1: p o r k 
gram2: po or  rk
end1 rk
end2: rk

Not sure how to accomplish that...


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478277 ] 

Otis Gospodnetic commented on SOLR-81:
--------------------------------------

Adam:

I can merge our patches to produce a unified one.

NOTE:
The SpellCheckerCommitRequestHandler assumes that:
  a) one wants to populate the spellchecker index with data from another Lucene index.
  b) the Lucene index to be used for populating is available on the same box where the spellchecker service is running.

I think both a) and b) are good - let those who want this functionality have it.
However, some may not be able to live with these assumptions (e.g. one may want to have a server dedicated to spellchecker service, and may not want to push the source Lucene index to the spellchecker box.)  For those people, the approach that includes schema.xml modifications will be required, unless I'm missing something.  Am I?

Also, I think this is a mistake:

accuracy = p.getFloat("accuracy", DEFAULT_NUM_SUGGESTIONS);

You probably wanted DEFAULT_ACCURACY there, but that doesn't exist yet, so I'll fix that.


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478900 ] 

Adam Hiatt commented on SOLR-81:
--------------------------------

In essence, point 1) is true. However, the way I have been using the SpellChecker index allows for the user to have a standalone spell checker as well as piggy-backing it off a primary index. 

Point 2) prevents the second use case I mentioned and also limits what can be done with the SpellChecker. 

WRT the issue of the NGram/EdgeNGram tokenizers: These should probably be split out into a separate patch/issue as they are not critical to the implementation. 

I like the idea of providing the SpellChecker index access functionality as a contrib that can be accessed from any RequestHandler, but it is useful to have a separate RequestHandler that can just provide spell checking functionality alone.

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by Chris Hostetter <ho...@fucit.org>.

: Yeah, I've used the Lucene-based spellchecker before, I just never had
: to hook it up with Solr.  At this point I'm not interested in the fancy
: stuff (cache, RAMDir...), I just want to figure out how to configure it
: via schema.xml...

But the crux of the issue is that if you are maintaining a second index
inside your base Solr installation for the purposes of the Spellchecker
class, then you don't want or need to configure it in schema.xml -- it
lives outside the schema space.

I pointed this out the last time spellchecking came up, there are two
extremely differnet approaches involved when you talk about "implimenting
a spelling/suggestion service with Solr"...

In the first approach, the main SOlr index *is* the suggestion index ...
each Document represents a suggested word, with one stored field telling
you what the word is, and indexed fields containing the ngrams.  you could
populate this index from any initial source: a dictionary, logs of popular
query terms, or a dump of all terms in your corpus.  At query time, your
application would query this index seperately from querying your "main"
Solr index containing your domain specific data.

The second approach is to have the spelling/suggestion index live inside
of your Solr index side by side with your main domain specific index, so
your Request Handler can talk to it directly, and it can be populated
directly using the terms in your corpus -- this sounds like the
approach you are taking, but in this approach there is no need for your
schema.xml to know anything about the index .. just use the SpellChecker
class as is: construct it with an empty RAMDirectory and call
indexDictionary on a LuceneDictionary pointed at your main Solr index.
The only code you really need to write is something to run clearIndex and
indexDirectory as a newSearcher hook  (the easiest way probably being to
hang your Spellchecker instance off of a single element Solr cache nad
write a Regenerator)

But like i said: you dodn't need to worry about making the schema know
about your ngrams -- you do that if you're going for the first approach.



-Hoss

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12460402 ] 
            
Otis Gospodnetic commented on SOLR-81:
--------------------------------------

Yeah, I've used the Lucene-based spellchecker before, I just never had to hook it up with Solr.  At this point I'm not interested in the fancy stuff (cache, RAMDir...), I just want to figure out how to configure it via schema.xml...

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Hiatt updated SOLR-81:
---------------------------

    Attachment: SOLR-81-spellchecker.patch

Good call on the DEFAULT_ACCURACY constant. BTW it should probably be .5.

As for:
> The SpellCheckerCommitRequestHandler assumes that:
>  a) one wants to populate the spellchecker index with data from another Lucene index.
>  b) the Lucene index to be used for populating is available on the same box where the spellchecker service is running. 

This does not necessarily have to be true (well a. sort of has to be true). The way I've been testing this is to make my primary index  an index of search terms + related metadata. The SpellChecker simply creates a separate index for the pieces it needs to do its work. In essence this is a standalone spellchecker. However, as you note, this same setup allows for the primary index to be any field. Can you see a downside to this approach?

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477569 ] 

Adam Hiatt commented on SOLR-81:
--------------------------------

> Is spelling check normally going to be integrated into the "main" index, or will it normally be a separate index?
AH: It is a separate index.

> If the latter, does it make more sense for some of this (the field definitions & handler) to be in contrib instead of core? 
AH: That would be fine by me. However, it should be noted that it can be turned on for any field.

> Any other way to avoid "cluttering" the current schema.xml?
> If spelling check is to be a core feature (that one can turn on for any field in any index), it seems like it needs to be easier to configure. Having the user 
> define all the ngram fields, fieldTypes, and copyField statements doesn't seem ideal. 
AH: I think there is some confusion over Otis's version and mine. I was never able to get Otis's version (single index using ngram types + copyfields) working fully so I went with the pure SpellChecker implementation that doesn't require any of that (no schema.xml additions) It just needs for the user to use a custom request handler to query for spelling corrections (Otis wrote the original) and a custom commit handler (based on CommitRequestHandler) to rebuild the spell checker index.

For the record the version I commited is: https://issues.apache.org/jira/secure/attachment/12352485/SOLR-81-spellchecker.patch

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478302 ] 

Adam Hiatt commented on SOLR-81:
--------------------------------

BTW updated patch added.

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473813 ] 

Otis Gospodnetic commented on SOLR-81:
--------------------------------------

Adam:
Please look at LUCENE-759.  That incorporates your patch, fixes a bug I found in it, and introduces a new bug, so we are not too bored with bug-free code.  Any idea how to extract that last n-gram when using Side.BACK?


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated SOLR-81:
---------------------------------

    Attachment: SOLR-81-ngram.patch

Here is a new patch that should include everything:
- New SpellCheckerRequestHandler
- Updated solrconfig.xml that defines the above handler and maps it to /spellechecker
- New NGramTokenizerFactory
- New EdgeNGramTokenizerFactory (includes Adam's changes)
- Modified BaseTokenizerFactory (used by the above 2 factories)
- Updated schema.xml that configures the above 2 factories to tokenize the input words into n-grams as required by the Lucene contrib SpellChecker (see http://wiki.apache.org/jakarta-lucene/SpellChecker for examples of n-grams used by the SpellChecker)


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Ogün Bilge (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12457871 ] 
            
Ogün Bilge commented on SOLR-81:
--------------------------------

I have created a NGramFilter for generating those gram fields based on the "word" field
It is configurable with the schema.xml file simply by generating fieldtypes and using the 
copyField directive. 
The generated documents can be used with the Lucene spellchecker extension to fetch a 
suggest word.
Unfortunately i have made this extension during my working hours and so i can't apply the 
asf license on it, but if you have question please ask.


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484232 ] 

Hoss Man commented on SOLR-81:
------------------------------

I've committed my last patch with a few changes:

1) support for directories relative dataDir since SOLR-182 was committed - used this in example solrconfig.xml
2) cleaned up some Typos (thanks for reminding me Otis)
3) whitespace reformatted (separate commit so diffs are easier to follow)

I think things are in a pretty good generally usable state now ... Otis; how do you feel about resolving? (possibly opening new enhancement Jira issues for some of the other things discussed, like the reader idea, and loading copying the FSDirectory into a RAMDirectory?)

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: hoss.spell.patch, SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Hiatt updated SOLR-81:
---------------------------

    Attachment: SOLR-81-spellchecker.patch

This patch was developed off of Otis's previous patch. It fixes a numSuggestion bug + adds an accuracy argument for the spellchecker + adds a commit handler for updating the spell correction index. It removes the n-gram generation from the spell correction index generation because that isn't actually needed to build the index (SpellChecker does that for you).

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12458052 ] 
            
Otis Gospodnetic commented on SOLR-81:
--------------------------------------

Something like this, then?

    <fieldtype name="queryString" class="solr.TextField" positionIncrementGap="1">
      <analyzer>
       <tokenizer class="solr.NGramTokenizerFactory"/>  <!-- Or maybe just make an NGramAnalyzer? -->
      <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

Plus:

<copyField source="word" dest="word_start1"/>
<copyField source="word" dest="word_end1"/>
<copyField source="word" dest="word_start2"/>
<copyField source="word" dest="word_end2"/>
<copyField source="word" dest="word_start3"/>
<copyField source="word" dest="word_end3"/>
<copyField source="word" dest="word_gram1"/>
<copyField source="word" dest="word_gram2"/>
<copyField source="word" dest="word_gram3"/>
<copyField source="word" dest="word_gram4"/> 

I'd probably also want to give those word_start* n-grams some boost, though I don't see how to do that in schema.xml yet.


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480660 ] 

Hoss Man commented on SOLR-81:
------------------------------

Otis: haven't had a chance to look at your newest patch yet, but just to clarify my comment#4... In the last patch i looked at, LuceneDictionary could be used to build the dictionary based on a field name from the index -- but this will only work for simple String or TextFields.

Theoretically, someone could write a ROT132FieldType that munges up the field values stored in it, if you were to try and build a SpellChecker index from this field, nothing good would come of it just using LUceneDIctionary (because of hte way it uses hte raw TermEnum) .. but since we have the IndexSchema, we can get the FieldType for the field name we want to use, and then the "indexedToReadable" method on each indexed term will tell you the "plain text" version.

it's a minor thing, but it's a good thing to take into account.

Alternately, we can just document that it doesn't make sense to use any field type except "StrField" (even TextField doens't really make sense since we can't anticipate what hte Analyzer might have done)


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Ogün Bilge (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12458072 ] 
            
Ogün Bilge commented on SOLR-81:
--------------------------------

Yes exactly as you described. 
The Fieldnames for the gram fields have to be like this to make it work with SpellChecker: 

<copyField source="word" dest="start1"/>
<copyField source="word" dest="end1"/>
<copyField source="word" dest="start2"/>
<copyField source="word" dest="end2"/>
<copyField source="word" dest="start3"/>
<copyField source="word" dest="end3"/>
<copyField source="word" dest="start4"/>
<copyField source="word" dest="end4"/>
<copyField source="word" dest="gram1"/>
<copyField source="word" dest="gram2"/>
<copyField source="word" dest="gram3"/>
<copyField source="word" dest="gram4"/> 

A NGramAnalyzer is enough, i used a WhiteSpacetokenizer as Tokenizer, some other filters and finally  the 
NGramAnaylzer

Unfortunately the SpellChecker implementation has a beta state imho. It is not possible to set your own boost factors 
for the start or end grams. In the current release it is set to start-gram factor 2 , end-gram factor 1



> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-81:
-------------------------

    Attachment: hoss.spell.patch


patch makes a few changes, if there are no objections i'll try to commit this on monday....

 * fixed NPE if no q param (when using cmd)
 * fixed schema.xml to know about "words" field in spellchecker.xml
 * cmd=rebuild needs to be disabled if termSourceField is null
 * added "cmd=reopen" for people maintaining the spell index externally.
 * added support for ramDir based spell index.
 * can't do relative path to dataDir, because we can't getdataDir,
   because SolrCore isn't done initializing yet.
 * added more explanation to solrconfig.xml about meaning of params,
   and changed the default values to work for anyone (using ramdir)
 * I punted on the issue of field type encoding by making it clear in
   the solrconfig.xml comments that termSourceField needs to use a simple
   field type

Remaining issues...

 * should we add a firstSearcher or newSearcher hook to rebuild in
   the example solrconfig.xml ?
 * i don't have an optinion about passing an IndexReader to
   suggestSimilar, if we want to do that it shouldn't be a static reader,
   it should come from the current request ... in the meantime i changed
   the name of the current one to "nullReader" so it's clear what it is.
 * the indenting is currently a hodgepodge of 2spaces vs 4 spaces ...
   i'll fix after commiting (trying to keep the patch easy to read for now)

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: hoss.spell.patch, SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Thomas Peuss (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511989 ] 

Thomas Peuss commented on SOLR-81:
----------------------------------

Hello Otis!

What happened to the TokenFilters included in the patch? They are in the patch but in trunk I don't see them.

CU
Thomas

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: hoss.spell.patch, SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/SOLR-81?page=all ]

Otis Gospodnetic updated SOLR-81:
---------------------------------

    Attachment:     (was: SOLR-81-ngram.patch)

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated SOLR-81:
---------------------------------

    Attachment: SOLR-81-ngram.patch

I think this is the final version - tested it, and it works.
I'd like to commit this next week, so please have a look if you have time.

This is what's in the patch:

A      lib/lucene-spellchecker-2.2-dev.jar
A      lib/lucene-analyzers-2.2-dev.jar
A      src/java/org/apache/solr/analysis/NGramTokenFilterFactory.java
M      src/java/org/apache/solr/analysis/BaseTokenFilterFactory.java
A      src/java/org/apache/solr/analysis/EdgeNGramTokenFilterFactory.java
A      src/java/org/apache/solr/request/SpellCheckerRequestHandler.java
C      example/solr/conf/schema.xml
M      example/solr/conf/solrconfig.xml

I think you can ignore that "C" -- there is no conflict in the file, actually.

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478893 ] 

Hoss Man commented on SOLR-81:
------------------------------

looking over both Otis's patches and Adam's patches for hte first time i find myself really confused.

As previously discussed in email, there are two completley different appraoches that could be taken to achieve "spell correction" using Solr:

1) Use something like the Lucene SpellChecker contrib to make suggestions basedon the data in the main solr index (defined by the solr schema) ... adding hooks to Solr to keep the SpellChecker system aware of changes to the main index, and hooks to allow requesthandlers to return suggestions with each query

2) use the main solr index (defined by the schema) to store the dictionary of words, turning the entire solr instance into one giant SpellChecker.  In this case there would be a recomended schema.xml for users who want to setup a SpellChecker Solr instance and possible a custom RequestHandler htat assumes you are using this schema.


These two patches both seem to be dealing with case#1, but they have hints of approach#2 ... for example i don't entirely understand why they include the NGram tokenfilter factories, since they don't seem to need the fields of the solr index to be tokenized in any special way (since the lucene SpellChecker controls the format of it's dictionary).   It's also not clear do me what the purpose of the SpellCheckerRequestHandler is ... if the main index is storing "real" user records, then wouldn't a helper method that existing request handlers (like dismax and standard) can optionally call to get the SpellChecker data be more useful?

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481738 ] 

Otis Gospodnetic commented on SOLR-81:
--------------------------------------

This is in SVN now, but I'm going to leave this open for another week, in case Hoss, Adam, or anyone else finds any issues.


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480409 ] 

Otis Gospodnetic commented on SOLR-81:
--------------------------------------

Adam:
Have you started making the changes that Hoss proposed here?  Please let me know (today, if you can).  If you have not started, I'll make the changes.  If you've started, I'll hold off.

Hoss & Adam:

1) out with tokenizer factories - right, they are no longer needed.

2) I'll stick to the absolute path for now, get that in SVN, and then we can add support for other things... unless you show me an example of how easy it is to support other paths/locations

3) merging the handlers sounds ok:
  to get suggestions: ...?qt=spellchecker&cmd=suggest 
  to completely rebuild: ...?qt=spellchecker&cmd=rebuild
OK?
The use-case here is to rebuild the index every once in a while, *not* on every change of the main index.

4) I'll leave that for later, as I don't completely understand you there.

5) ok, no static SpellChecker

6) ok, sounds like we just need remove the wrapping <lst name="invariants"> element

7) I actually liked having a separate example doc for demonstrating just the spellchecker functionality -- you don't have to know about those other documents/fields/values.  But if both Adam and Hoss think differently, we should go with the majority's opinion.


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12470295 ] 

Otis Gospodnetic commented on SOLR-81:
--------------------------------------

Adam,

I took a look at your patch.  It looks like you brought over (copied) various n-gram tokenizer classes and their unit tests that I put in Lucene's contrib/analyzers/.... .  Did you do this on purpose?  I intentionally put those n-gram tokenizers under Lucene's contrib, as they are generic and not Solr-specific.  Thus, the only classes my patch has are classes that are Solr-specific:

src/java/org/apache/solr/analysis/EdgeNGramTokenizerFactory.java
src/java/org/apache/solr/analysis/NGramTokenizerFactory.java
src/java/org/apache/solr/analysis/BaseTokenizerFactory.java

And instead of copying the source classes from Lucene's contrib/analyzers/.... it adds the new jar built from those sources:
lib/lucene-analyzers-2.1-dev.jar

Plus:
lib/lucene-spellchecker-2.1-dev.jar
example/solr/conf/schema.xml

I have some locally modified code for this issue, that was not a part of the first patch.  I wanted to attach the updated patch assuming you didn't really want those few generic tokenizer classes copied from Lucene over to Solr, but because changes are now in two places, so to speak, let's do this to unify our work:

Could you please:
- open a new LUCENE issue or just reopen the one where I originally attached this code and post your patch to the Lucene tokenizers there.
- prepare a new patch for this issue and make sure it only contains Solr-specific classes (see above), plus those 2 Jars.  

I'll upload my patch for schema.xml, so you can see my config (your patch didn't have this), and make sure your changes to the code are in sync with that.

Finally, are you making use of this code somehow already?
One thing that is completely missing from this patch is the RequestHandler that knows how to take the input (a query string), and get suggestions for alternative spellings via a SpellChecker instance.  I have some NGramRequestHandler code locally, but the code is unfinished.


> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-81) Add Query Spellchecker functionality

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/SOLR-81?page=all ]

Otis Gospodnetic updated SOLR-81:
---------------------------------

    Attachment: SOLR-81-ngram.patch

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira