You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Adam Hiatt (JIRA)" <ji...@apache.org> on 2007/03/29 01:53:25 UTC

[jira] Created: (SOLR-199) N-gram

N-gram
------

                 Key: SOLR-199
                 URL: https://issues.apache.org/jira/browse/SOLR-199
             Project: Solr
          Issue Type: New Feature
          Components: search
            Reporter: Adam Hiatt
            Priority: Trivial


This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-199) N-gram

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492096 ] 

Otis Gospodnetic commented on SOLR-199:
---------------------------------------

Adam, I think Yonik is just saying that the n-gram stuff I added to Lucene's contrib/analyzers was added after 2.1 was released, so we'd need a version of that jar from the trunk at this time.  I see mentions of Solr 1.2, so perhaps we can grab the 2.2-dev version of that jar and add it to Solr starting with release 1.2?

Question: How will the spellchecker you are writing or considering writing going to be different/better than the one in contrib/spellchecker?


> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-199) N-gram

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491023 ] 

Yonik Seeley commented on SOLR-199:
-----------------------------------

Since there is no impact or even memory overhead if unused, and just a teeny bit of disk overhead, this patch looks fine to me.

> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-199) N-gram

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Hiatt updated SOLR-199:
----------------------------

    Attachment: SOLR-199-n-gram.patch

This is the new patch, not just cut out of SOLR-81...

I removed references to the Base class and fixed the edge n-gram bug.

> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-199-n-gram.patch, SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (SOLR-199) N-gram

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley closed SOLR-199.
-----------------------------

    Resolution: Fixed

Thanks Adam, I just committed this.

> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-199-n-gram.patch, SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (SOLR-199) N-gram

Posted by Adam Hiatt <ad...@cnet.com>.

Good point. That looks flat out broken.
-- Adam




On May 1, 2007, at 2:16 PM, Chris Hostetter wrote:

>
> : > NGramTokenizerFactory is refering to constants from, and  
> constructing
> : an instance of, EdgeNGramTokenizer
>
> : Are you saying that this worries you b/c it is referenced in the  
> example
> : schema and will thus break without the lucene-analyzers package?  
> I do
> : agree that this example should probably be taken out for the time  
> being
> : (at the least).
>
> no, i'm saying that the class NGramTokenizerFactory does not produce
> instances of NGramTokenizer ... it produces instances of
> EdgeNGramTokenizer (which is coincidently what  
> EdgeNGramTokenizerFactory
> does as well)
>
>
> -Hoss

Re: [jira] Commented: (SOLR-199) N-gram

Posted by Chris Hostetter <ho...@fucit.org>.

: > NGramTokenizerFactory is refering to constants from, and constructing
: an instance of, EdgeNGramTokenizer

: Are you saying that this worries you b/c it is referenced in the example
: schema and will thus break without the lucene-analyzers package? I do
: agree that this example should probably be taken out for the time being
: (at the least).

no, i'm saying that the class NGramTokenizerFactory does not produce
instances of NGramTokenizer ... it produces instances of
EdgeNGramTokenizer (which is coincidently what EdgeNGramTokenizerFactory
does as well)


-Hoss

[jira] Commented: (SOLR-199) N-gram

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492948 ] 

Adam Hiatt commented on SOLR-199:
---------------------------------

I'll make those changes. I agree that we don't want bloated base classes.

> NGramTokenizerFactory is refering to constants from, and constructing an instance of, EdgeNGramTokenizer 
Are you saying that this worries you b/c it is referenced in the example schema and will thus break without the lucene-analyzers package? I do agree that this example should probably be taken out for the time being (at the least).

> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-199) N-gram

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492375 ] 

Otis Gospodnetic commented on SOLR-199:
---------------------------------------

+1 for getting this stuff into Solr then.  I imagine the patch is mostly what was in some of the SOLR-81 patches.
I think I saw Yonik mentioning cleaning up either solrconfig or schema, so that's something to keep in mind when applying this patch.


> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-199) N-gram

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Hiatt updated SOLR-199:
----------------------------

    Attachment: SOLR-81-ngram.patch

Here is the patch.

> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-199) N-gram

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485251 ] 

Adam Hiatt commented on SOLR-199:
---------------------------------

I do in fact have a need. I am creating an edge n-gram index to provide auto-complete functionality. This doesn't require the regular n-gram factory, but I put it in the patch as well. Furthermore, I am also planning to create a alternative spellchecker that does what you wanted to do initially (ie have a single spellchecking index). This would of course require n-gramming functionality. 

> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-199) N-gram

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485065 ] 

Otis Gospodnetic commented on SOLR-199:
---------------------------------------

While I was the one who started that n-gram approach, and while I'm saving code and config changes from that work "just in case", I'm not sure why we'd add it to Solr at this point, if nothing is going to use it.  I'd say keep the patch here, so one can easily put that back into Solr when a need arises, but don't commit it unless there is some immediate need.  Do you have something that needs the n-gram stuff in Solr?


> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-199) N-gram

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492078 ] 

Adam Hiatt commented on SOLR-199:
---------------------------------

Were you imagining that it just be included and that the users must use include a lucene 2.2+ analyzers jar themselves?

> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-199) N-gram

Posted by "Adam Hiatt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492106 ] 

Adam Hiatt commented on SOLR-199:
---------------------------------

Quoted:
Adam, I think Yonik is just saying that the n-gram stuff I added to Lucene's contrib/analyzers was added after 2.1 was released, so we'd need a version of that jar from the trunk at this time. I see mentions of Solr 1.2, so perhaps we can grab the 2.2-dev version of that jar and add it to Solr starting with release 1.2?

Understood. I talked with Yonik and he mentioned possibly upgrading to a lucene 2.2-dev in the future. I'm not sure he intended that to happen in time for solr 1.2 however. I suppose if it came to it, we could probably use the analyzers 2.2-dev with 2.1 core. I'm guessing the API was stable, but I'm not sure if we want to complicate things that much.

Quoted:
Question: How will the spellchecker you are writing or considering writing going to be different/better than the one in contrib/spellchecker? 

The initial use case was actually to support autocomplete functionality. IE using the start n-gramming functionality to build tokens that we can match term fragments upon. 

However, I do still plan to write a native Solr spell checker based on this same patch sometime in the future. The major improvements with a native system are several fold. First, it allows for truly native use of a Solr-configurable lucene index. Second, we will be able to take advantage of native Solr caching. Third, we will be able to boost on arbitrary aspects. For example, take the misspelling 'ipad' and the indexed terms 'ipod' and 'ipaq'. Both the indexed terms are the same edit distance away from the misspelling. They also have the same number of 2 grams (though not 3 grams). If find that 'ipod' is the more valuable term we can boost slightly based on its popularity and draw out ahead. The final big win is the ability to spell check on individual input tokens. For example, assume that we have the term 'ipod' indexed in our spell checker, but not the term 'apple ipod' and the misspelling 'apple ipdo' is entered. The overlap between 'ipod' and 'apple ipdo' is slight enough to not warrant a suggestion. However if we tokenize on white space and spell correct on each token we would be able to catch the 'ipdo' misspelling. I'm sure there are other use cases, but those are the ones that I've identified.



> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-199) N-gram

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492947 ] 

Hoss Man commented on SOLR-199:
-------------------------------

NGramTokenizerFactory is refering to constants from, and constructing an instance of, EdgeNGramTokenizer

I'm also not crazy about some of the utilities being added to BaseTokenizerFactory .. at a minimum they need better names (like getStringArg) but i'm not really clear on what this is suppose to mean at all...

           protected int getInt(String name, int defaultVal, boolean useDefault)

...if i don't want to use the default, then what am i suppose to pass as the defaultVal?

how about if we don't make any changes to BaseTokenizerFactory and just let subclasses that want convenience methods for dealing with args use MapSolrParams and the methods it supports?

> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-199) N-gram

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491647 ] 

Yonik Seeley commented on SOLR-199:
-----------------------------------

This patch does rely on analyzers.jar from lucene contrib.  No reason that jar shouldn't be part of Solr, but it needs a version after Lucene 2.1

> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram tokenizing functionality that was initially part of SOLR-81 (spell checking). This was taken out b/c the lucene SpellChecker class removed this dependency. None-the-less, I think this is useful functionality and the addition is trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.