You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Erick Erickson (Created) (JIRA)" <ji...@apache.org> on 2011/11/27 17:54:40 UTC

[jira] [Created] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
---------------------------------------------------------------------------------------------

                 Key: SOLR-2921
                 URL: https://issues.apache.org/jira/browse/SOLR-2921
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis
    Affects Versions: 3.6, 4.0
         Environment: All
            Reporter: Erick Erickson
            Assignee: Erick Erickson
            Priority: Minor


SOLR-2918, which drastically improves the approach of SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:

 * ASCIIFoldingFilterFactory
 * LowerCaseFilterFactory
 * LowerCaseTokenizerFactory
 * MappingCharFilterFactory
 * PersianCharFilterFactory

When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!

But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.

Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.

Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.

ArabicNormalizationFilterFactory
GreekLowerCaseFilterFactory
HindiNormalizationFilterFactory
ICUFoldingFilterFactory
ICUNormalizer2FilterFactory
ICUTransformFilterFactory
IndicNormalizationFilterFactory
ISOLatin1AccentFilterFactory
PersianNormalizationFilterFactory
RussianLowerCaseFilterFactory
TurkishLowerCaseFilterFactory


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson updated SOLR-2921:
---------------------------------

    Affects Version/s:     (was: 3.6)
    
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson updated SOLR-2921:
---------------------------------

    Attachment: SOLR-2921-3x.patch

Here's a first cut at these. The tests in TestFoldingMultitermExtrasQuery are especially weak, any help here would be extremely welcome....

Basically, I stole the patterns from the associated filters and removed the ones that failed for reasons I didn't understand. And I haven't checked the remaining all that carefully, I have some stuff coming up for most of the rest of today and wanted to get the first cut out in front of people.

The attached patch applies against 3x, I'll need to tweak it for trunk but won't bother until after we finalize this.

I also haven't run the full test suite, so this patch should NOT be committed yet.

I'm not even going to try the following, I don't even know what to expect as proper results. If nobody steps up I'll split these out into another JIRA and hopefully someone with the appropriate knowledge (and keyboard) can volunteer:
   ArabicNormalizationFilterFactory
   HindiNormalizationFilterFactory
   IndicNormalizationFilterFactory
   PersianNormalizationFilterFactory
   ICUTransformFilterFactory  
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: SOLR-2921-3x.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234855#comment-13234855 ] 

Robert Muir commented on SOLR-2921:
-----------------------------------

Patch looks good: i think you should commit it and I'll follow up with the other ones.

only one nitpick:
{noformat}
-/** 
+/**
  * Factory for {@link TurkishLowerCaseFilter}.
  * <pre class="prettyprint" >
  * &lt;fieldType name="text_trlwr" class="solr.TextField" positionIncrementGap="100"&gt;
- *   &lt;analyzer&gt;
- *     &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
- *     &lt;filter class="solr.TurkishLowerCaseFilterFactory"/&gt;
- *   &lt;/analyzer&gt;
- * &lt;/fieldType&gt;</pre> 
+ * &lt;analyzer&gt;
+ * &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
+ * &lt;filter class="solr.TurkishLowerCaseFilterFactory"/&gt;
+ * &lt;/analyzer&gt;
+ * &lt;/fieldType&gt;</pre>
+ *
{noformat}

Did your IDE do this? I don't think we should lose the indentation of the example there.

                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: SOLR-2921-3x.patch, SOLR-2921-3x.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234434#comment-13234434 ] 

Robert Muir commented on SOLR-2921:
-----------------------------------

{quote}
I'm not even going to try the following, I don't even know what to expect as proper results. If nobody steps up I'll split these out into another JIRA and hopefully someone with the appropriate knowledge (and keyboard) can volunteer:
{quote}

Just make progress with your patch and keep this issue open. I'll help with these.

As far as the existing patch, i think the multitermtest for ICU will not work (because the support is in a contrib module: analysis-extras).
If we want a test for that stuff i think it needs to sit under that contrib.
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: SOLR-2921-3x.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson resolved SOLR-2921.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
                   3.6

Let's open up any further issues in a new JIRA?
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-trunk.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson updated SOLR-2921:
---------------------------------

    Attachment: SOLR-2921-3x.patch

Fixes test cases in analysis-extras so it runs from the command line not only in IntelliJ.
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: SOLR-2921-3x.patch, SOLR-2921-3x.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159348#comment-13159348 ] 

Erick Erickson commented on SOLR-2921:
--------------------------------------

Hmmm, what about stemmers? Prefix only?
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234343#comment-13234343 ] 

Robert Muir commented on SOLR-2921:
-----------------------------------

{quote}
I guess I'm at a loss as to how to write tests for the various filters and tokenizers I listed, which is why I'm reluctant to just make them MultTermAwareComponents.
{quote}

Well, none of them are going to work perfectly with wildcards etc anyway, and we already have:
* tests that these filters do what they should
* tests that this MultiTerm stuff works

Looking at your 'quick cull' list i think these will 'generally work'. Sure I could list corner cases for maybe
half of them right off the top of my head, but we already know its not going to work perfectly. These aren't bugs
in the filters if they don't work 100% right when given query syntax instead of text, and its not bugs if they
have context-sensitive rules which won't work the way people expect with patterns.

So I don't think we need tests that show their half-way broken behavior? This is sort of a best-effort thing anyway.

                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236109#comment-13236109 ] 

Erick Erickson commented on SOLR-2921:
--------------------------------------

Cool! Thanks!
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-trunk.patch, SOLR-2921_rest.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson updated SOLR-2921:
---------------------------------

    Description: 
SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:

 * ASCIIFoldingFilterFactory
 * LowerCaseFilterFactory
 * LowerCaseTokenizerFactory
 * MappingCharFilterFactory
 * PersianCharFilterFactory

When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!

But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.

Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.

Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.

ArabicNormalizationFilterFactory
GreekLowerCaseFilterFactory
HindiNormalizationFilterFactory
ICUFoldingFilterFactory
ICUNormalizer2FilterFactory
ICUTransformFilterFactory
IndicNormalizationFilterFactory
ISOLatin1AccentFilterFactory
PersianNormalizationFilterFactory
RussianLowerCaseFilterFactory
TurkishLowerCaseFilterFactory


  was:
SOLR-2918, which drastically improves the approach of SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:

 * ASCIIFoldingFilterFactory
 * LowerCaseFilterFactory
 * LowerCaseTokenizerFactory
 * MappingCharFilterFactory
 * PersianCharFilterFactory

When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!

But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.

Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.

Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.

ArabicNormalizationFilterFactory
GreekLowerCaseFilterFactory
HindiNormalizationFilterFactory
ICUFoldingFilterFactory
ICUNormalizer2FilterFactory
ICUTransformFilterFactory
IndicNormalizationFilterFactory
ISOLatin1AccentFilterFactory
PersianNormalizationFilterFactory
RussianLowerCaseFilterFactory
TurkishLowerCaseFilterFactory


    
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Oliver Schihin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263999#comment-13263999 ] 

Oliver Schihin commented on SOLR-2921:
--------------------------------------

Am I off topic, or is ICUCollationKeyFilterFactory a candidate, as well?
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-trunk.patch, SOLR-2921_rest.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Oliver Schihin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Oliver Schihin updated SOLR-2921:
---------------------------------

    Comment: was deleted

(was: Am I off topic, or is ICUCollationKeyFilterFactory a candidate, as well?)
    
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-trunk.patch, SOLR-2921_rest.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-2921:
------------------------------

    Attachment: SOLR-2921_rest.patch

erick, here's a patch for the rest. I'll commit it shortly.
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-trunk.patch, SOLR-2921_rest.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234571#comment-13234571 ] 

Erick Erickson commented on SOLR-2921:
--------------------------------------

Hey, man, it worked from inside IntelliJ. I thought getting to the schema in core from analysis-extras (where TestFoldingMultitermExtrasQuery lives) was funky. I'll have a patch up for that shortly.

                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: SOLR-2921-3x.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Mike Sokolov (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160576#comment-13160576 ] 

Mike Sokolov commented on SOLR-2921:
------------------------------------

I spoke hastily, and it's true that stemmers are different from those other multi-token things.  It would be kind of nice if it were possible to have a query for "do?s" actually match the a document containing "dogs", even when matching against a stemmed field, but I don't see how to do it without breaking all kinds of other things.  Consider how messed up range queries would get.  [dogs TO *] would match doge, doggone,   and other words in [dog TO dogs] which would be totally counterintuitive.
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160416#comment-13160416 ] 

Robert Muir commented on SOLR-2921:
-----------------------------------

{quote}
I guess it begs the question of what use adding stemmers to the mix would be.
{quote}

I agree with Mike.

Most stemmers are basically suffix-strippers and use heuristics like term length. They are not going to work with the syntax of various multitermqueries. no stemmer is going to stem dogs* to dog*. some might remove any non-alpha characters completely, and its not a bug that they do this. they are heuristical in nature and designed to work on natural language text... not syntax.

                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160593#comment-13160593 ] 

Robert Muir commented on SOLR-2921:
-----------------------------------

well Erick i think the ones you listed here are ok.

There are cases where they won't work correctly, but trying to do
multitermqueries with mappingcharfilter and asciifolding
filter are already problematic (eg ? won't match œ because 
its now 'oe').

Personally i think this is fine, but we should document
that things don't work correctly all the time, and we 
should not make changes to analysis components to try 
to make them cope  with multiterm queries syntax or 
anything (this would be bad design, it turns them into 
queryparsers).

If the user cares about the corner cases, then they just
specify the chain.
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160406#comment-13160406 ] 

Erick Erickson commented on SOLR-2921:
--------------------------------------

Not synonyms - agreed.
Not shinglers - agreed.
Not anything that might produce multiple tokens - agreed.

Stemmers... When do stemmers produce multiple tokens? My ignorance of all the possibilities knows no bounds. I was wondering if, in this case, stemmers really reduced to prefix queries. Maybe it's just a bad idea altogether, I guess it begs the question of what use adding stemmers to the mix would be. You want to match the root, just specify the root with an asterisk and be done with it. No need to introduce stemming into the MultiTermAwareComponent mix.

But this kind of question is exactly why I have this JIRA in place, we can collect reasons I wouldn't think of and record them. Before I mess something up with well-intentioned-but-wrong "help".
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Mike Sokolov (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160278#comment-13160278 ] 

Mike Sokolov commented on SOLR-2921:
------------------------------------

No not stemmers.  Not synonyms, not shinglers or anything that might produce multiple tokens.

                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson updated SOLR-2921:
---------------------------------

    Attachment: SOLR-2921-trunk.patch
                SOLR-2921-3x.patch

3x r:    1303937
Trunk r: 1303939
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-trunk.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson updated SOLR-2921:
---------------------------------

    Affects Version/s: 3.6

There is some desire (and suggestions) to do some more of these for 3.6, so re-labeling
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Posted by "Erick Erickson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160612#comment-13160612 ] 

Erick Erickson commented on SOLR-2921:
--------------------------------------

Mike:

stemmers - not going to make them MultiTermAware. No way. No how. Not on my watch, one succinct example and I'm convinced.

The beauty of the way Yonik and Robert directed this is that we can take care of the 80% case, not provide things that are *that* surprising and still have all the flexibility available to those who really need it. As Robert says, if they really want some "interesting" behavior, they can specify the complete chain.

Robert:

I guess I'm at a loss as to how to write tests for the various filters and tokenizers I listed, which is why I'm reluctant to just make them MultTermAwareComponents. Do you have any suggestions as to how I could get tests? I had enough surprises when I ran the tests in English that I'm reluctant to just plow ahead. As far as I understand, Arabic is caseless for instance.

I totally agree with your point that making the analysis components cope with syntax is evil. Not going there either.

Maybe the right action is to wait for someone to volunteer to be the guinea pig for the various filters, I suppose we could advertise for volunteers...
                
> Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org