You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Ryan McKinley (JIRA)" <ji...@apache.org> on 2007/05/24 06:08:16 UTC

[jira] Created: (SOLR-248) Capitalization Filter Factory

Capitalization Filter Factory
-----------------------------

                 Key: SOLR-248
                 URL: https://issues.apache.org/jira/browse/SOLR-248
             Project: Solr
          Issue Type: New Feature
            Reporter: Ryan McKinley
            Priority: Minor


For tokens that are used in faceting, it is nice to have standard capitalization.  

I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (SOLR-248) Capitalization Filter Factory

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley reassigned SOLR-248:
----------------------------------

    Assignee: Ryan McKinley

> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Assignee: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch, SOLR-248-CapitalizationFilter.patch, SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-248) Capitalization Filter Factory

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498717 ] 

Yonik Seeley commented on SOLR-248:
-----------------------------------

> Implemented at the indexing level, I can have different values for the stored value and indexed terms.
One downside is that it complicates certain things like wildcard or prefix queries (capitalizing the first letter and lowercasing the second is something that the QueryParser does not support).

You could still store the values verbatim, and index as all lowercase.
Then the application could capitalize the results it gets back as it sees fit.
I do see value pushing this type of logic back to the search engine though.

Of course, I think this might be a more general problem in faceting... what to actually use as a label for display purposes vs what the terms in the index were (think price formatting, labels for more complex facet queries, etc).


> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-248) Capitalization Filter Factory

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-248:
-------------------------------

    Attachment: SOLR-248-CapitalizationFilter.patch

Implementation and test...

<filter class="solr.CapitalizationFilterFactory" onlyFirstWord="false" keep="and or the is my or de" maxTokenLength="40" maxWordCount="4" okPrefix="McK" forceFirstLetter="true" />

onlyFirstWord="false" -- this capatalizes every word

keep="and or the is my or de" -- don't change capitalization for these words

forceFirstLetter="true" -- capitalize the first letter of the Token (not word) even if it is in the "keep" list

maxTokenLength="40" -- if the token is longer then 40 chars, don't even try to capitalize it

maxWordCount="4" -- if there are more then 4 words, don't try capitalizing


> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-248) Capitalization Filter Factory

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-248:
-------------------------------

    Attachment: SOLR-248-CapitalizationFilter.patch

applies with trunk

> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch, SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-248) Capitalization Filter Factory

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498700 ] 

Yonik Seeley commented on SOLR-248:
-----------------------------------

Hmmm, this feels slightly strange implementing at the indexing level.
What are the ads/disads vs just lowercasing for indexing and capitalizing at the presentation/application layer?


> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (SOLR-248) Capitalization Filter Factory

Posted by Chris Hostetter <ho...@fucit.org>.
: I haven't looked at this specific code, but this is my preference in
: general.  multiple TokenFilters are created per-field instance on the
: index side, and per-query-term on the search side, so it's better to
: pull all the setup you can out of the Filter for performance reasons.

computation can be done at factory instantiation, but it can make sense to
put the code for the computation in static methods within the Filter class
itself -- so it's more reusable outside of Solr.



-Hoss


[jira] Commented: (SOLR-248) Capitalization Filter Factory

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498834 ] 

Yonik Seeley commented on SOLR-248:
-----------------------------------

> Why is so much of the logic in the Factory?

I haven't looked at this specific code, but this is my preference in general.  multiple TokenFilters are created per-field instance on the index side, and per-query-term on the search side, so it's better to pull all the setup you can out of the Filter for performance reasons.


> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-248) Capitalization Filter Factory

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498711 ] 

Ryan McKinley commented on SOLR-248:
------------------------------------

It is a little strange, but (in my case anyway) i think it makes sense...  

I am indexing a bunch of metadata from a bunch of libraries (OAI-PMH) -- I want to display the data exactly as it came from the source, but for faceted browsing I need to normalize capitalization.

Implemented at the indexing level, I can have different values for the stored value and indexed terms.  Also, at the indexing level I can leverage existing Tokenizers and Filters to build the tokens that need capitalization -- it keeps all the configuration in schema.xml and lets the OAI -> solr xml be a simple transformation, this way whoever takes care of this need only learn solr configuration, not ryan+solr configuration. 

If it is not generally useful I can keep it elsewhere - that is why we have the nice plugin framework!



> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (SOLR-248) Capitalization Filter Factory

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley resolved SOLR-248.
--------------------------------

    Resolution: Fixed

added a while ago

> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Assignee: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch, SOLR-248-CapitalizationFilter.patch, SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-248) Capitalization Filter Factory

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498841 ] 

Ryan McKinley commented on SOLR-248:
------------------------------------

> Why is so much of the logic in the Factory? 

It seemed silly to copy the same things over and over for each time the type is indexed or queried...  

> why is keep in a synchronized map,

I'm not sure it needs to be, but i was being cautious...   the map is only created once (and never edited) but could be accessed my many threads simultaneously.




> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-248) Capitalization Filter Factory

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498726 ] 

Ryan McKinley commented on SOLR-248:
------------------------------------

> 
>> Implemented at the indexing level, I can have different values for the stored value and indexed terms.
> One downside is that it complicates certain things like wildcard or prefix queries
>

currently i'm using copyfield and doing the prefix query on a different field... not great but it works!

> 
> Of course, I think this might be a more general problem in faceting... what to actually use as a label for display purposes vs what the terms in the index were (think price formatting, labels for more complex facet queries, etc).
> 

Interesting.  I could index with a lowercase filter then reformat the facet results...  I'll take a look at that after the deadline passes ;)


> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-248) Capitalization Filter Factory

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498488 ] 

Hoss Man commented on SOLR-248:
-------------------------------

1) would it make sense for the keep option to refer to a file, using the same format as StopFilter ... that way it's easy to reuse the same file (which seems like it would be a common case.

2) what is the point of forceFirstLetter="true" ? ... if you want to force capitalization, what's the point of making hte keep list?

3) is okPrefix going to force the case for things that have that prefix in an alternate case, or only allow that casing to remain (ie: if i index McKeen, Mckeen, mckeen and MCKEEN what tokens do i wind up with?)

> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-248) Capitalization Filter Factory

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498492 ] 

Ryan McKinley commented on SOLR-248:
------------------------------------

> 
> 1) would it make sense for the keep option to refer to a file, using the same format as StopFilter ... that way it's easy to reuse the same file (which seems like it would be a common case.
> 

probably.  that is a good idea


> 2) what is the point of forceFirstLetter="true" ? ... if you want to force capitalization, what's the point of making hte keep list?
> 

This is one that came of necessity!

with keep="the ..."  and input:
 "Grand army of the Republic", "the arts"

I want: "Grand Army of the Republic" and "The Arts"

"forceFirstLetter" only applies to the first character in the token, not to each word.


> 3) is okPrefix going to force the case for things that have that prefix in an alternate case, or only allow that casing to remain (ie: if i index McKeen, Mckeen, mckeen and MCKEEN what tokens do i wind up with?)
> 

As written, if the prefix matches, it assumes the word capitalization is correct.  For my input data, this is sufficient -- but it should problem do something smarter.

So, if you index "McKeen, Mckeen, mckeen, MCKEEN and McKEEN", you would get:

 "McKeen, Mckeen, Mckeen, Mckeen And McKEEN"

If "okPrefix" was treated as *the* capitalization for input where the lowercase prefix matches "mck", it would give:

 "McKeen, McKeen, McKeen, McKeen And McKeen"



> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-248) Capitalization Filter Factory

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-248:
-------------------------------

    Attachment: SOLR-248-CapitalizationFilter.patch

1. Added better javadocs explaining the configuration.
2. removed synchronized map
3. put the Filter as a package private class in the Factory file -- since the filter relies on hte factory, it is not particularly useful outsid solr.

I would like to add this soon


> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch, SOLR-248-CapitalizationFilter.patch, SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-248) Capitalization Filter Factory

Posted by "J.J. Larrea (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498817 ] 

J.J. Larrea commented on SOLR-248:
----------------------------------

While I fully agree that faceting does raise some odd issues stemming from the display of normally-invisible indexed values to humans, and that it  theoretically should be responsibility of the front-end to translate index values into human-readable values, there are great practical advantages in both efficiency and convenience to making the indexed values "pretty", and to centralize as much of that as possible in the Analysis stage.

In particular, I will try this and am very likely to put this into use this weekend, so thank you Ryan!  So I'm +1 to adding it to the Solr distribution, though to avoid confusing people it should have a JavaDoc comment explaining that the main use is in faceting to avoid having to introduce such common logic into the presentation-layer.

Regarding the implementation,

1. For 'keep' and 'okPrefix' (and were it not for reverse-compatibility issues, for 'words' in StopFilter), it would be nice to have a means to specify either a direct list or a filename in the same parameter.  A simple approach might be something like keep="word word word..." vs. keep="<file", or even keep="<file <file word word" (with the requirement for backslash-escaping spaces in either)...  Or alternately something like txt:filename (vs. xml:filename, json:filename, etc.) with an unescaped : being significant.

2. Why is so much of the logic in the Factory?  This drags Solr-specific stuff in when a user might want to use just the Analyzer in a non-Solr context. Wouldn't it be better in general for Solr Analyzers to be self-complete, with the Factory merely being an adaptor between SolrParams & external resources and the Analyzer's constructor?

Also, why is keep in a synchronized map, since there is no mutator?  (I know, picky picky...)

Good luck with the deadline!


> Capitalization Filter Factory
> -----------------------------
>
>                 Key: SOLR-248
>                 URL: https://issues.apache.org/jira/browse/SOLR-248
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-248-CapitalizationFilter.patch
>
>
> For tokens that are used in faceting, it is nice to have standard capitalization.  
> I want "Aerial views" and "Aerial Views" to both be: "Aerial Views"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.