You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tom Burton-West (JIRA)" <ji...@apache.org> on 2010/11/01 18:20:23 UTC
[jira] Created: (SOLR-2210) Provide solr FilterFactory for Lucene
ICUTokenizer
Provide solr FilterFactory for Lucene ICUTokenizer
--------------------------------------------------
Key: SOLR-2210
URL: https://issues.apache.org/jira/browse/SOLR-2210
Project: Solr
Issue Type: New Feature
Affects Versions: 3.1
Reporter: Tom Burton-West
Priority: Minor
The Lucene ICUTokenizer provides many benefits for multilingual tokenizing. There should be a ICUFilterFactory so that it can be used from Solr. There are probably some issues in terms of passing configuration parameters.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] Updated: (SOLR-2210) Provide solr FilterFactory for Lucene
ICUTokenizer
Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/SOLR-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated SOLR-2210:
------------------------------
Attachment: SOLR-2210.patch
here's a start: makes an analysis-extras contrib with all the build logic, and factories for the icu filters.
still todo: add support for custom normalization and custom tokenizer config, filters for smart chinese, and stempel.
But i think its ok to commit this as-is and improve it in svn.
> Provide solr FilterFactory for Lucene ICUTokenizer
> --------------------------------------------------
>
> Key: SOLR-2210
> URL: https://issues.apache.org/jira/browse/SOLR-2210
> Project: Solr
> Issue Type: New Feature
> Affects Versions: 3.1
> Reporter: Tom Burton-West
> Priority: Minor
> Attachments: SOLR-2210.patch
>
>
> The Lucene ICUTokenizer provides many benefits for multilingual tokenizing. There should be a ICUFilterFactory so that it can be used from Solr. There are probably some issues in terms of passing configuration parameters.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] Commented: (SOLR-2210) Provide solr FilterFactory for Lucene
ICUTokenizer
Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/SOLR-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927382#action_12927382 ]
Robert Muir commented on SOLR-2210:
-----------------------------------
ok, i committed the baseline code (rev 1030012, rev 1030022 in 3x).
we can keep the issue open and just add patches against it for customization, etc.
I just wanted to get all the build-system-stuff working so this was easy.
> Provide solr FilterFactory for Lucene ICUTokenizer
> --------------------------------------------------
>
> Key: SOLR-2210
> URL: https://issues.apache.org/jira/browse/SOLR-2210
> Project: Solr
> Issue Type: New Feature
> Affects Versions: 3.1
> Reporter: Tom Burton-West
> Priority: Minor
> Attachments: SOLR-2210.patch
>
>
> The Lucene ICUTokenizer provides many benefits for multilingual tokenizing. There should be a ICUFilterFactory so that it can be used from Solr. There are probably some issues in terms of passing configuration parameters.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] Commented: (SOLR-2210) Provide solr FilterFactory for Lucene
ICUTokenizer
Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/SOLR-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927053#action_12927053 ]
Robert Muir commented on SOLR-2210:
-----------------------------------
actually another idea, would be to just make an 'extraAnalyzers' contrib.
then we could also add factories for smart chinese, polish etc, without creating a ton of contribs.
i think this would be a good solution to expose all the lucene analyzers to Solr,
since to me, LUCENE-2510 seems tricky.
> Provide solr FilterFactory for Lucene ICUTokenizer
> --------------------------------------------------
>
> Key: SOLR-2210
> URL: https://issues.apache.org/jira/browse/SOLR-2210
> Project: Solr
> Issue Type: New Feature
> Affects Versions: 3.1
> Reporter: Tom Burton-West
> Priority: Minor
>
> The Lucene ICUTokenizer provides many benefits for multilingual tokenizing. There should be a ICUFilterFactory so that it can be used from Solr. There are probably some issues in terms of passing configuration parameters.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] Commented: (SOLR-2210) Provide solr FilterFactory for Lucene
ICUTokenizer
Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/SOLR-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927048#action_12927048 ]
Robert Muir commented on SOLR-2210:
-----------------------------------
Thanks for opening this, Tom.
I've got some barebones filters for some of this stuff on my computer.
Because the ICU jar file is large, i was trying to see if i could solve LUCENE-2510 first, but this would only fix the problem for 4.0 anyway.
I think we should just make an icu contrib for now, and put the factories (Tokenizer, Normalizer, Folding, Transliterator, Collation) and the jar file in there.
> Provide solr FilterFactory for Lucene ICUTokenizer
> --------------------------------------------------
>
> Key: SOLR-2210
> URL: https://issues.apache.org/jira/browse/SOLR-2210
> Project: Solr
> Issue Type: New Feature
> Affects Versions: 3.1
> Reporter: Tom Burton-West
> Priority: Minor
>
> The Lucene ICUTokenizer provides many benefits for multilingual tokenizing. There should be a ICUFilterFactory so that it can be used from Solr. There are probably some issues in terms of passing configuration parameters.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org