You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@roller.apache.org by "Kohei Nozaki (JIRA)" <ji...@apache.org> on 2015/12/10 13:00:16 UTC

[jira] [Commented] (ROL-2090) Lucene integration doesn't work well for entries that written in some languages

    [ https://issues.apache.org/jira/browse/ROL-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050650#comment-15050650 ] 

Kohei Nozaki commented on ROL-2090:
-----------------------------------

I guess a possible solution is that making the limit of number of token and {{Analyzer}} implementation class configurable via {{roller-custom.properties}}. If they are configurable, For example some Japanese users may put a jar which contains their favorite {{Analyzer}} implementation to container's library directory and set the FQCN to {{roller-custom.properties}}, and the number of limit as well.

> Lucene integration doesn't work well for entries that written in some languages
> -------------------------------------------------------------------------------
>
>                 Key: ROL-2090
>                 URL: https://issues.apache.org/jira/browse/ROL-2090
>             Project: Apache Roller
>          Issue Type: Improvement
>          Components: Data Model & JPA Backend
>    Affects Versions: 5.1.2
>            Reporter: Kohei Nozaki
>            Assignee: Roller Unassigned
>            Priority: Minor
>
> Reported in http://benzaiten.dyndns.org/roller/ugya/entry/roller_500_to_510_migration (Japanese). Summary in English:
> h4. Japanese keywords doesn't hit against the latter part of long entry
> It's caused by maximum token limit in the following code. The author said that typical Japanese text is not splitted by white spaces so that's not work well with it.
> {noformat}
> // Limit to 1000 tokens.
> LimitTokenCountAnalyzer analyzer = new LimitTokenCountAnalyzer(
>         IndexManagerImpl.getAnalyzer(), 1000);
> {noformat}
> h4. StandardAnalyzer doesn't work well with Japanese text
> Roller uses {{StandardAnalyzer}} but there are some other language specific implementations for it such as {{CJKAnalyzer}} or {{JapaneseAnalyzer}}. The author said that these implementations improve accuracy for such languages. I know these implementations are language specific so we can't simply replace it to them but might want to switch it in flexible manner, Such as using language configuration in each blogs.
> {noformat}
> public static final Analyzer getAnalyzer() {
>     return new StandardAnalyzer(FieldConstants.LUCENE_VERSION);
> }
> {noformat}
> I'm still not sure what would be proper solutions but I believe we have room for some improvement here. Any advices would be appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)