You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Christian Moen (Created) (JIRA)" <ji...@apache.org> on 2012/03/25 16:03:27 UTC

[jira] [Created] (LUCENE-3916) Consider different query and index segmentation for Japanese

Consider different query and index segmentation for Japanese
------------------------------------------------------------

                 Key: LUCENE-3916
                 URL: https://issues.apache.org/jira/browse/LUCENE-3916
             Project: Lucene - Java
          Issue Type: Improvement
          Components: modules/analysis
    Affects Versions: 3.6, 4.0
            Reporter: Christian Moen
            Priority: Minor


Kuromoji today uses search mode segmentation both at query and index time.

The benefit with search mode segmentation is that it segments compounds such as 関西国際空港 (Kansai International Airport) into 関西 (Kansai), 国際 (international), 空港 (airport), and leaves the compound 関西国際空港 as a synonym to 関西.

This segmentation allows us to get a match for 空港 (airport), which is good for recall and we'd get good precision when searching for the compound 関西国際空港 because of IDF.

However, if we search for the compound 関西国際空港 (Kansai International Airport) our query becomes (by default) an OR-query with terms 関西 (Kansai), 関西国際空港 (Kansai International Airport), 国際 (international) and 空港 (airport).

This behaviour is by-design when using OR as the default operator, but this also has the effect of returning generic hits like 空港 (airport) when the user searches for something very specific like 関西国際空港 (Kansai International Airport) -- and these hits are also highlighted.

This doesn't necessarily mean that ranking is flawed per se, but a user or application might prefer precision over recall.  In order to favour precision, we can consider using normal mode segmentation for queries, but retain search mode segmentation on the indexing side.

Does anyone have any general opinion on this?  What would we do for other language in the case of compound splitting?

Perhaps this can be dealt with as a documentation issue with a comment in {{schema.xml}} while keeping the current behaviour?

Many thanks for any input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3916) Consider different query and index segmentation for Japanese

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237878#comment-13237878 ] 

Robert Muir commented on LUCENE-3916:
-------------------------------------

For the case of compound splitting, split-and-keep is better then just splitting 
(what kuromoji did until recently), and I think the issues you see are mainly highlighting issues.

And yes, its true that using search mode at index time is really no different than
adding synonyms for the compounds, but I don't think we should change the default
configuration for japanese to one that uses different index and search analysis:
thats not ideal for an example.

Using different index and search analysis is really expert: I know the solr example
does this with its english field type, and 90% of the time I think users will just
screw things up worse, I see this in crazy examples on the user lists all the time.

A commented out note about how this acts just like synonyms and can be done purely
only at index-time might be good though.

In the future, now that we can split-and-keep, we could also consider at adding support
for LUCENE-2892 (SOLR-2477), where if a user asks for a phrase explicitly, we don't decompound.

But still the tradeoff for this stuff is that if we make sophisticated examples
with different chains, chances are that any time a user modifies these chains
they are just gonna screw up their analysis badly. 
                
> Consider different query and index segmentation for Japanese
> ------------------------------------------------------------
>
>                 Key: LUCENE-3916
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3916
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>            Priority: Minor
>
> Kuromoji today uses search mode segmentation both at query and index time.
> The benefit with search mode segmentation is that it segments compounds such as 関西国際空港 (Kansai International Airport) into 関西 (Kansai), 国際 (international), 空港 (airport), and leaves the compound 関西国際空港 as a synonym to 関西.
> This segmentation allows us to get a match for 空港 (airport), which is good for recall and we'd get good precision when searching for the compound 関西国際空港 because of IDF.
> However, if we search for the compound 関西国際空港 (Kansai International Airport) our query becomes (by default) an OR-query with terms 関西 (Kansai), 関西国際空港 (Kansai International Airport), 国際 (international) and 空港 (airport).
> This behaviour is by-design when using OR as the default operator, but this also has the effect of returning generic hits like 空港 (airport) when the user searches for something very specific like 関西国際空港 (Kansai International Airport) -- and these hits are also highlighted.
> This doesn't necessarily mean that ranking is flawed per se, but a user or application might prefer precision over recall.  In order to favour precision, we can consider using normal mode segmentation for queries, but retain search mode segmentation on the indexing side.
> Does anyone have any general opinion on this?  What would we do for other language in the case of compound splitting?
> Perhaps this can be dealt with as a documentation issue with a comment in {{schema.xml}} while keeping the current behaviour?
> Many thanks for any input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3916) Consider different query and index segmentation for Japanese

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240610#comment-13240610 ] 

Christian Moen commented on LUCENE-3916:
----------------------------------------

Thanks a lot, Robert.

I've added a comment about about this in {{schema.xml}} as part of SOLR-3276.  I'm resolving this issue.


                
> Consider different query and index segmentation for Japanese
> ------------------------------------------------------------
>
>                 Key: LUCENE-3916
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3916
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>            Priority: Minor
>
> Kuromoji today uses search mode segmentation both at query and index time.
> The benefit with search mode segmentation is that it segments compounds such as 関西国際空港 (Kansai International Airport) into 関西 (Kansai), 国際 (international), 空港 (airport), and leaves the compound 関西国際空港 as a synonym to 関西.
> This segmentation allows us to get a match for 空港 (airport), which is good for recall and we'd get good precision when searching for the compound 関西国際空港 because of IDF.
> However, if we search for the compound 関西国際空港 (Kansai International Airport) our query becomes (by default) an OR-query with terms 関西 (Kansai), 関西国際空港 (Kansai International Airport), 国際 (international) and 空港 (airport).
> This behaviour is by-design when using OR as the default operator, but this also has the effect of returning generic hits like 空港 (airport) when the user searches for something very specific like 関西国際空港 (Kansai International Airport) -- and these hits are also highlighted.
> This doesn't necessarily mean that ranking is flawed per se, but a user or application might prefer precision over recall.  In order to favour precision, we can consider using normal mode segmentation for queries, but retain search mode segmentation on the indexing side.
> Does anyone have any general opinion on this?  What would we do for other language in the case of compound splitting?
> Perhaps this can be dealt with as a documentation issue with a comment in {{schema.xml}} while keeping the current behaviour?
> Many thanks for any input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3916) Consider different query and index segmentation for Japanese

Posted by "Christian Moen (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Moen resolved LUCENE-3916.
------------------------------------

    Resolution: Fixed
    
> Consider different query and index segmentation for Japanese
> ------------------------------------------------------------
>
>                 Key: LUCENE-3916
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3916
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>            Priority: Minor
>
> Kuromoji today uses search mode segmentation both at query and index time.
> The benefit with search mode segmentation is that it segments compounds such as 関西国際空港 (Kansai International Airport) into 関西 (Kansai), 国際 (international), 空港 (airport), and leaves the compound 関西国際空港 as a synonym to 関西.
> This segmentation allows us to get a match for 空港 (airport), which is good for recall and we'd get good precision when searching for the compound 関西国際空港 because of IDF.
> However, if we search for the compound 関西国際空港 (Kansai International Airport) our query becomes (by default) an OR-query with terms 関西 (Kansai), 関西国際空港 (Kansai International Airport), 国際 (international) and 空港 (airport).
> This behaviour is by-design when using OR as the default operator, but this also has the effect of returning generic hits like 空港 (airport) when the user searches for something very specific like 関西国際空港 (Kansai International Airport) -- and these hits are also highlighted.
> This doesn't necessarily mean that ranking is flawed per se, but a user or application might prefer precision over recall.  In order to favour precision, we can consider using normal mode segmentation for queries, but retain search mode segmentation on the indexing side.
> Does anyone have any general opinion on this?  What would we do for other language in the case of compound splitting?
> Perhaps this can be dealt with as a documentation issue with a comment in {{schema.xml}} while keeping the current behaviour?
> Many thanks for any input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org