You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alex Simatov (JIRA)" <ji...@apache.org> on 2016/07/14 13:03:20 UTC

[jira] [Created] (LUCENE-7379) Search word request on Chinese is not working properly

Alex Simatov created LUCENE-7379:
------------------------------------

             Summary: Search word request on Chinese is not working properly
                 Key: LUCENE-7379
                 URL: https://issues.apache.org/jira/browse/LUCENE-7379
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/queryparser
    Affects Versions: 5.0
            Reporter: Alex Simatov


Originally we used Lucene 2.3 in the project for years.
Some time ago we made an update to the 5.0.0 version of Lucene.
After that Chinese analyzing stopped working normally (I did not test it on Japanese or Korean)

We have the following code to process the search request:

1. analyzer = new ClassicAnalyzer();
2. logger.Write2Log(queryString);
3. QueryParser qp = new QueryParser(fieldName, analyzer);
4. Query query = qp.parse(queryString);
5. logger.Write2Log(query.toString(fieldName));
6. int hits = searcher.search(query, 1).totalHits;

Analyzer on line 1 could be changed by config.
Line 2 is printing what we put to the Lucene.
Line 5 is printing how the query modified in Lucene

Normally we are using the string 打不开~0.7 for 70% or more accuracy and  打不开 to find exact this word.
~0.7 functionality was marked as deprecated since 4.0 version, however it is still worked on English at least.

What was before (on Lucene 2.3):
Line 2: 打不开~0.7 
Line 5: 打不开~0.7
If we provide the correct string for analysis, line 6 returns correct result

The same for case of 打不开 without accuracy (without ~0.7)

What is now (on Lucene 5.0):
Line 2: 打不开~0.7 
Line 5: 打不开~0
As I understood it is modifying of deprecated parameter to newly supported one with a little different meaning (at least it is working like I said on English).
The string for analysis contains the 打不开, however line 6 shows nothing is found.

Line 2: 打不开 
Line 5: 打 不 开
Lucene added spaces, which are interpreted as OR operator. As result Line 6 returns that keyword found even if it is only one 不 symbol in the string for analysis.

The same scenario was tested on the CJKAnalyzer, ClassicAnalyzer  and SmartChineseAnalyzer. Results are the same: neither one of them has the same functionality as analyzer on Lucene 2.3

Is it known problem in the product? Could you please explain or provide any docs about how the search should work for Chinese in mentioned cases.
Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org