You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by 李威 <li...@antvision.cn> on 2013/03/12 02:15:50 UTC

It seems a issue of deal with chinese synonym for solr

in org.apache.solr.parser.SolrQueryParserBase, there is a function: "protected Query newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted)  throws SyntaxError"

The below code can't process chinese rightly.

"          BooleanClause.Occur occur = positionCount > 1 && operator == AND_OPERATOR ?
            BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;

"

For example, “北京市" and “北京" are synonym, if I seach "北京市动物园", the expected parse result is "+(北京市 北京) +动物园", but actually it would be parsed to "+北京市 +北京 +动物园".

The code can process English, because English word is seperate by space, and only one position.

In order to process Chinese, I think it can charge by position increment, but not by position count.

Could you help take a look?




Thanks,

Wei Li

Re: It seems a issue of deal with chinese synonym for solr

Posted by Kuro Kurosaka <ku...@sonic.net>.

On 3/11/13 6:15 PM, 李威 wrote:
> in org.apache.solr.parser.SolrQueryParserBase, there is a function: "protected Query newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted)  throws SyntaxError"
>
> The below code can't process chinese rightly.
>
> "          BooleanClause.Occur occur = positionCount > 1 && operator == AND_OPERATOR ?
>              BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;
>
> "
>
> For example, “北京市" and “北京" are synonym, if I seach "北京市动物园", the expected parse result is "+(北京市 北京) +动物园", but actually it would be parsed to "+北京市 +北京 +动物园".
>
> The code can process English, because English word is seperate by space, and only one position.

An interesting feature of this example is that difference between the two 
synonyms is
omission of one token "市" (city). Doesn't the same same problem happen if we 
define
"London City" and "London" as synonyms, and execute a query like "London City Zoo"?
Must Chinese Analyzer be used to reproduce this problem?

I tried to test this but I couldn't. The result of query string expansion using 
Solr 4.2's
query interface with debug output shows:

<str name="parsedquery">MultiPhraseQuery(text:"(london london) city zoo")</str>

I see no plus (+). What query parser did you use?

-- 
Kuro Kurosaka

Re: It seems a issue of deal with chinese synonym for solr

Posted by Robert Muir <rc...@gmail.com>.

I agree. Actually that top-level logic is fine. its the loop that
follows thats wrong: it needs to look at position increment and do the
right thing.

Want to open a JIRA issue?

On Mon, Mar 11, 2013 at 9:15 PM, 李威 <li...@antvision.cn> wrote:
> in org.apache.solr.parser.SolrQueryParserBase, there is a function: "protected Query newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted)  throws SyntaxError"
>
> The below code can't process chinese rightly.
>
> "          BooleanClause.Occur occur = positionCount > 1 && operator == AND_OPERATOR ?
>             BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;
>
> "
>
> For example, “北京市" and “北京" are synonym, if I seach "北京市动物园", the expected parse result is "+(北京市 北京) +动物园", but actually it would be parsed to "+北京市 +北京 +动物园".
>
> The code can process English, because English word is seperate by space, and only one position.
>
> In order to process Chinese, I think it can charge by position increment, but not by position count.
>
> Could you help take a look?
>
>
>
>
> Thanks,
>
> Wei Li