You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by liwei <ch...@163.com> on 2013/03/01 07:56:16 UTC

It seems a bug of deal with synonym.

in org.apache.solr.parser.SolrQueryParserBase, there is a function: "protected Query newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted)  throws SyntaxError"

The below code can't process chinese rightly.

"          BooleanClause.Occur occur = positionCount > 1 && operator == AND_OPERATOR ?
            BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;

"

For example, “北京市" and “北京" are synonym, if I seach "北京市动物园", the expected parse result is "+(北京市 北京) +动物园", but actually it would be parsed to "+北京市 +北京 +动物园".

The code can process English, because English word is seperate by space, and only one position.

In order to process Chinese, I think it can charge by position increment, but not by position count.

Re: It seems a bug of deal with synonym.

Posted by Jienan Duan <jn...@gmail.com>.

Hi,liwei.
I have met this problem before,and my solution is expanding synonyms first
,then normalizing.For example,'北京市' and ‘北京’ are synonyms,in indexing
process,my programme convert them to '北京市',and in searching process do the
same logic.

In solr.SynonymFilterFactory,posIncreament between synonyms and original
token is 0,so you can use SynonymFilterFactory create a TokenSet for the
queryString,then normalize the synonyms in the tokenSet.

There is a piece of code,I think it may help you understand my solution:

// req is a SolrQueryRequest
Analyzer analyzer = req.getSchema().getQueryAnalyzer();
        final TokenizerChain tokennizerChain = (TokenizerChain)
req.getSchema().getField("title").getType().getQueryAnalyzer();
        SynonymFilterFactory sff = null;
        for (TokenFilterFactory tf :
tokennizerChain.getTokenFilterFactories()) {
            if (tf instanceof SynonymFilterFactory) {
                sff = (SynonymFilterFactory) tf;
            }
        }
        if (null == analyzer) {
            return;
        }

        StringReader reader = new StringReader(qstr);
        StringBuilder buffer = new StringBuilder(128);
        Set<String> tokenSet = new LinkedHashSet<String>();
        TokenStream tokens = null;
        TokenStream sf = null;
        try {
            // analysis title field
            tokens = analyzer.reusableTokenStream("title", reader);

            if (sff != null) {
                sf = sff.create(tokens);
                sf.reset();
                CharTermAttribute termAtt = (CharTermAttribute)
sf.getAttribute(CharTermAttribute.class);
                PositionIncrementAttribute positionIncrementAttribute =
sf.getAttribute(PositionIncrementAttribute.class);
                OffsetAttribute offsetAttribute =
sf.getAttribute(OffsetAttribute.class);
                Set<String> dumplicatedTokenSet = new HashSet<String>();
                while (sf.incrementToken()) {
                    final String token = (new
String(termAtt.toString())).toLowerCase();
                    final int posIncr =
positionIncrementAttribute.getPositionIncrement();
                    // *then you can normaizing the synonms to a standard
word*
                }
            }
        }

Best Regards.


2013/3/1 liwei <ch...@163.com>

> in org.apache.solr.parser.SolrQueryParserBase, there is a function:
> "protected Query newFieldQuery(Analyzer analyzer, String field, String
> queryText, boolean quoted)  throws SyntaxError"
>
> The below code can't process chinese rightly.
>
> "          BooleanClause.Occur occur = positionCount > 1 && operator ==
> AND_OPERATOR ?
>             BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;
>
> "
>
> For example, “北京市" and “北京" are synonym, if I seach "北京市动物园", the expected
> parse result is "+(北京市 北京) +动物园", but actually it would be parsed to "+北京市
> +北京 +动物园".
>
> The code can process English, because English word is seperate by space,
> and only one position.
>
> In order to process Chinese, I think it can charge by position increment,
> but not by position count.




-- 
------------------------------------------------------
不走弯路，就是捷径。
http://www.jnan.org/