You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ja...@nokia.com on 2010/11/22 14:49:23 UTC

DisMaxQParserPlugin and Tokenization

Hi,



Using the SearchHandler with the deftype=”dismax” option enables the DisMaxQParserPlugin. From investigating it seems, it is just tokenizing by whitespace.



Although by looking in the code I could not find the place, where this behavior is enforced? I only found, that for each field the getFieldQuery() method is called, which either throws an “unknownField” exception or returns the correct analyzer including tokenizer and filter for the given field.



We want to use a more fancier Tokenizer/filter setting with the DisMaxQuery stuff.



Where to hook in best?



Jan

Re: DisMaxQParserPlugin and Tokenization

Posted by Jan Kurella <ja...@nokia.com>.

Ok, I think I found it: the Queryparser used in the background "chunks" 
by whitespaces (and {}). Each of these chunks are then treated as 
"Phrases". This is complete useless for non-whitespace tokenizing languages.

So I started a simple DisMaxQueryParser. Can someone verify, that this 
codes produces a DisMaxQuery? (Theroy taken from here: 
http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/)

{code}
             stream = analyzer.reusableTokenStream("all", input);
             TermAttribute oTermAtt = 
stream.addAttribute(TermAttribute.class);
             int clauses = 0;
             BooleanQuery result = new BooleanQuery();
             while (stream.incrementToken()) {
                 DisjunctionMaxQuery clause = new DisjunctionMaxQuery(0.1f);
                 String oTermText = oTermAtt.term();
                 for (int iF = 0; iF < fields.length; ++iF) {
                     Query oQuery = new SpanTermQuery(new 
Term(fields[iF], oTermText));
                     clause.add(oQuery);
                     ++clauses;
                 }
                 result.add(new BooleanClause(clause, Occur.SHOULD));
             }
             result.setMinimumNumberShouldMatch((int) Math.ceil(0.75* 
clauses)); // mm=75%
             return result;
{code}

Is this, (basically, what the DisMaxQueryparser would do, if it would 
tokenize the full query without parsing for any of [+"{}] ?

Jan


On 24.11.2010 09:20, ext jan.kurella@nokia.com wrote:
> Sorry for the double post. Is there someone, that can point me where the original query given to the DisMaxHandler/QParser is splitted?
>
> Jan
>
> -----Original Message-----
> From: Kurella Jan (Nokia-MS/Berlin)
> Sent: Montag, 22. November 2010 14:49
> To: solr-user@lucene.apache.org
> Subject: DisMaxQParserPlugin and Tokenization
>
> Hi,
>
> Using the SearchHandler with the deftype=”dismax” option enables the DisMaxQParserPlugin. From investigating it seems, it is just tokenizing by whitespace.
>
> Although by looking in the code I could not find the place, where this behavior is enforced? I only found, that for each field the getFieldQuery() method is called, which either throws an “unknownField” exception or returns the correct analyzer including tokenizer and filter for the given field.
>
> We want to use a more fancier Tokenizer/filter setting with the DisMaxQuery stuff.
>
> Where to hook in best?
>
> Jan

RE: DisMaxQParserPlugin and Tokenization

Posted by ja...@nokia.com.

Sorry for the double post. Is there someone, that can point me where the original query given to the DisMaxHandler/QParser is splitted?

Jan

-----Original Message-----
From: Kurella Jan (Nokia-MS/Berlin) 
Sent: Montag, 22. November 2010 14:49
To: solr-user@lucene.apache.org
Subject: DisMaxQParserPlugin and Tokenization

Hi,

Using the SearchHandler with the deftype=”dismax” option enables the DisMaxQParserPlugin. From investigating it seems, it is just tokenizing by whitespace.

Although by looking in the code I could not find the place, where this behavior is enforced? I only found, that for each field the getFieldQuery() method is called, which either throws an “unknownField” exception or returns the correct analyzer including tokenizer and filter for the given field.

We want to use a more fancier Tokenizer/filter setting with the DisMaxQuery stuff.

Where to hook in best?

Jan