You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2011/11/03 21:09:11 UTC

Re: Dismax and phrases

Interesting, in the case where you use quotes...

: +<result name="response" numFound="6888" start="0" maxScore="3.0879765">
	...
: </lst><str name="rawquerystring">"asuntojen hinnat"</str>
: <str name="querystring">"asuntojen hinnat"</str>

...there is one DisjunctionMaxQuery (expected) for the entire phrase, 
but in the sub-clauses for each individual field the clauses coming from 
your "_fi" fields are just building boolean "OR" queries of the terms from 
your phrase (instead of building an actual phrase query...

: <str name="parsedquery">+DisjunctionMaxQuery((table.title_t:"asuntojen
: hinnat"^2.0 | title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen hinnat" |
: (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto
: table.description_fi:hinta) | table.description_t:"asuntojen hinnat" |
: graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto
: graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto
: table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" |
: text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) |
: (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto
: title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0
: FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0)</str>

...is this perhaps a side effect of the new autoGeneratePhraseQueries 
option? ... you are explicitly specifying a quoted phrase, but 
maybe somehwere in the code path of the dismax parser that information is 
getting lost?

can you post the details of your schema.xml?  (ie: the "version" property 
on the schema file, and the dynamicField/field + fieldType definitions for 
all these fields)

In contrast, your unquoted example is working exactly as i'd expect.  a 
DisjunctionMaxQuery is built for each clause of the input, and the two 
DisjunctionMaxQuery objects are then combined in a BooleanQuery where the 
minNrShouldMatch property is set to "2"....

: +<result name="response" numFound="1065" start="0"
: maxScore="2.230382"></result>
	...
: <str name="rawquerystring">asuntojen hinnat</str>
: <str name="querystring">asuntojen hinnat</str>
: 
: <str name="parsedquery">+((DisjunctionMaxQuery((table.title_t:asuntojen^2.0 |
: title_t:asuntojen^2.0 | ingress_t:asuntojen | text_fi:asunto |
: table.description_fi:asunto | table.description_t:asuntojen |
: graphic.title_t:asuntojen^2.0 | graphic.title_fi:asunto^2.0 |
: table.title_fi:asunto^2.0 | table.contents_t:asuntojen | text_t:asuntojen |
: ingress_fi:asunto | table.contents_fi:asunto | title_fi:asunto^2.0)~0.01)
: DisjunctionMaxQuery((table.title_t:hinnat^2.0 | title_t:hinnat^2.0 |
: ingress_t:hinnat | text_fi:hinta | table.description_fi:hinta |
: table.description_t:hinnat | graphic.title_t:hinnat^2.0 |
: graphic.title_fi:hinta^2.0 | table.title_fi:hinta^2.0 |
: table.contents_t:hinnat | text_t:hinnat | ingress_fi:hinta |
: table.contents_fi:hinta | title_fi:hinta^2.0)~0.01))~2) () type:tie^6.0
: type:kuv^2.0 type:tau^2.0
: FunctionQuery((1.0/(3.16E-11*float(ms(const(1319438484878),date(date.modified_dt)))+1.0))^100.0)</str>


-Hoss

Re: Dismax and phrases

Posted by Chris Hostetter <ho...@fucit.org>.
: ...is this perhaps a side effect of the new autoGeneratePhraseQueries 
: option? ... you are explicitly specifying a quoted phrase, but 
: maybe somehwere in the code path of the dismax parser that information is 
: getting lost?

FWIW:

a) I just realized you said in your first message you were using Solr 
1.4.1, which *definitely* predates the autoGeneratePhraseQueries option - 
so i'm really at a loss to understand how you are getting that query 
structure (definitely want to see your configs)

b) I did some quick testing with Solr 3.4 using the example configs, and 
verified that regardless of how autoGeneratePhraseQueries is set on the 
fieldType for the "name" field, this request...

http://localhost:8983/solr/select/?fl=name&debugQuery=true&q=%22samsung%20hard%20drive%22&defType=dismax&qf=name&qs=100

..always produces a dismax query wrapped arround a phrase query.


-Hoss

Re: Dismax and phrases

Posted by Chris Hostetter <ho...@fucit.org>.
: I am starting to wonder whether the module giving finnish language support
: (lingsoft) might be the cause?

It's extremeley possible -- the details relaly matter when debugging 
things like this.

Since i don't have any access to these custom plugins, i don't know what 
they might be doing, or how they might be affecting the terms produced 
during analysis to explain why you are getting the structure you are -- 
but one explanation might be if every term produced by them gets a 
positionIncrement of "0" ... that would tell the query parser to treat 
them as alternatives -- it's the same thing SynonymFilter does.

you'd have to look at the output from the analysis tool ,feeding your 
example input into the query analyzer to see what terms it produces (and 
what attributes those terms have).  if it is a position increment issue, 
then you should see the same "OR" style query structure (instead of a 
phrase query) even if you use the default "lucene" parser and give it a 
quoted phrase...

	text_fi:"asuntojen hinnat"


-Hoss

Re: Dismax and phrases

Posted by Hyttinen Lauri <la...@stat.fi>.
Hello,

I am starting to wonder whether the module giving finnish language 
support (lingsoft) might be the cause?
Like I earlier said I have inherited this project so my understanding of 
all the bells and whistles is a bit limited.

Some selected parts from the schema.xml file:

<schema name="example" version="1.2">
...
<fieldType name="suomi" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
                 generateWordParts="1"
                 generateNumberParts="1"
                 />
<filter class="lingSoft.LSFactory"/>
<filter class="solr.PositionFilterFactory" />
</analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
                 generateWordParts="1"
                 generateNumberParts="1"
                 />
<filter class="lingSoft.LSFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
                 generateWordParts="1"
                 generateNumberParts="1"
                 catenateWords="0"
                 preserveOriginal="1"
                 />
</analyzer>
</fieldType>
...
<field name="text_fi" type="suomi" indexed="true" stored="true" 
multiValued="true" required="false" />
...
<dynamicField name="*_t"  type="text"  indexed="true"  stored="true" 
multiValued="true"/>
...
<!-- dynamic field for finnish language support with the lingsoft 
transformation -->
<dynamicField name="*_fi"  type="suomi"  indexed="true"  stored="true" 
multiValued="true" />
....
<dynamicField name="ignored_*" type="ignored" multiValued="true"/>
<dynamicField name="attr_*" type="textgen" indexed="true" stored="true" 
multiValued="true"/>
<dynamicField name="random_*" type="random" />
<dynamicField name="*" type="text" multiValued="true" index="true" 
stored="true" />

Best regards,
Lauri Hyttinen


On 11/03/2011 10:09 PM, Chris Hostetter wrote:
> Interesting, in the case where you use quotes...
>
> : +<result name="response" numFound="6888" start="0" maxScore="3.0879765">
> 	...
> :</lst><str name="rawquerystring">"asuntojen hinnat"</str>
> :<str name="querystring">"asuntojen hinnat"</str>
>
> ...there is one DisjunctionMaxQuery (expected) for the entire phrase,
> but in the sub-clauses for each individual field the clauses coming from
> your "_fi" fields are just building boolean "OR" queries of the terms from
> your phrase (instead of building an actual phrase query...
>
> :<str name="parsedquery">+DisjunctionMaxQuery((table.title_t:"asuntojen
> : hinnat"^2.0 | title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen hinnat" |
> : (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto
> : table.description_fi:hinta) | table.description_t:"asuntojen hinnat" |
> : graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto
> : graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto
> : table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" |
> : text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) |
> : (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto
> : title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0
> : FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0)</str>
>
> ...is this perhaps a side effect of the new autoGeneratePhraseQueries
> option? ... you are explicitly specifying a quoted phrase, but
> maybe somehwere in the code path of the dismax parser that information is
> getting lost?
>
> can you post the details of your schema.xml?  (ie: the "version" property
> on the schema file, and the dynamicField/field + fieldType definitions for
> all these fields)
>
> In contrast, your unquoted example is working exactly as i'd expect.  a
> DisjunctionMaxQuery is built for each clause of the input, and the two
> DisjunctionMaxQuery objects are then combined in a BooleanQuery where the
> minNrShouldMatch property is set to "2"....
>
> : +<result name="response" numFound="1065" start="0"
> : maxScore="2.230382"></result>
> 	...
> :<str name="rawquerystring">asuntojen hinnat</str>
> :<str name="querystring">asuntojen hinnat</str>
> :
> :<str name="parsedquery">+((DisjunctionMaxQuery((table.title_t:asuntojen^2.0 |
> : title_t:asuntojen^2.0 | ingress_t:asuntojen | text_fi:asunto |
> : table.description_fi:asunto | table.description_t:asuntojen |
> : graphic.title_t:asuntojen^2.0 | graphic.title_fi:asunto^2.0 |
> : table.title_fi:asunto^2.0 | table.contents_t:asuntojen | text_t:asuntojen |
> : ingress_fi:asunto | table.contents_fi:asunto | title_fi:asunto^2.0)~0.01)
> : DisjunctionMaxQuery((table.title_t:hinnat^2.0 | title_t:hinnat^2.0 |
> : ingress_t:hinnat | text_fi:hinta | table.description_fi:hinta |
> : table.description_t:hinnat | graphic.title_t:hinnat^2.0 |
> : graphic.title_fi:hinta^2.0 | table.title_fi:hinta^2.0 |
> : table.contents_t:hinnat | text_t:hinnat | ingress_fi:hinta |
> : table.contents_fi:hinta | title_fi:hinta^2.0)~0.01))~2) () type:tie^6.0
> : type:kuv^2.0 type:tau^2.0
> : FunctionQuery((1.0/(3.16E-11*float(ms(const(1319438484878),date(date.modified_dt)))+1.0))^100.0)</str>
>
>
> -Hoss
>


-- 
Lauri Hyttinen
Tietopalvelusuunnittelija
Tilastokeskus
Yksikkö
Käyntiosoite: Työpajankatu 13, 00580 Helsinki
Postiosoite: PL 3 A, 00022 Tilastokeskus
puh. 09 1734 0000
lauri.hyttinen@tilastokeskus.fi
www.tilastokeskus.fi