You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jerome Renard <je...@gmail.com> on 2011/01/24 18:31:07 UTC

Weird behaviour with phrase queries

Hi,

I have a problem with phrase queries, from times to times I do not get any
result
where as I know I should get returned something.

The search is run against a field of type "text" which definition is
available at the following URL :
- http://pastebin.com/Ncem7M8z

This field is defined with the following configuration:
<field name="meta_text" type="text"    indexed="true"  stored="true"
multiValued="true" termVectors="true"/>

I use the following request handler:
<requestHandler name="custom" class="solr.DisMaxRequestHandler">
    <lst name="defaults">
        <str name="echoParams">explicit</str>
        <float name="tie">0.01</float>
        <str name="qf">meta_text</str>
        <str name="pf">meta_text</str>
        <str name="bf"/>
        <str name="mm">1&lt;1 2&lt;-1 5&lt;-2 7&lt;60%</str>
        <int name="ps">100</int>
        <str name="q.alt">*:*</str>
    </lst>
</requestHandler>

Depending on the kind of phrase query I use I get either exactly what I am
looking for or nothing.

Index' contents is all french so I thought about a possible problem with
accents but I got queries working
with phrase queries containing "é" and "è" chars like "académie" or
"ingénieur".

As you will see the filter used in the "text" type uses the
SnowballPorterFilterFactory for the english language,
I plan to fix that by using the correct language for the index (French) and
the following protwords http://bit.ly/i8JeX6 .

But except this mistake with the stemmer, did I do something (else) wrong ?
Did I overlook something ? What could
explain I do not always get results for my phrase queries ?

Thanks in advance for your feedback.

Best Regards,

--
Jérôme

Re: Weird behaviour with phrase queries

Posted by Em <ma...@yahoo.de>.
Hi Jerome,

does your fieldtype contains a stopword-filter?
Probably this could be the root of all evil :-).

Could you provide us the fieldtype definition and the explain-content of an
example-query?
Did you check the analysis.jsp to have a look at the produced results?

Regards,
Em


Jerome Renard wrote:
> 
> Hi,
> 
> I have a problem with phrase queries, from times to times I do not get any
> result
> where as I know I should get returned something.
> 
> The search is run against a field of type "text" which definition is
> available at the following URL :
> - http://pastebin.com/Ncem7M8z
> 
> This field is defined with the following configuration:
> <field name="meta_text" type="text"    indexed="true"  stored="true"
> multiValued="true" termVectors="true"/>
> 
> I use the following request handler:
> <requestHandler name="custom" class="solr.DisMaxRequestHandler">
>     <lst name="defaults">
>         <str name="echoParams">explicit</str>
>         <float name="tie">0.01</float>
>         <str name="qf">meta_text</str>
>         <str name="pf">meta_text</str>
>         <str name="bf"/>
>         <str name="mm">1&lt;1 2&lt;-1 5&lt;-2 7&lt;60%</str>
>         <int name="ps">100</int>
>         <str name="q.alt">*:*</str>
>     </lst>
> </requestHandler>
> 
> Depending on the kind of phrase query I use I get either exactly what I am
> looking for or nothing.
> 
> Index' contents is all french so I thought about a possible problem with
> accents but I got queries working
> with phrase queries containing "é" and "è" chars like "académie" or
> "ingénieur".
> 
> As you will see the filter used in the "text" type uses the
> SnowballPorterFilterFactory for the english language,
> I plan to fix that by using the correct language for the index (French)
> and
> the following protwords http://bit.ly/i8JeX6 .
> 
> But except this mistake with the stemmer, did I do something (else) wrong
> ?
> Did I overlook something ? What could
> explain I do not always get results for my phrase queries ?
> 
> Thanks in advance for your feedback.
> 
> Best Regards,
> 
> --
> Jérôme
> 
> 

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Weird-behaviour-with-phrase-queries-tp2321241p2321362.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Weird behaviour with phrase queries

Posted by Jerome Renard <je...@gmail.com>.
Hi Erick,

On Tue, Jan 25, 2011 at 1:38 PM, Erick Erickson <er...@gmail.com>wrote:

> Frankly, this puzzles me. It *looks* like it should be OK. One warning, the
> analysis page sometimes is a bit misleading, so beware of that.
>
> But the output of your queries make it look like the query is parsing as
> you
> expect, which leaves the question of whether your index contains what
> you think it does. You might get a copy of Luke, which allows you to
> examine
> what's actually in your index instead of what you think is in there.
> Sometimes
> there are surprises here!
>
>
Bingo ! Some data were not in the index. Indexing them obviously fixed the
problem.


> I didn't mean to re-index your whole corpus, I was thinking that you could
> just index a few documents in a test index so you have something small to
> look at.
>
> Sorry I can't spot what's happening right away.
>
>
No worries, thanks for your support :)

-- 
Jérôme

Re: Weird behaviour with phrase queries

Posted by Erick Erickson <er...@gmail.com>.
Frankly, this puzzles me. It *looks* like it should be OK. One warning, the
analysis page sometimes is a bit misleading, so beware of that.

But the output of your queries make it look like the query is parsing as you
expect, which leaves the question of whether your index contains what
you think it does. You might get a copy of Luke, which allows you to examine
what's actually in your index instead of what you think is in there.
Sometimes
there are surprises here!

I didn't mean to re-index your whole corpus, I was thinking that you could
just index a few documents in a test index so you have something small to
look at.

Sorry I can't spot what's happening right away.

Good luck!
Erick

On Tue, Jan 25, 2011 at 2:45 AM, Jerome Renard <je...@gmail.com>wrote:

> Erick,
>
> On Mon, Jan 24, 2011 at 9:57 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> Hmmm, I don't see any screen shots. Several things:
>> 1> If your stopword file has comments, I'm not sure what the effect would
>> be.
>>
>
> Ha, I thought comments were supported in stopwords.txt
>
>
>> 2> Something's not right here, or I'm being fooled again. Your withresults
>> xml has this line:
>> <str name="parsedquery">+DisjunctionMaxQuery((meta_text:"ecol d
>> ingenieur")~0.01) ()</str>
>> and your noresults has this line:
>> <str name="parsedquery">+DisjunctionMaxQuery((meta_text:"academi
>> charpenti")~0.01) DisjunctionMaxQuery((meta_text:"academi
>> charpenti"~100)~0.01)</str>
>>
>> the empty () in the first one often means you're NOT going to your
>> configured dismax parser in solrconfig.xml. Yet that doesn't square with
>> your custom qt, so I'm puzzled.
>>
>> Could we see your raw query string on the way in? It's almost as if you
>> defined qt in one and defType in the other, which are not equivalent.
>>
>
> You are right I fixed this problem (my bad).
>
> 3> It may take 12 hours to index, but you could experiment with a smaller
>> subset. You say you know that the noresults one should return documents,
>> what proof do
>> you have? If there's a single document that you know should match this,
>> just
>> index it and a few others and you should be able to make many runs until
>> you
>> get
>> to the bottom of this...
>>
>>
> I could but I always thought I had to fully re-index after updating
> schema.xml. If
> I update only few documents will that take the changes into account without
> breaking
> the rest ?
>
>
>> And obviously your stemming is happening on the query, are you sure it's
>> happening at index time too?
>>
>>
> Since you did not get the screenshots you will find attached the full
> output of the analysis
> for a phrase that works and for another that does not.
>
> Thanks for your support
>
> Best Regards,
>
> --
> Jérôme
>

Re: Weird behaviour with phrase queries

Posted by Jerome Renard <je...@gmail.com>.
Erick,

On Mon, Jan 24, 2011 at 9:57 PM, Erick Erickson <er...@gmail.com>wrote:

> Hmmm, I don't see any screen shots. Several things:
> 1> If your stopword file has comments, I'm not sure what the effect would
> be.
>

Ha, I thought comments were supported in stopwords.txt


> 2> Something's not right here, or I'm being fooled again. Your withresults
> xml has this line:
> <str name="parsedquery">+DisjunctionMaxQuery((meta_text:"ecol d
> ingenieur")~0.01) ()</str>
> and your noresults has this line:
> <str name="parsedquery">+DisjunctionMaxQuery((meta_text:"academi
> charpenti")~0.01) DisjunctionMaxQuery((meta_text:"academi
> charpenti"~100)~0.01)</str>
>
> the empty () in the first one often means you're NOT going to your
> configured dismax parser in solrconfig.xml. Yet that doesn't square with
> your custom qt, so I'm puzzled.
>
> Could we see your raw query string on the way in? It's almost as if you
> defined qt in one and defType in the other, which are not equivalent.
>

You are right I fixed this problem (my bad).

3> It may take 12 hours to index, but you could experiment with a smaller
> subset. You say you know that the noresults one should return documents,
> what proof do
> you have? If there's a single document that you know should match this,
> just
> index it and a few others and you should be able to make many runs until
> you
> get
> to the bottom of this...
>
>
I could but I always thought I had to fully re-index after updating
schema.xml. If
I update only few documents will that take the changes into account without
breaking
the rest ?


> And obviously your stemming is happening on the query, are you sure it's
> happening at index time too?
>
>
Since you did not get the screenshots you will find attached the full output
of the analysis
for a phrase that works and for another that does not.

Thanks for your support

Best Regards,

--
Jérôme

Re: Weird behaviour with phrase queries

Posted by Erick Erickson <er...@gmail.com>.
Hmmm, I don't see any screen shots. Several things:
1> If your stopword file has comments, I'm not sure what the effect would
be.
2> Something's not right here, or I'm being fooled again. Your withresults
xml has this line:
<str name="parsedquery">+DisjunctionMaxQuery((meta_text:"ecol d
ingenieur")~0.01) ()</str>
and your noresults has this line:
<str name="parsedquery">+DisjunctionMaxQuery((meta_text:"academi
charpenti")~0.01) DisjunctionMaxQuery((meta_text:"academi
charpenti"~100)~0.01)</str>

the empty () in the first one often means you're NOT going to your
configured dismax parser in solrconfig.xml. Yet that doesn't square with
your custom qt, so I'm puzzled.

Could we see your raw query string on the way in? It's almost as if you
defined qt in one and defType in the other, which are not equivalent.
3> It may take 12 hours to index, but you could experiment with a smaller
subset. You say you know that the noresults one should return documents,
what proof do
you have? If there's a single document that you know should match this, just
index it and a few others and you should be able to make many runs until you
get
to the bottom of this...

And obviously your stemming is happening on the query, are you sure it's
happening at index time too?

Best
Erick

On Mon, Jan 24, 2011 at 1:51 PM, Jerome Renard <je...@gmail.com>wrote:

> Hi Em, Erick
>
> thanks for your feedback.
>
> Em : yes Here is the stopwords.txt I use :
> -
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt
>
> On Mon, Jan 24, 2011 at 6:58 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> Try submitting your query from the admin page with &debugQuery=on and see
>> if that helps. The output is pretty dense, so feel free to cut-paste the
>> results for
>> help.
>>
>> Your stemmers have English as the language, which could also be
>> "interesting".
>>
>>
> Yes, I noticed that this will be fixed.
>
>
>> As Em says, the analysis page may help here, but I'd start by taking out
>> WordDelimiterFilterFactory, SnowballPorterFilterFactory and
>> StopFilterFactory
>> and build back up if you really need them. Although, again, the analysis
>> page
>> that's accessible from the admin page may help greatly (check "debug" in
>> both
>> index and query).
>>
>>
> You will find attached two xml files one with no results (noresult.xml.gz)
> and one with
> a lot of results (withresults.xml.gz). You will also find attached two
> screenshots showing
> there is a highlighted section in the "Index analyzer" section when
> analysing text.
>
>
>> Oh, and you MUST re-index after changing your schema to have a true test.
>>
>>
> Yes, the problem is that reindexing takes around 12 hours which makes it
> really hard
> for testing :/
>
>
> Thanks in advance for your feedback.
>
> Best Regards,
>
> --
> Jérôme
>

Re: Weird behaviour with phrase queries

Posted by Jerome Renard <je...@gmail.com>.
Hi Em, Erick

thanks for your feedback.

Em : yes Here is the stopwords.txt I use :
-
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt

On Mon, Jan 24, 2011 at 6:58 PM, Erick Erickson <er...@gmail.com>wrote:

> Try submitting your query from the admin page with &debugQuery=on and see
> if that helps. The output is pretty dense, so feel free to cut-paste the
> results for
> help.
>
> Your stemmers have English as the language, which could also be
> "interesting".
>
>
Yes, I noticed that this will be fixed.


> As Em says, the analysis page may help here, but I'd start by taking out
> WordDelimiterFilterFactory, SnowballPorterFilterFactory and
> StopFilterFactory
> and build back up if you really need them. Although, again, the analysis
> page
> that's accessible from the admin page may help greatly (check "debug" in
> both
> index and query).
>
>
You will find attached two xml files one with no results (noresult.xml.gz)
and one with
a lot of results (withresults.xml.gz). You will also find attached two
screenshots showing
there is a highlighted section in the "Index analyzer" section when
analysing text.


> Oh, and you MUST re-index after changing your schema to have a true test.
>
>
Yes, the problem is that reindexing takes around 12 hours which makes it
really hard
for testing :/

Thanks in advance for your feedback.

Best Regards,

-- 
Jérôme

Re: Weird behaviour with phrase queries

Posted by Erick Erickson <er...@gmail.com>.
Try submitting your query from the admin page with &debugQuery=on and see
if that helps. The output is pretty dense, so feel free to cut-paste the
results for
help.

Your stemmers have English as the language, which could also be
"interesting".

As Em says, the analysis page may help here, but I'd start by taking out
WordDelimiterFilterFactory, SnowballPorterFilterFactory and
StopFilterFactory
and build back up if you really need them. Although, again, the analysis
page
that's accessible from the admin page may help greatly (check "debug" in
both
index and query).

Oh, and you MUST re-index after changing your schema to have a true test.

Best
Erick

On Mon, Jan 24, 2011 at 12:31 PM, Jerome Renard <je...@gmail.com>wrote:

> Hi,
>
> I have a problem with phrase queries, from times to times I do not get any
> result
> where as I know I should get returned something.
>
> The search is run against a field of type "text" which definition is
> available at the following URL :
> - http://pastebin.com/Ncem7M8z
>
> This field is defined with the following configuration:
> <field name="meta_text" type="text"    indexed="true"  stored="true"
> multiValued="true" termVectors="true"/>
>
> I use the following request handler:
> <requestHandler name="custom" class="solr.DisMaxRequestHandler">
>    <lst name="defaults">
>        <str name="echoParams">explicit</str>
>        <float name="tie">0.01</float>
>        <str name="qf">meta_text</str>
>        <str name="pf">meta_text</str>
>        <str name="bf"/>
>        <str name="mm">1&lt;1 2&lt;-1 5&lt;-2 7&lt;60%</str>
>        <int name="ps">100</int>
>        <str name="q.alt">*:*</str>
>    </lst>
> </requestHandler>
>
> Depending on the kind of phrase query I use I get either exactly what I am
> looking for or nothing.
>
> Index' contents is all french so I thought about a possible problem with
> accents but I got queries working
> with phrase queries containing "é" and "è" chars like "académie" or
> "ingénieur".
>
> As you will see the filter used in the "text" type uses the
> SnowballPorterFilterFactory for the english language,
> I plan to fix that by using the correct language for the index (French) and
> the following protwords http://bit.ly/i8JeX6 .
>
> But except this mistake with the stemmer, did I do something (else) wrong ?
> Did I overlook something ? What could
> explain I do not always get results for my phrase queries ?
>
> Thanks in advance for your feedback.
>
> Best Regards,
>
> --
> Jérôme
>