You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by sureshrk19 <su...@gmail.com> on 2014/02/28 07:54:17 UTC

stopwords issue with edismax

Hi All,

I'm having a problem while searching for some string with a word defined in
stopwords.txt.

eg: I have 'of' defined in stopwords.txt 

My schema analyzer's defined as follows:

 <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      
<analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>


I have defined the filed as 'all_text' field in schema.xml.

When I try to search for 
Case 1: a of b --> I don't get any response
Case 2: "a of b" --> There is some response
Case 3: a \of\ b --> There is some response.

I'm using stopwords in both 'index' and 'query' analyzer so, this should be
working for case 1 too, right?

Did i miss anything?

Thanks,
Suresh



--
View this message in context: http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: stopwords issue with edismax

Posted by Jack Krupansky <ja...@basetechnology.com>.
Yes, if they are tokenized text fields, but I was assuming that "number" was 
a strictly numeric field.

That said, you  could have numeric and non-tokenized string fields, but 
copyField them to text fields (or a single text field) for purposes of 
queries.

-- Jack Krupansky

-----Original Message----- 
From: sureshrk19
Sent: Tuesday, March 4, 2014 1:57 PM
To: solr-user@lucene.apache.org
Subject: Re: stopwords issue with edismax

Thanks Jack.

I could fix this problem by adding stopwords 'filter' condition in
<fieldType> definition for  "number" and "all_code"





--
View this message in context: 
http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4121176.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: stopwords issue with edismax

Posted by sureshrk19 <su...@gmail.com>.
Thanks Jack.

I could fix this problem by adding stopwords 'filter' condition in
<fieldType> definition for  "number" and "all_code"





--
View this message in context: http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4121176.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: stopwords issue with edismax

Posted by Jack Krupansky <ja...@basetechnology.com>.
As I suggested, you have a couple of field that do not ignore stop words, so 
the stop word must be present in at least one of those fields:

(number:of^3.0 | all_code:of^2.0)

The solution would be to remove the "number" and "all_code" fields from qf.

-- Jack Krupansky

-----Original Message----- 
From: sureshrk19
Sent: Monday, March 3, 2014 1:05 AM
To: solr-user@lucene.apache.org
Subject: Re: stopwords issue with edismax

Jack,

Thanks for the reply.

Yes. your observation is right. I see, stopwords are not being ignore at
query time.
Say, I'm searching for 'bank of america'. I'm expecting 'of' should not be
the part of search.
But, here I see 'of' is being sent. Same is the query syntax for 'OR' and
'AND' operators and 'OR' is returning results as expected. But in my case, I
want to use 'AND'.

Here is debug query information...

"parsedquery":"(+((DisjunctionMaxQuery((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0))
DisjunctionMaxQuery((number:of^3.0 | all_code:of^2.0))
DisjunctionMaxQuery((ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0)))~3))/no_coord",
    "parsedquery_toString":"+(((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0)
(number:of^3.0 | all_code:of^2.0) (ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0))~3)"

Is there any reason why 'stopwords' are not being ignored. I checked
schema.xml for filter and the same is present:
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />




--
View this message in context: 
http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120815.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: stopwords issue with edismax

Posted by sureshrk19 <su...@gmail.com>.
Jack,

Thanks for the reply.

Yes. your observation is right. I see, stopwords are not being ignore at
query time. 
Say, I'm searching for 'bank of america'. I'm expecting 'of' should not be
the part of search.
But, here I see 'of' is being sent. Same is the query syntax for 'OR' and
'AND' operators and 'OR' is returning results as expected. But in my case, I
want to use 'AND'.

Here is debug query information...

"parsedquery":"(+((DisjunctionMaxQuery((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0))
DisjunctionMaxQuery((number:of^3.0 | all_code:of^2.0))
DisjunctionMaxQuery((ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0)))~3))/no_coord",
    "parsedquery_toString":"+(((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0)
(number:of^3.0 | all_code:of^2.0) (ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0))~3)"

Is there any reason why 'stopwords' are not being ignored. I checked
schema.xml for filter and the same is present:
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />




--
View this message in context: http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120815.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: stopwords issue with edismax

Posted by Jack Krupansky <ja...@basetechnology.com>.
Look at the parsed_query by setting the debugQuery=true parameter.

I think what is happening is that the query parser will generate a separate 
dismax query for each term and each dismax query will require at least one 
of its fields to contain the term. I suspect that some of your qf fields do 
not ignore stopwords, so the dismax for "of" will not be empty (although the 
clause for some of the fields will not be present since the stop word filter 
eliminates them) so that the dismax fails to match anything and since 
q.op=AND, the whole query matches nothing.

-- Jack Krupansky

-----Original Message----- 
From: sureshrk19
Sent: Friday, February 28, 2014 1:12 PM
To: solr-user@lucene.apache.org
Subject: Re: stopwords issue with edismax

Thanks for taking time on this...

Here is my request handler definition:

<requestHandler name="/select" class="solr.SearchHandler">

     <lst name="defaults">
       <str name="defType">edismax</str>
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>
       <str name="df">all_text number party name all_code ent_name</str>
       <str name="qf">all_text number^3 name^5 party^3 all_code^2
ent_name^7</str>
       <str name="fl">id description</str>
       <str name="q.op">AND</str>


Name which is indexed is: a of b
When I try to search, a of b then I don't see any results.

I changes q.op=OR then, I see results for this search.

I'm not sure why the same is not being returned when I search with AND
operator.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120459.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: stopwords issue with edismax

Posted by sureshrk19 <su...@gmail.com>.
Thanks for taking time on this...

Here is my request handler definition:

<requestHandler name="/select" class="solr.SearchHandler">
    
     <lst name="defaults">
       <str name="defType">edismax</str>
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>
       <str name="df">all_text number party name all_code ent_name</str>
       <str name="qf">all_text number^3 name^5 party^3 all_code^2
ent_name^7</str>
       <str name="fl">id description</str>
       <str name="q.op">AND</str>


Name which is indexed is: a of b
When I try to search, a of b then I don't see any results.

I changes q.op=OR then, I see results for this search.

I'm not sure why the same is not being returned when I search with AND
operator.





--
View this message in context: http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120459.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: stopwords issue with edismax

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi,

From the URLs you provided, it is not clear that you use edismax query parser at all. Thats why I asked complete list of parameters. Can you paste request handler definition from solrconfig.xml? 

And what do you expect and what is not working for you.





On Friday, February 28, 2014 7:30 PM, sureshrk19 <su...@gmail.com> wrote:
<str name="echoParams">explicit</str>

For all handlers I have the same setting.

Another observation I have is,

I'm getting results when I use, 'q.op=OR' the default operator set in
solrconfig.xml is 'AND'

the query working fine is:
http://localhost:8080/solr/collection1/select?q=bank+america&wt=json&indent=true&q.op=OR






--
View this message in context: http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120441.html

Sent from the Solr - User mailing list archive at Nabble.com.


Re: stopwords issue with edismax

Posted by sureshrk19 <su...@gmail.com>.
<str name="echoParams">explicit</str>

For all handlers I have the same setting.

Another observation I have is,

I'm getting results when I use, 'q.op=OR' the default operator set in
solrconfig.xml is 'AND'

the query working fine is:
http://localhost:8080/solr/collection1/select?q=bank+america&wt=json&indent=true&q.op=OR






--
View this message in context: http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120441.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: stopwords issue with edismax

Posted by Ahmet Arslan <io...@yahoo.com>.
Can give parameters defined in defaults sections of request handler / solrconfig.xml?

By the way echoParams=all will list all parameters.



On Friday, February 28, 2014 5:18 PM, sureshrk19 <su...@gmail.com> wrote:
Ahmet,

Thanks for the reply..

Here is the query:

http://localhost:8080/solr/collection1/select?q=a+of+b&fq=type%3AEntity&wt=json&indent=true

And here is my stopwords_en.txt content

a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or





--
View this message in context: http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120408.html

Sent from the Solr - User mailing list archive at Nabble.com.


Re: stopwords issue with edismax

Posted by sureshrk19 <su...@gmail.com>.
Ahmet,

Thanks for the reply..

Here is the query:

http://localhost:8080/solr/collection1/select?q=a+of+b&fq=type%3AEntity&wt=json&indent=true

And here is my stopwords_en.txt content

a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or





--
View this message in context: http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120408.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: stopwords issue with edismax

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Suresh,

Can you give us full set of parameters you use for edismax? qf, mm, etc.
And content of your stopwords.txt. Is a listed there too?

Ahmet



On Friday, February 28, 2014 8:54 AM, sureshrk19 <su...@gmail.com> wrote:
Hi All,

I'm having a problem while searching for some string with a word defined in
stopwords.txt.

eg: I have 'of' defined in stopwords.txt 

My schema analyzer's defined as follows:

<analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      
<analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>


I have defined the filed as 'all_text' field in schema.xml.

When I try to search for 
Case 1: a of b --> I don't get any response
Case 2: "a of b" --> There is some response
Case 3: a \of\ b --> There is some response.

I'm using stopwords in both 'index' and 'query' analyzer so, this should be
working for case 1 too, right?

Did i miss anything?

Thanks,
Suresh



--
View this message in context: http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339.html
Sent from the Solr - User mailing list archive at Nabble.com.