You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by erantone <er...@gmail.com> on 2015/05/01 01:31:34 UTC

Bug with full text search fields in multiple languages (solr 5)

Dear all,

I have defined two dynamic fields:

    <dynamicField name="*_texts_en" stored="true" type="text_en"
multiValued="true" indexed="true"/>
    <dynamicField name="*_texts_pt" stored="true" type="text_pt"
multiValued="true" indexed="true"/>

for documents in English and in Portuguese, with the following index and
query analyzers:

    <fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
	<charFilter class="solr.HTMLStripCharFilterFactory"/>  
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt" />
	<filter class="solr.LowerCaseFilterFactory"/>
	<filter class="solr.EnglishPossessiveFilterFactory"/>
	<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
	<filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt" />
	<filter class="solr.LowerCaseFilterFactory"/>
	<filter class="solr.EnglishPossessiveFilterFactory"/>
	<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
	<filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

    <fieldType name="text_pt" class="solr.TextField" omitNorms="false">
      <analyzer type="index">
	<charFilter class="solr.HTMLStripCharFilterFactory"/> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_pt.txt" format="snowball" />
        <filter class="solr.LowerCaseFilterFactory"/>
	<filter class="solr.PortugueseLightStemFilterFactory"/>
	<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/> 
	<filter class="solr.RemoveDuplicatesTokenFilterFactory" /> 
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_pt.txt" format="snowball" />
        <filter class="solr.LowerCaseFilterFactory"/>
	
	<filter class="solr.PortugueseLightStemFilterFactory"/>
	<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
	<filter class="solr.RemoveDuplicatesTokenFilterFactory" /> 
      </analyzer>
    </fieldType>

A document can be either in Portuguese and English, and it will use
something like 'body_texts_en' as a field in English. If in Portuguese:
'body_text_pt'.

However, I am experiencing problems with a search query to both fields
simultaneously when solr.StopFilterFactory is used in the filter chain. That
is, when I search for a certain query without knowing the language, I query
solr in this way:

{
  "responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "q": "suco de limão",
      "defType": "edismax",
      "indent": "true",
      "qf": " body_texts_pt  body_texts_en",
      "wt": "json",
      "lowercaseOperators": "true",
      "stopwords": "true",
      "_": "1430434475811"
    }
  },
  "response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  }
}

The query above was done using terms in Portuguese. Even though the index
had matching documents, no results are returned.
On the other hand, as soon as I:
- remove 'body_texts_en' from 'qf' param (in the solr request), OR
- remove all solr.StopFilterFactory filters from all analyzers,  
the matching documents are correctly returned.

Thus, the problem here is in the use of solr.StopFilterFactory and
simultaneous query to two fields, each one having its own use of
solr.StopFilterFactory (as shown above).

Is there any hope of having the query above to work as expected?

Thanks in advance.

With best regards,
Eric








--
View this message in context: http://lucene.472066.n3.nabble.com/Bug-with-full-text-search-fields-in-multiple-languages-solr-5-tp4203367.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Bug with full text search fields in multiple languages (solr 5)

Posted by Erick Erickson <er...@gmail.com>.

Eric:

First of all, kudos for your problem description. Plainly you've
1> tried to diagnose the problem.
2> taken the time to write it up for us.

Far too often we see problem statements like "it doesn't work, what's
wrong" (one of my pet peeves).

Anyway, on to your problem. This should work as you expect. This bit
is puzzling:

bq: remove 'body_texts_en' from 'qf' param (in the solr request)

The only thing that springs to mind here is that you might have your
"mm" parameter at 100% or
some such. that would require that all clauses match and could explain
why removing the field
returns matches.

You might try appending &debug=query to the URL and posting that, that
would help diagnose
this, and post the request handler definition from solrconfig.xml

Just to check, you did re-index the entire corpus after you finalized
your stopwords files, correct?

The other thing that might help considerably is the admin/analysis
page. That will show you exactly
what transformations on the field in question are performed at both
index and query time. Note that
one point of confusion is that on the query side, this shows you what
gets through the query parsing
process, looking at the results of adding debug=query to the URL and
pasting the _parsed_ tokens
into the "query" side is often wise.

I'll be away from my computer for a few days, good luck!
Erick

On Thu, Apr 30, 2015 at 4:31 PM, erantone <er...@gmail.com> wrote:
> Dear all,
>
> I have defined two dynamic fields:
>
>     <dynamicField name="*_texts_en" stored="true" type="text_en"
> multiValued="true" indexed="true"/>
>     <dynamicField name="*_texts_pt" stored="true" type="text_pt"
> multiValued="true" indexed="true"/>
>
> for documents in English and in Portuguese, with the following index and
> query analyzers:
>
>     <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <charFilter class="solr.HTMLStripCharFilterFactory"/>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
>     <fieldType name="text_pt" class="solr.TextField" omitNorms="false">
>       <analyzer type="index">
>         <charFilter class="solr.HTMLStripCharFilterFactory"/>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_pt.txt" format="snowball" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.PortugueseLightStemFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_pt.txt" format="snowball" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>         <filter class="solr.PortugueseLightStemFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>     </fieldType>
>
> A document can be either in Portuguese and English, and it will use
> something like 'body_texts_en' as a field in English. If in Portuguese:
> 'body_text_pt'.
>
> However, I am experiencing problems with a search query to both fields
> simultaneously when solr.StopFilterFactory is used in the filter chain. That
> is, when I search for a certain query without knowing the language, I query
> solr in this way:
>
> {
>   "responseHeader": {
>     "status": 0,
>     "QTime": 1,
>     "params": {
>       "q": "suco de limão",
>       "defType": "edismax",
>       "indent": "true",
>       "qf": " body_texts_pt  body_texts_en",
>       "wt": "json",
>       "lowercaseOperators": "true",
>       "stopwords": "true",
>       "_": "1430434475811"
>     }
>   },
>   "response": {
>     "numFound": 0,
>     "start": 0,
>     "docs": []
>   }
> }
>
> The query above was done using terms in Portuguese. Even though the index
> had matching documents, no results are returned.
> On the other hand, as soon as I:
> - remove 'body_texts_en' from 'qf' param (in the solr request), OR
> - remove all solr.StopFilterFactory filters from all analyzers,
> the matching documents are correctly returned.
>
> Thus, the problem here is in the use of solr.StopFilterFactory and
> simultaneous query to two fields, each one having its own use of
> solr.StopFilterFactory (as shown above).
>
> Is there any hope of having the query above to work as expected?
>
> Thanks in advance.
>
> With best regards,
> Eric
>
>
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Bug-with-full-text-search-fields-in-multiple-languages-solr-5-tp4203367.html
> Sent from the Solr - User mailing list archive at Nabble.com.