You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Germán Biozzoli <ge...@gmail.com> on 2009/10/18 03:32:25 UTC

Problem with Query Parser

Hi everybody

I have a simple but (for me) annoying problem. I'm happy user of Solr
1.4 with a small collection of documents. Today one of the users has
reported that a query returns documents that are non-pertinent to the
expression. I have spanish, portuguese and english text inside the
collection. Using the Solr administration interface I've found that
she was right, if I search for the spanish term "represion", I found
just only the word root, I mean it returns every document with the
term "repres". Using the admin-debug search I found this:


<lst name="debug">
<str name="rawquerystring">description:represion</str>
<str name="querystring">description:represion</str>
<str name="parsedquery">description:repres</str>
<str name="parsedquery_toString">description:repres</str>

the "ion" part of the term was deleted by the query parser. The first
question is: I don´t know now where should I see to correct this, at
the schema.xml or at the solrconfig.xml.

At schema, description is

<field name="description" type="text" indexed="true"
multiValued="true" stored="true"/>

and text is:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>

      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>

    </fieldtype>

The only thing that is suspicious to me is the EnglishPorter. I've
deleted from the configuration but nothing changes. Should I reindex
the collection to see the changes? Should I delete also from the index
section? What I will loose deleting English porter?

Thanks a lot for the help
German

Re: Problem with Query Parser

Posted by Lance Norskog <go...@gmail.com>.

Another way to do multi-lingual indexing is to have a separate field
for each language. Solr/Lucene have custom processing for some
languages.

On Sun, Oct 18, 2009 at 12:25 PM, Germán Biozzoli
<ge...@gmail.com> wrote:
> Thanks Ahmet. Definitely using analyzer appears the english porter as
> the killer ;)
> Regards
> German
>
> On Sun, Oct 18, 2009 at 7:30 AM, AHMET ARSLAN <io...@yahoo.com> wrote:
>>
>>> Hi everybody
>>>
>>> I have a simple but (for me) annoying problem. I'm happy
>>> user of Solr
>>> 1.4 with a small collection of documents. Today one of the
>>> users has
>>> reported that a query returns documents that are
>>> non-pertinent to the
>>> expression. I have spanish, portuguese and english text
>>> inside the
>>> collection. Using the Solr administration interface I've
>>> found that
>>> she was right, if I search for the spanish term
>>> "represion", I found
>>> just only the word root, I mean it returns every document
>>> with the
>>> term "repres". Using the admin-debug search I found this:
>>>
>>>
>>> <lst name="debug">
>>> <str
>>> name="rawquerystring">description:represion</str>
>>> <str
>>> name="querystring">description:represion</str>
>>> <str
>>> name="parsedquery">description:repres</str>
>>> <str
>>> name="parsedquery_toString">description:repres</str>
>>>
>>> the "ion" part of the term was deleted by the query parser.
>>> The first
>>> question is: I don´t know now where should I see to
>>> correct this, at
>>> the schema.xml or at the solrconfig.xml.
>>
>>> The only thing that is suspicious to me is the
>>> EnglishPorter.
>>
>> Yes you are right. "ion" part of the term was deleted by it. You can verify this using /admin/analysis.jsp page. It will tell you which TokenFilterFactory removes it.
>>
>>> I've deleted from the configuration but nothing changes. Should
>>> I reindex the collection to see the changes?
>>
>> Yes re-index is necessary.
>>
>>> Should I delete also from the index section?
>>
>> You should remove English porter from both query and index analyzer.
>>
>>> What I will loose deleting English porter?
>>
>> You will lose stemming functionality. But since you have spanish, portuguese and english documents using English porter for all the documents is not meaningful.
>>
>>
>>
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Problem with Query Parser

Posted by Germán Biozzoli <ge...@gmail.com>.

Thanks Ahmet. Definitely using analyzer appears the english porter as
the killer ;)
Regards
German

On Sun, Oct 18, 2009 at 7:30 AM, AHMET ARSLAN <io...@yahoo.com> wrote:
>
>> Hi everybody
>>
>> I have a simple but (for me) annoying problem. I'm happy
>> user of Solr
>> 1.4 with a small collection of documents. Today one of the
>> users has
>> reported that a query returns documents that are
>> non-pertinent to the
>> expression. I have spanish, portuguese and english text
>> inside the
>> collection. Using the Solr administration interface I've
>> found that
>> she was right, if I search for the spanish term
>> "represion", I found
>> just only the word root, I mean it returns every document
>> with the
>> term "repres". Using the admin-debug search I found this:
>>
>>
>> <lst name="debug">
>> <str
>> name="rawquerystring">description:represion</str>
>> <str
>> name="querystring">description:represion</str>
>> <str
>> name="parsedquery">description:repres</str>
>> <str
>> name="parsedquery_toString">description:repres</str>
>>
>> the "ion" part of the term was deleted by the query parser.
>> The first
>> question is: I don´t know now where should I see to
>> correct this, at
>> the schema.xml or at the solrconfig.xml.
>
>> The only thing that is suspicious to me is the
>> EnglishPorter.
>
> Yes you are right. "ion" part of the term was deleted by it. You can verify this using /admin/analysis.jsp page. It will tell you which TokenFilterFactory removes it.
>
>> I've deleted from the configuration but nothing changes. Should
>> I reindex the collection to see the changes?
>
> Yes re-index is necessary.
>
>> Should I delete also from the index section?
>
> You should remove English porter from both query and index analyzer.
>
>> What I will loose deleting English porter?
>
> You will lose stemming functionality. But since you have spanish, portuguese and english documents using English porter for all the documents is not meaningful.
>
>
>
>
>

Re: Problem with Query Parser

Posted by AHMET ARSLAN <io...@yahoo.com>.

> Hi everybody
> 
> I have a simple but (for me) annoying problem. I'm happy
> user of Solr
> 1.4 with a small collection of documents. Today one of the
> users has
> reported that a query returns documents that are
> non-pertinent to the
> expression. I have spanish, portuguese and english text
> inside the
> collection. Using the Solr administration interface I've
> found that
> she was right, if I search for the spanish term
> "represion", I found
> just only the word root, I mean it returns every document
> with the
> term "repres". Using the admin-debug search I found this:
> 
> 
> <lst name="debug">
> <str
> name="rawquerystring">description:represion</str>
> <str
> name="querystring">description:represion</str>
> <str
> name="parsedquery">description:repres</str>
> <str
> name="parsedquery_toString">description:repres</str>
> 
> the "ion" part of the term was deleted by the query parser.
> The first
> question is: I don´t know now where should I see to
> correct this, at
> the schema.xml or at the solrconfig.xml.

> The only thing that is suspicious to me is the
> EnglishPorter. 

Yes you are right. "ion" part of the term was deleted by it. You can verify this using /admin/analysis.jsp page. It will tell you which TokenFilterFactory removes it.

> I've deleted from the configuration but nothing changes. Should
> I reindex the collection to see the changes? 

Yes re-index is necessary.

> Should I delete also from the index section? 

You should remove English porter from both query and index analyzer.

> What I will loose deleting English porter?

You will lose stemming functionality. But since you have spanish, portuguese and english documents using English porter for all the documents is not meaningful.