You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Germán Biozzoli <ge...@gmail.com> on 2009/10/18 03:32:25 UTC
Problem with Query Parser
Hi everybody
I have a simple but (for me) annoying problem. I'm happy user of Solr
1.4 with a small collection of documents. Today one of the users has
reported that a query returns documents that are non-pertinent to the
expression. I have spanish, portuguese and english text inside the
collection. Using the Solr administration interface I've found that
she was right, if I search for the spanish term "represion", I found
just only the word root, I mean it returns every document with the
term "repres". Using the admin-debug search I found this:
<lst name="debug">
<str name="rawquerystring">description:represion</str>
<str name="querystring">description:represion</str>
<str name="parsedquery">description:repres</str>
<str name="parsedquery_toString">description:repres</str>
the "ion" part of the term was deleted by the query parser. The first
question is: I don´t know now where should I see to correct this, at
the schema.xml or at the solrconfig.xml.
At schema, description is
<field name="description" type="text" indexed="true"
multiValued="true" stored="true"/>
and text is:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
The only thing that is suspicious to me is the EnglishPorter. I've
deleted from the configuration but nothing changes. Should I reindex
the collection to see the changes? Should I delete also from the index
section? What I will loose deleting English porter?
Thanks a lot for the help
German
Re: Problem with Query Parser
Posted by Lance Norskog <go...@gmail.com>.
Another way to do multi-lingual indexing is to have a separate field
for each language. Solr/Lucene have custom processing for some
languages.
On Sun, Oct 18, 2009 at 12:25 PM, Germán Biozzoli
<ge...@gmail.com> wrote:
> Thanks Ahmet. Definitely using analyzer appears the english porter as
> the killer ;)
> Regards
> German
>
> On Sun, Oct 18, 2009 at 7:30 AM, AHMET ARSLAN <io...@yahoo.com> wrote:
>>
>>> Hi everybody
>>>
>>> I have a simple but (for me) annoying problem. I'm happy
>>> user of Solr
>>> 1.4 with a small collection of documents. Today one of the
>>> users has
>>> reported that a query returns documents that are
>>> non-pertinent to the
>>> expression. I have spanish, portuguese and english text
>>> inside the
>>> collection. Using the Solr administration interface I've
>>> found that
>>> she was right, if I search for the spanish term
>>> "represion", I found
>>> just only the word root, I mean it returns every document
>>> with the
>>> term "repres". Using the admin-debug search I found this:
>>>
>>>
>>> <lst name="debug">
>>> <str
>>> name="rawquerystring">description:represion</str>
>>> <str
>>> name="querystring">description:represion</str>
>>> <str
>>> name="parsedquery">description:repres</str>
>>> <str
>>> name="parsedquery_toString">description:repres</str>
>>>
>>> the "ion" part of the term was deleted by the query parser.
>>> The first
>>> question is: I don´t know now where should I see to
>>> correct this, at
>>> the schema.xml or at the solrconfig.xml.
>>
>>> The only thing that is suspicious to me is the
>>> EnglishPorter.
>>
>> Yes you are right. "ion" part of the term was deleted by it. You can verify this using /admin/analysis.jsp page. It will tell you which TokenFilterFactory removes it.
>>
>>> I've deleted from the configuration but nothing changes. Should
>>> I reindex the collection to see the changes?
>>
>> Yes re-index is necessary.
>>
>>> Should I delete also from the index section?
>>
>> You should remove English porter from both query and index analyzer.
>>
>>> What I will loose deleting English porter?
>>
>> You will lose stemming functionality. But since you have spanish, portuguese and english documents using English porter for all the documents is not meaningful.
>>
>>
>>
>>
>>
>
--
Lance Norskog
goksron@gmail.com
Re: Problem with Query Parser
Posted by Germán Biozzoli <ge...@gmail.com>.
Thanks Ahmet. Definitely using analyzer appears the english porter as
the killer ;)
Regards
German
On Sun, Oct 18, 2009 at 7:30 AM, AHMET ARSLAN <io...@yahoo.com> wrote:
>
>> Hi everybody
>>
>> I have a simple but (for me) annoying problem. I'm happy
>> user of Solr
>> 1.4 with a small collection of documents. Today one of the
>> users has
>> reported that a query returns documents that are
>> non-pertinent to the
>> expression. I have spanish, portuguese and english text
>> inside the
>> collection. Using the Solr administration interface I've
>> found that
>> she was right, if I search for the spanish term
>> "represion", I found
>> just only the word root, I mean it returns every document
>> with the
>> term "repres". Using the admin-debug search I found this:
>>
>>
>> <lst name="debug">
>> <str
>> name="rawquerystring">description:represion</str>
>> <str
>> name="querystring">description:represion</str>
>> <str
>> name="parsedquery">description:repres</str>
>> <str
>> name="parsedquery_toString">description:repres</str>
>>
>> the "ion" part of the term was deleted by the query parser.
>> The first
>> question is: I don´t know now where should I see to
>> correct this, at
>> the schema.xml or at the solrconfig.xml.
>
>> The only thing that is suspicious to me is the
>> EnglishPorter.
>
> Yes you are right. "ion" part of the term was deleted by it. You can verify this using /admin/analysis.jsp page. It will tell you which TokenFilterFactory removes it.
>
>> I've deleted from the configuration but nothing changes. Should
>> I reindex the collection to see the changes?
>
> Yes re-index is necessary.
>
>> Should I delete also from the index section?
>
> You should remove English porter from both query and index analyzer.
>
>> What I will loose deleting English porter?
>
> You will lose stemming functionality. But since you have spanish, portuguese and english documents using English porter for all the documents is not meaningful.
>
>
>
>
>
Re: Problem with Query Parser
Posted by AHMET ARSLAN <io...@yahoo.com>.
> Hi everybody
>
> I have a simple but (for me) annoying problem. I'm happy
> user of Solr
> 1.4 with a small collection of documents. Today one of the
> users has
> reported that a query returns documents that are
> non-pertinent to the
> expression. I have spanish, portuguese and english text
> inside the
> collection. Using the Solr administration interface I've
> found that
> she was right, if I search for the spanish term
> "represion", I found
> just only the word root, I mean it returns every document
> with the
> term "repres". Using the admin-debug search I found this:
>
>
> <lst name="debug">
> <str
> name="rawquerystring">description:represion</str>
> <str
> name="querystring">description:represion</str>
> <str
> name="parsedquery">description:repres</str>
> <str
> name="parsedquery_toString">description:repres</str>
>
> the "ion" part of the term was deleted by the query parser.
> The first
> question is: I don´t know now where should I see to
> correct this, at
> the schema.xml or at the solrconfig.xml.
> The only thing that is suspicious to me is the
> EnglishPorter.
Yes you are right. "ion" part of the term was deleted by it. You can verify this using /admin/analysis.jsp page. It will tell you which TokenFilterFactory removes it.
> I've deleted from the configuration but nothing changes. Should
> I reindex the collection to see the changes?
Yes re-index is necessary.
> Should I delete also from the index section?
You should remove English porter from both query and index analyzer.
> What I will loose deleting English porter?
You will lose stemming functionality. But since you have spanish, portuguese and english documents using English porter for all the documents is not meaningful.