You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Alheiros <Da...@bbc.co.uk> on 2007/07/31 18:41:54 UTC

Highlighting question

Hi

I've started using highlighting and there is something that I consider a bit
odd... It may be caused by the way I'm indexing or querying I'm sure, but
just to avoid doing a huge number of tests...

I'm querying for "butter" and only exact matches of butter are returning
highlighted, when I change my query to "butters" it returns both "butter"
and "butters" highlighted. Is it something that considers the word and it's
reductions but not match a word that contains the word in the query?

Thanks again,
Daniel


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
					

Re: Highlighting question

Posted by Daniel Alheiros <Da...@bbc.co.uk>.
Thanks Yonik.

Noted and fixed. I'll take extra care with this scenarios.


Regards,
Daniel


On 1/8/07 20:08, "Yonik Seeley" <yo...@apache.org> wrote:

> On 8/1/07, Daniel Alheiros <Da...@bbc.co.uk> wrote:
>> I'm using the PorterStemmerFilterFactory when indexing but not when
>> querying.
> 
> That's problematic though.  During index time, if "city" is stemmed to
> "citi", then a search of "city" will find nothing unless it's stemmed
> too.
> 
> One should always use the same analyzer for indexing and querying (or
> at least "compatible" analyzers... filters that can inject tokens
> (synonym and word delimier filters) are an exception)
> 
> -Yonik


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
					

Re: Highlighting question

Posted by Yonik Seeley <yo...@apache.org>.
On 8/1/07, Daniel Alheiros <Da...@bbc.co.uk> wrote:
> I'm using the PorterStemmerFilterFactory when indexing but not when
> querying.

That's problematic though.  During index time, if "city" is stemmed to
"citi", then a search of "city" will find nothing unless it's stemmed
too.

One should always use the same analyzer for indexing and querying (or
at least "compatible" analyzers... filters that can inject tokens
(synonym and word delimier filters) are an exception)

-Yonik

Re: Highlighting question in a multi-language index

Posted by Daniel Alheiros <Da...@bbc.co.uk>.
Hi

I've narrowed down to realize that my problem here is related to the way I
store/index my fields in a multi-language index... I'm going to explain how
I'm doing it and I hope you can come out with some nice way to solve my
problem:

My schema.xml contains the following definitions:

    <!--  definition for "Language Agnostic" text field -->
    <fieldtype name="text_basic" class="solr.TextField"
positionIncrementGap="100">
        <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        </analyzer>
    </fieldtype>

    <!--  Text definition for "ENGLISH" -->
    <fieldtype name="text_english" class="solr.TextField"
positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SynonymFilterFactory"
synonyms="synonyms-english.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords-english.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.EnglishPorterFilterFactory"
protected="protwords-english.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords-english.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.EnglishPorterFilterFactory"
protected="protwords-english.txt"/>
        </analyzer>
    </fieldtype>

    <!--  CPS Text definition for "SPANISH" -->
    <fieldtype name="cpstext_spanish" class="solr.TextField"
positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords-spanish.txt"/>
                <filter class="solr.SnowballPorterFilterFactory"
language="Spanish" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords-spanish.txt"/>
                <filter class="solr.SnowballPorterFilterFactory"
language="Spanish" />
        </analyzer>
    </fieldtype>

    <field name="body"    type="text_basic" indexed="false" stored="true"
multiValued="false" compressed="true" compressionThreshold="1024" />
    <field name="body_en" type="text_english" indexed="true" stored="false"
/>
    <field name="body_es" type="text_spanish" indexed="true" stored="false"
/>

    <copyField source="body_en"    dest="body"/>
    <copyField source="body_es"    dest="body"/>

So I¹m indexing some fields but in fact I¹m storing a different one that
holds a copy of the data used when indexing (independently of the language
of the current document). Each document, depending on its language will have
only one body_XX field (if it¹s a document in English it will have the field
body_en and if it is in Spanish it will have a body_es).

I¹m querying informing that I want to highlight the generic ³body² field (as
I need a stored field to use the highlighting) but it only returns the
proper result if I have on my stored field the same query analyzer structure
as in the language dependent field, and I can¹t do that, because I¹m
indexing content in six completely different languages that doesn¹t share
much in terms of analysis...

The idea in having a generic set of fields (language independent) is about
avoiding different interfaces for the search client (as the same search
client can search in any language) and all this documents are in the same
index for deployment and content management simplicity and because it¹s not
a huge amount of documents that can¹t be together (and the update frequency
is low).

Can you help me again with this? Is this solution feasible using Solr/Lucene
or I¹ll have to change my mind and change the client interface so it will
have to query for it¹s specific fields (and I will need to make those
stored=true)?

Thanks again,
Daniel

On 1/8/07 10:43, "Daniel Alheiros" <Da...@bbc.co.uk> wrote:

> Hi Mike.
> 
> Thanks for your reply, but seems that I haven't expressed myself clearly.
> Here I go:
> 
> I want that when I search for "butter" all words containing "butter" (like
> "buttered", "butters" ...) are highlighted.
> 
> I'm using the PorterStemmerFilterFactory when indexing but not when
> querying.
> 
> Regards,
> Daniel
> 
> 
> On 31/7/07 18:50, "Mike Klaas" <mi...@gmail.com> wrote:
> 
>> 
>> On 31-Jul-07, at 9:41 AM, Daniel Alheiros wrote:
>> 
>>> Hi
>>> 
>>> I've started using highlighting and there is something that I
>>> consider a bit
>>> odd... It may be caused by the way I'm indexing or querying I'm
>>> sure, but
>>> just to avoid doing a huge number of tests...
>>> 
>>> I'm querying for "butter" and only exact matches of butter are
>>> returning
>>> highlighted, when I change my query to "butters" it returns both
>>> "butter"
>>> and "butters" highlighted. Is it something that considers the word
>>> and it's
>>> reductions but not match a word that contains the word in the query?
>> 
>> This is because the example Solr distribution is configured to do
>> stemming (see the definition for "text" fieldtype in schema.xml).
>> 
>> Remove PorterStemmerFilterFactory to do exact(er) searching/
>> highlighting only.
>> 
>> -Mike
> 
> 
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain personal
> views which are not the views of the BBC unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in reliance on
> it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
> 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
					

Re: Highlighting question

Posted by Daniel Alheiros <Da...@bbc.co.uk>.
Hi Mike.

Thanks for your reply, but seems that I haven't expressed myself clearly.
Here I go:

I want that when I search for "butter" all words containing "butter" (like
"buttered", "butters" ...) are highlighted.

I'm using the PorterStemmerFilterFactory when indexing but not when
querying.

Regards,
Daniel


On 31/7/07 18:50, "Mike Klaas" <mi...@gmail.com> wrote:

> 
> On 31-Jul-07, at 9:41 AM, Daniel Alheiros wrote:
> 
>> Hi
>> 
>> I've started using highlighting and there is something that I
>> consider a bit
>> odd... It may be caused by the way I'm indexing or querying I'm
>> sure, but
>> just to avoid doing a huge number of tests...
>> 
>> I'm querying for "butter" and only exact matches of butter are
>> returning
>> highlighted, when I change my query to "butters" it returns both
>> "butter"
>> and "butters" highlighted. Is it something that considers the word
>> and it's
>> reductions but not match a word that contains the word in the query?
> 
> This is because the example Solr distribution is configured to do
> stemming (see the definition for "text" fieldtype in schema.xml).
> 
> Remove PorterStemmerFilterFactory to do exact(er) searching/
> highlighting only.
> 
> -Mike


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
					

Re: Highlighting question

Posted by Mike Klaas <mi...@gmail.com>.
On 31-Jul-07, at 9:41 AM, Daniel Alheiros wrote:

> Hi
>
> I've started using highlighting and there is something that I  
> consider a bit
> odd... It may be caused by the way I'm indexing or querying I'm  
> sure, but
> just to avoid doing a huge number of tests...
>
> I'm querying for "butter" and only exact matches of butter are  
> returning
> highlighted, when I change my query to "butters" it returns both  
> "butter"
> and "butters" highlighted. Is it something that considers the word  
> and it's
> reductions but not match a word that contains the word in the query?

This is because the example Solr distribution is configured to do  
stemming (see the definition for "text" fieldtype in schema.xml).

Remove PorterStemmerFilterFactory to do exact(er) searching/ 
highlighting only.

-Mike