You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by docmattman <ma...@live.com> on 2011/10/21 04:49:24 UTC

Highlighting misses some characters

I have highlighting on in query.  If I do a search for "Apple", it will
highlight "Appl".  If I do a search for "deleted" it will highlight "delet",
"agreed" will highlight "agre".  How can I get it to highlight the full term
that I'm searching for and not leave off certain letters?

I'm pretty new to Solr, so please let me know if there is any additional
information needed to assist me with this problem.

--
View this message in context: http://lucene.472066.n3.nabble.com/Highlighting-misses-some-characters-tp3439778p3439778.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Highlighting misses some characters

Posted by Dirceu Vieira <di...@gmail.com>.
Whether removing the filter of not really depends on the use of it in the
search and what result is expected from it.
Have a look at
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

I'd say you should find out what exactly are the requirements for the
search, read a bit about the TokenFactories and TokenFilters, then you could
define exactly what to do.

I wild guess says to me that you should remove the EdgeNGramFilter from the
query analyzers.
I believe when you do that your query will return "apple" when searching for
"appl" but will not return "appl" when searching for "apple".

Regards,

Dirceu

On Fri, Oct 21, 2011 at 4:39 PM, docmattman <ma...@live.com> wrote:

> Yea, I'm using EdgeNGramFilterFactory, should I remove that?  I actually
> inherited this index from another person who used to be part of the
> project,
> so there may be a few things that need to be changed.  Here is my field
> type
> from the schema:
>
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> maxGramSize="15"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> maxGramSize="15"/>
>      </analyzer>
>    </fieldType>
>
>
> I'm not sure what all of these do, but like I said, someone else built the
> system and now I'm in charge of getting it running correctly.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Highlighting-misses-some-characters-tp3439778p3440995.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Dirceu Vieira Júnior
-------------------------------------------------------------------
+47 9753 2473
dirceuvjr.blogspot.com
twitter.com/dirceuvjr

Re: Highlighting misses some characters

Posted by docmattman <ma...@live.com>.
Yea, I'm using EdgeNGramFilterFactory, should I remove that?  I actually
inherited this index from another person who used to be part of the project,
so there may be a few things that need to be changed.  Here is my field type
from the schema:


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
maxGramSize="15"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
maxGramSize="15"/>
      </analyzer>
    </fieldType>


I'm not sure what all of these do, but like I said, someone else built the
system and now I'm in charge of getting it running correctly.

--
View this message in context: http://lucene.472066.n3.nabble.com/Highlighting-misses-some-characters-tp3439778p3440995.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Highlighting misses some characters

Posted by Dirceu Vieira <di...@gmail.com>.
Hi,

Are you using any kind of NGram tokenizer?
At first I'd have said it is caused by stemming, but since it's not like the
stem and it's derived word are being highlighted, it's more like parts of it
are...

If you use NGram or EdgeNGram this will generate tokens for each part of the
word (the size of the token is configurable).

If you're not using that my second guess is that the term is being truncated
somehow.

If you could provide some more info about this case it would be better....

On Fri, Oct 21, 2011 at 4:49 AM, docmattman <ma...@live.com> wrote:

> I have highlighting on in query.  If I do a search for "Apple", it will
> highlight "Appl".  If I do a search for "deleted" it will highlight
> "delet",
> "agreed" will highlight "agre".  How can I get it to highlight the full
> term
> that I'm searching for and not leave off certain letters?
>
> I'm pretty new to Solr, so please let me know if there is any additional
> information needed to assist me with this problem.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Highlighting-misses-some-characters-tp3439778p3439778.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Dirceu Vieira Júnior
-------------------------------------------------------------------
+47 9753 2473
dirceuvjr.blogspot.com
twitter.com/dirceuvjr