You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Lasitha Wattaladeniya <wa...@gmail.com> on 2017/07/18 06:11:06 UTC

Highlighting words with special characters

Hi devs,

I have setup solr highlighting with default setup (only changed the
fragsize to 0 to match any field length). It worked fine but recently I
discovered it doesn't highlight for words with special characters in the
middle.

For an example, let's say I have indexed email address test.fsdg@ran.com to
a ngram field. And when I search for the partial text fsdg, I get the
results but it's not highlighted. It works in all other scenarios as
expected.

The ngram field has termVectors, termPositions, termOffsets set to true.

Can somebody please suggest me, what may be wrong here?

(sorry for the unstructured text. Typed using a mobile phone )

Regards
Lasitha

Re: Highlighting words with special characters

Posted by Lasitha Wattaladeniya <wa...@gmail.com>.

Hi Shawn,

Yes I can confirm, it works with out any errors with multiple tokenizers.
Following is my analysis chain

StandardTokenizerFactory (only in index)
StopFilterFactory
LowerCaseFilterFactory
ASCIIFoldingFilterFactory
EnglishPossessiveFilterFactory
StemmerOverrideFilterFactory (only in query)
NgramTokenizerFactory (only in index)

I'll have a look more into what you said, Single tokenizer in analysis
chain.

Regards,
Lasitha

Lasitha Wattaladeniya
Software Engineer

Mobile : +6593896893
Blog : techreadme.blogspot.com

On Thu, Jul 20, 2017 at 7:12 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 7/19/2017 8:31 PM, Lasitha Wattaladeniya wrote:
> > But I have NgramTokenizerFactory at the end of indexing analyzer chain.
> > Therefore I should still tokenize the email address. But how this affects
> > the highlighting?, that's what I'm confused to understand
>
> You can only have one tokenizer in an analysis chain.  I have no idea
> what happens if you have more than one.  I personally would expect that
> to result in an initialization error, but maybe what it does is ignore
> the additional tokenizers.  Your experience seems to indicate that it
> does NOT result in an error.  Can you confirm?
>
> The analysis is done in this order:
>
> CharFilters
> Tokenizer
> Filters
>
> Thanks,
> Shawn
>
>

Re: Highlighting words with special characters

Posted by Lasitha Wattaladeniya <wa...@gmail.com>.

Hi ahmet,

But I have NgramTokenizerFactory at the end of indexing analyzer chain.
Therefore I should still tokenize the email address. But how this affects
the highlighting?, that's what I'm confused to understand

Solr version : 4.10.4

Regards,
Lasitha

On 20 Jul 2017 08:28, "Ahmet Arslan" <io...@yahoo.com.invalid> wrote:

Hi,
Maybe name of the UAX29URLEMailTokenizer is deceiving you?It does *not*
tokenize URLs and Emails. Actually it recognises them and emits them as a
single token.
Ahmet

On Wednesday, July 19, 2017, 12:00:05 PM GMT+3, Lasitha Wattaladeniya <
wattale@gmail.com> wrote:

Update,

I changed the UAX29URLEmailTokenizerFactory to StandardTokenizerFactory and
now it shows highlighted text fragments in the indexed email text.

But I don't understand this behavior. Can someone shed some light please

On 18 Jul 2017 14:18, "Lasitha Wattaladeniya" <wa...@gmail.com> wrote:

> Further more, ngram field has following tokenizer/filter chain in index
> and query
>
> UAX29URLEmailTokenizerFactory (only in index)
> stopFilterFactory
> LowerCaseFilterFactory
> ASCIIFoldingFilterFactory
> EnglishPossessiveFilterFactory
> StemmerOverrideFilterFactory (only in query)
> NgramTokenizerFactory (only in index)
>
> Regards,
> Lasitha
>
> On 18 Jul 2017 14:11, "Lasitha Wattaladeniya" <wa...@gmail.com> wrote:
>
>> Hi devs,
>>
>> I have setup solr highlighting with default setup (only changed the
>> fragsize to 0 to match any field length). It worked fine but recently I
>> discovered it doesn't highlight for words with special characters in the
>> middle.
>>
>> For an example, let's say I have indexed email address test.fsdg@ran.com
>> to a ngram field. And when I search for the partial text fsdg, I get the
>> results but it's not highlighted. It works in all other scenarios as
>> expected.
>>
>> The ngram field has termVectors, termPositions, termOffsets set to true.
>>
>> Can somebody please suggest me, what may be wrong here?
>>
>> (sorry for the unstructured text. Typed using a mobile phone )
>>
>> Regards
>> Lasitha
>>
>

Re: Highlighting words with special characters

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi,
Maybe name of the UAX29URLEMailTokenizer is deceiving you?It does *not* tokenize URLs and Emails. Actually it recognises them and emits them as a single token.
Ahmet

On Wednesday, July 19, 2017, 12:00:05 PM GMT+3, Lasitha Wattaladeniya <wa...@gmail.com> wrote:

Update,

I changed the UAX29URLEmailTokenizerFactory to StandardTokenizerFactory and
now it shows highlighted text fragments in the indexed email text.

But I don't understand this behavior. Can someone shed some light please

On 18 Jul 2017 14:18, "Lasitha Wattaladeniya" <wa...@gmail.com> wrote:

> Further more, ngram field has following tokenizer/filter chain in index
> and query
>
> UAX29URLEmailTokenizerFactory (only in index)
> stopFilterFactory
> LowerCaseFilterFactory
> ASCIIFoldingFilterFactory
> EnglishPossessiveFilterFactory
> StemmerOverrideFilterFactory (only in query)
> NgramTokenizerFactory (only in index)
>
> Regards,
> Lasitha
>
> On 18 Jul 2017 14:11, "Lasitha Wattaladeniya" <wa...@gmail.com> wrote:
>
>> Hi devs,
>>
>> I have setup solr highlighting with default setup (only changed the
>> fragsize to 0 to match any field length). It worked fine but recently I
>> discovered it doesn't highlight for words with special characters in the
>> middle.
>>
>> For an example, let's say I have indexed email address test.fsdg@ran.com
>> to a ngram field. And when I search for the partial text fsdg, I get the
>> results but it's not highlighted. It works in all other scenarios as
>> expected.
>>
>> The ngram field has termVectors, termPositions, termOffsets set to true.
>>
>> Can somebody please suggest me, what may be wrong here?
>>
>> (sorry for the unstructured text. Typed using a mobile phone )
>>
>> Regards
>> Lasitha
>>
>

Re: Highlighting words with special characters

Posted by Lasitha Wattaladeniya <wa...@gmail.com>.

Update,

I changed the UAX29URLEmailTokenizerFactory to StandardTokenizerFactory and
now it shows highlighted text fragments in the indexed email text.

But I don't understand this behavior. Can someone shed some light please

On 18 Jul 2017 14:18, "Lasitha Wattaladeniya" <wa...@gmail.com> wrote:

> Further more, ngram field has following tokenizer/filter chain in index
> and query
>
> UAX29URLEmailTokenizerFactory (only in index)
> stopFilterFactory
> LowerCaseFilterFactory
> ASCIIFoldingFilterFactory
> EnglishPossessiveFilterFactory
> StemmerOverrideFilterFactory (only in query)
> NgramTokenizerFactory (only in index)
>
> Regards,
> Lasitha
>
> On 18 Jul 2017 14:11, "Lasitha Wattaladeniya" <wa...@gmail.com> wrote:
>
>> Hi devs,
>>
>> I have setup solr highlighting with default setup (only changed the
>> fragsize to 0 to match any field length). It worked fine but recently I
>> discovered it doesn't highlight for words with special characters in the
>> middle.
>>
>> For an example, let's say I have indexed email address test.fsdg@ran.com
>> to a ngram field. And when I search for the partial text fsdg, I get the
>> results but it's not highlighted. It works in all other scenarios as
>> expected.
>>
>> The ngram field has termVectors, termPositions, termOffsets set to true.
>>
>> Can somebody please suggest me, what may be wrong here?
>>
>> (sorry for the unstructured text. Typed using a mobile phone )
>>
>> Regards
>> Lasitha
>>
>

Re: Highlighting words with special characters

Posted by Lasitha Wattaladeniya <wa...@gmail.com>.

Further more, ngram field has following tokenizer/filter chain in index and
query

UAX29URLEmailTokenizerFactory (only in index)
stopFilterFactory
LowerCaseFilterFactory
ASCIIFoldingFilterFactory
EnglishPossessiveFilterFactory
StemmerOverrideFilterFactory (only in query)
NgramTokenizerFactory (only in index)

Regards,
Lasitha

On 18 Jul 2017 14:11, "Lasitha Wattaladeniya" <wa...@gmail.com> wrote:

> Hi devs,
>
> I have setup solr highlighting with default setup (only changed the
> fragsize to 0 to match any field length). It worked fine but recently I
> discovered it doesn't highlight for words with special characters in the
> middle.
>
> For an example, let's say I have indexed email address test.fsdg@ran.com
> to a ngram field. And when I search for the partial text fsdg, I get the
> results but it's not highlighted. It works in all other scenarios as
> expected.
>
> The ngram field has termVectors, termPositions, termOffsets set to true.
>
> Can somebody please suggest me, what may be wrong here?
>
> (sorry for the unstructured text. Typed using a mobile phone )
>
> Regards
> Lasitha
>