You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by "DF2832368_jan@amberoad.de DF2832368_jan@amberoad.de" <ja...@amberoad.de> on 2021/03/08 10:05:14 UTC

CustomBreakIterator Performance Issues

Hello,

I am currently working on getting a custom BreakIterator for the Unified Highlighter to work, and struggle a bit performance wise.

I need a BreakIterator for getting nice highlights of passages. For this I want the start of the highlight to be a sentence-start and the end to be a word-end. There are also some weird edge cases.

I already coded the BreakIterator and integrated it to our custom UnifiedHighlighter class, but when I use this Iterator the qTime of all requests rise from ~1000 to 12000+ which is not acceptable for this application.

Here is a link to my implementation. I can't really find where I am horrible inefficient.(I know that these functions get called very often)

Any suggestions are welcome, also other approaches.

So are there some nice resources to learn more about BreakIterators and stuff, since digging into the code is really hard here.

Another approach I am considering next is to do this highlight "trimming", when the final highlights are found. This would reduce the amount of logic called, but I guess the scoring system of SOLR wouldn't be taken in to account the right way.

As I said all suggestions are welcome and thanks in advance.

Jan Ulrich Robens

Re: CustomBreakIterator Performance Issues

Posted by David Smiley <ds...@apache.org>.

The BreakIterator impls in the JDK (and likely IBM ICU) seem slow and can
sometimes dominate the performance of this highlighter.  I worked on a
large search project (which led to the creation of the UnifiedHighlighter)
and we used a technique of encoding the breaks directly into the text
a-priori.  It was just a special character.  Perhaps use a "vertical tab"?
On the Solr side, it then became a very trivial char based iterator which
is already in Lucene/Solr.  You might do this as well.  You could add a
custom Solr UpdateRequestProcessor (URP) that inserts these characters.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Mar 8, 2021 at 5:06 AM DF2832368_jan@amberoad.de
DF2832368_jan@amberoad.de <ja...@amberoad.de> wrote:

> And of cource the link broke :
> https://drive.google.com/file/d/1wfZFQD6loTeA9_-eGrdwi9YGtJcNjKli/view?usp=sharing
>
> >     DF2832368_jan@amberoad.de DF2832368_jan@amberoad.de <ja...@amberoad.de>
> hat am 08.03.2021 11:05 geschrieben:
> >
> >
> >     Hello,
> >
> >     I am currently working on getting a custom BreakIterator for the
> Unified Highlighter to work, and struggle a bit performance wise.
> >
> >     I need a BreakIterator for getting nice highlights of passages. For
> this I want the start of the highlight to be a sentence-start and the end
> to be a word-end. There are also some weird edge cases.
> >
> >     I already coded the BreakIterator and integrated it to our custom
> UnifiedHighlighter class, but when I use this Iterator the qTime of all
> requests rise from ~1000 to 12000+ which is not acceptable for this
> application.
> >
> >     Here is a link to my implementation. I can't really find where I am
> horrible inefficient.(I know that these functions get called very often)
> >
> >     Any suggestions are welcome, also other approaches.
> >
> >     So are there some nice resources to learn more about BreakIterators
> and stuff, since digging into the code is really hard here.
> >
> >     Another approach I am considering next is to do this highlight
> "trimming", when the final highlights are found. This would reduce the
> amount of logic called, but I guess the scoring system of SOLR wouldn't be
> taken in to account the right way.
> >
> >     As I said all suggestions are welcome and thanks in advance.
> >
> >     Jan Ulrich Robens
> >
>

Re: CustomBreakIterator Performance Issues

Posted by "DF2832368_jan@amberoad.de DF2832368_jan@amberoad.de" <ja...@amberoad.de>.

And of cource the link broke : https://drive.google.com/file/d/1wfZFQD6loTeA9_-eGrdwi9YGtJcNjKli/view?usp=sharing

>     DF2832368_jan@amberoad.de DF2832368_jan@amberoad.de <ja...@amberoad.de> hat am 08.03.2021 11:05 geschrieben:
> 
> 
>     Hello,
> 
>     I am currently working on getting a custom BreakIterator for the Unified Highlighter to work, and struggle a bit performance wise.
> 
>     I need a BreakIterator for getting nice highlights of passages. For this I want the start of the highlight to be a sentence-start and the end to be a word-end. There are also some weird edge cases.
> 
>     I already coded the BreakIterator and integrated it to our custom UnifiedHighlighter class, but when I use this Iterator the qTime of all requests rise from ~1000 to 12000+ which is not acceptable for this application.
> 
>     Here is a link to my implementation. I can't really find where I am horrible inefficient.(I know that these functions get called very often)
> 
>     Any suggestions are welcome, also other approaches.
> 
>     So are there some nice resources to learn more about BreakIterators and stuff, since digging into the code is really hard here.
> 
>     Another approach I am considering next is to do this highlight "trimming", when the final highlights are found. This would reduce the amount of logic called, but I guess the scoring system of SOLR wouldn't be taken in to account the right way.
> 
>     As I said all suggestions are welcome and thanks in advance.
> 
>     Jan Ulrich Robens
>