You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael Stoppelman <st...@gmail.com> on 2007/07/19 02:28:44 UTC

StandardTokenizer is slowing down highlighting a lot

Hi all,

I was tracking down slowness in the contrib highlighter code and it seems
the seemingly simple tokenStream.next() is the culprit.
I've seen multiple posts about this being a possible cause. Has anyone
looked into how to speed up StandardTokenizer? For my
documents it's taking about 70ms per document that's a big ugh! I was
thinking I might just cache the TermVectors in memory if
that will be faster. Anyone have another approach to solving this problem?

-M

Re: StandardTokenizer is slowing down highlighting a lot

Posted by Michael Stoppelman <st...@gmail.com>.

On 7/19/07, Mark Miller <ma...@gmail.com> wrote:
>
> I think it goes without saying that a semi-complex NFA or DFA is going
> to be quite a bit slower than say, breaking on whitespace. Not that I am
> against such a warning.


This is true to those very familiar with the code base and the Tokenizer
source code. I think having a comment
about using complex a semi-complex NFA/DFA possibly being a major
performance hit in the highlighting code
would save others time, imho.

To support my point on writing a custom solution that is more exact
> towards your needs:
>
> If you just remove the <NUM> recognizer in StandardTokenizer.jj you will
> gain 20-25% speed in my tests of small and large documents.
>
> Limiting what is considered a letter to just the language/encodings you
> need might also get some good returns.


Both good ideas. I just released that the tokenizer for highlighting doesn't
need
to be the same as the tokenizer for indexing so I can make the highlighting
tokenizer
much simpler. Everything will be fast and happy soon.

-M

- Mark
>
> Michael Stoppelman wrote:
> > Might be nice to add a line of documentation to the highlighter on the
> > possible
> > performance hit if one uses StandardAnalyzer which probably is a common
> > case.
> > Thanks for the speedy response.
> >
> > -M
> >
> > On 7/18/07, Mark Miller <ma...@gmail.com> wrote:
> >>
> >> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
> >> limited by JavaCC speed. You cannot shave much more performance out of
> >> the grammar as it is already about as simple as it gets. You should
> >> first see if you can get away without it and use a different Analyzer,
> >> or if you can re-implement just the functionality you need in a custom
> >> Analyzer. Do you really need the support for abbreviations, companies,
> >> email address, etc?
> >>
> >> If so:
> >>
> >> You can use the TokenSources class in the highlighter package to
> rebuild
> >> a TokenStream without re-analyzing if you store term offsets and
> >> positions in the index. I have not found this to be super beneficial,
> >> even when using the StandardAnalyzer to re-analyze, but it certainly
> >> could be faster if you have large enough documents.
> >>
> >> Your best bet is probably to use
> >> https://issues.apache.org/jira/browse/LUCENE-644, which is a
> >> non-positional Highlighter that finds offsets to highlight by looking
> up
> >> query term offset information in the index. For larger documents this
> >> can be much faster than using the standard contrib Highlighter, even if
> >> your using TokenSources. LUCENE-644 has a much flatter curve than the
> >> contrib Highlighter as document size goes up.
> >>
> >> - Mark
> >>
> >> Michael Stoppelman wrote:
> >> > Hi all,
> >> >
> >> > I was tracking down slowness in the contrib highlighter code and it
> >> seems
> >> > the seemingly simple tokenStream.next() is the culprit.
> >> > I've seen multiple posts about this being a possible cause. Has
> anyone
> >> > looked into how to speed up StandardTokenizer? For my
> >> > documents it's taking about 70ms per document that's a big ugh! I was
> >> > thinking I might just cache the TermVectors in memory if
> >> > that will be faster. Anyone have another approach to solving this
> >> > problem?
> >> >
> >> > -M
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: StandardTokenizer is slowing down highlighting a lot

Posted by Mark Miller <ma...@gmail.com>.

I think it goes without saying that a semi-complex NFA or DFA is going 
to be quite a bit slower than say, breaking on whitespace. Not that I am 
against such a warning.

To support my point on writing a custom solution that is more exact 
towards your needs:

If you just remove the <NUM> recognizer in StandardTokenizer.jj you will 
gain 20-25% speed in my tests of small and large documents.

Limiting what is considered a letter to just the language/encodings you 
need might also get some good returns.

- Mark

Michael Stoppelman wrote:
> Might be nice to add a line of documentation to the highlighter on the
> possible
> performance hit if one uses StandardAnalyzer which probably is a common
> case.
> Thanks for the speedy response.
>
> -M
>
> On 7/18/07, Mark Miller <ma...@gmail.com> wrote:
>>
>> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
>> limited by JavaCC speed. You cannot shave much more performance out of
>> the grammar as it is already about as simple as it gets. You should
>> first see if you can get away without it and use a different Analyzer,
>> or if you can re-implement just the functionality you need in a custom
>> Analyzer. Do you really need the support for abbreviations, companies,
>> email address, etc?
>>
>> If so:
>>
>> You can use the TokenSources class in the highlighter package to rebuild
>> a TokenStream without re-analyzing if you store term offsets and
>> positions in the index. I have not found this to be super beneficial,
>> even when using the StandardAnalyzer to re-analyze, but it certainly
>> could be faster if you have large enough documents.
>>
>> Your best bet is probably to use
>> https://issues.apache.org/jira/browse/LUCENE-644, which is a
>> non-positional Highlighter that finds offsets to highlight by looking up
>> query term offset information in the index. For larger documents this
>> can be much faster than using the standard contrib Highlighter, even if
>> your using TokenSources. LUCENE-644 has a much flatter curve than the
>> contrib Highlighter as document size goes up.
>>
>> - Mark
>>
>> Michael Stoppelman wrote:
>> > Hi all,
>> >
>> > I was tracking down slowness in the contrib highlighter code and it
>> seems
>> > the seemingly simple tokenStream.next() is the culprit.
>> > I've seen multiple posts about this being a possible cause. Has anyone
>> > looked into how to speed up StandardTokenizer? For my
>> > documents it's taking about 70ms per document that's a big ugh! I was
>> > thinking I might just cache the TermVectors in memory if
>> > that will be faster. Anyone have another approach to solving this
>> > problem?
>> >
>> > -M
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardTokenizer is slowing down highlighting a lot

Posted by Michael Stoppelman <st...@gmail.com>.

Might be nice to add a line of documentation to the highlighter on the
possible
performance hit if one uses StandardAnalyzer which probably is a common
case.
Thanks for the speedy response.

-M

On 7/18/07, Mark Miller <ma...@gmail.com> wrote:
>
> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
> limited by JavaCC speed. You cannot shave much more performance out of
> the grammar as it is already about as simple as it gets. You should
> first see if you can get away without it and use a different Analyzer,
> or if you can re-implement just the functionality you need in a custom
> Analyzer. Do you really need the support for abbreviations, companies,
> email address, etc?
>
> If so:
>
> You can use the TokenSources class in the highlighter package to rebuild
> a TokenStream without re-analyzing if you store term offsets and
> positions in the index. I have not found this to be super beneficial,
> even when using the StandardAnalyzer to re-analyze, but it certainly
> could be faster if you have large enough documents.
>
> Your best bet is probably to use
> https://issues.apache.org/jira/browse/LUCENE-644, which is a
> non-positional Highlighter that finds offsets to highlight by looking up
> query term offset information in the index. For larger documents this
> can be much faster than using the standard contrib Highlighter, even if
> your using TokenSources. LUCENE-644 has a much flatter curve than the
> contrib Highlighter as document size goes up.
>
> - Mark
>
> Michael Stoppelman wrote:
> > Hi all,
> >
> > I was tracking down slowness in the contrib highlighter code and it
> seems
> > the seemingly simple tokenStream.next() is the culprit.
> > I've seen multiple posts about this being a possible cause. Has anyone
> > looked into how to speed up StandardTokenizer? For my
> > documents it's taking about 70ms per document that's a big ugh! I was
> > thinking I might just cache the TermVectors in memory if
> > that will be faster. Anyone have another approach to solving this
> > problem?
> >
> > -M
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: StandardTokenizer is slowing down highlighting a lot

Posted by Stanislaw Osinski <st...@man.poznan.pl>.

>
> I am sure a faster StandardAnalyzer would be greatly appreciated.


I'm increasing the priority of that task then :)

StandardAnalyzer appears widely used and horrendously slow. Even better
> would be a StandardAnalyzer that could have different recognizers
> enabled/disabled. For example, dropping NUM recognition if you don't
> need it in the current StandardAnalyzer gains like 25% speed.


That's a good idea, though I'd need to check if in case of JFlex there would
be considerable performance differences depending on the grammar.

Staszek

-- 
Stanislaw Osinski, stanislaw.osinski@carrot-search.com
http://www.carrot-search.com

Re: StandardTokenizer is slowing down highlighting a lot

Posted by Mark Miller <ma...@gmail.com>.

I would be very interested. I have been playing around with Antlr to see 
if it is any faster than JavaCC, but haven't seen great gains in my 
simple tests. I had not considered trying JFlex.

I am sure a faster StandardAnalyzer would be greatly appreciated. 
StandardAnalyzer appears widely used and horrendously slow. Even better 
would be a StandardAnalyzer that could have different recognizers 
enabled/disabled. For example, dropping NUM recognition if you don't 
need it in the current StandardAnalyzer gains like 25% speed.

- Mark

Stanislaw Osinski wrote:
>>
>> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
>> limited by JavaCC speed. You cannot shave much more performance out of
>> the grammar as it is already about as simple as it gets.
>
>
> JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 
> years
> ago :) switched to JFlex, which for roughly the same grammar would 
> sometimes
> be up to 10x (!) faster. You can have a look at our JFlex 
> specification at:
>
> http://carrot2.svn.sourceforge.net/viewvc/carrot2/trunk/carrot2/components/carrot2-util-tokenizer/src/org/carrot2/util/tokenizer/parser/jflex/JFlexWordBasedParserImpl.jflex?view=markup 
>
>
> This one seems more complex than the StandardAnalyzer's but it's much 
> faster
> anyway.
>
> If anyone is interested, I could prepare a JFlex based Analyzer 
> equivalent
> (to the extent possible) to current StandardAnalyzer, which might 
> offer nice
> indexing and highlighting speed-ups.
>
> Best,
>
> Staszek
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardTokenizer is slowing down highlighting a lot

Posted by Stanislaw Osinski <st...@man.poznan.pl>.

On 25/07/07, Yonik Seeley <yo...@apache.org> wrote:
>
> On 7/25/07, Stanislaw Osinski <st...@man.poznan.pl> wrote:
> > JavaCC is slow indeed.
>
> JavaCC is a very fast parser for a large document... the issue is
> small fields and JavaCC's use of an exception for flow control at the
> end of a value.  As JVMs have advanced, exception-as-control-flow as
> gotten comparably slower.

In Carrot2 we tokenize mostly very short documents (search results), so in
this context JFlex proved much faster. I did a very rough performance test
of Highlighter using JFlex and JavaCC-generated analyzers with medium-sized
documents (up to ~1kB), and JFlex was still faster. What size would a
'large' document be?

Does JFlex have a jar associated with it?  It's GPL (although you can
> freely use the files it generates under any license), so if there were
> other non-generated files required, we wouldn't be able to incorporate
> them.

You need JFlex jar only to generate the tokenizer (one Java class). The
generated tokenizer is standalone and doesn't need the JFlex jar to run.

Staszek

Re: StandardTokenizer is slowing down highlighting a lot

Posted by Yonik Seeley <yo...@apache.org>.

On 7/25/07, Stanislaw Osinski <st...@man.poznan.pl> wrote:
> JavaCC is slow indeed.

JavaCC is a very fast parser for a large document... the issue is
small fields and JavaCC's use of an exception for flow control at the
end of a value.  As JVMs have advanced, exception-as-control-flow as
gotten comparably slower.

Does JFlex have a jar associated with it?  It's GPL (although you can
freely use the files it generates under any license), so if there were
other non-generated files required, we wouldn't be able to incorporate
them.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardTokenizer is slowing down highlighting a lot

Posted by Stanislaw Osinski <st...@man.poznan.pl>.

> > If anyone is interested, I could prepare a JFlex based Analyzer
> > equivalent
> > (to the extent possible) to current StandardAnalyzer, which might
> > offer nice
> > indexing and highlighting speed-ups.
>
> +1.  I think a lot of people would be interested in a faster
> StandardAnalyzer.
>

I've attached a patch with the JFlex-based analyzer to
https://issues.apache.org/jira/browse/LUCENE-966. The code needs some
refactoring, but it shows some nice performance gains (5.5 -- 8.1 times
compared to StandardAnalyzer on Sun JVMs).

Staszek

-- 
Stanislaw Osinski, stanislaw.osinski@carrot-search.com
http://www.carrot-search.com

Re: StandardTokenizer is slowing down highlighting a lot

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 25, 2007, at 7:19 AM, Stanislaw Osinski wrote:

>>
>> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
>> limited by JavaCC speed. You cannot shave much more performance  
>> out of
>> the grammar as it is already about as simple as it gets.
>
>
> JavaCC is slow indeed. We used it for a while for Carrot2, but then  
> (3 years
> ago :) switched to JFlex, which for roughly the same grammar would  
> sometimes
> be up to 10x (!) faster. You can have a look at our JFlex  
> specification at:
>
> http://carrot2.svn.sourceforge.net/viewvc/carrot2/trunk/carrot2/ 
> components/carrot2-util-tokenizer/src/org/carrot2/util/tokenizer/ 
> parser/jflex/JFlexWordBasedParserImpl.jflex?view=markup
>
> This one seems more complex than the StandardAnalyzer's but it's  
> much faster
> anyway.
>
> If anyone is interested, I could prepare a JFlex based Analyzer  
> equivalent
> (to the extent possible) to current StandardAnalyzer, which might  
> offer nice
> indexing and highlighting speed-ups.

+1.  I think a lot of people would be interested in a faster  
StandardAnalyzer.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: StandardTokenizer is slowing down highlighting a lot

Posted by Stanislaw Osinski <st...@man.poznan.pl>.

>
> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
> limited by JavaCC speed. You cannot shave much more performance out of
> the grammar as it is already about as simple as it gets.


JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 years
ago :) switched to JFlex, which for roughly the same grammar would sometimes
be up to 10x (!) faster. You can have a look at our JFlex specification at:

http://carrot2.svn.sourceforge.net/viewvc/carrot2/trunk/carrot2/components/carrot2-util-tokenizer/src/org/carrot2/util/tokenizer/parser/jflex/JFlexWordBasedParserImpl.jflex?view=markup

This one seems more complex than the StandardAnalyzer's but it's much faster
anyway.

If anyone is interested, I could prepare a JFlex based Analyzer equivalent
(to the extent possible) to current StandardAnalyzer, which might offer nice
indexing and highlighting speed-ups.

Best,

Staszek

-- 
Stanislaw Osinski, stanislaw.osinski@carrot-search.com
http://www.carrot-search.com

Re: StandardTokenizer is slowing down highlighting a lot

Posted by Mark Miller <ma...@gmail.com>.

Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really 
limited by JavaCC speed. You cannot shave much more performance out of 
the grammar as it is already about as simple as it gets. You should 
first see if you can get away without it and use a different Analyzer, 
or if you can re-implement just the functionality you need in a custom 
Analyzer. Do you really need the support for abbreviations, companies, 
email address, etc?

If so:

You can use the TokenSources class in the highlighter package to rebuild 
a TokenStream without re-analyzing if you store term offsets and 
positions in the index. I have not found this to be super beneficial, 
even when using the StandardAnalyzer to re-analyze, but it certainly 
could be faster if you have large enough documents.

Your best bet is probably to use 
https://issues.apache.org/jira/browse/LUCENE-644, which is a 
non-positional Highlighter that finds offsets to highlight by looking up 
query term offset information in the index. For larger documents this 
can be much faster than using the standard contrib Highlighter, even if 
your using TokenSources. LUCENE-644 has a much flatter curve than the 
contrib Highlighter as document size goes up.

- Mark

Michael Stoppelman wrote:
> Hi all,
>
> I was tracking down slowness in the contrib highlighter code and it seems
> the seemingly simple tokenStream.next() is the culprit.
> I've seen multiple posts about this being a possible cause. Has anyone
> looked into how to speed up StandardTokenizer? For my
> documents it's taking about 70ms per document that's a big ugh! I was
> thinking I might just cache the TermVectors in memory if
> that will be faster. Anyone have another approach to solving this 
> problem?
>
> -M
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org