You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2002/11/20 07:59:32 UTC

Observations: profiling indexing process

Hello,

I decided to run a little Lucene app that does some indexing under a
profiler. (I used JMP, http://www.khelekore.org/jmp/, a rather simple
one).

The app uses StandardAnalyzer.
I've noticed that a lot of time is spent in StandardTokenizer and
various JavaCC-generated methods.
I am wondering if anyone tried replacing StandardTokenizer.jj with
something more efficient?

Also,StopFilter is using a Hashtable to store the list of stop words. 
Has anyone tried using HashMap instead?

Thanks,
Otis


__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Observations: profiling indexing process

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I realized soon after I sent the message that this is the case and I
knew somebody would quickly point it out :)
Still, if the effort to improve a piece is costless, why not do it :)

I changed my code locally to use HashMap.  I actually started with
HashSet, but with Sets one can't do set.get(object) :(

Anyhow, yes, there are bigger things to fix.

Otis

--- Brian Goetz <br...@quiotix.com> wrote:
> > > > I decided to run a little Lucene app that does some
> > > > indexing under a
> > > > profiler. (I used JMP,
> > > > http://www.khelekore.org/jmp/, a rather simple
> > > > one).
> > > > 
> > > > The app uses StandardAnalyzer.
> > > > I've noticed that a lot of time is spent in
> > > > StandardTokenizer and
> > > > various JavaCC-generated methods.
> > > > I am wondering if anyone tried replacing
> > > > StandardTokenizer.jj with
> > > > something more efficient?
> > > > 
> > > > Also,StopFilter is using a Hashtable to store the
> > > > list of stop words. 
> > > > Has anyone tried using HashMap instead?
> 
> HashMap is certainly a higher-performance choice, so long as the map
> is static for the duration of its lifetime and built in the
> constructor.  Otherwise, you could run afoul of thread-safety issues.
> And HashSet uses less memory.  
>
> But the bigger point is one that Doug convinced me of only after I
> went on a mad micro-optimization tear earlier in the project (Sorry,
> Doug, you were right) -- and that is that for the most part,
> tokenization is a very very small part of the total work done by the
> system.  Tokenization gets done once for each document, wheras the
> document gets merged, searched, and queried many times.  Time spent
> tweaking tokenizers for performance is likely wasted effort; that
> time
> could probably be much better spent improving the code in much more
> useful ways.
> 
> Sure, StandardToeknizer is slow.  But that tokenization effort gets
> spread over the many times the document is searched.  Even if it does
> a 1% better job at tokenizing, that might be worth a 100x increase in
> tokenizing time.  I think any effort you want to spend tweaking
> tokenizers would be much better spent doing a better job of
> toeknization and preprocessing (stemming, dealing intelligently with
> non-letters and word breaks, format stripping) than on performance
> tweaks.
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Observations: profiling indexing process

Posted by Brian Goetz <br...@quiotix.com>.

> > > I decided to run a little Lucene app that does some
> > > indexing under a
> > > profiler. (I used JMP,
> > > http://www.khelekore.org/jmp/, a rather simple
> > > one).
> > > 
> > > The app uses StandardAnalyzer.
> > > I've noticed that a lot of time is spent in
> > > StandardTokenizer and
> > > various JavaCC-generated methods.
> > > I am wondering if anyone tried replacing
> > > StandardTokenizer.jj with
> > > something more efficient?
> > > 
> > > Also,StopFilter is using a Hashtable to store the
> > > list of stop words. 
> > > Has anyone tried using HashMap instead?

HashMap is certainly a higher-performance choice, so long as the map
is static for the duration of its lifetime and built in the
constructor.  Otherwise, you could run afoul of thread-safety issues.
And HashSet uses less memory.  

But the bigger point is one that Doug convinced me of only after I
went on a mad micro-optimization tear earlier in the project (Sorry,
Doug, you were right) -- and that is that for the most part,
tokenization is a very very small part of the total work done by the
system.  Tokenization gets done once for each document, wheras the
document gets merged, searched, and queried many times.  Time spent
tweaking tokenizers for performance is likely wasted effort; that time
could probably be much better spent improving the code in much more
useful ways.

Sure, StandardToeknizer is slow.  But that tokenization effort gets
spread over the many times the document is searched.  Even if it does
a 1% better job at tokenizing, that might be worth a 100x increase in
tokenizing time.  I think any effort you want to spend tweaking
tokenizers would be much better spent doing a better job of
toeknization and preprocessing (stemming, dealing intelligently with
non-letters and word breaks, format stripping) than on performance
tweaks.



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Observations: profiling indexing process

Posted by Rajive Dave <ha...@yahoo.com>.

The tokenizer we have is pretty straight forward
implementation specific to our grammer . It is just a
tewaked version of CharStream.java already checked in.
I don't think its of any general use and hence non
contributable.

Anyway the point is yes Javacc has substantial
overhead.

Rajive

--- Otis Gospodnetic <ot...@yahoo.com>
wrote:
> Non-contributable?
> The impl. is just Java, no other alternative parser
> tools like ANTLR or
> some such?
> 
> Otis
> 
> --- Rajive Dave <ha...@yahoo.com> wrote:
> > Yep we replaced javacc with our home grown
> tokenizer.
> > I think we gained almost 100% indexing speed
> because
> > our document size is rather large. 
> > 
> > Rajive
> > 
> > --- Otis Gospodnetic <ot...@yahoo.com>
> > wrote:
> > > Hello,
> > > 
> > > I decided to run a little Lucene app that does
> some
> > > indexing under a
> > > profiler. (I used JMP,
> > > http://www.khelekore.org/jmp/, a rather simple
> > > one).
> > > 
> > > The app uses StandardAnalyzer.
> > > I've noticed that a lot of time is spent in
> > > StandardTokenizer and
> > > various JavaCC-generated methods.
> > > I am wondering if anyone tried replacing
> > > StandardTokenizer.jj with
> > > something more efficient?
> > > 
> > > Also,StopFilter is using a Hashtable to store
> the
> > > list of stop words. 
> > > Has anyone tried using HashMap instead?
> > > 
> > > Thanks,
> > > Otis
> > > 
> > > 
> > >
> __________________________________________________
> > > Do you Yahoo!?
> > > Yahoo! Web Hosting - Let the expert host your
> site
> > > http://webhosting.yahoo.com
> > > 
> > > --
> > > To unsubscribe, e-mail:  
> > >
> <ma...@jakarta.apache.org>
> > > For additional commands, e-mail:
> > > <ma...@jakarta.apache.org>
> > > 
> > 
> > 
> > __________________________________________________
> > Do you Yahoo!?
> > Yahoo! Web Hosting - Let the expert host your site
> > http://webhosting.yahoo.com
> > 
> > --
> > To unsubscribe, e-mail:  
> > <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> > 
> 
> 
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Web Hosting - Let the expert host your site
> http://webhosting.yahoo.com
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Observations: profiling indexing process

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Non-contributable?
The impl. is just Java, no other alternative parser tools like ANTLR or
some such?

Otis

--- Rajive Dave <ha...@yahoo.com> wrote:
> Yep we replaced javacc with our home grown tokenizer.
> I think we gained almost 100% indexing speed because
> our document size is rather large. 
> 
> Rajive
> 
> --- Otis Gospodnetic <ot...@yahoo.com>
> wrote:
> > Hello,
> > 
> > I decided to run a little Lucene app that does some
> > indexing under a
> > profiler. (I used JMP,
> > http://www.khelekore.org/jmp/, a rather simple
> > one).
> > 
> > The app uses StandardAnalyzer.
> > I've noticed that a lot of time is spent in
> > StandardTokenizer and
> > various JavaCC-generated methods.
> > I am wondering if anyone tried replacing
> > StandardTokenizer.jj with
> > something more efficient?
> > 
> > Also,StopFilter is using a Hashtable to store the
> > list of stop words. 
> > Has anyone tried using HashMap instead?
> > 
> > Thanks,
> > Otis
> > 
> > 
> > __________________________________________________
> > Do you Yahoo!?
> > Yahoo! Web Hosting - Let the expert host your site
> > http://webhosting.yahoo.com
> > 
> > --
> > To unsubscribe, e-mail:  
> > <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> > 
> 
> 
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Web Hosting - Let the expert host your site
> http://webhosting.yahoo.com
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Observations: profiling indexing process

Posted by Rajive Dave <ha...@yahoo.com>.

Yep we replaced javacc with our home grown tokenizer.
I think we gained almost 100% indexing speed because
our document size is rather large. 

Rajive

--- Otis Gospodnetic <ot...@yahoo.com>
wrote:
> Hello,
> 
> I decided to run a little Lucene app that does some
> indexing under a
> profiler. (I used JMP,
> http://www.khelekore.org/jmp/, a rather simple
> one).
> 
> The app uses StandardAnalyzer.
> I've noticed that a lot of time is spent in
> StandardTokenizer and
> various JavaCC-generated methods.
> I am wondering if anyone tried replacing
> StandardTokenizer.jj with
> something more efficient?
> 
> Also,StopFilter is using a Hashtable to store the
> list of stop words. 
> Has anyone tried using HashMap instead?
> 
> Thanks,
> Otis
> 
> 
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Web Hosting - Let the expert host your site
> http://webhosting.yahoo.com
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>