You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2011/12/07 22:42:57 UTC

[lucy-dev] Some quick benchmarks

Some quick and completely unscientific benchmarks, indexing 1000 times 
the same 10K ASCII document:

RT = RegexTokenizer
ST = StandardTokenizer
CF = CaseFolder
N  = Normalizer

RT:    2.177s
RT+CF: 3.964s
RT+N:  2.556s
ST:    1.551s
ST+CF: 3.357s
ST+N:  1.931s

It's also interesting that moving the tokenizer in front of the case 
folder or normalizer always gave me faster results.

Nick

Re: [lucy-dev] Some quick benchmarks

Posted by Nick Wellnhofer <we...@aevum.de>.
On 08/12/2011 23:38, Joe Schaefer wrote:
> When is all this nifty code going to land in trunk?  Don't
> wait for anyone to give you permission Nick, that decision
> is all yours.

I just merged branch LUCY-196-uax-tokenizer into trunk.

Nick

Re: [lucy-dev] Some quick benchmarks

Posted by Joe Schaefer <jo...@yahoo.com>.
When is all this nifty code going to land in trunk?  Don't
wait for anyone to give you permission Nick, that decision
is all yours.



----- Original Message -----
> From: Nick Wellnhofer <we...@aevum.de>
> To: lucy-dev@incubator.apache.org
> Cc: 
> Sent: Thursday, December 8, 2011 2:43 PM
> Subject: Re: [lucy-dev] Some quick benchmarks
> 
> On 08/12/11 20:04, Nathan Kurz wrote:
>>  I'm mostly listening in on this conversation because I haven't 
> thought
>>  much about indexing, but the magnitude of improvement here surprises
>>  me:  I wouldn't have thought that there would be that much time to
>>  shave off!    My presumption was that everything would be dominated by
>>  Disk IO, and that the actual tokenizing time would be tiny.   Are
>>  these numbers both working within memory with a pre-warmed cache so no
>>  disk reads are involved?  Also, have you controlled for whether the
>>  data is sync'ed to disk after the indexing?
> 
> These numbers are with pre-warmed cache. Also, the data isn't synced AFAIU. 
> But I think the analysis chain is CPU bound in the general case. All that 
> tokenizing, normalizing and stemming uses a lot of CPU cycles.
> 
>>  I'm not in a position to do it, but it might be insightful to do a
>>  quick profile of where these two are spending their time.  Are we
>>  gaining because the algorithm is faster, or because we have less
>>  function call overhead, or because of something confounding?
> 
> It's mainly that the algorithms are faster. The CaseFolder seems to be 
> especially slow but I have no idea why.
> 
>>  Oprofile
>>  on Linux is very easy to use once you have it set up.  In case you
>>  aren't familiar with it, this is a good intro:
>> 
> http://lbrandy.com/blog/2008/11/oprofile-profiling-in-linux-for-fun-and-profit/.
> 
> I have used it once and found it hard to setup on a virtual machine. But 
> it's very useful if you want to profile long running processes.
> 
> Nick
> 

Re: [lucy-dev] Some quick benchmarks

Posted by Nick Wellnhofer <we...@aevum.de>.
On 08/12/11 20:04, Nathan Kurz wrote:
> I'm mostly listening in on this conversation because I haven't thought
> much about indexing, but the magnitude of improvement here surprises
> me:  I wouldn't have thought that there would be that much time to
> shave off!    My presumption was that everything would be dominated by
> Disk IO, and that the actual tokenizing time would be tiny.   Are
> these numbers both working within memory with a pre-warmed cache so no
> disk reads are involved?  Also, have you controlled for whether the
> data is sync'ed to disk after the indexing?

These numbers are with pre-warmed cache. Also, the data isn't synced 
AFAIU. But I think the analysis chain is CPU bound in the general case. 
All that tokenizing, normalizing and stemming uses a lot of CPU cycles.

> I'm not in a position to do it, but it might be insightful to do a
> quick profile of where these two are spending their time.  Are we
> gaining because the algorithm is faster, or because we have less
> function call overhead, or because of something confounding?

It's mainly that the algorithms are faster. The CaseFolder seems to be 
especially slow but I have no idea why.

> Oprofile
> on Linux is very easy to use once you have it set up.  In case you
> aren't familiar with it, this is a good intro:
> http://lbrandy.com/blog/2008/11/oprofile-profiling-in-linux-for-fun-and-profit/.

I have used it once and found it hard to setup on a virtual machine. But 
it's very useful if you want to profile long running processes.

Nick

Re: [lucy-dev] Some quick benchmarks

Posted by Nathan Kurz <na...@verse.com>.
On Thu, Dec 8, 2011 at 10:02 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> On 08/12/2011 01:41, Marvin Humphrey wrote:
>
> Here is more data from a real world indexing run:
>
> RT+CF: 139 secs
> ST+N:  112 secs
>

Hi Nick --

I'm mostly listening in on this conversation because I haven't thought
much about indexing, but the magnitude of improvement here surprises
me:  I wouldn't have thought that there would be that much time to
shave off!    My presumption was that everything would be dominated by
Disk IO, and that the actual tokenizing time would be tiny.   Are
these numbers both working within memory with a pre-warmed cache so no
disk reads are involved?  Also, have you controlled for whether the
data is sync'ed to disk after the indexing?

I'm not in a position to do it, but it might be insightful to do a
quick profile of where these two are spending their time.  Are we
gaining because the algorithm is faster, or because we have less
function call overhead, or because of something confounding?  Oprofile
on Linux is very easy to use once you have it set up.  In case you
aren't familiar with it, this is a good intro:
http://lbrandy.com/blog/2008/11/oprofile-profiling-in-linux-for-fun-and-profit/.

Thanks!

--nate

Re: [lucy-dev] Some quick benchmarks

Posted by Nick Wellnhofer <we...@aevum.de>.
On 08/12/2011 01:41, Marvin Humphrey wrote:
> These numbers are great, and in line with some benchmarks I was also running
> today (raw data below).  StandardTokenizer and Normalizer are considerably
> faster than RegexTokenizer and the current implementation of CaseFolder, and
> thus the proposed EasyAnalyzer (StandardTokenizer, Normalizer,
> SnowballStemmer) outperforms PolyAnalyzer (CaseFolder, RegexTokenizer,
> SnowballStemmer) by a wide margin:
>
>      Time to index 1000 docs (10 reps, truncated mean)
>      =================================================
>      PolyAnalyzer   .576 secs
>      EasyAnalyzer   .436 secs

Here is more data from a real world indexing run:

RT+CF: 139 secs
ST+N:  112 secs

> Can't wait for StandardTokenizer to land in trunk!

I don't have any further work planned, so the branch is ready to be merged.

>> It's also interesting that moving the tokenizer in front of the case
>> folder or normalizer always gave me faster results.
>
> Yes, I get the same results.  When I first saw the effect, I thought it might
> be stack-memory-vs-malloc'd-buffer in Normalizer, but I was taken by surprise
> that CaseFolder behaved that way.  I have no explanation, but the results
> certainly argue for starting off analysis with tokenization.

In Normalizer it's probably because we have to scan the whole document 
twice to find the buffer size which happens rarely if ever when working 
with tokenized words.

Also the benefit from running the normalizer or case folder before the 
tokenizer isn't that great because tokens and most of the text buffers 
are reused. So we don't really save on allocations.

Nick

Re: [lucy-dev] Some quick benchmarks

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Dec 07, 2011 at 10:42:57PM +0100, Nick Wellnhofer wrote:
> Some quick and completely unscientific benchmarks, indexing 1000 times  
> the same 10K ASCII document:
>
> RT = RegexTokenizer
> ST = StandardTokenizer
> CF = CaseFolder
> N  = Normalizer
>
> RT:    2.177s
> RT+CF: 3.964s
> RT+N:  2.556s
> ST:    1.551s
> ST+CF: 3.357s
> ST+N:  1.931s

These numbers are great, and in line with some benchmarks I was also running
today (raw data below).  StandardTokenizer and Normalizer are considerably
faster than RegexTokenizer and the current implementation of CaseFolder, and
thus the proposed EasyAnalyzer (StandardTokenizer, Normalizer,
SnowballStemmer) outperforms PolyAnalyzer (CaseFolder, RegexTokenizer,
SnowballStemmer) by a wide margin:

    Time to index 1000 docs (10 reps, truncated mean)
    =================================================
    PolyAnalyzer   .576 secs
    EasyAnalyzer   .436 secs

Can't wait for StandardTokenizer to land in trunk!

> It's also interesting that moving the tokenizer in front of the case  
> folder or normalizer always gave me faster results.

Yes, I get the same results.  When I first saw the effect, I thought it might
be stack-memory-vs-malloc'd-buffer in Normalizer, but I was taken by surprise
that CaseFolder behaved that way.  I have no explanation, but the results
certainly argue for starting off analysis with tokenization.

Marvin Humphrey

===========================================================================

~/projects/lucy_196/perl $ # RegexTokenizer, pattern => \S+
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1    Secs: 0.300  Docs: 1000
2    Secs: 0.299  Docs: 1000
3    Secs: 0.297  Docs: 1000
4    Secs: 0.300  Docs: 1000
5    Secs: 0.298  Docs: 1000
6    Secs: 0.299  Docs: 1000
7    Secs: 0.297  Docs: 1000
8    Secs: 0.296  Docs: 1000
9    Secs: 0.300  Docs: 1000
10   Secs: 0.298  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.298 secs 
Truncated mean (6 kept, 4 discarded): 0.298 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # StandardTokenizer
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1    Secs: 0.254  Docs: 1000
2    Secs: 0.251  Docs: 1000
3    Secs: 0.253  Docs: 1000
4    Secs: 0.251  Docs: 1000
5    Secs: 0.253  Docs: 1000
6    Secs: 0.252  Docs: 1000
7    Secs: 0.253  Docs: 1000
8    Secs: 0.253  Docs: 1000
9    Secs: 0.251  Docs: 1000
10   Secs: 0.254  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.253 secs 
Truncated mean (6 kept, 4 discarded): 0.253 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # CaseFolder
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1    Secs: 0.160  Docs: 1000
2    Secs: 0.159  Docs: 1000
3    Secs: 0.160  Docs: 1000
4    Secs: 0.159  Docs: 1000
5    Secs: 0.160  Docs: 1000
6    Secs: 0.158  Docs: 1000
7    Secs: 0.161  Docs: 1000
8    Secs: 0.158  Docs: 1000
9    Secs: 0.160  Docs: 1000
10   Secs: 0.158  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.159 secs 
Truncated mean (6 kept, 4 discarded): 0.159 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # Normalizer
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1    Secs: 0.150  Docs: 1000
2    Secs: 0.148  Docs: 1000
3    Secs: 0.150  Docs: 1000
4    Secs: 0.149  Docs: 1000
5    Secs: 0.150  Docs: 1000
6    Secs: 0.148  Docs: 1000
7    Secs: 0.150  Docs: 1000
8    Secs: 0.148  Docs: 1000
9    Secs: 0.151  Docs: 1000
10   Secs: 0.148  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.149 secs 
Truncated mean (6 kept, 4 discarded): 0.149 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # PolyAnalyzer, language => 'en'
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1    Secs: 0.577  Docs: 1000
2    Secs: 0.577  Docs: 1000
3    Secs: 0.579  Docs: 1000
4    Secs: 0.576  Docs: 1000
5    Secs: 0.576  Docs: 1000
6    Secs: 0.575  Docs: 1000
7    Secs: 0.576  Docs: 1000
8    Secs: 0.575  Docs: 1000
9    Secs: 0.586  Docs: 1000
10   Secs: 0.575  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.577 secs 
Truncated mean (6 kept, 4 discarded): 0.576 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # EasyAnalyzer, language => 'en'
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1    Secs: 0.437  Docs: 1000
2    Secs: 0.434  Docs: 1000
3    Secs: 0.436  Docs: 1000
4    Secs: 0.437  Docs: 1000
5    Secs: 0.436  Docs: 1000
6    Secs: 0.436  Docs: 1000
7    Secs: 0.441  Docs: 1000
8    Secs: 0.436  Docs: 1000
9    Secs: 0.435  Docs: 1000
10   Secs: 0.435  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.436 secs 
Truncated mean (6 kept, 4 discarded): 0.436 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # [ Normalizer, StandardTokenizer, SnowballStemmer(en) ]
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1    Secs: 0.470  Docs: 1000
2    Secs: 0.471  Docs: 1000
3    Secs: 0.472  Docs: 1000
4    Secs: 0.472  Docs: 1000
5    Secs: 0.477  Docs: 1000
6    Secs: 0.470  Docs: 1000
7    Secs: 0.468  Docs: 1000
8    Secs: 0.470  Docs: 1000
9    Secs: 0.471  Docs: 1000
10   Secs: 0.470  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.471 secs 
Truncated mean (6 kept, 4 discarded): 0.471 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # [ RegexTokenizer, CaseFolder, SnowballStemmer(en) ]
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1    Secs: 0.555  Docs: 1000
2    Secs: 0.558  Docs: 1000
3    Secs: 0.557  Docs: 1000
4    Secs: 0.555  Docs: 1000
5    Secs: 0.565  Docs: 1000
6    Secs: 0.556  Docs: 1000
7    Secs: 0.555  Docs: 1000
8    Secs: 0.558  Docs: 1000
9    Secs: 0.555  Docs: 1000
10   Secs: 0.553  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.557 secs 
Truncated mean (6 kept, 4 discarded): 0.556 secs
------------------------------------------------------------
~/projects/lucy_196/perl $