You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2011/12/07 22:42:57 UTC
[lucy-dev] Some quick benchmarks
Some quick and completely unscientific benchmarks, indexing 1000 times
the same 10K ASCII document:
RT = RegexTokenizer
ST = StandardTokenizer
CF = CaseFolder
N = Normalizer
RT: 2.177s
RT+CF: 3.964s
RT+N: 2.556s
ST: 1.551s
ST+CF: 3.357s
ST+N: 1.931s
It's also interesting that moving the tokenizer in front of the case
folder or normalizer always gave me faster results.
Nick
Re: [lucy-dev] Some quick benchmarks
Posted by Nick Wellnhofer <we...@aevum.de>.
On 08/12/2011 23:38, Joe Schaefer wrote:
> When is all this nifty code going to land in trunk? Don't
> wait for anyone to give you permission Nick, that decision
> is all yours.
I just merged branch LUCY-196-uax-tokenizer into trunk.
Nick
Re: [lucy-dev] Some quick benchmarks
Posted by Joe Schaefer <jo...@yahoo.com>.
When is all this nifty code going to land in trunk? Don't
wait for anyone to give you permission Nick, that decision
is all yours.
----- Original Message -----
> From: Nick Wellnhofer <we...@aevum.de>
> To: lucy-dev@incubator.apache.org
> Cc:
> Sent: Thursday, December 8, 2011 2:43 PM
> Subject: Re: [lucy-dev] Some quick benchmarks
>
> On 08/12/11 20:04, Nathan Kurz wrote:
>> I'm mostly listening in on this conversation because I haven't
> thought
>> much about indexing, but the magnitude of improvement here surprises
>> me: I wouldn't have thought that there would be that much time to
>> shave off! My presumption was that everything would be dominated by
>> Disk IO, and that the actual tokenizing time would be tiny. Are
>> these numbers both working within memory with a pre-warmed cache so no
>> disk reads are involved? Also, have you controlled for whether the
>> data is sync'ed to disk after the indexing?
>
> These numbers are with pre-warmed cache. Also, the data isn't synced AFAIU.
> But I think the analysis chain is CPU bound in the general case. All that
> tokenizing, normalizing and stemming uses a lot of CPU cycles.
>
>> I'm not in a position to do it, but it might be insightful to do a
>> quick profile of where these two are spending their time. Are we
>> gaining because the algorithm is faster, or because we have less
>> function call overhead, or because of something confounding?
>
> It's mainly that the algorithms are faster. The CaseFolder seems to be
> especially slow but I have no idea why.
>
>> Oprofile
>> on Linux is very easy to use once you have it set up. In case you
>> aren't familiar with it, this is a good intro:
>>
> http://lbrandy.com/blog/2008/11/oprofile-profiling-in-linux-for-fun-and-profit/.
>
> I have used it once and found it hard to setup on a virtual machine. But
> it's very useful if you want to profile long running processes.
>
> Nick
>
Re: [lucy-dev] Some quick benchmarks
Posted by Nick Wellnhofer <we...@aevum.de>.
On 08/12/11 20:04, Nathan Kurz wrote:
> I'm mostly listening in on this conversation because I haven't thought
> much about indexing, but the magnitude of improvement here surprises
> me: I wouldn't have thought that there would be that much time to
> shave off! My presumption was that everything would be dominated by
> Disk IO, and that the actual tokenizing time would be tiny. Are
> these numbers both working within memory with a pre-warmed cache so no
> disk reads are involved? Also, have you controlled for whether the
> data is sync'ed to disk after the indexing?
These numbers are with pre-warmed cache. Also, the data isn't synced
AFAIU. But I think the analysis chain is CPU bound in the general case.
All that tokenizing, normalizing and stemming uses a lot of CPU cycles.
> I'm not in a position to do it, but it might be insightful to do a
> quick profile of where these two are spending their time. Are we
> gaining because the algorithm is faster, or because we have less
> function call overhead, or because of something confounding?
It's mainly that the algorithms are faster. The CaseFolder seems to be
especially slow but I have no idea why.
> Oprofile
> on Linux is very easy to use once you have it set up. In case you
> aren't familiar with it, this is a good intro:
> http://lbrandy.com/blog/2008/11/oprofile-profiling-in-linux-for-fun-and-profit/.
I have used it once and found it hard to setup on a virtual machine. But
it's very useful if you want to profile long running processes.
Nick
Re: [lucy-dev] Some quick benchmarks
Posted by Nathan Kurz <na...@verse.com>.
On Thu, Dec 8, 2011 at 10:02 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> On 08/12/2011 01:41, Marvin Humphrey wrote:
>
> Here is more data from a real world indexing run:
>
> RT+CF: 139 secs
> ST+N: 112 secs
>
Hi Nick --
I'm mostly listening in on this conversation because I haven't thought
much about indexing, but the magnitude of improvement here surprises
me: I wouldn't have thought that there would be that much time to
shave off! My presumption was that everything would be dominated by
Disk IO, and that the actual tokenizing time would be tiny. Are
these numbers both working within memory with a pre-warmed cache so no
disk reads are involved? Also, have you controlled for whether the
data is sync'ed to disk after the indexing?
I'm not in a position to do it, but it might be insightful to do a
quick profile of where these two are spending their time. Are we
gaining because the algorithm is faster, or because we have less
function call overhead, or because of something confounding? Oprofile
on Linux is very easy to use once you have it set up. In case you
aren't familiar with it, this is a good intro:
http://lbrandy.com/blog/2008/11/oprofile-profiling-in-linux-for-fun-and-profit/.
Thanks!
--nate
Re: [lucy-dev] Some quick benchmarks
Posted by Nick Wellnhofer <we...@aevum.de>.
On 08/12/2011 01:41, Marvin Humphrey wrote:
> These numbers are great, and in line with some benchmarks I was also running
> today (raw data below). StandardTokenizer and Normalizer are considerably
> faster than RegexTokenizer and the current implementation of CaseFolder, and
> thus the proposed EasyAnalyzer (StandardTokenizer, Normalizer,
> SnowballStemmer) outperforms PolyAnalyzer (CaseFolder, RegexTokenizer,
> SnowballStemmer) by a wide margin:
>
> Time to index 1000 docs (10 reps, truncated mean)
> =================================================
> PolyAnalyzer .576 secs
> EasyAnalyzer .436 secs
Here is more data from a real world indexing run:
RT+CF: 139 secs
ST+N: 112 secs
> Can't wait for StandardTokenizer to land in trunk!
I don't have any further work planned, so the branch is ready to be merged.
>> It's also interesting that moving the tokenizer in front of the case
>> folder or normalizer always gave me faster results.
>
> Yes, I get the same results. When I first saw the effect, I thought it might
> be stack-memory-vs-malloc'd-buffer in Normalizer, but I was taken by surprise
> that CaseFolder behaved that way. I have no explanation, but the results
> certainly argue for starting off analysis with tokenization.
In Normalizer it's probably because we have to scan the whole document
twice to find the buffer size which happens rarely if ever when working
with tokenized words.
Also the benefit from running the normalizer or case folder before the
tokenizer isn't that great because tokens and most of the text buffers
are reused. So we don't really save on allocations.
Nick
Re: [lucy-dev] Some quick benchmarks
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Dec 07, 2011 at 10:42:57PM +0100, Nick Wellnhofer wrote:
> Some quick and completely unscientific benchmarks, indexing 1000 times
> the same 10K ASCII document:
>
> RT = RegexTokenizer
> ST = StandardTokenizer
> CF = CaseFolder
> N = Normalizer
>
> RT: 2.177s
> RT+CF: 3.964s
> RT+N: 2.556s
> ST: 1.551s
> ST+CF: 3.357s
> ST+N: 1.931s
These numbers are great, and in line with some benchmarks I was also running
today (raw data below). StandardTokenizer and Normalizer are considerably
faster than RegexTokenizer and the current implementation of CaseFolder, and
thus the proposed EasyAnalyzer (StandardTokenizer, Normalizer,
SnowballStemmer) outperforms PolyAnalyzer (CaseFolder, RegexTokenizer,
SnowballStemmer) by a wide margin:
Time to index 1000 docs (10 reps, truncated mean)
=================================================
PolyAnalyzer .576 secs
EasyAnalyzer .436 secs
Can't wait for StandardTokenizer to land in trunk!
> It's also interesting that moving the tokenizer in front of the case
> folder or normalizer always gave me faster results.
Yes, I get the same results. When I first saw the effect, I thought it might
be stack-memory-vs-malloc'd-buffer in Normalizer, but I was taken by surprise
that CaseFolder behaved that way. I have no explanation, but the results
certainly argue for starting off analysis with tokenization.
Marvin Humphrey
===========================================================================
~/projects/lucy_196/perl $ # RegexTokenizer, pattern => \S+
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.300 Docs: 1000
2 Secs: 0.299 Docs: 1000
3 Secs: 0.297 Docs: 1000
4 Secs: 0.300 Docs: 1000
5 Secs: 0.298 Docs: 1000
6 Secs: 0.299 Docs: 1000
7 Secs: 0.297 Docs: 1000
8 Secs: 0.296 Docs: 1000
9 Secs: 0.300 Docs: 1000
10 Secs: 0.298 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.298 secs
Truncated mean (6 kept, 4 discarded): 0.298 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # StandardTokenizer
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.254 Docs: 1000
2 Secs: 0.251 Docs: 1000
3 Secs: 0.253 Docs: 1000
4 Secs: 0.251 Docs: 1000
5 Secs: 0.253 Docs: 1000
6 Secs: 0.252 Docs: 1000
7 Secs: 0.253 Docs: 1000
8 Secs: 0.253 Docs: 1000
9 Secs: 0.251 Docs: 1000
10 Secs: 0.254 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.253 secs
Truncated mean (6 kept, 4 discarded): 0.253 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # CaseFolder
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.160 Docs: 1000
2 Secs: 0.159 Docs: 1000
3 Secs: 0.160 Docs: 1000
4 Secs: 0.159 Docs: 1000
5 Secs: 0.160 Docs: 1000
6 Secs: 0.158 Docs: 1000
7 Secs: 0.161 Docs: 1000
8 Secs: 0.158 Docs: 1000
9 Secs: 0.160 Docs: 1000
10 Secs: 0.158 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.159 secs
Truncated mean (6 kept, 4 discarded): 0.159 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # Normalizer
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.150 Docs: 1000
2 Secs: 0.148 Docs: 1000
3 Secs: 0.150 Docs: 1000
4 Secs: 0.149 Docs: 1000
5 Secs: 0.150 Docs: 1000
6 Secs: 0.148 Docs: 1000
7 Secs: 0.150 Docs: 1000
8 Secs: 0.148 Docs: 1000
9 Secs: 0.151 Docs: 1000
10 Secs: 0.148 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.149 secs
Truncated mean (6 kept, 4 discarded): 0.149 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # PolyAnalyzer, language => 'en'
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.577 Docs: 1000
2 Secs: 0.577 Docs: 1000
3 Secs: 0.579 Docs: 1000
4 Secs: 0.576 Docs: 1000
5 Secs: 0.576 Docs: 1000
6 Secs: 0.575 Docs: 1000
7 Secs: 0.576 Docs: 1000
8 Secs: 0.575 Docs: 1000
9 Secs: 0.586 Docs: 1000
10 Secs: 0.575 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.577 secs
Truncated mean (6 kept, 4 discarded): 0.576 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # EasyAnalyzer, language => 'en'
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.437 Docs: 1000
2 Secs: 0.434 Docs: 1000
3 Secs: 0.436 Docs: 1000
4 Secs: 0.437 Docs: 1000
5 Secs: 0.436 Docs: 1000
6 Secs: 0.436 Docs: 1000
7 Secs: 0.441 Docs: 1000
8 Secs: 0.436 Docs: 1000
9 Secs: 0.435 Docs: 1000
10 Secs: 0.435 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.436 secs
Truncated mean (6 kept, 4 discarded): 0.436 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # [ Normalizer, StandardTokenizer, SnowballStemmer(en) ]
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.470 Docs: 1000
2 Secs: 0.471 Docs: 1000
3 Secs: 0.472 Docs: 1000
4 Secs: 0.472 Docs: 1000
5 Secs: 0.477 Docs: 1000
6 Secs: 0.470 Docs: 1000
7 Secs: 0.468 Docs: 1000
8 Secs: 0.470 Docs: 1000
9 Secs: 0.471 Docs: 1000
10 Secs: 0.470 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.471 secs
Truncated mean (6 kept, 4 discarded): 0.471 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ # [ RegexTokenizer, CaseFolder, SnowballStemmer(en) ]
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000 --reps=10
------------------------------------------------------------
1 Secs: 0.555 Docs: 1000
2 Secs: 0.558 Docs: 1000
3 Secs: 0.557 Docs: 1000
4 Secs: 0.555 Docs: 1000
5 Secs: 0.565 Docs: 1000
6 Secs: 0.556 Docs: 1000
7 Secs: 0.555 Docs: 1000
8 Secs: 0.558 Docs: 1000
9 Secs: 0.555 Docs: 1000
10 Secs: 0.553 Docs: 1000
------------------------------------------------------------
Lucy 0.002
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.557 secs
Truncated mean (6 kept, 4 discarded): 0.556 secs
------------------------------------------------------------
~/projects/lucy_196/perl $