You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by David Balmain <db...@gmail.com> on 2006/12/19 06:56:51 UTC
Re: OO Design -- performance
On 10/31/06, Marvin Humphrey <ma...@rectangular.com> wrote:
>
> On Oct 23, 2006, at 6:44 PM, David Balmain wrote:
> >> The cost of a double-dereference to dispatch a method is
> >> insignificant. However, the syntax is godawful...
> >>
> >> len = instream->m->length_i(instream);
> >>
> >> ... which is why Dave has macro-fied it:
> >>
> >> #define is_length(mis) mis->m->length_i(mis)
> >>
> >> How about if we implement every method call in Lucy that way?
> >
> > Sounds good to me. I would have implemented all of Ferret's classes
> > this way except that I was naively worrying about the performance
> > detriment of the extra point dereference. I thought the extra speed
> > may be worth the cost in memory.
>
> I've pondered this a little more, and I think you have a point. The
> extra deref is a cost in this design, and for a very tight loop, it
> could make a difference. In general, pointer dereferencing is such a
> cheap operation it's not something worth worrying about. But this is
> a major design decision. We're not talking about a localized area of
> code. This is everywhere, so it's important to choose correctly.
I really think that pointer dereferrencing is such a cheap operation
that it really isn't worth worrying about as it seems to have made so
little difference to Ferret's performance.
> Maybe we can treat the i/o classes as a special case, since I think
> that's where the majority of our function calls occur. Potentially,
> we could just have the Streams' macros alias functions directly...
>
> #define InStream_Write_VInt(instream) InStream_write_vint(instream)
>
> ... which is akin to declaring InStream_Write_VInt a "final" method.
> (I assume that at the compiler level, that's how "final" methods are
> implemented.)
>
> That's an option if Lucy adopts the same architecture that Ferret and
> KinoSearch have: no subclasses for InStream and OutStream.
>
> If the function calls are all macrofied, it's easy enough to switch
> up the macro definition and see if there's any effect. My guess is
> that there will be a small but measurable difference.
I think the difference will be very small indeed. But as you say,
since the function calls are macrofied it really is easy to switch out
the function call with a macro. The question is, which way to go with
first. I think we should have it consistent across the board and leave
this optimization for later. But that is my personal preference and it
certainly isn't a dealbreaker for me.
> This being C, we can also copy a function pointer out of the vtable
> and use it directly if we identify a really tight loop somewhere as a
> bottleneck.
>
> InStream_read_vint_t read_vint = instream->_->read_vint;
> for (1 .. a_lot) {
> foo++ = read_vint(instream);
> }
>
> However, I suspect that if we treat the i/o classes as special cases
> and call their functions directly everywhere, we'll have addressed
> 90% of the bottlenecks.
True, the i/o methods, particularly the vint i/o methods, are the
major bottleneck. I spent a lot of time optimizing write_vint and
read_vint. They're ugly but they got me a massive 20% performance
gain. Also, adding a skip_vints method instead of just using read_vint
in a loop like Lucene gave me another 10%.
> > Anyway, I changed the Streams to test
> > the difference and I mean to refactor the rest of the classes like
> > this when I have time.
>
> Below my sig you'll find the output of a simple benchmarking app
> "speed.c" written by a guy named Michael Lee, run on my G4 Laptop and
> then also on a Pentium 4. It shows the relative costs of some simple
> operations. You can download it from <http://www.vmlinux.org/jocke/
> programming/www.ontek.com/mikey/speed.c>. His original 1997 article
> on C optimization (which has a broken link to speed.c) is at <http://
> www.prism.uvsq.fr/~cedb/local_copies/lee.html>.
>
> Since that article is nearly a decade old, it doesn't spend a lot of
> time on CPU caching, which supposedly is becoming more important
> these days. I don't have any practical experience doing those kinds
> of optimizations, though. Maybe we could stuff all the vtables in
> one giant struct and hope that with all our function pointers jammed
> up against one another we'd get less cache thrash. But I think
> that's chasing our tail.
Unless we can make this really transparent I don't think it is worth
it. I'm not exactly sure but maybe just declaring all of the structs
next to eachother will give the same benifit. I don't know enough
about C compilers to say for sure.
> The bottom line for me is that I'd prefer that the method calls be
> macrofied because it's going to clean up the source quite a bit. The
> best way to optimize is to find a better algorithm, and simpler,
> cleaner, smaller source code makes refactoring easier.
This is a must for me too. I couldn't agree more.
--
Dave Balmain
http://www.davebalmain.com/
Re: OO Design -- performance
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Feb 27, 2007, at 10:43 PM, David Balmain wrote:
>> Truncated mean time to index 1000 Reuters news stories:
>>
>> Before: 1.83 seconds
>> After: 1.82 seconds
>>
>> Glad it didn't take me very long to implement. :)
>
> Thanks for posting this. It'll probably save me a lot of time (or
> worry) in the future. ;-)
It's cake to flip a switch in boilerplater.pl and turn all the final
methods back into virtual methods, if you ever get the itch.
To put those numbers in perspective, I recently knocked 30% off of
KinoSearch's indexing benchmark by refactoring Token sorting to
minimize malloc calls and hashset building.
I think boilerplater.pl is pretty much feature-complete at this
point. There's nothing more on my to-do list for it, and it seems to
be working fine. When I can afford to take a break from KS work,
I'll do a big s/kino_/lucy_/g on it and we can start playing with it
here.
I had to change the syntax a bit, though. Automatically generating
the the method names wasn't working well, because it couldn't detect
where I wanted CamelCase in the method name. For instance, I wanted
Folder_Open_InStream, but it gave me Folder_Open_Instream (lower case
s in Instream). So now, each method invoker is spec'd manually. The
header files got a little uglier, but the C files are perfect.
http://www.rectangular.com/svn/kinosearch/trunk/c_src/KinoSearch/
Search/TopDocCollector.h
http://www.rectangular.com/svn/kinosearch/trunk/c_src/KinoSearch/
Search/TopDocCollector.c
Oh yeah... I also implemented dynamic subclassing. :)
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Re: OO Design -- performance
Posted by David Balmain <db...@gmail.com>.
On 2/28/07, Marvin Humphrey <ma...@rectangular.com> wrote:
>
> On Dec 21, 2006, at 3:19 PM, David Balmain wrote:
>
> > I'll be very interested to see if it there is a noticable performance
> > difference with the "final" classes.
>
> I implemented final methods and final classes a little while back.
>
> All of the following classes became final, which means that all of
> their methods became aliases for functions rather than double-
> dereference invocations via vtable:
>
> DelDocs
> PostingsWriter
> SegInfo
> TermInfo
> TermInfosWriter
> TermVectorsReader
> TermVectorsWriter
> InStream
> OutStream
> SortExternal
>
> Truncated mean time to index 1000 Reuters news stories:
>
> Before: 1.83 seconds
> After: 1.82 seconds
>
> Glad it didn't take me very long to implement. :)
Thanks for posting this. It'll probably save me a lot of time (or
worry) in the future. ;-)
--
Dave Balmain
http://www.davebalmain.com/
Re: OO Design -- performance
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Dec 21, 2006, at 3:19 PM, David Balmain wrote:
> I'll be very interested to see if it there is a noticable performance
> difference with the "final" classes.
I implemented final methods and final classes a little while back.
All of the following classes became final, which means that all of
their methods became aliases for functions rather than double-
dereference invocations via vtable:
DelDocs
PostingsWriter
SegInfo
TermInfo
TermInfosWriter
TermVectorsReader
TermVectorsWriter
InStream
OutStream
SortExternal
Truncated mean time to index 1000 Reuters news stories:
Before: 1.83 seconds
After: 1.82 seconds
Glad it didn't take me very long to implement. :)
It's worth noting, though, that a lot of the VInt-writing was already
handled via function calls.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
slothbear:~/projects/ks/perl/t/benchmarks marvin$ perl -Mblib
indexers/kinosearch_indexer.plx --docs=1000 --reps=30
------------------------------------------------------------
1 Secs: 1.82 Docs: 1000
2 Secs: 1.83 Docs: 1000
3 Secs: 1.83 Docs: 1000
4 Secs: 1.84 Docs: 1000
5 Secs: 1.86 Docs: 1000
6 Secs: 1.81 Docs: 1000
7 Secs: 1.83 Docs: 1000
8 Secs: 1.83 Docs: 1000
9 Secs: 1.85 Docs: 1000
10 Secs: 1.82 Docs: 1000
11 Secs: 1.81 Docs: 1000
12 Secs: 1.80 Docs: 1000
13 Secs: 1.81 Docs: 1000
14 Secs: 1.82 Docs: 1000
15 Secs: 1.82 Docs: 1000
16 Secs: 1.82 Docs: 1000
17 Secs: 1.82 Docs: 1000
18 Secs: 2.23 Docs: 1000
19 Secs: 1.85 Docs: 1000
20 Secs: 1.86 Docs: 1000
21 Secs: 1.82 Docs: 1000
22 Secs: 1.84 Docs: 1000
23 Secs: 1.82 Docs: 1000
24 Secs: 1.85 Docs: 1000
25 Secs: 1.81 Docs: 1000
26 Secs: 1.84 Docs: 1000
27 Secs: 2.36 Docs: 1000
28 Secs: 1.87 Docs: 1000
29 Secs: 1.83 Docs: 1000
30 Secs: 1.81 Docs: 1000
------------------------------------------------------------
KinoSearch 0.20_01
Perl 5.8.6
Thread support: yes
Darwin 8.8.0 Power Macintosh
Mean: 1.86 secs
Truncated mean (16 kept, 14 discarded): 1.83 secs
------------------------------------------------------------
slothbear:~/projects/ks/perl/t/benchmarks marvin$
slothbear:~/projects/ks/perl/t/benchmarks marvin$ perl -Mblib
indexers/kinosearch_indexer.plx --docs=1000 --reps=30
------------------------------------------------------------
1 Secs: 1.89 Docs: 1000
2 Secs: 1.83 Docs: 1000
3 Secs: 1.83 Docs: 1000
4 Secs: 1.83 Docs: 1000
5 Secs: 1.81 Docs: 1000
6 Secs: 2.26 Docs: 1000
7 Secs: 2.00 Docs: 1000
8 Secs: 1.80 Docs: 1000
9 Secs: 1.83 Docs: 1000
10 Secs: 1.82 Docs: 1000
11 Secs: 1.80 Docs: 1000
12 Secs: 1.81 Docs: 1000
13 Secs: 1.92 Docs: 1000
14 Secs: 1.81 Docs: 1000
15 Secs: 2.32 Docs: 1000
16 Secs: 1.83 Docs: 1000
17 Secs: 1.81 Docs: 1000
18 Secs: 1.81 Docs: 1000
19 Secs: 1.79 Docs: 1000
20 Secs: 1.81 Docs: 1000
21 Secs: 1.79 Docs: 1000
22 Secs: 1.88 Docs: 1000
23 Secs: 1.82 Docs: 1000
24 Secs: 2.31 Docs: 1000
25 Secs: 1.83 Docs: 1000
26 Secs: 1.84 Docs: 1000
27 Secs: 1.83 Docs: 1000
28 Secs: 1.83 Docs: 1000
29 Secs: 1.83 Docs: 1000
30 Secs: 1.81 Docs: 1000
------------------------------------------------------------
KinoSearch 0.20_01
Perl 5.8.6
Thread support: yes
Darwin 8.8.0 Power Macintosh
Mean: 1.88 secs
Truncated mean (16 kept, 14 discarded): 1.82 secs
------------------------------------------------------------
slothbear:~/projects/ks/perl/t/benchmarks marvin$
Re: OO Design -- performance
Posted by David Balmain <db...@gmail.com>.
On 12/21/06, Marvin Humphrey <ma...@rectangular.com> wrote:
>
> On Dec 18, 2006, at 9:56 PM, David Balmain wrote:
>
> > I think the difference will be very small indeed. But as you say,
> > since the function calls are macrofied it really is easy to switch out
> > the function call with a macro. The question is, which way to go with
> > first. I think we should have it consistent across the board and leave
> > this optimization for later. But that is my personal preference and it
> > certainly isn't a dealbreaker for me.
>
> The way I've got things set up right now, all methods are virtual, so
> two lookups are required to locate the function address. However,
> with another 50 lines of code it's possible to implement "final"
> methods, which would be aliases for functions.
This code generator of yours is looking more and more useful. I'll be
very interested to see if it there is a noticable performance
difference with the "final" classes.
--
Dave Balmain
http://www.davebalmain.com/
Re: OO Design -- performance
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Dec 18, 2006, at 9:56 PM, David Balmain wrote:
> I think the difference will be very small indeed. But as you say,
> since the function calls are macrofied it really is easy to switch out
> the function call with a macro. The question is, which way to go with
> first. I think we should have it consistent across the board and leave
> this optimization for later. But that is my personal preference and it
> certainly isn't a dealbreaker for me.
The way I've got things set up right now, all methods are virtual, so
two lookups are required to locate the function address. However,
with another 50 lines of code it's possible to implement "final"
methods, which would be aliases for functions.
> True, the i/o methods, particularly the vint i/o methods, are the
> major bottleneck. I spent a lot of time optimizing write_vint and
> read_vint. They're ugly but they got me a massive 20% performance
> gain. Also, adding a skip_vints method instead of just using read_vint
> in a loop like Lucene gave me another 10%.
Mmmm, very interesting... I just spelunked your implementation.
Good stuff, I'm sure we'll want to use that.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/