You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by David Balmain <db...@gmail.com> on 2006/12/19 06:56:51 UTC

Re: OO Design -- performance

On 10/31/06, Marvin Humphrey <ma...@rectangular.com> wrote:
>
> On Oct 23, 2006, at 6:44 PM, David Balmain wrote:
> >> The cost of a double-dereference to dispatch a method is
> >> insignificant.  However, the syntax is godawful...
> >>
> >>       len = instream->m->length_i(instream);
> >>
> >> ... which is why Dave has macro-fied it:
> >>
> >>       #define is_length(mis) mis->m->length_i(mis)
> >>
> >> How about if we implement every method call in Lucy that way?
> >
> > Sounds good to me. I would have implemented all of Ferret's classes
> > this way except that I was naively worrying about the performance
> > detriment of the extra point dereference. I thought the extra speed
> > may be worth the cost in memory.
>
> I've pondered this a little more, and I think you have a point.  The
> extra deref is a cost in this design, and for a very tight loop, it
> could make a difference.  In general, pointer dereferencing is such a
> cheap operation it's not something worth worrying about.  But this is
> a major design decision.  We're not talking about a localized area of
> code.  This is everywhere, so it's important to choose correctly.

I really think that pointer dereferrencing is such a cheap operation
that it really isn't worth worrying about as it seems to have made so
little difference to Ferret's performance.

> Maybe we can treat the i/o classes as a special case, since I think
> that's where the majority of our function calls occur.  Potentially,
> we could just have the Streams' macros alias functions directly...
>
>      #define InStream_Write_VInt(instream) InStream_write_vint(instream)
>
> ... which is akin to declaring InStream_Write_VInt a "final" method.
> (I assume that at the compiler level, that's how "final" methods are
> implemented.)
>
> That's an option if Lucy adopts the same architecture that Ferret and
> KinoSearch have: no subclasses for InStream and OutStream.
>
> If the function calls are all macrofied, it's easy enough to switch
> up the macro definition and see if there's any effect.  My guess is
> that there will be a small but measurable difference.

I think the difference will be very small indeed. But as you say,
since the function calls are macrofied it really is easy to switch out
the function call with a macro. The question is, which way to go with
first. I think we should have it consistent across the board and leave
this optimization for later. But that is my personal preference and it
certainly isn't a dealbreaker for me.

> This being C, we can also copy a function pointer out of the vtable
> and use it directly if we identify a really tight loop somewhere as a
> bottleneck.
>
>      InStream_read_vint_t read_vint = instream->_->read_vint;
>      for (1 .. a_lot) {
>          foo++ = read_vint(instream);
>      }
>
> However, I suspect that if we treat the i/o classes as special cases
> and call their functions directly everywhere, we'll have addressed
> 90% of the bottlenecks.

True, the i/o methods, particularly the vint i/o methods, are the
major bottleneck. I spent a lot of time optimizing write_vint and
read_vint. They're ugly but they got me a massive 20% performance
gain. Also, adding a skip_vints method instead of just using read_vint
in a loop like Lucene gave me another 10%.

> > Anyway, I changed the Streams to test
> > the difference and I mean to refactor the rest of the classes like
> > this when I have time.
>
> Below my sig you'll find the output of a simple benchmarking app
> "speed.c" written by a guy named Michael Lee, run on my G4 Laptop and
> then also on a Pentium 4.  It shows the relative costs of some simple
> operations.  You can download it from <http://www.vmlinux.org/jocke/
> programming/www.ontek.com/mikey/speed.c>.  His original 1997 article
> on C optimization (which has a broken link to speed.c) is at <http://
> www.prism.uvsq.fr/~cedb/local_copies/lee.html>.
>
> Since that article is nearly a decade old, it doesn't spend a lot of
> time on CPU caching, which supposedly is becoming more important
> these days.  I don't have any practical experience doing those kinds
> of optimizations, though.  Maybe we could stuff all the vtables in
> one giant struct and hope that with all our function pointers jammed
> up against one another we'd get less cache thrash.  But I think
> that's chasing our tail.

Unless we can make this really transparent I don't think it is worth
it. I'm not exactly sure but maybe just declaring all of the structs
next to eachother will give the same benifit. I don't know enough
about C compilers to say for sure.

> The bottom line for me is that I'd prefer that the method calls be
> macrofied because it's going to clean up the source quite a bit.  The
> best way to optimize is to find a better algorithm, and simpler,
> cleaner, smaller source code makes refactoring easier.

This is a must for me too. I couldn't agree more.

-- 
Dave Balmain
http://www.davebalmain.com/

Re: OO Design -- performance

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Feb 27, 2007, at 10:43 PM, David Balmain wrote:

>> Truncated mean time to index 1000 Reuters news stories:
>>
>>     Before: 1.83 seconds
>>     After:  1.82 seconds
>>
>> Glad it didn't take me very long to implement.  :)
>
> Thanks for posting this. It'll probably save me a lot of time (or
> worry) in the future. ;-)

It's cake to flip a switch in boilerplater.pl and turn all the final  
methods back into virtual methods, if you ever get the itch.

To put those numbers in perspective, I recently knocked 30% off of  
KinoSearch's indexing benchmark by refactoring Token sorting to  
minimize malloc calls and hashset building.

I think boilerplater.pl is pretty much feature-complete at this  
point.  There's nothing more on my to-do list for it, and it seems to  
be working fine.  When I can afford to take a break from KS work,  
I'll do a big s/kino_/lucy_/g on it and we can start playing with it  
here.

I had to change the syntax a bit, though.  Automatically generating  
the the method names wasn't working well, because it couldn't detect  
where I wanted CamelCase in the method name.  For instance, I wanted  
Folder_Open_InStream, but it gave me Folder_Open_Instream (lower case  
s in Instream).  So now, each method invoker is spec'd manually.  The  
header files got a little uglier, but the C files are perfect.

http://www.rectangular.com/svn/kinosearch/trunk/c_src/KinoSearch/ 
Search/TopDocCollector.h
http://www.rectangular.com/svn/kinosearch/trunk/c_src/KinoSearch/ 
Search/TopDocCollector.c

Oh yeah... I also implemented dynamic subclassing.  :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Re: OO Design -- performance

Posted by David Balmain <db...@gmail.com>.

On 2/28/07, Marvin Humphrey <ma...@rectangular.com> wrote:
>
> On Dec 21, 2006, at 3:19 PM, David Balmain wrote:
>
> > I'll be very interested to see if it there is a noticable performance
> > difference with the "final" classes.
>
> I implemented final methods and final classes a little while back.
>
> All of the following classes became final, which means that all of
> their methods became aliases for functions rather than double-
> dereference invocations via vtable:
>
>     DelDocs
>     PostingsWriter
>     SegInfo
>     TermInfo
>     TermInfosWriter
>     TermVectorsReader
>     TermVectorsWriter
>     InStream
>     OutStream
>     SortExternal
>
> Truncated mean time to index 1000 Reuters news stories:
>
>     Before: 1.83 seconds
>     After:  1.82 seconds
>
> Glad it didn't take me very long to implement.  :)

Thanks for posting this. It'll probably save me a lot of time (or
worry) in the future. ;-)

-- 
Dave Balmain
http://www.davebalmain.com/

Re: OO Design -- performance

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Dec 21, 2006, at 3:19 PM, David Balmain wrote:

> I'll be very interested to see if it there is a noticable performance
> difference with the "final" classes.

I implemented final methods and final classes a little while back.

All of the following classes became final, which means that all of  
their methods became aliases for functions rather than double- 
dereference invocations via vtable:

    DelDocs
    PostingsWriter
    SegInfo
    TermInfo
    TermInfosWriter
    TermVectorsReader
    TermVectorsWriter
    InStream
    OutStream
    SortExternal

Truncated mean time to index 1000 Reuters news stories:

    Before: 1.83 seconds
    After:  1.82 seconds

Glad it didn't take me very long to implement.  :)

It's worth noting, though, that a lot of the VInt-writing was already  
handled via function calls.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

slothbear:~/projects/ks/perl/t/benchmarks marvin$ perl -Mblib  
indexers/kinosearch_indexer.plx --docs=1000 --reps=30
------------------------------------------------------------
1    Secs: 1.82  Docs: 1000
2    Secs: 1.83  Docs: 1000
3    Secs: 1.83  Docs: 1000
4    Secs: 1.84  Docs: 1000
5    Secs: 1.86  Docs: 1000
6    Secs: 1.81  Docs: 1000
7    Secs: 1.83  Docs: 1000
8    Secs: 1.83  Docs: 1000
9    Secs: 1.85  Docs: 1000
10   Secs: 1.82  Docs: 1000
11   Secs: 1.81  Docs: 1000
12   Secs: 1.80  Docs: 1000
13   Secs: 1.81  Docs: 1000
14   Secs: 1.82  Docs: 1000
15   Secs: 1.82  Docs: 1000
16   Secs: 1.82  Docs: 1000
17   Secs: 1.82  Docs: 1000
18   Secs: 2.23  Docs: 1000
19   Secs: 1.85  Docs: 1000
20   Secs: 1.86  Docs: 1000
21   Secs: 1.82  Docs: 1000
22   Secs: 1.84  Docs: 1000
23   Secs: 1.82  Docs: 1000
24   Secs: 1.85  Docs: 1000
25   Secs: 1.81  Docs: 1000
26   Secs: 1.84  Docs: 1000
27   Secs: 2.36  Docs: 1000
28   Secs: 1.87  Docs: 1000
29   Secs: 1.83  Docs: 1000
30   Secs: 1.81  Docs: 1000
------------------------------------------------------------
KinoSearch 0.20_01
Perl 5.8.6
Thread support: yes
Darwin 8.8.0 Power Macintosh
Mean: 1.86 secs
Truncated mean (16 kept, 14 discarded): 1.83 secs
------------------------------------------------------------
slothbear:~/projects/ks/perl/t/benchmarks marvin$


slothbear:~/projects/ks/perl/t/benchmarks marvin$ perl -Mblib  
indexers/kinosearch_indexer.plx --docs=1000 --reps=30
------------------------------------------------------------
1    Secs: 1.89  Docs: 1000
2    Secs: 1.83  Docs: 1000
3    Secs: 1.83  Docs: 1000
4    Secs: 1.83  Docs: 1000
5    Secs: 1.81  Docs: 1000
6    Secs: 2.26  Docs: 1000
7    Secs: 2.00  Docs: 1000
8    Secs: 1.80  Docs: 1000
9    Secs: 1.83  Docs: 1000
10   Secs: 1.82  Docs: 1000
11   Secs: 1.80  Docs: 1000
12   Secs: 1.81  Docs: 1000
13   Secs: 1.92  Docs: 1000
14   Secs: 1.81  Docs: 1000
15   Secs: 2.32  Docs: 1000
16   Secs: 1.83  Docs: 1000
17   Secs: 1.81  Docs: 1000
18   Secs: 1.81  Docs: 1000
19   Secs: 1.79  Docs: 1000
20   Secs: 1.81  Docs: 1000
21   Secs: 1.79  Docs: 1000
22   Secs: 1.88  Docs: 1000
23   Secs: 1.82  Docs: 1000
24   Secs: 2.31  Docs: 1000
25   Secs: 1.83  Docs: 1000
26   Secs: 1.84  Docs: 1000
27   Secs: 1.83  Docs: 1000
28   Secs: 1.83  Docs: 1000
29   Secs: 1.83  Docs: 1000
30   Secs: 1.81  Docs: 1000
------------------------------------------------------------
KinoSearch 0.20_01
Perl 5.8.6
Thread support: yes
Darwin 8.8.0 Power Macintosh
Mean: 1.88 secs
Truncated mean (16 kept, 14 discarded): 1.82 secs
------------------------------------------------------------
slothbear:~/projects/ks/perl/t/benchmarks marvin$

Re: OO Design -- performance

Posted by David Balmain <db...@gmail.com>.

On 12/21/06, Marvin Humphrey <ma...@rectangular.com> wrote:
>
> On Dec 18, 2006, at 9:56 PM, David Balmain wrote:
>
> > I think the difference will be very small indeed. But as you say,
> > since the function calls are macrofied it really is easy to switch out
> > the function call with a macro. The question is, which way to go with
> > first. I think we should have it consistent across the board and leave
> > this optimization for later. But that is my personal preference and it
> > certainly isn't a dealbreaker for me.
>
> The way I've got things set up right now, all methods are virtual, so
> two lookups are required to locate the function address.  However,
> with another 50 lines of code it's possible to implement "final"
> methods, which would be aliases for functions.

This code generator of yours is looking more and more useful. I'll be
very interested to see if it there is a noticable performance
difference with the "final" classes.

-- 
Dave Balmain
http://www.davebalmain.com/

Re: OO Design -- performance

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Dec 18, 2006, at 9:56 PM, David Balmain wrote:

> I think the difference will be very small indeed. But as you say,
> since the function calls are macrofied it really is easy to switch out
> the function call with a macro. The question is, which way to go with
> first. I think we should have it consistent across the board and leave
> this optimization for later. But that is my personal preference and it
> certainly isn't a dealbreaker for me.

The way I've got things set up right now, all methods are virtual, so  
two lookups are required to locate the function address.  However,  
with another 50 lines of code it's possible to implement "final"  
methods, which would be aliases for functions.

> True, the i/o methods, particularly the vint i/o methods, are the
> major bottleneck. I spent a lot of time optimizing write_vint and
> read_vint. They're ugly but they got me a massive 20% performance
> gain. Also, adding a skip_vints method instead of just using read_vint
> in a loop like Lucene gave me another 10%.

Mmmm, very interesting...  I just spelunked your implementation.   
Good stuff, I'm sure we'll want to use that.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/