You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by "Hyrum K. Wright" <hy...@mail.utexas.edu> on 2010/05/05 14:55:10 UTC

Re: [PATCH] Saving a few cycles, part 2/2

On Mon, Apr 26, 2010 at 6:10 PM, Peter Samuelson <pe...@p12n.org> wrote:

>
> [Philip Martin]
> > We have svn_ctype_isalpha which is a macro that does a table lookup.
> > We don't currently have svn_ctype_isxdigit but it could be added.
> >
> > > I'd rather see the whole thing become a single table lookup.  Untested
> > > patch appended.
> >
> > Is this a bottleneck?  I think using the svn_ctype.h macros should be
> > sufficient.
>
> I don't know if it's a bottleneck.  But given that what we actually
> want to know if 'what is the value of this hex digit' - using isxdigit,
> then isalpha as a conditional, then ASCII arithmetic is awfully
> roundabout compared to just a single table lookup.
>
> I mean, the whole point of those ctype functions is to avoid ASCII
> arithmetic.  So, while I don't claim my patch will be noticeably faster
> (it will be faster but probably not noticeably so), I do claim it's
> more elegant.


I would agree.  I saw the commit of Stefan's patch, and independently
thought about using a lookup table (though I prefer Peter's implementation).
 I don't have any hard performance numbers, but I gotta believe the lookup
table is simpler, as well as more performant.

-Hyrum

Re: [PATCH] Saving a few cycles, part 2/2

Posted by "Hyrum K. Wright" <hy...@mail.utexas.edu>.
On Wed, May 5, 2010 at 6:28 PM, Bolstridge, Andrew <
andy.bolstridge@intergraph.com> wrote:

> > -----Original Message-----
> > From: hyrum@hyrumwright.org [mailto:hyrum@hyrumwright.org] On Behalf
> Of
> > Hyrum K. Wright
> > Sent: Wednesday, May 05, 2010 3:55 PM
> > To: Peter Samuelson
> > Cc: dev@subversion.apache.org
> > Subject: Re: [PATCH] Saving a few cycles, part 2/2
> >
> ...
> > I would agree.  I saw the commit of Stefan's patch, and independently
> > thought about using a lookup table (though I prefer Peter's
> > implementation).
> >  I don't have any hard performance numbers, but I gotta believe the
> > lookup
> > table is simpler, as well as more performant.
> >
>
> It may be simpler, but generally not as fast as arithmetic. The problem
> (with modern processors) is that the table (or the entry you want) has
> to be loaded into the CPU cache and that may require a main RAM access
> that is very slow (relatively speaking) compared to a few maths
> operations.
>
> For example, you can do many instructions per clock cycle, but fetching
> 1 byte from main RAM (at 10ns delay) will take 30 clock cycles on a 3Ghz
> chip (and that's a lot of maths instructions.... over a thousand).
>

Completely off-topic: I'm interested in how over a thousand math
instructions can be done in only 30 clock cycles.  Even on highly parallel
data (which this example is not), and using a very long pipeline (which
modern processors top out at *maybe* 25 stages), your still talking about
over 33 ops/cycle.  My microarchitecture is a bit rusty, so I'm interested
to know how this works out.

I think it takes a lot longer than that, with slower latency RAM we use
> today and 3 levels of cache. But the point is that a lookup table is no
> longer necessarily faster than calculating the data. How times have
> changed :)
>

These are valid points, but I'm curious about them.  For instance, the
current code has function calls in it.  While the overhead might be lower,
because the compiler could pass arguments in registers instead of on the
stack, the functions most likely wouldn't already be resident in the i-cache
(at least not initially), and so would need to be loaded, just the same as
the lookup table would.  The lookup table, at only 256 bytes, fits in most
d-caches, and would stay resident for the entire loop.

The lookup table method also eliminates branches (hidden by the trinary
operators), which could lead to pipeline stalls.  Since the probability of
taking the branches is uniform (or ought to be with a cryptographic hash),
the branch predictor doesn't really have much chance of doing any good.


> Anyway, it's hardly going to be a big deal to overall performance, so
> whatever is easier to maintain is best.


+1 :)

-Hyrum

RE: [PATCH] Saving a few cycles, part 2/2

Posted by "Bolstridge, Andrew" <an...@intergraph.com>.
> -----Original Message-----
> From: hyrum@hyrumwright.org [mailto:hyrum@hyrumwright.org] On Behalf
Of
> Hyrum K. Wright
> Sent: Wednesday, May 05, 2010 3:55 PM
> To: Peter Samuelson
> Cc: dev@subversion.apache.org
> Subject: Re: [PATCH] Saving a few cycles, part 2/2
> 
...
> I would agree.  I saw the commit of Stefan's patch, and independently
> thought about using a lookup table (though I prefer Peter's
> implementation).
>  I don't have any hard performance numbers, but I gotta believe the
> lookup
> table is simpler, as well as more performant.
> 

It may be simpler, but generally not as fast as arithmetic. The problem
(with modern processors) is that the table (or the entry you want) has
to be loaded into the CPU cache and that may require a main RAM access
that is very slow (relatively speaking) compared to a few maths
operations. 

For example, you can do many instructions per clock cycle, but fetching
1 byte from main RAM (at 10ns delay) will take 30 clock cycles on a 3Ghz
chip (and that's a lot of maths instructions.... over a thousand).

I think it takes a lot longer than that, with slower latency RAM we use
today and 3 levels of cache. But the point is that a lookup table is no
longer necessarily faster than calculating the data. How times have
changed :)

Anyway, it's hardly going to be a big deal to overall performance, so
whatever is easier to maintain is best.