You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stdcxx.apache.org by Andrew Black <ab...@roguewave.com> on 2006/02/10 20:12:35 UTC

Benchmarking stdcxx

  Greetings all.

I thought it might be interesting to do some benchmarking, comparing the 
performance of stdcxx with other standard libraries.  As there are a 
number of attributes that can be compared when doing a benchmark, and an 
even larger number of classes that can be looked at, there is a fair 
amount of choice in what to measure.  As a starting point, I chose to 
measure the runtime performace of stringstream objects.

Measurements were taken on my linux box (a 1.9 GHz P4), with a light 
load (number of running applications, but most were idle) and an 8d 
(single threaded, release, shared) version of stdcxx.  Each test was run 
5 times in a row, with a count of 500000 iterations.  The following 
table lists the run times collected.  All times are in seconds.

+-------------------+---------------+----------------+
|     test name     |   gcc 3.2.3   |  stdcxx 4.1.3  |
+-------------------+-------+-------+--------+-------+
|                   |   usr | sys   |    usr | sys   |
+-------------------+-------+-------+--------+-------+
|    read_single    | 8.977 | 0.008 | 13.997 | 0.012 |
|                   | 7.856 | 0.008 | 13.913 | 0.016 |
|                   | 8.021 | 0.012 | 13.817 | 0.024 |
|                   | 7.736 | 0.020 | 28.634 | 0.016 |
|                   | 7.844 | 0.012 | 13.841 | 0.016 |
+-------------------+-------+-------+--------+-------+
|    read_multi     | 0.608 | 0.744 |  0.864 | 0.756 |
|                   | 0.688 | 0.704 |  0.860 | 0.736 |
|                   | 0.660 | 0.728 |  0.856 | 0.712 |
|                   | 0.608 | 0.792 |  0.848 | 0.724 |
|                   | 0.552 | 0.796 |  0.796 | 0.780 |
+-------------------+-------+-------+--------+-------+
|   write_single    | 1.976 | 0.000 | 30.450 | 0.048 |
|                   | 2.356 | 0.012 | 30.526 | 0.064 |
|                   | 1.984 | 0.000 | 30.354 | 0.032 |
|                   | 1.964 | 0.012 | 30.350 | 0.028 |
|                   | 1.936 | 0.000 | 30.286 | 0.036 |
+-------------------+-------+-------+--------+-------+
|   write_multi     | 1.172 | 2.352 | 32.326 | 2.320 |
|                   | 1.092 | 2.444 | 31.102 | 2.216 |
|                   | 1.164 | 2.360 | 30.482 | 2.248 |
|                   | 1.148 | 2.380 | 31.930 | 2.180 |
|                   | 1.000 | 2.532 | 29.534 | 2.272 |
+-------------------+-------+-------+--------+-------+
| read_write_single | 7.684 | 0.000 | 13.649 | 0.016 |
|                   | 7.684 | 0.012 | 13.685 | 0.016 |
|                   | 7.664 | 0.012 | 14.193 | 0.016 |
|                   | 8.353 | 0.012 | 13.745 | 0.016 |
|                   | 7.700 | 0.012 | 13.677 | 0.004 |
+-------------------+-------+-------+--------+-------+
| read_write_cycle  | 0.056 | 0.000 |  0.412 | 0.000 |
|                   | 0.056 | 0.000 |  0.424 | 0.004 |
|                   | 0.056 | 0.000 |  0.428 | 0.004 |
|                   | 0.056 | 0.000 |  0.420 | 0.004 |
|                   | 0.056 | 0.000 |  0.412 | 0.004 |
+-------------------+-------+-------+--------+-------+
| read_write_multi  | 0.664 | 0.732 |  1.028 | 0.716 |
|                   | 0.676 | 0.712 |  0.988 | 0.744 |
|                   | 0.632 | 0.752 |  1.036 | 0.716 |
|                   | 0.688 | 0.704 |  1.080 | 0.732 |
|                   | 0.632 | 0.732 |  0.940 | 0.804 |
+-------------------+-------+-------+--------+-------+
| write_read_single | 7.868 | 0.016 | 43.407 | 0.044 |
|                   | 7.896 | 0.012 | 43.895 | 0.044 |
|                   | 7.888 | 0.008 | 43.307 | 0.076 |
|                   | 7.912 | 0.012 | 43.391 | 0.032 |
|                   | 8.337 | 0.016 | 43.375 | 0.044 |
+-------------------+-------+-------+--------+-------+
| write_read_cycle  | 0.056 | 0.000 |  0.412 | 0.004 |
|                   | 0.056 | 0.000 |  0.404 | 0.016 |
|                   | 0.056 | 0.000 |  0.412 | 0.000 |
|                   | 0.056 | 0.000 |  0.420 | 0.000 |
|                   | 0.052 | 0.004 |  0.416 | 0.004 |
+-------------------+-------+-------+--------+-------+
| write_read_multi  | 7.340 | 2.404 | 43.591 | 2.408 |
|                   | 7.420 | 2.400 | 42.347 | 2.196 |
|                   | 7.440 | 2.376 | 45.227 | 2.336 |
|                   | 7.232 | 2.476 | 43.679 | 2.316 |
|                   | 7.348 | 2.488 | 44.271 | 2.348 |
+-------------------+-------+-------+--------+-------+

Analysis:
Using the numbers above, I did some basic analysis.  System times spent 
for a given test appear to be roughly the same, so I am overlooking 
those numbers at this time.
To look at these numbers, I see two or three stastical operations that 
could be of use.
The first operation is the arithmatic average ('average') of the 
numbers.  This is the 'classic'  sum and divide average.  The second 
operation is the medan value (middle number) in the set.  The final 
operation is what I term the 'middle average'.  I calculate this by 
throwing out the highest and lowest value, then calculating the 
arithmatic average of the remaining numbers.
In the tables below, ratio indicates how much longer the stdcxx runs 
take compared to the gcc runs, with 0% indicating they take the same 
amount of time.

+-------------------+-------+--------+----------+
|    read_single    |  gcc  | stdcxx |   ratio  |
+-------------------+-------+--------+----------+
|      average      | 8.087 | 16.840 |  108.25% |
+-------------------+-------+--------+----------+
|   middle average  | 7.907 | 13.917 |   76.01% |
+-------------------+-------+--------+----------+
|       medan       | 7.856 | 13.913 |   77.10% |
+-------------------+-------+--------+----------+

+-------------------+-------+--------+----------+
|    read_multi     |  gcc  | stdcxx |   ratio  |
+-------------------+-------+--------+----------+
|      average      | 0.623 |  0.845 |   35.56% |
+-------------------+-------+--------+----------+
|   middle average  | 0.625 |  0.855 |   36.67% |
+-------------------+-------+--------+----------+
|       medan       | 0.608 |  0.856 |   40.79% |
+-------------------+-------+--------+----------+
+-------------------+-------+--------+----------+
|   write_single    |  gcc  | stdcxx |   ratio  |
+-------------------+-------+--------+----------+
|      average      | 2.043 | 30.393 | 1387.53% |
+-------------------+-------+--------+----------+
|   middle average  | 1.975 | 30.385 | 1438.72% |
+-------------------+-------+--------+----------+
|       medan       | 1.976 | 30.354 | 1436.13% |
+-------------------+-------+--------+----------+

+-------------------+-------+--------+----------+
|   write_multi     |  gcc  | stdcxx |   ratio  |
+-------------------+-------+--------+----------+
|      average      | 1.115 | 31.075 | 2686.48% |
+-------------------+-------+--------+----------+
|   middle average  | 1.135 | 31.171 | 2647.18% |
+-------------------+-------+--------+----------+
|       medan       | 1.148 | 31.102 | 2609.23% |
+-------------------+-------+--------+----------+

+-------------------+-------+--------+----------+
| read_write_single |  gcc  | stdcxx |   ratio  |
+-------------------+-------+--------+----------+
|      average      | 7.817 | 13.790 |   76.41% |
+-------------------+-------+--------+----------+
|   middle average  | 7.689 | 13.720 |   78.20% |
+-------------------+-------+--------+----------+
|       medan       | 7.684 | 13.685 |   78.10% |
+-------------------+-------+--------+----------+

+-------------------+-------+--------+----------+
| read_write_cycle  |  gcc  | stdcxx |   ratio  |
+-------------------+-------+--------+----------+
|      average      | 0.056 |  0.419 |  648.57% |
+-------------------+-------+--------+----------+
|   middle average  | 0.056 |  0.419 |  647.62% |
+-------------------+-------+--------+----------+
|       medan       | 0.056 |  0.420 |  650.00% |
+-------------------+-------+--------+----------+

+-------------------+-------+--------+----------+
| read_write_multi  |  gcc  | stdcxx |   ratio  |
+-------------------+-------+--------+----------+
|      average      | 0.658 |  1.014 |   54.07% |
+-------------------+-------+--------+----------+
|   middle average  | 0.657 |  1.017 |   54.77% |
+-------------------+-------+--------+----------+
|       medan       | 0.664 |  1.028 |   54.82% |
+-------------------+-------+--------+----------+

+-------------------+-------+--------+----------+
| write_read_single |  gcc  | stdcxx |   ratio  |
+-------------------+-------+--------+----------+
|      average      | 7.980 | 43.475 |  444.79% |
+-------------------+-------+--------+----------+
|   middle average  | 7.899 | 43.391 |  449.35% |
+-------------------+-------+--------+----------+
|       medan       | 7.896 | 43.391 |  449.53% |
+-------------------+-------+--------+----------+

+-------------------+-------+--------+----------+
| write_read_cycle  |  gcc  | stdcxx |   ratio  |
+-------------------+-------+--------+----------+
|      average      | 0.055 |  0.413 |  647.83% |
+-------------------+-------+--------+----------+
|   middle average  | 0.056 |  0.413 |  638.10% |
+-------------------+-------+--------+----------+
|       medan       | 0.056 |  0.412 |  635.71% |
+-------------------+-------+--------+----------+

+-------------------+-------+--------+----------+
| write_read_multi  |  gcc  | stdcxx |   ratio  |
+-------------------+-------+--------+----------+
|      average      | 7.356 | 43.823 |  495.74% |
+-------------------+-------+--------+----------+
|   middle average  | 7.369 | 43.847 |  494.99% |
+-------------------+-------+--------+----------+
|       medan       | 7.348 | 43.679 |  494.43% |
+-------------------+-------+--------+----------+

Conclusions:
Looking over the processed numbers from the runs, one thing that jumps 
out at me is the write times, particularly the write_single and 
write_multi benchmarks.  Both of these benchmarks are an order of 
magnitude slower than their GCC counterparts (at least on this 
computer).  The write_multi benchmark in particular shows what happens 
if you stream large amounts of data (~250 MB worth of data in this case) 
into a strstream, without streaming any out.

Future:
For those interested in trying to repeat these tests, I have attached 
the source and makefile files I used to generate these benchmarks.  This 
particular benchmark is a work in progress.  There are several 
additional things that could be benchmarked regarding stringstreams.  
These include allocation (default, string, copy), pseudo-random 
read/writes (rather than pattern read/writes), reads and writes of 
varying length strings, and reading/writing using something other than 
the insertion and extraction operators.

--Andrew Black

Re: Benchmarking stdcxx

Posted by Martin Sebor <se...@roguewave.com>.

I reran the benchmark after the latest changes (r380995). Since I ran
the tests on Solaris I decided to extend the set of implementations
compared in this benchmark to also include the Sun libraries (both
the native Rogue Wave C++ Standard Library, version 2.1.1, as well
as STLport 4.5.2 bundled with the compiler and optionally available
under the -library=stlport4 switch).

All test were compiled at -O2 and without thread safety (i.e., no -mt
or similar flag).

The results are below. In each row the value 1. indicates the best
result (the fastest runtime measured by the user time of the process)
and represents a baseline against all other timings were calculated.

   +---------------+-------------------------------------------------+
   |               |          Sun C++ 5.7        |     GCC 4.0.2     |
   |   function    +---------+---------+---------+---------+---------+
   |               |  stdcxx |  native | stlport |  stdcxx |  native |
   +===============+=========+=========+=========+=========+=========+
   | default ctor  |   1.11  |   1.34  |   3.55  |   1.    |   1.22  |
   | char* ctor    |   1.    |   1.83  |   2.36  |   1.05  |   1.67  |
   | string ctor   |   1.03  |   1.20  |   1.50  |   1.    |   1.08  |
   | sputn         |   1.    |   1.36  |   2.56  |   1.10  |   1.78  |
   | insert char   |   1.11  |   1.18  |   2.09  |   1.    |   1.23  |
   | insert char*  |   1.    |   1.30  |   2.03  |   1.09  |   1.51  |
   | insert string |   1.74  |   1.57  |   2.19  |   1.64  |   1.    |
   +---------------+---------+---------+---------+---------+---------+

Martin Sebor wrote:
> Martin Sebor wrote:
> [...]
> 
>> I also ran some simple benchmarks:
>>
>>                    latest    gcc
>>                    stdcxx   4.0.2
>>   +---------------+-------+-------+
>>   | default ctor  |  1.00 |  1.22 |
>>   | char* ctor    |  1.00 |  1.59 |
>>   | string ctor   |  1.00 |  1.05 |
>>   | insert char   |  1.00 |   .96 |
>>   | insert char*  |  1.00 |   .56 |
>>   | insert string |  1.00 |   .53 |
>>   | sputn         |  1.00 |   .47 |
>>   +---------------+-------+-------+
>>
>> Clearly there is still some room for improvement. I tweaked the
>> allocation policy used by stringbuf to double the size of the buffer
>> (rather than growing by a factor of 1.6 or so) but that didn't make
>> any difference (which should have been expected).
>>
>> The last number is particularly puzzling because, AFAICT, xsputn()
>> (called by sputn) is optimal. I don't see a significant opportunity
>> for optimization there.
> 
> 
> Okay, I now see that it's not quite optimal and understand why. Our
> implementation uses the generic streambuf::xsputn() which copies the
> string into the buffer one chunk at a time, calling overflow() to
> process the contents of the buffer each time it runs out of space.
> This is optimal for filebuf (which flushes the buffer and starts
> writing from the beginning) but less so for stringbuf which must
> reallocate the buffer and copy its contents to the new one every
> time. This can be optimized by allocating the necessary amount of
> space ahead of time and simply copying the string into it in one
> shot. With this optimization in place the new numbers are:
> 
>   +---------------+-------+-------+
>   | default ctor  |  1.00 |  1.22 |
>   | char* ctor    |  1.00 |  1.58 |
>   | string ctor   |  1.00 |  1.06 |
>   | insert char   |  1.00 |  1.23 |
>   | insert char*  |  1.00 |  1.39 |
>   | insert string |  1.00 |   .60 |
>   | sputn         |  1.00 |  1.62 |
>   +---------------+-------+-------+
> 
> The string inserter still needs to be optimized but everything else
> is looking much better.
> 
> Martin

Re: Benchmarking stdcxx

Posted by Martin Sebor <se...@roguewave.com>.

Martin Sebor wrote:
[...]
> I also ran some simple benchmarks:
> 
>                    latest    gcc
>                    stdcxx   4.0.2
>   +---------------+-------+-------+
>   | default ctor  |  1.00 |  1.22 |
>   | char* ctor    |  1.00 |  1.59 |
>   | string ctor   |  1.00 |  1.05 |
>   | insert char   |  1.00 |   .96 |
>   | insert char*  |  1.00 |   .56 |
>   | insert string |  1.00 |   .53 |
>   | sputn         |  1.00 |   .47 |
>   +---------------+-------+-------+
> 
> Clearly there is still some room for improvement. I tweaked the
> allocation policy used by stringbuf to double the size of the buffer
> (rather than growing by a factor of 1.6 or so) but that didn't make
> any difference (which should have been expected).
> 
> The last number is particularly puzzling because, AFAICT, xsputn()
> (called by sputn) is optimal. I don't see a significant opportunity
> for optimization there.

Okay, I now see that it's not quite optimal and understand why. Our
implementation uses the generic streambuf::xsputn() which copies the
string into the buffer one chunk at a time, calling overflow() to
process the contents of the buffer each time it runs out of space.
This is optimal for filebuf (which flushes the buffer and starts
writing from the beginning) but less so for stringbuf which must
reallocate the buffer and copy its contents to the new one every
time. This can be optimized by allocating the necessary amount of
space ahead of time and simply copying the string into it in one
shot. With this optimization in place the new numbers are:

   +---------------+-------+-------+
   | default ctor  |  1.00 |  1.22 |
   | char* ctor    |  1.00 |  1.58 |
   | string ctor   |  1.00 |  1.06 |
   | insert char   |  1.00 |  1.23 |
   | insert char*  |  1.00 |  1.39 |
   | insert string |  1.00 |   .60 |
   | sputn         |  1.00 |  1.62 |
   +---------------+-------+-------+

The string inserter still needs to be optimized but everything else
is looking much better.

Martin

Re: Benchmarking stdcxx

Posted by Martin Sebor <se...@roguewave.com>.

Martin Sebor wrote:
> Andrew Black wrote:
> 
>> Greetings all.
>>
>> Part of the intent for subversion submits r379032 through r379035 (as 
>> I understand them) was to address some of the slowness in the 
>> stringstream operations that was detected by the previous benchmarking 
>> run.  With a fresh benchmark run, the following are the results I 
>> get.  Results are user times in seconds, found by taking the average 
>> of  3 runs of 500000 itterations.

I also ran some simple benchmarks:

                    latest    gcc
                    stdcxx   4.0.2
   +---------------+-------+-------+
   | default ctor  |  1.00 |  1.22 |
   | char* ctor    |  1.00 |  1.59 |
   | string ctor   |  1.00 |  1.05 |
   | insert char   |  1.00 |   .96 |
   | insert char*  |  1.00 |   .56 |
   | insert string |  1.00 |   .53 |
   | sputn         |  1.00 |   .47 |
   +---------------+-------+-------+

Clearly there is still some room for improvement. I tweaked the
allocation policy used by stringbuf to double the size of the buffer
(rather than growing by a factor of 1.6 or so) but that didn't make
any difference (which should have been expected).

The last number is particularly puzzling because, AFAICT, xsputn()
(called by sputn) is optimal. I don't see a significant opportunity
for optimization there.

Martin

Re: Benchmarking stdcxx

Posted by Martin Sebor <se...@roguewave.com>.

Andrew Black wrote:
> Greetings all.
> 
> Part of the intent for subversion submits r379032 through r379035 (as I 
> understand them) was to address some of the slowness in the stringstream 
> operations that was detected by the previous benchmarking run.  With a 
> fresh benchmark run, the following are the results I get.  Results are 
> user times in seconds, found by taking the average of  3 runs of 500000 
> itterations.
> 
> +-------------------+-------+------------+---------+
> | stringstream_bm   | glibc | stdcxx_old | stdcxx  |
> +-------------------+-------+------------+---------+
> | read_write_multi  |  0.68 |       1.05 |    1.05 |
> | read_write_single |  7.99 |      14.43 |   14.41 |
> | write_multi       |  1.13 |      30.78 |    1.15 |
> | write_read_cycle  |  0.05 |       0.43 |    0.44 |
> | write_read_multi  |  7.58 |      46.53 | N/A     |
> | write_read_single |  8.08 |      45.76 |   15.37 |
> | write_single      |  2.00 |      31.02 |    2.36 |
> | read_multi        |  0.61 |       0.83 |    0.84 |
> | read_single       |  7.92 |      14.28 |   14.13 |
> | read_write_cycle  |  0.05 |       0.42 |    0.44 |
> +-------------------+-------+------------+---------+
> 
> The bigest improvments spotted are in the write_single, write_multi, and 
> write_read_single tests (with the first two approaching the speed of 
> glibc).  The write_read_multi test also showed a fair amount of 
> improvement (to 1.16 seconds),

Great! That's a significant improvement! I suspect the difference
we continue to see in some of these numbers might be due to the
allocation policy employed by our implementation which strives
for a balance between runtime speed and space efficiency. It would
be good to verify this hypothesis by changing our allocation policy
in the benchamrk to match the native one.

> but the benchmark segfaults, so I must 
> discard the results of this test as unreliable.
> 
> I suppose it would make sense to note this failure with the STDCXX-149 
> JIRA incident.

I reproduced the core dump but I don't think there's anything wrong
with the library. I removed the stringstream code from the the test
and the program still core dumps (under Sun dbx), so it looks like
there's something wrong with the pointer arithmetic. I would suggest
to simplify the benchmark code so as not to rely on tricky pointer
manipulation and dynamic memory allocation (use std::string instead).

Martin

Re: Benchmarking stdcxx

Posted by Andrew Black <ab...@roguewave.com>.

Greetings all.

Part of the intent for subversion submits r379032 through r379035 (as I 
understand them) was to address some of the slowness in the stringstream 
operations that was detected by the previous benchmarking run.  With a 
fresh benchmark run, the following are the results I get.  Results are 
user times in seconds, found by taking the average of  3 runs of 500000 
itterations.

+-------------------+-------+------------+---------+
| stringstream_bm   | glibc | stdcxx_old | stdcxx  |
+-------------------+-------+------------+---------+
| read_write_multi  |  0.68 |       1.05 |    1.05 |
| read_write_single |  7.99 |      14.43 |   14.41 |
| write_multi       |  1.13 |      30.78 |    1.15 |
| write_read_cycle  |  0.05 |       0.43 |    0.44 |
| write_read_multi  |  7.58 |      46.53 | N/A     |
| write_read_single |  8.08 |      45.76 |   15.37 |
| write_single      |  2.00 |      31.02 |    2.36 |
| read_multi        |  0.61 |       0.83 |    0.84 |
| read_single       |  7.92 |      14.28 |   14.13 |
| read_write_cycle  |  0.05 |       0.42 |    0.44 |
+-------------------+-------+------------+---------+

The bigest improvments spotted are in the write_single, write_multi, and 
write_read_single tests (with the first two approaching the speed of 
glibc).  The write_read_multi test also showed a fair amount of 
improvement (to 1.16 seconds), but the benchmark segfaults, so I must 
discard the results of this test as unreliable.

I suppose it would make sense to note this failure with the STDCXX-149 
JIRA incident.

--Andrew Black

Re: Benchmarking stdcxx

Posted by Martin Sebor <se...@roguewave.com>.

Andrew Black wrote:
>  Greetings all.
> 
> I thought it might be interesting to do some benchmarking, comparing the 
> performance of stdcxx with other standard libraries.  As there are a 
> number of attributes that can be compared when doing a benchmark, and an 
> even larger number of classes that can be looked at, there is a fair 
> amount of choice in what to measure.  As a starting point, I chose to 
> measure the runtime performace of stringstream objects.

Thanks! These are extremely valuable data. They clearly show that the
insertion and extraction of character arrays to and from stringstreams
is slower in stdcxx than in libstdcx++. We need to figure out why,
especially in the most egregious cases.

To help us narrow down the area that we should focus on we should add
a few more test functions. The first one I would add is a new function
exercising just the class ctor:

   static void construct (int N) {
       for (int i = 0; i < N; ++i) {
           std::stringstream sink;
           assert (sink.goodbit == sink.rdstate ());
       }
   }

Assuming the results for just the ctor are comparable (in my
measurements with gcc 4.0.2, stdcxx was actually about 50% faster
than libstdc++ on this test) we can safely eliminate the ctor and
the dtor as the bottlenecks.

Next, I would add and benchmark another function to exercise the
sentry object that gets constructed in every inserter.

   static void ostream_sentry (int N) {
       for (int i = 0; i < N; ++i) {
           std::stringstream sink;
           assert (sink.goodbit == sink.rdstate ());

           const std::ostream::sentry guard (sink);
           assert (true == guard);
       }
   }

If the results of this test are similar as well (in my runs stdcxx
was about 30% faster than libstdc++), we can eliminate the sentry
as the cause of the problem.

As the next step, instead of benchmarking the entire insertion, I
would exercise just the streambuf::sputn() function (which ends up
getting called by our implementation of the inserter).

   static void streambuf_sputn (int N) {
       for (int i = 0; i < N; ++i) {
           std::stringstream sink;
           assert (sink.goodbit == sink.rdstate ());
           const int nput = i % sizeof ldata;
           const int n = sink.rdbuf ()->sputn (ldata, nput);
           assert (n == nput);
       }
   }

In my tests, this function indeed appeared to be the source of the
poor performance (10 times slower than the libstdc++ implementation
of the same).

 From examining the code I knew that sputn() (which calls the virtual
function xsputn()) calls the virtual streambuf member function
overflow(). I measured the performance of overflow but it was the
same for both implementations.

  static void streambuf_overflow (int N) {
       struct pubbuf: std::stringbuf {
           using std::stringbuf::overflow;
       };
       for (int i = 0; i < N; ++i) {
           std::stringstream sink;
           assert (sink.goodbit == sink.rdstate ());
           const int n = ((pubbuf*)sink.rdbuf ())->
               overflow ((unsigned char)i);
           assert (n == (unsigned char)i);
       }
   }

So the source of the performance problem seems to be in xsputn() or
in its interaction with overflow. To narrow it down even more, I
exercised xsputn() with the second argument of 0 (i.e., making it
insert a string of lenght 0). Again, stdcxx is quite a bit faster
in this case, this time by about 20%. Changing the second argument
to 1 brought the two results closer (libstdc++ was a tad faster but
not significantly so). Interestingly, though, increasing the value
of the second argument has a corresponding effect on the slowdown
in stdcxx.

By stepping through the code I noticed that xsputn() would call
overflow() for every character instead of only when the buffer was
full as I expected. It seems that stringstream::overflow() doesn't
make any put area available. Looking at the function more closely
revealed a bug in the put area pointer manipulation. Quickly fixing
the bug eliminated much of the performance problem. stdcxx is now
essentially comparable (although on average still 60% slower) to
libstdc++. I suspect that the remaining difference is due to the
allocation policy used by stdcxx stringstream (128 characters
initial buffer size with a growth factor of 1.6 or so).

I created http://issues.apache.org/jira/browse/STDCXX-142 to track
this issue.

I'll have to test my quick fix but assuming it doesn't cause any
regressions I'll commit it on trunk. It would be good if you could
rerun your benchmarks (with the enhancements suggested above) and
post new results when the fix is available.

Btw., it would also be very nice to put together a harness (e.g.,
in the form of a portable shell script) that would run each test
some number of times and produced a table with the results on
output. That way we could easily rerun the whole benchmark and
quickly post new results after each change.

Again, thanks for doing this, it's very helpful!
Martin