You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Evgeny Kotkov <ev...@visualsvn.com> on 2015/09/02 15:33:31 UTC

Re: svn commit: r1700799 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

Stefan Fuhrmann <st...@apache.org> writes:

> Author: stefan2
> Date: Wed Sep  2 13:04:51 2015
> New Revision: 1700799
>
> URL: http://svn.apache.org/r1700799
> Log:
> [Combines r1698359 and r170078 into a single commit for better review.]
> Introduce a stream wrapper object that adds buffering support to any
> readable stream.  Use it on the stdin streams in our CL tools.
>
> As it turns out, parsing data from a stdin byte-by-byte incurs a
> massive overhead of 100% internal and 300% system load over a buffered
> stream.  'svnadmin load-revprops' sees a 5 times speedup if all data
> is in OS disc caches.  This is a realistic assumption in a "final sync
> and switch over to new repository" scenario.
>
> The other 2 uses of stdin either have less data to process (svnfsfs
> load-index) or parse only a small fraction of the stream (svnadmin load).
>
> * subversion/include/svn_io.h
>   (svn_stream_wrap_buffered_read): Declare the new stream constructor API.

[...]

If buffering makes difference for how well svnadmin load, load-revprops and
svnfsfs load-index behave, can't we use the standard mechanism for that,
say, apr_file_open_flags_stdin(..., APR_BUFFERED)?

What is the point of reimplementing something that's already a part of the APR?


Regards,
Evgeny Kotkov

Re: svn commit: r1700799 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Stefan Fuhrmann <st...@wandisco.com> writes:

>> Julian Foad <ju...@btopenworld.com> writes:
>>
>> > Also for:
>> >    svnrdump load
>> >    svnmucc put - <url>
>> >
>> > ? (These just came to mind when I thought about it.)
>>
>> I think, yes.  Thank you for pointing this out, Julian.
>
> After discussion at the Berlin hackathon, the APR-buffered solution has
> been implemented and all callers been update in r1702983.

Thanks!

I also updated svnrdump load and svndumpfilter in r1703072, r1703073 and
r1703074.


Regards,
Evgeny Kotkov

Re: svn commit: r1700799 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Sat, Sep 12, 2015 at 1:46 AM, Evgeny Kotkov <ev...@visualsvn.com>
wrote:

> Julian Foad <ju...@btopenworld.com> writes:
>
> > Also for:
> >    svnrdump load
> >    svnmucc put - <url>
> >
> > ? (These just came to mind when I thought about it.)
>
> I think, yes.  Thank you for pointing this out, Julian.
>

After discussion at the Berlin hackathon, the APR-buffered
solution has been implemented and all callers been update
in r1702983.

-- Stefan^2.

Re: svn commit: r1700799 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Julian Foad <ju...@btopenworld.com> writes:

> Also for:
>    svnrdump load
>    svnmucc put - <url>
>
> ? (These just came to mind when I thought about it.)

I think, yes.  Thank you for pointing this out, Julian.


Regards,
Evgeny Kotkov

Re: svn commit: r1700799 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

Posted by Julian Foad <ju...@btopenworld.com>.
Evgeny Kotkov wrote:
> - Enable buffering in three subcommands: svnadmin load, svnadmin load-revprops
>   and in svnfsfs load-index.

Also for:
   svnrdump load
   svnmucc put - <url>

? (These just came to mind when I thought about it.)

- Julian

Re: svn commit: r1700799 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Stefan Fuhrmann <st...@wandisco.com> writes:

> So, you are absolutely right about APR buffering being enough for our
> non-interactive "read stdin as a single large file" usage in load & friends.
>
> My last concern is as follows. We need an alternative implementation
> for svn_stream_for_stdin. I see 3 options:
>
> 1. svn_stream_for_stdin2 with buffered option.
> 2. Always enable buffering in svn_stream_for_stdin.
> 3. Some private API.

I very much prefer using the standard APR_BUFFERED buffering, as opposed
to rolling our own buffering adapter.  I also prefer the idea of exposing this
flag through the new variant of the API, svn_stream_for_stdin2(), and using
it in the three mentioned subcommands (option 1).

I don't see an abstraction leak in this approach.  Rather than that, exposing
this option is a necessary thing.  Only the caller knows if buffering makes
sense at the particular moment, and that is, for instance, what we know when
we enable it for commands like 'svnadmin load-revprops'.  Enabling buffering
when you're about to do something interactive is always a bad idea, but, again,
only the caller knows these details.  Furthermore, it is not that we invent
something new and push it through the API, as APR_BUFFERED is a well-known
and a quite stable concept.

Regarding other two options:

- We can't do 2), because doing so would induce a behavior change in all our
  command-line tools, and this is going to happen without any reason.  As we
  know that buffering makes sense for three mentioned subcommands, I'd say
  that we should limit the scope to them.

- I find 3) an unnecessary complication, as opposed to 1).  As we are just
  exposing a standard APR concept through our API, I think that we can make
  it public.

To sum everything up, I think that we should:

- Replace the custom buffering with APR_BUFFERED, and expose it through the
  new svn_stream_for_stdin2() API.  We could also place a @note in the API
  documentation saying that currently enabled buffering is equivalent to how
  the APR_BUFFERED flag behaves.

- Enable buffering in three subcommands: svnadmin load, svnadmin load-revprops
  and in svnfsfs load-index.

I could do that, if it works for you.


Regards,
Evgeny Kotkov

Re: svn commit: r1700799 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Wed, Sep 9, 2015 at 12:31 PM, Evgeny Kotkov <ev...@visualsvn.com>
wrote:

> Stefan Fuhrmann <st...@wandisco.com> writes:
>
> >> The original commit begins using svn_stream_wrap_buffered_read() during
> >> svnadmin load-revprops and svnfsfs load-index.  This patch, however,
> does
> >> something entirely different and adds buffering to *every* stdin.
> >
> > Sorry for the confusion. I did not intend to actually change the
> > implementation of svn_stream_for_stdin this way but simply tried to
> > demonstrate a problem with APR buffer reads for "streamy" file handles.
>
> Thank you for the explanation.
>

After some debugging I found the reason for the ra-test hang. Once I knew
it,
it's been kind of obvious from the code: For buffered files,
apr_file_read() is
more or less equivalent to apr_file_write_full().

Changing that in APR might break APR internals (buffer may be half-filled
but not be at EOF) as well as user code (read_full required where a normal
read used to be sufficient).


> > The underlying problem is still present: If stdin can't deliver data fast
> > enough (e.g. some hick-up / long latency on the producer side of a dump |
> > load), a buffered APR file will error out while a non-buffered one will
> > simply wait & retry.
> >
> > However, I have yet to try and provoke the error specifically for the
> > dump | load scenario.
>
> I think that this problem is nonexistent.
>
> A program doesn't need to handle EAGAIN during read() unless it puts the
> file
> into the non-blocking mode with O_NONBLOCK [1].  We don't do that for
> STDIN,
> and, irrespectively of what happens with the data on the other side of the
> pipe, read() is going to block until the requested data is available.
>
> [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html
>

You are right. I tried various methods to delay the producer side and
it never caused the consumer to error out. The ra-test hang mislead
me into believing that the EAGAIN handling was the problem.

So, you are absolutely right about APR buffering being enough for our
non-interactive "read stdin as a single large file" usage in load & friends.

My last concern is as follows. We need an alternative implementation
for svn_stream_for_stdin. I see 3 options:

1. svn_stream_for_stdin2 with buffered option.
2. Always enable buffering in svn_stream_for_stdin.
3. Some private API.

The problem with 1. is that if we use APR_BUFFERED for it, the new
API will be add leakage to the abstraction: the user has to know when
it is safe to enable buffering. This problem does not occur if the wrapper
stream is being used instead.

Thus, variant 2 is impossible with APR_BUFFERED but quite possible
using the buffering wrapper. The only way it could cause a problem is
if some code was to rely on the actual size of a read request from stdin.
Since the latter is managed by the OS + C runtime, it is hard to think of
a way to make this happen - but still. Having a fall-back would be nice,
i.e. variant 1 is safer than 2.

Option 3 is basically saying "the public API is not good enough but we
don't want to give you a better one". It is, however, the only variant
where I would be fully comfortable just using APR buffering.

Your thoughts?

-- Stefan^2.

Re: svn commit: r1700799 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Stefan Fuhrmann <st...@wandisco.com> writes:

>> The original commit begins using svn_stream_wrap_buffered_read() during
>> svnadmin load-revprops and svnfsfs load-index.  This patch, however, does
>> something entirely different and adds buffering to *every* stdin.
>
> Sorry for the confusion. I did not intend to actually change the
> implementation of svn_stream_for_stdin this way but simply tried to
> demonstrate a problem with APR buffer reads for "streamy" file handles.

Thank you for the explanation.

> The underlying problem is still present: If stdin can't deliver data fast
> enough (e.g. some hick-up / long latency on the producer side of a dump |
> load), a buffered APR file will error out while a non-buffered one will
> simply wait & retry.
>
> However, I have yet to try and provoke the error specifically for the
> dump | load scenario.

I think that this problem is nonexistent.

A program doesn't need to handle EAGAIN during read() unless it puts the file
into the non-blocking mode with O_NONBLOCK [1].  We don't do that for STDIN,
and, irrespectively of what happens with the data on the other side of the
pipe, read() is going to block until the requested data is available.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html


Regards,
Evgeny Kotkov

Re: svn commit: r1700799 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Mon, Sep 7, 2015 at 6:38 PM, Evgeny Kotkov <ev...@visualsvn.com>
wrote:

> Stefan Fuhrmann <st...@wandisco.com> writes:
>
> > That is indeed a good idea, so I tried it in various ways:
> >
> > Index: subversion/libsvn_subr/stream.c
> > ===================================================================
> > --- subversion/libsvn_subr/stream.c    (revision 1701330)
> > +++ subversion/libsvn_subr/stream.c    (working copy)
> > @@ -1826,10 +1826,20 @@ svn_stream_for_stdin(svn_stream_t **in,
> apr_pool_t
> >    apr_status_t apr_err;
> >
> >    apr_err = apr_file_open_stdin(&stdin_file, pool);
> > +
> > +/* Alternatively to the above: tell APR to create a buffered file
> > +  apr_err = apr_file_open_flags_stdin(&stdin_file, APR_BUFFERED, pool);
> > +*/
> >    if (apr_err)
> >      return svn_error_wrap_apr(apr_err, "Can't open stdin");
>
> The original commit begins using svn_stream_wrap_buffered_read() during
> svnadmin load-revprops and svnfsfs load-index.  This patch, however, does
> something entirely different and adds buffering to *every* stdin.
>

Sorry for the confusion. I did not intend to actually change the
implementation of svn_stream_for_stdin this way but simply
tried to demonstrate a problem with APR buffer reads for
"streamy" file handles.

I am not sure why would we want to do this, but there certainly is a reason
> against a change like this — buffering can't be used when something
> interactive
> happens using stdin, e.g., with svnserve.
>

It is certainly useful to make buffering available as an option
(like you suggested below) instead of wrapping the stream
locally or wrapping it unconditionally. Currently, I'm not 100%
sure whether always wrapping stdin would be safe but I think
it is at least for correct usage:

A read to the wrapper will translate to at most 1 read at the
APR layer, so it will never request more data from stdin than
stdin can deliver at the moment. Exception: Single byte reads
may block scenarios just as and because they do at the APR
level.

So, whatever logic a stream API user applies to determine
that some interaction is required before new data can arrive
in stdin, that logic applies 1:1 with and without the wrapper.

So, why can't we enable buffering for these subcommands (perhaps including
> svnadmin load), leave everything else as is and avoid the necessity of
> having
> and maintaining the custom buffered stream adapter?
>

The underlying problem is still present: If stdin can't deliver
data fast enough (e.g. some hick-up / long latency on the
producer side of a dump | load), a buffered APR file will
error out while a non-buffered one will simply wait & retry.

However, I have yet to try and provoke the error specifically
for the dump | load scenario.

We could introduce svn_stream_for_stdin2(..., svn_boolean_t buffered, ...),
> and then the required change for svnadmin load-revprops would look as below
> (changes for other subcommands would look almost exactly the same):
>

Regardless of the buffering method used, +1 on adding
a rev'ed API.

-- Stefan^2.


>
> [[[
> Index: subversion/libsvn_subr/stream.c
> ===================================================================
> --- subversion/libsvn_subr/stream.c (revision 1701648)
> +++ subversion/libsvn_subr/stream.c (working copy)
> @@ -1820,12 +1820,20 @@ svn_stream_wrap_buffered_read(svn_stream_t *inner,
>
>
>  svn_error_t *
> -svn_stream_for_stdin(svn_stream_t **in, apr_pool_t *pool)
> +svn_stream_for_stdin2(svn_stream_t **in,
> +                      svn_boolean_t buffered,
> +                      apr_pool_t *pool)
>  {
>    apr_file_t *stdin_file;
>    apr_status_t apr_err;
> +  apr_int32_t flags;
>
> -  apr_err = apr_file_open_stdin(&stdin_file, pool);
> +  if (buffered)
> +    flags = APR_BUFFERED;
> +  else
> +    flags = 0;
> +
> +  apr_err = apr_file_open_flags_stdin(&stdin_file, flags, pool);
>    if (apr_err)
>      return svn_error_wrap_apr(apr_err, "Can't open stdin");
>
> Index: subversion/svnadmin/svnadmin.c
> ===================================================================
> --- subversion/svnadmin/svnadmin.c  (revision 1701648)
> +++ subversion/svnadmin/svnadmin.c  (working copy)
> @@ -1547,8 +1547,7 @@ subcommand_load_revprops(apr_getopt_t *os, void *b
>    SVN_ERR(open_repos(&repos, opt_state->repository_path, pool));
>
>    /* Read the stream from STDIN.  Users can redirect a file. */
> -  SVN_ERR(svn_stream_for_stdin(&stdin_stream, pool));
> -  stdin_stream = svn_stream_wrap_buffered_read(stdin_stream, pool);
> +  SVN_ERR(svn_stream_for_stdin2(&stdin_stream, TRUE, pool));
>
>    /* Progress feedback goes to STDOUT, unless they asked to suppress it.
> */
>    if (! opt_state->quiet)
>
> ]]]
>
>
> Regards,
> Evgeny Kotkov
>

Re: svn commit: r1700799 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Stefan Fuhrmann <st...@wandisco.com> writes:

> That is indeed a good idea, so I tried it in various ways:
>
> Index: subversion/libsvn_subr/stream.c
> ===================================================================
> --- subversion/libsvn_subr/stream.c    (revision 1701330)
> +++ subversion/libsvn_subr/stream.c    (working copy)
> @@ -1826,10 +1826,20 @@ svn_stream_for_stdin(svn_stream_t **in, apr_pool_t
>    apr_status_t apr_err;
>
>    apr_err = apr_file_open_stdin(&stdin_file, pool);
> +
> +/* Alternatively to the above: tell APR to create a buffered file
> +  apr_err = apr_file_open_flags_stdin(&stdin_file, APR_BUFFERED, pool);
> +*/
>    if (apr_err)
>      return svn_error_wrap_apr(apr_err, "Can't open stdin");

The original commit begins using svn_stream_wrap_buffered_read() during
svnadmin load-revprops and svnfsfs load-index.  This patch, however, does
something entirely different and adds buffering to *every* stdin.

I am not sure why would we want to do this, but there certainly is a reason
against a change like this — buffering can't be used when something interactive
happens using stdin, e.g., with svnserve.

So, why can't we enable buffering for these subcommands (perhaps including
svnadmin load), leave everything else as is and avoid the necessity of having
and maintaining the custom buffered stream adapter?

We could introduce svn_stream_for_stdin2(..., svn_boolean_t buffered, ...),
and then the required change for svnadmin load-revprops would look as below
(changes for other subcommands would look almost exactly the same):

[[[
Index: subversion/libsvn_subr/stream.c
===================================================================
--- subversion/libsvn_subr/stream.c (revision 1701648)
+++ subversion/libsvn_subr/stream.c (working copy)
@@ -1820,12 +1820,20 @@ svn_stream_wrap_buffered_read(svn_stream_t *inner,


 svn_error_t *
-svn_stream_for_stdin(svn_stream_t **in, apr_pool_t *pool)
+svn_stream_for_stdin2(svn_stream_t **in,
+                      svn_boolean_t buffered,
+                      apr_pool_t *pool)
 {
   apr_file_t *stdin_file;
   apr_status_t apr_err;
+  apr_int32_t flags;

-  apr_err = apr_file_open_stdin(&stdin_file, pool);
+  if (buffered)
+    flags = APR_BUFFERED;
+  else
+    flags = 0;
+
+  apr_err = apr_file_open_flags_stdin(&stdin_file, flags, pool);
   if (apr_err)
     return svn_error_wrap_apr(apr_err, "Can't open stdin");

Index: subversion/svnadmin/svnadmin.c
===================================================================
--- subversion/svnadmin/svnadmin.c  (revision 1701648)
+++ subversion/svnadmin/svnadmin.c  (working copy)
@@ -1547,8 +1547,7 @@ subcommand_load_revprops(apr_getopt_t *os, void *b
   SVN_ERR(open_repos(&repos, opt_state->repository_path, pool));

   /* Read the stream from STDIN.  Users can redirect a file. */
-  SVN_ERR(svn_stream_for_stdin(&stdin_stream, pool));
-  stdin_stream = svn_stream_wrap_buffered_read(stdin_stream, pool);
+  SVN_ERR(svn_stream_for_stdin2(&stdin_stream, TRUE, pool));

   /* Progress feedback goes to STDOUT, unless they asked to suppress it. */
   if (! opt_state->quiet)

]]]


Regards,
Evgeny Kotkov

Re: svn commit: r1700799 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Wed, Sep 2, 2015 at 3:33 PM, Evgeny Kotkov <ev...@visualsvn.com>
wrote:

> Stefan Fuhrmann <st...@apache.org> writes:
>
> > Author: stefan2
> > Date: Wed Sep  2 13:04:51 2015
> > New Revision: 1700799
> >
> > URL: http://svn.apache.org/r1700799
> > Log:
> > [Combines r1698359 and r170078 into a single commit for better review.]
> > Introduce a stream wrapper object that adds buffering support to any
> > readable stream.  Use it on the stdin streams in our CL tools.
> >
> > As it turns out, parsing data from a stdin byte-by-byte incurs a
> > massive overhead of 100% internal and 300% system load over a buffered
> > stream.  'svnadmin load-revprops' sees a 5 times speedup if all data
> > is in OS disc caches.  This is a realistic assumption in a "final sync
> > and switch over to new repository" scenario.
> >
> > The other 2 uses of stdin either have less data to process (svnfsfs
> > load-index) or parse only a small fraction of the stream (svnadmin load).
> >
> > * subversion/include/svn_io.h
> >   (svn_stream_wrap_buffered_read): Declare the new stream constructor
> API.
>
> [...]
>
> If buffering makes difference for how well svnadmin load, load-revprops and
> svnfsfs load-index behave, can't we use the standard mechanism for that,
> say, apr_file_open_flags_stdin(..., APR_BUFFERED)?
>

That is indeed a good idea, so I tried it in various ways:

Index: subversion/libsvn_subr/stream.c
===================================================================
--- subversion/libsvn_subr/stream.c    (revision 1701330)
+++ subversion/libsvn_subr/stream.c    (working copy)
@@ -1826,10 +1826,20 @@ svn_stream_for_stdin(svn_stream_t **in, apr_pool_t
   apr_status_t apr_err;

   apr_err = apr_file_open_stdin(&stdin_file, pool);
+
+/* Alternatively to the above: tell APR to create a buffered file
+  apr_err = apr_file_open_flags_stdin(&stdin_file, APR_BUFFERED, pool);
+*/
   if (apr_err)
     return svn_error_wrap_apr(apr_err, "Can't open stdin");

+/* Alternatively to the svn_stream_wrap_buffered_read() call or
+ * apr_file_open_flags_stdin, retroactively add a buffer to the APR file
+  apr_file_buffer_set(stdin_file, apr_palloc(pool, 4096), 4096);
+*/
+
   *in = svn_stream_from_aprfile2(stdin_file, TRUE, pool);
+  *in = svn_stream_wrap_buffered_read(*in, pool);

   return SVN_NO_ERROR;
 }



> What is the point of reimplementing something that's already a part of the
> APR?
>

As it turns out, APR buffering is not compatible with streamy files.
Any of the 2 alternatives in the patch above causes ra_test to hang.

The problem seems to be that APR does not handle EAGAIN nor
EWOULDBLOCK when reading in buffered mode. I.e. buffered mode
assumes that the contents is readily available. To fix this, our file
read wrappers would have to replicate this section from readwrite,c
(possibly in different variants for platforms):

#ifdef USE_WAIT_FOR_IO
        if (rv == -1 &&
            (errno == EAGAIN || errno == EWOULDBLOCK) &&
            thefile->timeout != 0) {
            apr_status_t arv = apr_wait_for_io_or_timeout(thefile, NULL, 1);
            if (arv != APR_SUCCESS) {
                *nbytes = bytes_read;
                return arv;
            }
            else {
                do {
                    rv = read(thefile->filedes, buf, *nbytes);
                } while (rv == -1 && errno == EINTR);
            }
        }
#endif

It seems to me that the stream wrapper is the cleaner solution and
that it fits nicely with the general ideas behind stream design pattern.
Having lower latency than even buffered APR files is a bonus.

-- Stefan^2.