You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apr.apache.org by Greg Stein <gs...@gmail.com> on 2019/06/07 11:27:32 UTC

Re: svn commit: r1860745 - /apr/apr/trunk/file_io/win32/dir.c

On Fri, Jun 7, 2019 at 6:02 AM <iv...@apache.org> wrote:
>...

> \@@ -114,116 +93,73 @@ APR_DECLARE(apr_status_t) apr_dir_read(a
>  {
>      apr_status_t rv;
>      char *fname;
> +    apr_wchar_t wdirname[APR_PATH_MAX];
>

That's a crazy huge stack frame. I recognize we were doing this before this
commit, but is it advisable? Should it be corrected?

Cheers,
-g

Re: svn commit: r1860745 - /apr/apr/trunk/file_io/win32/dir.c

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

On Tue, Jun 11, 2019 at 9:14 AM William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Tue, Jun 11, 2019 at 4:15 AM Branko Čibej <br...@apache.org> wrote:
>
>> On 07.06.2019 21:58, William A Rowe Jr wrote:
>> > I think the optimal way is to allocate a pair of apr thread-specific
>> > wchar buffers in each thread's pool on startup, and use those
>> > exclusively per-thread for wchar translations. We could be looking at
>> > 64k/thread exclusively for name translation, but it doesn't seem
>> > unreasonable.
>> >
>> > The alternative is to continue to use stack, we surely don't want to
>> > lock on acquiring or allocating name translation buffers. /shrug
>>
>> Since this is Windows, and there's no embedded Windows environment left
>> that I'm aware of that's still alive, we can continue with using the
>> stack ... but wouldn't it be so much better to alloca() the required
>> size instead of blindly burning through 64k every time? Obviously that
>> means counting the characters first.
>>
>
> I don't think there is any need to do so. A name buffer needs 32k (right
> now
> we limit that to 8k, but if we wanted to satisfy the OS-acceptable pathname
> length, we would blow that out 4x the current arbitrary limit.) For a
> rename,
> make that 2x buffers. But there is no benefit to wasting cycles determining
> the string length ahead of allocation, because the very next call can run
> right past that limit and hit the wall on available stack. Demanding that
> there
> always be a potential 32k runway of available stack doesn't seem excessive.
>
> The point to the stack is that it contracts immediately on return. So we
> aren't
> burning through a 64k buffer - we are ensuring that we have been topped-up
> to 64k remaining. The targets of these calls are all Win32 API
> invocations,
> we are never nesting them inside further big-buffer allocations of our own.
> What might happen in ntdll we have little control over.
>
>  We either reserve about 2x buffers for file name transliteration in heap
> per thread, or we use the thread stack. As long as we trust that our utf-8
> to ucs-2 logic is rock solid and the allocations and limits are correctly
> coded, this continues to be a safe approach. The 32k/64k are immediately
> given back in the current stack-based approach, but would simply be boat
> anchors most of the time if they are set aside in heap.
>
> I'm +/-0 on switching from stack to heap for this particular transformation
> but welcome good suggestions.
>

I failed to call this out, but guessing that we agree that if we preserve
wide
character string results (without discarding them after a single API call),
using the string size or moving towards counted byte string representation
makes a whole lot more sense for longer-term heap allocations.

Re: svn commit: r1860745 - /apr/apr/trunk/file_io/win32/dir.c

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

Brane, if that's sufficient, would you propose a rewording patch to dispel
any confusion about how the API behaves?

I understand where you confusion arose, but UCS-2<>UTF-16 was a thing
already when I wrote the code. What I didn't reflect in comments is that we
will throw away bad UTF-16 as quickly as we dispel UTF-8, even as Java
kept eating it up for some very serious consequences (yes, I was the first
reporter, so far as I know.)



On Wed, Jun 12, 2019 at 8:15 PM Branko Čibej <br...@apache.org> wrote:

> On 12.06.2019 16:47, William A Rowe Jr wrote:
> > On Tue, Jun 11, 2019 at 5:38 PM William A Rowe Jr <wrowe@rowe-clan.net
> > <ma...@rowe-clan.net>> wrote:
> >
> >     On Tue, Jun 11, 2019 at 1:44 PM Branko Čibej <brane@apache.org
> >     <ma...@apache.org>> wrote:
> >
> >>          We either reserve about 2x buffers for file name
> >>         transliteration in heap
> >>         per thread, or we use the thread stack. As long as we trust
> >>         that our utf-8
> >>         to ucs-2 logic is rock solid and the allocations and limits
> >>         are correctly
> >>         coded, this continues to be a safe approach.
> >
> >         Apropos of that, for 2.0 we're about to or have already
> >         ditched support for versions of Windows that do not have
> >         native UTF-8/UTF-16 conversions (ah, yes ... Windows has
> >         finally moved from UCS-2 to UTF-16). Wouldn't this be the
> >         right time to switch to using Windows' functions instead of
> >         staying with our own? Especially since, with the transition to
> >         UTF-16, we have to deal correctly with surrogate pairs,
> >         something our current code (IIRC) doesn't do.
> >
> >
> >     A bit of a misnomer, the code is full of references to ucs-2
> >     w/surrogate pair
> >     support, the combo of these is utf-16. The comments can be
> >     refreshed to
> >     today's utf-16 nomenclature.
> >
> >
> > How strongly do we feel about the naming? I have the following patch to
> > commit if we want to observe current convention in utf-naming and
> > deprecate
> > many (but not all) ucs references.
>
> As far as I'm concerned, the naming (which is an internal matter, not
> part of the API) can stay as long as the implementation is correct. A
> comment to the tune of "it's actually UTF-16 but started life in the
> olden days when only UCS-2 existed" is good enough.
>
> -- Brane
>
>

Re: svn commit: r1860745 - /apr/apr/trunk/file_io/win32/dir.c

Posted by Branko Čibej <br...@apache.org>.

On 12.06.2019 16:47, William A Rowe Jr wrote:
> On Tue, Jun 11, 2019 at 5:38 PM William A Rowe Jr <wrowe@rowe-clan.net
> <ma...@rowe-clan.net>> wrote:
>
>     On Tue, Jun 11, 2019 at 1:44 PM Branko Čibej <brane@apache.org
>     <ma...@apache.org>> wrote:
>
>>          We either reserve about 2x buffers for file name
>>         transliteration in heap
>>         per thread, or we use the thread stack. As long as we trust
>>         that our utf-8
>>         to ucs-2 logic is rock solid and the allocations and limits
>>         are correctly
>>         coded, this continues to be a safe approach.
>
>         Apropos of that, for 2.0 we're about to or have already
>         ditched support for versions of Windows that do not have
>         native UTF-8/UTF-16 conversions (ah, yes ... Windows has
>         finally moved from UCS-2 to UTF-16). Wouldn't this be the
>         right time to switch to using Windows' functions instead of
>         staying with our own? Especially since, with the transition to
>         UTF-16, we have to deal correctly with surrogate pairs,
>         something our current code (IIRC) doesn't do.
>
>
>     A bit of a misnomer, the code is full of references to ucs-2
>     w/surrogate pair
>     support, the combo of these is utf-16. The comments can be
>     refreshed to 
>     today's utf-16 nomenclature.
>
>
> How strongly do we feel about the naming? I have the following patch to
> commit if we want to observe current convention in utf-naming and
> deprecate
> many (but not all) ucs references.

As far as I'm concerned, the naming (which is an internal matter, not
part of the API) can stay as long as the implementation is correct. A
comment to the tune of "it's actually UTF-16 but started life in the
olden days when only UCS-2 existed" is good enough.

-- Brane

Re: svn commit: r1860745 - /apr/apr/trunk/file_io/win32/dir.c

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

On Tue, Jun 11, 2019 at 5:38 PM William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Tue, Jun 11, 2019 at 1:44 PM Branko Čibej <br...@apache.org> wrote:
>
>>  We either reserve about 2x buffers for file name transliteration in heap
>> per thread, or we use the thread stack. As long as we trust that our utf-8
>> to ucs-2 logic is rock solid and the allocations and limits are correctly
>> coded, this continues to be a safe approach.
>>
>>
>> Apropos of that, for 2.0 we're about to or have already ditched support
>> for versions of Windows that do not have native UTF-8/UTF-16 conversions
>> (ah, yes ... Windows has finally moved from UCS-2 to UTF-16). Wouldn't this
>> be the right time to switch to using Windows' functions instead of staying
>> with our own? Especially since, with the transition to UTF-16, we have to
>> deal correctly with surrogate pairs, something our current code (IIRC)
>> doesn't do.
>>
>
> A bit of a misnomer, the code is full of references to ucs-2 w/surrogate
> pair
> support, the combo of these is utf-16. The comments can be refreshed to
> today's utf-16 nomenclature.
>

How strongly do we feel about the naming? I have the following patch to
commit if we want to observe current convention in utf-naming and deprecate
many (but not all) ucs references.

Re: svn commit: r1860745 - /apr/apr/trunk/file_io/win32/dir.c

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

On Tue, Jun 11, 2019 at 1:44 PM Branko Čibej <br...@apache.org> wrote:

>  We either reserve about 2x buffers for file name transliteration in heap
> per thread, or we use the thread stack. As long as we trust that our utf-8
> to ucs-2 logic is rock solid and the allocations and limits are correctly
> coded, this continues to be a safe approach.
>
>
> Apropos of that, for 2.0 we're about to or have already ditched support
> for versions of Windows that do not have native UTF-8/UTF-16 conversions
> (ah, yes ... Windows has finally moved from UCS-2 to UTF-16). Wouldn't this
> be the right time to switch to using Windows' functions instead of staying
> with our own? Especially since, with the transition to UTF-16, we have to
> deal correctly with surrogate pairs, something our current code (IIRC)
> doesn't do.
>

A bit of a misnomer, the code is full of references to ucs-2 w/surrogate
pair
support, the combo of these is utf-16. The comments can be refreshed to
today's utf-16 nomenclature.

Today's logic remains correct, and of course does the correct thing,
because
an unpaired utf-8 surrogate value would be very broken and even possibly a
security issue, much as decoding other invalid utf-8 bytestreams proved to
be.

If you want to look at win32 api's, feel free to benchmark; though I doubt
it
would outperform the current implementation.

Re: svn commit: r1860745 - /apr/apr/trunk/file_io/win32/dir.c

Posted by Branko Čibej <br...@apache.org>.

On 11.06.2019 16:14, William A Rowe Jr wrote:
> On Tue, Jun 11, 2019 at 4:15 AM Branko Čibej <brane@apache.org
> <ma...@apache.org>> wrote:
>
>     On 07.06.2019 21:58, William A Rowe Jr wrote:
>     > I think the optimal way is to allocate a pair of apr thread-specific
>     > wchar buffers in each thread's pool on startup, and use those
>     > exclusively per-thread for wchar translations. We could be
>     looking at
>     > 64k/thread exclusively for name translation, but it doesn't seem
>     > unreasonable.
>     >
>     > The alternative is to continue to use stack, we surely don't want to
>     > lock on acquiring or allocating name translation buffers. /shrug
>
>     Since this is Windows, and there's no embedded Windows environment
>     left
>     that I'm aware of that's still alive, we can continue with using the
>     stack ... but wouldn't it be so much better to alloca() the required
>     size instead of blindly burning through 64k every time? Obviously that
>     means counting the characters first.
>
>
> I don't think there is any need to do so. A name buffer needs 32k
> (right now
> we limit that to 8k, but if we wanted to satisfy the OS-acceptable
> pathname
> length, we would blow that out 4x the current arbitrary limit.) For a
> rename,
> make that 2x buffers. But there is no benefit to wasting cycles
> determining
> the string length ahead of allocation, because the very next call can run
> right past that limit and hit the wall on available stack. Demanding
> that there
> always be a potential 32k runway of available stack doesn't seem
> excessive.
>
> The point to the stack is that it contracts immediately on return. So
> we aren't
> burning through a 64k buffer - we are ensuring that we have been topped-up
> to 64k remaining. The targets of these calls are all Win32 API
> invocations, 
> we are never nesting them inside further big-buffer allocations of our
> own.
> What might happen in ntdll we have little control over.

Point. Agreed.

>  We either reserve about 2x buffers for file name transliteration in heap
> per thread, or we use the thread stack. As long as we trust that our utf-8
> to ucs-2 logic is rock solid and the allocations and limits are correctly
> coded, this continues to be a safe approach.

Apropos of that, for 2.0 we're about to or have already ditched support
for versions of Windows that do not have native UTF-8/UTF-16 conversions
(ah, yes ... Windows has finally moved from UCS-2 to UTF-16). Wouldn't
this be the right time to switch to using Windows' functions instead of
staying with our own? Especially since, with the transition to UTF-16,
we have to deal correctly with surrogate pairs, something our current
code (IIRC) doesn't do.

-- Brane

Re: svn commit: r1860745 - /apr/apr/trunk/file_io/win32/dir.c

Posted by Ivan Zhakov <iv...@visualsvn.com>.

On Tue, 11 Jun 2019 at 17:14, William A Rowe Jr <wr...@rowe-clan.net> wrote:
>
> On Tue, Jun 11, 2019 at 4:15 AM Branko Čibej <br...@apache.org> wrote:
>>
>> On 07.06.2019 21:58, William A Rowe Jr wrote:
>> > I think the optimal way is to allocate a pair of apr thread-specific
>> > wchar buffers in each thread's pool on startup, and use those
>> > exclusively per-thread for wchar translations. We could be looking at
>> > 64k/thread exclusively for name translation, but it doesn't seem
>> > unreasonable.
>> >
>> > The alternative is to continue to use stack, we surely don't want to
>> > lock on acquiring or allocating name translation buffers. /shrug
>>
>> Since this is Windows, and there's no embedded Windows environment left
>> that I'm aware of that's still alive, we can continue with using the
>> stack ... but wouldn't it be so much better to alloca() the required
>> size instead of blindly burning through 64k every time? Obviously that
>> means counting the characters first.
>
>
> I don't think there is any need to do so. A name buffer needs 32k (right now
> we limit that to 8k, but if we wanted to satisfy the OS-acceptable pathname
> length, we would blow that out 4x the current arbitrary limit.) For a rename,
> make that 2x buffers. But there is no benefit to wasting cycles determining
> the string length ahead of allocation, because the very next call can run
> right past that limit and hit the wall on available stack. Demanding that there
> always be a potential 32k runway of available stack doesn't seem excessive.
>
> The point to the stack is that it contracts immediately on return. So we aren't
> burning through a 64k buffer - we are ensuring that we have been topped-up
> to 64k remaining. The targets of these calls are all Win32 API invocations,
> we are never nesting them inside further big-buffer allocations of our own.
> What might happen in ntdll we have little control over.
>
>  We either reserve about 2x buffers for file name transliteration in heap
> per thread, or we use the thread stack. As long as we trust that our utf-8
> to ucs-2 logic is rock solid and the allocations and limits are correctly
> coded, this continues to be a safe approach. The 32k/64k are immediately
> given back in the current stack-based approach, but would simply be boat
> anchors most of the time if they are set aside in heap.
>
> I'm +/-0 on switching from stack to heap for this particular transformation
> but welcome good suggestions.
> .
I agree that in general allocating 16kb on stack should be avoided. However,
I also think that is it's not too problematic in this specific case, as it
happens within a function that only calls the Win32 API.

Also, this particular code should probably be changed anyway to avoid doing
UTF8->UCS conversions per every apr_dir_read(), and I have that on my TODO
list.

Speaking of this commit, my aim here was to fix accessing the uninitialized
memory, as I did in r1860747. And this turned out to be much simpler with
the ANSI code path removed.

--
Ivan Zhakov

Re: svn commit: r1860745 - /apr/apr/trunk/file_io/win32/dir.c

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

On Tue, Jun 11, 2019 at 4:15 AM Branko Čibej <br...@apache.org> wrote:

> On 07.06.2019 21:58, William A Rowe Jr wrote:
> > I think the optimal way is to allocate a pair of apr thread-specific
> > wchar buffers in each thread's pool on startup, and use those
> > exclusively per-thread for wchar translations. We could be looking at
> > 64k/thread exclusively for name translation, but it doesn't seem
> > unreasonable.
> >
> > The alternative is to continue to use stack, we surely don't want to
> > lock on acquiring or allocating name translation buffers. /shrug
>
> Since this is Windows, and there's no embedded Windows environment left
> that I'm aware of that's still alive, we can continue with using the
> stack ... but wouldn't it be so much better to alloca() the required
> size instead of blindly burning through 64k every time? Obviously that
> means counting the characters first.
>

I don't think there is any need to do so. A name buffer needs 32k (right now
we limit that to 8k, but if we wanted to satisfy the OS-acceptable pathname
length, we would blow that out 4x the current arbitrary limit.) For a
rename,
make that 2x buffers. But there is no benefit to wasting cycles determining
the string length ahead of allocation, because the very next call can run
right past that limit and hit the wall on available stack. Demanding that
there
always be a potential 32k runway of available stack doesn't seem excessive.

The point to the stack is that it contracts immediately on return. So we
aren't
burning through a 64k buffer - we are ensuring that we have been topped-up
to 64k remaining. The targets of these calls are all Win32 API invocations,
we are never nesting them inside further big-buffer allocations of our own.
What might happen in ntdll we have little control over.

 We either reserve about 2x buffers for file name transliteration in heap
per thread, or we use the thread stack. As long as we trust that our utf-8
to ucs-2 logic is rock solid and the allocations and limits are correctly
coded, this continues to be a safe approach. The 32k/64k are immediately
given back in the current stack-based approach, but would simply be boat
anchors most of the time if they are set aside in heap.

I'm +/-0 on switching from stack to heap for this particular transformation
but welcome good suggestions.
.

Re: svn commit: r1860745 - /apr/apr/trunk/file_io/win32/dir.c

Posted by Branko Čibej <br...@apache.org>.

On 07.06.2019 21:58, William A Rowe Jr wrote:
> I think the optimal way is to allocate a pair of apr thread-specific
> wchar buffers in each thread's pool on startup, and use those
> exclusively per-thread for wchar translations. We could be looking at
> 64k/thread exclusively for name translation, but it doesn't seem
> unreasonable.
>
> The alternative is to continue to use stack, we surely don't want to
> lock on acquiring or allocating name translation buffers. /shrug

Since this is Windows, and there's no embedded Windows environment left
that I'm aware of that's still alive, we can continue with using the
stack ... but wouldn't it be so much better to alloca() the required
size instead of blindly burning through 64k every time? Obviously that
means counting the characters first.


> On Fri, Jun 7, 2019 at 6:27 AM Greg Stein <gstein@gmail.com
> <ma...@gmail.com>> wrote:
>
>     On Fri, Jun 7, 2019 at 6:02 AM <ivan@apache.org
>     <ma...@apache.org>> wrote:
>     >...
>
>         \@@ -114,116 +93,73 @@ APR_DECLARE(apr_status_t) apr_dir_read(a
>          {
>              apr_status_t rv;
>              char *fname;
>         +    apr_wchar_t wdirname[APR_PATH_MAX];
>
>
>     That's a crazy huge stack frame. I recognize we were doing this
>     before this commit, but is it advisable? Should it be corrected?
>
>     Cheers,
>     -g
>

Re: svn commit: r1860745 - /apr/apr/trunk/file_io/win32/dir.c

Posted by William A Rowe Jr <wr...@rowe-clan.net>.

I think the optimal way is to allocate a pair of apr thread-specific wchar
buffers in each thread's pool on startup, and use those exclusively
per-thread for wchar translations. We could be looking at 64k/thread
exclusively for name translation, but it doesn't seem unreasonable.

The alternative is to continue to use stack, we surely don't want to lock
on acquiring or allocating name translation buffers. /shrug

On Fri, Jun 7, 2019 at 6:27 AM Greg Stein <gs...@gmail.com> wrote:

> On Fri, Jun 7, 2019 at 6:02 AM <iv...@apache.org> wrote:
> >...
>
>> \@@ -114,116 +93,73 @@ APR_DECLARE(apr_status_t) apr_dir_read(a
>>  {
>>      apr_status_t rv;
>>      char *fname;
>> +    apr_wchar_t wdirname[APR_PATH_MAX];
>>
>
> That's a crazy huge stack frame. I recognize we were doing this before
> this commit, but is it advisable? Should it be corrected?
>
> Cheers,
> -g
>
>