You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stdcxx.apache.org by Eric Lemings <Er...@roguewave.com> on 2008/03/11 23:51:33 UTC

STDCXX-435

 
>From Martin's comments in the Jira issue:
 
"...The function attempts to convert the source sequence up until the
terminating NUL (or an invalid byte) or until it has produced the
requested number of destitation characters. When the destination buffer
is large enough for more the number of characters in the source sequence
the function just keeps converting past the end."
 
More than that, I think the call to mbsrtowcs() is using the wrong
length.  It's using dst_len when it should be using src_len which is
smaller in the test case listed in the issue.
 
Some debugging excerpts:
 
    __rw_libc_do_in (state=@0x7fff5d880800
<ma...@0x7fff5d880800> , from=0x7fff5d880820 "abc",
    from_end=0x7fff5d880821 "bc", from_next=@0x7fff5d8807f8
<ma...@0x7fff5d8807f8> , to=0x7fff5d880810,
    to_limit=0x7fff5d880818, to_next=@0x7fff5d8807f0
<ma...@0x7fff5d8807f0> )
    at /work/stdcxx/trunk.1/src/wcodecvt.cpp:357

    372         const _RWSTD_SIZE_T ret = mbsrtowcs (pdst, &psrc,
dst_len, &state);
 
    (gdb) print src_len
    $6 = 1
    (gdb) print dst_len
    $7 = 2

I think it should actually use min(src_len, dst_len).
 
Thoughts?
 
Brad.

Re: STDCXX-435

Posted by Martin Sebor <se...@roguewave.com>.

Eric Lemings wrote:
>  
> 
>> -----Original Message-----
>> From: Martin Sebor [mailto:sebor@roguewave.com] 
>> Sent: Wednesday, March 12, 2008 10:52 AM
>> To: dev@stdcxx.apache.org
>> Subject: Re: STDCXX-435
>>
> ...
>>> I'll do some manual testing with this patch -- give it a test drive.
>> Thanks! All I think we need to do is to rerun the codecvt tests
>> with it and check for regressions. If there are none let me know
>> and I'll commit both the patch and add the regression test.
> 
> I applied the patch and reran the test case and several regression
> tests associated with std::codecvt.  It looks good.

Okay, thanks for testing it! I just checked it in along with
the regression test:

   http://svn.apache.org/viewvc?rev=636534&view=rev
   http://svn.apache.org/viewvc?rev=636553&view=rev

I tried to make the test portable but because of the hardcoded
locale names I suspect the test might need some tweaking...

Martin

RE: STDCXX-435

Posted by Eric Lemings <Er...@roguewave.com>.

 

> -----Original Message-----
> From: Martin Sebor [mailto:sebor@roguewave.com] 
> Sent: Wednesday, March 12, 2008 10:52 AM
> To: dev@stdcxx.apache.org
> Subject: Re: STDCXX-435
> 
...
> > I'll do some manual testing with this patch -- give it a test drive.
> 
> Thanks! All I think we need to do is to rerun the codecvt tests
> with it and check for regressions. If there are none let me know
> and I'll commit both the patch and add the regression test.

I applied the patch and reran the test case and several regression
tests associated with std::codecvt.  It looks good.

Brad.

Re: STDCXX-435

Posted by Martin Sebor <se...@roguewave.com>.

Eric Lemings wrote:
>  
> 
>> -----Original Message-----
>> From: Martin Sebor [mailto:sebor@roguewave.com] 
>> Sent: Wednesday, March 12, 2008 10:52 AM
>> To: dev@stdcxx.apache.org
>> Subject: Re: STDCXX-435
>>
>> Eric Lemings wrote:
>> [...]
>>> Right.  And I believe that limit is
>>>
>>> 	min(__from_end-__from, __to_limit-__to)
>> Suppose MB_CUR_MAX=4, the size of the source sequence is 1 byte
>> (with no terminating NUL), and the byte doesn't form a complete
>> multibyte character. There's no way to tell the function not to
>> attempt to read past the first byte.
> 
> I could be missing something but wouldn't this constitute an invalid
> conversion?

Not invalid -- incomplete. The only way to tell whether it's valid
or not is to look at the next byte. Which is exactly what mbsrtowcs()
will do: read past the first source byte in an effort to determine
what the rest of the multibyte character is. This first byte could
be something like "\xc2" in UTF-8 which doesn't mean anything by
itself but needs one other byte to form a valid character.

Martin

RE: STDCXX-435

Posted by Eric Lemings <Er...@roguewave.com>.

 

> -----Original Message-----
> From: Martin Sebor [mailto:sebor@roguewave.com] 
> Sent: Wednesday, March 12, 2008 10:52 AM
> To: dev@stdcxx.apache.org
> Subject: Re: STDCXX-435
> 
> Eric Lemings wrote:
> [...]
> > Right.  And I believe that limit is
> > 
> > 	min(__from_end-__from, __to_limit-__to)
> 
> Suppose MB_CUR_MAX=4, the size of the source sequence is 1 byte
> (with no terminating NUL), and the byte doesn't form a complete
> multibyte character. There's no way to tell the function not to
> attempt to read past the first byte.

I could be missing something but wouldn't this constitute an invalid
conversion?

Brad.

Re: STDCXX-435

Posted by Martin Sebor <se...@roguewave.com>.

Eric Lemings wrote:
[...]
> Right.  And I believe that limit is
> 
> 	min(__from_end-__from, __to_limit-__to)

Suppose MB_CUR_MAX=4, the size of the source sequence is 1 byte
(with no terminating NUL), and the byte doesn't form a complete
multibyte character. There's no way to tell the function not to
attempt to read past the first byte. I think this might have
been what made me decide not to use the function, although it
still seems that this case could be handled separately and that
we should still be able to call mbsrtowcs() on the longest source
sequence guaranteed not to cause the function to read past the
end of it, even in the absence of a NUL. Maybe I was just plain
lazy to implement the algorithm. It might be worth revisiting
since a few calls to mbsrtowcs() on small number of subsequences
are likely to be faster than one call to mbrtowc() for each
multibyte character.

> 
> possibly adjusted to account for the length in bytes of the appropriate
> character sequence type, e.g. mbrlen(), wcslen().
> 
>> or make sure there is a NUL at the end (by copying the short
> subsequence
>> at the end of the source sequence into a small local buffer and NUL
>> terminating it there.
> 
> Also possible.  It would be safer though more involved.
> 
>> I recall trying to get that approach to work
>> at first and failing. I don't remember why it didn't work anymore,
>> if it was because the code became too complex and inefficient or
>> if it simply wasn't possible to guarantee that it would be correct
>> in all cases.
>>
>> In any event, I came to the conclusion that we can't call mbsrtowcs()
>> to convert whole ranges of characters at once but that we must do the
>> conversion one character at a time instead. That's what the attached
>> patch does (I think). Since I wrote it months ago I don't remember
>> how extensively I tested it, or if it even applies cleanly after so
>> much time has passed. It does appear to fix the bug in the originally
>> submitted test case though :)
> 
> I'll do some manual testing with this patch -- give it a test drive.

Thanks! All I think we need to do is to rerun the codecvt tests
with it and check for regressions. If there are none let me know
and I'll commit both the patch and add the regression test.

Martin

RE: STDCXX-435

Posted by Eric Lemings <Er...@roguewave.com>.

 

> -----Original Message-----
> From: Martin Sebor [mailto:msebor@gmail.com] On Behalf Of Martin Sebor
> Sent: Tuesday, March 11, 2008 5:42 PM
> To: dev@stdcxx.apache.org
> Cc: Eric Lemings; Martin Sebor
> Subject: Re: STDCXX-435
> 
> I don't think it's that simple. IIRC, the problem is that we're using
> mbsrtowcs() on sequences that aren't guaranteed to be NUL-terminated.

Well the sequences in the test case are null terminated.

According to the C99 specs for mbsrtowcs(), "Conversion continues up to
and including a terminating null character, which is also stored.
Conversion stops earlier in two cases: when a sequence of bytes is
encountered  that does not form a valid multibyte character, or (if
dst is not a null pointer) when len wide characers have been stored in
the array pointed to by dst."

So the function actually stops conversion in three cases: 1.) it
converts
the terminating null character, 2.) attempts to convert an invalid
multibyte character, and 3.) converts len number of wide characters.

The problem in the test case is that the src constitutes only one
character
(even though it contains more valid characters) but the function is
called
with a len argument of 2 -- the length of the destination buffer.

> It seems that it should be straightforward to either specify a limit
> to mbsrtowcs() that's small enough so as to prevent the function from
> ever reaching the end of the (non-NUL-terminated) source sequence,

Right.  And I believe that limit is

	min(__from_end-__from, __to_limit-__to)

possibly adjusted to account for the length in bytes of the appropriate
character sequence type, e.g. mbrlen(), wcslen().

> or make sure there is a NUL at the end (by copying the short
subsequence
> at the end of the source sequence into a small local buffer and NUL
> terminating it there.

Also possible.  It would be safer though more involved.

> I recall trying to get that approach to work
> at first and failing. I don't remember why it didn't work anymore,
> if it was because the code became too complex and inefficient or
> if it simply wasn't possible to guarantee that it would be correct
> in all cases.
> 
> In any event, I came to the conclusion that we can't call mbsrtowcs()
> to convert whole ranges of characters at once but that we must do the
> conversion one character at a time instead. That's what the attached
> patch does (I think). Since I wrote it months ago I don't remember
> how extensively I tested it, or if it even applies cleanly after so
> much time has passed. It does appear to fix the bug in the originally
> submitted test case though :)

I'll do some manual testing with this patch -- give it a test drive.

Brad.

Re: STDCXX-435

Posted by Martin Sebor <se...@roguewave.com>.

I don't think it's that simple. IIRC, the problem is that we're using
mbsrtowcs() on sequences that aren't guaranteed to be NUL-terminated.
It seems that it should be straightforward to either specify a limit
to mbsrtowcs() that's small enough so as to prevent the function from
ever reaching the end of the (non-NUL-terminated) source sequence, or
make sure there is a NUL at the end (by copying the short subsequence
at the end of the source sequence into a small local buffer and NUL
terminating it there. I recall trying to get that approach to work
at first and failing. I don't remember why it didn't work anymore,
if it was because the code became too complex and inefficient or
if it simply wasn't possible to guarantee that it would be correct
in all cases.

In any event, I came to the conclusion that we can't call mbsrtowcs()
to convert whole ranges of characters at once but that we must do the
conversion one character at a time instead. That's what the attached
patch does (I think). Since I wrote it months ago I don't remember
how extensively I tested it, or if it even applies cleanly after so
much time has passed. It does appear to fix the bug in the originally
submitted test case though :)

Martin

Eric Lemings wrote:
>  
>>>From Martin's comments in the Jira issue:
>  
> "...The function attempts to convert the source sequence up until the
> terminating NUL (or an invalid byte) or until it has produced the
> requested number of destitation characters. When the destination buffer
> is large enough for more the number of characters in the source sequence
> the function just keeps converting past the end."
>  
> More than that, I think the call to mbsrtowcs() is using the wrong
> length.  It's using dst_len when it should be using src_len which is
> smaller in the test case listed in the issue.
>  
> Some debugging excerpts:
>  
>     __rw_libc_do_in (state=@0x7fff5d880800
> <ma...@0x7fff5d880800> , from=0x7fff5d880820 "abc",
>     from_end=0x7fff5d880821 "bc", from_next=@0x7fff5d8807f8
> <ma...@0x7fff5d8807f8> , to=0x7fff5d880810,
>     to_limit=0x7fff5d880818, to_next=@0x7fff5d8807f0
> <ma...@0x7fff5d8807f0> )
>     at /work/stdcxx/trunk.1/src/wcodecvt.cpp:357
> 
>     372         const _RWSTD_SIZE_T ret = mbsrtowcs (pdst, &psrc,
> dst_len, &state);
>  
>     (gdb) print src_len
>     $6 = 1
>     (gdb) print dst_len
>     $7 = 2
> 
> I think it should actually use min(src_len, dst_len).
>  
> Thoughts?
>  
> Brad.
>  
>