You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Mark Phippard <ma...@gmail.com> on 2011/09/08 20:07:03 UTC

[RFC] - Proper encoding for patch file?

This is a JavaHL issue.  See the attached patch which resolves the
problem I face.

If I use the JavaHL diff API to produce a patch it fails if there are
paths in the patch with UTF8 characters in the name.  Here is an
example of the Exception:

    Invalid argument
svn: Can't convert string from 'UTF-8' to native encoding:
svn: Index: ?\230?\181?\139?\232?\175?\149?\230?\150?\135?\228?\187?\182.txt
===================================================================

RA layer request failed
svn: Error reading spooled REPORT request response


The problem seems to be that JavaHL creates the output file for the
patch with the encoding of SVN_APR_LOCALE_CHARSET.  If I change this
to "utf-8" as shown in the patch then the method works.

The command line client from the same system works fine.

How do people feel about this?  Does it make sense that JavaHL should
create the patch file with UTF-8 encoding?  I tend to think it does,
but thought I would raise the question here.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: [RFC] - Proper encoding for patch file?

Posted by Mark Phippard <ma...@gmail.com>.
On Thu, Sep 8, 2011 at 2:30 PM, Mark Phippard <ma...@gmail.com> wrote:
> On Thu, Sep 8, 2011 at 2:27 PM, C. Michael Pilato <cm...@collab.net> wrote:
>> Why does the command-line client work?  Does it not also use the locale
>> encoding for its diff headers?  At any rate, consistency between the
>> behaviors of the relevant Java and C APIs seems like a reasonable goal.
>
> I have not tested exhaustively, but my OSX Terminal says UTF-8 is the
> default encoding.  Maybe that is why I do not see it from command
> line?

Changed Terminal to use MacOS Roman as default encoding.  Now I get this:

$ svn diff
subversion/svn/diff-cmd.c:373: (apr_err=22)
subversion/libsvn_client/diff.c:1989: (apr_err=22)
subversion/libsvn_client/diff.c:1667: (apr_err=22)
subversion/libsvn_wc/diff_local.c:560: (apr_err=22)
subversion/libsvn_wc/status.c:2364: (apr_err=22)
subversion/libsvn_wc/status.c:1171: (apr_err=22)
subversion/libsvn_wc/status.c:1157: (apr_err=22)
subversion/libsvn_wc/diff_local.c:474: (apr_err=22)
subversion/libsvn_wc/diff_local.c:474: (apr_err=22)
subversion/libsvn_wc/diff_local.c:419: (apr_err=22)
subversion/libsvn_client/diff.c:1098: (apr_err=22)
subversion/libsvn_client/diff.c:1012: (apr_err=22)
subversion/libsvn_subr/stream.c:248: (apr_err=22)
subversion/libsvn_subr/utf.c:775: (apr_err=22)
subversion/libsvn_subr/utf.c:580: (apr_err=22)
svn: E000022: Can't convert string from 'UTF-8' to native encoding:
subversion/libsvn_subr/utf.c:578: (apr_err=22)
svn: E000022: Index: Design
Documents/?\230?\181?\139?\232?\175?\149?\230?\150?\135?\228?\187?\182.txt



-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: [RFC] - Proper encoding for patch file?

Posted by Hyrum K Wright <hy...@wandisco.com>.
On Thu, Sep 8, 2011 at 1:49 PM, Mark Phippard <ma...@gmail.com> wrote:
> On Thu, Sep 8, 2011 at 2:30 PM, Mark Phippard <ma...@gmail.com> wrote:
>> On Thu, Sep 8, 2011 at 2:27 PM, C. Michael Pilato <cm...@collab.net> wrote:
>>> Why does the command-line client work?  Does it not also use the locale
>>> encoding for its diff headers?  At any rate, consistency between the
>>> behaviors of the relevant Java and C APIs seems like a reasonable goal.
>>
>> I have not tested exhaustively, but my OSX Terminal says UTF-8 is the
>> default encoding.  Maybe that is why I do not see it from command
>> line?
>
> FWIW, even if I explicitly set LANG=en_US.UTF-8 before launching Java,
> and even if I change all of the JVM properties to make UTF-8 the
> default encoding for files for the JVM, I still get this error.  So
> JavaHL does not seem to pickup the environment in the same ways as the
> command line.

FWIW, JavaHL is just using SVN_APR_LOCALE_CHARSET, which is a magic
number inside of APR.  I've no idea what it actually does.

-Hyrum


-- 

uberSVN: Apache Subversion Made Easy
http://www.uberSVN.com/

Re: [RFC] - Proper encoding for patch file?

Posted by Mark Phippard <ma...@gmail.com>.
On Thu, Sep 8, 2011 at 2:30 PM, Mark Phippard <ma...@gmail.com> wrote:
> On Thu, Sep 8, 2011 at 2:27 PM, C. Michael Pilato <cm...@collab.net> wrote:
>> Why does the command-line client work?  Does it not also use the locale
>> encoding for its diff headers?  At any rate, consistency between the
>> behaviors of the relevant Java and C APIs seems like a reasonable goal.
>
> I have not tested exhaustively, but my OSX Terminal says UTF-8 is the
> default encoding.  Maybe that is why I do not see it from command
> line?

FWIW, even if I explicitly set LANG=en_US.UTF-8 before launching Java,
and even if I change all of the JVM properties to make UTF-8 the
default encoding for files for the JVM, I still get this error.  So
JavaHL does not seem to pickup the environment in the same ways as the
command line.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: [RFC] - Proper encoding for patch file?

Posted by Mark Phippard <ma...@gmail.com>.
On Thu, Sep 8, 2011 at 2:27 PM, C. Michael Pilato <cm...@collab.net> wrote:
> Why does the command-line client work?  Does it not also use the locale
> encoding for its diff headers?  At any rate, consistency between the
> behaviors of the relevant Java and C APIs seems like a reasonable goal.

I have not tested exhaustively, but my OSX Terminal says UTF-8 is the
default encoding.  Maybe that is why I do not see it from command
line?

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: [RFC] - Proper encoding for patch file?

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 09/08/2011 02:07 PM, Mark Phippard wrote:
> This is a JavaHL issue.  See the attached patch which resolves the
> problem I face.
> 
> If I use the JavaHL diff API to produce a patch it fails if there are
> paths in the patch with UTF8 characters in the name.  Here is an
> example of the Exception:
> 
>     Invalid argument
> svn: Can't convert string from 'UTF-8' to native encoding:
> svn: Index: ?\230?\181?\139?\232?\175?\149?\230?\150?\135?\228?\187?\182.txt
> ===================================================================
> 
> RA layer request failed
> svn: Error reading spooled REPORT request response
> 
> 
> The problem seems to be that JavaHL creates the output file for the
> patch with the encoding of SVN_APR_LOCALE_CHARSET.  If I change this
> to "utf-8" as shown in the patch then the method works.
> 
> The command line client from the same system works fine.
> 
> How do people feel about this?  Does it make sense that JavaHL should
> create the patch file with UTF-8 encoding?  I tend to think it does,
> but thought I would raise the question here.

Why does the command-line client work?  Does it not also use the locale
encoding for its diff headers?  At any rate, consistency between the
behaviors of the relevant Java and C APIs seems like a reasonable goal.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: [RFC] - Proper encoding for patch file?

Posted by Branko Čibej <br...@xbc.nu>.
On 08.09.2011 20:07, Mark Phippard wrote:
> This is a JavaHL issue.  See the attached patch which resolves the
> problem I face.
>
> If I use the JavaHL diff API to produce a patch it fails if there are
> paths in the patch with UTF8 characters in the name.  Here is an
> example of the Exception:
>
>     Invalid argument
> svn: Can't convert string from 'UTF-8' to native encoding:
> svn: Index: ?\230?\181?\139?\232?\175?\149?\230?\150?\135?\228?\187?\182.txt
> ===================================================================
>
> RA layer request failed
> svn: Error reading spooled REPORT request response
>
>
> The problem seems to be that JavaHL creates the output file for the
> patch with the encoding of SVN_APR_LOCALE_CHARSET.  If I change this
> to "utf-8" as shown in the patch then the method works.
>
> The command line client from the same system works fine.
>
> How do people feel about this?  Does it make sense that JavaHL should
> create the patch file with UTF-8 encoding?  I tend to think it does,
> but thought I would raise the question here.
>

Unfortunately, on Linux (and other *ix), the filename encoding is just a
convention. So there's no guarantee that the filename is in fact UTF-8,
even if the locale says it should be. Therefore, just writing the file
names to the patch file unchanged ("in UTF-8") will not in fact do the
right thing in exactly the kind of corner case that's triggering this error.

The only marginally sane solution is to include complete Unicode
normalization and transliteration libraries in Subversion ... and use
them correctly. I expect that'd mean storing the actual transliterated
filename in the WC datbase alongside the original UTF-8 value that came
from the repository, because transliteration is in general not reversible.

-- Brane

P.S.: As an added bonus, that would allow us to "transliterate"
characters that are invalid on some particular filesystem, if they
happen to appear in names in the repository.

Re: [RFC] - Proper encoding for patch file?

Posted by Stefan Sperling <st...@elego.de>.
On Thu, Sep 08, 2011 at 02:07:03PM -0400, Mark Phippard wrote:
> This is a JavaHL issue.  See the attached patch which resolves the
> problem I face.
> 
> If I use the JavaHL diff API to produce a patch it fails if there are
> paths in the patch with UTF8 characters in the name.  Here is an
> example of the Exception:
> 
>     Invalid argument
> svn: Can't convert string from 'UTF-8' to native encoding:
> svn: Index: ?\230?\181?\139?\232?\175?\149?\230?\150?\135?\228?\187?\182.txt
> ===================================================================

This might be related to the following TODO comment in libsvn_client/patch.c.
In other words, this is a known limitation of the current implementation.

[[[
static svn_error_t *
grab_filename(const char **file_name, const char *line, apr_pool_t *result_pool,
              apr_pool_t *scratch_pool)
{
  const char *utf8_path;
  const char *canon_path;

  /* Grab the filename and encode it in UTF-8. */
  /* TODO: Allow specifying the patch file's encoding.
   *       For now, we assume its encoding is native. */
  /* ### This can fail if the filename cannot be represented in the current
   * ### locale's encoding. */
  SVN_ERR(svn_utf_cstring_to_utf8(&utf8_path,
                                  line,
                                  scratch_pool));

]]]

Re: [RFC] - Proper encoding for patch file?

Posted by Mark Phippard <ma...@gmail.com>.
I should point out this is on OSX.  The results on Windows are more interesting:

1. Unlike OSX, on Windows the API completes without error.

2. However, the paths in the index are show ??? in place of UTF-8

3.  But the content within the patch, shows up fine.

So this seems like another data point in favor of just telling SVN to
output as UTF-8 since it seems to only apply to the pathnames.

Comments?



On Thu, Sep 8, 2011 at 2:07 PM, Mark Phippard <ma...@gmail.com> wrote:
> This is a JavaHL issue.  See the attached patch which resolves the
> problem I face.
>
> If I use the JavaHL diff API to produce a patch it fails if there are
> paths in the patch with UTF8 characters in the name.  Here is an
> example of the Exception:
>
>    Invalid argument
> svn: Can't convert string from 'UTF-8' to native encoding:
> svn: Index: ?\230?\181?\139?\232?\175?\149?\230?\150?\135?\228?\187?\182.txt
> ===================================================================
>
> RA layer request failed
> svn: Error reading spooled REPORT request response
>
>
> The problem seems to be that JavaHL creates the output file for the
> patch with the encoding of SVN_APR_LOCALE_CHARSET.  If I change this
> to "utf-8" as shown in the patch then the method works.
>
> The command line client from the same system works fine.
>
> How do people feel about this?  Does it make sense that JavaHL should
> create the patch file with UTF-8 encoding?  I tend to think it does,
> but thought I would raise the question here.
>
> --
> Thanks
>
> Mark Phippard
> http://markphip.blogspot.com/
>



-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/