You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@subversion.apache.org by "Nathan Hartman (Jira)" <ji...@apache.org> on 2019/11/13 03:27:00 UTC

[jira] [Commented] (SVN-807) gracefully degrade from failed charset conversion

    [ https://issues.apache.org/jira/browse/SVN-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972982#comment-16972982 ] 

Nathan Hartman commented on SVN-807:
------------------------------------

Daniel Shahaf tested this issue as follows:
{quote}{color:#172b4d}{{Well, I'm sure there are better ways, but I just did this:}}
{{.}}
{{ % svnadmin create r}}
{{ % vim -b r/db/revprops/0/0}}
{{.}}
{{and manually added an svn:log property with a value that's invalid UTF-8 [svn:* properties must use UTF-8 with LF line endings]:}}
{{.}}
{{ % xxd r/db/revprops/0/0 | vipe}}
{{ 00000000: 4b20 380a 7376 6e3a 6461 7465 0a56 2032 K 8.svn:date.V 2}}
{{ 00000010: 370a 3230 3139 2d31 312d 3131 5431 363a 7.2019-11-11T16:}}
{{ 00000020: 3038 3a30 312e 3334 3437 3434 5a0a 4b20 08:01.344744Z.K }}
{{ 00000030: 370a 7376 6e3a 6c6f 670a 5620 330a ffff 7.svn:log.V 3...}}
{{ ^^^^ }}
{{ 00000040: ff0a 454e 440a ..END.}}
{{ ^^ }}
{{ %}}{{You can confirm it's invalid:}}
{{.}}
{{ % iconv -f utf8 < r/db/revprops/0/0 > /dev/null}}
{{ iconv: illegal input sequence at position 62}}
{{ zsh: exit 1 iconv -f utf8 < r/db/revprops/0/0 > /dev/null}}{{'svn log' gives:}}
{{.}}
{{ % svn log file://$PWD/r }}
{{ ------------------------------------------------------------------------}}
{{ r0 | (no author) | 2019-11-11 16:08:01 +0000 (Mon, 11 Nov 2019) | 1 line}}

{{ ?\FF?\FF?\FF}}
{{ ------------------------------------------------------------------------}}
{{ %}}{{So I think we can close it as "Fixed at some point"?}}{color}{quote}
See the dev@ mailing list thread ["Issue tracker cleanup: SVN-807"|https://mail-archives.apache.org/mod_mbox/subversion-dev/201911.mbox/%3c20191111163006.ewawdd6af5fnvqo7@tarpaulin.shahaf.local2%3e] (11 Nov 2019).

This issue was fixed in r842879 with the addition of *svn_utf_cstring_from_utf8_fuzzy()*. With this change, unconvertible UTF-8 is displayed in the form of hexadecimal codes as shown above. This was done before Subversion 1.0.0.

Nothing more can be done if the log contains invalid UTF-8 because such codes cannot be converted to anything meaningful. It is unlikely that this would happen under normal circumstances because Subversion checks log messages for bad UTF-8 (and also mismatched line endings) at commit time and aborts the commit if such content is found.

From my reading it appears that this issue was left open because of a desire to move svn_utf_cstring_from_utf8_fuzzy() to Apache Portable Runtime (APR). If there is still interest in doing that, it should be tracked in a separate issue.

We are closing this issue as it has been fixed.

> gracefully degrade from failed charset conversion
> -------------------------------------------------
>
>                 Key: SVN-807
>                 URL: https://issues.apache.org/jira/browse/SVN-807
>             Project: Subversion
>          Issue Type: Bug
>    Affects Versions: all
>            Reporter: Karl Fogel
>            Priority: Minor
>             Fix For: unscheduled
>
>         Attachments: 1_brane-utf-8.mbox, 2_ulrich.mbox
>
>
> {noformat:nopanel=true}
> Right now, if a log message contains characters that cannot be
> represented in the client's locale, that log message will simply show
> up as:
>    "[unconvertible log msg]"
> Graceful degradation would be nice here :-).
> See the dev list thread "Re: converting unconvertible UTF-8 data" for
> discussion of possible solutions.
> My first idea was to write a fuzzy converter function that replaces
> every unconverted byte with an escape sequence representing its
> numerical code ("?\XXX" or somesuch).
> Then Ulrich Drepper pointed out that since this data is mainly for
> human consumption, the "//TRANSLIT" behavior of glibc's iconv and GNU
> libiconv would produce more readable output.  We can at least detect
> when we're using one of those iconv's and append that option to the
> to-charset string where appropriate.  (Marcus Comstedt points out that
> some iconv implementations automatically do transliteration for you,
> and don't even tell you whether or not it's happened, which is sort of
> unnerving.)
> However, if you are on a system that doesn't support this, you'll get
> the result above.
> So there are various non-mutually-exclusive steps to take here:
>    - Write the fuzzy function with the escape codes, use where
> translit not available.
>    - Meanhwile, get Subversion doing transliteration where possible
> (Ulrich may do)
>    - Possible early fix: make "svn log" accept --force or
> --message-encoding, so one
>       can make it output the raw bytes or a specific encoding,
> respectively.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)