You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Julian Foad <ju...@btopenworld.com> on 2014/04/25 18:53:29 UTC

String formatting with APR_INT64_T etc. & gettext localization

Mattias Engdegård mentioned [1] that nobody has run the 'po-update.sh' script in trunk for some time, as it complains about some format strings it can't handle:

../libsvn_fs_fs/index.c:594: ... string ends in the middle of a directive.
../libsvn_fs_fs/index.c:654: ... string ends in the middle of a directive.
../libsvn_fs_fs/cached_data.c:914: ... string ends in the middle of a directive.

The code in each case is similar to:

  svn_error_createf(... _("... %" APR_UINT64_T_FMT "..."),
                        some_uint64_t_value);

The stupid thing is we have no easy and general solution at the moment.

Ideas:

1. Use PRIu64 etc. from <inttypes.h>.
2. Use a helper function to format an integer as a string; insert the result with "%s".
3. Double-formatting, using "...%%%s..." to insert the correct format specifier.
4. Modify 'gettext' to expand or recognize these macros.
5. Pre-process the source to expand these macros.


1. GNU gettext supports the C'99 PRI... macros from <inttypes.h>:

  #include <inttypes.h>
  svn_error_createf(..., _("... %" PRIu64 "..."),
                         some_uint64_t_value);

Limitations:

  * GNU gettext is hard-coded to recognize  the specific types that are 
    defined in <inttypes.h>, so it will not work for other similar 
    definitions of our own or from APR such as APR_OFF_T_FMT.

  * We'd have to arrange for <inttypes.h> or equivalent to be
    available on all platforms. (That would be useful anyway, and is not 
    fundamentally difficult.)

2. Use a helper function. (We do this in some places.)

  #define ui64toa(pool, i) apr_psprintf(pool, "%"APR_UINT64_T_FMT, i)
  svn_error_createf(..., _("... %s..."),
                         ui64toa(pool, some_uint64_t_value));

Limitations:

  * Requires a pool. Some of the current functions don't have one. We 
    could add one, but it's seems a bit stupid to have to do this, when the 
    'printf' code already has access to one.

An alternative:

  #define ui64toa(i) apr_psprintf(svn_pool_create(NULL), "%"APR_UINT64_T_FMT, i)
  svn_error_createf(..., _("... %s..."),
                         ui64toa(some_uint64_t_value));

But that doesn't release the allocated memory.

3. Double-formatting. (We do this in some places.)

  svn_error_createf(...,
                    apr_psprintf(pool, _("... %%%s ..."),APR_INT64_T_FMT),
                    some_uint64_t_value);

Limitations:

  * Prevents compile-time checking of argument types, and not just for the
    arguments that need special handling but for all the arguments.

  * Requires a pool.

  * Verbose.

4. Modify gettext.

This could be the best long-term solution. However, I'm not tackling it now.

5. Pre-process the source.

I tried this and it worked.


We can simplify the source code as in the following examples. (The first hunk fixes one of the present problems.)

[[[

Index: subversion/libsvn_fs_fs/cached_data.c
===================================================================
--- subversion/libsvn_fs_fs/cached_data.c    (revision 1589948)
+++ subversion/libsvn_fs_fs/cached_data.c    (working copy)
@@ -913,6 +913,6 @@
         return svn_error_createf(SVN_ERR_REPOS_CORRUPTED, NULL,
-                                 _("No representation found at offset %s "
-                                   "for item %" APR_UINT64_T_FMT
+                                 _("No representation found at offset %" APR_OFF_T_FMT
+                                   " for item %" APR_UINT64_T_FMT
                                    " in revision %ld"),
-                                 apr_off_t_toa(pool, offset),
+                                 offset,
                                  rep->item_index, rep->revision);
Index: subversion/libsvn_fs_x/noderevs.c
===================================================================
--- subversion/libsvn_fs_x/noderevs.c    (revision 1589948)
+++ subversion/libsvn_fs_x/noderevs.c    (working copy)
@@ -436,6 +436,4 @@
     return svn_error_createf(SVN_ERR_FS_CONTAINER_INDEX, NULL,
-                             apr_psprintf(pool,
-                                          _("Node revision index %%%s"
-                                            " exceeds container size %%d"),
-                                          APR_SIZE_T_FMT),
+                             _("Node revision index %" APR_SIZE_T_FMT
+                               " exceeds container size %d"),
                              idx, container->noderevs->nelts);
]]]

The patch:

[[[
Index: tools/po/po-update.sh
===================================================================
--- tools/po/po-update.sh    (revision 1589948)
+++ tools/po/po-update.sh    (working copy)
@@ -49,2 +49,29 @@

+# Process the content of file "$1" into a new file named "$2".
+preprocess_file()
+{
+  sed < "$1" \
+    -e 's/APR_SSIZE_T_FMT/"ld"/g' \
+    -e 's/APR_SIZE_T_FMT/"lu"/g' \
+    -e 's/APR_OFF_T_FMT/"ld"/g' \
+    -e 's/APR_PID_T_FMT/"d"/g' \
+    -e 's/APR_INT64_T_FMT/PRId64/g' \
+    -e 's/APR_UINT64_T_FMT/PRIu64/g' \
+    -e 's/APR_UINT64_T_HEX_FMT/PRIx64/g' \
+    -e 's/SVN_FILESIZE_T_FMT/PRId64/g' \
+    > "$2"
+}
+
+# For each file name on standard input, process the file's content into a
+# new file and write the new file's name on standard output.
+preprocess_files()
+{
+  while read F; do
+    G="preprocessed/${F#../}"
+    mkdir -p $(dirname "$G")
+    preprocess_file "$F" "$G"
+    echo "$G"
+  done
+}
+
 make_pot()
@@ -54,2 +82,3 @@
     (cd $svn_base/subversion/po && \
+    rm -r preprocessed && \
     find .. \
@@ -61,2 +90,3 @@
     -name "svn_fs_util.h" -print | \
+    preprocess_files | \
     $XGETTEXT --sort-by-file -k_ -kN_ -kQ_:1,2 -kSVN_ERRDEF:3 \
]]]

Thoughts?

- Julian


--
Join WANdisco's free daily demo sessions on Scaling Subversion for the Enterprise
<http://www.wandisco.com/training/webinars>


Re: String formatting with APR_INT64_T etc. & gettext localization

Posted by Julian Foad <ju...@btopenworld.com>.
Branko Čibej wrote:
>>>+    -e 's/APR_SSIZE_T_FMT/"ld"/g' \ 
>>>+    -e 's/APR_SIZE_T_FMT/"lu"/g' \ 
>>>+    -e 's/APR_OFF_T_FMT/"ld"/g' \ 
>> 
>> Surely these must be wrong on IL32P64 platforms?
> 
> It's wrong on all platforms, unless you also explicitly cast the
> values to the declared format types.

Yes, it's wrong to substitute a fixed literal string such as "%ld" before translation. (I'm not considering casting the values.)

Our available options are:

  * Substitute one of the <inttypes.h> tokens (PRId64, etc.) that 'gettext' handles.


  * Do what 'gettext' does for 'PRId64', which is arrange for the substitution to happen *after* translation at build time, as explained below.
So, let me explain how these ones work:

+    -e 's/APR_INT64_T_FMT/PRId64/g' \

The way 'gettext' handles PRId64 (etc.) is:

First, when 'gettext' is extracting translatable strings from the C source files into the .pot (PO template) file, it replaces the source token 'PRId64' with the literal string "<PRId64>". So the translators will see strings like "The number is %<PRId64>". The translators are expected to leave this part unchanged so the translation is for example "Le numéro est %<PRId64>".


Second, after the translations have been prepared in the .po files, 'msgfmt' substitutes "<PRId64>" in any translated string with one of "d", "ld", etc. in the resulting message catalog (.mo) file, according to the system 'msgfmt' is running on at build time (or packaging time).

Therefore the executable that looks up the strings at run time gets a translated string containing "%d" or "%ld" etc. depending on where the message catalogs were built. The idea is that these will match the corresponding strings that were compiled in to the executable directly, of course.

The documentation about that facility [1] is rather slim. I learnt the above from a quick read of the source code and the original patch [2]. As far as I can tell, though, we're safe to use this feature.


[1] <http://www.gnu.org/software/gettext/manual/html_node/Preparing-Strings.html>
[2] <http://www.sourceware.org/ml/libc-alpha/2002-07/msg00202.html>

Whether we can use that for APR_OFF_T etc. depends on whether we can determine, at build time, what recognized <inttypes.h> type to map those to. I haven't confirmed but think we probably can.

An alternative option is to emulate the functionality that's built in to 'gettext', for our own types, by pre-processing 'APR_OFF_T_FMT' to '<APR_OFF_T_FMT>' and then post-process at 'msgfmt' time from that to the correct replacement string. In effect, this approach is exactly the same as hacking the 'gettext' system to support APR_OFF_T_FMT etc., except we could do it as pre-processing and post-processing steps.

Any further thoughts on whether this approach is feasible?

- Julian


Re: String formatting with APR_INT64_T etc. & gettext localization

Posted by Mattias Engdegård <ma...@bredband.net>.
1 maj 2014 kl. 08.13 skrev Dongsheng Song:

> _("The repository trunk version is %"PRId64".")
>
> Will translate to:
> _("The repository trunk version is %lld.")  // ILP32
> _("The repository trunk version is %I64d.") // MSVC
> _("The repository trunk version is %ld.") // LP64

No, gettext recognises standard format macros, such as PRId64, in  
localised strings.
The above string would appear in the message catalogue as

msgid "The repository trunk version is %<PRId64>."

to be translated in the obvious way, keeping the %<PRId64> in the  
string.


Re: String formatting with APR_INT64_T etc. & gettext localization

Posted by Dongsheng Song <do...@gmail.com>.
I'm prefer use helper function to translate numbers to strings.
Why ? Because without this approach, the same source will became to
different translate source on different platforms. e.g.

_("The repository trunk version is %"PRId64".")

Will translate to:
_("The repository trunk version is %lld.")  // ILP32
_("The repository trunk version is %I64d.") // MSVC
_("The repository trunk version is %ld.") // LP64

Then how the translator do ?

Re: String formatting with APR_INT64_T etc. & gettext localization

Posted by Julian Foad <ju...@btopenworld.com>.
Branko Čibej wrote:
>>>+    -e 's/APR_SSIZE_T_FMT/"ld"/g' \ 
>>>+    -e 's/APR_SIZE_T_FMT/"lu"/g' \ 
>>>+    -e 's/APR_OFF_T_FMT/"ld"/g' \ 
>> 
>> Surely these must be wrong on IL32P64 platforms?
> 
> It's wrong on all platforms, unless you also explicitly cast the
> values to the declared format types.

Yes, it's wrong to substitute a fixed literal string such as "%ld" before translation. (I'm not considering casting the values.)

Our available options are:

  * Substitute one of the <inttypes.h> tokens (PRId64, etc.) that 'gettext' handles.


  * Do what 'gettext' does for 'PRId64', which is arrange for the 
substitution to happen *after* translation at build time, as explained 
below.
So, let me explain how these ones work:

+    -e 's/APR_INT64_T_FMT/PRId64/g' \

The way 'gettext' handles PRId64 (etc.) is:

First, when 'gettext' is extracting translatable strings from the C source files into the .pot (PO template) file, it replaces the source token 'PRId64' with the literal string 
"<PRId64>". So the translators will see strings like "The number 
is %<PRId64>". The translators are expected to leave this part 
unchanged so the translation is for example "Le numéro est 
%<PRId64>".


Second, after the translations have been prepared in the .po files, 'msgfmt' 
substitutes "<PRId64>" in any translated string with one of "d", 
"ld", etc. in the resulting message catalog (.mo) file, according to the system 'msgfmt' is running on at build time (or packaging time).

Therefore the executable that looks up the strings at run time gets a translated 
string containing "%d" or "%ld" etc. depending on where the message 
catalogs were built. The idea is that these will match the corresponding strings that were compiled in to the executable directly, of course.

The documentation about that facility [1] is rather slim. I learnt the 
above from a quick read of the source code and the original patch [2]. 
As far as I can tell, though, we're safe to use this feature.


[1] <http://www.gnu.org/software/gettext/manual/html_node/Preparing-Strings.html>
[2] <http://www.sourceware.org/ml/libc-alpha/2002-07/msg00202.html>

Whether we can use that for APR_OFF_T etc. depends on whether we can determine, at build time, what recognized <inttypes.h> type to map those to. I haven't confirmed but think we probably can.

An alternative option is to emulate the functionality that's built in to 
'gettext', for our own types, by pre-processing 'APR_OFF_T_FMT' to 
'<APR_OFF_T_FMT>' and then post-process at 'msgfmt' time from that to the correct replacement string. In effect, this approach is exactly 
the same as hacking the 'gettext' system to support APR_OFF_T_FMT etc., except we could do it as pre-processing and post-processing steps.

Any further thoughts on whether this approach is feasible?

- Julian


Re: String formatting with APR_INT64_T etc. & gettext localization

Posted by Branko Čibej <br...@wandisco.com>.
On 25.04.2014 22:31, Mattias Engdegård wrote:
> 25 apr 2014 kl. 18.53 skrev Julian Foad:
>
>> 1. Use PRIu64 etc. from <inttypes.h>.
>
> Good, if you can get it all to work on Windows.
>
>> 2. Use a helper function to format an integer as a string; insert the
>> result with "%s".
>
> Probably the most practical solution for the time being.

Ack.

>> +    -e 's/APR_SSIZE_T_FMT/"ld"/g' \
>> +    -e 's/APR_SIZE_T_FMT/"lu"/g' \
>> +    -e 's/APR_OFF_T_FMT/"ld"/g' \
>
> Surely these must be wrong on IL32P64 platforms?

It's wrong on all platforms, unless you also explicitly cast the values
to the declared format types.

> (Why not use z?)

Because that's not a standard printf type tag in C90. And even if it
were it'd still be wrong.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. brane@wandisco.com

Re: String formatting with APR_INT64_T etc. & gettext localization

Posted by Mattias Engdegård <ma...@bredband.net>.
>>>   1. Use PRIu64 etc. from <inttypes.h>.
>>
>> Good, if you can get it all to work on Windows.
>
> No problem with a bit of configury magic. See for example <http://code.google.com/p/msinttypes/ 
> >.

Perhaps – I was mainly thinking about gettext and its runtime. A non- 
standard build of it may be required then. (I have never attempted to  
build gettext on Windows; it could be utterly trivial for all I know.)

> The patch is a proof of concept, not a final delivery. Sorry, I  
> should have said that.

I'm the one who should apologise for not understanding that.

Thank you for pointing out that gettext actually does parse the PRIu64  
macros; I didn't know that. The resulting strings in the message  
catalog are quite readable too: _("Scrooge has %" PRIu64 " sacks of  
gold") becomes "Scrooge has %<PRIu64> sacks of gold", which is fine.

However, preprocessing still seems fragile and messy as well as  
unnecessary. If we are going to rely on inttypes.h, we could just as  
well use those macros in the source directly, can't we?


Re: String formatting with APR_INT64_T etc. & gettext localization

Posted by Julian Foad <ju...@btopenworld.com>.
Mattias Engdegård wrote:
> 25 apr 2014 kl. 18.53 skrev Julian Foad:
> 
>>  1. Use PRIu64 etc. from <inttypes.h>.
> 
> Good, if you can get it all to work on Windows.

No problem with a bit of configury magic. See for example <http://code.google.com/p/msinttypes/>.

>>  2. Use a helper function to format an integer as a string; insert the 
> result with "%s".
> 
> Probably the most practical solution for the time being.

It certainly gives the desired result for the least work in the short term. I'm trying to think a bit longer term.

>>  3. Double-formatting, using "...%%%s..." to insert the correct 
> format specifier.
> 
> In addition to the problems you mentioned with this approach, it yields 
> unnecessarily messy strings for the translators to deal with, and possibly 
> misunderstand.

Good point.

>>  4. Modify 'gettext' to expand or recognize these macros.
>>  5. Pre-process the source to expand these macros.
> 
> Sounds fragile, and again, it needs to be verified on Windows.
> 
>>  +    -e 's/APR_SSIZE_T_FMT/"ld"/g' \
>>  +    -e 's/APR_SIZE_T_FMT/"lu"/g' \
>>  +    -e 's/APR_OFF_T_FMT/"ld"/g' \
> 
> Surely these must be wrong on IL32P64 platforms? (Why not use z?)
> 
>>  +# For each file name on standard input, process the file's content into a
>>  +# new file and write the new file's name on standard output.
> 
> Perhaps I have misunderstood your patch entirely, but will that really produce 
> the right file names in the resulting .pot file?

The patch is a proof of concept, not a final delivery. Sorry, I should have said that.

It needs to use the correct replacement strings, of course, depending on the build platform. This can be done by configury magic. I haven't implemented that yet. And it should put the correct paths to the source file names, as you say.

At this stage, the question is whether this looks feasible and useful, or if any of these issues are insurmountable or undesirable.

- Julian


Re: String formatting with APR_INT64_T etc. & gettext localization

Posted by Mattias Engdegård <ma...@bredband.net>.
25 apr 2014 kl. 18.53 skrev Julian Foad:

> 1. Use PRIu64 etc. from <inttypes.h>.

Good, if you can get it all to work on Windows.

> 2. Use a helper function to format an integer as a string; insert  
> the result with "%s".

Probably the most practical solution for the time being.

> 3. Double-formatting, using "...%%%s..." to insert the correct  
> format specifier.

In addition to the problems you mentioned with this approach, it  
yields unnecessarily messy strings for the translators to deal with,  
and possibly misunderstand.

> 4. Modify 'gettext' to expand or recognize these macros.
> 5. Pre-process the source to expand these macros.

Sounds fragile, and again, it needs to be verified on Windows.

> +    -e 's/APR_SSIZE_T_FMT/"ld"/g' \
> +    -e 's/APR_SIZE_T_FMT/"lu"/g' \
> +    -e 's/APR_OFF_T_FMT/"ld"/g' \

Surely these must be wrong on IL32P64 platforms? (Why not use z?)

> +# For each file name on standard input, process the file's content  
> into a
> +# new file and write the new file's name on standard output.

Perhaps I have misunderstood your patch entirely, but will that really  
produce the right file names in the resulting .pot file?