You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Erik Huelsmann <e....@gmx.net> on 2004/05/13 18:13:52 UTC

Stripping 'charset=' from po files [the sequal]

In order to prevent charset conversion by 'smart' gettext implementations
our build system has to strip out the the 'charset=UTF-8' string in the
administrative section of po files.  The Makefile based system currently
does this by ripping out the entire 'Content-Type' line using 'sed'.


The Windows (python based) build system does not provide sed.  To work
around that I wrote the general python based po parser included below.  A
separate script does the real stripping.  This also provides the (cleaner)
solution to only examen the admin section.


There are several questions to be answered before proceding:

1) We don't want to use the same script for the Makefile build (adding a new
dependency), do we?

2)
 a) Do we want the po parser in the Subversion repository?
 b) If so: where?


3) Do you have any comments to either script? (the strip charset script has
to be extended to include plural support before this code can be committed)


bye,

Erik.


start of the parser ===============
import string




class PoSink:
  def __init__(self):
    self.domain = None

  def recv_domain(self, domain):
    self.domain = domain

  def recv_simple_msg(self, pre_comment, msgid, msgstr):
    pass
  def recv_plural_msg(self, pre_comment, msgid, plural, msgstr_order,
msgstrs):
    pass
  def finish_parse(self):
    pass




# implement a token-parser
#
# the tokens will be defined as (with '$' == EOL)
#
# COMMENT : #(.*)(<EOL>|<EOF>)
# STRING  : "<any character including escaped ">*"
# INDEX   : '[' NUMBER ']'
# NUMBER  : [0-9]+
# other   : [a-zA-Z0-9_]+

TOKEN_CHUNK_SIZE = 100 * 1024 # 100kiB
OTHER_TOKEN_CHARS = string.letters + string.digits + '_'

class PoTokens:
  def __init__(self, inp):
    self.inp = inp
    self.buf = inp.read(TOKEN_CHUNK_SIZE)
    self.idx = -1

  def get(self):

    # skip initial whitespace
    while 1:
      self.idx += 1

      while self.idx < len(self.buf) and \
                self.buf[self.idx] in string.whitespace:
        self.idx += 1

      if self.idx == len(self.buf):
        self.buf = self.inp.read(TOKEN_CHUNK_SIZE)
        self.idx = -1
        if self.buf == "":
          del self.buf
          return ""

      if not self.buf[self.idx] in string.whitespace:
        break

    start = self.idx

    # string "token"
    if self.buf[start] == "\"":
      token = ""

      end = self.buf.find('"', start+1)
      while 1:
        while end > -1 and self.buf[end - 1] == "\\":
          end = self.buf.find('"', end + 1)

        if end == -1:
          token += self.buf[start:]

          self.buf = self.inp.read(TOKEN_CHUNK_SIZE)
          if not self.buf:
            raise "Unexpected EOF; unterminated string."

          end = self.buf.find('"')
          start = 0
          continue

        self.idx = end
        return token + self.buf[start:end+1]

    # comment "token"
    if self.buf[start] == "#":
      token = ""

      while 1:
        end = self.buf.find("\n", start+1)

        if end == -1:
          token += self.buf[start:]

          self.buf = self.inp.read(TOKEN_CHUNK_SIZE)
          if not self.buf:
            del self.buf
            return token

          start = 0
          continue

        self.idx = end
        return token + self.buf[start:end]

    # msgstr "[INDEX]" "token"
    if self.buf[start] == '[':
      token = "["

      while 1:
        while self.idx < len(self.buf) and \
                  self.buf[self.idx] in string.whitespace:
          self.idx += 1

        if self.idx == len(self.buf):
          self.buf = self.inp.read(TOKEN_CHUNK_SIZE)

          if not self.buf:
            raise "Unexpected EOF while parsing a msgstr INDEX"

          self.idx = start = 0
          continue

        break

      while 1:
        while self.idx < len(self.buf) and \
                  self.buf[self.idx] in string.digits:
          self.idx += 1

        if self.idx == len(self.buf):
          token += self.buf[start:]
          self.buf = self.inp.read(TOKEN_CHUNK_SIZE)

          if not self.buf:
            raise "Unexpected EOF in msgstr INDEX"

          self.idx = start = 0

        token += self.buf[start:self.idx]
        break

      while 1:
        while self.idx < len(self.buf) and \
                  self.buf[self.idx] in string.whitespace:
          self.idx += 1

        if self.idx == len(self.buf):
          self.buf = self.inp.read(TOKEN_CHUNK_SIZE)

          if not self.buf:
            raise "Unexpected EOF while parsing a msgstr INDEX"

          self.idx = start = 0
          continue

        if self.buf[self.idx] == ']':
          return token + ']'
        else:
          raise "Unexpected character while parsing a msgstr INDEX"

    # character series token
    if self.buf[start] in OTHER_TOKEN_CHARS:
      token = ""

      while 1:
        while self.idx < len(self.buf) and \
                  self.buf[self.idx] in OTHER_TOKEN_CHARS:
          self.idx += 1

        if self.idx == len(self.buf):
          token += self.buf[start:]

          self.buf = self.inp.read(TOKEN_CHUNK_SIZE)

          if not self.buf:
            return token

          self.idx = start = 0
          continue

        return token + self.buf[start:self.idx]

    # unknown token starting character
    raise "Unexpected character in input stream (%s)" % self.buf[start]

  def unget(self, token):
    def reget(self=self, ungot=token):
      del self.get

      return ungot

    self.get = reget




def parse(inp, sink):

  def get_msg_argument(arg_to):
    rv = []
    token = inp.get()
    while token[0] == '"':
      rv += [ token ]
      token = inp.get()

    inp.unget(token)

    if len(rv) == 0:
      raise "Expected %s argument found other token instead" % arg_to

    return rv

  comment = []
  while 1:
    token = inp.get()

    if not token: # EOF
      return

    if token[0] == '#':
      comment += [ token ]

      continue

    if token.lower() == 'domain':
      token = inp.get()

      if token[0] in string.letters + string.digits + '_':
        sink.recv_domain(token)

      else:
        raise "Invalid token where domain name expected"

      continue

    if token.lower() == 'msgid':
      msgid = get_msg_argument('msgid')
      msgid_plural = []

      token = inp.get()
      if token.lower() == 'msgid_plural':
        msgid_plural = get_msg_argument('msgid_plural')
        token = inp.get()

      if msgid_plural:
        while token.lower() == 'msgstr':
          token = inp.get()

          if not token[0] == '[':
            raise "msgid INDEX expected when msgid_plural defined"

          msgstr_indices += [ token[1:-1] ]
          msgstrs[token[1:-1]] = get_msg_argument('msgstr[INDEX]')

          token = inp.get()

        if len(msgstr_indices) == 0:
          raise "msgstr expected after msgid_plural"

        inp.unget(token)

        sink.recv_plural_msg(comment, msgid, msgid_plural,
                             msgstr_indices, msgstrs)
        continue

      else: # not msgid_plural
        if not token.lower() == "msgstr":
          raise "Unexpected token where 'msgstr'"

        sink.recv_simple_msg(comment, msgid, get_msg_argument('msgstr'))

        comment = []
        continue

    raise "Unknown token (%s)" % token

end of the parser ===============

start of the strip script ===============
#!/usr/bin/env python

import sys, poparse
import getopt


class CharsetStrippingSink(poparse.PoSink):
  def __init__(self, out):
    self.out = out


  def recv_simple_msg(self, pre_comment, msgid, msgstr):
    if msgid == [ '""' ]:
      for i in xrange(len(msgstr)):
        msgstr_len = len(msgstr)-1
        # note that the charset could be split over lines
        if msgstr[msgstr_len-i].find("charset=") >= 0:
          del msgstr[msgstr_len-i]
          break

    for l in pre_comment:
      self.out.write("%s\n" % l)

    msg = "msgid "
    for l in msgid:
      self.out.write("%s%s\n" % (msg, l))
      msg = ""

    msg = "msgstr "
    for l in msgstr:
      self.out.write("%s%s\n" % (msg, l))
      msg = ""


  def finish_parse(self):
    pass


def strip_it(infile, outfile):
  poparse.parse(poparse.PoTokens(infile),
                CharsetStrippingSink(outfile))

def main():
  """Docstring to be added"""

  opts, args = getopt.getopt(sys.argv[1:], '', [])

  if len(args) < 1:
    print __doc__
    sys.exit(2)

  infile = None
  if args[0] == '-':
    infile = sys.stdin
  else:
    infile = open(args[0],'r')

  outfile = None
  if len(args) < 2 or args[1] == '-':
    outfile = sys.stdout
  else:
    outfile = open(args[1],'w')

  strip_it(infile, outfile)

if __name__ == '__main__':
  main()
end of the strip script ===============

-- 
NEU : GMX Internet.FreeDSL
Ab sofort DSL-Tarif ohne Grundgeb�hr: http://www.gmx.net/dsl


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Branko Čibej <br...@xbc.nu>.
Erik Hülsmann wrote:

>@@ -0,0 +1,23 @@
>+#
>+# strip-po-charset.py
>+#
>+
>+import sys
>+
>+def strip_po_charset(inp, out):
>+
>+    for line in inp.xreadlines():
>  
>
We only require Python 2.0. xreadlines appeared in 2.1.

Just read the whole thing into memory in one go and do a simple 
string.replace.

>+        if line.find("\"Content-Type: text/plain; charset=UTF-8\\n\"") == -1:
>+            out.write(line)
>+
>+
>+def main():
>+
>+    if len(sys.argv) != 3:
>+        print "Unsupported number of arguments; 2 required."
>  
>
Would be nice to say _which_ arguments. "Usage: foo arg1 arg2"...

>+        sys.exit(1)
>+
>+    strip_po_charset(open(sys.argv[1],'r'), open(sys.argv[2],'w'))
>+
>+if __name__ == '__main__':
>+    main()
>  
>




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Erik Hülsmann <e....@gmx.net>.
>>3) Do you have any comments to either script? (the strip charset script has
>>to be extended to include plural support before this code can be committed)
>>  
>>
>I think this parser is overkill. How about something like this (just 
>typing, not testing):

[ ... ]

>I don't think we need anything more complicated than this sed-like 
>replacement in the Windows build.

Ok. How about the attached patch:

Log:
[[[
Add charset stripping to Windows build.

* build/generator/build_locale.ezt: Add intermediate .spo step
  also used in the Makefile build

* build/generator/gen_win.py (): Pass the base name instead of
  .po and .mo.
  (POFile.__Init__): Initialize fields fields from base parameter.

* build/strip-po-charset.py: New file. Contains the actual sed-like
  script.
]]]

Index: build/generator/build_locale.ezt
===================================================================
--- build/generator/build_locale.ezt	(revision 9884)
+++ build/generator/build_locale.ezt	(working copy)
@@ -2,8 +2,11 @@
 @rem **************************************************************************
 cd ..\..\subversion\po
 [for pofiles]echo Running msgfmt on [pofiles.po]...
-msgfmt.exe -o [pofiles.mo] [pofiles.po]
+python ..\..\build\strip-po-charset.py [pofiles.po] [pofiles.spo]
 if not errorlevel 0 goto err
+msgfmt.exe -o [pofiles.mo] [pofiles.spo]
+if not errorlevel 0 goto err
+del [pofiles.spo]
 [end]
 goto end
 @rem **************************************************************************
Index: build/generator/gen_win.py
===================================================================
--- build/generator/gen_win.py	(revision 9884)
+++ build/generator/gen_win.py	(working copy)
@@ -143,7 +143,7 @@
     if self.enable_nls:
       for po in os.listdir(os.path.join('subversion', 'po')):
         if fnmatch.fnmatch(po, '*.po'):
-          pofiles.append(POFile(po, po[:-2] + 'mo'))
+          pofiles.append(POFile(po[:-3]))
     
     data = {'pofiles': pofiles}
     self.write_with_template(os.path.join('build', 'win32', 'build_locale.bat'),
@@ -706,6 +706,7 @@
 
 class POFile:
   "Item class for holding po file info"
-  def __init__(self, po, mo):
-    self.po = po
-    self.mo = mo
+  def __init__(self, base):
+    self.po = base + '.po'
+    self.spo = base + '.spo'
+    self.mo = base + '.mo'
Index: build/strip-po-charset.py
===================================================================
--- build/strip-po-charset.py	(revision 0)
+++ build/strip-po-charset.py	(revision 0)
@@ -0,0 +1,23 @@
+#
+# strip-po-charset.py
+#
+
+import sys
+
+def strip_po_charset(inp, out):
+
+    for line in inp.xreadlines():
+        if line.find("\"Content-Type: text/plain; charset=UTF-8\\n\"") == -1:
+            out.write(line)
+
+
+def main():
+
+    if len(sys.argv) != 3:
+        print "Unsupported number of arguments; 2 required."
+        sys.exit(1)
+
+    strip_po_charset(open(sys.argv[1],'r'), open(sys.argv[2],'w'))
+
+if __name__ == '__main__':
+    main()

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Branko Čibej <br...@xbc.nu>.
Erik Huelsmann wrote:

>In order to prevent charset conversion by 'smart' gettext implementations
>our build system has to strip out the the 'charset=UTF-8' string in the
>administrative section of po files.  The Makefile based system currently
>does this by ripping out the entire 'Content-Type' line using 'sed'.
>
>
>The Windows (python based) build system does not provide sed.  To work
>around that I wrote the general python based po parser included below.  A
>separate script does the real stripping.  This also provides the (cleaner)
>solution to only examen the admin section.
>
>
>There are several questions to be answered before proceding:
>
>1) We don't want to use the same script for the Makefile build (adding a new
>dependency), do we?
>  
>
Using sed in the Makefile is fine.

>2)
> a) Do we want the po parser in the Subversion repository?
> b) If so: where?
>
>
>3) Do you have any comments to either script? (the strip charset script has
>to be extended to include plural support before this code can be committed)
>  
>
I think this parser is overkill. How about something like this (just 
typing, not testing):

    podir = 'subversion/po'
    filtered_podir = ...
    for file in os.listdir(podir):
      if file[-3:] != '.po':
        continue
      _filter_charser(podir, file), filtered_podir)

    def _filter_charser(source, file, target):
      f = open(os.path.join(source, file), 'rb')
      content = f.read()
      f.close()
      f = open(os.path.join(target, file), 'wb')
      f.write(string.replace(content, 'Content-Type: ....', ''))
      f.close()

I don't think we need anything more complicated than this sed-like 
replacement in the Windows build.

-- Brane



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Branko Čibej <br...@xbc.nu>.
Greg Hudson wrote:

>On Thu, 2004-05-13 at 18:11, Branko Čibej wrote:
>  
>
>>If indeed bind_textdomain_codeset is a GNUism, and if we can avoid the 
>>need to use it by stripping the charset bit from the .po files /and/ 
>>expect all gettext implementations to behave in the same way afterwards 
>>(i.e., not try to translate anything), then let's do that.
>>    
>>
>
>Why?  It sounds like it's harder and more annoying than using
>bind_textdomain_codeset when available.
>  
>
Because AFAIK a gettext implementation can both /not/ have 
bind_textdomain_codeset /and/ automatically translate the messages to 
the local encoding, which would break things for us most annoyingly.

-- Brane



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Greg Hudson <gh...@MIT.EDU>.
On Thu, 2004-05-13 at 18:11, Branko Čibej wrote:
> If indeed bind_textdomain_codeset is a GNUism, and if we can avoid the 
> need to use it by stripping the charset bit from the .po files /and/ 
> expect all gettext implementations to behave in the same way afterwards 
> (i.e., not try to translate anything), then let's do that.

Why?  It sounds like it's harder and more annoying than using
bind_textdomain_codeset when available.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Stripping 'charset=' from po files [the sequal]

Posted by Branko Čibej <br...@xbc.nu>.
Erik Huelsmann wrote:

>>On Thu, 2004-05-13 at 14:13, Erik Huelsmann wrote:
>>    
>>
>>>In order to prevent charset conversion by 'smart' gettext
>>>      
>>>
>>implementations
>>    
>>
>>>our build system has to strip out the the 'charset=UTF-8' string in the
>>>administrative section of po files.
>>>      
>>>
>>So, I recall Nico repeatedly pointing out that all the gettext
>>implementations which perform charset translation also support the
>>function call to turn it off.  I don't recall seeing an answer to this
>>claim.
>>    
>>
>
>I don't think there was one. He stated that he thought they all did.
>
>  
>
>>Given that stripping out the charset directive yields ugly warnings and
>>is presenting a portability problem, why are we doing it when there's a
>>better option?
>>    
>>
>
>It's the only way that I currently know of (given the absence of reactions
>to state that he is correct) to be sure to eliminate the recoding.  OTOH, if
>we are willing to give it a try to use the bind_textdomain_codeset() when
>available and assume a 'dumb' gettext if not until proven wrong that's fine
>by me.
>
>It was Branko who insisted on not having it in our code at all; I think he
>expected problems with the custom built GNU gettext.  Maybe now that he has
>built it he can give us an answer to the question whether we still can't use
>it or not... (Branko, any idea?)
>  
>
I was told on this list that the Solaris gettext didn't have 
bind_textdomain_codeset. The warnings Greg mentsions are, as far as I 
know, from GNU msgfmt. But this is all hearsay.

If indeed bind_textdomain_codeset is a GNUism, and if we can avoid the 
need to use it by stripping the charset bit from the .po files /and/ 
expect all gettext implementations to behave in the same way afterwards 
(i.e., not try to translate anything), then let's do that.

-- Brane



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Erik Huelsmann <e....@gmx.net>.
> On Thu, 2004-05-13 at 14:13, Erik Huelsmann wrote:
> > In order to prevent charset conversion by 'smart' gettext
> implementations
> > our build system has to strip out the the 'charset=UTF-8' string in the
> > administrative section of po files.
> 
> So, I recall Nico repeatedly pointing out that all the gettext
> implementations which perform charset translation also support the
> function call to turn it off.  I don't recall seeing an answer to this
> claim.

I don't think there was one. He stated that he thought they all did.

> Given that stripping out the charset directive yields ugly warnings and
> is presenting a portability problem, why are we doing it when there's a
> better option?

It's the only way that I currently know of (given the absence of reactions
to state that he is correct) to be sure to eliminate the recoding.  OTOH, if
we are willing to give it a try to use the bind_textdomain_codeset() when
available and assume a 'dumb' gettext if not until proven wrong that's fine
by me.

It was Branko who insisted on not having it in our code at all; I think he
expected problems with the custom built GNU gettext.  Maybe now that he has
built it he can give us an answer to the question whether we still can't use
it or not... (Branko, any idea?)


bye,


Erik.

-- 
NEU : GMX Internet.FreeDSL
Ab sofort DSL-Tarif ohne Grundgeb�hr: http://www.gmx.net/dsl


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Nicolás Lichtmaier <ni...@reloco.com.ar>.
>>In order to prevent charset conversion by 'smart' gettext implementations
>>our build system has to strip out the the 'charset=UTF-8' string in the
>>administrative section of po files.
>>    
>>
>
>So, I recall Nico repeatedly pointing out that all the gettext
>implementations which perform charset translation also support the
>function call to turn it off.  I don't recall seeing an answer to this
>claim.
>  
>

I don't know that for sure, I just think it's very reasonable, because 
it seems that gettext's charset handling in whole is a GNU invention. If 
you check other gettext manapges you find no mention of any charset 
handling. A reasonable idea would be to assume this is true until the 
first bug report comes, claiming otherwise. There's always time to 
complicate things later. =)


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Greg Hudson <gh...@MIT.EDU>.
On Thu, 2004-05-13 at 14:13, Erik Huelsmann wrote:
> In order to prevent charset conversion by 'smart' gettext implementations
> our build system has to strip out the the 'charset=UTF-8' string in the
> administrative section of po files.

So, I recall Nico repeatedly pointing out that all the gettext
implementations which perform charset translation also support the
function call to turn it off.  I don't recall seeing an answer to this
claim.

Given that stripping out the charset directive yields ugly warnings and
is presenting a portability problem, why are we doing it when there's a
better option?


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Ben Reser <be...@reser.org>.
On Thu, May 13, 2004 at 08:41:16PM +0200, Erik Huelsmann wrote:
> 
> > > 3) Do you have any comments to either script? (the strip charset script
> > has
> > > to be extended to include plural support before this code can be
> > committed)
> > 
> > Uhh isn't that overly complicated?  Can't you do something that is
> > roughly similar to the sed script in the Makefile?  I'd be really
> > surprised if python couldn't do that.  But if it can't you could do it
> > with Perl.  The Windows build already requires it.
> > 
> > For example:
> > perl -pe 's#^"Content-Type: text/plain; charset=UTF-8\\n"\n$##' es.po >
> > es.po.spo
> 
> Yes it can.  I just thought I'd do it exact by parsing the po file so that
> no lines outside the "" entry can be eliminated. (Yes, I know chances are
> slim..)

I guess I don't see the point in going to that effort for the Windows
build if we're not going to do it for the (U|u)(N|n)(I|i)(X|x) build.

-- 
Ben Reser <be...@reser.org>
http://ben.reser.org

"Conscience is the inner voice which warns us somebody may be looking."
- H.L. Mencken

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Erik Huelsmann <e....@gmx.net>.
> > 3) Do you have any comments to either script? (the strip charset script
> has
> > to be extended to include plural support before this code can be
> committed)
> 
> Uhh isn't that overly complicated?  Can't you do something that is
> roughly similar to the sed script in the Makefile?  I'd be really
> surprised if python couldn't do that.  But if it can't you could do it
> with Perl.  The Windows build already requires it.
> 
> For example:
> perl -pe 's#^"Content-Type: text/plain; charset=UTF-8\\n"\n$##' es.po >
> es.po.spo

Yes it can.  I just thought I'd do it exact by parsing the po file so that
no lines outside the "" entry can be eliminated. (Yes, I know chances are
slim..)


bye,


Erik.

-- 
NEU : GMX Internet.FreeDSL
Ab sofort DSL-Tarif ohne Grundgeb�hr: http://www.gmx.net/dsl


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Branko Čibej <br...@xbc.nu>.
Ben Reser wrote:

>...But if it can't you could do it
>with Perl.  The Windows build already requires it.
>  
>
Nope.




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Stripping 'charset=' from po files [the sequal]

Posted by Ben Reser <be...@reser.org>.
On Thu, May 13, 2004 at 08:13:52PM +0200, Erik Huelsmann wrote:
> 
> In order to prevent charset conversion by 'smart' gettext implementations
> our build system has to strip out the the 'charset=UTF-8' string in the
> administrative section of po files.  The Makefile based system currently
> does this by ripping out the entire 'Content-Type' line using 'sed'.
> 
> 
> The Windows (python based) build system does not provide sed.  To work
> around that I wrote the general python based po parser included below.  A
> separate script does the real stripping.  This also provides the (cleaner)
> solution to only examen the admin section.

[snip]

> 3) Do you have any comments to either script? (the strip charset script has
> to be extended to include plural support before this code can be committed)

Uhh isn't that overly complicated?  Can't you do something that is
roughly similar to the sed script in the Makefile?  I'd be really
surprised if python couldn't do that.  But if it can't you could do it
with Perl.  The Windows build already requires it.

For example:
perl -pe 's#^"Content-Type: text/plain; charset=UTF-8\\n"\n$##' es.po >
es.po.spo


-- 
Ben Reser <be...@reser.org>
http://ben.reser.org

"Conscience is the inner voice which warns us somebody may be looking."
- H.L. Mencken

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org