You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@gump.apache.org by Adam Jack <aj...@mric.coop> on 2004/10/05 19:08:39 UTC

Unicode & Python interacting with File Systems

I just noticed a sync (where Gump copies from one directory tree to another)
failure occuring in XOM due to unicode decode issues. I always struggle to
get my ehad around the magic that is unicode (to/from 'real world') and I'd
appreciate any thoughts.insights here. The file name 'resum\xc3\xa9.xml' is
being extracted from the directory, and when we try to join it to it's
parent directory (to get a full path) we are failing in a unicode
conversion.

A few questions:

1) Is "\xC3\xA9" a 'single' unicode character already? Is it being
considered as two by accident?
2) Looking at the 'resum\xc3\xa9.xml' I see no 'u' at the begining, so
Python doesn't consider it unicode (yet). This seems contrary to the Python
2.3 manual that states "if os.listdir is given a unicode string path it'll
return unicode strings". The path (shown below) is marked 'u'.
3) Any thoughts on where the flaw might be, and what ought be done to try to
fix this?

regards,

Adam

-----------------------------------------------------------------

ERROR:gump:Unicode Error. Can't copy ['resum\xc3\xa9.xml'] in
[u'/usr/local/gump/test/workspace/cvs/xom/data'] to
[u'/usr/local/gump/test/workspace/xom/data']: ['ascii' codec can't decode
byte 0xc3 in position 6: ordinal not in range(128)]
Traceback (most recent call last):
  File "/usr/local/gump/test/gump/python/gump/utils/sync.py", line 180, in
copytree
    srcname = os.path.join(src, name)
  File "/usr/lib/python2.3/posixpath.py", line 65, in join
    path += '/' + b


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org


Re: Unicode & Python interacting with File Systems

Posted by "Adam R. B. Jack" <aj...@apache.org>.
Ok, so a quick test program:

----------------------------------------------------------------------
import sys
import os

print 'Default File System Encoding: ' + sys.getdefaultencoding()

for name in os.listdir('../../workspace/cvs/xom/data'):
        if name.startswith('r'): print 'Non-Unicode : ' + `name`

for name in os.listdir(u'../../workspace/cvs/xom/data'):
        if name.startswith('r'): print 'Unicode : ' + `name`
----------------------------------------------------------------------
Gives:

Default File System Encoding: ascii

Non-Unicode : 'rddltest.html'
Non-Unicode : 'resum\xc3\xa9.xml'
Unicode : u'rddltest.html'
Unicode : 'resum\xc3\xa9.xml'

i.e. listdir is returning unicode strings when passed a unicode directory,
except in this case, where it returns a simple string. As you see above, it
seems that the default file system encoding is ascii, so somehow when this
filename is encountered the logic is stumbling. Can the Linux file system
not cope with unicode characters, or is "ascii" wrong as a default system
encoding? Heck, I can't easily ls this file simply either (on my terminal):

     ls ../../workspace/cvs/xom/data/resumé.xml
    ../../workspace/cvs/xom/data/resum??.xml

Is the problem that XOM is (on some platform) encoding this filename,
checking it in to CVS, and when CVS (on Brutus) checks it out, it is
knobbling the directory creation? Do we have a general problem with CVS|SVN
here?

Anybody have suggestions on where I go next with this? I'd like to solve it
[short and/or long term], but I'd also like to understand where the issue
really is, since we might have a general problem here.

[This seems a good question to post to a group like python@apache.org,
except we don't have one. Ok, whine over... ]

regards,

Adam
----- Original Message ----- 
From: "Sam Ruby" <ru...@apache.org>
To: "Gump code and data" <ge...@gump.apache.org>
Sent: Tuesday, October 05, 2004 5:18 PM
Subject: Re: Unicode & Python interacting with File Systems


> Adam Jack wrote:
>
> > 1) Is "\xC3\xA9" a 'single' unicode character already? Is it being
> > considered as two by accident?
>
> unicode("\xC3\xA9","utf-8") == u'\xe9'
>
> http://www.unipad.org/unimap/index.php?page=detail&param_char=00E9
>
> - Sam Ruby
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
> For additional commands, e-mail: general-help@gump.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org


Re: Unicode & Python interacting with File Systems

Posted by Sam Ruby <ru...@apache.org>.
Adam Jack wrote:

> 1) Is "\xC3\xA9" a 'single' unicode character already? Is it being
> considered as two by accident?

unicode("\xC3\xA9","utf-8") == u'\xe9'

http://www.unipad.org/unimap/index.php?page=detail&param_char=00E9

- Sam Ruby

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org