You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Masaru Tsuchiyama <ts...@asahi-net.email.ne.jp> on 2003/07/15 00:09:59 UTC

cvs2svn.py : about --encoding option

cvs2svn.py can specify one character-set with --encoding option,
I have a cvs repository whose log messages are written in two
charsets, euc-jp and shift-jis.

I executed cvs2svn.py with "--encoding euc-jp". then log messages in
euc-jp are correctly converted, but log messages in sjis can't be
converted. With "--encoding sjis" vice versa.

Could you make it possible to specify more than two charset?


-------------------------------
Masaru Tsuchiyama
-------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: cvs2svn.py : about --encoding option

Posted by Masaru Tsuchiyama <ts...@asahi-net.email.ne.jp>.
> Greg Stein <gs...@lyra.org> writes:
> > In this case, ShiftJIS and EUC share some codepoints. If a string contains
> > characters *only* in that shared space, then you get a "maybe". If there is
> > *any* character in the string which falls into one of the exclusive
> > codepoint spaces, then you get to say "yes" or "no".
> 
> For 'maybe', just encode it both ways and give both results, separated
> by a dashed line or something :-).
> 
I suggest the way that cvs2svn.py ask a user which charset it uses
if it can't autodetect.


I posted this issue to the issue tracker
http://subversion.tigris.org/issues/show_bug.cgi?id=1420

-------------------------------
Masaru Tsuchiyama
-------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: cvs2svn.py : about --encoding option

Posted by kf...@collab.net.
Greg Stein <gs...@lyra.org> writes:
> In this case, ShiftJIS and EUC share some codepoints. If a string contains
> characters *only* in that shared space, then you get a "maybe". If there is
> *any* character in the string which falls into one of the exclusive
> codepoint spaces, then you get to say "yes" or "no".

For 'maybe', just encode it both ways and give both results, separated
by a dashed line or something :-).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: cvs2svn.py : about --encoding option

Posted by Greg Stein <gs...@lyra.org>.
On Tue, Jul 15, 2003 at 02:37:01PM -0700, Jack Repenning wrote:
> At 2:35 PM -0700 7/15/03, Greg Stein wrote:
> >Mike and I are somewhat familiar with kanjilib.py. Great module, but note
> >that its autodetection can return "yes, no, maybe". That third case can be
> >troublesome :-)
> 
> That "maybe" case is inherent in the problem, isn't it?  Some codes 
> just plain are ambiguous.

Yup, definitely.

In this case, ShiftJIS and EUC share some codepoints. If a string contains
characters *only* in that shared space, then you get a "maybe". If there is
*any* character in the string which falls into one of the exclusive
codepoint spaces, then you get to say "yes" or "no".

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: cvs2svn.py : about --encoding option

Posted by Jack Repenning <jr...@collab.net>.
At 2:35 PM -0700 7/15/03, Greg Stein wrote:
>Mike and I are somewhat familiar with kanjilib.py. Great module, but note
>that its autodetection can return "yes, no, maybe". That third case can be
>troublesome :-)

That "maybe" case is inherent in the problem, isn't it?  Some codes 
just plain are ambiguous.
-- 
-==-
Jack Repenning
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
o: 650.228.2562
c: 408.835-8090

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: cvs2svn.py : about --encoding option

Posted by Greg Stein <gs...@lyra.org>.
On Tue, Jul 15, 2003 at 02:58:05PM -0500, kfogel@collab.net wrote:
> cmpilato@collab.net writes:
> > > > Could you make it possible to specify more than two charset?
> > > 
> > > How would it know which encoding to use for a given log message?
> > 
> > Perhaps cvs2svn should consider kanjilib.py, autodetect the encoding
> > type used for the log message, and convert to UTF8 ?
> 
> Interesting idea!
> 
> Masaru Tsuchiyama, can you file an 'ENHANCEMENT' request in the issue
> tracker noting this?  Thanks...

Mike and I are somewhat familiar with kanjilib.py. Great module, but note
that its autodetection can return "yes, no, maybe". That third case can be
troublesome :-)

That said, it would be quite nice to try to import the module, and (if
successful) to use it for auto-detection.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: cvs2svn.py : about --encoding option

Posted by kf...@collab.net.
cmpilato@collab.net writes:
> > > Could you make it possible to specify more than two charset?
> > 
> > How would it know which encoding to use for a given log message?
> 
> Perhaps cvs2svn should consider kanjilib.py, autodetect the encoding
> type used for the log message, and convert to UTF8 ?

Interesting idea!

Masaru Tsuchiyama, can you file an 'ENHANCEMENT' request in the issue
tracker noting this?  Thanks...

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: cvs2svn.py : about --encoding option

Posted by cm...@collab.net.
kfogel@collab.net writes:

> Masaru Tsuchiyama <ts...@asahi-net.email.ne.jp> writes:
> > cvs2svn.py can specify one character-set with --encoding option,
> > I have a cvs repository whose log messages are written in two
> > charsets, euc-jp and shift-jis.
> > 
> > I executed cvs2svn.py with "--encoding euc-jp". then log messages in
> > euc-jp are correctly converted, but log messages in sjis can't be
> > converted. With "--encoding sjis" vice versa.
> > 
> > Could you make it possible to specify more than two charset?
> 
> How would it know which encoding to use for a given log message?

Perhaps cvs2svn should consider kanjilib.py, autodetect the encoding
type used for the log message, and convert to UTF8 ?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: cvs2svn.py : about --encoding option

Posted by kf...@collab.net.
Masaru Tsuchiyama <ts...@asahi-net.email.ne.jp> writes:
> cvs2svn.py can specify one character-set with --encoding option,
> I have a cvs repository whose log messages are written in two
> charsets, euc-jp and shift-jis.
> 
> I executed cvs2svn.py with "--encoding euc-jp". then log messages in
> euc-jp are correctly converted, but log messages in sjis can't be
> converted. With "--encoding sjis" vice versa.
> 
> Could you make it possible to specify more than two charset?

How would it know which encoding to use for a given log message?

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org