You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by "Peter N. Lundblad" <pe...@famlundblad.se> on 2005/01/05 15:39:32 UTC

Re: [Issue 2194] Unicde UTF-16 files detected as binary

On Wed, 5 Jan 2005 maxb@tigris.org wrote:

> http://subversion.tigris.org/issues/show_bug.cgi?id=2194
>
>
>
> User maxb changed the following:
>
>                   What    |Old value                 |New value
> ================================================================================
>                     Status|NEW                       |RESOLVED
> --------------------------------------------------------------------------------
>                 Resolution|                          |INVALID
> --------------------------------------------------------------------------------
>
>
>
>
> ------- Additional comments from maxb@tigris.org Wed Jan  5 06:48:02 -0800 2005 -------
> There's some huge red text on the issue tracker front page.
> Please read it.
> Thanks.
>
But don't you aggree this would be a good enhancement, i.e. better support
for other Unicode encodings than UTF8?

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 2194] Unicde UTF-16 files detected as binary

Posted by Branko Čibej <br...@xbc.nu>.
Peter N. Lundblad wrote:

>On Thu, 6 Jan 2005, [UTF-8] Branko �^Libej wrote:
>
>  
>
>>Peter N. Lundblad wrote:
>>
>>    
>>
>>>On Thu, 6 Jan 2005, [UTF-8] Branko �^Libej wrote:
>>>
>>>
>>>
>>>      
>>>
>>>>It is much more complicated than that. If we're to treat UTF-16 files as
>>>>text, we have to teach libsvn_diff to do diffs and merges correctly on
>>>>such files, and possibly enhance keyword expansion and newline
>>>>conversion, too.
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>Or convert to/from UTF8 as we do with other encodings.
>>>
>>>
>>>      
>>>
>>We don't convert file /contents/ between encodings, we don't even know
>>which encoding they're in.
>>
>>    
>>
>We don't know that *currently*.
>
Exactly. Bravo. And that change is where the can of worms is hidden, 
because it's not only about writing, but also about parsing the files.

> That could change, however. Right now, we
>output UTF8 (I think, or is it native). Still, we just insert stuff in
>files without knowing the encoding.
>
Yes, we do broken things like that.

> So, I think we will want to add an
>encoding property (or support it in the svn:mime-type) someday.
>  
>
Yup. We almost support it already, in the sense that we don't die it 
it's there.

>Note that I don't say it is trivial, but it should be doable.
>  
>
Neither did I say it wasn't doable, just that it frobs 90% of the 
client-side code. But I admit this estimate might be a bit pessimistic; 
it's probably closer to 85%. :-p

-- Brane



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 2194] Unicde UTF-16 files detected as binary

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Thu, 6 Jan 2005, [UTF-8] Branko �^Libej wrote:

> Peter N. Lundblad wrote:
>
> >On Thu, 6 Jan 2005, [UTF-8] Branko �^Libej wrote:
> >
> >
> >
> >>It is much more complicated than that. If we're to treat UTF-16 files as
> >>text, we have to teach libsvn_diff to do diffs and merges correctly on
> >>such files, and possibly enhance keyword expansion and newline
> >>conversion, too.
> >>
> >>
> >>
> >Or convert to/from UTF8 as we do with other encodings.
> >
> >
> We don't convert file /contents/ between encodings, we don't even know
> which encoding they're in.
>
We don't know that *currently*. That could change, however. Right now, we
output UTF8 (I think, or is it native). Still, we just insert stuff in
files without knowing the encoding. So, I think we will want to add an
encoding property (or support it in the svn:mime-type) someday.

Note that I don't say it is trivial, but it should be doable.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: [Issue 2194] Unicde UTF-16 files detected as binary

Posted by Branko Čibej <br...@xbc.nu>.
Peter N. Lundblad wrote:

>On Thu, 6 Jan 2005, [UTF-8] Branko �^Libej wrote:
>
>  
>
>>Peter N. Lundblad wrote:
>>
>>    
>>
>>>Yes, it is more complicated than that, since it is an enconding where a
>>>line break is not one or two bytes, and for some other reasons. Still, I
>>>think we really need to support other Unicode encodings thatn UTF8, like
>>>we support other 8-bit encodings.
>>>
>>>
>>>      
>>>
>>It is much more complicated than that. If we're to treat UTF-16 files as
>>text, we have to teach libsvn_diff to do diffs and merges correctly on
>>such files, and possibly enhance keyword expansion and newline
>>conversion, too.
>>
>>    
>>
>Or convert to/from UTF8 as we do with other encodings.
>  
>
We don't convert file /contents/ between encodings, we don't even know 
which encoding they're in.

>>In short, it's a whole can of worms that probably affects 90% of the
>>client-side code.
>>    
>>
>I can't belive that.
>  
>
You don't have to take my word for it, the code is right there.

-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 2194] Unicde UTF-16 files detected as binary

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Thu, 6 Jan 2005, [UTF-8] Branko �^Libej wrote:

> Peter N. Lundblad wrote:
>
> >Yes, it is more complicated than that, since it is an enconding where a
> >line break is not one or two bytes, and for some other reasons. Still, I
> >think we really need to support other Unicode encodings thatn UTF8, like
> >we support other 8-bit encodings.
> >
> >
> It is much more complicated than that. If we're to treat UTF-16 files as
> text, we have to teach libsvn_diff to do diffs and merges correctly on
> such files, and possibly enhance keyword expansion and newline
> conversion, too.
>
Or convert to/from UTF8 as we do with other encodings.

> In short, it's a whole can of worms that probably affects 90% of the
> client-side code.
>
I can't belive that.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: [Issue 2194] Unicde UTF-16 files detected as binary

Posted by Branko Čibej <br...@xbc.nu>.
Barry Scott wrote:

>
> On Jan 6, 2005, at 01:41, Branko Čibej wrote:
>
>> Peter N. Lundblad wrote:
>>
>>> On Wed, 5 Jan 2005, Max Bowsher wrote:
>>>
>>>
>>>> Peter N. Lundblad wrote:
>>>> I agree with what you are saying, but what 2194 was saying was "UTF-16
>>>> should be detected as textual".
>>>>
>>>>
>>> Yes, it is more complicated than that, since it is an enconding where a
>>> line break is not one or two bytes, and for some other reasons. 
>>> Still, I
>>> think we really need to support other Unicode encodings thatn UTF8, 
>>> like
>>> we support other 8-bit encodings.
>>>
>> It is much more complicated than that. If we're to treat UTF-16 files 
>> as text, we have to teach libsvn_diff to do diffs and merges 
>> correctly on such files, and possibly enhance keyword expansion and 
>> newline conversion, too.
>>
>> In short, it's a whole can of worms that probably affects 90% of the 
>> client-side code.
>
>
> When the rewrite of the client eventually happens design wide char 
> support in on day 1 then.

This won't help in general. You can only guarantee identical conversions 
between the various Unicode encodings, but if the file is in some other 
encoding, there's not always a valid way to convert the contents to 
Unicode, operate on that, and convert back without changing some of the 
original characters that shouldn't have changed. For example, the 
various ISO-2022 encodings are notorious for not behaving nicely in this 
context, and for that matter so is UTF-7.

The only universally correct way is to find the replaceable strings 
*without* converting the file contents, then only convert the 
replacements once from Unicode to the file's encoding.

> I do not expect a quick fix, but this issue should be nagging at svn 
> devos.

Not to worry, it's in the issue tracker. :-)

-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 2194] Unicde UTF-16 files detected as binary

Posted by Barry Scott <ba...@barrys-emacs.org>.
On Jan 6, 2005, at 01:41, Branko Čibej wrote:

> Peter N. Lundblad wrote:
>
>> On Wed, 5 Jan 2005, Max Bowsher wrote:
>>
>>
>>> Peter N. Lundblad wrote:
>>> I agree with what you are saying, but what 2194 was saying was 
>>> "UTF-16
>>> should be detected as textual".
>>>
>>>
>> Yes, it is more complicated than that, since it is an enconding where 
>> a
>> line break is not one or two bytes, and for some other reasons. 
>> Still, I
>> think we really need to support other Unicode encodings thatn UTF8, 
>> like
>> we support other 8-bit encodings.
>>
> It is much more complicated than that. If we're to treat UTF-16 files 
> as text, we have to teach libsvn_diff to do diffs and merges correctly 
> on such files, and possibly enhance keyword expansion and newline 
> conversion, too.
>
> In short, it's a whole can of worms that probably affects 90% of the 
> client-side code.

When the rewrite of the client eventually happens design wide char 
support in on day 1 then.

I do not expect a quick fix, but this issue should be nagging at svn 
devos.

Barry


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: [Issue 2194] Unicde UTF-16 files detected as binary

Posted by Branko Čibej <br...@xbc.nu>.
Peter N. Lundblad wrote:

>On Wed, 5 Jan 2005, Max Bowsher wrote:
>
>  
>
>>Peter N. Lundblad wrote:
>>I agree with what you are saying, but what 2194 was saying was "UTF-16
>>should be detected as textual".
>>
>>    
>>
>Yes, it is more complicated than that, since it is an enconding where a
>line break is not one or two bytes, and for some other reasons. Still, I
>think we really need to support other Unicode encodings thatn UTF8, like
>we support other 8-bit encodings.
>  
>
It is much more complicated than that. If we're to treat UTF-16 files as 
text, we have to teach libsvn_diff to do diffs and merges correctly on 
such files, and possibly enhance keyword expansion and newline 
conversion, too.

In short, it's a whole can of worms that probably affects 90% of the 
client-side code.

-- Brane



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 2194] Unicde UTF-16 files detected as binary

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Wed, 5 Jan 2005, Max Bowsher wrote:

> Peter N. Lundblad wrote:
> I agree with what you are saying, but what 2194 was saying was "UTF-16
> should be detected as textual".
>
Yes, it is more complicated than that, since it is an enconding where a
line break is not one or two bytes, and for some other reasons. Still, I
think we really need to support other Unicode encodings thatn UTF8, like
we support other 8-bit encodings.

> IMO, given the current level of software support for UTF-16, it is more
> binary than text.
>
I don't agree. Will make a comment on this on users@ to give Berry some
support:-)

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 2194] Unicde UTF-16 files detected as binary

Posted by Max Bowsher <ma...@ukf.net>.
Peter N. Lundblad wrote:
> On Wed, 5 Jan 2005 maxb@tigris.org wrote:
>
>> http://subversion.tigris.org/issues/show_bug.cgi?id=2194
>>
>>
>>
>> User maxb changed the following:
>>
>>                   What    |Old value                 |New value
>> ================================================================================
>>                     Status|NEW                       |RESOLVED
>> --------------------------------------------------------------------------------
>>                 Resolution|                          |INVALID
>> --------------------------------------------------------------------------------
>>
>>
>>
>>
>> ------- Additional comments from maxb@tigris.org Wed Jan  5 
>> 06:48:02 -0800
>> 2005 ------- There's some huge red text on the issue tracker front page.
>> Please read it.
>> Thanks.
>>
> But don't you aggree this would be a good enhancement, i.e. better support
> for other Unicode encodings than UTF8?

I agree with what you are saying, but what 2194 was saying was "UTF-16 
should be detected as textual".

IMO, given the current level of software support for UTF-16, it is more 
binary than text.

Ma.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org