You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Greg Thomas <th...@omc.bt.co.uk> on 2005/06/06 13:57:04 UTC

Re: Text mime types

On Mon, 6 Jun 2005 12:36:36 +0200, Nicolas Goutte <ni...@snafu.de>
wrote:

>I would be nice if Subversion would have a svn:text property to tweak it 
>independently (even if perhaps its default would be "look at the mime type").

Regardless of the setting of the application/xml or text/xml mime type
discussion held elsewhere, this strikes me as an incredibly sensible
idea (though I'll turn it on it's head and suggest svn:binary
instead).

Currently, the determination of whether or not files are binary is a
bit arbitrary - a file is considered binary if it has a svn:mime-type
other than text/*, image/x-xbitmap or image/x-xpixmap. 

A simple svn:binary flag set if needed automatically when a file is
added (cf application/octet-stream) should make the whole thing a lot
simpler - it will also solve the problem of more exceptions being
added to the current list. 

For example, the XML files of OpenOffice documents are application/* -
http://framework.openoffice.org/documentation/mimetypes/mimetypes.html
yet they are XML and therefore (presumably) text. No doubt they too
should be added to the "not svn:mime-type=text/* but still text" list.
This list could grow and grow, but a simple svn:binary flag solves the
problem once.

Greg
[dev list added, as no doubt there are other issues I've missed]
-- 
This post represents the views of the author and does
not necessarily accurately represent the views of BT.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Greg Thomas <th...@omc.bt.co.uk>.
On Wed, 08 Jun 2005 12:15:34 +0100, Julian Foad
<ju...@btopenworld.com> wrote:

>> A simple svn:binary flag set if needed automatically when a file is
>> added (cf application/octet-stream) should make the whole thing a lot
>> simpler
>
>Make what simpler?

AIUI, commands such as svn blame, svn merge, etc. will use the
'binaryness' of a file to determine if to proceed. Currently, a file
is treated as binary if it has a mime type other than text/*,
image/x-xbitmap or image/x-xpixmap. I can only see that list of
mime-types growing, so a single "binary or text" flag could make it
simpler. svn add/import can continue to detect binary/text files as it
does currently, and set the flag accordingly.

Greg
-- 
This post represents the views of the author and does
not necessarily accurately represent the views of BT.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-06-14 15:21:47 +0100, Julian Foad wrote:
> Vincent Lefevre wrote:
> >http://httpd.apache.org/docs-2.0/en/mod/mod_mime.html#contentencoding
> 
> OK, thanks.  That helps to explain how content type is handled in HTML.

Concerning Apache, this would be HTTP more than HTML. But of course,
other protocols, file systems (with metadata) or utilities (e.g.
"file -zi" under Linux) could reuse the same ideas.

> OK.  So you're saying we should move closer to the way content type is 
> notified in HTML*, and add a "content encoding" to the meta-data of 
> Subversion files, as the first indicator of how to handle a file.  If no 
> encoding is specified for a file, then the Subversion client program would 
> look at the MIME type to determine how to handle it.  If an encoding is 
> specified, then we could design Subversion to decode the file before 
> applying operations such as "diff" and "merge" (and it would look at the 
> MIME type to determine what to do after decoding), and encode it 
> afterwards.

For diff, there would be no encode step as this wouldn't make much
sense. However the diff behavior after the possible decode step could
depend on the MIME type (through an option). For instance, this would
allow "true" XML diff (hasn't this been requested before?).

For merge, it would depend on the content encoding in the working copy.
Do I need to detail?

> The user could be given the option of not having this decode/encode
> step performed.

Yes.

> You must be implying that we should add this "content encoding" field, 
> because without it there is no point in knowing what MIME type the data 
> would have after decoding.  I must have missed where you said this.

Yes, this kind of thing. This is basically why the "content encoding"
notion is used with HTTP.

I don't know about mail. The application/octet-stream MIME type is
generally (always?) used for gzipped files, and I can say that this
is really annoying when I want to view gzip-compressed files from my
MUA; of course, I could use a handler for application/octet-stream
attachments and guess the real MIME type... just like if MIME types
never existed. So, the "content encoding" system would be a real
benefit.

> This may be a direction that we want to go in. I don't know. I was
> working on the assumption that we had just one field describing the
> file's outermost type, and that therefore that field would say that
> a file was gzipped data, but would not say what kind of data had
> been gzipped.
> 
> I can't help feeling that HTML's two-level scheme (content-type and
> content-encoding) lacks generality: for instance it can't handle
> more than one encoding such as a text file that is gzipped and then
> uuencoded.

But do you really store uuencoded gzipped file in your Subversion
repository?

Anyway this isn't a problem since content codings can be chained.
See <http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11>.

  Content-Encoding  = "Content-Encoding" ":" 1#content-coding

Note: 1# means 1 or more, separated by commas. It also says:

  If multiple encodings have been applied to an entity, the content
  codings MUST be listed in the order in which they were applied.

Note that uuencode isn't defined, since it would never be used as
a content encoding anyway (perhaps just a transfer encoding). See
"3.5 Content Codings" vs "3.6 Transfer Codings" in RFC 2616.

> In practice this probably isn't much of a problem, but it still
> bothers me. I have for years thought about designing a heirarchichal
> content-type description scheme, but haven't got very far.

I think that all is already in HTTP/1.1.

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Julian Foad <ju...@btopenworld.com>.
Vincent Lefevre wrote:
> On 2005-06-14 14:02:06 +0100, Julian Foad wrote:
>>Vincent Lefevre wrote:
>>>On 2005-06-13 17:56:52 +0100, Julian Foad wrote:
>>>>I think there is some confusion between "file encoding" and "transfer 
>>>>encoding".
>>>
>>>No, this is the same thing (though a file could be compressed before
>>>the transfer...).
>>
>>Sorry, I simpy don't understand your point of view no matter how much you 
>>say. We'll just have to disagree until someone else explains it in a 
>> different way to one or both of us.
> 
> Perhaps the Apache documentation?
> 
> http://httpd.apache.org/docs-2.0/en/mod/mod_mime.html#contentencoding

OK, thanks.  That helps to explain how content type is handled in HTML.

> In particular, it says:
> 
>   By using more than one file extension (see section above about
>   multiple file extensions), you can indicate that a file is of a
>   particular type, and also has a particular encoding.
> 
> You can read "file" in the above paragraph, not "transfer". Note that
> here, the compression is not added by Apache, it is part of the file
> on the file system.

OK.  So you're saying we should move closer to the way content type is notified 
in HTML*, and add a "content encoding" to the meta-data of Subversion files, as 
the first indicator of how to handle a file.  If no encoding is specified for a 
file, then the Subversion client program would look at the MIME type to 
determine how to handle it.  If an encoding is specified, then we could design 
Subversion to decode the file before applying operations such as "diff" and 
"merge" (and it would look at the MIME type to determine what to do after 
decoding), and encode it afterwards.  The user could be given the option of not 
having this decode/encode step performed.

You must be implying that we should add this "content encoding" field, because 
without it there is no point in knowing what MIME type the data would have 
after decoding.  I must have missed where you said this.

This may be a direction that we want to go in.  I don't know.  I was working on 
the assumption that we had just one field describing the file's outermost type, 
and that therefore that field would say that a file was gzipped data, but would 
not say what kind of data had been gzipped.

I can't help feeling that HTML's two-level scheme (content-type and 
content-encoding) lacks generality: for instance it can't handle more than one 
encoding such as a text file that is gzipped and then uuencoded.  In practice 
this probably isn't much of a problem, but it still bothers me.  I have for 
years thought about designing a heirarchichal content-type description scheme, 
but haven't got very far.

- Julian

[* Apologies if I'm saying "HTML" where I should be saying something more 
general like "HTML family".]

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-06-14 14:02:06 +0100, Julian Foad wrote:
> Vincent Lefevre wrote:
> >On 2005-06-13 17:56:52 +0100, Julian Foad wrote:
> >>I think there is some confusion between "file encoding" and "transfer 
> >>encoding".
> >
> >No, this is the same thing (though a file could be compressed before
> >the transfer...).
> [...]
> 
> Sorry, I simpy don't understand your point of view no matter how much you 
> say. We'll just have to disagree until someone else explains it in a 
>  different way to one or both of us.

Perhaps the Apache documentation?

http://httpd.apache.org/docs-2.0/en/mod/mod_mime.html#contentencoding

In particular, it says:

  By using more than one file extension (see section above about
  multiple file extensions), you can indicate that a file is of a
  particular type, and also has a particular encoding.

You can read "file" in the above paragraph, not "transfer". Note that
here, the compression is not added by Apache, it is part of the file
on the file system.

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Julian Foad <ju...@btopenworld.com>.
Vincent Lefevre wrote:
> On 2005-06-13 17:56:52 +0100, Julian Foad wrote:
>>I think there is some confusion between "file encoding" and "transfer 
>>encoding".
> 
> No, this is the same thing (though a file could be compressed before
> the transfer...).
[...]

Sorry, I simpy don't understand your point of view no matter how much you say. 
  We'll just have to disagree until someone else explains it in a different way 
to one or both of us.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-06-13 17:56:52 +0100, Julian Foad wrote:
> Vincent Lefevre wrote:
> >On 2005-06-12 13:15:37 +0100, Julian Foad wrote:
> >>Vincent Lefevre wrote:
> >>>So, svn:mime-type should contain text/plain and there should be a way
> >>>to specify the file encoding (compression scheme). "utf8" is not an
> >>>encoding in the MIME sense.
> 
> I think there is some confusion between "file encoding" and "transfer 
> encoding".

No, this is the same thing (though a file could be compressed before
the transfer...).

For instance, "wget -S http://www.vinc17.org/research/papers/rnc6.ps.gz"
gives:

[...]
10 Content-Type: application/postscript
11 Content-Encoding: x-gzip

and the file that is downloaded is really a .ps.gz file.

> >>Hmm... I can't see how that would work in general. It could have
> >>made sense in MIME's original context - attachments to email
> >>messages - where the compression was perhaps temporary, to be
> >>automatically undone at the end of the transfer, but I can't see
> >>that making sense where MIME types are used more generally to
> >>describe arbitrary files.
> >
> >The application could still perform the decompression 
> 
> It _could_, but that would be a completely new and different
> feature. We are not talking about that.

Not new. And it's the only way to get more information about the
contents of the file and the application that can handle them.

> >One of the questions one may ask is how MIME types should be used by
> >Subversion. An advantage of considering compression schemes not as
> >MIME types is for svn diff for instance: the diff could be shown on
> >the uncompressed file (as the may be unwanted or may be slow, this
> >could be an option only).
> 
> Ditto.
> 
> I think the comment in Debian's /etc/mime.types is wrong and
> misleading and is assuming that gzip (etc.) is only ever used as a
> content _transfer_ encoding. Where no transfer (with its consequent
> automatic encoding and decoding) is performed, the transfer encoding
> is inapplicable and we just want to state the file content type. In
> this regard, "gzip" (MIME type probably "application/x-gzip") is the
> outermost content type of a gzipped file.

No, when one has a gzipped postscript file, one wants to process it
with a postscript viewer, so that it really makes sense to have a
application/postscript MIME type for this file. Otherwise what would
be the need for MIME types?

Of course, a MIME type could contain information about both the
content type and the compression scheme, such as application/x-gtar
for gzipped tar archives. But an additional property specifying the
content encoding would be a nicer solution in general.

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Julian Foad <ju...@btopenworld.com>.
Vincent Lefevre wrote:
> On 2005-06-12 13:15:37 +0100, Julian Foad wrote:
>>Vincent Lefevre wrote:
>>>So, svn:mime-type should contain text/plain and there should be a way
>>>to specify the file encoding (compression scheme). "utf8" is not an
>>>encoding in the MIME sense.

I think there is some confusion between "file encoding" and "transfer encoding".

>>Hmm... I can't see how that would work in general. It could have
>>made sense in MIME's original context - attachments to email
>>messages - where the compression was perhaps temporary, to be
>>automatically undone at the end of the transfer, but I can't see
>>that making sense where MIME types are used more generally to
>>describe arbitrary files.
> 
> The application could still perform the decompression 

It _could_, but that would be a completely new and different feature.  We are 
not talking about that.

> One of the questions one may ask is how MIME types should be used by
> Subversion. An advantage of considering compression schemes not as
> MIME types is for svn diff for instance: the diff could be shown on
> the uncompressed file (as the may be unwanted or may be slow, this
> could be an option only).

Ditto.

I think the comment in Debian's /etc/mime.types is wrong and misleading and is 
assuming that gzip (etc.) is only ever used as a content _transfer_ encoding. 
Where no transfer (with its consequent automatic encoding and decoding) is 
performed, the transfer encoding is inapplicable and we just want to state the 
file content type.  In this regard, "gzip" (MIME type probably 
"application/x-gzip") is the outermost content type of a gzipped file.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-06-12 13:15:37 +0100, Julian Foad wrote:
> Vincent Lefevre wrote:
> >gzip is an encoding, but not a MIME type. From /etc/mime.types under
> >Debian:
> >
> >#  Note: Compression schemes like "gzip", "bzip", and "compress" are not
> >#  actually "mime-types".  They are "encodings" and hence must _not_ have
> >#  entries in this file to map their extensions.  The "mime-type" of an
> >#  encoded file refers to the type of data that has been encoded, not the
> >#  type of the encoding.
> >
> >So, svn:mime-type should contain text/plain and there should be a way
> >to specify the file encoding (compression scheme). "utf8" is not an
> >encoding in the MIME sense.
> 
> Hmm... I can't see how that would work in general. It could have
> made sense in MIME's original context - attachments to email
> messages - where the compression was perhaps temporary, to be
> automatically undone at the end of the transfer, but I can't see
> that making sense where MIME types are used more generally to
> describe arbitrary files.

The application could still perform the decompression (many can do
that when it is common to compress the file, e.g. "gv", "xdvi" and
"less"). It could also be the job of the application launcher if
there's one.

> For example, what about a Zip (as in PKZIP) file?

No, a zip file is an archive, so that it has its own MIME type
(application/zip). Ditto for tar (application/x-tar). So, a
.tar.gz file should have MIME type application/x-tar.

Well, some archive files may have other MIME types when it makes
sense (static libraries, Java classes and OpenOffice files are
just archives).

Also, if a zip file has only one file, then it can be seen as a
compression scheme, but this would be up to the user to decide
(at commit time, when setting the properties) how he wants the
file to be regarded

> That's a combination of compression and multi-file archiving. A
> single MIME type can't represent the content of all the different
> files in the Zip archive, so Zip would have to have its own MIME
> type. Then a single UTF8 file Zipped would have a MIME type of
> "Zip", and a single UTF8 file gzipped would have a MIME type of
> "UTF8 text".

And a UTF8 file encapsulated in HTML (e.g. using the "pre" element)
would have a text/html MIME type. I don't see this as a problem.

> Maybe that's how it is, but that seems awfully ugly to me and I
> don't imagine at the moment that that would be the right thing for
> Subversion to do.
> 
> I'll try to read up on the standards and the current best practices
> on using MIME types to get a better understanding of the issue, but
> I may not get around to it any time soon.

One of the questions one may ask is how MIME types should be used by
Subversion. An advantage of considering compression schemes not as
MIME types is for svn diff for instance: the diff could be shown on
the uncompressed file (as the may be unwanted or may be slow, this
could be an option only).

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Julian Foad <ju...@btopenworld.com>.
Vincent Lefevre wrote:
> On 2005-06-11 23:18:53 +0100, Julian Foad wrote:
> 
>>Vincent Lefevre wrote:
>>
>>>We would need a keyword for charset encoding and a keyword for
>>>file encoding. For instance, a file file-in-utf8.txt.gz would
>>>have a text/plain MIME type, a utf8 charset encoding and a gzip
>>>file encoding.
>>
>>No. The example "file-in-utf8.txt.gz" is an example of one encoding
>>within another.
> 
> gzip is an encoding, but not a MIME type. From /etc/mime.types under
> Debian:
> 
> #  Note: Compression schemes like "gzip", "bzip", and "compress" are not
> #  actually "mime-types".  They are "encodings" and hence must _not_ have
> #  entries in this file to map their extensions.  The "mime-type" of an
> #  encoded file refers to the type of data that has been encoded, not the
> #  type of the encoding.
> 
> So, svn:mime-type should contain text/plain and there should be a way
> to specify the file encoding (compression scheme). "utf8" is not an
> encoding in the MIME sense.

Hmm... I can't see how that would work in general.  It could have made sense in 
MIME's original context - attachments to email messages - where the compression 
was perhaps temporary, to be automatically undone at the end of the transfer, 
but I can't see that making sense where MIME types are used more generally to 
describe arbitrary files.  For example, what about a Zip (as in PKZIP) file? 
That's a combination of compression and multi-file archiving.  A single MIME 
type can't represent the content of all the different files in the Zip archive, 
so Zip would have to have its own MIME type.  Then a single UTF8 file Zipped 
would have a MIME type of "Zip", and a single UTF8 file gzipped would have a 
MIME type of "UTF8 text".  Maybe that's how it is, but that seems awfully ugly 
to me and I don't imagine at the moment that that would be the right thing for 
Subversion to do.

I'll try to read up on the standards and the current best practices on using 
MIME types to get a better understanding of the issue, but I may not get around 
to it any time soon.

- Julian


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-06-11 23:18:53 +0100, Julian Foad wrote:
> Vincent Lefevre wrote:
> >We would need a keyword for charset encoding and a keyword for
> >file encoding. For instance, a file file-in-utf8.txt.gz would
> >have a text/plain MIME type, a utf8 charset encoding and a gzip
> >file encoding.
> 
> No. The example "file-in-utf8.txt.gz" is an example of one encoding
> within another.

gzip is an encoding, but not a MIME type. From /etc/mime.types under
Debian:

#  Note: Compression schemes like "gzip", "bzip", and "compress" are not
#  actually "mime-types".  They are "encodings" and hence must _not_ have
#  entries in this file to map their extensions.  The "mime-type" of an
#  encoded file refers to the type of data that has been encoded, not the
#  type of the encoding.

So, svn:mime-type should contain text/plain and there should be a way
to specify the file encoding (compression scheme). "utf8" is not an
encoding in the MIME sense.

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Julian Foad <ju...@btopenworld.com>.
Vincent Lefevre wrote:
> We would need a keyword for charset encoding and a keyword for
> file encoding. For instance, a file file-in-utf8.txt.gz would
> have a text/plain MIME type, a utf8 charset encoding and a gzip
> file encoding.

No.  The example "file-in-utf8.txt.gz" is an example of one encoding within 
another.  It is only feasible for a tool like Subversion to know about the 
outermost encoding (gzip in this example).  If the user unzips the file and 
wants to store the resulting file "file-in-utf8.txt" in Subversion, THEN is the 
time when the user must tell Subversion the encoding of this new file.

If Subversion was designed to automatically zip and unzip files, then it would 
probably need to know the inner encoding; but it isn't.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2005-06-08 16:27:23 +0200, Nicolas Goutte wrote:
> That is why personally I would prefer (additionally) a keyword
> handling the encoding. So if a svn tool cannot handle that encoding
> it treats it as binary. That is also safe for (future) mixed
> Subversion environments where parts of Subversion could perhaps
> process an encoding and the other part could not, depending on the
> client's version.

We would need a keyword for charset encoding and a keyword for
file encoding. For instance, a file file-in-utf8.txt.gz would
have a text/plain MIME type, a utf8 charset encoding and a gzip
file encoding.

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Nicolas Goutte <ni...@snafu.de>.
On Wednesday 08 June 2005 13:15, Julian Foad wrote:
> [Replying only to the "dev" list, as we're now discussing the design.]
>
> Greg Thomas wrote:
> > Nicolas Goutte <ni...@snafu.de> wrote:
> >>I would be nice if Subversion would have a svn:text property to tweak it
> >>independently (even if perhaps its default would be "look at the mime
> >> type").
>
> We have to recognise that there is NOT a hard distinction between "text"
> and "binary".  There are different forms and degrees of "textiness". 
> Examples in approximate order of decreasing "textiness": ASCII,
> iso-latin-1, UTF-8, UTF-16; text files with a bit of binary data at the
> beginning, middle or end; binary files with some text embedded.

Just be careful that the "binary data" could be shifts in the specific 
encoding (for example some East-Asian encodings) as far as I have understood 
encodings. In such a case, you cannot simply add something like conflict 
marks despite the file looking like ASCII (or at least like a 
ISO-8859-something file) without that binary data.

The best way would be to really check the encoding but as far as I have 
understood this is far from being obvious and there are pitfalls (for example 
that UTF-8 must be tested for any ISO-8859).

>
> Therefore it is wrong to have a flag that says just "This is text".  We
> need to say "This is parseable by Subversion's built-in diff" or "This is
> displayable on the console" or other such precise statements.

That is why personally I would prefer (additionally) a keyword handling the 
encoding. So if a svn tool cannot handle that encoding it treats it as 
binary. That is also safe for (future) mixed Subversion environments where 
parts of Subversion could perhaps process an encoding and the other part 
could not, depending on the client's version.

The only problem with an encoding is that a real binary file, for example an 
executable, as no encoding at all. So such a file must be recognised or 
forced to be treated as a binary. (The svn:executable will not help here, as 
for example an object file is not an executable but it is a binary.)

>
> I don't think adding such flags to a file's properties is the way to go in
> general, because metadata should describe the file's inherent properties,
> not the manner in which it should be treated by certain specific tools.  I

Again on advantage of telling the encoding. The encoding is a file property.

> think we should implement those decisions as a configurable function of
> MIME type.  It might possibly be useful to have such properties to override
> the general configuration in special cases.

Personally I do not mind if such functions is only for overriding a default 
behaviour.

And if you want to add with the MIME type in svn:mime-type, at least many mime 
types have a charset extension. But I am not sure if it is the best place to 
put it in.

>
> > Currently, the determination of whether or not files are binary is a
> > bit arbitrary - a file is considered binary if it has a svn:mime-type
> > other than text/*, image/x-xbitmap or image/x-xpixmap.
>
> That's one part of it.  Another part is looking at some of the bytes to see
> how close they are to ASCII.  Subversion's determination and handling of
> textiness needs a fair bit of enhancement.

As I have written above, it is not linked to nearly ASCII or not (even if this 
could be the current question for the current Subversion). It is more: does 
the svn tool in question support the encoding. If does not at all, it should 
handle the file as binary.

(Perhaps said otherwise, if you choose a strategy for ASCII-likeness today, it 
might be more difficult the day where svn will (need to) handle 
non-ASCII-like encodings.)

>
> > A simple svn:binary flag set if needed automatically when a file is
> > added (cf application/octet-stream) should make the whole thing a lot
> > simpler
>
> Make what simpler?
>
> > - it will also solve the problem of more exceptions being
> > added to the current list.
>
> No, it _moves_ the problem to "svn add" and "svn import".  For users
> affected by the current inextensible determination of textiness, it would
> make life easier by requiring only a one-off tweak rather than a
> work-around each time the file is diffed etc., but it's not really a proper
> solution to the problem.

Yes, of course, if the detection is done automatically when adding or 
importing, that would be great.

>
> - Julian

Have a nice day!


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Julian Foad <ju...@btopenworld.com>.
[Replying only to the "dev" list, as we're now discussing the design.]

Greg Thomas wrote:
> Nicolas Goutte <ni...@snafu.de> wrote:
>>I would be nice if Subversion would have a svn:text property to tweak it 
>>independently (even if perhaps its default would be "look at the mime type").

We have to recognise that there is NOT a hard distinction between "text" and 
"binary".  There are different forms and degrees of "textiness".  Examples in 
approximate order of decreasing "textiness": ASCII, iso-latin-1, UTF-8, UTF-16; 
text files with a bit of binary data at the beginning, middle or end; binary 
files with some text embedded.

Therefore it is wrong to have a flag that says just "This is text".  We need to 
say "This is parseable by Subversion's built-in diff" or "This is displayable 
on the console" or other such precise statements.

I don't think adding such flags to a file's properties is the way to go in 
general, because metadata should describe the file's inherent properties, not 
the manner in which it should be treated by certain specific tools.  I think we 
should implement those decisions as a configurable function of MIME type.  It 
might possibly be useful to have such properties to override the general 
configuration in special cases.

> Currently, the determination of whether or not files are binary is a
> bit arbitrary - a file is considered binary if it has a svn:mime-type
> other than text/*, image/x-xbitmap or image/x-xpixmap. 

That's one part of it.  Another part is looking at some of the bytes to see how 
close they are to ASCII.  Subversion's determination and handling of textiness 
needs a fair bit of enhancement.

> 
> A simple svn:binary flag set if needed automatically when a file is
> added (cf application/octet-stream) should make the whole thing a lot
> simpler

Make what simpler?

> - it will also solve the problem of more exceptions being
> added to the current list. 

No, it _moves_ the problem to "svn add" and "svn import".  For users affected 
by the current inextensible determination of textiness, it would make life 
easier by requiring only a one-off tweak rather than a work-around each time 
the file is diffed etc., but it's not really a proper solution to the problem.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Greg Thomas <th...@omc.bt.co.uk>.
On Mon, 6 Jun 2005 17:59:17 +0200, Nicolas Goutte <ni...@snafu.de>
wrote:

>No, sorry, by default OOo's files (even in the OASIS file format) are zipped, 
>so they are in fact binaries.

Oops - forgot that!

Greg
-- 
This post represents the views of the author and does
not necessarily accurately represent the views of BT.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Text mime types

Posted by Greg Thomas <th...@omc.bt.co.uk>.
On Mon, 6 Jun 2005 17:59:17 +0200, Nicolas Goutte <ni...@snafu.de>
wrote:

>No, sorry, by default OOo's files (even in the OASIS file format) are zipped, 
>so they are in fact binaries.

Oops - forgot that!

Greg
-- 
This post represents the views of the author and does
not necessarily accurately represent the views of BT.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Text mime types

Posted by Nicolas Goutte <ni...@snafu.de>.
On Monday 06 June 2005 15:57, Greg Thomas wrote:
> On Mon, 6 Jun 2005 12:36:36 +0200, Nicolas Goutte <ni...@snafu.de>
>
> wrote:
> >I would be nice if Subversion would have a svn:text property to tweak it
> >independently (even if perhaps its default would be "look at the mime
> > type").
>
> Regardless of the setting of the application/xml or text/xml mime type
> discussion held elsewhere, this strikes me as an incredibly sensible
> idea (though I'll turn it on it's head and suggest svn:binary
> instead).
>
> Currently, the determination of whether or not files are binary is a
> bit arbitrary - a file is considered binary if it has a svn:mime-type
> other than text/*, image/x-xbitmap or image/x-xpixmap.
>

> A simple svn:binary flag set if needed automatically when a file is
> added (cf application/octet-stream) should make the whole thing a lot
> simpler - it will also solve the problem of more exceptions being
> added to the current list.

That is perhaps a good solution for migrating from CVS. Just map -kb to 
svn:binary (But that would be a hint for the cvs2svn tool.)

The third posibility is to store the encoding and have a dummy encoding for 
binary. That would help Subversion to help merging in the right encoding (to 
follow my example of UTF-16 or UTF-32).

>
> For example, the XML files of OpenOffice documents are application/* -
> http://framework.openoffice.org/documentation/mimetypes/mimetypes.html
> yet they are XML and therefore (presumably) text. No doubt they too
> should be added to the "not svn:mime-type=text/* but still text" list.
> This list could grow and grow, but a simple svn:binary flag solves the
> problem once.

No, sorry, by default OOo's files (even in the OASIS file format) are zipped, 
so they are in fact binaries.

(Both OOo file formats have a flat version too but it is seldom used.)

>
> Greg
> [dev list added, as no doubt there are other issues I've missed]

Have a nice day!


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Text mime types

Posted by Nicolas Goutte <ni...@snafu.de>.
On Monday 06 June 2005 15:57, Greg Thomas wrote:
> On Mon, 6 Jun 2005 12:36:36 +0200, Nicolas Goutte <ni...@snafu.de>
>
> wrote:
> >I would be nice if Subversion would have a svn:text property to tweak it
> >independently (even if perhaps its default would be "look at the mime
> > type").
>
> Regardless of the setting of the application/xml or text/xml mime type
> discussion held elsewhere, this strikes me as an incredibly sensible
> idea (though I'll turn it on it's head and suggest svn:binary
> instead).
>
> Currently, the determination of whether or not files are binary is a
> bit arbitrary - a file is considered binary if it has a svn:mime-type
> other than text/*, image/x-xbitmap or image/x-xpixmap.
>

> A simple svn:binary flag set if needed automatically when a file is
> added (cf application/octet-stream) should make the whole thing a lot
> simpler - it will also solve the problem of more exceptions being
> added to the current list.

That is perhaps a good solution for migrating from CVS. Just map -kb to 
svn:binary (But that would be a hint for the cvs2svn tool.)

The third posibility is to store the encoding and have a dummy encoding for 
binary. That would help Subversion to help merging in the right encoding (to 
follow my example of UTF-16 or UTF-32).

>
> For example, the XML files of OpenOffice documents are application/* -
> http://framework.openoffice.org/documentation/mimetypes/mimetypes.html
> yet they are XML and therefore (presumably) text. No doubt they too
> should be added to the "not svn:mime-type=text/* but still text" list.
> This list could grow and grow, but a simple svn:binary flag solves the
> problem once.

No, sorry, by default OOo's files (even in the OASIS file format) are zipped, 
so they are in fact binaries.

(Both OOo file formats have a flat version too but it is seldom used.)

>
> Greg
> [dev list added, as no doubt there are other issues I've missed]

Have a nice day!


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org