You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by cm...@collab.net on 2001/07/31 20:34:04 UTC

Ascii/binary detection.

According to the alpha-checklist.txt file, the problems--no, the
"challenges!"--of text/binary detection, keyword substitution, and
newline translation "will be solved by three separate properties":
mime-type, keyword substitution, and newline translation.  Branko,
this is basically your brainchild, so I hope you're paying attention
to what follows (with your editor's red pen in hand!)

Confused as to why the mime-type was needed, I asked Karl the
reasoning behind this property.  Karl explained that we want to allow
files to be marked as wanting newline conversion and/or keyword
substitution, but allow those attributes to be ignored for binary
files.  That way users can enable those features for, say, '*' in
their working copy directory, and not have to worry about whether or
not each target in that directory is an ascii file.  This is important
because a file's binariness can switch back and forth over the file's
lifetime.

I'm proposing the following:

1.  Develop a heuristic for determining the binariness of a file, say
    svn_io_is_binary_file ()

2.  During `svn add', svn_io_is_binary_file () is called (only on
    files, of course).  If it returns TRUE, the property
    `svn:mime-type' is set on the file with a value of
    `application/octet-stream'.  [NOTE: It'd be really nice at this
    point for the UI to say either "Added binary file foo" or "Added
    text file foo", but then again, it'd be nice if the UI said
    *anything* during an add].

3.  At this point, the user can use `svn propset' (or `svn propdel')
    to change the values of svn:mime-type, svn:line-ending, and
    svn:keywords.  We can also provide convenience subcommands for
    making these special property modifications, too (but don't make
    me pull out any -kkv's or anything!)

Now, a word about the values of these three properties.

`svn:mime-type'

    If this property is present on a given file, its value is used to
    determine the binary-ness of the contents of that file.  Values
    for this can really be just about anything, but some notable ones
    are:

        'application/octet-stream' - Generic binary file.  No keyword
                                     substitution or newline
                                     translation will occur on this
                                     file.  Also, `svn diff' will not
                                     try to display a diff for this file.

        'text/foo'                 - Text file (where 'foo' is some
                                     mime subtype like 'plain' or
                                     'html').  Keyword substitution
                                     and newline translation are
                                     available for this file, and `svn
                                     diff' will actually display diffs
                                     for it. 

`svn:line-ending'

    If this property is present on a given non-binary file, its value
    is used to determine how line-endings should be translated.

    Values for this can be:

        'native'                - Use the line ending mechanism native
                                  to the user's operating system. 

        'dos', 'unix', or 'mac' - Use CRLF, LF, or LFCR, respectively.

    Absence of this property means that no line-ending substitution
    should occur at all.

`svn:keywords'

    If this property is present on a given non-binary file, its value
    is used to determine which keywords will be substituted in that
    file.  The value is expected to be a comma-delimited list of
    keywords from the following set:

        'author'   - replaces the keyword placeholder $Author$
        'date'     - replaces the keyword placeholder $Date$
        'header'   - replaces the keyword placeholder $Header$
        'revision' - replaces the keyword placeholder $Revision$

        ...and maybe some others, depending on whatch'all want.

    Absence of this property means that no keyword substitution should
    occur at all.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by kf...@collab.net.

Branko =?ISO-8859-2?Q?=C8ibej?= <br...@xbc.nu> writes:
> Oh, btw: I'd like to put ra_cvs on the list, handling ':pserver:*' and 
> ':ext:*' not-really-urls. For instance, the Subversion project will need 
> that unless we want to import all of APR into our repository.

Why exactly do we need that?  It's not as if CVS gets uninstalled when
you install Subversion. :-)  

It would be easier to make Subversion try to invoke CVS when it sees a
CVS working copy in a subdirectory.  Or just do nothing, and make sure
people know to update APR separately.

(Not that I've anything against ra_cvs, just think there are more
important things to devote effort to.)

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Branko Čibej <br...@xbc.nu>.

Ben Collins-Sussman wrote:

>>Actually, subversion /will/ be able to read mail -- when we support 
>>clien-side plugins, and someone writes a mail-reader plugin. :-)
>>
>
>Don't forget ra_smtp !
>
>I believe that idea has been in the spec for over a year.  ;-)
>
This reminds me: ClearCase MultiSite can do VOB synchronization via 
SMTP. Maybe it's not all so farfetched after all.

Oh, btw: I'd like to put ra_cvs on the list, handling ':pserver:*' and 
':ext:*' not-really-urls. For instance, the Subversion project will need 
that unless we want to import all of APR into our repository.

Yes, I know the suggestion is not entirely politically correct. As well 
as being off-topic for this thread. :-)

-- 
Brane ďż˝ibej
    home:   <br...@xbc.nu>             http://www.xbc.nu/brane/
    work:   <br...@hermes.si>   http://www.hermes-softlab.com/
     ACM :   <br...@acm.org>            http://www.acm.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Ben Collins-Sussman <su...@collab.net>.

> Actually, subversion /will/ be able to read mail -- when we support 
> clien-side plugins, and someone writes a mail-reader plugin. :-)

Don't forget ra_smtp !

I believe that idea has been in the spec for over a year.  ;-)



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Branko Čibej <br...@xbc.nu>.

Branko ďż˝ibej wrote:

> kfogel@collab.net wrote:
>
>>> BTW, would it make sense to have a sort of wrappers configuration, 
>>> like CVS does?
>>>
>>
>> How exactly do you envision it working?
>>
> Somewhat like the cvswrappers file. You'd have a config file in the 
> repository area, where you'd define default values for the props, 
> based on file name patterns, or some such.
>
>    [file-types]
>    *.dsp:  --content-type=text/plain --newline=dos
>    *.html: --content-type=text/html --newline=native 
> --keywords=revision,date,author
>    *.o:    --content-type=application/octet-stream
>    s.*:    --content-type=application/x-sccs
>
>
> etc. You get the picture.


I just remembered why this won't work. This config file would have to be 
in the repository, i.e., on the server. But the server shouldn't touch 
the props. So, in general, the client won't have access to this file.

Except if we have an equivalent of CVSROOT, which the client can check out.

Ouch.

    Brane



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Branko Čibej <br...@xbc.nu>.

kfogel@collab.net wrote:

>>BTW, would it make sense to have a sort of wrappers configuration, like 
>>CVS does?
>>
>
>How exactly do you envision it working?
>
Somewhat like the cvswrappers file. You'd have a config file in the 
repository area, where you'd define default values for the props, based 
on file name patterns, or some such.

    [file-types]
    *.dsp:  --content-type=text/plain --newline=dos
    *.html: --content-type=text/html --newline=native --keywords=revision,date,author
    *.o:    --content-type=application/octet-stream
    s.*:    --content-type=application/x-sccs


etc. You get the picture.


    Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by kf...@collab.net.

Branko =?ISO-8859-2?Q?=C8ibej?= <br...@xbc.nu> writes:
> I wonder what we'll do with a Chinese text in UTF-8? More then 65% of 
> the bytes will be >128.

We'll improve our heuristic. :-)

> BTW, would it make sense to have a sort of wrappers configuration, like 
> CVS does?

How exactly do you envision it working?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Branko Čibej <br...@xbc.nu>.

kfogel@collab.net wrote:

>Branko =?ISO-8859-2?Q?=C8ibej?= <br...@xbc.nu> writes:
>
>>Right. Note that when we were discussing this (damn, don't have that 
>>archive any more ...), someone pointed out that the "text/*" mime types 
>>actually imply CRLF line endings. But I think we can safely ignore that; 
>>Subversion is not an MUA.
>>
>
>     "Every program attempts to expand until it can read mail.
>      Those programs which cannot so expand are replaced by ones that
>      can."
>                   -- [Third] Law of Software Envelopment
>
>                  [also apparently known as "Zawinski's Law", but jwz
>                  quotes it thus, so don't know if he's the originator] 
>
Actually, subversion /will/ be able to read mail -- when we support 
clien-side plugins, and someone writes a mail-reader plugin. :-)


>>>1.  Develop a heuristic for determining the binariness of a file, say
>>>   svn_io_is_binary_file ()
>>>
>>(Two suggestions: a) don't mark the file as binary just because there's 
>>a byte with value >= 128 in it; b) if other tests aren't conclusive, 
>>check for extremely long lines in the file?)
>>
>
>That combination is exactly what we were planning to do, yeah -- some
>combination of a) at least a certain percentage of bytes with the high
>bit set, and b) long lines.
>
I wonder what we'll do with a Chinese text in UTF-8? More then 65% of 
the bytes will be >128.


BTW, would it make sense to have a sort of wrappers configuration, like 
CVS does?


>>This looks good, even if you ignore all my comments.
>>
>
>Even this one?
>
Ha, that's an opinion, not a comment. You can't ignore my opinions. :-)


-- 
Brane ďż˝ibej
    home:    <br...@xbc.nu>             http://www.xbc.nu/brane/
    work:    <br...@hermes.si>   http://www.hermes-softlab.com/
      ACM:   <br...@acm.org>            http://www.acm.org/




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by kf...@collab.net.

Branko =?ISO-8859-2?Q?=C8ibej?= <br...@xbc.nu> writes:
> >Since the whole point of this is to save the user the trouble of
> >explicitly specifying the type of each file, it seems to me that
> >binary detection is a great job for a client-side plugin.  Users
> >generally know what kind of files are typically found in their
> >filesystems.  If someone has a lot of UTF-8 encoded Chinese text, then
> >she'll want a different heuristic than what I'd want.
> >
> +1
> 
> I see it still takes genius to state the obvious! :-)

Also +1.

Stating the oblivious,
-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Branko Čibej <br...@xbc.nu>.

Jim Blandy wrote:

>kfogel@collab.net writes:
>
>>>>1.  Develop a heuristic for determining the binariness of a file, say
>>>>   svn_io_is_binary_file ()
>>>>
>>>(Two suggestions: a) don't mark the file as binary just because there's 
>>>a byte with value >= 128 in it; b) if other tests aren't conclusive, 
>>>check for extremely long lines in the file?)
>>>
>>That combination is exactly what we were planning to do, yeah -- some
>>combination of a) at least a certain percentage of bytes with the high
>>bit set, and b) long lines.
>>
>
>Since the whole point of this is to save the user the trouble of
>explicitly specifying the type of each file, it seems to me that
>binary detection is a great job for a client-side plugin.  Users
>generally know what kind of files are typically found in their
>filesystems.  If someone has a lot of UTF-8 encoded Chinese text, then
>she'll want a different heuristic than what I'd want.
>
+1

I see it still takes genius to state the obvious! :-)


    Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Jim Blandy <ji...@zwingli.cygnus.com>.

kfogel@collab.net writes:
> > >1.  Develop a heuristic for determining the binariness of a file, say
> > >    svn_io_is_binary_file ()
> > >
> > (Two suggestions: a) don't mark the file as binary just because there's 
> > a byte with value >= 128 in it; b) if other tests aren't conclusive, 
> > check for extremely long lines in the file?)
> 
> That combination is exactly what we were planning to do, yeah -- some
> combination of a) at least a certain percentage of bytes with the high
> bit set, and b) long lines.

Since the whole point of this is to save the user the trouble of
explicitly specifying the type of each file, it seems to me that
binary detection is a great job for a client-side plugin.  Users
generally know what kind of files are typically found in their
filesystems.  If someone has a lot of UTF-8 encoded Chinese text, then
she'll want a different heuristic than what I'd want.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by kf...@collab.net.

Branko =?ISO-8859-2?Q?=C8ibej?= <br...@xbc.nu> writes:
> Right. Note that when we were discussing this (damn, don't have that 
> archive any more ...), someone pointed out that the "text/*" mime types 
> actually imply CRLF line endings. But I think we can safely ignore that; 
> Subversion is not an MUA.

     "Every program attempts to expand until it can read mail.
      Those programs which cannot so expand are replaced by ones that
      can."
                   -- [Third] Law of Software Envelopment

                  [also apparently known as "Zawinski's Law", but jwz
                  quotes it thus, so don't know if he's the originator] 

> >1.  Develop a heuristic for determining the binariness of a file, say
> >    svn_io_is_binary_file ()
> >
> (Two suggestions: a) don't mark the file as binary just because there's 
> a byte with value >= 128 in it; b) if other tests aren't conclusive, 
> check for extremely long lines in the file?)

That combination is exactly what we were planning to do, yeah -- some
combination of a) at least a certain percentage of bytes with the high
bit set, and b) long lines.

If there's a library out there which does this already, we should use
it, probably.  Anybody know of one?

> >2.  During `svn add', svn_io_is_binary_file () is called (only on
> >    files, of course).  If it returns TRUE, the property
> >    `svn:mime-type' is set on the file with a value of
> >    `application/octet-stream'.
> >
> What do you think about following the HTTP convention here? Call the 
> property svn:content-type, and encode the character set, too? Not that 
> we'll do anything with that info in 1.0.

Seems like a good idea.

> I agree. Since the heuristic can't be 100% accurate, we definitely have 
> to say what we guess about the file.

Also +1.

> Why not keyword substitution? Just make it off by default. If the user 
> wants keyword substitution in binary files, we cna always let him shoot 
> himself in the foot. Besides, it can actually make sense in some kinds 
> of binary formats.

That makes sense.  As long as Subversion doesn't initiate it, it's fine.

> There are (used to be?) systems where lines are delimited from both 
> ends. On VMS, a line started with a LF and ended with a CR, IIRC. How 
> about a more generic approach: the value of this property is a pair of 
> strings, one for the BOL and one for the EOL marker. 'native' would 
> still have the same meaning, while 'dos', 'unix' and 'mac' would be 
> aliases for ':\r\n', ':\n' and ':\n\r' (or whatever), respectively. A 
> VMS guy would make 'native' an alias for '\n:\r'.

Clever, yeah.  +1.  Why not support everything, if it's easy? :-)

> >    Absence of this property means that no line-ending substitution
> >    should occur at all.
> >
> Um. I'd rather use 'none' (':', if you accept the idea outlined above), 
> and make 'native' the default for text files. Oh, and we have to 
> prescribe the repository's native format, so that we can send deltas 
> back and forth.

We can assign `none' (or `:') that meaning, if we choose, but we still
have to handle the case where the property is simply absent, and the
appropriate behavior in that case is, obviously, no conversion.  So
our code for reporting the newline conversion status to the user (for
example) would still have to special-case the property's absence, at
least if we have any reporting mechanism more fancy than the user
simply doing a proplist/propget.

> This looks good, even if you ignore all my comments.

Even this one?

:-)

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Charles Wilson <pa...@acm.org>.

>>1.  Develop a heuristic for determining the binariness of a file, say
>>    svn_io_is_binary_file ()
>>
>(Two suggestions: a) don't mark the file as binary just because 
>there's a byte with value >= 128 in it; b) if other tests aren't 
>conclusive, check for extremely long lines in the file?)

A good check for binary is a 0x00 (or 0x0000 depending on the 
encoding) in the file.

>>`svn:line-ending'
>>
>>    If this property is present on a given non-binary file, its value
>>    is used to determine how line-endings should be translated.
>>
>>    Values for this can be:
>>
>>        'native'                - Use the line ending mechanism native
>>                                  to the user's operating system.
>>        'dos', 'unix', or 'mac' - Use CRLF, LF, or LFCR, respectively.
>>
>I'm not sure what the correct 'mac' line ending is. Have to check that.

Macintosh <CR>
VMS <LF><CR>

It should be noted that due to historic screwups with implementations 
of C development environments on the Macintosh, '\n' and '\r' aren't 
always mapped correctly. It is necessary to define all of the 
following sequences:

CHR_LF, CHR_CR, CHR_NL
STR_LF, STR_CR, STR_NL

You may also want to consider inclusion of an arbitrary byte sequence 
for future use and too speed up development of new 'official' values.

-- 
- charles

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by kf...@collab.net.

"Rick Price" <rp...@opentext.com> writes:
> Why not just ask the user to explicitly specify the type if there is any
> confusion?
> 
> This is what cvsgui does, and it works well for me.

We can (and should); but there still has to be an automated guesser
for situations where asking the user for input is not appropriate.

Large imports can sometimes be such a situation, even when the user is
present, just because of the sheer number of files and the user's
desire to do other stuff that day.

-K

> ----- Original Message -----
> From: "Branko �ibej" <br...@xbc.nu>
> To: "Alan Shutko" <at...@acm.org>
> Cc: <cm...@collab.net>; <de...@subversion.tigris.org>
> Sent: Wednesday, August 01, 2001 4:27 PM
> Subject: Re: Ascii/binary detection.
> 
> 
> > Alan Shutko wrote:
> >
> > >cmpilato@collab.net writes:
> > >
> > >>The point is that some work needs to be done to create the
> > >>Subversion Binariness Heuristic, and your suggestions are good ones.
> > >>
> > >
> > >I don't have any specific suggestions here, but it would be good when
> > >deciding the heuristic to take a bunch of non-english text files
> > >(esp. CJK in various encodings) and try to make sure they aren't
> > >always seen as binary.
> > >
> > I agree. But if we can't be sure, we have to default to binary, so that
> > even if we're wrong, the file doesn't get munged during commit.
> >
> >
> > >Extremely long lines also needs some judgement, because of people who
> > >create text files where they use a newline per paragraph.
> > >
> > True, that's why I suggested checking line length as a last resort. If
> > we're cautious and mark the file binary unless we're sure it's not,
> > there's no harm done.
> >
> >
> >     Brane
> >
> > --
> > Brane �ibej
> >     home:   <br...@xbc.nu>             http://www.xbc.nu/brane/
> >     work:   <br...@hermes.si>   http://www.hermes-softlab.com/
> >      ACM :   <br...@acm.org>            http://www.acm.org/
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> > For additional commands, e-mail: dev-help@subversion.tigris.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Rick Price <rp...@opentext.com>.

Why not just ask the user to explicitly specify the type if there is any
confusion?

This is what cvsgui does, and it works well for me.

Rick

----- Original Message -----
From: "Branko ďż˝ibej" <br...@xbc.nu>
To: "Alan Shutko" <at...@acm.org>
Cc: <cm...@collab.net>; <de...@subversion.tigris.org>
Sent: Wednesday, August 01, 2001 4:27 PM
Subject: Re: Ascii/binary detection.


> Alan Shutko wrote:
>
> >cmpilato@collab.net writes:
> >
> >>The point is that some work needs to be done to create the
> >>Subversion Binariness Heuristic, and your suggestions are good ones.
> >>
> >
> >I don't have any specific suggestions here, but it would be good when
> >deciding the heuristic to take a bunch of non-english text files
> >(esp. CJK in various encodings) and try to make sure they aren't
> >always seen as binary.
> >
> I agree. But if we can't be sure, we have to default to binary, so that
> even if we're wrong, the file doesn't get munged during commit.
>
>
> >Extremely long lines also needs some judgement, because of people who
> >create text files where they use a newline per paragraph.
> >
> True, that's why I suggested checking line length as a last resort. If
> we're cautious and mark the file binary unless we're sure it's not,
> there's no harm done.
>
>
>     Brane
>
> --
> Brane ďż˝ibej
>     home:   <br...@xbc.nu>             http://www.xbc.nu/brane/
>     work:   <br...@hermes.si>   http://www.hermes-softlab.com/
>      ACM :   <br...@acm.org>            http://www.acm.org/
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Branko Čibej <br...@xbc.nu>.

Alan Shutko wrote:

>cmpilato@collab.net writes:
>
>>The point is that some work needs to be done to create the
>>Subversion Binariness Heuristic, and your suggestions are good ones.
>>
>
>I don't have any specific suggestions here, but it would be good when
>deciding the heuristic to take a bunch of non-english text files
>(esp. CJK in various encodings) and try to make sure they aren't
>always seen as binary.  
>
I agree. But if we can't be sure, we have to default to binary, so that 
even if we're wrong, the file doesn't get munged during commit.


>Extremely long lines also needs some judgement, because of people who
>create text files where they use a newline per paragraph.
>
True, that's why I suggested checking line length as a last resort. If 
we're cautious and mark the file binary unless we're sure it's not, 
there's no harm done.


    Brane

-- 
Brane ďż˝ibej
    home:   <br...@xbc.nu>             http://www.xbc.nu/brane/
    work:   <br...@hermes.si>   http://www.hermes-softlab.com/
     ACM :   <br...@acm.org>            http://www.acm.org/




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Alan Shutko <at...@acm.org>.

cmpilato@collab.net writes:

> The point is that some work needs to be done to create the
> Subversion Binariness Heuristic, and your suggestions are good ones.

I don't have any specific suggestions here, but it would be good when
deciding the heuristic to take a bunch of non-english text files
(esp. CJK in various encodings) and try to make sure they aren't
always seen as binary.  

Extremely long lines also needs some judgement, because of people who
create text files where they use a newline per paragraph.

Just a couple things to keep aware of.

-- 
Alan Shutko <at...@acm.org> - In a variety of flavors!
Sinner: A stupid person who gets found out.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Branko Čibej <br...@xbc.nu>.

Charles Wilson wrote:

> What about those evil programmers who embed EOL sequences in their 
> strings?
>
You mean, what would our heuristics show for an object file containing 
such strings? Not to worry, I think. First, the likelyhood that the 
percentage of string constants in an object file is high enough to 
confuse the heuristics is pretty small. And you always get a healthy 
dose of NUL bytes in such files.

    Brane

-- 
Brane ďż˝ibej
    home:   <br...@xbc.nu>             http://www.xbc.nu/brane/
    work:   <br...@hermes.si>   http://www.hermes-softlab.com/
     ACM :   <br...@acm.org>            http://www.acm.org/




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Charles Wilson <pa...@acm.org>.

What about those evil programmers who embed EOL sequences in their strings?

-- 
- charles

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by cm...@collab.net.

Branko =?ISO-8859-2?Q?=C8ibej?= <br...@xbc.nu> writes:

> Got a big, fat, orange fluorescent marker here, hur hur hur!

Yeah-heh!

> Right. Note that when we were discussing this (damn, don't have that 
> archive any more ...), someone pointed out that the "text/*" mime types 
> actually imply CRLF line endings. But I think we can safely ignore that; 
> Subversion is not an MUA.

Indeed, and I confirmed the CRLF thing yesterday reading RFCs.  Our
plan, I think is to encourage folks to use text/unknown for textfiles
whose line endings are unimportant.  Whatcha think?

> (Two suggestions: a) don't mark the file as binary just because there's 
> a byte with value >= 128 in it; b) if other tests aren't conclusive, 
> check for extremely long lines in the file?)

Good suggestions.  I've done no research into common heuristics on
this.  I think one used by some text/hex editors is to see if some
percentage of the bytes is >= 128 or 0.  35% seems to ring a bell, but
whatever.  The point is that some work needs to be done to create the
Subversion Binariness Heuristic, and your suggestions are good ones.

> >2.  During `svn add', svn_io_is_binary_file () is called (only on
> >    files, of course).  If it returns TRUE, the property
> >    `svn:mime-type' is set on the file with a value of
> >    `application/octet-stream'.
> >
> What do you think about following the HTTP convention here? Call the 
> property svn:content-type, and encode the character set, too? Not that 
> we'll do anything with that info in 1.0.

I definitely thought about that while reading RFCs, and am certainly
not opposed to it.

> Just have 'svn add' accept --text and --binary, and possibly 
> --content-type=..., --end-of-line=...

Ah...yes...

> Why not keyword substitution? Just make it off by default. If the user 
> wants keyword substitution in binary files, we cna always let him shoot 
> himself in the foot. Besides, it can actually make sense in some kinds 
> of binary formats.

Hm.  While I personally agree with you, I'm sure Karl "Ambassador For
the Little User Guy" Fogel will have something to say about this.  But
not until Thursday when he's back at work, so let's implement it now! :-)

> Um. I'd rather use 'none' (':', if you accept the idea outlined above), 
> and make 'native' the default for text files. Oh, and we have to 
> prescribe the repository's native format, so that we can send deltas 
> back and forth.

I'd like to stick with the 'none' being implied by absence of the
property, but I agree that 'native' should be the default for
textfiles.

Anyway, we have some time on this (it's not an M3 requirement).  Any
other feedback (or, uh, code contributions...) are welcome.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection. - Unicode

Posted by Markus Scherer <ma...@jtcsv.com>.

Branko ï¿½ibej wrote:
> Unicode would be, for instance, "svn:content-type = text/plain;
> charset=UCS-2". Not that we'll support Unicode (or ISO10646) directly in 1.0

No, at least not if text/* means anything related to MIME.

UCS-2 (by the way, deprecated and superceded by UTF-16) is a 16-bit encoding, which means that its byte-serialized encoding scheme uses all byte values including 0. This is incompatible with MIME text/* types.

text/* must be byte-oriented, avoid control codes and 0, use CRLF line endings (you said earlier that you don't care about this restriction in svn), and are (must be? not sure) ASCII-based.

However, "svn:content-type = text/plain; charset=UTF-8" would work.
With UTF-8, you do support Unicode already, right? :-)

markus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Branko Čibej <br...@xbc.nu>.

Alexander Mueller wrote:

>What about Unicode? This definitely is (in my eyes) text, not binary.
>But handling Unicode seems to be a pretty complex thing to me.
>But important as well. One wants to be able to text-diff Unicode files
>and to use keyword substitution....
>
Unicode would be, for instance, "svn:content-type = text/plain; 
charset=UCS-2". Not that we'll support Unicode (or ISO10646) directly in 1.0


    Brane

-- 
Brane ďż˝ibej
    home:   <br...@xbc.nu>             http://www.xbc.nu/brane/
    work:   <br...@hermes.si>   http://www.hermes-softlab.com/
     ACM :   <br...@acm.org>            http://www.acm.org/




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Branko Čibej <br...@xbc.nu>.

Jacek Prucia wrote:

>I think Subversion schould consider only Content-Type group, and safely
>ignore everything after slash, so:
>
>svn:mime-type = "application/*" -> binary file
>svn:mime-type = "image/*"		-> binary file
>svn:mime-type = "text/*"		-> text file
>
I don't think so. What I mean is, later on we'll add client-side plugins 
that will be triggered off of the full content type, so we have to store 
the full type in the props.

Of course, right now all we can say is: "text/*" is text, enything else 
is binary.


>svn:mime-type = unknown|missing	-> set svn:mime-type to
>"application/octet-stream" -> binary file just to be safe :)
>
Absolutely.


>Back to discussion on mime-types:
>
>Branko noted some time ago, that if we want to relay on some mime-types
>file, we have to be sure that server and client uses the same file. This
>seems a little bit complicated and not that easy to implement. However,
>there seems to be other solution.
>
>During initial checkin some files might have svn:mime-type set, some not.
>Subversion client will transfer all files that don't have svn:mime-type set
>(or that have unknown value) as application/octet-stream, but *without*
>touching svn:mime-type property. Server upon recieving file checks it's
>svn:mime-type. If it's still not set, then server fills it up with some
>value (using mime-types file), so during first checkout all files have
>svn:mime-type set to some value. If user isn't happy with what server just
>did - he can use svn propset.
>
>That way we have central mime-types management, and users can set
>svn:mime-type if they want to *or* (in case of large initial import) leave
>this decision to Subversion server.
>

I disagree. As a matter of principle, the server should never touch the 
contents of a file or its properties.

    Brane

-- 
Brane ďż˝ibej
    home:   <br...@xbc.nu>             http://www.xbc.nu/brane/
    work:   <br...@hermes.si>   http://www.hermes-softlab.com/
     ACM :   <br...@acm.org>            http://www.acm.org/




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by Branko Čibej <br...@xbc.nu>.

cmpilato@collab.net wrote:

>According to the alpha-checklist.txt file, the problems--no, the
>"challenges!"--of text/binary detection, keyword substitution, and
>newline translation "will be solved by three separate properties":
>mime-type, keyword substitution, and newline translation.  Branko,
>this is basically your brainchild, so I hope you're paying attention
>to what follows (with your editor's red pen in hand!)
>
Got a big, fat, orange fluorescent marker here, hur hur hur!


>Confused as to why the mime-type was needed, I asked Karl the
>reasoning behind this property.  Karl explained that we want to allow
>files to be marked as wanting newline conversion and/or keyword
>substitution, but allow those attributes to be ignored for binary
>files.  That way users can enable those features for, say, '*' in
>their working copy directory, and not have to worry about whether or
>not each target in that directory is an ascii file.  This is important
>because a file's binariness can switch back and forth over the file's
>lifetime.
>
Right. Note that when we were discussing this (damn, don't have that 
archive any more ...), someone pointed out that the "text/*" mime types 
actually imply CRLF line endings. But I think we can safely ignore that; 
Subversion is not an MUA.


>I'm proposing the following:
>
>1.  Develop a heuristic for determining the binariness of a file, say
>    svn_io_is_binary_file ()
>
(Two suggestions: a) don't mark the file as binary just because there's 
a byte with value >= 128 in it; b) if other tests aren't conclusive, 
check for extremely long lines in the file?)


>2.  During `svn add', svn_io_is_binary_file () is called (only on
>    files, of course).  If it returns TRUE, the property
>    `svn:mime-type' is set on the file with a value of
>    `application/octet-stream'.
>
What do you think about following the HTTP convention here? Call the 
property svn:content-type, and encode the character set, too? Not that 
we'll do anything with that info in 1.0.


>  [NOTE: It'd be really nice at this
>    point for the UI to say either "Added binary file foo" or "Added
>    text file foo", but then again, it'd be nice if the UI said
>    *anything* during an add].
>
I agree. Since the heuristic can't be 100% accurate, we definitely have 
to say what we guess about the file.


>3.  At this point, the user can use `svn propset' (or `svn propdel')
>    to change the values of svn:mime-type, svn:line-ending, and
>    svn:keywords.  We can also provide convenience subcommands for
>    making these special property modifications, too (but don't make
>    me pull out any -kkv's or anything!)
>
Just have 'svn add' accept --text and --binary, and possibly 
--content-type=..., --end-of-line=...


>Now, a word about the values of these three properties.
>
>`svn:mime-type'
>
>    If this property is present on a given file, its value is used to
>    determine the binary-ness of the contents of that file.  Values
>    for this can really be just about anything, but some notable ones
>    are:
>
>        'application/octet-stream' - Generic binary file.  No keyword
>                                     substitution or newline
>                                     translation will occur on this
>                                     file.  Also, `svn diff' will not
>                                     try to display a diff for this file.
>
Why not keyword substitution? Just make it off by default. If the user 
wants keyword substitution in binary files, we cna always let him shoot 
himself in the foot. Besides, it can actually make sense in some kinds 
of binary formats.

>
>        'text/foo'                 - Text file (where 'foo' is some
>                                     mime subtype like 'plain' or
>                                     'html').  Keyword substitution
>                                     and newline translation are
>                                     available for this file, and `svn
>                                     diff' will actually display diffs
>                                     for it. 
>
          'anything/else'            - Treat as generic binary, for now. Later
                                       on we can hang constom-diff hooks and
                                       other nice stuff on that.

>
>
>`svn:line-ending'
>
>    If this property is present on a given non-binary file, its value
>    is used to determine how line-endings should be translated.
>
>    Values for this can be:
>
>        'native'                - Use the line ending mechanism native
>                                  to the user's operating system. 
>
>        'dos', 'unix', or 'mac' - Use CRLF, LF, or LFCR, respectively.
>
I'm not sure what the correct 'mac' line ending is. Have to check that.

There are (used to be?) systems where lines are delimited from both 
ends. On VMS, a line started with a LF and ended with a CR, IIRC. How 
about a more generic approach: the value of this property is a pair of 
strings, one for the BOL and one for the EOL marker. 'native' would 
still have the same meaning, while 'dos', 'unix' and 'mac' would be 
aliases for ':\r\n', ':\n' and ':\n\r' (or whatever), respectively. A 
VMS guy would make 'native' an alias for '\n:\r'.

(And someone porting SVN to the ZX Spectrum will define 'native' as 
':\r' -- then run out of memory when compiling neon :-)


>    Absence of this property means that no line-ending substitution
>    should occur at all.
>
Um. I'd rather use 'none' (':', if you accept the idea outlined above), 
and make 'native' the default for text files. Oh, and we have to 
prescribe the repository's native format, so that we can send deltas 
back and forth.


>`svn:keywords'
>
>    If this property is present on a given non-binary file, its value
>    is used to determine which keywords will be substituted in that
>    file.  The value is expected to be a comma-delimited list of
>    keywords from the following set:
>
>        'author'   - replaces the keyword placeholder $Author$
>        'date'     - replaces the keyword placeholder $Date$
>        'header'   - replaces the keyword placeholder $Header$
>        'revision' - replaces the keyword placeholder $Revision$
>
>        ...and maybe some others, depending on whatch'all want.
>
>    Absence of this property means that no keyword substitution should
>    occur at all.
>
Hum. Can't find a nit here. How sad. :-)


This looks good, even if you ignore all my comments.

    Brane



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Ascii/binary detection.

Posted by su...@thewrittenword.com.

On Tue, Jul 31, 2001 at 03:34:04PM -0500, cmpilato@collab.net wrote:
> `svn:line-ending'
> 
>     If this property is present on a given non-binary file, its value
>     is used to determine how line-endings should be translated.
> 
>     Values for this can be:
> 
>         'native'                - Use the line ending mechanism native
>                                   to the user's operating system. 
> 
>         'dos', 'unix', or 'mac' - Use CRLF, LF, or LFCR, respectively.

Why not make the values 'crlf', 'lf', and 'lfcr'. This way, if new
operating systems are introduced which use an existing line
termination value, a new value for the OS does not need to exist.

The problem with this is that users would then need to know the line
termination character of their OS. One alternative is to advocate
'crlf', 'lf', and 'lfcr' but have aliases for the "popular" OS's that
translate to the correct value.

-- 
albert chin (china@thewrittenword.com)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org