You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Navrotskiy Artem <bo...@ya.ru> on 2014/10/03 09:26:32 UTC

Subversion binary file detection is look like broken

6$сAQУ^╔щFBfщex}:^Ц`°Льл╥$N;ВчЁSИ}Г9з╗БччЩe/ЫР╞╧[Фkн▀▒Y-·З°"шQ)_+_╠Ё≥╪n.Гьцb╓Г-зЁг`┐╥▌┤кы=Б	ФF≥к°$dчиБЗ CJc▐u╒щг∙WНКw╜╬VЖ├ex.┴│6█▄≥R┬ы'чТ┬W╣ЛqP▐┼В╩▀КюО!╦Ю╞(≥Cф╜╠~═ж┼ -ь`iQi╥X╝≥Й⌠+г╫²З=F┐~,P╞Яж▐╒I`1ы(gm╗Иzq╥ыdи²ГВБ$G3═┼│p`╨┌ЭFмЯ■,ч°╜*wД1с╢/─-╓д%iХ0q╩cN9ИЫ█,28Q(Дм│╝u╞2RЛх┼Ej┤▀щщИVPД9ОШЯЩкяьe╤≤└zw}1≈IчKюЛTwn▒░░хlМБBМgюФ░╘┌Кt c!е\:\╪ЮImр▀p#И;цx\┐d©▐бu▄╨╓_Ф└╩эПБю═gс╤·pУ╧П┴[ё0НZjFQ@▀jРY┌─Ит╨цFЖ;АXWЗbчMм©ZюДn√▄Vi\gмк╦(▀y┘╧├≈*.	LЫт:h1
'T@YuU╣ХъdМsUg╘aЙа	а╞W╡BPзЛрДВ╦≥2┴Л≤щЁ+╫G╗ гCXav_P ШНЦБ(▄ц{нОЕШv2╝S─щгCLз·Я▐┴кЯfq/\?с+Q│╛Щ╪ацmrоД@6VтQ?Б7КoкВнГ╨Ws├└÷LxzE/
хt©F(6gП┼└┐1÷ЩИд+■$M⌠:рE%Твh┬╤лЧзL╟_n55╜IF╛Ш▌юXМAш╜╨
⌠≈Нl╒ёЦs{j5Гв▀╜оj╛╔x╘©P┐╣┌Э*ФV4Ё╔╧ВЭ"ЧCK▐Ы≤	Й(hFтWёb≤/Вaа┴
∙╔
щ╖g√лBЯ6╪эЩgM≈DЙ╘Яg@╞Z?z╣ыiS|Ы?G╖
r·©'гя▌Z└ebК╚бuлД░├пМ^ЬIs:zъъ0TD╬	╞а├р{╥\х≈╩╤
√14w╨М)жN\=╧J─jR▀qМOA╕е╠pp╘ХW▓S╥ьeЩ╩|ёТDш²╧]7o┼{#√Aёжpf'╚cx╥NхЗd╖лO╢ ╥┌5з>·в╓в>ngФcN9въ╜H▀²╫░G╢╟╝ц(ёГ╩ё=Z5ь!╡сL╬q■ёЬёGЭfЁп╓'YЭгEВ║Р▒├2ЬнбoB<фОFq:т╞ъ=жП@кШ⌠Э═╛И_╦у▐ г╡З│QЧcU@4╤Р╜▐П░│╖ XИ─/qo∙мddGqХKXZeЫ@'╕к═'ok&хEоЫдщ╞ФСO┼Y©|╫ 2┼└й.╞^╝╢┌*░6аNv╚╜s©╕(АHm┬╬!г▄╦oЮbON≤ ╔╤N▓⌠к╛0═Йj─
█■║∙╠3Eи▐LйI]┌
2$█8=НаMА║э{сЦ,
ПWK,g┴^уВеЫ█(ю²$;М=89
кНдЛDZ$"
▒┌_J²Шэ╔$╔Э▀╘╝╣ф╠ВЕ9╜K#"├╟╞с3yJeЙщЦ@ГЖh/▓╕▀╣%Гo Ё jйI▀n9S<▌ях╛Ч8ZuC&≈1"э╤▄Я═Рр├kрH'_pUЗ)T_у÷wО'Uы≥5nCаBV≤НgO©LХзe│з╘S╠>цс*u[z╦pь/х2ЭШR·ГВ)йrЕUыЭ?^╡~7грф	е╫╝еС█D╨wg┤Ц÷ЕЗVF╫⌡s▄╔BЭЭ╩┬БЙВ▓r$м3M╦wvzg ▄hхY╨m╦╥╘лaМ[Ф&╕H╡ РBпК,╛гGY═ KOt%O─/┼┼█▒/░У╤фS┼Ф÷▐по╪$Y┐▀ЙH╗©?ЬТ∙юшi]iКК(7нGNюдД╥┬kHюйЪ╫*kN}m▐зеT╒w)╪хР[paЗe°&ГЕ┘z0Ж
g*шt╓╟ eИ╪ЪHd7цб@ыU┬P╤▌xзтc╘р█Ю}┘cZA÷$▌Q■п{5hч╔ч©h╣УЮМ▒шI╛█щ╨■yrv($fLЧDЭ╖▌л"ЖЕ<фs╧,÷ЮдЪ┤mH▒IББaёт╟╕v╩uП┴▓x]\эQ╖мьVEЯЮPуЧT{E╥Ч∙╦^U²ш<#бЫВf╩'b*3оъЙ|▐J ]sAlE$sт≥╜ФiFЯВb║├Ыт)_ЭТ⌡.нkO/Хь1Jи+(\P°Е╢r;'кн╓╩&РтdKQъH!aPиES╟╢ЩЭMйA╩╪;mEБц1▓tП╜le═/wЗRКqтM▌.■`÷RЬЧДЗ┐║О6#B▒ъ╓t&°SАЭmзФg<ся[ ёЪм]юxnЁёXfzЙО=Hp'sw$-$g@Э ┤}Чы8к√eu╥#wБ┴OЦч═╞░I▄тltA┌Н╔╣
╢хръ°m>
еrв≈[.:лк<`7ЯAС├Вp@╦╨ПRvг?А:
l│М[°Оk1Iю.╖BД/:ВrdDт╒3U©ФJJЦЮХ╔v~Ж╞я┼

Re: Subversion binary file detection is look like broken

Posted by Branko Čibej <br...@wandisco.com>.
On 03.10.2014 22:30, Barry Scott wrote:
> If you do look at this you might want to fix the problem of .svg files
> being classed as binary where as they are XML. I'm guessing that the
> mime type is used that assumes that an image/* cannot be text.

But do you seriously propose that "svn diff" and "svn merge" should just
work for SVG files? That's actually what "not binary" means to
Subversion. I can't imagine how that wouldn't break 90% of the time if
the algorithm isn't aware of the SVG schema (*not* just aware of XML
syntax).

-- Brane


Re: Subversion binary file detection is look like broken

Posted by Barry Scott <ba...@barrys-emacs.org>.
On 3 Oct 2014, at 13:15, Julian Foad <ju...@btopenworld.com> wrote:

> Stefan Sperling wrote:
>> On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
>>>     Subversion console client try to detect binary file with algorythm:
>>> 
>>>      1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
>>>         check as first N bytes is corret UTF-8?);
>>>      2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
>>>         distribution of bytes takes change of absent ZERO byte: (1 - 1 /
>>>         256) ^ 1024 = ~1.8%);
>>>      3. File is BINARY if first 1024 bytes contains over 85% of characters
>>>         not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
>>>         bytes, ~60%).
>>> 
>>>     This algoritm looks like broken.
> 
> The requirement (3) for >85% non-ASCII* bytes => binary, was a historical accident. The 
> original intention was >15% non-ASCII bytes => binary, or in other words >85% ASCII bytes => text. Quoting from libsvn_subr/io.c:svn_io_is_binary_data():
> 
>      NOTE:  Originally, I intended to target 85% of the bytes being in
>      the specified ranges, but I flubbed the condition.  At any rate,
>      folks aren't complaining, so I'm not sure that it's worth
>      adjusting this retroactively now.
> 
> Perhaps now is the time to change that to match the original intent.
> 
> * I use the term ASCII loosely to mean "bytes in those two ranges".
> 
> 
>> Can you suggest a better algoritm?
>> 
>>> For example:
>>>      1. File "text.txt":
>>> Is file contains text block from wikipedia about subversion in UTF-8
>>> (https://ru.wikipedia.org/wiki/Subversion) and unfortunaly contains too
>>> many cyrillic charactes (on character - 2 "binary" bytes).
>>>      2. File "binary.txt" detected as "text"
>>> It was created by "dd if=/dev/urandom of=binary.txt count=1 bs=2048" and
>>> unfortunaly does not contains ZERO byte in first 1024 bytes.
> 
> Changing the 85% condition would fix example 2. However it would make example 1 occur more often, unless we also make valid UTF-8 be detected as text.
> 
> It does sound like a good idea to make valid UTF-8 be detected as text.

If you do look at this you might want to fix the problem of .svg files being classed as binary where as they are XML. I'm guessing that the mime type is used that assumes that an image/* cannot be text.

Barry


Re: Subversion binary file detection is look like broken

Posted by Julian Foad <ju...@btopenworld.com>.
Stefan Sperling wrote:
> On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
>>    Subversion console client try to detect binary file with algorythm:
>> 
>>     1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
>>        check as first N bytes is corret UTF-8?);
>>     2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
>>        distribution of bytes takes change of absent ZERO byte: (1 - 1 /
>>        256) ^ 1024 = ~1.8%);
>>     3. File is BINARY if first 1024 bytes contains over 85% of characters
>>        not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
>>        bytes, ~60%).
>> 
>>    This algoritm looks like broken.

The requirement (3) for >85% non-ASCII* bytes => binary, was a historical accident. The 
original intention was >15% non-ASCII bytes => binary, or in other words >85% ASCII bytes => text. Quoting from libsvn_subr/io.c:svn_io_is_binary_data():

     NOTE:  Originally, I intended to target 85% of the bytes being in
     the specified ranges, but I flubbed the condition.  At any rate,
     folks aren't complaining, so I'm not sure that it's worth
     adjusting this retroactively now.

Perhaps now is the time to change that to match the original intent.

* I use the term ASCII loosely to mean "bytes in those two ranges".


> Can you suggest a better algoritm?
> 
>> For example:
>>     1. File "text.txt":
>> Is file contains text block from wikipedia about subversion in UTF-8
>> (https://ru.wikipedia.org/wiki/Subversion) and unfortunaly contains too
>> many cyrillic charactes (on character - 2 "binary" bytes).
>>     2. File "binary.txt" detected as "text"
>> It was created by "dd if=/dev/urandom of=binary.txt count=1 bs=2048" and
>> unfortunaly does not contains ZERO byte in first 1024 bytes.

Changing the 85% condition would fix example 2. However it would make example 1 occur more often, unless we also make valid UTF-8 be detected as text.

It does sound like a good idea to make valid UTF-8 be detected as text.

- Julian

Re: Subversion binary file detection is look like broken

Posted by "Artem V. Navrotskiy" <bo...@yandex.ru>.
Hello,

03.10.2014 15:35, Stefan Sperling пишет:
> On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
>>     Hello,
>>
>>
>>
>>     Subversion console client try to detect binary file with algorythm:
>>
>>      1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
>>         check as first N bytes is corret UTF-8?);
>>      2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
>>         distribution of bytes takes change of absent ZERO byte: (1 - 1 /
>>         256) ^ 1024 = ~1.8%);
>>      3. File is BINARY if first 1024 bytes contains over 85% of characters
>>         not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
>>         bytes, ~60%).
>>
>>     This algoritm looks like broken.
>>
> Can you suggest a better algoritm?
About false positive:

 1. If text file detected as binary:
      * with "svn:auto-props = '*.txt = svn:eol-style=native'" svn
        client block adding this file: svn:eol-style and
        svn:mime-type=application/octet-stream can't be defined
        simultaneously;
        You have a workaround:
          o create empty file;
          o run svn add for empty file;
          o replace empty file by real data;
          o commit.
      * you can't diff and merge this file (Cannot display: file marked
        as a binary type.).
        You can't fix it, because you can't remove svn:mime-type
        property in last modified revision.
 2. If binary file detected as text:
      * svn diff and merge display unusable output.
        You can fix it in current revision by set svn:mime-type property.

I think, false positive, when text file detected as binary is more annoying.


About file type detection:

 1. File detection algorythm must be as simple, as possible.
 2. If first N bytes contains ZERO byte - file is binary.
 3. If file is valid UTF-8 - file is text.
 4. If file contains too many binary characters - file is binary.
    I think, definitely binary charactes is: 0x00-0x08, 0x0B, 0x0C,
    0x0E-0x1F, 0x7F (29 characters, ~11.3%).
    This characters very rarely uses in text files. Characters from
    range 0x80-0xFF can identify as letters in some encodings.
    Comparison threshold should be significantly lower than the
    percentage of data characters in a normal distribution.
    For example, if file contains about 2.5% of N bytes as "binary"
    characters - this file is binary.


Overall, I seem to be successful following implementations:

 1. Git autodetection: if first 8000 bytes contains ZERO byte - file is
    binary.
    + As simple, as possible;
    + Can't detect text files as binary;
    - Can detect some binary files as text;
 2. Byte range autodetection: if first N bytes contains byte from range
    0x00-0x08 or 0x0E-0x1F - file is binary.
    + Still simple;
    - Can detect some short binary files as text;
 3. Byte range autodetection: if first N bytes contains about N% of
    bytes: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F, 0x7F - file is binary.
    - Not so simple;



Best regards,
Navrotskiy Artem.

Re: Subversion binary file detection is look like broken

Posted by Stefan Sperling <st...@elego.de>.
On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
>    Hello,
> 
> 
> 
>    Subversion console client try to detect binary file with algorythm:
> 
>     1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
>        check as first N bytes is corret UTF-8?);
>     2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
>        distribution of bytes takes change of absent ZERO byte: (1 - 1 /
>        256) ^ 1024 = ~1.8%);
>     3. File is BINARY if first 1024 bytes contains over 85% of characters
>        not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
>        bytes, ~60%).
> 
>    This algoritm looks like broken.
> 

Can you suggest a better algoritm?