You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Stefan Kueng <to...@gmail.com> on 2019/01/05 15:58:00 UTC

extending the blame callback

Hi,

When running blame on an utf16 text file, the lines can not be used 
since the blame callback only passes a 'char *' parameter which means it 
ends at the first zero char. But actually, svn knows if the line has 
more content.
So I'd like to propose the following patch which extends the blame 
callback with a 'bytelength' parameter indicating how many bytes the 
line consists of. That way, clients can themselves determine whether to 
show the full line (e.g. using hex display) or maybe try to convert the 
line string from different encodings.
Without the length parameter, clients can not exceed beyond the first 
null char to determine what the line really consists of without risking 
an access violation.

I've already tested this with TSVN and now I can properly blame utf16 
files. While it's not straight forward to figure out how to get the real 
uft16 line, it's possible. One problem is that every line except the 
first has a zero char at the beginning (the leftover zero char after the 
detected LF from the previous line), but I can work around that.

So, with this patch I'm able to do a blame on utf16 files, so I'd like 
to commit this if there are no objections.

[[[
Extend the blame callback with a string length parameter.

* subversion/incluce/svn_client.h
* subversion/libsvn_client/blame.c
   (svn_client_blame_receiver4_t): typedef for new callback
   (svn_client_blame6): new API using the svn_client_blame_receiver4_t 
callback
* subversion/libsvn_client/deprecated.c
   (svn_client_blame5): moved API there, calling svn_client_blame6 using a
                        callback shim
   (blame_wrapper_receiver3): callback shim for svn_client_blame5

]]]

Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 05.01.2019 18:05, Mark Phippard wrote:
> 
>> On Jan 5, 2019, at 10:58 AM, Stefan Kueng <to...@gmail.com> wrote:
>>
>> So, with this patch I'm able to do a blame on utf16 files, so I'd like to commit this if there are no objections.
>>
> 
> I have no objections, just a question. I thought SVN treated utf16 files as binary.  How does blame work at all?  I thought diff does not even work for utf16.  Does this require setting svn:mime-type to text/plain or something like that as well?

In default mode, yes.
But the --force flag makes it run on every kind of file.

Stefan

Re: extending the blame callback

Posted by Mark Phippard <ma...@gmail.com>.
> On Jan 5, 2019, at 10:58 AM, Stefan Kueng <to...@gmail.com> wrote:
> 
> So, with this patch I'm able to do a blame on utf16 files, so I'd like to commit this if there are no objections.
> 

I have no objections, just a question. I thought SVN treated utf16 files as binary.  How does blame work at all?  I thought diff does not even work for utf16.  Does this require setting svn:mime-type to text/plain or something like that as well?

Mark

Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 05.01.2019 17:53, Julian Foad wrote:
> Stefan Kueng wrote:
>> When running blame on an utf16 text file, the lines can not be used
>> since the blame callback only passes a 'char *' parameter which means it
>> ends at the first zero char. But actually, svn knows if the line has
>> more content.
> 
> +1 on doing something to fix the problem.
> 
> Just briefly:
> * we already have a (char*, len) wrapper, called svn_string_t, so I would expect it would be neatest to use that;

Ok, I'll change that and send a new patch.

> * splitting the lines on a bare LF byte, and having a left-over zero byte, sounds just so wrong... it can't be right, can it? Can't we do something better? Use a callback for deciding how to split lines, or something? I haven't looked at the code, so not sure what exactly.

Well, without doing several checks for what kind of encoding the line 
has this won't be possible. I'm just glad that the function as is does 
not stop at the zero char but only on 'lf' - so even if the file is 
really binary it "works".

Stefan

Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 06.01.2019 20:10, Stefan Kueng wrote:
>
>
> On 06.01.2019 19:37, Branko Čibej wrote:
>
>> Windows default is UTF-16-LE, at least on x86(_64) and other
>> little-endian architectures. I'm not sure what they do on ARM but I'd be
>> surprised if Windows doesn't put it in little-endian mode, given that
>> decades of legacy software assume little-endian.
>>
>> A simple check would be:
>>
>>    * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
>>      UTF-16-LE linefeed;
>>    * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
>>      then it's a UTF-16-BE linefeed;
>>    * otherwise just hope it's a linefeed and move on.
>
> looks good for proper files. But I have two files right here that are
> utf8 encoded, but have also null bytes in it: the stupid app that
> writes those uses the null byte to separate lines, and lf to separate
> paragraphs. So for these files the check would fail.

But not any more drastically than it fails now, surely? Since it returns
whole paragraphs instead of lines. You might lose an empty line in the
blame output here and there, hardly noticeable. :)


>> You're right about that. I wouldn't dream of supporting such things
>> within the blame callback itself. However it would still be nice to at
>> least document what's happening.
>
> I'll add some more info to the doc string and resend the patch tomorrow.
>
> The advantage if this is done in an UI client: if the detection of the
> encoding is wrong, I can just add a button/combobox/whatever so the
> user can choose the right encoding right there when showing the blame
> output.
> If the detection is done in the svn lib and is wrong, then an UI
> client could not do that.

OK, I guess, for now ... until we figure out how to do this right. For
example, when someone finally decides to properly handle Unicode
representations in file contents for diff, patch and blame, we'd also
have to support U+2028 (line separator) and U+2029 (paragraph separator).

-- Brane


Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 06.01.2019 19:37, Branko Čibej wrote:

> Windows default is UTF-16-LE, at least on x86(_64) and other
> little-endian architectures. I'm not sure what they do on ARM but I'd be
> surprised if Windows doesn't put it in little-endian mode, given that
> decades of legacy software assume little-endian.
> 
> A simple check would be:
> 
>    * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
>      UTF-16-LE linefeed;
>    * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
>      then it's a UTF-16-BE linefeed;
>    * otherwise just hope it's a linefeed and move on.

looks good for proper files. But I have two files right here that are 
utf8 encoded, but have also null bytes in it: the stupid app that writes 
those uses the null byte to separate lines, and lf to separate 
paragraphs. So for these files the check would fail.

> 
> You're right about that. I wouldn't dream of supporting such things
> within the blame callback itself. However it would still be nice to at
> least document what's happening.

I'll add some more info to the doc string and resend the patch tomorrow.

The advantage if this is done in an UI client: if the detection of the 
encoding is wrong, I can just add a button/combobox/whatever so the user 
can choose the right encoding right there when showing the blame output.
If the detection is done in the svn lib and is wrong, then an UI client 
could not do that.

Stefan

Re: extending the blame callback

Posted by Peter Samuelson <pe...@p12n.org>.
[Daniel Shahaf]
> The current patch's docstring implies the LF byte is necessarily part
> of a line terminator, which is true for UTF-8/16/32 but not
> necessarily true in arbitrary encodings.

Nitpick: It is true in UTF-8, but not -16 or -32.  There are about 70
characters in the BMP which, in UTF-16LE (and -32LE), begin with 0A:

    $ grep '^..0A;' /usr/share/misc/unicode.gz | head
    000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;
    010A;LATIN CAPITAL LETTER C WITH DOT ABOVE;Lu;0;L;0043 0307;;;;N;LATIN CAPITAL LETTER C DOT;;;010B;
    020A;LATIN CAPITAL LETTER I WITH INVERTED BREVE;Lu;0;L;0049 0311;;;;N;;;;020B;
    030A;COMBINING RING ABOVE;Mn;230;NSM;;;;;N;NON-SPACING RING ABOVE;;;;
    040A;CYRILLIC CAPITAL LETTER NJE;Lu;0;L;;;;;N;;;;045A;
    050A;CYRILLIC CAPITAL LETTER KOMI NJE;Lu;0;L;;;;;N;;;;050B;
    060A;ARABIC-INDIC PER TEN THOUSAND SIGN;Po;0;ET;;;;;N;;;;;
    070A;SYRIAC CONTRACTION;Po;0;AL;;;;;N;;;;;
    080A;SAMARITAN LETTER KAAF;Lo;0;R;;;;;N;;;;;
    090A;DEVANAGARI LETTER UU;Lo;0;L;;;;;N;;;;;

Re: extending the blame callback

Posted by Julian Foad <ju...@apache.org>.
Stefan Kueng wrote:
> On 14.01.2019 14:25, Branko Čibej wrote:
> > I started on that then decided that svn_client_blame6 is far too narrow
> > scope for that. In order to do this right, we have to introduce a line
> > splitter callback to svn_subst_stream_translated.
> > 
> > My idea was to do something like this:
> > 
> >    * If the line-splitter is NULL, use the current default and check MIME
> >      type for text/*
> >    * If it's not NULL, ignore MIME type and provide the callback with
> >      file props so it can use that to determine encoding if it wants to.
> >    * Remove the ignore_mime_type flag and use the presence of a line
> >      splitter to convey the same intent; consequently,
> >    * Expose the "default" line splitter as a public function.
> 
> one problem I encountered so far is that the encoding detection requires 
> data to work properly. But if e.g. the first few lines of a file have 
> only a few chars, then that's not enough.
> Not sure how this could be done, but since the blame function already 
> has the file loaded into memory, could this be passed to the client in 
> any way? Maybe the first time the callback is called? Or on every call?
> What I mean is instead of passing the svn_string_t for only a line, pass 
> it for the whole (unmodified/untranslated) file too.

Hi, Stefan. It wouldn't make sense to try to do this detection in the "blame line" callback: that is far too late. That's why we're discussing introducing a new "stream translation" callback, which would not be specific to "blame" but used for all textual operations.

There was some more discussion about it on #svn-dev [1], but not any good conclusions yet.

A stream translation callback could act as a stream filter, before the line splitting occurs, and could be responsible for determining how to translate both the character encoding and the keywords and EOL style. It could be free to buffer and examine as much of the file content as it wants to, before it starts to output the translated file content.

I'd be very glad to see any prototypes of such ideas.

-- 
- Julian


[1] 2019-01-14 Discussion of line-splitter callback ideas on #svn-dev, archived at https://colabti.org/irclogger/irclogger_log/svn-dev?date=2019-01-14#l50 and on Matrix: https://matrix.to/#/!hFVJYCuSnqoHmiVQYA:matrix.org/$15474728952387759JdjHl:matrix.org?via=matrix.org&via=foad.me.uk

Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 14.01.2019 14:25, Branko Čibej wrote:
> On 14.01.2019 13:36, Julian Foad wrote:
>> Stefan, thanks for taking account of the feedback and updating the doc string in r1851197.
>>
>> I took a look and thought to rewrite the part about encoding and line splitting like this:
>>
>>   * Character Encoding and Line Splitting:
>>   *
>>   * It is up to the client to determine the character encoding. The @a line
>>   * content is delivered without any encoding conversion. The line splitting
>>   * is designed to work with ASCII-compatible encodings including UTF-8. Any
>>   * of the byte sequences LF ("\n"), CR ("\n"), CR LF ("\r\n") ends a line
>>   * and is not included in @a line. The @a line content can include all other
>>   * byte values including zero (ASCII NUL).
>>
>> I dropped the reference to svn_subst_stream_translated() because it wasn't much use without saying what parameters it is given, and instead I was able to say exactly what happens overall.
>>
>> Problem 1: Using this blame function on a 16-bit character encoding is still really ugly: the receiver cannot know which byte sequences were stripped out.
>>
>> We should address this issue properly by passing a "line splitter" function in to svn_client_blame6().
> 
> 
> I started on that then decided that svn_client_blame6 is far too narrow
> scope for that. In order to do this right, we have to introduce a line
> splitter callback to svn_subst_stream_translated.
> 
> My idea was to do something like this:
> 
>    * If the line-splitter is NULL, use the current default and check MIME
>      type for text/*
>    * If it's not NULL, ignore MIME type and provide the callback with
>      file props so it can use that to determine encoding if it wants to.
>    * Remove the ignore_mime_type flag and use the presence of a line
>      splitter to convey the same intent; consequently,
>    * Expose the "default" line splitter as a public function.

one problem I encountered so far is that the encoding detection requires 
data to work properly. But if e.g. the first few lines of a file have 
only a few chars, then that's not enough.
Not sure how this could be done, but since the blame function already 
has the file loaded into memory, could this be passed to the client in 
any way? Maybe the first time the callback is called? Or on every call?
What I mean is instead of passing the svn_string_t for only a line, pass 
it for the whole (unmodified/untranslated) file too.

Stefan


Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 14.01.2019 15:37, Julian Foad wrote:
> Branko Čibej wrote:
>> On 14.01.2019 13:36, Julian Foad wrote:
> [...]
>
> Thanks, Brane; you gave a good initial summary of the scope of the problem. As you said on IRC there's even more to it, such as the relationship between line splitting and keywords, meaning that all has to be addressed together.
>
>>> Can we re-write this properly?
>> As I said, this is not limited to blame. The current change in the blame
>> API at least makes some sense and is a localised change, that's why I
>> support it. It doesn't break anything that wasn't broken before.
>>
>> Changing the way we handle text-like files for diff, blame and patch, on
>> the other hand, is quite a bit more involved and I'm afraid it'll touch
>> a lot of code. I wouldn't dream of rejecting svn_client_blame6 just
>> because it doesn't solve the larger problem.
> Ack. I didn't mean to reject this current upgrade: I agree that providing an svn_string_t is a useful and self-contained upgrade in itself, and I support it. I meant can we next, sometime, address the bigger problem that you discussed above. (We should open an issue in the tracker, and start a new thread for it.)

+1


Re: extending the blame callback

Posted by Julian Foad <ju...@apache.org>.
Branko Čibej wrote:
> On 14.01.2019 13:36, Julian Foad wrote:
[...]

Thanks, Brane; you gave a good initial summary of the scope of the problem. As you said on IRC there's even more to it, such as the relationship between line splitting and keywords, meaning that all has to be addressed together.

> > Can we re-write this properly?
> 
> As I said, this is not limited to blame. The current change in the blame
> API at least makes some sense and is a localised change, that's why I
> support it. It doesn't break anything that wasn't broken before.
> 
> Changing the way we handle text-like files for diff, blame and patch, on
> the other hand, is quite a bit more involved and I'm afraid it'll touch
> a lot of code. I wouldn't dream of rejecting svn_client_blame6 just
> because it doesn't solve the larger problem.

Ack. I didn't mean to reject this current upgrade: I agree that providing an svn_string_t is a useful and self-contained upgrade in itself, and I support it. I meant can we next, sometime, address the bigger problem that you discussed above. (We should open an issue in the tracker, and start a new thread for it.)

-- 
- Julian

Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 14.01.2019 13:36, Julian Foad wrote:
> Stefan, thanks for taking account of the feedback and updating the doc string in r1851197.
>
> I took a look and thought to rewrite the part about encoding and line splitting like this:
>
>  * Character Encoding and Line Splitting:
>  *
>  * It is up to the client to determine the character encoding. The @a line
>  * content is delivered without any encoding conversion. The line splitting
>  * is designed to work with ASCII-compatible encodings including UTF-8. Any
>  * of the byte sequences LF ("\n"), CR ("\n"), CR LF ("\r\n") ends a line
>  * and is not included in @a line. The @a line content can include all other
>  * byte values including zero (ASCII NUL).
>
> I dropped the reference to svn_subst_stream_translated() because it wasn't much use without saying what parameters it is given, and instead I was able to say exactly what happens overall.
>
> Problem 1: Using this blame function on a 16-bit character encoding is still really ugly: the receiver cannot know which byte sequences were stripped out.
>
> We should address this issue properly by passing a "line splitter" function in to svn_client_blame6().


I started on that then decided that svn_client_blame6 is far too narrow
scope for that. In order to do this right, we have to introduce a line
splitter callback to svn_subst_stream_translated.

My idea was to do something like this:

  * If the line-splitter is NULL, use the current default and check MIME
    type for text/*
  * If it's not NULL, ignore MIME type and provide the callback with
    file props so it can use that to determine encoding if it wants to.
  * Remove the ignore_mime_type flag and use the presence of a line
    splitter to convey the same intent; consequently,
  * Expose the "default" line splitter as a public function.



> Problem 2: Then I noticed that where this splitting algorithm is used, it is in the second pass over the file data, where we associate each "struct blame" item with a line of text.
>
> It looks like a different algorithm may be used in the first pass, when calculating the differences and constructing the blame list. It diffs all repository revisions of the file using svn_diff_file_diff_2() which splits lines according to the "ignore_eol_style" option from svn_client_blame6(diff_options.ignore_eol_style). The effect of "ignore_eol_style" is not entirely clear to me; it is used in svn_diff__normalize_buffer(). In addition, if svn_client_blame6() ends up processing a local WC revision of the file, that is processed through svn_client__get_normalized_stream(normalize_eols=(svn:eol-style == native)) before being diffed.
>
> This is horrible. Surely we should use a consistent "line splitter" everywhere.


Yes, we should introduce the line splitter concept everywhere in the
diff/patch/blame group and really anywhere where we're expected to read
"lines" from files (svn_stream_readline comes to mind). This is why I
stopped working on a splitter for blame ... the problem is much larger
than that and "fixing" blame would really not help by itself.


> I would expect this means it is possible for blame output to go wrong, with revision numbers assigned to the wrong lines.


This has always been possible due to the imprecise nature of diff, which
blame relies on. I wouldn't worry too much about it.


>  I have been unable to construct such a wrong output, so far, after trying for half an hour or so. Possibly there is no code path that produces such a wrong output. But surely this is the sort of issue that leads to problems down the road.
>
> Can we re-write this properly?

As I said, this is not limited to blame. The current change in the blame
API at least makes some sense and is a localised change, that's why I
support it. It doesn't break anything that wasn't broken before.

Changing the way we handle text-like files for diff, blame and patch, on
the other hand, is quite a bit more involved and I'm afraid it'll touch
a lot of code. I wouldn't dream of rejecting svn_client_blame6 just
because it doesn't solve the larger problem.

-- Brane

Re: extending the blame callback

Posted by Julian Foad <ju...@apache.org>.
Stefan, thanks for taking account of the feedback and updating the doc string in r1851197.

I took a look and thought to rewrite the part about encoding and line splitting like this:

 * Character Encoding and Line Splitting:
 *
 * It is up to the client to determine the character encoding. The @a line
 * content is delivered without any encoding conversion. The line splitting
 * is designed to work with ASCII-compatible encodings including UTF-8. Any
 * of the byte sequences LF ("\n"), CR ("\n"), CR LF ("\r\n") ends a line
 * and is not included in @a line. The @a line content can include all other
 * byte values including zero (ASCII NUL).

I dropped the reference to svn_subst_stream_translated() because it wasn't much use without saying what parameters it is given, and instead I was able to say exactly what happens overall.

Problem 1: Using this blame function on a 16-bit character encoding is still really ugly: the receiver cannot know which byte sequences were stripped out.

We should address this issue properly by passing a "line splitter" function in to svn_client_blame6().

Problem 2: Then I noticed that where this splitting algorithm is used, it is in the second pass over the file data, where we associate each "struct blame" item with a line of text.

It looks like a different algorithm may be used in the first pass, when calculating the differences and constructing the blame list. It diffs all repository revisions of the file using svn_diff_file_diff_2() which splits lines according to the "ignore_eol_style" option from svn_client_blame6(diff_options.ignore_eol_style). The effect of "ignore_eol_style" is not entirely clear to me; it is used in svn_diff__normalize_buffer(). In addition, if svn_client_blame6() ends up processing a local WC revision of the file, that is processed through svn_client__get_normalized_stream(normalize_eols=(svn:eol-style == native)) before being diffed.

This is horrible. Surely we should use a consistent "line splitter" everywhere.

I would expect this means it is possible for blame output to go wrong, with revision numbers assigned to the wrong lines. I have been unable to construct such a wrong output, so far, after trying for half an hour or so. Possibly there is no code path that produces such a wrong output. But surely this is the sort of issue that leads to problems down the road.

Can we re-write this properly?

-- 
- Julian

Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 11.01.2019 17:11, Stefan Kueng wrote:
>
>
> On 10.01.2019 23:46, Branko Čibej wrote:
>> On 10.01.2019 19:13, Stefan Kueng wrote:
>>>
>>>
>>> On 10.01.2019 06:58, Branko Čibej wrote:
>>>> On 10.01.2019 04:58, Branko Čibej wrote:
>>>>> On 07.01.2019 20:57, Stefan Kueng wrote:
>>>>>> @@ -758,6 +759,33 @@
>>>>>>     * will be true if the reason there is no blame information is
>>>>>> that the line
>>>>>>     * was modified locally. In all other cases @a local_change will
>>>>>> be false.
>>>>>>     *
>>>>>> + * @note the line is split on LF characters. Clients must be aware
>>>>>> of this
>>>>>> + * when dealing with different encodings of the file/line.
>>>>>> + * Blaming non ASCII/UTF-8 files requires the @a force flag to be
>>>>>> set when
>>>>>> + * calling the svn_client_blame6 function.
>>>>>
>>>>> I just noticed that svn_client_blame6 does not, of course, have a
>>>>> parameter called 'force'. But it does have a parameter called
>>>>> 'ignore_mime_type'.
>>>>
>>>>
>>>> Also the assertion that "lines are split on LF" turns out to be wrong
>>>> and misleading. Line endings are translated first, through
>>>> svn_subst_stream_translated(), and this happens regardless of the MIME
>>>> type. These parts of the new docstrings should be fixed before the
>>>> next
>>>> release.
>>>
>>> How about this:
>>>   * @note the line is split on newline bytes. Clients must be aware of
>>> this
>>>   * when dealing with different encodings of the file/line.
>>>   * Blaming non ASCII/UTF-8 files requires the @a ignore_mime_type flag
>>> to be
>>>   * set to true when calling the svn_client_blame6 function.
>>>
>>>
>>> mentioning that the split is done on newline *bytes* should be clear
>>> enough?
>>> Of course, better ideas are always welcome.
>>
>> I'd just write something about the contents being preprocessed by '@ref
>> svn_subst_stream_translated' to convert newlines, as that has its own,
>> quite extensive docstring. There's no need to second-guess that or
>> duplicate information.
>>
>
> how about this:
>
>  * @note the line contents are processed by @ref
> svn_subst_stream_translated
>  * to convert newlines. The lines are then split on newlines.
>  * Clients must be aware of this when dealing with different encodings of
>  * the file/line.
>  * Blaming non ASCII/UTF-8 files requires the @a ignore_mime_type flag
> to be
>  * set to true when calling the svn_client_blame6 function.

"Blaming files that have <tt>svn:mime-type</tt> set to something other
than <tt>text/...</tt>" would be more correct, otherwise +1. It's quite
possible to have a plain ASCII text file, then set the MIME type to,
e.g., 'application/xml'.

-- Brane


Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 10.01.2019 23:46, Branko Čibej wrote:
> On 10.01.2019 19:13, Stefan Kueng wrote:
>>
>>
>> On 10.01.2019 06:58, Branko Čibej wrote:
>>> On 10.01.2019 04:58, Branko Čibej wrote:
>>>> On 07.01.2019 20:57, Stefan Kueng wrote:
>>>>> @@ -758,6 +759,33 @@
>>>>>     * will be true if the reason there is no blame information is
>>>>> that the line
>>>>>     * was modified locally. In all other cases @a local_change will
>>>>> be false.
>>>>>     *
>>>>> + * @note the line is split on LF characters. Clients must be aware
>>>>> of this
>>>>> + * when dealing with different encodings of the file/line.
>>>>> + * Blaming non ASCII/UTF-8 files requires the @a force flag to be
>>>>> set when
>>>>> + * calling the svn_client_blame6 function.
>>>>
>>>> I just noticed that svn_client_blame6 does not, of course, have a
>>>> parameter called 'force'. But it does have a parameter called
>>>> 'ignore_mime_type'.
>>>
>>>
>>> Also the assertion that "lines are split on LF" turns out to be wrong
>>> and misleading. Line endings are translated first, through
>>> svn_subst_stream_translated(), and this happens regardless of the MIME
>>> type. These parts of the new docstrings should be fixed before the next
>>> release.
>>
>> How about this:
>>   * @note the line is split on newline bytes. Clients must be aware of
>> this
>>   * when dealing with different encodings of the file/line.
>>   * Blaming non ASCII/UTF-8 files requires the @a ignore_mime_type flag
>> to be
>>   * set to true when calling the svn_client_blame6 function.
>>
>>
>> mentioning that the split is done on newline *bytes* should be clear
>> enough?
>> Of course, better ideas are always welcome.
> 
> I'd just write something about the contents being preprocessed by '@ref
> svn_subst_stream_translated' to convert newlines, as that has its own,
> quite extensive docstring. There's no need to second-guess that or
> duplicate information.
> 

how about this:

  * @note the line contents are processed by @ref 
svn_subst_stream_translated
  * to convert newlines. The lines are then split on newlines.
  * Clients must be aware of this when dealing with different encodings of
  * the file/line.
  * Blaming non ASCII/UTF-8 files requires the @a ignore_mime_type flag 
to be
  * set to true when calling the svn_client_blame6 function.



> Incidentally, this is probably why you don't see any CR bytes. An UTF-16
> sequence of 'CR 00 LF 00' will be translated to 'LF 00 LF 00' and you'll
> get two callbacks out of that instead of one ... I think.

Yes, that's exactly why.

Stefan


Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 10.01.2019 19:13, Stefan Kueng wrote:
>
>
> On 10.01.2019 06:58, Branko Čibej wrote:
>> On 10.01.2019 04:58, Branko Čibej wrote:
>>> On 07.01.2019 20:57, Stefan Kueng wrote:
>>>> @@ -758,6 +759,33 @@
>>>>    * will be true if the reason there is no blame information is
>>>> that the line
>>>>    * was modified locally. In all other cases @a local_change will
>>>> be false.
>>>>    *
>>>> + * @note the line is split on LF characters. Clients must be aware
>>>> of this
>>>> + * when dealing with different encodings of the file/line.
>>>> + * Blaming non ASCII/UTF-8 files requires the @a force flag to be
>>>> set when
>>>> + * calling the svn_client_blame6 function.
>>>
>>> I just noticed that svn_client_blame6 does not, of course, have a
>>> parameter called 'force'. But it does have a parameter called
>>> 'ignore_mime_type'.
>>
>>
>> Also the assertion that "lines are split on LF" turns out to be wrong
>> and misleading. Line endings are translated first, through
>> svn_subst_stream_translated(), and this happens regardless of the MIME
>> type. These parts of the new docstrings should be fixed before the next
>> release.
>
> How about this:
>  * @note the line is split on newline bytes. Clients must be aware of
> this
>  * when dealing with different encodings of the file/line.
>  * Blaming non ASCII/UTF-8 files requires the @a ignore_mime_type flag
> to be
>  * set to true when calling the svn_client_blame6 function.
>
>
> mentioning that the split is done on newline *bytes* should be clear
> enough?
> Of course, better ideas are always welcome.

I'd just write something about the contents being preprocessed by '@ref
svn_subst_stream_translated' to convert newlines, as that has its own,
quite extensive docstring. There's no need to second-guess that or
duplicate information.

Incidentally, this is probably why you don't see any CR bytes. An UTF-16
sequence of 'CR 00 LF 00' will be translated to 'LF 00 LF 00' and you'll
get two callbacks out of that instead of one ... I think.

-- Brane


Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 10.01.2019 06:58, Branko Čibej wrote:
> On 10.01.2019 04:58, Branko Čibej wrote:
>> On 07.01.2019 20:57, Stefan Kueng wrote:
>>> @@ -758,6 +759,33 @@
>>>    * will be true if the reason there is no blame information is that the line
>>>    * was modified locally. In all other cases @a local_change will be false.
>>>    *
>>> + * @note the line is split on LF characters. Clients must be aware of this
>>> + * when dealing with different encodings of the file/line.
>>> + * Blaming non ASCII/UTF-8 files requires the @a force flag to be set when
>>> + * calling the svn_client_blame6 function.
>>
>> I just noticed that svn_client_blame6 does not, of course, have a
>> parameter called 'force'. But it does have a parameter called
>> 'ignore_mime_type'.
> 
> 
> Also the assertion that "lines are split on LF" turns out to be wrong
> and misleading. Line endings are translated first, through
> svn_subst_stream_translated(), and this happens regardless of the MIME
> type. These parts of the new docstrings should be fixed before the next
> release.

How about this:
  * @note the line is split on newline bytes. Clients must be aware of this
  * when dealing with different encodings of the file/line.
  * Blaming non ASCII/UTF-8 files requires the @a ignore_mime_type flag 
to be
  * set to true when calling the svn_client_blame6 function.


mentioning that the split is done on newline *bytes* should be clear enough?
Of course, better ideas are always welcome.

Stefan

Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 10.01.2019 04:58, Branko Čibej wrote:
> On 07.01.2019 20:57, Stefan Kueng wrote:
>> @@ -758,6 +759,33 @@
>>   * will be true if the reason there is no blame information is that the line
>>   * was modified locally. In all other cases @a local_change will be false.
>>   *
>> + * @note the line is split on LF characters. Clients must be aware of this
>> + * when dealing with different encodings of the file/line.
>> + * Blaming non ASCII/UTF-8 files requires the @a force flag to be set when
>> + * calling the svn_client_blame6 function.
>
> I just noticed that svn_client_blame6 does not, of course, have a
> parameter called 'force'. But it does have a parameter called
> 'ignore_mime_type'.


Also the assertion that "lines are split on LF" turns out to be wrong
and misleading. Line endings are translated first, through
svn_subst_stream_translated(), and this happens regardless of the MIME
type. These parts of the new docstrings should be fixed before the next
release.

-- Brane


Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 07.01.2019 20:57, Stefan Kueng wrote:
> @@ -758,6 +759,33 @@
>   * will be true if the reason there is no blame information is that the line
>   * was modified locally. In all other cases @a local_change will be false.
>   *
> + * @note the line is split on LF characters. Clients must be aware of this
> + * when dealing with different encodings of the file/line.
> + * Blaming non ASCII/UTF-8 files requires the @a force flag to be set when
> + * calling the svn_client_blame6 function.


I just noticed that svn_client_blame6 does not, of course, have a
parameter called 'force'. But it does have a parameter called
'ignore_mime_type'.

-- Brane


Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 07.01.2019 19:58, Daniel Shahaf wrote:
> Stefan Kueng wrote on Mon, 07 Jan 2019 19:30 +0100:
>> On 06.01.2019 21:09, Daniel Shahaf wrote:
>>> Stefan Kueng wrote on Sun, Jan 06, 2019 at 20:40:28 +0100:
>>>> @@ -758,6 +759,33 @@
>>>>     * will be true if the reason there is no blame information is that the line
>>>>     * was modified locally. In all other cases @a local_change will be false.
>>>>     *
>>>> + * @note if the text encoding of the file is not ASCII or utf8, the end-of-line
>>>> + * detection may lead to lines having a one byte offset. It is up to the client
>>>
>>> "One byte offset" is not true in general; it is true for UTF-16 but
>>> there are other encodings in the world.  Besides, I would sooner point
>>> out that if the file isn't in UTF-8 (including ASCII), the end of line
>>> detection may be *wrong* since it looks for the byte 0x0A, which may not
>>> even be part of a (possibly multibyte) newline character.
>>>
>>> It's fine to give specific details about UTF-16, but we should give the
>>> more generally-applicable information first.
>>
>> The wording is "*may*", but I've reworded it slightly. I hope it's better.
> 
> It _is_ better, thank you, but I agree with Julian's last post where he wrote that
> the docstring should just say that the line is split on LF bytes.  The current
> patch's docstring implies the LF byte is necessarily part of a line terminator,
> which is true for UTF-8/16/32 but not necessarily true in arbitrary encodings.

next patch attached. I think this is better now.

Stefan


Re: extending the blame callback

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Stefan Kueng wrote on Mon, 07 Jan 2019 19:30 +0100:
> On 06.01.2019 21:09, Daniel Shahaf wrote:
> > Stefan Kueng wrote on Sun, Jan 06, 2019 at 20:40:28 +0100:
> >> @@ -758,6 +759,33 @@
> >>    * will be true if the reason there is no blame information is that the line
> >>    * was modified locally. In all other cases @a local_change will be false.
> >>    *
> >> + * @note if the text encoding of the file is not ASCII or utf8, the end-of-line
> >> + * detection may lead to lines having a one byte offset. It is up to the client
> > 
> > "One byte offset" is not true in general; it is true for UTF-16 but
> > there are other encodings in the world.  Besides, I would sooner point
> > out that if the file isn't in UTF-8 (including ASCII), the end of line
> > detection may be *wrong* since it looks for the byte 0x0A, which may not
> > even be part of a (possibly multibyte) newline character.
> > 
> > It's fine to give specific details about UTF-16, but we should give the
> > more generally-applicable information first.
> 
> The wording is "*may*", but I've reworded it slightly. I hope it's better.

It _is_ better, thank you, but I agree with Julian's last post where he wrote that
the docstring should just say that the line is split on LF bytes.  The current
patch's docstring implies the LF byte is necessarily part of a line terminator,
which is true for UTF-8/16/32 but not necessarily true in arbitrary encodings.

Snipped the rest — thanks for making those changes.

Cheers,

Daniel

Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 07.01.2019 19:56, Branko Čibej wrote:
> On Mon, 7 Jan 2019, 19:30 Stefan Kueng <tortoisesvn@gmail.com 
> <ma...@gmail.com> wrote:
> 
> 
>     I think for proper C, all variables need to be declared at the start of
>     the function. Unless of course svn allows newer C versions?
> 
> 
> Not "the start of the function" but "before any statements in the 
> enclosing block." It has been that way since the beginning.

So should I change it?

Stefan


Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On Mon, 7 Jan 2019, 19:30 Stefan Kueng <tortoisesvn@gmail.com wrote:

>
> I think for proper C, all variables need to be declared at the start of
> the function. Unless of course svn allows newer C versions?
>

Not "the start of the function" but "before any statements in the enclosing
block." It has been that way since the beginning.

-- Brane

>

Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.
the attached patch has some corrections to the doc strings.
If there are no objections, I'll commit this tomorrow.

Also see inline comments:

On 06.01.2019 21:09, Daniel Shahaf wrote:
> Stefan Kueng wrote on Sun, Jan 06, 2019 at 20:40:28 +0100:
>> +++ subversion/include/svn_client.h	(working copy)
>> @@ -736,10 +736,11 @@
>>    * @{
>>    */
>>   
>> -/** Callback type used by svn_client_blame5() to notify the caller
>> +/** Callback type used by svn_client_blame6() to notify the caller
>>    * that line @a line_no of the blamed file was last changed in @a revision
>>    * which has the revision properties @a rev_props, and that the contents were
>> - * @a line.
>> + * @a line. The @a line content is delivered as is, it is up to the client to
>> + * determine the encoding.
> 
> s/,/;/
> 
> Shouldn't the docstring explicitly say whether or not the value of @a
> line includes the terminating newline?  (Preëxisting issue)

Done.

> 
>> @@ -758,6 +759,33 @@
>>    * will be true if the reason there is no blame information is that the line
>>    * was modified locally. In all other cases @a local_change will be false.
>>    *
>> + * @note if the text encoding of the file is not ASCII or utf8, the end-of-line
>> + * detection may lead to lines having a one byte offset. It is up to the client
> 
> "One byte offset" is not true in general; it is true for UTF-16 but
> there are other encodings in the world.  Besides, I would sooner point
> out that if the file isn't in UTF-8 (including ASCII), the end of line
> detection may be *wrong* since it looks for the byte 0x0A, which may not
> even be part of a (possibly multibyte) newline character.
> 
> It's fine to give specific details about UTF-16, but we should give the
> more generally-applicable information first.

The wording is "*may*", but I've reworded it slightly. I hope it's better.

> 
>> + * to handle these situations. Blaming non ASCII/utf8 files requires the
> 
> s/utf8/UTF-8/ (two instances)

Done.

> 
>> + * @a force flag to be set when calling the blame function.
> 
> s/the blame function/svn_client_blame6/

Done.

> 
>> @@ -941,18 +942,20 @@
>>               SVN_ERR(ctx->cancel_func(ctx->cancel_baton));
>>             if (!eof || sb->len)
>>               {
>> +              line.data = sb->data;
>> +              line.len = sb->len;
> 
> Reduce the scope of LINE to just this block?

I think for proper C, all variables need to be declared at the start of 
the function. Unless of course svn allows newer C versions?


Re: extending the blame callback

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Stefan Kueng wrote on Sun, Jan 06, 2019 at 20:40:28 +0100:
> +++ subversion/include/svn_client.h	(working copy)
> @@ -736,10 +736,11 @@
>   * @{
>   */
>  
> -/** Callback type used by svn_client_blame5() to notify the caller
> +/** Callback type used by svn_client_blame6() to notify the caller
>   * that line @a line_no of the blamed file was last changed in @a revision
>   * which has the revision properties @a rev_props, and that the contents were
> - * @a line.
> + * @a line. The @a line content is delivered as is, it is up to the client to
> + * determine the encoding.

s/,/;/

Shouldn't the docstring explicitly say whether or not the value of @a
line includes the terminating newline?  (Preëxisting issue)

> @@ -758,6 +759,33 @@
>   * will be true if the reason there is no blame information is that the line
>   * was modified locally. In all other cases @a local_change will be false.
>   *
> + * @note if the text encoding of the file is not ASCII or utf8, the end-of-line
> + * detection may lead to lines having a one byte offset. It is up to the client

"One byte offset" is not true in general; it is true for UTF-16 but
there are other encodings in the world.  Besides, I would sooner point
out that if the file isn't in UTF-8 (including ASCII), the end of line
detection may be *wrong* since it looks for the byte 0x0A, which may not
even be part of a (possibly multibyte) newline character.

It's fine to give specific details about UTF-16, but we should give the
more generally-applicable information first.

> + * to handle these situations. Blaming non ASCII/utf8 files requires the

s/utf8/UTF-8/ (two instances)

> + * @a force flag to be set when calling the blame function.

s/the blame function/svn_client_blame6/

> + * @since New in 1.12.
> + */
> +typedef svn_error_t *(*svn_client_blame_receiver4_t)(
> +  void *baton,
> +  svn_revnum_t start_revnum,
> +  svn_revnum_t end_revnum,
> +  apr_int64_t line_no,
> +  svn_revnum_t revision,
> +  apr_hash_t *rev_props,
> +  svn_revnum_t merged_revision,
> +  apr_hash_t *merged_rev_props,
> +  const char *merged_path,
> +  const svn_string_t *line,
> +  svn_boolean_t local_change,
> +  apr_pool_t *pool);

> +++ subversion/libsvn_client/blame.c	(working copy)
> @@ -676,6 +676,7 @@ svn_client_blame6(const char *target,
>    svn_stream_t *last_stream;
>    svn_stream_t *stream;
>    const char *target_abspath_or_url;
> +  svn_string_t line;
>  
>    if (start->kind == svn_opt_revision_unspecified
>        || end->kind == svn_opt_revision_unspecified)
> @@ -941,18 +942,20 @@
>              SVN_ERR(ctx->cancel_func(ctx->cancel_baton));
>            if (!eof || sb->len)
>              {
> +              line.data = sb->data;
> +              line.len = sb->len;

Reduce the scope of LINE to just this block?

>                if (walk->rev)
>                  SVN_ERR(receiver(receiver_baton, start_revnum, end_revnum,
>                                   line_no, walk->rev->revision,
>                                   walk->rev->rev_props, merged_rev,
>                                   merged_rev_props, merged_path,
> -                                 sb->data, FALSE, iterpool));
> +                                 &line, FALSE, iterpool));
>                else
>                  SVN_ERR(receiver(receiver_baton, start_revnum, end_revnum,
>                                   line_no, SVN_INVALID_REVNUM,
>                                   NULL, SVN_INVALID_REVNUM,
>                                   NULL, NULL,
> -                                 sb->data, TRUE, iterpool));
> +                                 &line, TRUE, iterpool));
>              }
>            if (eof) break;
>          }

Cheers,

Daniel

Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 06.01.2019 19:54, Daniel Shahaf wrote:
> The encoding may also be set explicitly via a svn:mime-type="text/foo;
> charset=utf-16-le" property.  (We even parse that in mod_dav_svn, I think?)

Setting the mime-type to text/.. with the utf16 charset helps only that 
the --force flag doesn't need to be passed to commands, but that all 
(client-side) as far as I know.

Here's another patch with some more doc strings. I guess that can be 
improved, but I'm really not good at writing docs...

Stefan


Re: extending the blame callback

Posted by Julian Foad <ju...@apache.org>.
I reviewed the patch 'blame6_5.patch'. It all looks fine, functionally. I think it provides a good upgrade for any user of the API, even though I don't really think using it to process UTF-16 data is a very good idea.

The only part of the patch I'd like to see changed is the comment about "the end-of-line detection may lead to lines having a one byte offset", because this "offset" is purely an idea that belongs to the UTF-16-aware caller. It would be more accurate just to say here that the EOL detection splits lines at a LF byte (which it removes).

-- 
- Julian

Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 07.01.2019 06:17, Daniel Shahaf wrote:
> Branko Čibej wrote on Mon, 07 Jan 2019 05:55 +0100:
>> On 06.01.2019 19:54, Daniel Shahaf wrote:
>>> Branko Čibej wrote on Sun, 06 Jan 2019 19:37 +0100:
>>>> A simple check would be:
>>>>
>>>>   * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
>>>>     UTF-16-LE linefeed;
>>>>   * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
>>>>     then it's a UTF-16-BE linefeed;
>>> Would would happen if it were an ASCII/UTF-8 file that happened to
>>> have a literal NUL byte next to an LF byte?  I have seen/used
>>> some of those.
>> Yes, well, in that case the NUL just might randomly "move" from one line
>> to the next, depending on changes in the file. Nothing we can do
>> short-term will fix that ... and as far as I'm concerned that's not a
>> valid text file, so it won't disturb me too much if blame results are a
>> bit fuzzy in such cases.
> You don't actually give any reason why we shouldn't support text files
> with embedded NULs.


I didn't say we shouldn't. Clearly we do, if inadvertently, or we
wouldn't be discussing Stefan's patch in this thread.


>>>   (We even parse that in mod_dav_svn, I think?)
>> Even if we do, it's irrelevant to this discussion. We set the value of
>> the Content-Type header based on svn:mime-type for a simple GET request,
>> but don't interpret it otherwise on the server side.
> The relevance is that the property may be already set in users' repositories.

It's still irrelevant to *this* discussion. I don't think anyone is
proposing to propagate pieces of svn:mime-type deep into the difflib as
part of this patch.

It will be relevant to some future discussion where we talk about
supporting (and detecting) Unicode representations as text throughout
our code.

-- Brane


Re: extending the blame callback

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Branko Čibej wrote on Mon, 07 Jan 2019 05:55 +0100:
> On 06.01.2019 19:54, Daniel Shahaf wrote:
> > Branko Čibej wrote on Sun, 06 Jan 2019 19:37 +0100:
> >> A simple check would be:
> >>
> >>   * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
> >>     UTF-16-LE linefeed;
> >>   * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
> >>     then it's a UTF-16-BE linefeed;
> > Would would happen if it were an ASCII/UTF-8 file that happened to
> > have a literal NUL byte next to an LF byte?  I have seen/used
> > some of those.
> 
> Yes, well, in that case the NUL just might randomly "move" from one line
> to the next, depending on changes in the file. Nothing we can do
> short-term will fix that ... and as far as I'm concerned that's not a
> valid text file, so it won't disturb me too much if blame results are a
> bit fuzzy in such cases.

You don't actually give any reason why we shouldn't support text files
with embedded NULs.

> >   (We even parse that in mod_dav_svn, I think?)
> 
> Even if we do, it's irrelevant to this discussion. We set the value of
> the Content-Type header based on svn:mime-type for a simple GET request,
> but don't interpret it otherwise on the server side.

The relevance is that the property may be already set in users' repositories.

Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 06.01.2019 19:54, Daniel Shahaf wrote:
> Branko Čibej wrote on Sun, 06 Jan 2019 19:37 +0100:
>> A simple check would be:
>>
>>   * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
>>     UTF-16-LE linefeed;
>>   * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
>>     then it's a UTF-16-BE linefeed;
> Would would happen if it were an ASCII/UTF-8 file that happened to
> have a literal NUL byte next to an LF byte?  I have seen/used
> some of those.

Yes, well, in that case the NUL just might randomly "move" from one line
to the next, depending on changes in the file. Nothing we can do
short-term will fix that ... and as far as I'm concerned that's not a
valid text file, so it won't disturb me too much if blame results are a
bit fuzzy in such cases.



>>   * otherwise just hope it's a linefeed and move on.
> The encoding may also be set explicitly via a svn:mime-type="text/foo;
> charset=utf-16-le" property.


Yes, but supporting _that_ in our code is a much, much bigger task than
what we're talking about in this thread. What I proposed here would be a
simple, localised hack to avoid splitting lines in the middle of a code
sequence. If we do decide to implement something like that for line
breaks, we should extend it to both variants of UTF-32 as well.

(Also, note the abuse of the 'charset' attribute in Content-Type ... the
"character set" is Unicode, it should be Content-Encoding: UTFsomething,
but that's used for something different again :\ )


>   (We even parse that in mod_dav_svn, I think?)

Even if we do, it's irrelevant to this discussion. We set the value of
the Content-Type header based on svn:mime-type for a simple GET request,
but don't interpret it otherwise on the server side.

-- Brane

Re: extending the blame callback

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Branko Čibej wrote on Sun, 06 Jan 2019 19:37 +0100:
> A simple check would be:
> 
>   * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
>     UTF-16-LE linefeed;
>   * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
>     then it's a UTF-16-BE linefeed;

Would would happen if it were an ASCII/UTF-8 file that happened to
have a literal NUL byte next to an LF byte?  I have seen/used
some of those.

>   * otherwise just hope it's a linefeed and move on.

The encoding may also be set explicitly via a svn:mime-type="text/foo;
charset=utf-16-le" property.  (We even parse that in mod_dav_svn, I think?)

Cheers,

Daniel

Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 06.01.2019 17:16, Stefan Kueng wrote:
>
>
> On 06.01.2019 15:05, Branko Čibej wrote:
>
>> Sorry about getting into this late, but as Julian said:
>>
>>> * we already have a (char*, len) wrapper, called svn_string_t, so I
>>> would expect it would be neatest to use that;
>>
>> but the patch has:
>>
>>> @@ -758,6 +758,28 @@
>>> ·*·will·be·true·if·the·reason·there·is·no·blame·information·is·that·the·line
>>>
>>>
>>> ·*·was·modified·locally.·In·all·other·cases·@a·local_change·will·be·false.
>>>
>>>
>>> ·*
>>> +·*·@since·New·in·1.12.
>>> +·*/
>>> +typedef·svn_error_t·*(*svn_client_blame_receiver4_t)(
>>> +··void·*baton,
>>> +··svn_revnum_t·start_revnum,
>>> +··svn_revnum_t·end_revnum,
>>> +··apr_int64_t·line_no,
>>> +··svn_revnum_t·revision,
>>> +··apr_hash_t·*rev_props,
>>> +··svn_revnum_t·merged_revision,
>>> +··apr_hash_t·*merged_rev_props,
>>> +··const·char·*merged_path,
>>> +··svn_stringbuf_t·*line,
>>> +··svn_boolean_t·local_change,
>>> +··apr_pool_t·*pool);
>>
>>
>> The svn_stringbuf_t struct is not appropriate here; please use
>> svn_string_t. The former is used as a buffer to construct larger
>> strings, that's why it contains a reference to a pool and a blocksize.
>> the blame receiver gets data that are already allocated from a pool and
>> should be immutable to the receiver; the appropriate declaration here is
>>
>>      const svn_string_t *line;
>
> good point.
>
>> as you only need the data pointer and the length, and very much do _not_
>> need the pool and blocksize, nor do you want the data or length to be
>> modifiable by the callback.
>>
>> That will make the changes in libsvn_client/blame.c a bit more complex
>> since you'll have to create a local svn_string_t object (on stack)
>> before calling the receiver, but it makes the receiver's interface
>> cleaner. Something like this would work:
>
> see attached patch.
>
>> Other than that, I think it's a bad idea to leave the trailing null byte
>> from the UTF-16-LE representation of U+000a at the beginning of the next
>> "line". If we're making this change in a *public* API, we should not
>> expect all users to hack around such a misfeature.
>
> problem is: if we just skip any null char after an lf, then we would
> screw up utf16be encodings (or le? I always get those two confused).


Windows default is UTF-16-LE, at least on x86(_64) and other
little-endian architectures. I'm not sure what they do on ARM but I'd be
surprised if Windows doesn't put it in little-endian mode, given that
decades of legacy software assume little-endian.

A simple check would be:

  * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
    UTF-16-LE linefeed;
  * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
    then it's a UTF-16-BE linefeed;
  * otherwise just hope it's a linefeed and move on.


But given your example below of mixing ASCII/UTF-8/UTF-16, it's better
to just leave well enough alone. :(

>> Until we have better support for other Unicode representations, we
>> should detect that null byte and include it as part of the reported
>> blame line.
>
> another situation which unfortunately is not that uncommon: files that
> have different encodings in different lines or even inside the same
> line. I've seen that especially in log files where the logger first
> prints timestamp in ascii, but then the log-text is in utf8

... which isn't a problem since ASCII and UTF-8 are compatible ...

> or even utf16. 

... but this is a rather huge problem.

> Such situations I think would require too much detection logic for the
> blame function and should be left to clients - they might not be able
> to even show such lines so why bother trying to do the detection right
> for that.


You're right about that. I wouldn't dream of supporting such things
within the blame callback itself. However it would still be nice to at
least document what's happening.

The patch itself looks OK to me now.

-- Brane


Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 06.01.2019 15:05, Branko Čibej wrote:

> Sorry about getting into this late, but as Julian said:
> 
>> * we already have a (char*, len) wrapper, called svn_string_t, so I would expect it would be neatest to use that;
> 
> but the patch has:
> 
>> @@ -758,6 +758,28 @@
>> ·*·will·be·true·if·the·reason·there·is·no·blame·information·is·that·the·line
>>
>> ·*·was·modified·locally.·In·all·other·cases·@a·local_change·will·be·false.
>>
>> ·*
>> +·*·@since·New·in·1.12.
>> +·*/
>> +typedef·svn_error_t·*(*svn_client_blame_receiver4_t)(
>> +··void·*baton,
>> +··svn_revnum_t·start_revnum,
>> +··svn_revnum_t·end_revnum,
>> +··apr_int64_t·line_no,
>> +··svn_revnum_t·revision,
>> +··apr_hash_t·*rev_props,
>> +··svn_revnum_t·merged_revision,
>> +··apr_hash_t·*merged_rev_props,
>> +··const·char·*merged_path,
>> +··svn_stringbuf_t·*line,
>> +··svn_boolean_t·local_change,
>> +··apr_pool_t·*pool);
> 
> 
> The svn_stringbuf_t struct is not appropriate here; please use
> svn_string_t. The former is used as a buffer to construct larger
> strings, that's why it contains a reference to a pool and a blocksize.
> the blame receiver gets data that are already allocated from a pool and
> should be immutable to the receiver; the appropriate declaration here is
> 
>      const svn_string_t *line;

good point.

> as you only need the data pointer and the length, and very much do _not_
> need the pool and blocksize, nor do you want the data or length to be
> modifiable by the callback.
> 
> That will make the changes in libsvn_client/blame.c a bit more complex
> since you'll have to create a local svn_string_t object (on stack)
> before calling the receiver, but it makes the receiver's interface
> cleaner. Something like this would work:

see attached patch.

> Other than that, I think it's a bad idea to leave the trailing null byte
> from the UTF-16-LE representation of U+000a at the beginning of the next
> "line". If we're making this change in a *public* API, we should not
> expect all users to hack around such a misfeature.

problem is: if we just skip any null char after an lf, then we would 
screw up utf16be encodings (or le? I always get those two confused).

> Until we have better support for other Unicode representations, we
> should detect that null byte and include it as part of the reported
> blame line.

another situation which unfortunately is not that uncommon: files that 
have different encodings in different lines or even inside the same 
line. I've seen that especially in log files where the logger first 
prints timestamp in ascii, but then the log-text is in utf8 or even 
utf16. Such situations I think would require too much detection logic 
for the blame function and should be left to clients - they might not be 
able to even show such lines so why bother trying to do the detection 
right for that.

Stefan

Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 06.01.2019 15:22, Julian Foad wrote:
> Branko Čibej wrote:
>> +              const svn_string_t line = {sb->data, sb->len};
> 
> In plain old C, we can't write that, but we have a helper function: svn_stringbuf__morph_into_string().
> 

but that destroys the original stringbuf, but that one is still needed. 
So we can't use that helper function.

Stefan


Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 06.01.2019 15:22, Julian Foad wrote:
> Branko Čibej wrote:
>> +              const svn_string_t line = {sb->data, sb->len};
> In plain old C, we can't write that, but we have a helper function: svn_stringbuf__morph_into_string().

I've noticed that clang is a bit deficient when it comes to pedantic C90
warnings. Sorry.

-- Brane


Re: extending the blame callback

Posted by Julian Foad <ju...@apache.org>.
Branko Čibej wrote:
> +              const svn_string_t line = {sb->data, sb->len};

In plain old C, we can't write that, but we have a helper function: svn_stringbuf__morph_into_string().
-- 
- Julian

Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 06.01.2019 15:05, Branko Čibej wrote:
> Until we have better support for other Unicode representations, we
> should detect that null byte and include it as part of the reported
> blame line.

That might also imply detecting that the file encoding is not UTF-16-BE
instead ... which could very likely also have a null byte after the 0x0a
that does belong to the next line. I'd suggest checking for odd/even
offsets of the newline byte, since both UTF-16 representations must have
an even number of bytes in each line. It would be nice to avoid
treating, e.g., U+010A as a line break ...

-- Brane


Re: extending the blame callback

Posted by Branko Čibej <br...@apache.org>.
On 06.01.2019 14:10, Stefan Kueng wrote:
>
>
> On 05.01.2019 22:35, Daniel Shahaf wrote:
>> Stefan Kueng wrote on Sat, 05 Jan 2019 21:15 +0100:
>>> here's a patch using svn_stringbuf_t for review.
>>
>> Change "Provided for backwards compatibility with the 1.6 API" to
>> "Provided for backwards compatibility with the 1.11 API" in
>> svn_client_blame5()'s docstring.
>>
>
> done. See attached patch.

Sorry about getting into this late, but as Julian said:

> * we already have a (char*, len) wrapper, called svn_string_t, so I would expect it would be neatest to use that;

but the patch has:

> @@ -758,6 +758,28 @@
> ·*·will·be·true·if·the·reason·there·is·no·blame·information·is·that·the·line
>
> ·*·was·modified·locally.·In·all·other·cases·@a·local_change·will·be·false.
>
> ·*
> +·*·@since·New·in·1.12.
> +·*/
> +typedef·svn_error_t·*(*svn_client_blame_receiver4_t)(
> +··void·*baton,
> +··svn_revnum_t·start_revnum,
> +··svn_revnum_t·end_revnum,
> +··apr_int64_t·line_no,
> +··svn_revnum_t·revision,
> +··apr_hash_t·*rev_props,
> +··svn_revnum_t·merged_revision,
> +··apr_hash_t·*merged_rev_props,
> +··const·char·*merged_path,
> +··svn_stringbuf_t·*line,
> +··svn_boolean_t·local_change,
> +··apr_pool_t·*pool);


The svn_stringbuf_t struct is not appropriate here; please use
svn_string_t. The former is used as a buffer to construct larger
strings, that's why it contains a reference to a pool and a blocksize.
the blame receiver gets data that are already allocated from a pool and
should be immutable to the receiver; the appropriate declaration here is

    const svn_string_t *line;


as you only need the data pointer and the length, and very much do _not_
need the pool and blocksize, nor do you want the data or length to be
modifiable by the callback.

That will make the changes in libsvn_client/blame.c a bit more complex
since you'll have to create a local svn_string_t object (on stack)
before calling the receiver, but it makes the receiver's interface
cleaner. Something like this would work:

@@ -941,18 +941,20 @@ svn_client_blame5(const char *target,
             SVN_ERR(ctx->cancel_func(ctx->cancel_baton));
           if (!eof || sb->len)
             {
+              const svn_string_t line = {sb->data, sb->len};
+
               if (walk->rev)
                 SVN_ERR(receiver(receiver_baton, start_revnum, end_revnum,
                                  line_no, walk->rev->revision,
                                  walk->rev->rev_props, merged_rev,
                                  merged_rev_props, merged_path,
-                                 sb->data, FALSE, iterpool));
+                                 &line, FALSE, iterpool));
               else
                 SVN_ERR(receiver(receiver_baton, start_revnum, end_revnum,
                                  line_no, SVN_INVALID_REVNUM,
                                  NULL, SVN_INVALID_REVNUM,
                                  NULL, NULL,
-                                 sb->data, TRUE, iterpool));
+                                 &line, TRUE, iterpool));
             }
           if (eof) break;
         }



Other than that, I think it's a bad idea to leave the trailing null byte
from the UTF-16-LE representation of U+000a at the beginning of the next
"line". If we're making this change in a *public* API, we should not
expect all users to hack around such a misfeature.

Until we have better support for other Unicode representations, we
should detect that null byte and include it as part of the reported
blame line.

-- Brane


Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 05.01.2019 22:35, Daniel Shahaf wrote:
> Stefan Kueng wrote on Sat, 05 Jan 2019 21:15 +0100:
>> here's a patch using svn_stringbuf_t for review.
> 
> Change "Provided for backwards compatibility with the 1.6 API" to
> "Provided for backwards compatibility with the 1.11 API" in
> svn_client_blame5()'s docstring.
> 

done. See attached patch.

Stefan

Re: extending the blame callback

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Stefan Kueng wrote on Sat, 05 Jan 2019 21:15 +0100:
> here's a patch using svn_stringbuf_t for review.

Change "Provided for backwards compatibility with the 1.6 API" to
"Provided for backwards compatibility with the 1.11 API" in
svn_client_blame5()'s docstring.

Re: extending the blame callback

Posted by Stefan Kueng <to...@gmail.com>.

On 05.01.2019 17:53, Julian Foad wrote:
> Stefan Kueng wrote:
>> When running blame on an utf16 text file, the lines can not be used
>> since the blame callback only passes a 'char *' parameter which means it
>> ends at the first zero char. But actually, svn knows if the line has
>> more content.
> 
> +1 on doing something to fix the problem.
> 
> Just briefly:
> * we already have a (char*, len) wrapper, called svn_string_t, so I would expect it would be neatest to use that;

here's a patch using svn_stringbuf_t for review.

Stefan

Re: extending the blame callback

Posted by Julian Foad <ju...@apache.org>.
Stefan Kueng wrote:
> When running blame on an utf16 text file, the lines can not be used 
> since the blame callback only passes a 'char *' parameter which means it 
> ends at the first zero char. But actually, svn knows if the line has 
> more content.

+1 on doing something to fix the problem.

Just briefly:
* we already have a (char*, len) wrapper, called svn_string_t, so I would expect it would be neatest to use that;
* splitting the lines on a bare LF byte, and having a left-over zero byte, sounds just so wrong... it can't be right, can it? Can't we do something better? Use a callback for deciding how to split lines, or something? I haven't looked at the code, so not sure what exactly.

(I can't review the patch in more detail just now.)

-- 
- Julian