You are viewing a plain text version of this content. The canonical link for it is here.

Posted to modperl@perl.apache.org by "André Warnier (tomcat/perl)" <aw...@ice-sa.com> on 2019/11/13 18:12:10 UTC

Output filters, data encoding

Hi.
I'm writing a new PerlOutputFilter, stream version.
I have written several working ones before, so I know the general scheme.
But in this latest filter, I have a problem with the data encoding, which I did not 
encounter previously.
I did not find an answer in the on-line mod_perl documentation, which seems silent about 
any encoding of the data that one might read in a filter.
(I also tried the WWW, without any more success).

To read the response data coming from upstream, I use the standard $f->read(), as in

	while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
..

and then I need to pass this data to another module for processing (Template::Toolkit).
To make a long story short, Template::Toolkit misinterprets the data I'm sending to it, 
because this data /is/ actually UTF-8, but apparently not marked so internally by the 
$f->read(). So TT2 re-encodes it, leading to double UTF-8 encoding.

My question is : can I - and how -, set the filehandle that corresponds to the $f->read(), 
to a UTF-8 layer ?
I have tried

line 155: binmode($f,'encoding:(UTF-8)');

and that triggers an error :
  Not a GLOB reference at (my filter) line 155.\n
)

Or do I need to read the data 'as is', and separately do an

  $decoded_buffer = decode('UTF-8', $buffer);

?

Note 1 : I can of course do the above decode(), but it seems more elegant to just insert 
the I/O layer.

Note 2 : I know that the pre-filter data is UTF-8 (or at least I choose to believe that it 
is), because I check $f->r->content_type(), which returns "text/html;charset=UTF-8".
Also, the filter otherwise works as expected, and I get the expected output in the 
browser, except that e.g. "München" becomes "MÃ¼nchen".

Thanks in advance for any tips
André

Re: Output filters, data encoding

Posted by "André Warnier (tomcat/perl)" <aw...@ice-sa.com>.

On 13.11.2019 19:37, Damyan Ivanov wrote:
> -=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=-
>> 	while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
>> ..
>>
>> and then I need to pass this data to another module for processing (Template::Toolkit).
>> To make a long story short, Template::Toolkit misinterprets the data I'm
>> sending to it, because this data /is/ actually UTF-8, but apparently not
>> marked so internally by the $f->read(). So TT2 re-encodes it, leading to
>> double UTF-8 encoding.
>>
>> My question is : can I - and how -, set the filehandle that corresponds to
>> the $f->read(), to a UTF-8 layer ?
>> I have tried
>>
>> line 155: binmode($f,'encoding:(UTF-8)');
>>
>> and that triggers an error :
>>   Not a GLOB reference at (my filter) line 155.\n
>> )
>>
>> Or do I need to read the data 'as is', and separately do an
>>
>>   $decoded_buffer = decode('UTF-8', $buffer);
>
> There's a middle ground - partial decoding. See Encode(1)/FB_QUIET:
>
>         If CHECK is set to "Encode::FB_QUIET", encoding and decoding
>         immediately return the portion of the data that has been processed so
>         far when an error occurs. The data argument is overwritten with
>         everything after that point; that is, the unprocessed portion of the
>         data.  This is handy when you have to call "decode" repeatedly in the
>         case where your source data may contain partial multi-byte character
>         sequences, (that is, you are reading with a fixed-width buffer). Here's
>         some sample code to do exactly that:
>
>             my($buffer, $string) = ("", "");
>             while (read($fh, $buffer, 256, length($buffer))) {
>                 $string .= decode($encoding, $buffer, Encode::FB_QUIET);
>                 # $buffer now contains the unprocessed partial character
>             }
>
> Looks exactly like your case.
>
Thanks for the response and the tip.

My idea of adding a UTF-8 layer to the filehandle through which Apache2::Filter reads the 
incoming data was probably wrong anyway : it cannot do that, because it gets this data 
originally in chunks, as "bucket brigades" from Apache httpd.  And there is no guarantee 
that such a bucket brigade would always end in "complete" UTF-8 character sequences.
At the very least, this would probably complicate the code underlying $f->read() quite a bit.
It is clearer to handle that in the filter itself.

The Encode::FB_QUIET flag above, with the incremental buffer read, is really smart.
Unfortunately, the Apache2::Filter read() method does not allow as many arguments, and all 
one has is something like this :

	my $accumulated_content = "";
   	while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
		$accumulated_content .= $buffer;
	}

Luckily, in this case, I have to accumulate the complete response content anyway, before I 
can decide to call Template::Toolkit on it or not. So I can do a single decode() on 
$accumulated_content. Not the most efficient memory-wise, but good enough in this case.

Re: Output filters, data encoding

Posted by Damyan Ivanov <dm...@debian.org>.

-=| pali@cpan.org, 14.11.2019 09:51:20 +0100 |=-
> On Wednesday 13 November 2019 20:37:06 Damyan Ivanov wrote:
> >            my($buffer, $string) = ("", "");
> >            while (read($fh, $buffer, 256, length($buffer))) {
> >                $string .= decode($encoding, $buffer, Encode::FB_QUIET);
> >                # $buffer now contains the unprocessed partial character
> >            }
> 
> This code is dangerous. It can enter into endless loop. Once you read
> invalid UTF-8 sequence, above loop never finish. So if buffer input is
> under user/attacker control you introduce DoS issues.

Sure. A check to prevent that would be in order. I must admit that 
I was very happy to find a solution to the problem that was even in 
the official documentation.

> Instead of FB_QUIET, you should use Encode::STOP_AT_PARTIAL flag. This
> is the flag which you want to use. Encode::decode stops decoding when
> valid UTF-8 sequence is not complete and needs more bytes to read. And
> by default invalid UTF-8 sequences are mapped to Unicode replacement
> character.
> 
> Btw, PerlIO::encoding uses also Encode::STOP_AT_PARTIAL flag to handle
> this situation.
> 
> PS: I know that Encode::STOP_AT_PARTIAL is undocumented, but it is only
> because nobody found time to write documentation for it. It is part of
> Encode API and ready to use...

That would be https://rt.cpan.org/Public/Bug/Display.html?id=67065 
(filed 8 years ago, still open).

Re: Output filters, data encoding

Posted by pa...@cpan.org.

On Wednesday 13 November 2019 20:37:06 Damyan Ivanov wrote:
> -=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=-
> > 	while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
> > ..
> > 
> > and then I need to pass this data to another module for processing (Template::Toolkit).
> > To make a long story short, Template::Toolkit misinterprets the data I'm
> > sending to it, because this data /is/ actually UTF-8, but apparently not
> > marked so internally by the $f->read(). So TT2 re-encodes it, leading to
> > double UTF-8 encoding.
> > 
> > My question is : can I - and how -, set the filehandle that corresponds to
> > the $f->read(), to a UTF-8 layer ?
> > I have tried
> > 
> > line 155: binmode($f,'encoding:(UTF-8)');
> > 
> > and that triggers an error :
> >  Not a GLOB reference at (my filter) line 155.\n
> > )
> > 
> > Or do I need to read the data 'as is', and separately do an
> > 
> >  $decoded_buffer = decode('UTF-8', $buffer);
> 
> There's a middle ground - partial decoding. See Encode(1)/FB_QUIET:
> 
>        If CHECK is set to "Encode::FB_QUIET", encoding and decoding
>        immediately return the portion of the data that has been processed so
>        far when an error occurs. The data argument is overwritten with
>        everything after that point; that is, the unprocessed portion of the
>        data.  This is handy when you have to call "decode" repeatedly in the
>        case where your source data may contain partial multi-byte character
>        sequences, (that is, you are reading with a fixed-width buffer). Here's
>        some sample code to do exactly that:
> 
>            my($buffer, $string) = ("", "");
>            while (read($fh, $buffer, 256, length($buffer))) {
>                $string .= decode($encoding, $buffer, Encode::FB_QUIET);
>                # $buffer now contains the unprocessed partial character
>            }

This code is dangerous. It can enter into endless loop. Once you read
invalid UTF-8 sequence, above loop never finish. So if buffer input is
under user/attacker control you introduce DoS issues.

Instead of FB_QUIET, you should use Encode::STOP_AT_PARTIAL flag. This
is the flag which you want to use. Encode::decode stops decoding when
valid UTF-8 sequence is not complete and needs more bytes to read. And
by default invalid UTF-8 sequences are mapped to Unicode replacement
character.

Btw, PerlIO::encoding uses also Encode::STOP_AT_PARTIAL flag to handle
this situation.

PS: I know that Encode::STOP_AT_PARTIAL is undocumented, but it is only
because nobody found time to write documentation for it. It is part of
Encode API and ready to use...

> 
> Looks exactly like your case.
> 
> 
> -- Damyan

Re: Output filters, data encoding

Posted by Damyan Ivanov <dm...@debian.org>.

-=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=-
> 	while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
> ..
> 
> and then I need to pass this data to another module for processing (Template::Toolkit).
> To make a long story short, Template::Toolkit misinterprets the data I'm
> sending to it, because this data /is/ actually UTF-8, but apparently not
> marked so internally by the $f->read(). So TT2 re-encodes it, leading to
> double UTF-8 encoding.
> 
> My question is : can I - and how -, set the filehandle that corresponds to
> the $f->read(), to a UTF-8 layer ?
> I have tried
> 
> line 155: binmode($f,'encoding:(UTF-8)');
> 
> and that triggers an error :
>  Not a GLOB reference at (my filter) line 155.\n
> )
> 
> Or do I need to read the data 'as is', and separately do an
> 
>  $decoded_buffer = decode('UTF-8', $buffer);

There's a middle ground - partial decoding. See Encode(1)/FB_QUIET:

       If CHECK is set to "Encode::FB_QUIET", encoding and decoding
       immediately return the portion of the data that has been processed so
       far when an error occurs. The data argument is overwritten with
       everything after that point; that is, the unprocessed portion of the
       data.  This is handy when you have to call "decode" repeatedly in the
       case where your source data may contain partial multi-byte character
       sequences, (that is, you are reading with a fixed-width buffer). Here's
       some sample code to do exactly that:

           my($buffer, $string) = ("", "");
           while (read($fh, $buffer, 256, length($buffer))) {
               $string .= decode($encoding, $buffer, Encode::FB_QUIET);
               # $buffer now contains the unprocessed partial character
           }

Looks exactly like your case.


-- Damyan

Re: Output filters, data encoding

Posted by Vincent Veyron <vv...@wanadoo.fr>.

On Wed, 13 Nov 2019 19:12:10 +0100
André Warnier (tomcat/perl) <aw...@ice-sa.com> wrote:
> 

I also found that calls to binmode in output filters generate a double encoding.

Here is a paste of the code of an output filter that adds a menu, some headers and closing tags to the html pages generated by previous modules; it reads from STDOUT, not from a file:

https://pastebin.com/trhjfDxX

It uses this :

>   #on arrive à la fin du contenu
>    if ($f->seen_eos) {
>       
>        $content = Encode::decode_utf8( $content ) . '</div>' ;

Never had a problem with it.

I have handlers, not output filters, that read from files and use this :

    open DOCUMENT_CONTENT, "<:encoding(UTF-8)", "$document_content" or die "can't open $document_content : $!\n" ;

-- 

					Bien à vous, Vincent Veyron 

https://compta.libremen.com
Logiciel libre de comptabilité générale en partie double

Re: Output filters, data encoding

Posted by pa...@cpan.org.

On Wednesday 13 November 2019 20:10:07 André Warnier (tomcat/perl) wrote:
> On 13.11.2019 19:53, pali@cpan.org wrote:
> > On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote:
> > > On 13.11.2019 19:17, pali@cpan.org wrote:
> > > > On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
> > > > > My question is : can I - and how -, set the filehandle that corresponds to
> > > > > the $f->read(), to a UTF-8 layer ?
> > > > > I have tried
> > > > > 
> > > > > line 155: binmode($f,'encoding:(UTF-8)');
> > > > 
> > > > Hi André! When specifying PerlIO layer for file handle, you need to
> > > > write colon character before layer name. So correct binmode call is:
> > > > 
> > > >     binmode($f, ':encoding(UTF-8)');
> > > > 
> > > > > and that triggers an error :
> > > > >    Not a GLOB reference at (my filter) line 155.\n
> > > > > )
> > > 
> > > Thanks. Ooops, that was a typo (also in my filter, not only in the list message).
> > > But correcting it, does not change the GLOB error message.
> > 
> > Ok. What is the $f? It is object or what kind of scalar?
> > 
> It is the Apache2::Filter object.
> See : http://perl.apache.org/docs/2.0/api/Apache2/Filter.html
> Configured in httpd as :       PerlOutputFilterHandler MyFilter
> See also :  http://perl.apache.org/docs/2.0/user/handlers/filters.html
> 
> My (hopeful) thinking was that considering the
> $f->read()
> the Apache2::Filter object may also be a FileHandle, hence the attempt at
> binmode($f,..)
> But that seems to be incorrect.
> (And I don't see any (documented) method of Apache2::Filter that would
> return the underlying FileHandle either)

Sorry, then I do not know :-(

Re: Output filters, data encoding

Posted by "André Warnier (tomcat/perl)" <aw...@ice-sa.com>.

On 13.11.2019 19:53, pali@cpan.org wrote:
> On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote:
>> On 13.11.2019 19:17, pali@cpan.org wrote:
>>> On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
>>>> My question is : can I - and how -, set the filehandle that corresponds to
>>>> the $f->read(), to a UTF-8 layer ?
>>>> I have tried
>>>>
>>>> line 155: binmode($f,'encoding:(UTF-8)');
>>>
>>> Hi André! When specifying PerlIO layer for file handle, you need to
>>> write colon character before layer name. So correct binmode call is:
>>>
>>>     binmode($f, ':encoding(UTF-8)');
>>>
>>>> and that triggers an error :
>>>>    Not a GLOB reference at (my filter) line 155.\n
>>>> )
>>
>> Thanks. Ooops, that was a typo (also in my filter, not only in the list message).
>> But correcting it, does not change the GLOB error message.
>
> Ok. What is the $f? It is object or what kind of scalar?
>
It is the Apache2::Filter object.
See : http://perl.apache.org/docs/2.0/api/Apache2/Filter.html
Configured in httpd as :       PerlOutputFilterHandler MyFilter
See also :  http://perl.apache.org/docs/2.0/user/handlers/filters.html

My (hopeful) thinking was that considering the
$f->read()
the Apache2::Filter object may also be a FileHandle, hence the attempt at
binmode($f,..)
But that seems to be incorrect.
(And I don't see any (documented) method of Apache2::Filter that would return the 
underlying FileHandle either)

Re: Output filters, data encoding

Posted by pa...@cpan.org.

On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote:
> On 13.11.2019 19:17, pali@cpan.org wrote:
> > On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
> > > My question is : can I - and how -, set the filehandle that corresponds to
> > > the $f->read(), to a UTF-8 layer ?
> > > I have tried
> > > 
> > > line 155: binmode($f,'encoding:(UTF-8)');
> > 
> > Hi André! When specifying PerlIO layer for file handle, you need to
> > write colon character before layer name. So correct binmode call is:
> > 
> >    binmode($f, ':encoding(UTF-8)');
> > 
> > > and that triggers an error :
> > >   Not a GLOB reference at (my filter) line 155.\n
> > > )
> 
> Thanks. Ooops, that was a typo (also in my filter, not only in the list message).
> But correcting it, does not change the GLOB error message.

Ok. What is the $f? It is object or what kind of scalar?

Re: Output filters, data encoding

Posted by "André Warnier (tomcat/perl)" <aw...@ice-sa.com>.

On 13.11.2019 19:17, pali@cpan.org wrote:
> On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
>> My question is : can I - and how -, set the filehandle that corresponds to
>> the $f->read(), to a UTF-8 layer ?
>> I have tried
>>
>> line 155: binmode($f,'encoding:(UTF-8)');
>
> Hi André! When specifying PerlIO layer for file handle, you need to
> write colon character before layer name. So correct binmode call is:
>
>    binmode($f, ':encoding(UTF-8)');
>
>> and that triggers an error :
>>   Not a GLOB reference at (my filter) line 155.\n
>> )

Thanks. Ooops, that was a typo (also in my filter, not only in the list message).
But correcting it, does not change the GLOB error message.

Re: Output filters, data encoding

Posted by pa...@cpan.org.

On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
> My question is : can I - and how -, set the filehandle that corresponds to
> the $f->read(), to a UTF-8 layer ?
> I have tried
> 
> line 155: binmode($f,'encoding:(UTF-8)');

Hi André! When specifying PerlIO layer for file handle, you need to
write colon character before layer name. So correct binmode call is:

  binmode($f, ':encoding(UTF-8)');

> and that triggers an error :
>  Not a GLOB reference at (my filter) line 155.\n
> )

Re: Output filters, data encoding

Posted by "André Warnier (tomcat/perl)" <aw...@ice-sa.com>.

On 14.11.2019 01:09, Hua, Yong wrote:
> Hi
>
> on 2019/11/14 2:12, André Warnier (tomcat/perl) wrote:
>> I'm writing a new PerlOutputFilter, stream version.
>
> Can you give a more general introduction for what is "stream version"?
>
> Thank you.
>
You shoud read the pages which I referred to previously, they explain this better than I 
could do :
1) http://perl.apache.org/docs/2.0/user/handlers/filters.html
2) http://perl.apache.org/docs/2.0/api/Apache2/Filter.html

See in particular here :
http://perl.apache.org/docs/2.0/user/handlers/filters.html#Two_Methods_for_Manipulating_Data

Re: Output filters, data encoding

Posted by "Hua, Yong" <hu...@mein.gmx>.

Hi

on 2019/11/14 2:12, André Warnier (tomcat/perl) wrote:
> I'm writing a new PerlOutputFilter, stream version.

Can you give a more general introduction for what is "stream version"?

Thank you.