You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@jmeter.apache.org by Vincent Partington <vp...@xebia.com> on 2003/12/23 13:45:39 UTC

jorphan.io.TextFile always uses default encoding, ResultCollector always uses UTF-*

Hi,

The class jorphan.io.TextFile always uses the default encoding to read and
write files:
http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-jmeter/src/jorphan/org/apache/jorphan/io/TextFile.java?content-type=text%2Fplain&rev=1.4

In my case the default encoding is ISO-8859-1 (Windows XP US). However,
other parts of the JMeter code explicitly use UTF-8:
http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-jmeter/src/core/org/apache/jmeter/reporters/ResultCollector.java?content-type=text%2Fplain&rev=1.29

This causes the result XML file to say UTF-8 in its XML header, but the
content is actually ISO-8859-1. If funny characters are written to the
result XML, the file will not be accepted by the XSLT processor.

I fixed the problem by having jorphan.io.TextFile explicitly and
hardcodedly use UTF-8, but I don't know whether that will impact other
code. Any thoughts?

Regards, Vincent.








---------------------------------------------------------------------
To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: jmeter-user-help@jakarta.apache.org


Re: jorphan.io.TextFile always uses default encoding, ResultCollector always uses UTF-*

Posted by Jordi Salvat i Alabart <js...@atg.com>.
Now I understand. I thought iso-8859-1 reading & writing would not 
change anything. I was obviously wrong.

In any case, it looks like the root problem is that XML is the wrong 
format for that content. One could argue that it can still be used where 
the response is text, but what about cases in which the response is also 
XML? A <![CDATA[ will probably not help, since the response may contain 
CDATA sections too, and the first ]]> string closes the outer CDATA 
section (nested CDATA sections are not allowed).

So we're left with two choices:
- Use separate files and only add references to them in the XML file.
- Use base64 encoding for all content.

What do (Vincent and others) think?

-- 
Salut,

Jordi.

En/na Vincent Partington ha escrit:
> Jordi Salvat i Alabart wrote:
> 
>>You're indeed pointing to a whole collection of bugs... but none of them
>> seems to be the one affecting you :-)
>>
>>[...]
>>- Last line trim in ResultCollector -- but only last line trim! It
>>should not make any difference whether you use UTF-8 or ISO-Latin-1
>>here, as far as you use the same one for reading and writing. Still, it
>>could fail if the platform encoding is one for which the UTF-8
>>representation of some character used in the file is not a valid
>>character representation. (Sorry for the very clear statement -- that's
>>about as good as my English can be.) ISO-Latin-1 is pretty safe, but
>>other platforms will of course use others...
> 
> 
> Well, actually this _is_ the use of jorphan.io.TextFile that is causing my
> problems. Reading a UTF-8 file with an ISO-8859-1 encoding and then
> writing it out with that same ISO-8859-1 encoding causes a few UTF-8
> sequences to be altered.
> 
> I modified jorphan.io.TextFile to keep a backup copy of the file it was
> writing over and I could see a difference between the first part of the
> two files in a few UTF-8 sequences.
> 
> 
>>[...]
>>Whether TextFile should use a given encoding or just the platform
>>default can be discussed, but it certainly should be documented.
> 
> 
> Either it should read the file as binary or an encoding should be passed
> by its caller, so that when it is used to trim the last line of a result
> file the encoding can be set to UTF-8.
> 
> 
>>Also, it's quite obvious that the ResultCollector should not handle
>>response data as character data, since in many cases it's binary stuff,
>>and any character encoding (UTF-8 or whatever) will be wrong. Actually,
>>XML is a bad format for binary data: we should either store that in
>>separate files or encode it base-64 or alike.
> 
> 
> I agree completely. The binary data not only contains "funny" UTF-8
> characters that cause problems, it also contains XML entities such as &#1;
> which our XSLT processor can't handle.
> 
> 
>>[...]
>>In any case, as I said, I can't see how you can end up with a result XML
>>file with ISO-8859-1 content. Are you sure about that?
> 
> 
> The content is not real ISO-8859-1 content, but corrupted UTF-8 content. I
> changed jorphan.io.TextFile to read and write the file with UTF-8 encoding
> and then the problem disappeared.
> 
> Regards, Vincent.
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: jmeter-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: jmeter-user-help@jakarta.apache.org


Re: jorphan.io.TextFile always uses default encoding, ResultCollector always uses UTF-*

Posted by Vincent Partington <vp...@xebia.com>.
Jordi Salvat i Alabart wrote:
> I've fixed the issue (in current CVS code) by enabling TextFile to use 
> an encoding of choice and having ResultCollector use UTF-8. It's 
> suboptimal, but it will do until someone has the time and energy to 
> write a ResultCollector that dumps content into files or uses base64 
> encoding or whatever.

Hi Jordi,

Thank you for implementing that fix. That will at least solve some of my 
problems. There is still the issue of the &1#; characters appearing.

The problem will be solved even better when result data is not written 
to the results XML file. Currently this only happens when a response 
assertion sets error="true":
     result.setFailureMessage(
            new String((byte[]) response.getResponseData()));

A more logical approach would be to set the failure message to something 
like "Cannot assert response when sample was not succesful", or just not 
checking where the sample was succesful at all.

Regards, Vincent.




---------------------------------------------------------------------
To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: jmeter-user-help@jakarta.apache.org


Re: jorphan.io.TextFile always uses default encoding, ResultCollector always uses UTF-*

Posted by Sebastian Bazley <Se...@london.sema.slb.com>.
I wrote ResultSaver recently - this saves the responseData as a file.
It uses write() on a FileOutputStream, which does not do any conversion, AFAIK.

It was originally intended for functional testing, and can certainly create GIF and JPG files - perhaps it would work here as well?

If not, perhaps it could be extended as needed.

Note that it is implemented as a Post-Processor - just add it after any sampler (or group of samplers, at any level) and the sample
data will be stored in a file.

S.
----- Original Message ----- 
From: "Jordi Salvat i Alabart" <js...@atg.com>
To: "JMeter Users List" <jm...@jakarta.apache.org>
Sent: Wednesday, December 24, 2003 2:03 PM
Subject: Re: jorphan.io.TextFile always uses default encoding, ResultCollector always uses UTF-*


I've fixed the issue (in current CVS code) by enabling TextFile to use
an encoding of choice and having ResultCollector use UTF-8. It's
suboptimal, but it will do until someone has the time and energy to
write a ResultCollector that dumps content into files or uses base64
encoding or whatever.

-- 
Salut,

Jordi.

En/na Vincent Partington ha escrit:
> Jordi Salvat i Alabart wrote:
>
>>You're indeed pointing to a whole collection of bugs... but none of them
>> seems to be the one affecting you :-)
>>
>>[...]
>>- Last line trim in ResultCollector -- but only last line trim! It
>>should not make any difference whether you use UTF-8 or ISO-Latin-1
>>here, as far as you use the same one for reading and writing. Still, it
>>could fail if the platform encoding is one for which the UTF-8
>>representation of some character used in the file is not a valid
>>character representation. (Sorry for the very clear statement -- that's
>>about as good as my English can be.) ISO-Latin-1 is pretty safe, but
>>other platforms will of course use others...
>
>
> Well, actually this _is_ the use of jorphan.io.TextFile that is causing my
> problems. Reading a UTF-8 file with an ISO-8859-1 encoding and then
> writing it out with that same ISO-8859-1 encoding causes a few UTF-8
> sequences to be altered.
>
> I modified jorphan.io.TextFile to keep a backup copy of the file it was
> writing over and I could see a difference between the first part of the
> two files in a few UTF-8 sequences.
>
>
>>[...]
>>Whether TextFile should use a given encoding or just the platform
>>default can be discussed, but it certainly should be documented.
>
>
> Either it should read the file as binary or an encoding should be passed
> by its caller, so that when it is used to trim the last line of a result
> file the encoding can be set to UTF-8.
>
>
>>Also, it's quite obvious that the ResultCollector should not handle
>>response data as character data, since in many cases it's binary stuff,
>>and any character encoding (UTF-8 or whatever) will be wrong. Actually,
>>XML is a bad format for binary data: we should either store that in
>>separate files or encode it base-64 or alike.
>
>
> I agree completely. The binary data not only contains "funny" UTF-8
> characters that cause problems, it also contains XML entities such as &#1;
> which our XSLT processor can't handle.
>
>
>>[...]
>>In any case, as I said, I can't see how you can end up with a result XML
>>file with ISO-8859-1 content. Are you sure about that?
>
>
> The content is not real ISO-8859-1 content, but corrupted UTF-8 content. I
> changed jorphan.io.TextFile to read and write the file with UTF-8 encoding
> and then the problem disappeared.
>
> Regards, Vincent.
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: jmeter-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: jmeter-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: jmeter-user-help@jakarta.apache.org


Re: jorphan.io.TextFile always uses default encoding, ResultCollector always uses UTF-*

Posted by Jordi Salvat i Alabart <js...@atg.com>.
I've fixed the issue (in current CVS code) by enabling TextFile to use 
an encoding of choice and having ResultCollector use UTF-8. It's 
suboptimal, but it will do until someone has the time and energy to 
write a ResultCollector that dumps content into files or uses base64 
encoding or whatever.

-- 
Salut,

Jordi.

En/na Vincent Partington ha escrit:
> Jordi Salvat i Alabart wrote:
> 
>>You're indeed pointing to a whole collection of bugs... but none of them
>> seems to be the one affecting you :-)
>>
>>[...]
>>- Last line trim in ResultCollector -- but only last line trim! It
>>should not make any difference whether you use UTF-8 or ISO-Latin-1
>>here, as far as you use the same one for reading and writing. Still, it
>>could fail if the platform encoding is one for which the UTF-8
>>representation of some character used in the file is not a valid
>>character representation. (Sorry for the very clear statement -- that's
>>about as good as my English can be.) ISO-Latin-1 is pretty safe, but
>>other platforms will of course use others...
> 
> 
> Well, actually this _is_ the use of jorphan.io.TextFile that is causing my
> problems. Reading a UTF-8 file with an ISO-8859-1 encoding and then
> writing it out with that same ISO-8859-1 encoding causes a few UTF-8
> sequences to be altered.
> 
> I modified jorphan.io.TextFile to keep a backup copy of the file it was
> writing over and I could see a difference between the first part of the
> two files in a few UTF-8 sequences.
> 
> 
>>[...]
>>Whether TextFile should use a given encoding or just the platform
>>default can be discussed, but it certainly should be documented.
> 
> 
> Either it should read the file as binary or an encoding should be passed
> by its caller, so that when it is used to trim the last line of a result
> file the encoding can be set to UTF-8.
> 
> 
>>Also, it's quite obvious that the ResultCollector should not handle
>>response data as character data, since in many cases it's binary stuff,
>>and any character encoding (UTF-8 or whatever) will be wrong. Actually,
>>XML is a bad format for binary data: we should either store that in
>>separate files or encode it base-64 or alike.
> 
> 
> I agree completely. The binary data not only contains "funny" UTF-8
> characters that cause problems, it also contains XML entities such as &#1;
> which our XSLT processor can't handle.
> 
> 
>>[...]
>>In any case, as I said, I can't see how you can end up with a result XML
>>file with ISO-8859-1 content. Are you sure about that?
> 
> 
> The content is not real ISO-8859-1 content, but corrupted UTF-8 content. I
> changed jorphan.io.TextFile to read and write the file with UTF-8 encoding
> and then the problem disappeared.
> 
> Regards, Vincent.
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: jmeter-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: jmeter-user-help@jakarta.apache.org


Re: jorphan.io.TextFile always uses default encoding, ResultCollector always uses UTF-*

Posted by Vincent Partington <vp...@xebia.com>.
Jordi Salvat i Alabart wrote:
> You're indeed pointing to a whole collection of bugs... but none of them
>  seems to be the one affecting you :-)
>
> [...]
> - Last line trim in ResultCollector -- but only last line trim! It
> should not make any difference whether you use UTF-8 or ISO-Latin-1
> here, as far as you use the same one for reading and writing. Still, it
> could fail if the platform encoding is one for which the UTF-8
> representation of some character used in the file is not a valid
> character representation. (Sorry for the very clear statement -- that's
> about as good as my English can be.) ISO-Latin-1 is pretty safe, but
> other platforms will of course use others...

Well, actually this _is_ the use of jorphan.io.TextFile that is causing my
problems. Reading a UTF-8 file with an ISO-8859-1 encoding and then
writing it out with that same ISO-8859-1 encoding causes a few UTF-8
sequences to be altered.

I modified jorphan.io.TextFile to keep a backup copy of the file it was
writing over and I could see a difference between the first part of the
two files in a few UTF-8 sequences.

> [...]
> Whether TextFile should use a given encoding or just the platform
> default can be discussed, but it certainly should be documented.

Either it should read the file as binary or an encoding should be passed
by its caller, so that when it is used to trim the last line of a result
file the encoding can be set to UTF-8.

> Also, it's quite obvious that the ResultCollector should not handle
> response data as character data, since in many cases it's binary stuff,
> and any character encoding (UTF-8 or whatever) will be wrong. Actually,
> XML is a bad format for binary data: we should either store that in
> separate files or encode it base-64 or alike.

I agree completely. The binary data not only contains "funny" UTF-8
characters that cause problems, it also contains XML entities such as &#1;
which our XSLT processor can't handle.

> [...]
> In any case, as I said, I can't see how you can end up with a result XML
> file with ISO-8859-1 content. Are you sure about that?

The content is not real ISO-8859-1 content, but corrupted UTF-8 content. I
changed jorphan.io.TextFile to read and write the file with UTF-8 encoding
and then the problem disappeared.

Regards, Vincent.





---------------------------------------------------------------------
To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: jmeter-user-help@jakarta.apache.org


Re: jorphan.io.TextFile always uses default encoding, ResultCollector always uses UTF-*

Posted by Jordi Salvat i Alabart <js...@atg.com>.
You're indeed pointing to a whole collection of bugs... but none of them 
seems to be the one affecting you :-)

Talking about current CVS code, TextFile is used in very few places:

- A unit test in AnchorModifier -- we could live with any side effects 
on this.

- Last line trim in ResultCollector -- but only last line trim! It 
should not make any difference whether you use UTF-8 or ISO-Latin-1 
here, as far as you use the same one for reading and writing. Still, it 
could fail if the platform encoding is one for which the UTF-8 
representation of some character used in the file is not a valid 
character representation. (Sorry for the very clear statement -- that's 
about as good as my English can be.) ISO-Latin-1 is pretty safe, but 
other platforms will of course use others...

- Retrieving XML data files in WebServiceSampler. I think it's incorrect 
to use TextFile here, since XML file encoding is either assumed (in 
which case a TextFile using the platform default encoding is a correct 
solution, although probably not optimal) or found inside the XML file 
itself -- in which case the solution is plain wrong.

Whether TextFile should use a given encoding or just the platform 
default can be discussed, but it certainly should be documented.

Also, it's quite obvious that the ResultCollector should not handle 
response data as character data, since in many cases it's binary stuff, 
and any character encoding (UTF-8 or whatever) will be wrong. Actually, 
XML is a bad format for binary data: we should either store that in 
separate files or encode it base-64 or alike.

And, you're right, there's just too many places where we use response 
data is character data. If all this causes is some gibberish in the 
screen, that's a minor problem, but sometimes it can be worse...

In any case, as I said, I can't see how you can end up with a result XML 
file with ISO-8859-1 content. Are you sure about that?

-- 
Salut,

Jordi.

En/na Vincent Partington ha escrit:
> Hi,
> 
> The class jorphan.io.TextFile always uses the default encoding to read and
> write files:
> http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-jmeter/src/jorphan/org/apache/jorphan/io/TextFile.java?content-type=text%2Fplain&rev=1.4
> 
> In my case the default encoding is ISO-8859-1 (Windows XP US). However,
> other parts of the JMeter code explicitly use UTF-8:
> http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-jmeter/src/core/org/apache/jmeter/reporters/ResultCollector.java?content-type=text%2Fplain&rev=1.29
> 
> This causes the result XML file to say UTF-8 in its XML header, but the
> content is actually ISO-8859-1. If funny characters are written to the
> result XML, the file will not be accepted by the XSLT processor.
> 
> I fixed the problem by having jorphan.io.TextFile explicitly and
> hardcodedly use UTF-8, but I don't know whether that will impact other
> code. Any thoughts?
> 
> Regards, Vincent.
> 
> 
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: jmeter-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: jmeter-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: jmeter-user-help@jakarta.apache.org