You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by "Ian Beaumont (Created) (JIRA)" <ji...@apache.org> on 2011/12/01 12:42:39 UTC
[jira] [Created] (HTTPCLIENT-1149) EntityUtils.toString should
detect Byte order mark (BOM) and remove it if present
EntityUtils.toString should detect Byte order mark (BOM) and remove it if present
---------------------------------------------------------------------------------
Key: HTTPCLIENT-1149
URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1149
Project: HttpComponents HttpClient
Issue Type: Bug
Components: HttpClient
Affects Versions: 4.1.2
Environment: Windows
Reporter: Ian Beaumont
Priority: Minor
The Byte order mark at the start of the input stream should be detected and removed by EntityUtils.toString, otherwise strange unwanted characters are left at the start.
This link lists possible Byte order markings http://en.wikipedia.org/wiki/Byte_order_mark
I'm not sure if EntityUtils.toString using the BOM to try to detect the encoding, but if it doesn't then it should.
Example URL that is causing this issue is mircosoft virtual earth WSDL file:
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://dev.virtualearth.net/webservices/v1/searchservice/searchservice.svc?wsdl");
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
String textContents = EntityUtils.toString(entity);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org
[jira] [Commented] (HTTPCLIENT-1149) EntityUtils.toString should
detect Byte order mark (BOM) and remove it if present
Posted by "Sebb (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HTTPCLIENT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161273#comment-13161273 ]
Sebb commented on HTTPCLIENT-1149:
----------------------------------
The following code works for me:
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://dev.virtualearth.net/webservices/v1/searchservice/searchservice.svc?wsdl");
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
if (entity != null) {
InputStream instream = entity.getContent();
BOMInputStream bis = new BOMInputStream(instream);
if (bis.hasBOM()) {
System.out.println(bis.getBOMCharsetName());
}
// Now read the bis stream
}
> EntityUtils.toString should detect Byte order mark (BOM) and remove it if present
> ---------------------------------------------------------------------------------
>
> Key: HTTPCLIENT-1149
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1149
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient
> Affects Versions: 4.1.2
> Environment: Windows
> Reporter: Ian Beaumont
> Priority: Minor
> Labels: BOM, EntityUtils
>
> The Byte order mark at the start of the input stream should be detected and removed by EntityUtils.toString, otherwise strange unwanted characters are left at the start.
> This link lists possible Byte order markings http://en.wikipedia.org/wiki/Byte_order_mark
> I'm not sure if EntityUtils.toString using the BOM to try to detect the encoding, but if it doesn't then it should.
> Example URL that is causing this issue is mircosoft virtual earth WSDL file:
> HttpClient httpclient = new DefaultHttpClient();
> HttpGet httpget = new HttpGet("http://dev.virtualearth.net/webservices/v1/searchservice/searchservice.svc?wsdl");
> HttpResponse response = httpclient.execute(httpget);
> HttpEntity entity = response.getEntity();
> String textContents = EntityUtils.toString(entity);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org
[jira] [Commented] (HTTPCLIENT-1149) EntityUtils.toString should
detect Byte order mark (BOM) and remove it if present
Posted by "Ian Beaumont (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HTTPCLIENT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161500#comment-13161500 ]
Ian Beaumont commented on HTTPCLIENT-1149:
------------------------------------------
Thanks for the comments guys. Just as a footnote (I'm not asking it to be re-opened):
I came across this
http://msdn.microsoft.com/en-us/library/cc295463.aspx
So having BOMs seems to be something microsoft encourage on webpages generate by some of their tools (althought that is an old version of teir product).
> EntityUtils.toString should detect Byte order mark (BOM) and remove it if present
> ---------------------------------------------------------------------------------
>
> Key: HTTPCLIENT-1149
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1149
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient
> Affects Versions: 4.1.2
> Environment: Windows
> Reporter: Ian Beaumont
> Priority: Minor
> Labels: BOM, EntityUtils
>
> The Byte order mark at the start of the input stream should be detected and removed by EntityUtils.toString, otherwise strange unwanted characters are left at the start.
> This link lists possible Byte order markings http://en.wikipedia.org/wiki/Byte_order_mark
> I'm not sure if EntityUtils.toString using the BOM to try to detect the encoding, but if it doesn't then it should.
> Example URL that is causing this issue is mircosoft virtual earth WSDL file:
> HttpClient httpclient = new DefaultHttpClient();
> HttpGet httpget = new HttpGet("http://dev.virtualearth.net/webservices/v1/searchservice/searchservice.svc?wsdl");
> HttpResponse response = httpclient.execute(httpget);
> HttpEntity entity = response.getEntity();
> String textContents = EntityUtils.toString(entity);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org
[jira] [Commented] (HTTPCLIENT-1149) EntityUtils.toString should
detect Byte order mark (BOM) and remove it if present
Posted by "Sebb (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HTTPCLIENT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160962#comment-13160962 ]
Sebb commented on HTTPCLIENT-1149:
----------------------------------
RFC-3629 section 6. Byte order mark (BOM) says:
"o A protocol SHOULD also forbid use of U+FEFF as a signature for
those textual protocol elements for which the protocol provides
character encoding identification mechanisms, when it is expected
that implementations of the protocol will be in a position to
always use the mechanisms properly."
The Wikipedia artice footnote [3] says "Use of a BOM is neither required nor recommended for UTF-8, ..."
The document itself includes the encoding.
This all suggests that it is wrong for the server to send the BOM.
Note: there is already a BOM decoder class in Apache Commons IO:
http://commons.apache.org/io/api-release/org/apache/commons/io/input/BOMInputStream.html
> EntityUtils.toString should detect Byte order mark (BOM) and remove it if present
> ---------------------------------------------------------------------------------
>
> Key: HTTPCLIENT-1149
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1149
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient
> Affects Versions: 4.1.2
> Environment: Windows
> Reporter: Ian Beaumont
> Priority: Minor
> Labels: BOM, EntityUtils
>
> The Byte order mark at the start of the input stream should be detected and removed by EntityUtils.toString, otherwise strange unwanted characters are left at the start.
> This link lists possible Byte order markings http://en.wikipedia.org/wiki/Byte_order_mark
> I'm not sure if EntityUtils.toString using the BOM to try to detect the encoding, but if it doesn't then it should.
> Example URL that is causing this issue is mircosoft virtual earth WSDL file:
> HttpClient httpclient = new DefaultHttpClient();
> HttpGet httpget = new HttpGet("http://dev.virtualearth.net/webservices/v1/searchservice/searchservice.svc?wsdl");
> HttpResponse response = httpclient.execute(httpget);
> HttpEntity entity = response.getEntity();
> String textContents = EntityUtils.toString(entity);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org
[jira] [Commented] (HTTPCLIENT-1149) EntityUtils.toString should
detect Byte order mark (BOM) and remove it if present
Posted by "Oleg Kalnichevski (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HTTPCLIENT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160925#comment-13160925 ]
Oleg Kalnichevski commented on HTTPCLIENT-1149:
-----------------------------------------------
Ian
I am not sure I agree. BOM detection and handling is a responsibility of the CharsetDecoder, not that of HttpClient.
Oleg
> EntityUtils.toString should detect Byte order mark (BOM) and remove it if present
> ---------------------------------------------------------------------------------
>
> Key: HTTPCLIENT-1149
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1149
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient
> Affects Versions: 4.1.2
> Environment: Windows
> Reporter: Ian Beaumont
> Priority: Minor
> Labels: BOM, EntityUtils
>
> The Byte order mark at the start of the input stream should be detected and removed by EntityUtils.toString, otherwise strange unwanted characters are left at the start.
> This link lists possible Byte order markings http://en.wikipedia.org/wiki/Byte_order_mark
> I'm not sure if EntityUtils.toString using the BOM to try to detect the encoding, but if it doesn't then it should.
> Example URL that is causing this issue is mircosoft virtual earth WSDL file:
> HttpClient httpclient = new DefaultHttpClient();
> HttpGet httpget = new HttpGet("http://dev.virtualearth.net/webservices/v1/searchservice/searchservice.svc?wsdl");
> HttpResponse response = httpclient.execute(httpget);
> HttpEntity entity = response.getEntity();
> String textContents = EntityUtils.toString(entity);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org
[jira] [Commented] (HTTPCLIENT-1149) EntityUtils.toString should
detect Byte order mark (BOM) and remove it if present
Posted by "Sebb (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HTTPCLIENT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160978#comment-13160978 ]
Sebb commented on HTTPCLIENT-1149:
----------------------------------
I saw that next para, but I don't think it applies, given that the charset can be provided as Content-Type and is present in the file itself.
Don't know about wrapping the stream.
> EntityUtils.toString should detect Byte order mark (BOM) and remove it if present
> ---------------------------------------------------------------------------------
>
> Key: HTTPCLIENT-1149
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1149
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient
> Affects Versions: 4.1.2
> Environment: Windows
> Reporter: Ian Beaumont
> Priority: Minor
> Labels: BOM, EntityUtils
>
> The Byte order mark at the start of the input stream should be detected and removed by EntityUtils.toString, otherwise strange unwanted characters are left at the start.
> This link lists possible Byte order markings http://en.wikipedia.org/wiki/Byte_order_mark
> I'm not sure if EntityUtils.toString using the BOM to try to detect the encoding, but if it doesn't then it should.
> Example URL that is causing this issue is mircosoft virtual earth WSDL file:
> HttpClient httpclient = new DefaultHttpClient();
> HttpGet httpget = new HttpGet("http://dev.virtualearth.net/webservices/v1/searchservice/searchservice.svc?wsdl");
> HttpResponse response = httpclient.execute(httpget);
> HttpEntity entity = response.getEntity();
> String textContents = EntityUtils.toString(entity);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org
[jira] [Commented] (HTTPCLIENT-1149) EntityUtils.toString should
detect Byte order mark (BOM) and remove it if present
Posted by "Ian Beaumont (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HTTPCLIENT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160939#comment-13160939 ]
Ian Beaumont commented on HTTPCLIENT-1149:
------------------------------------------
I sort of understand your point, but I'm not sure I agree.
1. Would you expect EntityUtils.toString to return the string without the BOM (no matter whether it is done by the decoder in this method or by the EntityUtils.toString method itself)?
I would.
2. Looking at the implementation of EntityUtils.toString, I don't really see that the Java API of InputStreamReader should be changed to deal with it. It seems to me we should be using a different InputStreamReader, such as the one outlined here...
http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java
or here
http://www.velocityreviews.com/forums/t123963-textstreamreader-with-transparent-unicode-bom-support.html
> EntityUtils.toString should detect Byte order mark (BOM) and remove it if present
> ---------------------------------------------------------------------------------
>
> Key: HTTPCLIENT-1149
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1149
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient
> Affects Versions: 4.1.2
> Environment: Windows
> Reporter: Ian Beaumont
> Priority: Minor
> Labels: BOM, EntityUtils
>
> The Byte order mark at the start of the input stream should be detected and removed by EntityUtils.toString, otherwise strange unwanted characters are left at the start.
> This link lists possible Byte order markings http://en.wikipedia.org/wiki/Byte_order_mark
> I'm not sure if EntityUtils.toString using the BOM to try to detect the encoding, but if it doesn't then it should.
> Example URL that is causing this issue is mircosoft virtual earth WSDL file:
> HttpClient httpclient = new DefaultHttpClient();
> HttpGet httpget = new HttpGet("http://dev.virtualearth.net/webservices/v1/searchservice/searchservice.svc?wsdl");
> HttpResponse response = httpclient.execute(httpget);
> HttpEntity entity = response.getEntity();
> String textContents = EntityUtils.toString(entity);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org
[jira] [Resolved] (HTTPCLIENT-1149) EntityUtils.toString should
detect Byte order mark (BOM) and remove it if present
Posted by "Oleg Kalnichevski (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HTTPCLIENT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Oleg Kalnichevski resolved HTTPCLIENT-1149.
-------------------------------------------
Resolution: Won't Fix
At any rate anyone using EntityUtils#toString in productive code must be mad. Closing as WONTFIX. Please use Commons IO to work the problem around.
Oleg
> EntityUtils.toString should detect Byte order mark (BOM) and remove it if present
> ---------------------------------------------------------------------------------
>
> Key: HTTPCLIENT-1149
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1149
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient
> Affects Versions: 4.1.2
> Environment: Windows
> Reporter: Ian Beaumont
> Priority: Minor
> Labels: BOM, EntityUtils
>
> The Byte order mark at the start of the input stream should be detected and removed by EntityUtils.toString, otherwise strange unwanted characters are left at the start.
> This link lists possible Byte order markings http://en.wikipedia.org/wiki/Byte_order_mark
> I'm not sure if EntityUtils.toString using the BOM to try to detect the encoding, but if it doesn't then it should.
> Example URL that is causing this issue is mircosoft virtual earth WSDL file:
> HttpClient httpclient = new DefaultHttpClient();
> HttpGet httpget = new HttpGet("http://dev.virtualearth.net/webservices/v1/searchservice/searchservice.svc?wsdl");
> HttpResponse response = httpclient.execute(httpget);
> HttpEntity entity = response.getEntity();
> String textContents = EntityUtils.toString(entity);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org
[jira] [Commented] (HTTPCLIENT-1149) EntityUtils.toString should
detect Byte order mark (BOM) and remove it if present
Posted by "Ian Beaumont (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HTTPCLIENT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160971#comment-13160971 ]
Ian Beaumont commented on HTTPCLIENT-1149:
------------------------------------------
Looking at RFC-3629...the very next paragraph to the one you quote is...
o A protocol SHOULD NOT forbid use of U+FEFF as a signature for
those textual protocol elements for which the protocol does not
provide character encoding identification mechanisms, when a ban
would be unenforceable, or when it is expected that
implementations of the protocol will not be in a position to
always use the mechanisms properly. The latter two cases are
likely to occur with larger protocol elements such as MIME
entities, especially when implementations of the protocol will
obtain such entities from file systems, from protocols that do not
have encoding identification mechanisms for payloads (such as FTP)
or from other protocols that do not guarantee proper
identification of character encoding (such as HTTP).
Isn't that more relevant?
Would it be an issue to wrap the inputstream in the BOMInputStream?
> EntityUtils.toString should detect Byte order mark (BOM) and remove it if present
> ---------------------------------------------------------------------------------
>
> Key: HTTPCLIENT-1149
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1149
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient
> Affects Versions: 4.1.2
> Environment: Windows
> Reporter: Ian Beaumont
> Priority: Minor
> Labels: BOM, EntityUtils
>
> The Byte order mark at the start of the input stream should be detected and removed by EntityUtils.toString, otherwise strange unwanted characters are left at the start.
> This link lists possible Byte order markings http://en.wikipedia.org/wiki/Byte_order_mark
> I'm not sure if EntityUtils.toString using the BOM to try to detect the encoding, but if it doesn't then it should.
> Example URL that is causing this issue is mircosoft virtual earth WSDL file:
> HttpClient httpclient = new DefaultHttpClient();
> HttpGet httpget = new HttpGet("http://dev.virtualearth.net/webservices/v1/searchservice/searchservice.svc?wsdl");
> HttpResponse response = httpclient.execute(httpget);
> HttpEntity entity = response.getEntity();
> String textContents = EntityUtils.toString(entity);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org
[jira] [Commented] (HTTPCLIENT-1149) EntityUtils.toString should
detect Byte order mark (BOM) and remove it if present
Posted by "Oleg Kalnichevski (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HTTPCLIENT-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161544#comment-13161544 ]
Oleg Kalnichevski commented on HTTPCLIENT-1149:
-----------------------------------------------
Ian
I am not saying the issue is not real. All I am trying to say that EntityUtils#toString may not be the best place to address it.
Oleg
> EntityUtils.toString should detect Byte order mark (BOM) and remove it if present
> ---------------------------------------------------------------------------------
>
> Key: HTTPCLIENT-1149
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1149
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient
> Affects Versions: 4.1.2
> Environment: Windows
> Reporter: Ian Beaumont
> Priority: Minor
> Labels: BOM, EntityUtils
>
> The Byte order mark at the start of the input stream should be detected and removed by EntityUtils.toString, otherwise strange unwanted characters are left at the start.
> This link lists possible Byte order markings http://en.wikipedia.org/wiki/Byte_order_mark
> I'm not sure if EntityUtils.toString using the BOM to try to detect the encoding, but if it doesn't then it should.
> Example URL that is causing this issue is mircosoft virtual earth WSDL file:
> HttpClient httpclient = new DefaultHttpClient();
> HttpGet httpget = new HttpGet("http://dev.virtualearth.net/webservices/v1/searchservice/searchservice.svc?wsdl");
> HttpResponse response = httpclient.execute(httpget);
> HttpEntity entity = response.getEntity();
> String textContents = EntityUtils.toString(entity);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org