You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by "droidin.net" <dr...@droidin.net> on 2009/08/18 20:55:15 UTC
Weird characters in the stream
I'm trying to read a partial data from the HTML file. So I have this code
that returns me InputSource for my SAX parser
InputSource is = null;
HttpEntity entity = response.getEntity();
if (entity == null) {
final String body = new
BasicResponseHandler().handleResponse(response);
is = new InputSource(new StringReader(body));
} else {
is = new InputSource(new InputStreamReader(entity.getContent(),
"utf-8"));
}
And now comes the problem:
1. is = new InputSource(new StringReader(body)); // this always work
2. If I save HTML into file and then create InputSource from that using
is = new InputSource(new
InputStreamReader(ParserUtils.class.getResourceAsStream(testFile),
"utf-8"));
this also works
3. However if I do
is = new InputSource(new InputStreamReader(entity.getContent(), "utf-8"));
Then my sax parser chokes with ArrayIndexOutOfBoundsException (Attempt to
access illegal array index) and when I look at the buffer it's full of
garbage chars that show up as little blank squares with char numeric value
of -1. If I wrap InputStreamReader into BufferedREader - that does not help.
The original HTML doc specifies
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
--
View this message in context: http://www.nabble.com/Weird-characters-in-the-stream-tp25031327p25031327.html
Sent from the HttpClient-User mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org
Re: Weird characters in the stream
Posted by Oleg Kalnichevski <ol...@apache.org>.
On Tue, Aug 18, 2009 at 11:55:15AM -0700, droidin.net wrote:
>
> I'm trying to read a partial data from the HTML file. So I have this code
> that returns me InputSource for my SAX parser
> InputSource is = null;
> HttpEntity entity = response.getEntity();
> if (entity == null) {
> final String body = new
> BasicResponseHandler().handleResponse(response);
> is = new InputSource(new StringReader(body));
> } else {
> is = new InputSource(new InputStreamReader(entity.getContent(),
> "utf-8"));
> }
>
> And now comes the problem:
> 1. is = new InputSource(new StringReader(body)); // this always work
> 2. If I save HTML into file and then create InputSource from that using
> is = new InputSource(new
> InputStreamReader(ParserUtils.class.getResourceAsStream(testFile),
> "utf-8"));
> this also works
> 3. However if I do
> is = new InputSource(new InputStreamReader(entity.getContent(), "utf-8"));
> Then my sax parser chokes with ArrayIndexOutOfBoundsException (Attempt to
> access illegal array index) and when I look at the buffer it's full of
> garbage chars that show up as little blank squares with char numeric value
> of -1. If I wrap InputStreamReader into BufferedREader - that does not help.
>
Sounds like an issue with the SAX parser. Turn on wire logging and see what
gets transferred across the wire:
http://hc.apache.org/httpcomponents-client/logging.html
Oleg
> The original HTML doc specifies
> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
> --
> View this message in context: http://www.nabble.com/Weird-characters-in-the-stream-tp25031327p25031327.html
> Sent from the HttpClient-User mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org