You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Chris Fellows <cf...@quinstreet.com> on 2005/05/31 19:29:05 UTC

HTTPClient 3.0-rc2 returning corrupt data through popular Proxy

Hello,

 

I've recently integrated HTTPClient 3.0-rc2 into an application that
uses a popular Proxy service, Anonymizer.com. Unfortunetly, sometimes
HTTPClient returns webpages with portions of the page corrupted. If I
use any web browser (IE, Firefox, Opera) I never see the same corrupt
data in the same webpages. I was originally using simple sockets and
java.net to find, connect to and retrieve these pages, but when
switching to Anonymizer I was running into problems parsing chunked data
content. I had written a regex expression to try and find and discard
the chunk identifiers (as opposed to reading the page based on the
chunked identifers) but the regex expression would occasionally miss
some of the hex idenifiers. I cannot find anything wrong with the regex
expression and so I suspect that the proxy was not returning data per
RFC. Regardless, I decided to switch over to HTTPClient for several
reasons, one of which is the transparent reading of chunked data. Still,
after implementing (and I hope I followed the tutorials, docs, and
sample code as closely as possible), I'm still getting corrupt data.
I've looked throughout the user and most of the dev mailing lists and
have not found quite a similar problem being reported. 

 

So my few questions, if any one can help, are:

 

1)       Should HTTPClient 3.0 return data as well as any web browser?

2)       Has anyone run into similar problems with Proxy Services?

3)       Are there any fine tuning tips anyone has for using Proxies? 

4)       Or tips for reading chunked data?

 

Below is a snip of the code to connect to and retrieve proxy data. Note,
I did not follow the sample Proxy code found in the rc2 src, because I
need to sometimes connect to Google and Overture, both of which return
502 - Forbidden pages when connecting using that particular method.
Instead I opted on the tuturial method of connecting through proxies. 

 

Also, attached is one of the returned corrupted pages. Check out the
page source and about 70 lines down, you'll start seeing the corrupted
characters.

 

Thanks in advance for any pointers or responses

 

Chris

 

    private HttpConnectionEngine(String pHost, int pPort,

            HttpConnectionEngineParams pConnEngineParams) {

        HttpConnectionManagerParams connManagerParams = new
HttpConnectionManagerParams();

 
connManagerParams.setDefaultMaxConnectionsPerHost(pConnEngineParams

                .getMaxConnectionsPerHost());

        connManagerParams.setMaxTotalConnections(pConnEngineParams

                .getMaxConnectionsPerHost());

        connManagerParams.setStaleCheckingEnabled(pConnEngineParams

                .isConnectionStaleCheckingEnabled());

        connManagerParams.setConnectionTimeout((int) pConnEngineParams

                .getIdleConnectionTimeout());

 

        cConnManager = new MultiThreadedHttpConnectionManager();

        cConnManager.setParams(connManagerParams);

        cConnManager.closeIdleConnections(pConnEngineParams

                .getIdleConnectionTimeout());

        cConnEngineParams = pConnEngineParams;

        cHostConfig = new HostConfiguration();

        cHostConfig.setHost("www.google.com"); // example host that
sometimes returns corrupt webpage

        cHostConfig.setProxy("quinstreet.anonymizer.com", 80); // proxy
host

    }

 

    public void readFromServer(String pRequest, StringBuffer pWebPage)

            throws InvalidArgumentException {

 

        final String METHOD = "readFromServer()";

 

        int status = -1, lReadLine = -1;

        int lBufSize = 4 * 1024;

        char[] lHtmlBuf = new char[lBufSize];

        GetMethod lRequestMethod = null;

        InputStreamReader lIn = null;

        HttpClient lClient = new HttpClient();

        lClient.setHttpConnectionManager(cConnManager);

        lClient.setHostConfiguration(cHostConfig);

        // set number of retrys on bad connect

        lClient.getParams().setParameter(

                HttpMethodParams.RETRY_HANDLER,

                new DefaultHttpMethodRetryHandler(cConnEngineParams

                        .getNumOfRetryOnBadHttpStatus(),
cConnEngineParams

                        .isRequestSentRetryEnabled()));

        log.log(Level.INFO, "Request: " + pRequest);

        lRequestMethod = new GetMethod(pRequest);

 

        // add headers to request

        Properties lReqHeaderProps =
cConnEngineParams.getReqHeaderProps();

        Enumeration enum = lReqHeaderProps.keys();

        while (enum.hasMoreElements()) {

            String key = (String) enum.nextElement();

            lRequestMethod.addRequestHeader(key, lReqHeaderProps

                    .getProperty(key));

        }

        // clean StringBuffer

        pWebPage.delete(0, pWebPage.length());

        // execute request

        try {

            status = lClient.executeMethod(lRequestMethod);            

            lIn = new InputStreamReader(

                    lRequestMethod.getResponseBodyAsStream(), 

                    lRequestMethod.getResponseCharSet()

            );

            while ((lReadLine = lIn.read(lHtmlBuf)) != -1) {

                pWebPage.append(lHtmlBuf);

                lHtmlBuf = new char[lBufSize];

            }

        } catch (HttpException he) {

            throw new InvalidArgumentException(

                    "HttpException executing GetMethod on request: " +
pRequest

                            + ", with: " + he.getMessage());

        } catch (IOException ioe) {

            throw new InvalidArgumentException(

                    "IOException executing request or reading response
on request: "

                            + pRequest + ", with: " + ioe.getMessage());

        } finally {

            // clean resources

            // NOTE: don't close connection with HTTP/1.1

            lRequestMethod.releaseConnection();

            lRequestMethod = null;

            lClient = null;

            enum = null;

            // check the status for logging

            if (status != HttpStatus.SC_OK)

                log.logp(Level.INFO, CLASS, METHOD, "Bad request,
status: "

                        + status);

            else

                log.logp(Level.FINE, CLASS, METHOD,

                        "OK status, webpage-length:" +
pWebPage.length());

        }

    }

 


Re: HTTPClient 3.0-rc2 returning corrupt data through popular Proxy

Posted by Oleg Kalnichevski <ol...@apache.org>.
> 1)       Should HTTPClient 3.0 return data as well as any web browser?
> 

Absolutely

> 2)       Has anyone run into similar problems with Proxy Services?
> 

Not that we know of. I _suppose_ if there were such a fundamental
problem with HttpClient we would have known.


> 3)       Are there any fine tuning tips anyone has for using Proxies? 
> 

There are not many. None of them is in any way related to data
corruption

> 4)       Or tips for reading chunked data?
> 

Chunk-encoding is FULLY transparent to the end user.

I looked at the source code attached and from the cursory observation
nothing struck me as obviously wrong, but since this code is not
compilable there is no sure way of telling for sure

(1) Is the problem reliably reproducible (hitting the same URL always
produces corrupted HTLM) or does it appear random?

(2) Does this problem occur if you hit the URLs directly bypassing
Anonymizer.com proxy?

Oleg


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-user-help@jakarta.apache.org