You are viewing a plain text version of this content. The canonical link for it is here.
Posted to taglibs-dev@jakarta.apache.org by "Meltsner, Kenneth" <Ke...@ca.com> on 2001/07/20 02:53:58 UTC
possible fix for Scrape: doesn't respect http.proxy settings, red
irects, etc.
Rich suggested I send this to taglibs-dev -- it's a fix (I think) for problems with proxy use and following redirects with the scrape taglib. This is my first suggestion to an open source project; be kind.
Ken
--
It seemed easier to modify your code slightly to cast (perhaps a bad idea) the result of URL.openConnection instead of subclassing java.net.HttpURLConnection. Here's the bit I changed (marked with a KJM]:
(from jakarta-taglibs\scrape\src\org\apache\taglibs\scrape; you would also remove HttpConnection.java from that directory]
[...]
/**
* Create an http request for the specified URL, check to see if time has
* elapsed, if so get page, check last modified header of page, and if
* necessary make the request
*
*/
class Page extends Thread {
private java.net.HttpURLConnection connection; // object to create an http request
private long lastmodified; // time the page was last modified
private long expires; // http header = time the page expires
private URL url; // url from the page to be scraped
private PageData pagedata; // pagedata object that holds data on this url
// char array to hold the source page from the http request
private char source[];
// max size of the buffer that the http request is read into
private final long MAX_BUFFER_SIZE = 50000;
// pagecontext that the servlet resides in, used for logging to the server
private PageContext pageContext;
Page(URL url, PageData page, PageContext pc) {
this.url = url;
pagedata = page;
pageContext = pc;
}
public void run() {
long current = new Date().getTime(); // get current time
// make http connection to url
try
{
// create new HttpUrlConnection --KJM
connection = (java.net.HttpURLConnection) url.openConnection();
connection.setRequestMethod("HEAD");
connection.connect();
// set current time to time of last scrape
pagedata.setLastScrape(current);
// check response status code a code of 200 is a successful
// connection
if (connection.getResponseCode() >= 300) {
pageContext.getServletContext().
log("Error Occured: " + connection.getResponseMessage());
} else {
// get expires header
if ((expires =(long)connection.getExpiration()) == 0)
// do this if header does not exist
expires = current - 1;
// check for a new scrape for this page or that the Expires
// time for the page has passed
if((current > expires) || pagedata.getnewFlag() ||
pagedata.getChangeFlag()) {
// get lastmodified header
// getLastModified returns 0 if header does not exist
if ((lastmodified = (long)connection.getLastModified()) == 0)
// do this if header does not exist
lastmodified = pagedata.getLastScrape() - 1;
// disconnect so that the connection object can be reset to
// use GET instead of HEAD
connection.disconnect();
// check for a new scrape for this page or that Last-
// Modified time for the page has passed
if ((pagedata.getLastScrape() < lastmodified) ||
pagedata.getnewFlag() || pagedata.getChangeFlag()) {
// set the request method to get
connection.setRequestMethod("GET");
// make the connection
connection.connect();
// check responce code from connection
if (connection.getResponseCode() >= 300) {
pageContext.getServletContext().
log("Error Occured: " +
connection.getResponseMessage());
// the connection did not occur return cached data
return;
}
// read http request into buffer return value is false
// if an error occured
if (streamtochararray(connection.getInputStream())) {
// perform the scrapes on this page
scrape();
}
}
}
}
}
catch (IOException ee) {
pageContext.getServletContext().
log(ee.toString());
}
}
-----Original Message-----
From: Rich Catlett [mailto:rich@more.net]
Sent: Monday, July 16, 2001 11:34 AM
To: taglibs-user@jakarta.apache.org
Subject: [Fwd: Scrape: doesn't respect http.proxy settings, redirects,
etc;] (fwd)
It does use the java.net.HttpUrlConnection, that is the super class. The
Conncet and disconnect are abstract classes that have to be written. As
far as redirects go there is a setFollowRedirects method that I didn't
bother with since automatically following redirects is supposed to be the
default behavior. As far as using a proxy, there is another abstract
method usingProxy that I believe would have to be flushed out. Currently
it simply returns false. I imagine that to use a proxy, an attribute
would have to be added to the page tag to determine if a proxy is to be
used and then the usingProxy method called. I am not very strong in this
area and I would have no place to test it currently, so if maybe you would
like to flush out the usingProxy method and submit the fix to the
taglibs-dev list I would be happy to add the change to cvs.
---------------------------------------------------------------------
Rich Catlett rich@more.net | Confuscious say "Man who stand |
Programmer | on toilet, high on pot!" |
| |
---------------------------------------------------------------------
-------- Original Message --------
Subject: Scrape: doesn't respect http.proxy settings, redirects, etc;
Date: Thu, 12 Jul 2001 12:07:33 -0400
From: "Meltsner, Kenneth" <Ke...@ca.com>
Reply-To: taglibs-user@jakarta.apache.org
To: taglibs-user@jakarta.apache.org
I figured this one out: Scrape doesn't follow redirects, use system proxy settings, etc.
because it has its own implementation of HTTPURLConnection. Was there a reason not to use
the standard object from java.net? If not, it'd be relatively simple to fix...
Ken
Ken Meltsner
Computer Associates
Senior Architect, Portal TAG Team