You are viewing a plain text version of this content. The canonical link for it is here.
Posted to taglibs-dev@jakarta.apache.org by "Meltsner, Kenneth" <Ke...@ca.com> on 2001/07/20 02:53:58 UTC

possible fix for Scrape: doesn't respect http.proxy settings, red irects, etc.

Rich suggested I send this to taglibs-dev -- it's a fix (I think) for problems with proxy use and following redirects with the scrape taglib.  This is my first suggestion to an open source project; be kind.

Ken

--
It seemed easier to modify your code slightly to cast (perhaps a bad idea) the result of URL.openConnection instead of subclassing java.net.HttpURLConnection.    Here's the bit I changed (marked with a KJM]:
(from jakarta-taglibs\scrape\src\org\apache\taglibs\scrape; you would also remove HttpConnection.java from that directory]
[...]
 /**
  *  Create an http request for the specified URL, check to see if time has
  *     elapsed, if so get page, check last modified header of page, and if
  *     necessary make the request
  *
  */
class Page extends Thread {

    private java.net.HttpURLConnection connection; // object to create an http request
    private long lastmodified;         // time the page was last modified
    private long expires;              // http header = time the page expires
    private URL url;                   // url from the page to be scraped
    private PageData pagedata;    // pagedata object that holds data on this url
    // char array to hold the source page from the http request
    private char source[];
    // max size of the buffer that the http request is read into
    private final long MAX_BUFFER_SIZE = 50000;
    // pagecontext that the servlet resides in, used for logging to the server
    private PageContext pageContext;

    Page(URL url, PageData page, PageContext pc) {
	this.url = url;
	pagedata = page;
	pageContext = pc;
    }

    public void run() {
        long current = new Date().getTime();  // get current time

        // make http connection to url
         try 
         {
	     // create new HttpUrlConnection --KJM

             connection = (java.net.HttpURLConnection) url.openConnection();
	     connection.setRequestMethod("HEAD");
	     connection.connect();

          // set current time to time of last scrape
	     pagedata.setLastScrape(current);
	     // check response status code a code of 200 is a successful
      	     // connection
	     if (connection.getResponseCode() >= 300) {
		 pageContext.getServletContext().
                   log("Error Occured: " + connection.getResponseMessage());
   	     } else {
		 // get expires header
		 if ((expires =(long)connection.getExpiration()) == 0)
		     // do this if header does not exist
		     expires = current - 1;

	     	 // check for a new scrape for this page or that the Expires
		 // time for the page has passed
                 if((current > expires) || pagedata.getnewFlag() || 
		    pagedata.getChangeFlag()) {

		     // get lastmodified header
		     // getLastModified returns 0 if header does not exist
		     if ((lastmodified = (long)connection.getLastModified()) == 0)
			 // do this if header does not exist
			 lastmodified = pagedata.getLastScrape() - 1;

		     // disconnect so that the connection object can be reset to
		     // use GET instead of HEAD
		     connection.disconnect();

		     // check for a new scrape for this page or that Last-
		     // Modified time for the page has passed
	       	     if ((pagedata.getLastScrape() < lastmodified) || 
                            pagedata.getnewFlag() || pagedata.getChangeFlag()) {

			 // set the request method to get
			 connection.setRequestMethod("GET");
			 // make the connection
			 connection.connect();

			 // check responce code from connection
			 if (connection.getResponseCode() >= 300) {
		             pageContext.getServletContext().
                                log("Error Occured: " +
				    connection.getResponseMessage());
			     // the connection did not occur return cached data
			     return;
			 }

			 // read http request into buffer return value is false
			 // if an error occured
			 if (streamtochararray(connection.getInputStream())) {
			     // perform the scrapes on this page
		       	     scrape();
			 }
         }
                 }
             }
         }
        catch (IOException ee) {
	     pageContext.getServletContext().
             log(ee.toString());
         }
     }

-----Original Message-----
From: Rich Catlett [mailto:rich@more.net]
Sent: Monday, July 16, 2001 11:34 AM
To: taglibs-user@jakarta.apache.org
Subject: [Fwd: Scrape: doesn't respect http.proxy settings, redirects,
etc;] (fwd)


It does use the java.net.HttpUrlConnection, that is the super class.  The
Conncet and disconnect are abstract classes that have to be written.  As
far as redirects go there is a setFollowRedirects method that I didn't
bother with since automatically following redirects is supposed to be the
default behavior.  As far as using a proxy, there is another abstract
method usingProxy that I believe would have to be flushed out.  Currently
it simply returns false.  I imagine that to use a proxy, an attribute
would have to be added to the page tag to determine if a proxy is to be
used and then the usingProxy method called.  I am not very strong in this
area and I would have no place to test it currently, so if maybe you would
like to flush out the usingProxy method and submit the fix to the
taglibs-dev list I would be happy to add the change to cvs.

---------------------------------------------------------------------
Rich Catlett        rich@more.net |  Confuscious say "Man who stand |
Programmer                        |   on toilet, high on pot!"      |
                                  |                                 |
---------------------------------------------------------------------

-------- Original Message --------
Subject: Scrape: doesn't respect http.proxy settings, redirects, etc;
Date: Thu, 12 Jul 2001 12:07:33 -0400
From: "Meltsner, Kenneth" <Ke...@ca.com>
Reply-To: taglibs-user@jakarta.apache.org
To: taglibs-user@jakarta.apache.org

I figured this one out:  Scrape doesn't follow redirects, use system proxy settings, etc.
because it has its own implementation of HTTPURLConnection.  Was there a reason not to use
the standard object from java.net?  If not, it'd be relatively simple to fix...

Ken


Ken Meltsner
Computer Associates
Senior Architect, Portal TAG Team