You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Gustavo Nalle Fernandes <gn...@ig.com.br> on 2004/03/29 01:59:29 UTC

Cache and HTMLGenerator

Hi, I am using HTMLGenerator in order to obtain contents from remote sites.
By using a network sniffer, I noticed the HTMLGenerator always makes a GET
request to the remote site in order to
obtain the content. Is it possible to cache this content? Like specifying a
period of time so that within
this period, the HTML Generator uses a cached version instead of keep making
the same request all the time?
This would improve drastically performance, no matter what type of generator
is used.


Gustavo


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


RES: RES: RES: Cache and HTMLGenerator

Posted by Gustavo Nalle Fernandes <gn...@ig.com.br>.
  Well, this solves que question of tranferring big documents only to know
the last modified date, and always making a GET request, which is nice.
  But if I understand correctly, the remote server must have the
last-modifed header in order to this whole scheme work.

  Gustavo

-----Mensagem original-----
De: Miles Elam [mailto:miles@pcextremist.com]
Enviada em: terça-feira, 30 de março de 2004 11:12
Para: users@cocoon.apache.org
Assunto: Re: RES: RES: Cache and HTMLGenerator


Gustavo Nalle Fernandes wrote:

> Thanks for the code! It is indeed very simple! That?s why I like Cocoon :)
>  Regarding the Last-Modified header, the getLastModified() do work for GET
>request, but the GET request
>also brings the whole document and not just the headers. That?s why I was
>observing the whole document being
>transferred all the time. So what is the best scenario for the
>HTMLGenerator?
>
FYI: Web browsers always send GETs.  The difference being that they also
send the header

If-Modified-Since: ***some timestamp value ***

with the timestamp value being the value they received on the first
uncached request.  If the page has not been modified, the server sends
back a 304 status code instead of a 200 and no content.  If the page has
been modified since the specified timestamp, it sends back the normal
200 status with the page content.

The same should work with any timestamp.  I've just only ever ever seen
it used with values previously sent with the server.  This should solve
your "two requests" problem.

- Miles Elam


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: RES: RES: Cache and HTMLGenerator

Posted by Miles Elam <mi...@pcextremist.com>.
Gustavo Nalle Fernandes wrote:

> Thanks for the code! It is indeed very simple! That?s why I like Cocoon :)
>  Regarding the Last-Modified header, the getLastModified() do work for GET
>request, but the GET request
>also brings the whole document and not just the headers. That?s why I was
>observing the whole document being
>transferred all the time. So what is the best scenario for the
>HTMLGenerator?
>
FYI: Web browsers always send GETs.  The difference being that they also 
send the header

If-Modified-Since: ***some timestamp value ***

with the timestamp value being the value they received on the first 
uncached request.  If the page has not been modified, the server sends 
back a 304 status code instead of a 200 and no content.  If the page has 
been modified since the specified timestamp, it sends back the normal 
200 status with the page content.

The same should work with any timestamp.  I've just only ever ever seen 
it used with values previously sent with the server.  This should solve 
your "two requests" problem.

- Miles Elam


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Cache and HTMLGenerator

Posted by Joerg Heinicke <jo...@gmx.de>.
On 30.03.2004 02:41, Gustavo Nalle Fernandes wrote:
>  Thanks for the code! It is indeed very simple! That?s why I like Cocoon :)
>   Regarding the Last-Modified header, the getLastModified() do work for GET
> request, but the GET request
> also brings the whole document and not just the headers. That?s why I was
> observing the whole document being
> transferred all the time.

Ah, of course. Now it's obvious :) The getLastModified() is only for 
Cocoon's pipeline caching as it is assumed that the pipeline processing 
is the most time consuming part. Of course this changes fast if you 
fetch the content from remote.

> So what is the best scenario for the
> HTMLGenerator? Always do a HEAD request to see if the remote document is
> modified and if it is, make a subsequent GET request OR always make a GET on
> every request ? It depends of the size of the document and the modification
> frequency. If the remote document is too large, it is inefficent to make a
> GET all the time, as the HTMLGenerator does today. On the other hand, if the
> document is modified frequently, it would be inefficient to make HEAD and
> GET request, since it means making two connections to the remote site.Using
> a sitemap parameter specifying the interval that the HTMLGenerator would
> fectch data would address both issues. Do you think it is worthy to change
> the current HTMLGenerator to include this extra parameter?

Definitely not as this problem is not HTMLGenerator specific, but 
URLSource specific. So I will raise this question also on the dev list, 
maybe someone has a clever proposal for this.

For the devs with clever ideas here's the thread (unfortunately RES 
breaks the thread view at marc.theaimsgroup.com, so switching to gmane.org):
http://thread.gmane.org/gmane.text.xml.cocoon.user/34445

Joerg

Re: Cache and HTMLGenerator

Posted by Joerg Heinicke <jo...@gmx.de>.
On 30.03.2004 02:41, Gustavo Nalle Fernandes wrote:
>  Thanks for the code! It is indeed very simple! That?s why I like Cocoon :)
>   Regarding the Last-Modified header, the getLastModified() do work for GET
> request, but the GET request
> also brings the whole document and not just the headers. That?s why I was
> observing the whole document being
> transferred all the time.

Ah, of course. Now it's obvious :) The getLastModified() is only for 
Cocoon's pipeline caching as it is assumed that the pipeline processing 
is the most time consuming part. Of course this changes fast if you 
fetch the content from remote.

> So what is the best scenario for the
> HTMLGenerator? Always do a HEAD request to see if the remote document is
> modified and if it is, make a subsequent GET request OR always make a GET on
> every request ? It depends of the size of the document and the modification
> frequency. If the remote document is too large, it is inefficent to make a
> GET all the time, as the HTMLGenerator does today. On the other hand, if the
> document is modified frequently, it would be inefficient to make HEAD and
> GET request, since it means making two connections to the remote site.Using
> a sitemap parameter specifying the interval that the HTMLGenerator would
> fectch data would address both issues. Do you think it is worthy to change
> the current HTMLGenerator to include this extra parameter?

Definitely not as this problem is not HTMLGenerator specific, but 
URLSource specific. So I will raise this question also on the dev list, 
maybe someone has a clever proposal for this.

For the devs with clever ideas here's the thread (unfortunately RES 
breaks the thread view at marc.theaimsgroup.com, so switching to gmane.org):
http://thread.gmane.org/gmane.text.xml.cocoon.user/34445

Joerg

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


RES: RES: Cache and HTMLGenerator

Posted by Gustavo Nalle Fernandes <gn...@ig.com.br>.
 Thanks for the code! It is indeed very simple! That?s why I like Cocoon :)
  Regarding the Last-Modified header, the getLastModified() do work for GET
request, but the GET request
also brings the whole document and not just the headers. That?s why I was
observing the whole document being
transferred all the time. So what is the best scenario for the
HTMLGenerator? Always do a HEAD request to see if the remote document is
modified and if it is, make a subsequent GET request OR always make a GET on
every request ? It depends of the size of the document and the modification
frequency. If the remote document is too large, it is inefficent to make a
GET all the time, as the HTMLGenerator does today. On the other hand, if the
document is modified frequently, it would be inefficient to make HEAD and
GET request, since it means making two connections to the remote site.Using
a sitemap parameter specifying the interval that the HTMLGenerator would
fectch data would address both issues. Do you think it is worthy to change
the current HTMLGenerator to include this extra parameter?

Gustavo

-----Mensagem original-----
De: Joerg Heinicke [mailto:joerg.heinicke@gmx.de]
Enviada em: segunda-feira, 29 de marco de 2004 21:16
Para: users@cocoon.apache.org
Assunto: Re: RES: Cache and HTMLGenerator


On 30.03.2004 01:54, Gustavo Nalle Fernandes wrote:

>  Interesting class Joerg. A couple of observations:
>
>  1) The remote site DO have Last-modified header and Cocoon is issuing a
GET
> request instead of a HEAD request
>  to obtain the header value. This is a common mistake made when using the
> java.net.HttpURLConnection class. If you
>  want to make a HEAD request, you must use setRequestMethod("HEAD") on
your
> HttpURLConnection class before calling
>  the method to obtain the header value.

Unfortunately this is out of my knowledge. You mean getLastModified()
would not work on a GET request? But aren't the header sent to on a GET
request?

>  2) Regarding your implementation, I found very promising the idea of
> creating a subclassed HTMLGenerator that
>  enables us to control the cache timeout. I am kind of a newbie in cocoon
> source code, so could you
>  provide me general guidelines on how could I create a external sitemap
> parameter to manage this time interval? It would
>  replace the hard coded value "10000" in your code.

Also easy :) The HTMLGenerator has some examples for this, e.g. xpath
parameter:   xpath = par.getParameter("xpath", null);
The first one is the parameter name, the second one the default value.


The class could then look like:

public class DelayedHTMLGenerator extends HTMLGenerator {

    protected int delay;

    public void setup(SourceResolver resolver, Map objectModel,
                      String src, Parameters par)
    throws ProcessingException, SAXException, IOException {
      super.setup(resolver, objectModel, src, par);

      delay = par.getParameterAsInteger("delay", 1000);

      this.inputSource =
             new DelayedRefreshSourceWrapper(this.inputSource, delay);
    }
}

You would specify it in the sitemap like the following:

<map:generate type="delayhtml" src="url">
   <map:parameter name="delay" value="10000"/>
</map:generate>

Joerg

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: RES: Cache and HTMLGenerator

Posted by Joerg Heinicke <jo...@gmx.de>.
On 30.03.2004 01:54, Gustavo Nalle Fernandes wrote:

>  Interesting class Joerg. A couple of observations:
> 
>  1) The remote site DO have Last-modified header and Cocoon is issuing a GET
> request instead of a HEAD request
>  to obtain the header value. This is a common mistake made when using the
> java.net.HttpURLConnection class. If you
>  want to make a HEAD request, you must use setRequestMethod("HEAD") on your
> HttpURLConnection class before calling
>  the method to obtain the header value.

Unfortunately this is out of my knowledge. You mean getLastModified() 
would not work on a GET request? But aren't the header sent to on a GET 
request?

>  2) Regarding your implementation, I found very promising the idea of
> creating a subclassed HTMLGenerator that
>  enables us to control the cache timeout. I am kind of a newbie in cocoon
> source code, so could you
>  provide me general guidelines on how could I create a external sitemap
> parameter to manage this time interval? It would
>  replace the hard coded value "10000" in your code.

Also easy :) The HTMLGenerator has some examples for this, e.g. xpath 
parameter:   xpath = par.getParameter("xpath", null);
The first one is the parameter name, the second one the default value.


The class could then look like:

public class DelayedHTMLGenerator extends HTMLGenerator {

    protected int delay;

    public void setup(SourceResolver resolver, Map objectModel,
                      String src, Parameters par)
    throws ProcessingException, SAXException, IOException {
      super.setup(resolver, objectModel, src, par);

      delay = par.getParameterAsInteger("delay", 1000);

      this.inputSource =
             new DelayedRefreshSourceWrapper(this.inputSource, delay);
    }
}

You would specify it in the sitemap like the following:

<map:generate type="delayhtml" src="url">
   <map:parameter name="delay" value="10000"/>
</map:generate>

Joerg

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


RES: Cache and HTMLGenerator

Posted by Gustavo Nalle Fernandes <gn...@ig.com.br>.
 Interesting class Joerg. A couple of observations:

 1) The remote site DO have Last-modified header and Cocoon is issuing a GET
request instead of a HEAD request
 to obtain the header value. This is a common mistake made when using the
java.net.HttpURLConnection class. If you
 want to make a HEAD request, you must use setRequestMethod("HEAD") on your
HttpURLConnection class before calling
 the method to obtain the header value.

 2) Regarding your implementation, I found very promising the idea of
creating a subclassed HTMLGenerator that
 enables us to control the cache timeout. I am kind of a newbie in cocoon
source code, so could you
 provide me general guidelines on how could I create a external sitemap
parameter to manage this time interval? It would
 replace the hard coded value "10000" in your code.

 Thanks in advance,
 Gustavo


-----Mensagem original-----
De: Joerg Heinicke [mailto:joerg.heinicke@gmx.de]
Enviada em: segunda-feira, 29 de marco de 2004 18:46
Para: users@cocoon.apache.org
Assunto: Re: Cache and HTMLGenerator


On 29.03.2004 01:59, Gustavo Nalle Fernandes wrote:

> Hi, I am using HTMLGenerator in order to obtain contents from remote
sites.
> By using a network sniffer, I noticed the HTMLGenerator always makes a GET
> request to the remote site in order to
> obtain the content. Is it possible to cache this content? Like specifying
a
> period of time so that within
> this period, the HTML Generator uses a cached version instead of keep
making
> the same request all the time?
> This would improve drastically performance, no matter what type of
generator
> is used.

A look into HTMLGenerator's source shows that it is cacheable, but
delegates the caching to the Source implementation, in your case
probably URLSource and this one is trying to read getLastModified() on
the URLConnection. If you have control over the remote site you probably
have to set a header there for lastmodified.

For a more clever handling like you described it above on Cocoon side
you have to subclass HTMLGenerator and surround the URLSource with a
DelayedRefreshSourceWrapper. The code might look like:

public class DelayedHTMLGenerator extends HTMLGenerator {

   public void setup(SourceResolver resolver, Map objectModel,
                     String src, Parameters par)
   throws ProcessingException, SAXException, IOException {
     super.setup(resolver, objectModel, src, par);
     this.inputSource =
            new DelayedRefreshSourceWrapper(this.inputSource,  10000);
   }
}

(not tested)

The rest should be inherited and is not needed to be implemented.

Joerg

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Cache and HTMLGenerator

Posted by Joerg Heinicke <jo...@gmx.de>.
On 29.03.2004 01:59, Gustavo Nalle Fernandes wrote:

> Hi, I am using HTMLGenerator in order to obtain contents from remote sites.
> By using a network sniffer, I noticed the HTMLGenerator always makes a GET
> request to the remote site in order to
> obtain the content. Is it possible to cache this content? Like specifying a
> period of time so that within
> this period, the HTML Generator uses a cached version instead of keep making
> the same request all the time?
> This would improve drastically performance, no matter what type of generator
> is used.

A look into HTMLGenerator's source shows that it is cacheable, but 
delegates the caching to the Source implementation, in your case 
probably URLSource and this one is trying to read getLastModified() on 
the URLConnection. If you have control over the remote site you probably 
have to set a header there for lastmodified.

For a more clever handling like you described it above on Cocoon side 
you have to subclass HTMLGenerator and surround the URLSource with a 
DelayedRefreshSourceWrapper. The code might look like:

public class DelayedHTMLGenerator extends HTMLGenerator {

   public void setup(SourceResolver resolver, Map objectModel,
                     String src, Parameters par)
   throws ProcessingException, SAXException, IOException {
     super.setup(resolver, objectModel, src, par);
     this.inputSource =
            new DelayedRefreshSourceWrapper(this.inputSource,  10000);
   }
}

(not tested)

The rest should be inherited and is not needed to be implemented.

Joerg

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org