You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Ovidiu Predescu <ov...@cup.hp.com> on 2001/12/10 10:23:11 UTC

initial checkin of the Scheme code

Hi,

I've just checked in the initial code for the Scheme integration. It
is in scratchpad/schecoon/ (for lack of a better name) in the main
trunk.

As it is right now, it doesn't do much. There are two servlets,
REPLServlet and REPLEvalServlet. The first one is supposed to be the
main entry point for the new Scheme-driven Cocoon. Eventually it will
execute the Scheme function that will execute the sitemap.

The second servlet is used for development only. It is mounted as
/schecoon/eval under the servlet, and it accepts POST messages
containing Scheme code to evaluate. Because this servlet is a security
risk, you need to configure a flag in web.xml to enable it. Once this
flag is set to true, the servlet is setup to use basic
authentication. To set this up with Tomcat 3.3 for example, add the
following entry in conf/users/global-users.xml:

<user name="cocoon" password="your password" roles="cocoon_admin"/>

There is a command line driver for the interpreter in the util
package. This will read in an infinite loop a Scheme expression from
the command line, POST it to the servlet and print out the
response. This application is intended to be used with Emacs, read the
README file in the emacs/ directory. From within Emacs, you can edit
your code, and when you want to test it, you type few keys which will
send the code to the servlet Scheme engine to be evaluated.

The SISC jar file is built from the latest sources in CVS. With the
simple servlet I've just described above, it adds less than 2Mb to the
runtime memory.


Greetings,
-- 
Ovidiu Predescu <ov...@cup.hp.com>
http://orion.nsr.hp.com/ (inside HP's firewall only)
http://sourceforge.net/users/ovidiu/ (my SourceForge page)
http://www.geocities.com/SiliconValley/Monitor/7464/ (GNU, Emacs, other stuff)

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Stefano Mazzocchi <st...@apache.org>.

Berin Loritsch wrote:
> 
> Berin Loritsch wrote:
> 
> > Stefano Mazzocchi wrote:
> >
> >> Talking about placing too many irons in the fire :)
> >>
> >
> >
> > Stefano, please forgive my ignorance, as I have an Associates Degree in
> > Music Recording, and not a traditional CS degree.  I haven't had the
> > higher math classes you and many others have had here.

Oppps, sorry about that: having had *many* of those higher math classes,
I consider this very basic stuff and I tend to forget that it is not,
even for programmers (well, the concepts indeed are, as you know).

> > I understand Algebra and Trigonometry ok, so if we can simplify these
> > algorithms into a "pseudo-code" so I can wrap my head around them, I
> > will be happy to jump in the discussion.  As it is now, It's going
> > over my head.

Thanks to Leo for doing that.

> > I understand the Calculus ideas of functions (i.e. V(r,t)).  What throws
> > me are the Sigmas (I studied Ancient Greek, so I know what the letters
> > are).
> > I know they are meant to represent sums, or statistical data.  I don't know
> > what the letter above the sigma or below the sigma is supposed to signify.

it's where to start and where to stop. In fact, math, after centuries of
use, has an incredibly good syntax (unfortunately, it requires full 2D
vector graphics to show and doesn't render very good as ASCII art :(

> > If you don't feel up to giving me a quick math lesson so I can grok the
> > concepts fully, can you point me to a resource where I can teach myself?

Oh, no problem in explaining what this means to everyone that asks.

> >> Thus, given
> >>
> >>  r        := resource
> >>  t        := time of the request
> >>  V(r,t)   := validity estimation cost
> >>  P(r,t)   := production cost when caching is not performed
> >>  Pc(r,t)  := production cost when caching is performed
> >>  l(r,t)   := lookup cost
> >>  s(r,t)   := storage cost
> >>
> >> and
> >>
> >>  I := sequence of invalid requests time [1,n]
> >>  V := sequence of valid request time [1,m]
> 
> Also, I have to ask which aspects of the resource are we calculating?  Time
> changed? size in bytes?

That's not defined at this point, we would say "pluggable" or "abstract"
in code terms.

These equations are sort of a math framework where you can plug in your
own cost function and the whole thing still works.

[I admit it's probably because of these *intensive* high math studies of
mine that I have such a framework-oriented mindset]

> t = relative millis since begining of adaptation, time since last request, or absolute millis?

Again, not defined. It's the time axis. It's up to you to define what
the resolution is and where the zero of the axis is. It's not the
equation's concern to define that (at least, at this point).

> Also, what is the sample window that we have to look at.  It is impractical
> to look at all samples for the life of the cache, so how many do we need to track?

I didn't define this as well, anyway I expect *not* to be impractical to
consider all the samples, as long as the statistical informations are
stored (average, standard deviation) instead of the entire array of
values.

> It also helps to know the units in which these functions are measured, such as
> Validity Estimation Cost (my assumption is time in millis).

The dimension for a function of cost is [cost]. :)

If you define what [cost] is at this point, you are throwing away the
ability to plug in different cost functions.

For example: in a system where you have all the disk you want, disk cost
is zero, thus you don't take it into account, only CPU time is your
scarce resource, thus this is included into the cost function.

On a hypothetical embedded system, the scarce resource could be 'thermal
dissipation' while disk and CPU time might be less important.

How, can you, by hand, estimate what resources will be better to cache
or not depending on how much 'heat' they generate inside the system?

This is the power of my adaptive caching scheme with pluggable cost
function: it's up to you to define what you think it's scarce on your
system and to provide a way to measure that scarce resource, the caching
behavior is handled by the same exact algorithms.

This also shows how limiting is the assumption that cost's dimension is
time, only because you are used to think that the only scarce resource
on a server system is processing speed.

> Also not that storage and lookup costs can be reduced with asynchronous file IO.
> It is not simple to program in those terms, but such a mechanism can speed up the
> scalability of such operations.

Absolutely, but once you have introduced a way for the system to
'measure' those costs, any improvement on the architecture will be
balanced by an adaptation of the cache system, without even telling it
that you changed something.

> You also have to consider the fact that an effective cache uses both memory and
> file IO, so the storage costs between them vary greatly.

Absolutely, this is why the cost function is not defined and the math
framework is kept abstract.

Hope this helps.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Gerhard Froehlich <g-...@gmx.de>.

Berin,
>-----Original Message-----
>From: Berin Loritsch [mailto:bloritsch@apache.org]
>Sent: Thursday, December 13, 2001 4:43 PM
>To: cocoon-dev@xml.apache.org
>Subject: Re: Adaptive Caching [was Re: initial checkin of the Scheme
>code]
>
>
>Berin Loritsch wrote:
>
>> Stefano Mazzocchi wrote:
>> 
>>> Talking about placing too many irons in the fire :)
>>>
>> 
>> 
>>I haven't had the higher math classes you 
>>and many others have had here.

YOU LUCKY!! ;-)

Cheers
Gerhard


 
-----------------------------------------
All my life I wanted to be someone;
I guess I should have been more specific.
(Jane Wagner)
-----------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Berin Loritsch <bl...@apache.org>.

Berin Loritsch wrote:

> Stefano Mazzocchi wrote:
> 
>> Talking about placing too many irons in the fire :)
>>
> 
> 
> Stefano, please forgive my ignorance, as I have an Associates Degree in
> Music Recording, and not a traditional CS degree.  I haven't had the
> higher math classes you and many others have had here.
> 
> I understand Algebra and Trigonometry ok, so if we can simplify these
> algorithms into a "pseudo-code" so I can wrap my head around them, I
> will be happy to jump in the discussion.  As it is now, It's going
> over my head.
> 
> I understand the Calculus ideas of functions (i.e. V(r,t)).  What throws
> me are the Sigmas (I studied Ancient Greek, so I know what the letters 
> are).
> I know they are meant to represent sums, or statistical data.  I don't know
> what the letter above the sigma or below the sigma is supposed to signify.
> 
> If you don't feel up to giving me a quick math lesson so I can grok the
> concepts fully, can you point me to a resource where I can teach myself?
> 
>> Thus, given
>>
>>  r        := resource
>>  t        := time of the request
>>  V(r,t)   := validity estimation cost
>>  P(r,t)   := production cost when caching is not performed
>>  Pc(r,t)  := production cost when caching is performed
>>  l(r,t)   := lookup cost
>>  s(r,t)   := storage cost
>>
>> and
>>
>>  I := sequence of invalid requests time [1,n]
>>  V := sequence of valid request time [1,m]


Also, I have to ask which aspects of the resource are we calculating?  Time
changed? size in bytes?

t = relative millis since begining of adaptation, time since last request, or absolute millis?

Also, what is the sample window that we have to look at.  It is impractical
to look at all samples for the life of the cache, so how many do we need to track?

It also helps to know the units in which these functions are measured, such as
Validity Estimation Cost (my assumption is time in millis).

Also not that storage and lookup costs can be reduced with asynchronous file IO.
It is not simple to program in those terms, but such a mechanism can speed up the
scalability of such operations.

You also have to consider the fact that an effective cache uses both memory and
file IO, so the storage costs between them vary greatly.



>> we have that
>>
>>                n
>>                --
>>  Invalid(r) =  \  (V(r,I ) + Pc(r,I ) + s(r,I ))
>>                /        i          i         i                --
>>               i = 1
>>
>> and
>>
>>              m
>>              --
>>  Valid(r) =  \  (V(r,I ) + L(r,I ))
>>              /        j         j
>>              --
>>            j = 1
>>
>> and
>>
>>             n                m
>>             --               --
>>  Ptot(r) =  \  P(r,I )   +   \  P(r,V )
>>             /       i        /       j
>>             --               --
>>           i = 1             j = 1



-- 

"They that give up essential liberty to obtain a little temporary safety
  deserve neither liberty nor safety."
                 - Benjamin Franklin


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Gerhard Froehlich <g-...@gmx.de>.

Hi,

   n 
  -- 
  \  x  = x  + x + ... + x 
  /   i    1    2         n
  --
i = 1

n = Constant or outer Parameter
i = Index only total numbers (1,2,3...)

Example:
x = 10
 1 
x = 12
 2
x = 14
 3

   3 
  -- 
  \  x  = x  + x  + x  = 10 + 12 + 14 = 36
  /   i    1    2    3
  --
i = 1

  Gerhard

--------------------------------
Beam me up... arrgh, no carrier!
--------------------------------


>-----Original Message-----
>From: Berin Loritsch [mailto:bloritsch@apache.org]
>Sent: Thursday, December 13, 2001 4:34 PM
>To: cocoon-dev@xml.apache.org
>Subject: Re: Adaptive Caching [was Re: initial checkin of the Scheme
>code]
>
>
>Stefano Mazzocchi wrote:
>
>> Talking about placing too many irons in the fire :)
>> 
>
>
>Stefano, please forgive my ignorance, as I have an Associates Degree in
>Music Recording, and not a traditional CS degree.  I haven't had the
>higher math classes you and many others have had here.
>
>I understand Algebra and Trigonometry ok, so if we can simplify these
>algorithms into a "pseudo-code" so I can wrap my head around them, I
>will be happy to jump in the discussion.  As it is now, It's going
>over my head.
>
>I understand the Calculus ideas of functions (i.e. V(r,t)).  What throws
>me are the Sigmas (I studied Ancient Greek, so I know what the letters are).
>I know they are meant to represent sums, or statistical data.  I don't know
>what the letter above the sigma or below the sigma is supposed to signify.
>
>If you don't feel up to giving me a quick math lesson so I can grok the
>concepts fully, can you point me to a resource where I can teach myself?
>
>> Thus, given
>> 
>>  r        := resource
>>  t        := time of the request
>>  V(r,t)   := validity estimation cost
>>  P(r,t)   := production cost when caching is not performed
>>  Pc(r,t)  := production cost when caching is performed
>>  l(r,t)   := lookup cost
>>  s(r,t)   := storage cost
>> 
>> and
>> 
>>  I := sequence of invalid requests time [1,n]
>>  V := sequence of valid request time [1,m]
>> 
>> we have that
>> 
>>                n
>>                --
>>  Invalid(r) =  \  (V(r,I ) + Pc(r,I ) + s(r,I ))
>>                /        i          i         i 
>>                --
>>               i = 1
>> 
>> and
>> 
>>              m
>>              --
>>  Valid(r) =  \  (V(r,I ) + L(r,I ))
>>              /        j         j
>>              --
>>            j = 1
>> 
>> and
>> 
>>             n                m
>>             --               --
>>  Ptot(r) =  \  P(r,I )   +   \  P(r,V )
>>             /       i        /       j
>>             --               --
>>           i = 1             j = 1
>> 
>
>
>
>
>-- 
>
>"They that give up essential liberty to obtain a little temporary safety
>  deserve neither liberty nor safety."
>                 - Benjamin Franklin
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
>For additional commands, email: cocoon-dev-help@xml.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Berin Loritsch <bl...@apache.org>.

Leo Sutic wrote:

> 
>>From: Berin Loritsch [mailto:bloritsch@apache.org]
>>Stefano Mazzocchi wrote:
>>
>>I
>>don't know
>>what the letter above the sigma or below the sigma is supposed to signify.
>>
> 
> The sigma is like a summing for-loop. Example:
> 
> 
>>>               n
>>>               --
>>> Invalid(r) =  \  (V(r,I ) + Pc(r,I ) + s(r,I ))
>>>               /        i          i         i
>>>               --
>>>              i = 1
>>>
>>>
> 
> would be like (with an asuumption of data types):
> 
> float Invalid (float r) {
>   float sum = 0.0;
>   for (int i = 1; i <= n; i++) {
>     sum += (V(r,I[i]) + Pc(r,I[i]) + s(r,I[i]))
>   }
>   return sum;
> }
> 
> So, you can read the sigma as "sum of <expression after sigma> for all
> integer values of <whatever variable is below the sigma> between <the
> variable below the sigma's initial value> and <the value or variable on top
> of the sigma>". For the above: "sum of (V(r,Ii) + Pc(r,Ii) + s(r,Ii)) for
> all integer values of i between 1 and n".


Thanks Leo!  That helps _alot_.

So, how do we decide what "n" and "m" are?




-- 

"They that give up essential liberty to obtain a little temporary safety
  deserve neither liberty nor safety."
                 - Benjamin Franklin


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Leo Sutic <le...@inspireinfrastructure.com>.

> From: Berin Loritsch [mailto:bloritsch@apache.org]
> Stefano Mazzocchi wrote:
>
> I
> don't know
> what the letter above the sigma or below the sigma is supposed to signify.

The sigma is like a summing for-loop. Example:

> >
> >                n
> >                --
> >  Invalid(r) =  \  (V(r,I ) + Pc(r,I ) + s(r,I ))
> >                /        i          i         i
> >                --
> >               i = 1
> >

would be like (with an asuumption of data types):

float Invalid (float r) {
  float sum = 0.0;
  for (int i = 1; i <= n; i++) {
    sum += (V(r,I[i]) + Pc(r,I[i]) + s(r,I[i]))
  }
  return sum;
}

So, you can read the sigma as "sum of <expression after sigma> for all
integer values of <whatever variable is below the sigma> between <the
variable below the sigma's initial value> and <the value or variable on top
of the sigma>". For the above: "sum of (V(r,Ii) + Pc(r,Ii) + s(r,Ii)) for
all integer values of i between 1 and n".

/LS

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Berin Loritsch <bl...@apache.org>.

Stefano Mazzocchi wrote:

> Talking about placing too many irons in the fire :)
> 

Stefano, please forgive my ignorance, as I have an Associates Degree in
Music Recording, and not a traditional CS degree.  I haven't had the
higher math classes you and many others have had here.

I understand Algebra and Trigonometry ok, so if we can simplify these
algorithms into a "pseudo-code" so I can wrap my head around them, I
will be happy to jump in the discussion.  As it is now, It's going
over my head.

I understand the Calculus ideas of functions (i.e. V(r,t)).  What throws
me are the Sigmas (I studied Ancient Greek, so I know what the letters are).
I know they are meant to represent sums, or statistical data.  I don't know
what the letter above the sigma or below the sigma is supposed to signify.

If you don't feel up to giving me a quick math lesson so I can grok the
concepts fully, can you point me to a resource where I can teach myself?

> Thus, given
> 
>  r        := resource
>  t        := time of the request
>  V(r,t)   := validity estimation cost
>  P(r,t)   := production cost when caching is not performed
>  Pc(r,t)  := production cost when caching is performed
>  l(r,t)   := lookup cost
>  s(r,t)   := storage cost
> 
> and
> 
>  I := sequence of invalid requests time [1,n]
>  V := sequence of valid request time [1,m]
> 
> we have that
> 
>                n
>                --
>  Invalid(r) =  \  (V(r,I ) + Pc(r,I ) + s(r,I ))
>                /        i          i         i 
>                --
>               i = 1
> 
> and
> 
>              m
>              --
>  Valid(r) =  \  (V(r,I ) + L(r,I ))
>              /        j         j
>              --
>            j = 1
> 
> and
> 
>             n                m
>             --               --
>  Ptot(r) =  \  P(r,I )   +   \  P(r,V )
>             /       i        /       j
>             --               --
>           i = 1             j = 1
> 

-- 

"They that give up essential liberty to obtain a little temporary safety
  deserve neither liberty nor safety."
                 - Benjamin Franklin

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Stefano Mazzocchi <st...@apache.org>.

Antti Koivunen wrote:

> > Everytime a request comes, I have the information on the 'estimated'
> > time the resource will need to be generated on the different routes.
> > Once the decision is taken, I have the information on how much it took
> > to get it and I can compare it with the "assumed" time that would have
> > taken on the other path. Then I know how much time I *saved* with this
> > choice.
> 
> I really like the idea of having two options to choose from based on
> a single variable (the time it takes to produce the resource). 

Well, that's one possibility. In general, the choice is made between the
'least expensive' path, where you can define 'cost' as you like
(normally, time to produce the resource)

> It might
> seem overly simplified, but it actually encapsulates the problem well.
> 
> In fact, with one extra level of abstraction stating that "cache is just
> another place to get the same thing", this could be further simplified
> to something like:
> 
> Source soc = new SourceOfChoice(Source a, Source b);
> 
> Here, SourceOfChoice (or SoC, if you will ;) is only responsible of
> tracking the time it takes to finish certain operation, such as the
> retrieval of some resource, performing the load balancing calculations,
> and delegating the requests accordingly. 

Uh, that's a good suggestion, didn't think about adding load balancing
to the picture...

> This design enforces SoC (in
> its real meaning) and allows SourceOfChoice to focus on a very small
> problem. ('Source' here, of course, isn't referring to Cocoon or TrAX
> Sources.)
> 
> In complex applications this could result in a tree that automatically
> optimizes the lookup paths (on a node level). 

Yes, that's the idea.

> I really don't know Cocoon
> that well, but for 'some server application' the structure might look
> something like:
> 
>                        <Pipeline>
>                         /      \
>         [completely cached]  [dynamic]
>                              /       \
>                      <Generator>  <Transformers>
>                        /     \
>                  [cached]  [dynamic]
>                             /     \
>                         <File>  <DBQuery>
> 
> Of course, the cost of making the <decisions> must be justified by the
> overall performance gain from adaptibility. The idea might also be
> better applied to more complex applications (can't think of many,
> though ;)

well, on example would be a transformer stylesheet dynamically generated
out of aggregated dynamic pipelines. Might be big time FS, I know this,
but it's possible with today's architecture and the cache adapts to this
very well.
 
> > But we don't stop here: we also have a way to measure the efficiency of
> > the cache itself between cache hits and cache misses.
> 
> I've used an extremely simple, but quite powerful adaptive cache
> implementation that periodically adjusts its size according to the
> number of cache hits and misses. It supports pluggable capacity
> adjustment algorithms and the default one works as follows:
> 
>    if cacheMisses > maxMisses and currentCapacity < maxCapacity
> 
>      increase capacity by cacheMisses / maxMisses * capacityIncrement
> 
>    else if cacheMisses < minMisses and currentCapacity > minCapacity
> 
>      decrease capacity by capacityDecrement
> 
> This simple approach is quite effective as it allows the capacity to
> grow rapidly under high load conditions. I'm not sure if any of this is
> really useful for Cocoon, but you never know ;)

Ah, BTW, in my caching algorithms, there is no notion of 'floating
storage capacity' since I assume a given fixed amount of memory. This
will need to be fixed (didn't have time to complete the survey back then
and forgot about it until now that I saw your pseudocode).

Thanks for bringing this up.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Antti Koivunen <an...@users.sourceforge.net>.

I didn't yet have time to read your essay on caching (I will), but a few
thoughts...

Stefano Mazzocchi wrote:

>

<skip/>

>
> Everytime a request comes, I have the information on the 'estimated'
> time the resource will need to be generated on the different routes.
> Once the decision is taken, I have the information on how much it took
> to get it and I can compare it with the "assumed" time that would have
> taken on the other path. Then I know how much time I *saved* with this
> choice.

I really like the idea of having two options to choose from based on
a single variable (the time it takes to produce the resource). It might
seem overly simplified, but it actually encapsulates the problem well.

In fact, with one extra level of abstraction stating that "cache is just
another place to get the same thing", this could be further simplified
to something like:

Source soc = new SourceOfChoice(Source a, Source b);

Here, SourceOfChoice (or SoC, if you will ;) is only responsible of
tracking the time it takes to finish certain operation, such as the
retrieval of some resource, performing the load balancing calculations,
and delegating the requests accordingly. This design enforces SoC (in
its real meaning) and allows SourceOfChoice to focus on a very small
problem. ('Source' here, of course, isn't referring to Cocoon or TrAX
Sources.)

In complex applications this could result in a tree that automatically
optimizes the lookup paths (on a node level). I really don't know Cocoon
that well, but for 'some server application' the structure might look
something like:

                       <Pipeline>
                        /      \
        [completely cached]  [dynamic]
                             /       \
                     <Generator>  <Transformers>
                       /     \
                 [cached]  [dynamic]
                            /     \
                        <File>  <DBQuery>

Of course, the cost of making the <decisions> must be justified by the
overall performance gain from adaptibility. The idea might also be
better applied to more complex applications (can't think of many,
though ;)

> But we don't stop here: we also have a way to measure the efficiency of
> the cache itself between cache hits and cache misses.

I've used an extremely simple, but quite powerful adaptive cache
implementation that periodically adjusts its size according to the
number of cache hits and misses. It supports pluggable capacity
adjustment algorithms and the default one works as follows:

   if cacheMisses > maxMisses and currentCapacity < maxCapacity

     increase capacity by cacheMisses / maxMisses * capacityIncrement

   else if cacheMisses < minMisses and currentCapacity > minCapacity

     decrease capacity by capacityDecrement

This simple approach is quite effective as it allows the capacity to
grow rapidly under high load conditions. I'm not sure if any of this is
really useful for Cocoon, but you never know ;)

(: Anrie ;)

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Paulo Gaspar <pa...@krankikom.de>.

Answer inline:

> -----Original Message-----
> From: Antti Koivunen [mailto:anryoshi@users.sourceforge.net]
> Sent: Saturday, December 15, 2001 5:24 PM
>
> ...
>
>  > However, as you also mention, there is the cost of sampling. If you
>  > have a processing time expensive document "A" with a maximum cache
>  > lifetime of 24 hours that is usually requested 100 times a day...
>  > and then you sample how much time it takes to get it 100 times a
>  > day, the accuracy gets better but the cost of the sampling is as
>  > big as the cost of not caching at all.
> 
> I'm not sure that I understand this. I think Stefano's idea was to use 
> normal requests for sampling. If the document is served a lot faster 
> from the cache, very few requests would cause it to be regenerated 
> (unless it's modified). The actual sampling calculations are very cheap 
> compared to I/O or XSLT operations.

Maybe you understood it. It is just as stupid as it sounds.
=;o)

I was just saying that we (obviously) should not get a requested and 
already cached resource again from its origin (as if there was no cache) 
several times along a sampling period just to have more sampling data. 

Then we would have more accurate data (and hence better cache tuning) but
we would not be using the cache all those times.


>  >
>  > But usually you have families of documents that take a similar time
>  > to process, like:
>  >  - Articles with 4000 words or less without pictures stored in XML
>  >    files;
>  >  - Product records from a product catalog stored in a database;
>  >  - Invoices with less than 10 items from a database.
>  >
>  > If your time measurements are made per family, you will usually
>  > end up with a much wider set of sample data and hence much more
>  > representative results. The system use will generate the repeated
>  > samples and their distribution along the time (and along load peaks
>  > and low load periods) will tend to be much more representative than
>  > any other mechanism we could come up with.
> 
> I understand the issue, but I'd like to see the process completely 
> automated. I think we might get enough information just by sampling the 
> time it takes to produce a resource (and perhaps the frequency of the 
> requests).

IMHO, completely automated means loosing a lot of performance. I think 
you will not get much more than what you already get from a non adaptive
cache.

>...
>
> BTW, the Colt project ("Open Source Libraries for High Performance
> Scientific and Technical Computing in Java") provides several nice
> implementations of various collection types (and much more). Also the
> javadocs are some of the best I've ever seen :) And here's the link:
> 
> http://tilde-hoschek.home.cern.ch/~hoschek/colt/
> 
> (: Anrie ;)

I already knew that one but had lost track of it. Some stuff looks 
interesting!
=:o)


Have fun,
Paulo Gaspar

http://www.krankikom.de
http://www.ruhronline.de
 

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Antti Koivunen <an...@users.sourceforge.net>.

Good points, Paulo. A few short comments...

Paulo Gaspar wrote:

 > Hi Stefano and other caching maniacs,
 >
 >
 > I am just going to make a brief post now, since I am overloaded at
 > the moment (hence my delay on replying).
 >
 > This is very interesting stuff! And I like very much the hotspot
 > analogy. I was also following the initial cache discussion but I
 > missed the document you attached.
 >
 >
 > I am only going to focus on the major obstacle I see:
 >   - sampling accuracy.
 >
 >
 > In most cases (e.g.: the typical web site use) we can consider the
 > cost function is related to processing time being storage volume a
 > restriction. Of course that under this perspective we would also
 > have to consider serialization time, especially when using a 2
 > level cache, but it is better to ignore that at the moment.
 >
 > You also mention something like this in you text:
 >
 >
 >>The easiest cost function is "elapsed time": this indicates that the
 >>most expensive resource is computing power and speed is the most
 >>valuable property of functional operation.
 >>
 >
 > Then you scratch the problem:
 >
 >
 >>A pure-Java implementation is possible with a time granularity of
 >>milliseconds using System.currentTimeMillis(). Smaller granularities
 >>require native hooks either to the system or to the JVM using JNI or the
 >>JVM Native Profiling Interface. For most serving environments,
 >>millisecond granularity is sufficient.
 >>
 >
 > Actually, I think the typical (near 10 ms) granularity of Java is
 > enough because there is another factor that affects much more the
 > accuracy of this kind of measurement... and you scratch that one
 > too:
 >
 >
 >>A pure-Java implementation of such "time + memory" cost function is
 >>almost impossible due to concurrency: there is no way for a particular
 >>thread to know how much memory it has used since the JVM gives
 >>information only about the system free memory and doesn't allow more
 >>granularity than that.

And because of this, different caches might have to be 'weighted', 
perhaps according to the type of component they hold.

 >
 > Concurrency also affects A LOT (especially on production systems)
 > the accuracy of elapsed time measurement, by much more than the
 > above mentioned (near 10 ms) granularity.

True. I've also noticed that even though fast, the calls to
System.currentTimeMillis() can be about 10 times slower than e.g.
simple getters. That's why it sometimes makes sense to have a
'LazyClock' that monitors the system time by periodic calls
System.currentTimeMillis(). This is of course just a minor
implementation detail.

 >
 > This happens not only because of the load in the machine where the
 > system is running but also because many other factors like the load
 > on the database server or the network you use to access the
 > database and other external resources. In fact, it is impossible to
 > know how much "pure" processing time it takes to obtain a resource
 > since a lot of waiting (for other threads of resource availability)
 > is always involved.
 >
 > The only hope one can have of getting some good sampling is trough
 > quantity: repeated random sampling. Otherwise the cache can end up
 > "thinking" that a given document was very expensive just because it
 > happened to be requested when the system was going trough a load
 > peak, and it might take the place of a much more "expensive"
 > document which was always requested when the system was under lower
 > loads. If theses documents have long "cache lives" a lot of capacity
 > can be lost for long periods of time due to such accidents.
 >
 > However, as you also mention, there is the cost of sampling. If you
 > have a processing time expensive document "A" with a maximum cache
 > lifetime of 24 hours that is usually requested 100 times a day...
 > and then you sample how much time it takes to get it 100 times a
 > day, the accuracy gets better but the cost of the sampling is as
 > big as the cost of not caching at all.

I'm not sure that I understand this. I think Stefano's idea was to use 
normal requests for sampling. If the document is served a lot faster 
from the cache, very few requests would cause it to be regenerated 
(unless it's modified). The actual sampling calculations are very cheap 
compared to I/O or XSLT operations.

 >
 > But usually you have families of documents that take a similar time
 > to process, like:
 >  - Articles with 4000 words or less without pictures stored in XML
 >    files;
 >  - Product records from a product catalog stored in a database;
 >  - Invoices with less than 10 items from a database.
 >
 > If your time measurements are made per family, you will usually
 > end up with a much wider set of sample data and hence much more
 > representative results. The system use will generate the repeated
 > samples and their distribution along the time (and along load peaks
 > and low load periods) will tend to be much more representative than
 > any other mechanism we could come up with.

I understand the issue, but I'd like to see the process completely 
automated. I think we might get enough information just by sampling the 
time it takes to produce a resource (and perhaps the frequency of the 
requests).

 >
 > Besides, your sampling data aggregated per family will take less
 > storage space, hence leaving more room for caching.
 > =:o)
 >
 >
 > Now, you only mention one key generation method in your document:
 >
 >
 >>     | Result #2:                                               |
 >>     |                                                          |
 >>     | Each cacheable producer must generate the unique key of  |
 >>     | the resource given all the enviornment information at    |
 >>     | request time                                             |
 >>     |                                                          |
 >>     |   long generateKey(Enviornment e);                       |
 >>
 >
 > Is this key per instance (the caching key) or per family/type of
 > resource?
 >
 >
 > The conclusion from all I wrote above is that a resource producer
 > should probably produce two keys: the cache key and a "sampling
 > family" key that would be use to aggregate cost sampling data.
 >>From your document it is not clear is it is this that you propose.
 >
 > I would also prefer string keys since I do not see a meaningful
 > performance gain on using longs and I see much easier use with
 > strings, but this is just my opinion.

I tend to agree. I once did a benchmark on this and found out that
unless custom long->Object maps are used, the key object instantiation
cost is higher than any such (barely noticeable) performance gain. As
Cocoon uses custom cache key objects (that internally use strings), I
can't really see any advantage in using longs (I might of course be
missing something).

BTW, the Colt project ("Open Source Libraries for High Performance
Scientific and Technical Computing in Java") provides several nice
implementations of various collection types (and much more). Also the
javadocs are some of the best I've ever seen :) And here's the link:

http://tilde-hoschek.home.cern.ch/~hoschek/colt/

(: Anrie ;)



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching

Posted by Stefano Mazzocchi <st...@apache.org>.

Judson Lester wrote:
> 
> Please forgive me if I'm being a buttinsky, but...

no prob whatsoever...

[skipped great explaination]

>  Stefano, does this make more sense?

Absolutely. It's a very good point indeed.

> <aside type="RT">
> This is the (arguably correct) behavioral inverse of the focus of this
> adaptive caching policy.  It's been said that if the cost of using the cache
> is lower, it is more likely to be used.  However, it's also correct that a
> more costly caching operation will be used less often.
> 
> Of course, this presents the additional complexity (although with effort it
> might become 'sophistication" :-O ) of group membership.  For instance, it's
> intuitively obvious that there is an age-based partition that could be made
> on the invoice generator group, and that "new" invoices have a different key
> that "old" invoices, and that new invoices would partake of the new-invoice
> key and old invoices of the old-invoice key.  Finally, what if those
> partitions were fuzzy, and any invoice could be .8 new and .2 old?  I don't
> think that complicates the math unduly.  (Can you tell I studied with a fuzzy
> logic prof?)
> 
> The natural implementation of this would be for each node of a pipeline to
> have a key, but the generator be able to provide a method to specify
> partitions for the reponse to a particular request and their unit-sum weight.
> Thus, sample keys are the pipe-path(?) plus an optional partition, and a
> specific request might partake and contribute to sampling multiple sampling
> keys.
> 
> This would be an extension to the adaptive caching with sampling groups, and
> would be backwards compatible.  I wonder about it's utility...  But I think
> Paulo's sampling groups idea has merit.
> </aside>

It totally agree.... I'll let your suggestions percolate thru my neurons
and see what that influences my neural network... I'll come back when I
have something to say on the topic (right now, I'm concentrating on
Forrest!)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Paulo Gaspar <pa...@krankikom.de>.

Hi Judson,

> -----Original Message-----
> From: Judson Lester [mailto:judson@irev2.com]
> Sent: Wednesday, December 19, 2001 10:24 PM
>
> Please forgive me if I'm being a buttinsky, but...

No, you made what I said much clearer. Maybe I wouldn't have written that
long post yesterday if I had seen yours before.
=:o)

> > > What I do not understand is: having an XML fragment produced
> by a given
> > > generator (that gets an invoice from the database and "translates" it
> > > into XML), do you _always_ track costs per each invoice
> instance comming
> > > from that generator (one entry with the average cost of generating XML
> > > for invoice #12456 and another one for invoice #12122) or
> could you keep
> > > ONE cost entry aggregating the cost of all invoices comming from that
> > > generator (only one entry with the average cost of generating XML for
> > > any invoice, #12456's cost added together with #12122's).
> >
> > Sorry, I didn't understand what you were saying.
>
> As I understand Paulo, say you have a situation where there's an invoice
> report in the web app.  The user can request a listing of invoice
> by number,
> and the invoice is looked up and a report is generated.  This process is
> assumed to very similar in a consumption-of-resources way, regardless of
> which invoice the user selects.  However, the data produced for a request
> for invoice X is completely different than invoice Y, so, obviously,
> returning the cached version of X if Y is request is utterly wrong.

Correct.


> So, each invoice request should have it's own cache request, but
> all cache
> requests might share a single sampling key.  The advantage,
> supposedly, is
> not storage of sampling data (although this would also be so),
> but to enhance
> the quality of the sampling data for the entire class of invoice requests.

Exactly. That is the point I wanted to make.


> Compare:
> 	a> Every invoice uses it's own sampling data.  The vast
> majority of the tens
> of thousands of invoices are requested incredibly infrequently,
> while some
> (the most recent) will be requested a several times, and then
> their frequency
> of request will drop off precipitously.  On some level, the same caching
> lesson is learned over and over again by the caching system.
>
> to
> 	b>All invoice reports share a single sampling key.  The
> vast majority are
> still only requested very infrequently, with the most recent
> being requested
> more frequently.  Now, though, the system can adapt to the best
> caching for
> invoice reports as a whole (which will probably be to almost always
> regenerate), which leaves the most recent to be regenerated more
> frequently
> they might need to be.
>
> Paulo, have I represented your ideas correctly?

Very correctly indeed!
=:o)


> Stefano, does this make more sense?
>
>
> <aside type="RT">
> This is the (arguably correct) behavioral inverse of the focus of this
> adaptive caching policy.  It's been said that if the cost of
> using the cache
> is lower, it is more likely to be used.  However, it's also
> correct that a
> more costly caching operation will be used less often.
>
> Of course, this presents the additional complexity (although with
> effort it
> might become 'sophistication" :-O ) of group membership.

Of course, but the additional complexity is very low and can be
completely optional.

Only those that want its (potentially interesting) profit will
have to pay its (low) cost.
=:o)


> For instance, it's
> intuitively obvious that there is an age-based partition that
> could be made
> on the invoice generator group, and that "new" invoices have a
> different key
> that "old" invoices, and that new invoices would partake of the
> new-invoice
> key and old invoices of the old-invoice key.  Finally, what if those
> partitions were fuzzy, and any invoice could be .8 new and .2
> old?  I don't
> think that complicates the math unduly.  (Can you tell I studied
> with a fuzzy
> logic prof?)

Interesting ideas.


> The natural implementation of this would be for each node of a
> pipeline to
> have a key, but the generator be able to provide a method to specify
> partitions for the reponse to a particular request and their
> unit-sum weight.
> Thus, sample keys are the pipe-path(?) plus an optional partition, and a
> specific request might partake and contribute to sampling
> multiple sampling
> keys.
>
> This would be an extension to the adaptive caching with sampling
> groups, and
> would be backwards compatible.  I wonder about it's utility...

Yes, in these kind of things we can only quantify the potential advantage
of one solution over the other by testing both.

I just think it might work well and that it is not much extra coding work.

Probably it will take less extra coding time than the time we took writing
about it.
=:o)


> But I think Paulo's sampling groups idea has merit.

Thanks!

> </aside>
>
> Judson


Have fun,
Paulo

http://www.krankikom.de
http://www.ruhronline.de


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Judson Lester <ju...@irev2.com>.

Please forgive me if I'm being a buttinsky, but...

> > What I do not understand is: having an XML fragment produced by a given
> > generator (that gets an invoice from the database and "translates" it
> > into XML), do you _always_ track costs per each invoice instance comming
> > from that generator (one entry with the average cost of generating XML
> > for invoice #12456 and another one for invoice #12122) or could you keep
> > ONE cost entry aggregating the cost of all invoices comming from that
> > generator (only one entry with the average cost of generating XML for
> > any invoice, #12456's cost added together with #12122's).
>
> Sorry, I didn't understand what you were saying.

As I understand Paulo, say you have a situation where there's an invoice 
report in the web app.  The user can request a listing of invoice by number, 
and the invoice is looked up and a report is generated.  This process is 
assumed to very similar in a consumption-of-resources way, regardless of 
which invoice the user selects.  However, the data produced for a request for 
invoice X is completely different than invoice Y, so, obviously, returning 
the cached version of X if Y is request is utterly wrong.  

So, each invoice request should have it's own cache request, but all cache 
requests might share a single sampling key.  The advantage, supposedly, is 
not storage of sampling data (although this would also be so), but to enhance 
the quality of the sampling data for the entire class of invoice requests.

Compare:
	a> Every invoice uses it's own sampling data.  The vast majority of the tens 
of thousands of invoices are requested incredibly infrequently, while some 
(the most recent) will be requested a several times, and then their frequency 
of request will drop off precipitously.  On some level, the same caching 
lesson is learned over and over again by the caching system.

to
	b>All invoice reports share a single sampling key.  The vast majority are 
still only requested very infrequently, with the most recent being requested 
more frequently.  Now, though, the system can adapt to the best caching for 
invoice reports as a whole (which will probably be to almost always 
regenerate), which leaves the most recent to be regenerated more frequently 
they might need to be.  

Paulo, have I represented your ideas correctly?  Stefano, does this make more 
sense?


<aside type="RT">
This is the (arguably correct) behavioral inverse of the focus of this 
adaptive caching policy.  It's been said that if the cost of using the cache 
is lower, it is more likely to be used.  However, it's also correct that a 
more costly caching operation will be used less often.  

Of course, this presents the additional complexity (although with effort it 
might become 'sophistication" :-O ) of group membership.  For instance, it's 
intuitively obvious that there is an age-based partition that could be made 
on the invoice generator group, and that "new" invoices have a different key 
that "old" invoices, and that new invoices would partake of the new-invoice 
key and old invoices of the old-invoice key.  Finally, what if those 
partitions were fuzzy, and any invoice could be .8 new and .2 old?  I don't 
think that complicates the math unduly.  (Can you tell I studied with a fuzzy 
logic prof?)

The natural implementation of this would be for each node of a pipeline to 
have a key, but the generator be able to provide a method to specify 
partitions for the reponse to a particular request and their unit-sum weight. 
Thus, sample keys are the pipe-path(?) plus an optional partition, and a 
specific request might partake and contribute to sampling multiple sampling 
keys.  

This would be an extension to the adaptive caching with sampling groups, and 
would be backwards compatible.  I wonder about it's utility...  But I think 
Paulo's sampling groups idea has merit.
</aside>

Judson











---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Paulo Gaspar <pa...@krankikom.de>.

> I think I got you idea in full detail now and I agree it's a very good
> suggestion.

I was too lazy when I tried to explain it first. Sorry.


> There are some details to think about but I think it's a great addition
> to my model. Thanks for sharing it with us. :)

Nothing to thank for. We are just scratching common itches.

I have totally selfish interest on having a great Adaptive Cache with 
this feature one of this days.
=;o)


Have fun,
Paulo Gaspar

http://www.krankikom.de
http://www.ruhronline.de


> -----Original Message-----
> From: Stefano Mazzocchi [mailto:stefano@apache.org]
> Sent: Friday, December 21, 2001 12:48 AM
> To: cocoon-dev@xml.apache.org
> Subject: Re: Adaptive Caching [was Re: initial checkin of the Scheme
> code]
> 
> 
> Paulo Gaspar wrote:
> 
> > Notice that my focus is just cache efficiency. And, unlike what 
> I have been
> > able to transmit, I am interested on significant improvements.
> 
> Than we share the same concern :)
> 
> [skipped a good example of the need for a sampling key for resource
> families]
> 
> I think I got you idea in full detail now and I agree it's a very good
> suggestion.
> 
> The idea is to have the cacheable component logic create both the
> caching key for the handled request *AND* a (possibly different)
> sampling key.
> 
> So, for example, asking for the resource:
> 
>  cocoon://reviews/movies/harry-potter
> 
> might generate a caching key that hashes the entire URI, while might
> generate a sampling key that hashes only the general part
> 
>  cocoon://review/movies/
> 
> thus, every request in this category will be collected in the same
> statistical data pool, thus converge sooner and talking less memory.
> 
> It's a great idea and, yes, I totally missed it in my proposal.
> 
> There are some details to think about but I think it's a great addition
> to my model. Thanks for sharing it with us. :)
> 
> -- 
> Stefano Mazzocchi      One must still have chaos in oneself to be
>                           able to give birth to a dancing star.
> <st...@apache.org>                             Friedrich Nietzsche
> --------------------------------------------------------------------
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
> For additional commands, email: cocoon-dev-help@xml.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Stefano Mazzocchi <st...@apache.org>.

Paulo Gaspar wrote:

> Notice that my focus is just cache efficiency. And, unlike what I have been
> able to transmit, I am interested on significant improvements.

Than we share the same concern :)

[skipped a good example of the need for a sampling key for resource
families]

I think I got you idea in full detail now and I agree it's a very good
suggestion.

The idea is to have the cacheable component logic create both the
caching key for the handled request *AND* a (possibly different)
sampling key.

So, for example, asking for the resource:

 cocoon://reviews/movies/harry-potter

might generate a caching key that hashes the entire URI, while might
generate a sampling key that hashes only the general part

 cocoon://review/movies/

thus, every request in this category will be collected in the same
statistical data pool, thus converge sooner and talking less memory.

It's a great idea and, yes, I totally missed it in my proposal.

There are some details to think about but I think it's a great addition
to my model. Thanks for sharing it with us. :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Paulo Gaspar <pa...@krankikom.de>.

> -----Original Message-----
> From: Stefano Mazzocchi [mailto:stefano@apache.org]
> Sent: Wednesday, December 19, 2001 5:39 PM
>
> Paulo Gaspar wrote:
>
> > > So, please, try to see the entire system and not the single page.
> >
> > Of course. I am NOT saying "we must have no errors", I am saying "maybe
> > we could have less errors".
>
> Absolutely, if you have ideas on how to do a better sampling than what I
> explained, I'll be very happy to hear them.
>
> > > A "resource", in this case, is every document fragment at all levels,
> > > the output of each pipeline component.
> >
> > Yes, I understand that the Cache can work in terms of fragments at all
> > levels and so.
> >
> > What I do not understand is: having an XML fragment produced by a given
> > generator (that gets an invoice from the database and "translates" it
> > into XML), do you _always_ track costs per each invoice instance comming
> > from that generator (one entry with the average cost of generating XML
> > for invoice #12456 and another one for invoice #12122) or could you keep
> > ONE cost entry aggregating the cost of all invoices comming from that
> > generator (only one entry with the average cost of generating XML for
> > any invoice, #12456's cost added together with #12122's).
>
> Sorry, I didn't understand what you were saying.
>
> Please, allow me to restate: the 'cost' is only one: how much it takes
> to process an entire request.

I think I have a much more clear case, based in real life and all.

Notice that my focus is just cache efficiency. And, unlike what I have been
able to transmit, I am interested on significant improvements.

Believe me, I do not care about little things like the time cost of
measuring the time cost. I know that System.currentTimeMillis() takes less
than a microsecond on a modern PC and accessing a hash table to store that
is not sooo muuuch slower.
=:o)

Ok, one of the things I work with is a search engine for movies and other
cultural events (I find a lot of culture on a Jet Li movie! (o:= ). We have
data for the movies running on a set of German cities.

In one of our simplest systems people can ask things like "What are the
movies playing today in *Berlin*?". (Search criteria is city and date
range, but lets say it is just "today".)

Then they get a list of movie sessions and can see a detail page about each
session's movie or cinema.

Now, one can say that generating the page for a movie detail always has the
same cost. It is always reading 1 database record for the movie, 4 to 6 for
the credits (director and actors), the same template.

Building the page for "Romeo must die" should take about the same time as
for "Harry Potter". Probably the difference is less than 5% or so.

On the other hand, the list of movies running today in Berlin is much longer
and expensive than the list of movies running in Duisburg - Berlin is a much
bigger city with much more cinemas.

So, lets consider this cost issues:
 - The cost of producing the page "Movies Playing Today on City X" varies
   a lot depending on X (the city);
 - The page "Movie Z Details" is always (almost) the same, whatever Z is;
 - The "Movies Playing Today" page takes much more processing time to
   produce than the "Movie Details" page.

Lets also consider that:
 - The page "Movie Z Details" can stay in cache for several days because
   that data never changes;
 - The page "Movies Playing Today" should only be kept a couple of hours
   because sometimes that data is updated.

The typical error we talked on "cheap (fast) stuff" taking valuable cache
space from "expensive (slow) stuff" would happen when a the Movie Detail
for "Harry Potter" gets first requested during a load peak and takes much
more time than even the "Movies Playing Today" list for Berlin.

Since a Movie Detail can stay cached for several days, errors like this can
accumulate a lot.

One solution is to shorten the cache life of the Movie Detail, of course.
But it is a bit artificial.

Besides, with the cost-per-resource scheme the cache never "knows" very
well what is the average cost or generating a Movie Detail page during the
day and how that compares to the average cost of generating a "Movies
Playing Today" for Berlin.

The cache just knows how the cost for producing the "Harry Potter" detail
compares to the cost of producing the "Movies Playing Today in Berlin" and
to the cost of the "Romeo Must Die" detail. And it even might think that
 - "Harry Potter" is much more expensive than "Romeo Must Die" because
   "Harry" is a new movie which page was only once generated during a
   load peak while Romeo was already generated several times during low
   traffic hours;
 - As above, that "Harry Potter" costs more than "Berlin Movies Today".
=:o(

Now, if the cache has a Cost Sampling Key, all this changes.

Then the developer can tell that all "Movie Details" have the same
Sampling Key while each "Movies Playing Today" has a Cost Key per city.
So, our Smart(er) cache will know that:
 - All "Movie Details" have the same cost;
 - And "Movies Playing Today" will still have a cost per (city) instance.

Then, the cost of generating a Movie Detail page will be the average cost
of generating ANY Movie Detail page. Any Movie Detail generation will add
its cost to one single accumulator, increment one single counter and
result in a single average.

Movies Playing Today will stay per (city) instance.

And then the cache system will know that:
 - "Harry Potter" is so expensive to generate as "Romeo Must Die", even
   if it was only generated once during a load peak;
 - "Harry Potter" is a lot cheaper then the "Berlin Movies Today" list;
 - "Berlin Movies Today" still has a different cost (more expensive)
   than "Duisburg Movies Today".

And, of course, the larger sampling (more frequent measurements) for
Movie Details also mean a more accurate cost evaluation for this family
of resources.

> > If in the case above all the invoices "produced" are very similar, the
> > cache could measure their cost together, using a single sampling key for
> > all the invoices produced by the above generator (although each invoice
> > would still have a different cache storage key, of course).
>
> well, if you still need a different cache storage key, why would you need
> a different sampling key?
>
> Keep in mind that when the cache content is invalid, the single fragment
> must be regenerated and this can be used for sampling.
>
> In real situations, I really don't see sampling as an expensive
> operation.

No, I was talking about memory cost.

Keeping measurements for resources that are not currently stored in cache
helps on having more accurate sampling data about what should get into or
out of the cache. If this is what you plan to do, then you can end up
keeping measurements of 100000 (one hundred thousand) resources although
only caching 5000 at a time.

100000 measurements must take space! Spooling them to disk or loosing is
not so nice either. But if from this 100000 you have only 10000 (ten
thousand) without a "cost family" and the remaining 90000 can be divided
by some 50 families, you get to only track 10050 costs and this will take
only 10.05% of the memory space of the "no cost families" one hundred
thousand costs.

> ...

Thanks for your attention and have fun,
Paulo Gaspar

http://www.krankikom.de
http://www.ruhronline.de

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Stefano Mazzocchi <st...@apache.org>.

Paulo Gaspar wrote:

> > So, please, try to see the entire system and not the single page.
> 
> Of course. I am NOT saying "we must have no errors", I am saying "maybe
> we could have less errors".

Absolutely, if you have ideas on how to do a better sampling than what I
explained, I'll be very happy to hear them.

> > A "resource", in this case, is every document fragment at all levels,
> > the output of each pipeline component.
> 
> Yes, I understand that the Cache can work in terms of fragments at all
> levels and so.
> 
> What I do not understand is: having an XML fragment produced by a given
> generator (that gets an invoice from the database and "translates" it
> into XML), do you _always_ track costs per each invoice instance comming
> from that generator (one entry with the average cost of generating XML
> for invoice #12456 and another one for invoice #12122) or could you keep
> ONE cost entry aggregating the cost of all invoices comming from that
> generator (only one entry with the average cost of generating XML for
> any invoice, #12456's cost added together with #12122's).

Sorry, I didn't understand what you were saying.

Please, allow me to restate: the 'cost' is only one: how much it takes
to process an entire request.

> If in the case above all the invoices "produced" are very similar, the
> cache could measure their cost together, using a single sampling key for
> all the invoices produced by the above generator (although each invoice
> would still have a different cache storage key, of course).

well, if you still need a differen cache storage key, why would you need
a different sampling key?

Keep in mind that when the cache content is invalid, the single fragment
must be regenerated and this can be used for sampling.

In real situations, I really don't see sampling as an expensive
operation.
 
> In such situations, if a Cocoon developer could _optionally_ supply a
> "cost family" key the cache could have much better cost data (more
> frequent measurements, since any invoice generated contributes to the
> same measurement pool) and this cost data would take less space (same
> cost data for all the invoice instances).

Yeah, could be a good way to reduce memory consumption for cache data
(which is admittedly, lots of data anyway)
 
> Maybe this is already what you have in mind, but we are having a
> communication problem around this one.

No, thanks for telling me.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Paulo Gaspar <pa...@krankikom.de>.

Hi Stefano,

Answer inline:

> -----Original Message-----
> From: Stefano Mazzocchi [mailto:stefano@apache.org]
> Sent: Monday, December 17, 2001 12:50 PM
> 
> Paulo Gaspar wrote:
> 
> > What I mean is that the cache could accumulate quite some wrong cost
> > measurements for long periods of time if they would happen with
> > resources having a long "cache life".
> 
> Correct, but there is no such thing as a perfect cache. Keep this in
> mind. We are dealing with stocastical properties and what we can do is
> 'converge' toward the optimal solution for the entire system. Single
> resources will not be perfectly cached, but the entire system load will
> (if there is enough statistics available).
> 
> So, please, try to see the entire system and not the single page.

Of course. I am NOT saying "we must have no errors", I am saying "maybe
we could have less errors".

> [...]
>  
> > What I say is that this kind of error can happen a lot. That will
> > have a different cost depending on the characteristics of the system
> > (e.g.: longer cache lives => higher cost).
> 
> Again, correct. In my paper I clearly wrote that the algorithms
> described work best on systems that exhibit time-local behavior, that
> is, don't exhibit high frequencies of cost variations.
> 
> Like any retroactive control system, it cannot possibly have the same
> behavior for all frequencies. The control system I described has a
> 'low-pass' behavior: works best on slowly moving costs... the problem is
> that I couldn't come up with a way to tune the system for higher
> frequencies without creating an impossible overhead on the system :/
> 
> But if you have a suggestion, I'm more than happy to follow it.

I am just trying to figure out if you are considering each instance of a
"document fragment" or each "family". Sorry but I still do not understand
that.

> > What is a resource? Is it "Invoice detail view" or is it the "detail
> > view of invoice number 5678665"?
> 
> A "resource", in this case, is every document fragment at all levels,
> the output of each pipeline component.

Yes, I understand that the Cache can work in terms of fragments at all
levels and so.

What I do not understand is: having an XML fragment produced by a given
generator (that gets an invoice from the database and "translates" it 
into XML), do you _always_ track costs per each invoice instance comming 
from that generator (one entry with the average cost of generating XML 
for invoice #12456 and another one for invoice #12122) or could you keep
ONE cost entry aggregating the cost of all invoices comming from that 
generator (only one entry with the average cost of generating XML for 
any invoice, #12456's cost added together with #12122's).

If in the case above all the invoices "produced" are very similar, the 
cache could measure their cost together, using a single sampling key for
all the invoices produced by the above generator (although each invoice 
would still have a different cache storage key, of course).

In such situations, if a Cocoon developer could _optionally_ supply a 
"cost family" key the cache could have much better cost data (more 
frequent measurements, since any invoice generated contributes to the
same measurement pool) and this cost data would take less space (same 
cost data for all the invoice instances).

Maybe this is already what you have in mind, but we are having a 
communication problem around this one.

Have fun,
Paulo Gaspar

http://www.krankikom.de
http://www.ruhronline.de

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Stefano Mazzocchi <st...@apache.org>.

Paulo Gaspar wrote:

> What I mean is that the cache could accumulate quite some wrong cost
> measurements for long periods of time if they would happen with
> resources having a long "cache life".

Correct, but there is no such thing as a perfect cache. Keep this in
mind. We are dealing with stocastical properties and what we can do is
'converge' toward the optimal solution for the entire system. Single
resources will not be perfectly cached, but the entire system load will
(if there is enough statistics available).

So, please, try to see the entire system and not the single page.

[...]

> What I say is that this kind of error can happen a lot. That will
> have a different cost depending on the characteristics of the system
> (e.g.: longer cache lives => higher cost).

Again, correct. In my paper I clearly wrote that the algorithms
described work best on systems that exhibit time-local behavior, that
is, don't exhibit high frequencies of cost variations.

Like any retroactive control system, it cannot possibly have the same
behavior for all frequencies. The control system I described has a
'low-pass' behavior: works best on slowly moving costs... the problem is
that I couldn't come up with a way to tune the system for higher
frequencies without creating an impossible overhead on the system :/

But if you have a suggestion, I'm more than happy to follow it.

> What is a resource? Is it "Invoice detail view" or is it the "detail
> view of invoice number 5678665"?

A "resource", in this case, is every document fragment at all levels,
the output of each pipeline component.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Paulo Gaspar <pa...@krankikom.de>.

Answer inline:

> -----Original Message-----
> From: Stefano Mazzocchi [mailto:stefano@apache.org]
> Sent: Sunday, December 16, 2001 1:15 AM
> 
> Paulo Gaspar wrote:
> 
> ...
>
> > Otherwise the cache can end up
> > "thinking" that a given document was very expensive just because it
> > happened to be requested when the system was going trough a load
> > peak, and it might take the place of a much more "expensive"
> > document which was always requested when the system was under lower
> > loads. If theses documents have long "cache lives" a lot of capacity
> > can be lost for long periods of time due to such accidents.
> 
> True, but the user doesn't give a damn about "what" influenced the
> slowness of the document, neither does the cache: it's the actual result
> which is sampled, so maybe the document could take 10ms to generate
> without load (and the cache might take 12ms) but under load it takes
> 25ms and the cache 23ms.
> 
> ...

All you say is true. The problem is with what I wrote - I did not get 
my message trough.

What I mean is that the cache could accumulate quite some wrong cost 
measurements for long periods of time if they would happen with 
resources having a long "cache life".

Lets say you have resources A and B and that, under low load, A takes 
100 ms to build and B takes 15 seconds. Lets also say that both have a
cache life of 24 hours.

If when A is first requested the system is under heavy load it night 
end up taking 30" to get (sh*t like this happens with database
resources). Half an hour later B is requested for the first time and 
the system is already under low load, so it takes those usual 15".

For the next (almost) 24 hours, the cache will think that A is more
expensive than B.

What I say is that this kind of error can happen a lot. That will 
have a different cost depending on the characteristics of the system
(e.g.: longer cache lives => higher cost).

> > However, as you also mention, there is the cost of sampling. If you
> > have a processing time expensive document "A" with a maximum cache
> > lifetime of 24 hours that is usually requested 100 times a day...
> > and then you sample how much time it takes to get it 100 times a
> > day, the accuracy gets better but the cost of the sampling is as
> > big as the cost of not caching at all.
> 
> Yes, but the frequency of sampling a resource is inversively
> proportional to the difference between the costs of the two choices.
>
> ...

Again, all you say is true. The problem is with what I wrote - I did 
not get my message trough. Really, what I wrote is open to different
interpretations.

I was talking about how silly is the silly possibility of having more
sample data just by not using stuff that is already cached.

In this silly scenario, the system would try to get more data about how
expensive a resource with long "cache life" is to get by, at "some" 
requests "along the day" getting the resource again from its origin and 
ignoring its cached version.

> > But usually you have families of documents that take a similar time
> > to process, like:
> >  - Articles with 4000 words or less without pictures stored in XML
> >    files;
> >  - Product records from a product catalog stored in a database;
> >  - Invoices with less than 10 items from a database.
> > 
> > If your time measurements are made per family, you will usually
> > end up with a much wider set of sample data and hence much more
> > representative results. 
> 
> how do you envision the cache estimating what a 'family of resources'
> is?

You have to tell it, as follows later in the text.

> > The system use will generate the repeated
> > samples and their distribution along the time (and along load peaks
> > and low load periods) will tend to be much more representative than
> > any other mechanism we could come up with.
> > 
> > Besides, your sampling data aggregated per family will take less
> > storage space, hence leaving more room for caching.
> > =:o)
> 
> good point.

It is the same you (better) described before in your reply with:

  True, but the user doesn't give a damn about "what" influenced the
  slowness of the document, neither does the cache: it's the actual result
  which is sampled, so maybe the document could take 10ms to generate
  without load (and the cache might take 12ms) but under load it takes
  25ms and the cache 23ms.

  This shows that the cost function is not CPU time or 'single document
  processing time', but it's a more global 'production time for that
  resource at this very time' and *MUST* include everything even your load
  peaks.

> > Now, you only mention one key generation method in your document:
> > 
> > >      | Result #2:                                               |
> > >      |                                                          |
> > >      | Each cacheable producer must generate the unique key of  |
> > >      | the resource given all the enviornment information at    |
> > >      | request time                                             |
> > >      |                                                          |
> > >      |   long generateKey(Enviornment e);                       |
> > 
> > Is this key per instance (the caching key) or per family/type of
> > resource?
> 
> per resource.

Maybe we have a vocabulary problem here:
  I think I do not understand what a resource is in Cocoon.
=:o/

Lets take my previous example:
> >  - Invoices with less than 10 items from a database.

What is a resource? Is it "Invoice detail view" or is it the "detail 
view of invoice number 5678665"?

> > The conclusion from all I wrote above is that a resource producer
> > should probably produce two keys: the cache key and a "sampling
> > family" key that would be use to aggregate cost sampling data.
> > From your document it is not clear is it is this that you propose.
> 
> I can't think of a way to come up with 'resource families', but I'm open
> to suggestions.

Not sure, first I need to understand what "resource" really means!
=:o)

> > I would also prefer string keys since I do not see a meaningful
> > performance gain on using longs and I see much easier use with
> > strings, but this is just my opinion.
> 
> Well, I do: new String() is the single most espensive operation in the
> java.lang package. 
> 
> The java golden rule for performance is: don't you strings if you can
> avoid it.

OTOH, if each request takes 10000 (ten thousand) times the time of 
creating a String and if creating the string really makes things much 
easier...

Have fun,
Paulo Gaspar

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Stefano Mazzocchi <st...@apache.org>.

Paulo Gaspar wrote:

> This is very interesting stuff! And I like very much the hotspot
> analogy. I was also following the initial cache discussion but I
> missed the document you attached.
> 
> I am only going to focus on the major obstacle I see:
>   - sampling accuracy.

ok

> In most cases (e.g.: the typical web site use) we can consider the
> cost function is related to processing time being storage volume a
> restriction. Of course that under this perspective we would also
> have to consider serialization time, especially when using a 2
> level cache, but it is better to ignore that at the moment.
> 
> You also mention something like this in you text:
> 
> > The easiest cost function is "elapsed time": this indicates that the
> > most expensive resource is computing power and speed is the most
> > valuable property of functional operation.
> 
> Then you scratch the problem:
> 
> > A pure-Java implementation is possible with a time granularity of
> > milliseconds using System.currentTimeMillis(). Smaller granularities
> > require native hooks either to the system or to the JVM using JNI or the
> > JVM Native Profiling Interface. For most serving environments,
> > millisecond granularity is sufficient.
> 
> Actually, I think the typical (near 10 ms) granularity of Java is
> enough because there is another factor that affects much more the
> accuracy of this kind of measurement... and you scratch that one
> too:
> 
> > A pure-Java implementation of such "time + memory" cost function is
> > almost impossible due to concurrency: there is no way for a particular
> > thread to know how much memory it has used since the JVM gives
> > information only about the system free memory and doesn't allow more
> > granularity than that.
> 
> Concurrency also affects A LOT (especially on production systems)
> the accuracy of elapsed time measurement, by much more than the
> above mentioned (near 10 ms) granularity.
> 
> This happens not only because of the load in the machine where the
> system is running but also because many other factors like the load
> on the database server or the network you use to access the
> database and other external resources. In fact, it is impossible to
> know how much "pure" processing time it takes to obtain a resource
> since a lot of waiting (for other threads of resource availability)
> is always involved.

I totally agree.

> The only hope one can have of getting some good sampling is trough
> quantity: repeated random sampling. 

Exactly. It can be shown easily that if the external properties that
effect the sampling are totally stocastic (i.e. random: their mean value
is zero), they aren't taken into consideration by the adaptive cache, on
the other side, if those effecting properties are *NOT* stocastic, only
their average value is taken into consideration but this is the behavior
we want.

> Otherwise the cache can end up
> "thinking" that a given document was very expensive just because it
> happened to be requested when the system was going trough a load
> peak, and it might take the place of a much more "expensive"
> document which was always requested when the system was under lower
> loads. If theses documents have long "cache lives" a lot of capacity
> can be lost for long periods of time due to such accidents.

True, but the user doesn't give a damn about "what" influenced the
slowness of the document, neither does the cache: it's the actual result
which is sampled, so maybe the document could take 10ms to generate
without load (and the cache might take 12ms) but under load it takes
25ms and the cache 23ms.

This shows that the cost function is not CPU time or 'single document
processing time', but it's a more global 'production time for that
resource at this very time' and *MUST* include everything even your load
peaks.

At the same time, adaptation means 'stability'... in fact in my
algorightms there is a value (the 'k' factor in the hyperbolic tangent)
that indicates the 'slope' of the sampling function, this is a tune
factor to indicates how "fast" the cache system should adapt on
variations of the resource costs.

I also thought about making this automatic by performing a Fourier
Transform of the collected data, but that's probably too much of an
overhead for the system. Probably there are other ways to estimate that
'speed of adaptation', but this is definately a second order tuning.

> However, as you also mention, there is the cost of sampling. If you
> have a processing time expensive document "A" with a maximum cache
> lifetime of 24 hours that is usually requested 100 times a day...
> and then you sample how much time it takes to get it 100 times a
> day, the accuracy gets better but the cost of the sampling is as
> big as the cost of not caching at all.

Yes, but the frequency of sampling a resource is inversively
proportional to the difference between the costs of the two choices.

So, for example, if you document takes 4 minutes to generate and 20ms to
cache, it's unlikely that this resource will soon end up being processed
faster than the cache, thus the frequency might be one over 100000
requests are redirected to sampling.

On the other hand, if I have a resource that takes 20ms and caching
takes 15ms, I'd sample one over 20, so that I still take advantage of
the cache, but I can adapt faster if the processing time or the caching
time changes (which might be the case under load).

The 'k' factor I wrote above indicates the 'steepness' of this function
that maps cost differences to sampling frequency: a small k will
increase adaptability but decrease overall caching efficiency since more
requests are used to sampling. a bigk will decrease adaptability but
improve overall caching efficiency (since less requests are used for
sampling).

As I said, since tuning this value will easily become 'black art',
having frequency information (thus the need for a FFT) can allow us to
estimate the 'variability' of the production system and then tune the k
factor on this.

I have to think more about it.

> But usually you have families of documents that take a similar time
> to process, like:
>  - Articles with 4000 words or less without pictures stored in XML
>    files;
>  - Product records from a product catalog stored in a database;
>  - Invoices with less than 10 items from a database.
> 
> If your time measurements are made per family, you will usually
> end up with a much wider set of sample data and hence much more
> representative results. 

how do you envision the cache estimating what a 'family of resources'
is?

> The system use will generate the repeated
> samples and their distribution along the time (and along load peaks
> and low load periods) will tend to be much more representative than
> any other mechanism we could come up with.
> 
> Besides, your sampling data aggregated per family will take less
> storage space, hence leaving more room for caching.
> =:o)

good point.

> Now, you only mention one key generation method in your document:
> 
> >      | Result #2:                                               |
> >      |                                                          |
> >      | Each cacheable producer must generate the unique key of  |
> >      | the resource given all the enviornment information at    |
> >      | request time                                             |
> >      |                                                          |
> >      |   long generateKey(Enviornment e);                       |
> 
> Is this key per instance (the caching key) or per family/type of
> resource?

per resource.

> The conclusion from all I wrote above is that a resource producer
> should probably produce two keys: the cache key and a "sampling
> family" key that would be use to aggregate cost sampling data.
> From your document it is not clear is it is this that you propose.

I can't think of a way to come up with 'resource families', but I'm open
to suggestions.

> I would also prefer string keys since I do not see a meaningful
> performance gain on using longs and I see much easier use with
> strings, but this is just my opinion.

Well, I do: new String() is the single most espensive operation in the
java.lang package. 

The java golden rule for performance is: don't you strings if you can
avoid it.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Paulo Gaspar <pa...@krankikom.de>.

Hi Stefano and other caching maniacs,

I am just going to make a brief post now, since I am overloaded at 
the moment (hence my delay on replying).

This is very interesting stuff! And I like very much the hotspot 
analogy. I was also following the initial cache discussion but I 
missed the document you attached.

I am only going to focus on the major obstacle I see: 
  - sampling accuracy.

In most cases (e.g.: the typical web site use) we can consider the 
cost function is related to processing time being storage volume a
restriction. Of course that under this perspective we would also 
have to consider serialization time, especially when using a 2 
level cache, but it is better to ignore that at the moment.

You also mention something like this in you text:

> The easiest cost function is "elapsed time": this indicates that the
> most expensive resource is computing power and speed is the most
> valuable property of functional operation.

Then you scratch the problem:

> A pure-Java implementation is possible with a time granularity of
> milliseconds using System.currentTimeMillis(). Smaller granularities
> require native hooks either to the system or to the JVM using JNI or the
> JVM Native Profiling Interface. For most serving environments,
> millisecond granularity is sufficient.

Actually, I think the typical (near 10 ms) granularity of Java is 
enough because there is another factor that affects much more the
accuracy of this kind of measurement... and you scratch that one 
too:

> A pure-Java implementation of such "time + memory" cost function is
> almost impossible due to concurrency: there is no way for a particular
> thread to know how much memory it has used since the JVM gives
> information only about the system free memory and doesn't allow more
> granularity than that.

Concurrency also affects A LOT (especially on production systems) 
the accuracy of elapsed time measurement, by much more than the
above mentioned (near 10 ms) granularity. 

This happens not only because of the load in the machine where the 
system is running but also because many other factors like the load
on the database server or the network you use to access the 
database and other external resources. In fact, it is impossible to
know how much "pure" processing time it takes to obtain a resource 
since a lot of waiting (for other threads of resource availability) 
is always involved.

The only hope one can have of getting some good sampling is trough
quantity: repeated random sampling. Otherwise the cache can end up 
"thinking" that a given document was very expensive just because it 
happened to be requested when the system was going trough a load 
peak, and it might take the place of a much more "expensive" 
document which was always requested when the system was under lower 
loads. If theses documents have long "cache lives" a lot of capacity
can be lost for long periods of time due to such accidents.

However, as you also mention, there is the cost of sampling. If you 
have a processing time expensive document "A" with a maximum cache 
lifetime of 24 hours that is usually requested 100 times a day... 
and then you sample how much time it takes to get it 100 times a 
day, the accuracy gets better but the cost of the sampling is as 
big as the cost of not caching at all.

But usually you have families of documents that take a similar time 
to process, like:
 - Articles with 4000 words or less without pictures stored in XML
   files;
 - Product records from a product catalog stored in a database;
 - Invoices with less than 10 items from a database.

If your time measurements are made per family, you will usually
end up with a much wider set of sample data and hence much more 
representative results. The system use will generate the repeated
samples and their distribution along the time (and along load peaks
and low load periods) will tend to be much more representative than
any other mechanism we could come up with.

Besides, your sampling data aggregated per family will take less 
storage space, hence leaving more room for caching.
=:o)

Now, you only mention one key generation method in your document:

>      | Result #2:                                               |
>      |                                                          |
>      | Each cacheable producer must generate the unique key of  |
>      | the resource given all the enviornment information at    |
>      | request time                                             |
>      |                                                          |
>      |   long generateKey(Enviornment e);                       |

Is this key per instance (the caching key) or per family/type of 
resource?

The conclusion from all I wrote above is that a resource producer 
should probably produce two keys: the cache key and a "sampling 
family" key that would be use to aggregate cost sampling data. 
>From your document it is not clear is it is this that you propose.

I would also prefer string keys since I do not see a meaningful 
performance gain on using longs and I see much easier use with 
strings, but this is just my opinion.

Thanks for the great food for thought you always provide.

Have fun,
Paulo Gaspar	

http://www.krankikom.de
http://www.ruhronline.de

> -----Original Message-----
> From: Stefano Mazzocchi [mailto:stefano@apache.org]
> Sent: Wednesday, December 12, 2001 7:17 PM
> 
> 
> Talking about placing too many irons in the fire :)
> 
> Paulo Gaspar wrote:
> 
> > The "adaptive caching" idea just arrived too early but I hope it is not
> > forgotten.
> 
> You can bet your ass it's not :) [excuse my oxford english]
> 
> The concept I proposed (more than 6 months ago) about adaptive caching
> is difficult to understand because it's very different from the usual
> caching approach.
> 
> Let me try to explain it:
> 
> when a request arrives to a resource (might also be a pipeline
> fragment), you have two choices: ask the cache if the resource is still
> valid or go right ahead and regenerate it.
> 
> And then, after it's regenerated, should I save it for further use? if
> so, where? disk or memory? and if the memory is full, what should I
> throw away?
> 
> There is a very simple answer for all of this: do what it gives you the
> best performance.
> 
> Period.
> 
> The real problem is: once I've taken a decision (cache or don't cache)
> how do I know what would have happened performance-wise if I had taken
> the other choice?
> 
> This is the key problem.
> 
> I propose a simple solution: add 'selection noise' and use incoming
> requests as sampling events on the system to test.
> 
> It's a trick: the objective is to reduce the time it takes to handle
> those requests, but I use them to obtain time information on the system
> and I superimpose a stocastic sampling to the decision-making process.
> 
> The problem is that sampling uses user requests, so we must reduce the
> 'noisy' behavior of these requests: my solution is to make this
> 'selection-noise' a function of the difference between the two paths.
> So, if going thru the cache system is, on average, 3 times as fast, I
> use 1 request over 100. Otherwise, if the cache yields 20% improvement
> (1.2 times as fast), I sample 1 over 5 and so on.
> 
> This guarantees that:
> 
>  1) users don't perceive this since the faster is one path, the less
> frequent the other path is sampled.
> 
>  2) the system remains stable and adaptive: if sampling the other path
> reduces the difference, the frequency of sampling increases, thus
> ensuring a way to 'steer' the decision making following the actual
> system performance.
> 
> Sampling sacrifices a small peak performance for adaptibility, but
> allows the cache system to transparently adapt even to those cases where
> caching makes it "slower" than regenerating the resource and avoiding
> storing it into the cache system (which also has a cost).
> 
> Another almost magic effect of this system is that we have a direct
> measure of the efficency of the cache: assuming time-locality of
> actions, I have a way to measure directly the amount of CPU time the
> cache system saved.
> 
> How so?
> 
> Everytime a request comes, I have the information on the 'estimated'
> time the resource will need to be generated on the different routes.
> Once the decision is taken, I have the information on how much it took
> to get it and I can compare it with the "assumed" time that would have
> taken on the other path. Then I know how much time I *saved* with this
> choice.
> 
> But we don't stop here: we also have a way to measure the efficiency of
> the cache itself between cache hits and cache misses.
> 
> This information might be used to estimate the RAM a server would
> require to obtain near-maximum efficiency.
> 
> And all this without even touching a single configuration!!!
> 
> A wonderful example of the advancement of adaptivity on caching is that
> you have an immediate feedback (with numbers!!!) on how much a change,
> say, in your database monitoring code, influences caching efficiency.
> 
> This is a completely innovative approach because the decision whether or
> not to cache something is estimated a-priori by system administrators,
> but in complex systems, very few people can make this decision and tune
> it for every change in the system.
> 
> Consider this as a hotspot caching machine.
> 
> Find attached my original message (WARNING: it's more than 20 printed
> pages!), maybe this time more people will understand it. :)
> 
> -- 
> Stefano Mazzocchi      One must still have chaos in oneself to be
>                           able to give birth to a dancing star.
> <st...@apache.org>                             Friedrich Nietzsche
> --------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Berin Loritsch <bl...@apache.org>.

Stefano Mazzocchi wrote:

> Berin Loritsch wrote:
> 
> 
>>However, what I see as more practical is that the specific files cached
>>might change by the time of day.  For instance, web administrators usually
>>have log analyzers that will easily determine which parts of the site get
>>hit heaviest at which time of day.  This is very important.  For instance,
>>we may want the cache to artificially lengthen the ergodic period for specific
>>resources during their peak load in order to scale more gracefully as more
>>users view those resources.
>>
> 
> hmmm...


Got you thinking huh? :)


>>Such an adaptive cache would
>>be _more_ concerned with slowly degrading performance by serving stale data
>>rather than ensuring the data retrieved from the cache is the most up to date.
>>
> 
> Good point.
> 
> I see two ways of achieving this transparently:
> 
>  1) tune your Cacheable behavior for those resources that need this
>  2) add a site-wide Cacheable behavior that is summed to the other
> Cacheable events
> 
> of course, we must have a way to measure *machine load* from java both
> fast and reliable (probably a sampling thread asking for 'uptime' or
> similar and storing this someplace?)


This is true.  It is something that I have given some thought about, but
haven't done any research into yet.



>>This would be an adaptive cache that *could* possibly have pluggable or adaptible
>>policies regarding stale data.  For instance, a site like Slashdot can get away
>>with marginally stale data as the average user is not constantly using it
>>in day to day work.  However, such a policy would be vey detrimental to a
>>corporate web site that managed the daily workflow of information that is
>>core to the company's needs.  It is also detrimental to web sites that require
>>that users cannot see each other's data, where privacy and security are
>>chief concerns.
>>
>>However, for sites like D-Haven (if I get the time) is supposed to be, stale
>>data would be acceptable because the content would have a naturally longer
>>ergodic period to begin with.
>>
> 
> If you already know your site is not going to change for a given period
> of time, well, batch generate the site and be happy (like we do for
> *.apache.org), if parts are dynamic and parts are not, use both, if
> everything is dynamic (yes, there are cases where *everything* needs to
> be dynamic), you have to tune your cache.


And that is the point.  Right *now* D-Haven would be better off batch
generated.  However, when it gets to full potential, almost everything
will need to be dynamic.  You also have to concider the cost of batch
generating in a mixed environment.  For example, if only one or two
pages are batch generated, and the rest of a large site is dynamic,
it would be cheaper and easier for the one or two pages to remain
dynamic.


> 
> By having pluggable Cacheable logic, you can even say that, if load is
> higher than a certain amount, the cache is still valid, no matter what
> happened to the real data.


Exactly.


> 
> The IoC cache design allows you to do what you want and the adaptive
> caching even to turn caching off for those resources where even 'trying
> to cache' is more expensive than not even trying.
> 
> But even adaptive caching will not remove your needs to optimize and
> design a system in order to scale. No matter how smart it is.


All we need now is one that is not only adaptive, but makes predictions
based on probability ;P

(Sorry, I have been reading up on Artificial intelligence, and probability
theroy is pretty interesting...)




-- 

"They that give up essential liberty to obtain a little temporary safety
  deserve neither liberty nor safety."
                 - Benjamin Franklin


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Stefano Mazzocchi <st...@apache.org>.

Berin Loritsch wrote:

> However, what I see as more practical is that the specific files cached
> might change by the time of day.  For instance, web administrators usually
> have log analyzers that will easily determine which parts of the site get
> hit heaviest at which time of day.  This is very important.  For instance,
> we may want the cache to artificially lengthen the ergodic period for specific
> resources during their peak load in order to scale more gracefully as more
> users view those resources.

hmmm...

> In affect, this is similar to the approach that Slashdot uses in it's server
> clustering.  They have a couple of dynamic servers with full fledged capability,
> however the bulk of the site's use is reading the front page or possibly the
> extended story and comments (most users do not post).  This allows the
> Slashdot team to have a few servers that only have static pages--that are
> updated every 20-30 minutes.  That is quite an ergodic period for a news
> site.  As load increases to the point that the dynamic servers cannot sustain,
> additional static page servers are brought into the cluster.

Ah, ok, good example.

> This is a macro scale of what I was referring to, but it is similar to the
> concept of dropping packets to allow the server to slowly degrade as load
> increases instead of come to a screaching halt.  

Yes, good parallel.

> Such an adaptive cache would
> be _more_ concerned with slowly degrading performance by serving stale data
> rather than ensuring the data retrieved from the cache is the most up to date.

Good point.

I see two ways of achieving this transparently:

 1) tune your Cacheable behavior for those resources that need this
 2) add a site-wide Cacheable behavior that is summed to the other
Cacheable events

of course, we must have a way to measure *machine load* from java both
fast and reliable (probably a sampling thread asking for 'uptime' or
similar and storing this someplace?)
 
> This would be an adaptive cache that *could* possibly have pluggable or adaptible
> policies regarding stale data.  For instance, a site like Slashdot can get away
> with marginally stale data as the average user is not constantly using it
> in day to day work.  However, such a policy would be vey detrimental to a
> corporate web site that managed the daily workflow of information that is
> core to the company's needs.  It is also detrimental to web sites that require
> that users cannot see each other's data, where privacy and security are
> chief concerns.
> 
> However, for sites like D-Haven (if I get the time) is supposed to be, stale
> data would be acceptable because the content would have a naturally longer
> ergodic period to begin with.

If you already know your site is not going to change for a given period
of time, well, batch generate the site and be happy (like we do for
*.apache.org), if parts are dynamic and parts are not, use both, if
everything is dynamic (yes, there are cases where *everything* needs to
be dynamic), you have to tune your cache.

By having pluggable Cacheable logic, you can even say that, if load is
higher than a certain amount, the cache is still valid, no matter what
happened to the real data.

The IoC cache design allows you to do what you want and the adaptive
caching even to turn caching off for those resources where even 'trying
to cache' is more expensive than not even trying.

But even adaptive caching will not remove your needs to optimize and
design a system in order to scale. No matter how smart it is.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Berin Loritsch <bl...@apache.org>.

Gerhard Froehlich wrote:

>>From: Stefano Mazzocchi [mailto:stefano@apache.org]
>>
>>
> 
> <skip/>
> 
> Berin, you introduced some days ago a Profiler in Avalon Excalibur. Is that
> Component designed to handle such things?
> 
>   Gerhard (still trying to get the big picture of caching)

It was yesterday and today.  The "Profiler" is really a method of Instrumenting
Avalon based projects (like cocoon).  All that is really defined is a series
of interfaces that define the contracts involved in such a system.

However, if you were to have ProfilePoints that take a Resource and perform
your cost calculations on it, it can feed the adaptive nature of such a Cache.
In fact the CacheController would be a ProfileReport as well, recieving all the
samples from the ProfilePoints.  It can easily perform the adaptive cache
mechanism.

However, keep in mind that once the JVM is recycled, all the hard work and
statistical gathering is reset.  It might be worth considering a persistence
mechanism for the results of the adaptive tests.  There might even come a
point where the results of cache adaptation do not change, and themselves
become superfluous.  At what point this occurs, I do not know.  However,
if the results of the adaptation were stored persistently, the Cache decisions
could be made by a much simpler Cache implementation that merely followed
the instructions from the adaptation algorithm.

In that case, the adaptive cache would only be useful during development.

However, what I see as more practical is that the specific files cached
might change by the time of day.  For instance, web administrators usually
have log analyzers that will easily determine which parts of the site get
hit heaviest at which time of day.  This is very important.  For instance,
we may want the cache to artificially lengthen the ergodic period for specific
resources during their peak load in order to scale more gracefully as more
users view those resources.

In affect, this is similar to the approach that Slashdot uses in it's server
clustering.  They have a couple of dynamic servers with full fledged capability,
however the bulk of the site's use is reading the front page or possibly the
extended story and comments (most users do not post).  This allows the
Slashdot team to have a few servers that only have static pages--that are
updated every 20-30 minutes.  That is quite an ergodic period for a news
site.  As load increases to the point that the dynamic servers cannot sustain,
additional static page servers are brought into the cluster.

This is a macro scale of what I was referring to, but it is similar to the
concept of dropping packets to allow the server to slowly degrade as load
increases instead of come to a screaching halt.  Such an adaptive cache would
be _more_ concerned with slowly degrading performance by serving stale data
rather than ensuring the data retrieved from the cache is the most up to date.

This would be an adaptive cache that *could* possibly have pluggable or adaptible
policies regarding stale data.  For instance, a site like Slashdot can get away
with marginally stale data as the average user is not constantly using it
in day to day work.  However, such a policy would be vey detrimental to a
corporate web site that managed the daily workflow of information that is
core to the company's needs.  It is also detrimental to web sites that require
that users cannot see each other's data, where privacy and security are
chief concerns.

However, for sites like D-Haven (if I get the time) is supposed to be, stale
data would be acceptable because the content would have a naturally longer
ergodic period to begin with.

-- 

"They that give up essential liberty to obtain a little temporary safety
  deserve neither liberty nor safety."
                 - Benjamin Franklin

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Gerhard Froehlich <g-...@gmx.de>.

>From: Stefano Mazzocchi [mailto:stefano@apache.org]
>

<skip/>

>
>Everytime a request comes, I have the information on the 'estimated'
>time the resource will need to be generated on the different routes.
>Once the decision is taken, I have the information on how much it took
>to get it and I can compare it with the "assumed" time that would have
>taken on the other path. Then I know how much time I *saved* with this
>choice.
>
>But we don't stop here: we also have a way to measure the efficiency of
>the cache itself between cache hits and cache misses.
>
>This information might be used to estimate the RAM a server would
>require to obtain near-maximum efficiency.
>
>And all this without even touching a single configuration!!!
>
>A wonderful example of the advancement of adaptivity on caching is that
>you have an immediate feedback (with numbers!!!) on how much a change,
>say, in your database monitoring code, influences caching efficiency.
>
>This is a completely innovative approach because the decision whether or
>not to cache something is estimated a-priori by system administrators,
>but in complex systems, very few people can make this decision and tune
>it for every change in the system.
>
>Consider this as a hotspot caching machine.

Berin, you introduced some days ago a Profiler in Avalon Excalibur. Is that
Component designed to handle such things?

  Gerhard (still trying to get the big picture of caching)

-------------------------------------------
Beer is God's way of showing us he loves us 
and wants us to be happy...
-------------------------------------------









---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Adaptive Caching [was Re: initial checkin of the Scheme code]

Posted by Stefano Mazzocchi <st...@apache.org>.

Talking about placing too many irons in the fire :)

Paulo Gaspar wrote:

> The "adaptive caching" idea just arrived too early but I hope it is not
> forgotten.

You can bet your ass it's not :) [excuse my oxford english]

The concept I proposed (more than 6 months ago) about adaptive caching
is difficult to understand because it's very different from the usual
caching approach.

Let me try to explain it:

when a request arrives to a resource (might also be a pipeline
fragment), you have two choices: ask the cache if the resource is still
valid or go right ahead and regenerate it.

And then, after it's regenerated, should I save it for further use? if
so, where? disk or memory? and if the memory is full, what should I
throw away?

There is a very simple answer for all of this: do what it gives you the
best performance.

Period.

The real problem is: once I've taken a decision (cache or don't cache)
how do I know what would have happened performance-wise if I had taken
the other choice?

This is the key problem.

I propose a simple solution: add 'selection noise' and use incoming
requests as sampling events on the system to test.

It's a trick: the objective is to reduce the time it takes to handle
those requests, but I use them to obtain time information on the system
and I superimpose a stocastic sampling to the decision-making process.

The problem is that sampling uses user requests, so we must reduce the
'noisy' behavior of these requests: my solution is to make this
'selection-noise' a function of the difference between the two paths.
So, if going thru the cache system is, on average, 3 times as fast, I
use 1 request over 100. Otherwise, if the cache yields 20% improvement
(1.2 times as fast), I sample 1 over 5 and so on.

This guarantees that:

 1) users don't perceive this since the faster is one path, the less
frequent the other path is sampled.

 2) the system remains stable and adaptive: if sampling the other path
reduces the difference, the frequency of sampling increases, thus
ensuring a way to 'steer' the decision making following the actual
system performance.

Sampling sacrifices a small peak performance for adaptibility, but
allows the cache system to transparently adapt even to those cases where
caching makes it "slower" than regenerating the resource and avoiding
storing it into the cache system (which also has a cost).

Another almost magic effect of this system is that we have a direct
measure of the efficency of the cache: assuming time-locality of
actions, I have a way to measure directly the amount of CPU time the
cache system saved.

How so?

Everytime a request comes, I have the information on the 'estimated'
time the resource will need to be generated on the different routes.
Once the decision is taken, I have the information on how much it took
to get it and I can compare it with the "assumed" time that would have
taken on the other path. Then I know how much time I *saved* with this
choice.

But we don't stop here: we also have a way to measure the efficiency of
the cache itself between cache hits and cache misses.

This information might be used to estimate the RAM a server would
require to obtain near-maximum efficiency.

And all this without even touching a single configuration!!!

A wonderful example of the advancement of adaptivity on caching is that
you have an immediate feedback (with numbers!!!) on how much a change,
say, in your database monitoring code, influences caching efficiency.

This is a completely innovative approach because the decision whether or
not to cache something is estimated a-priori by system administrators,
but in complex systems, very few people can make this decision and tune
it for every change in the system.

Consider this as a hotspot caching machine.

Find attached my original message (WARNING: it's more than 20 printed
pages!), maybe this time more people will understand it. :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

RE: initial checkin of the Scheme code

Posted by Paulo Gaspar <pa...@krankikom.de>.

Extremely interesting exchange of ideas.

Just to add something on the "adaptive caching" issue: it is possibly 
the only way to get really efficient caching. I do not think that it is 
"probably too difficult to understand", I just think that we do not know
enough and have to crawl before we walk and walk before we run.

Caching is so new to web applications that we are still fighting to make
it just work (caching what it should and not caching what it should not)
and easy enough to use.

We still did not get far enough on having a deep understanding of which
are the most usual caching scenarios and how does a cache behave in each
of those scenarios with the simple LRU/expiration policies we are 
attempting. While we don't even understand that, it is hard to improve 
it.

After you get some real experience with the current caching mechanisms 
at Cocoon you will understand what can be improved and how. When Cocoon
is also so much evolved that you are not focusing on something else, you
will probably go back to this issue and you will probably find a way to
do it.

The "adaptive caching" idea just arrived too early but I hope it is not
forgotten.

Besides, there must be a lot of literature on this issue to explore out 
there. Something already in your collection Stefano?

Have fun,
Paulo Gaspar

http://www.krankikom.de
http://www.ruhronline.de

> -----Original Message-----
> From: Stefano Mazzocchi [mailto:stefano@apache.org]
> Sent: Wednesday, December 12, 2001 12:37 PM
> 
> Jason Foster wrote:
> 
> .......
>
> > This is where my "pseduo-postmodern" instincts start to twitch.  The
> > whole notion of a "best strategy" strikes me as being a little too
> > simplistic.  I'm thinking back now to your RT on caching from so long
> > ago.  As I understood your proposal, you were advocating a cache that
> > was in large part self-tuning.  You also allowed for developers (and
> > users) to choose which optimization function to use.  One could argue
> > that allowing different optimization functions was FS.
> 
> Absolutely. This is what RT are for: gather consensus and share visions
> on what is needed and what is pure esthetic abstraction.
> 
> If you and I have different FS alarms but both don't ring on a solution,
> I personally have the impression that this solution is better than one
> that makes one or the other alarm ringing, don't you think?
> 
> Some 5 years ago, I thought that more freedom you give to a program, the
> better.
> 
> Then I met the design patterns metapattern: there are guidelines that
> 'guide' you and distill the experience of many brilliant people before
> you. By using them, you are not limiting your freedom, you are avoiding
> those rocky roads that would slow you down anyway.
> 
> I believe that semantics are patterns: I'd rather spend time improving
> the patterns and distill more and more working knowledge into them,
> rather than making it easier to people to try new ones out.
>  
> > BTW, it was a *really cool* approach to caching.
> 
> Thanks, but probably too difficult to understand. Which is the reason
> why only a third was implemented.
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: initial checkin of the Scheme code

Posted by Stefano Mazzocchi <st...@apache.org>.

Jason Foster wrote:

> > The golden rule to avoid FS is: give users all the freedom they need,
> > but no more than that.
> >
> > This is what we did for the sitemap and even it's entirely possible for
> > you to write your sitemap by hand in java, I don't know of anybody who
> > did it.
> 
> I would hazard that the rest of the Cocoon plumbing made that
> infeasibly difficult.  It's hard to know where the sitemap stops and
> Cocoon starts (and what are EventPipelines anyways!?).

What makes you think that a flowmap would be less hard to write by hand?
or with a strategy that was not built ad-hoc for it?

> > Don't get me wrong: you know how much I like modularity,
> > componentization and polymorphism, but I don't like to give users more
> > freedom than what they need because, otherwise, we'll end up having to
> > maintain and support all that freedom, wasting precious resources to
> > improve what we find out to be the best strategy.
> 
> This is where my "pseduo-postmodern" instincts start to twitch.  The
> whole notion of a "best strategy" strikes me as being a little too
> simplistic.  I'm thinking back now to your RT on caching from so long
> ago.  As I understood your proposal, you were advocating a cache that
> was in large part self-tuning.  You also allowed for developers (and
> users) to choose which optimization function to use.  One could argue
> that allowing different optimization functions was FS.

Absolutely. This is what RT are for: gather consensus and share visions
on what is needed and what is pure esthetic abstraction.

If you and I have different FS alarms but both don't ring on a solution,
I personally have the impression that this solution is better than one
that makes one or the other alarm ringing, don't you think?

Some 5 years ago, I thought that more freedom you give to a program, the
better.

Then I met the design patterns metapattern: there are guidelines that
'guide' you and distill the experience of many brilliant people before
you. By using them, you are not limiting your freedom, you are avoiding
those rocky roads that would slow you down anyway.

I believe that semantics are patterns: I'd rather spend time improving
the patterns and distill more and more working knowledge into them,
rather than making it easier to people to try new ones out.

> BTW, it was a *really cool* approach to caching.

Thanks, but probably too difficult to understand. Which is the reason
why only a third was implemented.

> Anyways, the notion of "best strategy" for modeling a webapp depends
> on a whole ton of factors like:
> 
> - what model are developers familiar with (if adoption is the key)
> - what model is most straightforward (to who?)
> - what model is the least verbose (if size matters)
> - what model makes the fewest demands on the programmer/architect/...
> - what model is most maintainable
> - what model requires the most supporting tools
> 
> I'm concerned that as we satisfice (which, don't get me wrong, is a
> good thing) we are going to prevent Cocoon from become the "Avalon of
> the Web Publishing World".  Avalon, as I understand it, doesn't
> enforce much of a conceptual model on the developer.  It does have
> one, but it is nice and generic and as such doesn't constrain
> developers.

:) Here is where you miss my entire point: Avalon is *very* strict. It
forces you to follow a behaviorally-based model based on IoC and SoC.

If you think this framework doesn't limit you, this means that we, the
Avalon architects, succeeded: in fact it does, but it does so in such a
way that you would go in that direction anyway (in fact, once they
learned Avalon, people tend to use Avalon it all over their projects!)

If you don't have this impression with Cocoon, this doesn't mean that
the 'pattern' of limiting your freedom to the strictly needed is a bad
one, but simply that Cocoon is not yet architecturally advanced enough
for your needs.

Avalon stands to Java (for server programming) like Cocoon stands to
servlets (for web publishing).

You have expressed your vision on Cocoon indicating that you don't see
the sitemap as a strong enough concept to be included into those "core"
patterns that Cocoon proposes.

I tend to agree (even if I think it's a giant step in the right
direction).

I have the impression than once the sitemap is coupled with the flowmap,
you might think the opposite and find yourself in the same good
perception that Avalon gives you: by placing roadblocks before clifs so
that you don't fell off, and not in the middle of the road.

> What sucks is that I agree with much of what you say regarding
> flexibility syndrome.  I'm grappling with SVG right now because there
> are so many ways to accomplish what appears visually to be the same
> thing.  On the other hand I don't want to alienate all of the
> [FSM|Prolog|WebObjects|...] developers who have a really good
> understanding of what have been shown to be pretty reasonable models
> of web development.

Believe me, we share the exact same concern, but as I don't feel sorry
for those programmers that dislike Cocoon just because it's written in
Java, we can't possibly avoid taking some freedom-limiting decisions.

At the same time, as others very well expressed, we must find the best
way to do something, even if this requires changing our mind a little
bit (and learn concepts like Scheme continuations).

Keep in mind, though, that the result is always a meritocratic vote,
thus a well balanced situation.

> Unless of course what we come up with really is better on all axes,
> in which case full steam ahead!

That's exactly my point: we have generators/transformers/serializers.
They are SAX-based components. This is a *big* limitation. Should we
create a new strategy for pipelines for those who don't care about
performance and what to use DOM?

Here is an example where abstraction is FS, until somebody indicates a
need for this.

On the other hand, the Cocoon Environment is abstracted: but this is
required to have an elegant way to have both CLI and servlet bahavior
(and possibly, Avalon-component behavior). But, believe me, I tried so
hard to avoid this because it imposed a big architectural design change.
We couldn't find a better way and we abstracted it.

Look at the discussion on the sitemap engine: if pluggable, people might
add their own, but then we'll have different versions, possibly
incompatible, different docs, different behaviors... is this freedom
helping the users at the end or hurting them?

I can go on forever.

I finish with the last example, the one that made to think at Avalon:
unlike C++, Java restricts multiple inheritance for classes. This is a
pretty serious freedom limitation for programmers. But it forced me to
think at a way to decomponse my components by projecting them into
different multi-dimensional concern hyperplanes. (of course, having
idendependently came up with this I didn't know the SoC terminology when
I started, I was calling this "separation of contexts")

Result: if Java had multiple inheritance, we wouldn't have Avalon, nor I
would have understood the importance of SoC as a metric for system
architectures (thus we wouldn't have Cocoon).

Sure, it probably took decades of mistakes on multiple inheritance to
lead the java inventors to the *big* decision to remove that feature.

But that results in the best java feature being the *lack* of something.
Clearly something that doesn't show off on the cover of a book, but it
does tell you something about the dangers of FS.

> Jason Foster
> 
> P.S.  How about using session URLs of the form...
> 
> http://session:identifier@host/cocoon/webapp
> 
> In a lot of browsers (but not all) if you follow one of these URLs
> the browser remembers the credentials for the duration of the
> session.  Pretso!  No URL rewriting, no cookies.  Might have problems
> with multiple windows though...

Hmmm, surely worth investigating further.... thanks.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: initial checkin of the Scheme code

Posted by Jason Foster <ja...@uwaterloo.ca>.

>> An (IMHO) equally valid "theory" is that we could also make the 
>> semantics
>> pluggable.
>
> I disagree here since looks like FS from all the point I look at it.

<grumble/>  You're right, but...

> Well, your statement could easily be extended to say that one 
> concept is
> a Turing machine, everything else is implementation.
>
> And you'd be totally right even in this case.
>
> But as I wouldn't want to implement my webapps in assembly, I don't
> think that this level of generality is going to be of any help for the
> project or the users.

Fair enough.  By the same token though, I am skeptical that you 
believe that all programming should be done using the imperative 
model.  I'm all for abstraction, but am leery of a single conceptual 
framework in which I must maneuver.

<snip/>

>>  I can't wait to see how the "implicit"
>>   approach to defining a webapp works (Prolog! Prolog!), but I 
>> would also
>> like to make sure that we have a common vision of what webapps, flow,
>> resources, etc. actually are and what we're trying to accomplish.
>
> Ok, good point. I'll post a summary real soon.

Thanks!

<snip/>

> The golden rule to avoid FS is: give users all the freedom they need,
> but no more than that.
>
> This is what we did for the sitemap and even it's entirely possible for
> you to write your sitemap by hand in java, I don't know of anybody who
> did it.

I would hazard that the rest of the Cocoon plumbing made that 
infeasibly difficult.  It's hard to know where the sitemap stops and 
Cocoon starts (and what are EventPipelines anyways!?).

<snip/>

> Don't get me wrong: you know how much I like modularity,
> componentization and polymorphism, but I don't like to give users more
> freedom than what they need because, otherwise, we'll end up having to
> maintain and support all that freedom, wasting precious resources to
> improve what we find out to be the best strategy.

This is where my "pseduo-postmodern" instincts start to twitch.  The 
whole notion of a "best strategy" strikes me as being a little too 
simplistic.  I'm thinking back now to your RT on caching from so long 
ago.  As I understood your proposal, you were advocating a cache that 
was in large part self-tuning.  You also allowed for developers (and 
users) to choose which optimization function to use.  One could argue 
that allowing different optimization functions was FS.

BTW, it was a *really cool* approach to caching.

Anyways, the notion of "best strategy" for modeling a webapp depends 
on a whole ton of factors like:

- what model are developers familiar with (if adoption is the key)
- what model is most straightforward (to who?)
- what model is the least verbose (if size matters)
- what model makes the fewest demands on the programmer/architect/...
- what model is most maintainable
- what model requires the most supporting tools

I'm concerned that as we satisfice (which, don't get me wrong, is a 
good thing) we are going to prevent Cocoon from become the "Avalon of 
the Web Publishing World".  Avalon, as I understand it, doesn't 
enforce much of a conceptual model on the developer.  It does have 
one, but it is nice and generic and as such doesn't constrain 
developers.

What sucks is that I agree with much of what you say regarding 
flexibility syndrome.  I'm grappling with SVG right now because there 
are so many ways to accomplish what appears visually to be the same 
thing.  On the other hand I don't want to alienate all of the 
[FSM|Prolog|WebObjects|...] developers who have a really good 
understanding of what have been shown to be pretty reasonable models 
of web development.

Unless of course what we come up with really is better on all axes, 
in which case full steam ahead!

Jason Foster

P.S.  How about using session URLs of the form...

http://session:identifier@host/cocoon/webapp

In a lot of browsers (but not all) if you follow one of these URLs 
the browser remembers the credentials for the duration of the 
session.  Pretso!  No URL rewriting, no cookies.  Might have problems 
with multiple windows though...


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: initial checkin of the Scheme code

Posted by Stefano Mazzocchi <st...@apache.org>.

Jason Foster wrote:
> 
> > Moreover, now that we clearly identified the SoC between
> > syntax/semantics and many examples showed that the same semantics could
> > be equally expressed with different syntaxes, it's good that we now
> > concentrate on the semantics and let the syntax be the last of our
> > concerns.
> >
> > In theory, if one likes, we could make the syntax wrapper pluggable, so
> > that you could write your flow-driving code with the syntax you like the
> > most (or it best fits your favorite IDE)
> 
> An (IMHO) equally valid "theory" is that we could also make the semantics
> pluggable.

I disagree here since looks like FS from all the point I look at it.

> A little while back I sent in a message that tried to raise the issue of
> what aspects of Cocoon should be rigidly specified and what aspects should
> be open.  So far there have been no replies, but I'd like to re-raise the
> issue anyways.

Yes, I forgot to reply to that, I'll do it now.

> As things stand Cocoon appears to define the following conceptual
> invariants:
> 
> - Requests and Responses
> - Pipelines
> - Generators
> - Transformers
> - Serializers
> 
> The sitemap (which I consider separate from the core aspects of Cocoon, but

Ok, totally. I agree with you.

> please don't hold that against me) also defines at least one other concept:
> 
> - Views
> 
> Everything else is implementation ;)

Well, your statement could easily be extended to say that one concept is
a Turing machine, everything else is implementation.

And you'd be totally right even in this case.

But as I wouldn't want to implement my webapps in assembly, I don't
think that this level of generality is going to be of any help for the
project or the users.

> The current discussion of flows and resources is important (and fun!) but
> it seems to be moving very quickly towards implementation without having
> resolved some key conceptual issues.

Hmmm, I don't resonate with this view of the problem.

>  I can't wait to see how the "implicit"
>   approach to defining a webapp works (Prolog! Prolog!), but I would also
> like to make sure that we have a common vision of what webapps, flow,
> resources, etc. actually are and what we're trying to accomplish.

Ok, good point. I'll post a summary real soon.

> For example a webapp could be seen as:
> 
> - a "standard" event-driven, object-oriented application where web requests
> are translated to method invocations (WebObjects)
> - a continuation-based application where the choice of continuation is
> based on a request parameter
> - a FSM
> - a set of web pages hooked together manually by a developer who manages
> session and persistance aspects manually
> - a test for provability given axioms and theorems
> 
> Should Cocoon provide a formal definition of a webapp?  Personally I don't
> think so.  My (not terribly well informed) opinion is that Cocoon should
> add one more concept, the "Controller", which is an implementation of the
> GoF strategy pattern, and that is responsible for controlling the response
> to requests.  Different controllers would embody different models of
> webapps.

My FS alarms are still ringing.

The golden rule to avoid FS is: give users all the freedom they need,
but no more than that.

This is what we did for the sitemap and even it's entirely possible for
you to write your sitemap by hand in java, I don't know of anybody who
did it.

For the same reason, I'd rather investigate which one of these
strategies fits best with the SoC concepts we based Cocoon on and
concentrate on that.

If (and only *if*) people will ask for a different strategy, we'll
consider to implement a pluggable system, but, for now, I consider this
elegant only on paper, but very dangerous for both the project and the
community because it would spread the usage over different strategies
instead of evolving one into the best.

Don't get me wrong: you know how much I like modularity,
componentization and polymorphism, but I don't like to give users more
freedom than what they need because, otherwise, we'll end up having to
maintain and support all that freedom, wasting precious resources to
improve what we find out to be the best strategy.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: initial checkin of the Scheme code

Posted by Jason Foster <ja...@uwaterloo.ca>.

> Moreover, now that we clearly identified the SoC between
> syntax/semantics and many examples showed that the same semantics could
> be equally expressed with different syntaxes, it's good that we now
> concentrate on the semantics and let the syntax be the last of our
> concerns.
>
> In theory, if one likes, we could make the syntax wrapper pluggable, so
> that you could write your flow-driving code with the syntax you like the
> most (or it best fits your favorite IDE)

An (IMHO) equally valid "theory" is that we could also make the semantics 
pluggable.

A little while back I sent in a message that tried to raise the issue of 
what aspects of Cocoon should be rigidly specified and what aspects should 
be open.  So far there have been no replies, but I'd like to re-raise the 
issue anyways.

As things stand Cocoon appears to define the following conceptual 
invariants:

- Requests and Responses
- Pipelines
- Generators
- Transformers
- Serializers

The sitemap (which I consider separate from the core aspects of Cocoon, but 
please don't hold that against me) also defines at least one other concept:

- Views

Everything else is implementation ;)

The current discussion of flows and resources is important (and fun!) but 
it seems to be moving very quickly towards implementation without having 
resolved some key conceptual issues.  I can't wait to see how the "implicit"
  approach to defining a webapp works (Prolog! Prolog!), but I would also 
like to make sure that we have a common vision of what webapps, flow, 
resources, etc. actually are and what we're trying to accomplish.

For example a webapp could be seen as:

- a "standard" event-driven, object-oriented application where web requests 
are translated to method invocations (WebObjects)
- a continuation-based application where the choice of continuation is 
based on a request parameter
- a FSM
- a set of web pages hooked together manually by a developer who manages 
session and persistance aspects manually
- a test for provability given axioms and theorems

Should Cocoon provide a formal definition of a webapp?  Personally I don't 
think so.  My (not terribly well informed) opinion is that Cocoon should 
add one more concept, the "Controller", which is an implementation of the 
GoF strategy pattern, and that is responsible for controlling the response 
to requests.  Different controllers would embody different models of 
webapps.

Comments?  I don't think I'm out to lunch, but you never know...

Jason


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: initial checkin of the Scheme code

Posted by Stefano Mazzocchi <st...@apache.org>.

just some over-the-head comments from a community-leadership
perspective:

Ovidiu Predescu wrote:
> 
> Hi,
> 
> I've just checked in the initial code for the Scheme integration. It
> is in scratchpad/schecoon/ (for lack of a better name) in the main
> trunk.

Let me state this loud and clear: we should all work together to avoid
having forking friction developping inside Cocoon.

I love the scratchpad idea where committers can present their ideas and
work on them in the public, I think it's a great thing and should not be
regulated at all (means that you can add your own stuff as you like
overthere) but you need a formal votation to move things from there to
the main trunk.

I state this even if it's obvious so that everybody knows this and it's
crystal clear.

Mixing XML+Java+Scheme has not been tempted before, ASAIK, and as we've
seen, it's potentially explosive mix of feelings to hurt.

At the same time, I think all Cocoon 2.0 users understand that if you
can find natural to program a publishing system with an XML file (the
sitemap), programming the flow of your application using a scripting
language is not that distant from that same concept.

Moreover, now that we clearly identified the SoC between
syntax/semantics and many examples showed that the same semantics could
be equally expressed with different syntaxes, it's good that we now
concentrate on the semantics and let the syntax be the last of our
concerns.

In theory, if one likes, we could make the syntax wrapper pluggable, so
that you could write your flow-driving code with the syntax you like the
most (or it best fits your favorite IDE)

This said, I'm *very* interested in the 'semantics' that Ovidiu is
playing on because I think it would give Cocoon a potentially giant
expressive power compared to all other server-side java-based solutions,
and I'll keep a neutral (and tollerant) position to the syntax that he
will use to prove these concepts useful.

I kindly suggest all of you to do the same.

If the concept proves to be successful from a semantic point of view,
we'll debate over which syntax is better, or we'll make it possible for
everybody to use the most liked syntax, but this is not something we
should debate on until the concepts are fully expressed, outlined and
tested.

> As it is right now, it doesn't do much. There are two servlets,
> REPLServlet and REPLEvalServlet. The first one is supposed to be the
> main entry point for the new Scheme-driven Cocoon. Eventually it will
> execute the Scheme function that will execute the sitemap.
> 
> The second servlet is used for development only. It is mounted as
> /schecoon/eval under the servlet, and it accepts POST messages
> containing Scheme code to evaluate. Because this servlet is a security
> risk, you need to configure a flag in web.xml to enable it. Once this
> flag is set to true, the servlet is setup to use basic
> authentication. To set this up with Tomcat 3.3 for example, add the
> following entry in conf/users/global-users.xml:
> 
> <user name="cocoon" password="your password" roles="cocoon_admin"/>
> 
> There is a command line driver for the interpreter in the util
> package. This will read in an infinite loop a Scheme expression from
> the command line, POST it to the servlet and print out the
> response. This application is intended to be used with Emacs, read the
> README file in the emacs/ directory. From within Emacs, you can edit
> your code, and when you want to test it, you type few keys which will
> send the code to the servlet Scheme engine to be evaluated.
>
> The SISC jar file is built from the latest sources in CVS. With the
> simple servlet I've just described above, it adds less than 2Mb to the
> runtime memory.

I'll dive into that tomorrow.

Many thanks for doing this :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org