You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Upayavira <uv...@upaya.co.uk> on 2003/03/31 21:11:51 UTC

CLI ideas (long)

Dear All,

Below is the a summary of a brief exchange with Nicola Ken 
regarding CLI ideas I'd like to implement. He has encouraged 
me to 'go public', which I am now doing.  

My aim in the below is twofold: make the CLI into something 
that is useful to a project I am working on, and also to make 
the CLI into something that people would prefer to use as 
opposed to something like wget. [Confession: I'm afraid I 
still use wget myself.]

--ModifiableSources--

The Bean's Destination objects will be replaced with
ModifiableSources.

A destination can be configured with a uri which identifies a
ModfiableSource via its protocol. So instead of specifying a
Destination, you specify a uri, and a sourceResolver identifies the
ModifiableSource.

Then all that remains is how to work out the actual uri of a
particular output file, based upon the destination uri and the
source uri. 

This can be done by specifying the source uri as a base
and a path. The base is used to request a page, but is not appended
to the output uri. When the source path should be inserted into the
destination URI, the insertion point can be marked with a *, e.g.

Ftp://ftp.host.com/htdocs/*
zip://path/*@foo.zip

If no * is present, and the destination URI ends with a /, the
source path is appended to the output uri. If it does not end with a
/, the output URI only is used, and the source path is discarded.
 
The source uri is identified by combining the base & the path
(separated by a file separator).

Exactly how these will be supplied within the xconf, I've still to
work out, but each page will need a source base, source path and a
destination URI. 
 
The Bean gets a ComponentManager from its Cocoon object, and uses
this to get a SourceResolver for its own use.

Basically, this allows maximum configuration of resulting URIs. At present, for a 
URI of /site/page.html, this will be put into $dest/site/page.html. But what if you 
want that page to be at the root of your site, i.e. you don't want 'site' at the 
beginning? Well, in this case, you'd specify '/site' as the base and 'page.html' as 
the path. If page.html contained a link to something/anotherpage.html, this would 
have /site as its base and something/anotherpage.html as its path.

So any pages that are linked to will inherit the destination URI and source base 
from the linking page.
 
---FTPSource------
I need a writable FTPSource for a project I'm working on. Nicola Ken suggested 
looking into a VFSSource, which I will do. It shouldn't be hard to produce a 
ModifiableSource for it. I presume, to make it work, you'd configure multiple 
protocols in cocoon.xconf to point to the same code, e.g. ftp, zip, smb, etc.

[Caching thoughts removed - return to that some time later]

---Configuring a ModifiableSource---
>From all of the Sources I've seen, I haven't seen ways to pass configuration 
parameters into them. For example, how might one tell an FTPSource to use 
passive as opposed to active FTP? Any ideas?

---Source Caching------
> Cocoon probably has a lot of code for caching sources. There are two sides to caching, improving processing time by reducing workload, and reducing writing time by not updating pages that have not changed.

The former is already managed by Cocoon. The latter would require the pipeline to 
report if the serializer output was read from the cache. If so, content isn't written. 

[Nicola Ken - I didn't understand this bit of your reply:]
> Actually even the former is managed by Cocoon, I don't remember where but
> IIRC the Environment has such an info, only that in the current
> implementation of the CLI environments it's unimplemented.

As Nicola Ken pointed out, links of every page would need to be cached, because 
when a page will be found to be already on disk and uptodate, you still need the 
links for crawling. Hmm. 

---Threading---
Threadinq needs reworking as the ThreadedDestination would become
deprecated. 

The bean either needs to have threaded processing built in, or I need to create 
something like a 'threadable source', using something like threaded:ftp://blah. 
Don't know which yet.

There are two possible forms of threading: generation and dispatch
threading. 
 
In generation threading, multiple pages are generated
simultaneously. The benefit of this is that pages are likely to
appear more synchronously at the destination. The downside is that
the processor needs to be switching between multiple threads 
(assuming single processor m/c)
 
Dispatch threading involves sequential page generation and then
handing the generated content to a thread-pool to handle dispatching
the content to its destination. The benefit of this greater speed of
delivery when delivery takes place over a slow network connection.

This kind of threading is important for a system that I want to use it for. 
The pages bear no relevance to each other, and speed of delivery is 
important.  (I don't plan to implement generation threading ATM).

Threading is either configured once for an xconf file, or as a part of a threaded
source URL. Default would be no threading.

Final comments from Nicola Ken:
> What about a publish-subscribe model, with complete decoupling from
> the publishing and the handling?

Can you explain more what you mean by this?

> As points that are important, I would say in order:
> 
>   1) make Cocoon *not* output the pages that have an error
>   2) make cocoon output xxxpagename.error.txt with the errors
>      of the 'xxxpagename' page (configurable)
>   3) make the report on broken links in XML so that it can be
>      added to the site (where to put it configurable)
>   4) make the content not regenerated if uptodate (very important
>      from a user perspective POV)
>   5) use ModifyableSource instead of Destination
>   6) others
> 
> Feel free to do whatever in whatever order you prefer, this is just
> what IMVHO is the priority. 1+2 are needed BTW so that crawlers see
> broken links correctly, otherwise the site seems ok but instead the
> broken links are there.

Do you have ideas as to how to do these (i.e. 1-4)? 5 is of greatest importance to 
me, but if I can understand what is involved in the others, then I can always have a 
go.

Regards, Upayavira


Re: CLI ideas (long)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Upayavira wrote, On 02/04/2003 0.25:
>Nicola Ken wrote
...
>>In the Environment there is
>>
>>     boolean isResponseModified(long lastModified);
>>     void setResponseIsNotModified();
>>
>>But it's never implemented. In AbstractEnvironment:
>>
>>     public boolean isResponseModified(long lastModified) {
>>         return true; // always modified
>>     }
>>
>>     public void setResponseIsNotModified() {
>>         // does nothing
>>     }
>>
>>So it means that the above has to be first implemented, then used when
>>writing to disk.
> 
> Okay. So how easy is it to code this? Any caching gurus out there?

...

>>>As Nicola Ken pointed out, links of every page would need to be
>>>cached, because when a page will be found to be already on disk and
>>>uptodate, you still need the links for crawling. Hmm. 
> 
> I'll have to think more about this one. Not straight-forward.

Leave it out now then if it slows you.

>>>>As points that are important, I would say in order:
>>>>
>>>> 1) make Cocoon *not* output the pages that have an error
>>>> 2) make cocoon output xxxpagename.error.txt with the errors
>>>>    of the 'xxxpagename' page (configurable)
>>>> 3) make the report on broken links in XML so that it can be
>>>>    added to the site (where to put it configurable)
>>>> 4) make the content not regenerated if uptodate (very important
>>>>    from a user perspective POV)
>>>> 5) use ModifyableSource instead of Destination
>>>> 6) others
>>>>
>>
>>1 is about not making error pages be printed out... for one thing IIUC
>>it needs resourceUnavailable() to be configurable (write out or not),
>>but I don't know if maybe there are other errors that write directly.
> 
> Sounds easy enough.

Then just do this, it's quite important.

>>4 is quite important from a user perspective, but maybe it takes some
>>time to do.
>>
>>Feel really free in doing what you need/prefer, especially if other
>>things take you too much time.
> 
> Progress will be slow, but I'll keep you posted.

Thanks :-)

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


Re: CLI ideas (long)

Posted by Upayavira <uv...@upaya.co.uk>.
> > [Nicola Ken - I didn't understand this bit of your reply:]
> > 
> >>Actually even the former is managed by Cocoon, I don't remember
> >>where but IIRC the Environment has such an info, only that in the
> >>current implementation of the CLI environments it's unimplemented.
> 
> I mean that the hook are already there, you just have to fill in the
> implementation.
> 
> In the Environment there is
> 
>      boolean isResponseModified(long lastModified);
>      void setResponseIsNotModified();
> 
> But it's never implemented. In AbstractEnvironment:
> 
>      public boolean isResponseModified(long lastModified) {
>          return true; // always modified
>      }
> 
>      public void setResponseIsNotModified() {
>          // does nothing
>      }
> 
> So it means that the above has to be first implemented, then used when
> writing to disk.

Okay. So how easy is it to code this? Any caching gurus out there?
 
> > As Nicola Ken pointed out, links of every page would need to be
> > cached, because when a page will be found to be already on disk and
> > uptodate, you still need the links for crawling. Hmm. 

I'll have to think more about this one. Not straight-forward.

> >>As points that are important, I would say in order:
> >>
> >>  1) make Cocoon *not* output the pages that have an error
> >>  2) make cocoon output xxxpagename.error.txt with the errors
> >>     of the 'xxxpagename' page (configurable)
> >>  3) make the report on broken links in XML so that it can be
> >>     added to the site (where to put it configurable)
> >>  4) make the content not regenerated if uptodate (very important
> >>     from a user perspective POV)
> >>  5) use ModifyableSource instead of Destination
> >>  6) others
> >>
> 1 is about not making error pages be printed out... for one thing IIUC
> it needs resourceUnavailable() to be configurable (write out or not),
> but I don't know if maybe there are other errors that write directly.

Sounds easy enough.
 
> 4 is quite important from a user perspective, but maybe it takes some
> time to do.
> 
> Feel really free in doing what you need/prefer, especially if other
> things take you too much time.

Progress will be slow, but I'll keep you posted.

Upayavira


Re: CLI ideas (long)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Upayavira wrote, On 31/03/2003 21.11:
> Dear All,
>
> Below is the a summary of a brief exchange with Nicola Ken 
> regarding CLI ideas I'd like to implement. He has encouraged 
> me to 'go public', which I am now doing.  

Hey, nobody wants to comment on the CLI changes?
Or is it that we are doing it too well? ;-)

> My aim in the below is twofold: make the CLI into something 
> that is useful to a project I am working on, and also to make 
> the CLI into something that people would prefer to use as 
> opposed to something like wget. [Confession: I'm afraid I 
> still use wget myself.]

:-))

<snip/>
> [Nicola Ken - I didn't understand this bit of your reply:]
> 
>>Actually even the former is managed by Cocoon, I don't remember where but
>>IIRC the Environment has such an info, only that in the current
>>implementation of the CLI environments it's unimplemented.

I mean that the hook are already there, you just have to fill in the 
implementation.

In the Environment there is

     boolean isResponseModified(long lastModified);
     void setResponseIsNotModified();

But it's never implemented. In AbstractEnvironment:

     public boolean isResponseModified(long lastModified) {
         return true; // always modified
     }

     public void setResponseIsNotModified() {
         // does nothing
     }

So it means that the above has to be first implemented, then used when 
writing to disk.

> As Nicola Ken pointed out, links of every page would need to be cached, because 
> when a page will be found to be already on disk and uptodate, you still need the 
> links for crawling. Hmm. 

Yup.

> ---Threading---
> Threadinq needs reworking as the ThreadedDestination would become
> deprecated. 
...
> There are two possible forms of threading: generation and dispatch
> threading. 
...
> This kind of threading is important for a system that I want to use it for. 
> The pages bear no relevance to each other, and speed of delivery is 
> important.  (I don't plan to implement generation threading ATM).
...
> Final comments from Nicola Ken:
> 
>>What about a publish-subscribe model, with complete decoupling from
>>the publishing and the handling?
> 
> Can you explain more what you mean by this?

I was thinking of a messaging system, like JMS for example, but it's 
overkill.

Go ahead with your needs.

>>As points that are important, I would say in order:
>>
>>  1) make Cocoon *not* output the pages that have an error
>>  2) make cocoon output xxxpagename.error.txt with the errors
>>     of the 'xxxpagename' page (configurable)
>>  3) make the report on broken links in XML so that it can be
>>     added to the site (where to put it configurable)
>>  4) make the content not regenerated if uptodate (very important
>>     from a user perspective POV)
>>  5) use ModifyableSource instead of Destination
>>  6) others
>>
>>Feel free to do whatever in whatever order you prefer, this is just
>>what IMVHO is the priority. 1+2 are needed BTW so that crawlers see
>>broken links correctly, otherwise the site seems ok but instead the
>>broken links are there.
> 
> 
> Do you have ideas as to how to do these (i.e. 1-4)? 5 is of greatest importance to 
> me, but if I can understand what is involved in the others, then I can always have a 
> go.

Leave 2 and 3 out then for now.

1 is about not making error pages be printed out... for one thing IIUC 
it needs resourceUnavailable() to be configurable (write out or not), 
but I don't know if maybe there are other errors that write directly.

4 is quite important from a user perspective, but maybe it takes some 
time to do.

Feel really free in doing what you need/prefer, especially if other 
things take you too much time.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


Re: CLI ideas

Posted by Upayavira <uv...@upaya.co.uk>.
> Do you know the Ant mapper
> (http://ant.apache.org/manual/CoreTypes/mapper.html)?
> Perhaps a similar facility can be used.

No, it looks interesting. I'll look into it.

Upayavira

Re: CLI ideas (long)

Posted by Stephan Michels <st...@apache.org>.
On Mon, 31 Mar 2003, Upayavira wrote:

> A destination can be configured with a uri which identifies a
> ModfiableSource via its protocol. So instead of specifying a
> Destination, you specify a uri, and a sourceResolver identifies the
> ModifiableSource.
>
> Then all that remains is how to work out the actual uri of a
> particular output file, based upon the destination uri and the
> source uri.
>
> This can be done by specifying the source uri as a base
> and a path. The base is used to request a page, but is not appended
> to the output uri. When the source path should be inserted into the
> destination URI, the insertion point can be marked with a *, e.g.
>
> Ftp://ftp.host.com/htdocs/*
> zip://path/*@foo.zip
>
> If no * is present, and the destination URI ends with a /, the
> source path is appended to the output uri. If it does not end with a
> /, the output URI only is used, and the source path is discarded.
>
> The source uri is identified by combining the base & the path
> (separated by a file separator).

Do you know the Ant mapper
(http://ant.apache.org/manual/CoreTypes/mapper.html)?
Perhaps a similar facility can be used.

Stephan.