You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Upayavira <uv...@upaya.co.uk> on 2003/08/27 21:29:33 UTC

Re: Cocoon CLI: excluding URIs

Switching from Forrest-dev...

Jeff Turner wrote (on forrest-dev):

>On Wed, Aug 27, 2003 at 10:42:36AM +0100, Upayavira wrote:
>  
>
>>Jeff Turner wrote:
>>
>>    
>>
>>>On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
>>>
>>>
>>>      
>>>
>>>>I rebuilt my local Forrest doco today but i get all these strange
>>>>error messages about "site:" and "ext:" URLs being broken.
>>>>Here is one example...
>>>>------
>>>>...
>>>>* [0] your-project.pdf
>>>>X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
>>>>* [48] cap.html
>>>>...
>>>>------
>>>>
>>>>On the other hand, i have a project site that builds with no such
>>>>problems. So i do not know what is going on. Any clues?
>>>>  
>>>>
>>>>        
>>>>
>>>The problem is with the new CLI: we have no way to exclude certain URLs
>>>      
>>>
>>>from being traversed.  The Forrest site gives these broken links because
>>    
>>
>>>sitemap-ref.xml deliberately references some raw XML (index.xml), which
>>>contains refs to untranslated links like 'site:contrib'.  It just an
>>>annoyance really -- doesn't harm the actual output.
>>>
>>>If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
>>>onto the Cocoon CLI so we can do a long-overdue 0.5 release.
>>>
>>>      
>>>
>>Jeff,
>>
>>Are you saying that the CLI is holding back a Forrest release?
>>    
>>
>
>A bit ;)  0.4 and previous versions have all had a mechanism to exclude
>certain URIs from being traversed.  Forrest's own site gives errors if
>some URLs aren't excluded.
>
>>s the a timescale for it?
>>    
>>
>No particular timescale.  It's been 6 months since 0.4 though, so a
>release soon would be nice.
>  
>
>>A few points:
>>
>>1) If you switch back to link view, would that enable you to achieve 
>>your 'excludes' requirement?
>>    
>>
>Yes, but I've gotten used to the CLI speeding along, and wouldn't like to
>go back.
>  
>
Okay.

>>2) The LinkGatherer doesn't currently work, as a recent fix to caching 
>>broke it. It assumes that the LinkGatherer component isn't cached, as 
>>its 'gathering' side effect isn't cached.
>>    
>>
>Strange thing is, I haven't been able to replicate this in Forrest, after
>updating locally to CVS Cocoon.  CLI rendering works fine, both on
>initial and subsequent renderings.  I thought perhaps we have the buggy
>cache impl, but in my tests I'm using the same excalibur-store as in
>Cocoon, so I don't know what's going on.
>
Interesting. I think I know. Whilst hacking around, I added a 
getValidity() method to the LinkGatherer, thinking that that was what 
was breaking the cache. But I didn't commit it. I have been working from 
a not working caching LinkGatherer, whilst you're working with a working 
CVS non-caching LinkGatherer. So this is good news.

What it means is that link gathering works, but that, if you use link 
gathering, you can't take advantage of the new ability to write to files 
only if a page has changed. To get that working, I've got to get the 
links gathered by the LinkGatherer into the cache somehow.

>>3) I think I might be able to fix that (just rebuilding my Eclipse 
>>environment...), by setting the LinkGatherer to return null in response 
>>to getValitity()
>>4) I just started thinking about your excludes code (assuming that link 
>>gathering does start working again). Basically, there's a number of 
>>things one can exclude upon - source URI, source prefix, full source URI 
>>(prefix and URI), final destination URI . How about something like:
>>
>><exclude type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>><include type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>>    
>>
>I'd be happy with a simple 'ignore this link', but wildcards would be
>great.
>
>I'm a bit confused by all the @src types though.  Is 'dest-uri' the final
>filesystem destination?  Is there anything possible with src="dest-uri"
>that isn't possible otherwise?  Does 'src-prefix' mean "ignore URIs
>starting with this prefix"?  If so, why not just use a wildcard?
>  
>
The thing is, you might want to exclude a certain URL from going to one 
destination but not another, so you'd need to specify a wildcard on 
either source or destination. However, given that a wildcard can be used 
to deal with prefixes, we don't need to specifically worry about 
prefixes. So, I propose:

<exclude-source match="<wildcard pattern>"/>
<exclude-destination match="<wildcard pattern>"/>
<exclude-source match="<wildcard pattern>"/>
<exclude-destination match="<wildcard pattern>"/>

I don't want to use <exclude type="source" ...> as I wan to reserve the 
type attribute for specifying whether to use a wildcard or regexp matcher.

Thoughts?

I've got some basic code in place to do includes/excludes - I'll keep 
you posted.

>>With include, you can have only a very narrow part of your site
>>crawled.
>>
>>Note: I think the xconf format needs some serious rethinking, so this 
>>would be a temporary extension.
>>    
>>
>I agree, the format isn't something that can be decided up-front.  I
>wouldn't worry too much about keeping backwards-compat.
>  
>
>>What do you think?
>>
>>I'm struggling to fit a number of projects into limited time (1 1/2 
>>hours per day) - want to do Cocoon stuff, but need to work on some other 
>>sites), but I'm keen to get Cocoon working for you.
>>    
>>
>
>Thanks very much :)  I'm in the same boat, working on Forrest in the
>evenings.  No rush -- there's plenty of other stuff to keep us busy
>before a release.
>
I've just managed to shove one burning project two weeks into the 
future, so I'm back on for Cocoon for a while!

>PS: in your CLI experiments, have you ever encountered a bug where the
>last link in a page isn't crawled?  I'll try to come up with a decent
>replicable example, but thought I'd mention it anyway.
>
To be honest, I haven't. Give me an example, and I'll look into it.

Regards, Upayavira



Re: Cocoon CLI: excluding URIs

Posted by Upayavira <uv...@upaya.co.uk>.
Upayavira wrote:

...      

>>> 4) I just started thinking about your excludes code (assuming that 
>>> link gathering does start working again). Basically, there's a 
>>> number of things one can exclude upon - source URI, source prefix, 
>>> full source URI (prefix and URI), final destination URI . How about 
>>> something like:
>>>
>>> <exclude type="regexp| wildcard" src="source-uri | source-prefix | 
>>> full-source-uri | dest-uri" match="<pattern>"/>
>>> <include type="regexp| wildcard" src="source-uri | source-prefix | 
>>> full-source-uri | dest-uri" match="<pattern>"/>   
>>
>> I'd be happy with a simple 'ignore this link', but wildcards would be
>> great.
>>
>> I'm a bit confused by all the @src types though.  Is 'dest-uri' the 
>> final
>> filesystem destination?  Is there anything possible with src="dest-uri"
>> that isn't possible otherwise?  Does 'src-prefix' mean "ignore URIs
>> starting with this prefix"?  If so, why not just use a wildcard?
>
> The thing is, you might want to exclude a certain URL from going to 
> one destination but not another, so you'd need to specify a wildcard 
> on either source or destination. However, given that a wildcard can be 
> used to deal with prefixes, we don't need to specifically worry about 
> prefixes. So, I propose:
>
> <exclude-source match="<wildcard pattern>"/>
> <exclude-destination match="<wildcard pattern>"/>
> <exclude-source match="<wildcard pattern>"/>
> <exclude-destination match="<wildcard pattern>"/>
>
> I don't want to use <exclude type="source" ...> as I wan to reserve 
> the type attribute for specifying whether to use a wildcard or regexp 
> matcher.
>
> Thoughts?
>
> I've got some basic code in place to do includes/excludes - I'll keep 
> you posted.

I've just run my code through, and it seems to have worked. I'll give it 
a bit more testing later today and commit it either this evening or 
tomorrow.

It is very simple - I haven't yet implemented 'destination' excludes, 
and have only done wildcard excludes. And the matching happens with the 
'absolute' url, i.e. including any source prefix. I've also implemented 
includes in the same way, but have not yet tested it.

I've made it so that if both includes and excludes are present, a URL is 
first checked to see if it should be included, and then to see if it 
should be excluded. So you might say <include pattern="subsite/**"/> and 
<exclude pattern="subsite/images/**"/>.

I have tested the following, which generates the docs, but without any 
images.

   <exclude pattern="**/images/**"/>
   <uri type="append" src-prefix="docs/" src="index.html" 
dest="build/dest/" />

Regards, Upayavira