You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by Upayavira <uv...@upaya.co.uk> on 2003/08/27 21:29:46 UTC

Re: Cocoon CLI: excluding URIs

Jeff,

I've replied fully on cocoon-dev, as this discussion should really be 
happening there.

Upayavira


Jeff Turner wrote:

>On Wed, Aug 27, 2003 at 10:42:36AM +0100, Upayavira wrote:
>  
>
>>Jeff Turner wrote:
>>
>>    
>>
>>>On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
>>>
>>>
>>>      
>>>
>>>>I rebuilt my local Forrest doco today but i get all these strange
>>>>error messages about "site:" and "ext:" URLs being broken.
>>>>Here is one example...
>>>>------
>>>>...
>>>>* [0] your-project.pdf
>>>>X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
>>>>* [48] cap.html
>>>>...
>>>>------
>>>>
>>>>On the other hand, i have a project site that builds with no such
>>>>problems. So i do not know what is going on. Any clues?
>>>>  
>>>>
>>>>        
>>>>
>>>The problem is with the new CLI: we have no way to exclude certain URLs
>>>      
>>>
>>>from being traversed.  The Forrest site gives these broken links because
>>    
>>
>>>sitemap-ref.xml deliberately references some raw XML (index.xml), which
>>>contains refs to untranslated links like 'site:contrib'.  It just an
>>>annoyance really -- doesn't harm the actual output.
>>>
>>>If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
>>>onto the Cocoon CLI so we can do a long-overdue 0.5 release.
>>>
>>>      
>>>
>>Jeff,
>>
>>Are you saying that the CLI is holding back a Forrest release?
>>    
>>
>
>A bit ;)  0.4 and previous versions have all had a mechanism to exclude
>certain URIs from being traversed.  Forrest's own site gives errors if
>some URLs aren't excluded.
>
>  
>
>>Is the a timescale for it?
>>    
>>
>
>No particular timescale.  It's been 6 months since 0.4 though, so a
>release soon would be nice.
>
>  
>
>>A few points:
>>
>>1) If you switch back to link view, would that enable you to achieve 
>>your 'excludes' requirement?
>>    
>>
>
>Yes, but I've gotten used to the CLI speeding along, and wouldn't like to
>go back.
>
>  
>
>>2) The LinkGatherer doesn't currently work, as a recent fix to caching 
>>broke it. It assumes that the LinkGatherer component isn't cached, as 
>>its 'gathering' side effect isn't cached.
>>    
>>
>
>Strange thing is, I haven't been able to replicate this in Forrest, after
>updating locally to CVS Cocoon.  CLI rendering works fine, both on
>initial and subsequent renderings.  I thought perhaps we have the buggy
>cache impl, but in my tests I'm using the same excalibur-store as in
>Cocoon, so I don't know what's going on.
>
>  
>
>>3) I think I might be able to fix that (just rebuilding my Eclipse 
>>environment...), by setting the LinkGatherer to return null in response 
>>to getValitity()
>>4) I just started thinking about your excludes code (assuming that link 
>>gathering does start working again). Basically, there's a number of 
>>things one can exclude upon - source URI, source prefix, full source URI 
>>(prefix and URI), final destination URI . How about something like:
>>
>><exclude type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>><include type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>>    
>>
>
>I'd be happy with a simple 'ignore this link', but wildcards would be
>great.
>
>I'm a bit confused by all the @src types though.  Is 'dest-uri' the final
>filesystem destination?  Is there anything possible with src="dest-uri"
>that isn't possible otherwise?  Does 'src-prefix' mean "ignore URIs
>starting with this prefix"?  If so, why not just use a wildcard?
>
>  
>
>>With include, you can have only a very narrow part of your site
>>crawled.
>>
>>Note: I think the xconf format needs some serious rethinking, so this 
>>would be a temporary extension.
>>    
>>
>
>I agree, the format isn't something that can be decided up-front.  I
>wouldn't worry too much about keeping backwards-compat.
>
>  
>
>>What do you think?
>>
>>I'm struggling to fit a number of projects into limited time (1 1/2 
>>hours per day) - want to do Cocoon stuff, but need to work on some other 
>>sites), but I'm keen to get Cocoon working for you.
>>    
>>
>
>Thanks very much :)  I'm in the same boat, working on Forrest in the
>evenings.  No rush -- there's plenty of other stuff to keep us busy
>before a release.
>
>
>--Jeff
>
>PS: in your CLI experiments, have you ever encountered a bug where the
>last link in a page isn't crawled?  I'll try to come up with a decent
>replicable example, but thought I'd mention it anyway.
>
>
>  
>
>>Regards, Upayavira
>>
>>    
>>
>
>  
>