You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by David Crossley <cr...@indexgeo.com.au> on 2003/08/26 10:27:08 UTC

broken links to "site:" URLs

I rebuilt my local Forrest doco today but i get all these strange
error messages about "site:" and "ext:" URLs being broken.
Here is one example...
------
...
* [0] your-project.pdf
X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
* [48] cap.html
...
------

On the other hand, i have a project site that builds with no such
problems. So i do not know what is going on. Any clues?

--David


Re: Cocoon CLI: excluding URIs

Posted by Upayavira <uv...@upaya.co.uk>.
Upayavira wrote:

...      

>>> 4) I just started thinking about your excludes code (assuming that 
>>> link gathering does start working again). Basically, there's a 
>>> number of things one can exclude upon - source URI, source prefix, 
>>> full source URI (prefix and URI), final destination URI . How about 
>>> something like:
>>>
>>> <exclude type="regexp| wildcard" src="source-uri | source-prefix | 
>>> full-source-uri | dest-uri" match="<pattern>"/>
>>> <include type="regexp| wildcard" src="source-uri | source-prefix | 
>>> full-source-uri | dest-uri" match="<pattern>"/>   
>>
>> I'd be happy with a simple 'ignore this link', but wildcards would be
>> great.
>>
>> I'm a bit confused by all the @src types though.  Is 'dest-uri' the 
>> final
>> filesystem destination?  Is there anything possible with src="dest-uri"
>> that isn't possible otherwise?  Does 'src-prefix' mean "ignore URIs
>> starting with this prefix"?  If so, why not just use a wildcard?
>
> The thing is, you might want to exclude a certain URL from going to 
> one destination but not another, so you'd need to specify a wildcard 
> on either source or destination. However, given that a wildcard can be 
> used to deal with prefixes, we don't need to specifically worry about 
> prefixes. So, I propose:
>
> <exclude-source match="<wildcard pattern>"/>
> <exclude-destination match="<wildcard pattern>"/>
> <exclude-source match="<wildcard pattern>"/>
> <exclude-destination match="<wildcard pattern>"/>
>
> I don't want to use <exclude type="source" ...> as I wan to reserve 
> the type attribute for specifying whether to use a wildcard or regexp 
> matcher.
>
> Thoughts?
>
> I've got some basic code in place to do includes/excludes - I'll keep 
> you posted.

I've just run my code through, and it seems to have worked. I'll give it 
a bit more testing later today and commit it either this evening or 
tomorrow.

It is very simple - I haven't yet implemented 'destination' excludes, 
and have only done wildcard excludes. And the matching happens with the 
'absolute' url, i.e. including any source prefix. I've also implemented 
includes in the same way, but have not yet tested it.

I've made it so that if both includes and excludes are present, a URL is 
first checked to see if it should be included, and then to see if it 
should be excluded. So you might say <include pattern="subsite/**"/> and 
<exclude pattern="subsite/images/**"/>.

I have tested the following, which generates the docs, but without any 
images.

   <exclude pattern="**/images/**"/>
   <uri type="append" src-prefix="docs/" src="index.html" 
dest="build/dest/" />

Regards, Upayavira



Re: Cocoon CLI: excluding URIs

Posted by Upayavira <uv...@upaya.co.uk>.
Switching from Forrest-dev...

Jeff Turner wrote (on forrest-dev):

>On Wed, Aug 27, 2003 at 10:42:36AM +0100, Upayavira wrote:
>  
>
>>Jeff Turner wrote:
>>
>>    
>>
>>>On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
>>>
>>>
>>>      
>>>
>>>>I rebuilt my local Forrest doco today but i get all these strange
>>>>error messages about "site:" and "ext:" URLs being broken.
>>>>Here is one example...
>>>>------
>>>>...
>>>>* [0] your-project.pdf
>>>>X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
>>>>* [48] cap.html
>>>>...
>>>>------
>>>>
>>>>On the other hand, i have a project site that builds with no such
>>>>problems. So i do not know what is going on. Any clues?
>>>>  
>>>>
>>>>        
>>>>
>>>The problem is with the new CLI: we have no way to exclude certain URLs
>>>      
>>>
>>>from being traversed.  The Forrest site gives these broken links because
>>    
>>
>>>sitemap-ref.xml deliberately references some raw XML (index.xml), which
>>>contains refs to untranslated links like 'site:contrib'.  It just an
>>>annoyance really -- doesn't harm the actual output.
>>>
>>>If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
>>>onto the Cocoon CLI so we can do a long-overdue 0.5 release.
>>>
>>>      
>>>
>>Jeff,
>>
>>Are you saying that the CLI is holding back a Forrest release?
>>    
>>
>
>A bit ;)  0.4 and previous versions have all had a mechanism to exclude
>certain URIs from being traversed.  Forrest's own site gives errors if
>some URLs aren't excluded.
>
>>s the a timescale for it?
>>    
>>
>No particular timescale.  It's been 6 months since 0.4 though, so a
>release soon would be nice.
>  
>
>>A few points:
>>
>>1) If you switch back to link view, would that enable you to achieve 
>>your 'excludes' requirement?
>>    
>>
>Yes, but I've gotten used to the CLI speeding along, and wouldn't like to
>go back.
>  
>
Okay.

>>2) The LinkGatherer doesn't currently work, as a recent fix to caching 
>>broke it. It assumes that the LinkGatherer component isn't cached, as 
>>its 'gathering' side effect isn't cached.
>>    
>>
>Strange thing is, I haven't been able to replicate this in Forrest, after
>updating locally to CVS Cocoon.  CLI rendering works fine, both on
>initial and subsequent renderings.  I thought perhaps we have the buggy
>cache impl, but in my tests I'm using the same excalibur-store as in
>Cocoon, so I don't know what's going on.
>
Interesting. I think I know. Whilst hacking around, I added a 
getValidity() method to the LinkGatherer, thinking that that was what 
was breaking the cache. But I didn't commit it. I have been working from 
a not working caching LinkGatherer, whilst you're working with a working 
CVS non-caching LinkGatherer. So this is good news.

What it means is that link gathering works, but that, if you use link 
gathering, you can't take advantage of the new ability to write to files 
only if a page has changed. To get that working, I've got to get the 
links gathered by the LinkGatherer into the cache somehow.

>>3) I think I might be able to fix that (just rebuilding my Eclipse 
>>environment...), by setting the LinkGatherer to return null in response 
>>to getValitity()
>>4) I just started thinking about your excludes code (assuming that link 
>>gathering does start working again). Basically, there's a number of 
>>things one can exclude upon - source URI, source prefix, full source URI 
>>(prefix and URI), final destination URI . How about something like:
>>
>><exclude type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>><include type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>>    
>>
>I'd be happy with a simple 'ignore this link', but wildcards would be
>great.
>
>I'm a bit confused by all the @src types though.  Is 'dest-uri' the final
>filesystem destination?  Is there anything possible with src="dest-uri"
>that isn't possible otherwise?  Does 'src-prefix' mean "ignore URIs
>starting with this prefix"?  If so, why not just use a wildcard?
>  
>
The thing is, you might want to exclude a certain URL from going to one 
destination but not another, so you'd need to specify a wildcard on 
either source or destination. However, given that a wildcard can be used 
to deal with prefixes, we don't need to specifically worry about 
prefixes. So, I propose:

<exclude-source match="<wildcard pattern>"/>
<exclude-destination match="<wildcard pattern>"/>
<exclude-source match="<wildcard pattern>"/>
<exclude-destination match="<wildcard pattern>"/>

I don't want to use <exclude type="source" ...> as I wan to reserve the 
type attribute for specifying whether to use a wildcard or regexp matcher.

Thoughts?

I've got some basic code in place to do includes/excludes - I'll keep 
you posted.

>>With include, you can have only a very narrow part of your site
>>crawled.
>>
>>Note: I think the xconf format needs some serious rethinking, so this 
>>would be a temporary extension.
>>    
>>
>I agree, the format isn't something that can be decided up-front.  I
>wouldn't worry too much about keeping backwards-compat.
>  
>
>>What do you think?
>>
>>I'm struggling to fit a number of projects into limited time (1 1/2 
>>hours per day) - want to do Cocoon stuff, but need to work on some other 
>>sites), but I'm keen to get Cocoon working for you.
>>    
>>
>
>Thanks very much :)  I'm in the same boat, working on Forrest in the
>evenings.  No rush -- there's plenty of other stuff to keep us busy
>before a release.
>
I've just managed to shove one burning project two weeks into the 
future, so I'm back on for Cocoon for a while!

>PS: in your CLI experiments, have you ever encountered a bug where the
>last link in a page isn't crawled?  I'll try to come up with a decent
>replicable example, but thought I'd mention it anyway.
>
To be honest, I haven't. Give me an example, and I'll look into it.

Regards, Upayavira



Re: Cocoon CLI: excluding URIs

Posted by Upayavira <uv...@upaya.co.uk>.
Jeff,

I've replied fully on cocoon-dev, as this discussion should really be 
happening there.

Upayavira


Jeff Turner wrote:

>On Wed, Aug 27, 2003 at 10:42:36AM +0100, Upayavira wrote:
>  
>
>>Jeff Turner wrote:
>>
>>    
>>
>>>On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
>>>
>>>
>>>      
>>>
>>>>I rebuilt my local Forrest doco today but i get all these strange
>>>>error messages about "site:" and "ext:" URLs being broken.
>>>>Here is one example...
>>>>------
>>>>...
>>>>* [0] your-project.pdf
>>>>X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
>>>>* [48] cap.html
>>>>...
>>>>------
>>>>
>>>>On the other hand, i have a project site that builds with no such
>>>>problems. So i do not know what is going on. Any clues?
>>>>  
>>>>
>>>>        
>>>>
>>>The problem is with the new CLI: we have no way to exclude certain URLs
>>>      
>>>
>>>from being traversed.  The Forrest site gives these broken links because
>>    
>>
>>>sitemap-ref.xml deliberately references some raw XML (index.xml), which
>>>contains refs to untranslated links like 'site:contrib'.  It just an
>>>annoyance really -- doesn't harm the actual output.
>>>
>>>If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
>>>onto the Cocoon CLI so we can do a long-overdue 0.5 release.
>>>
>>>      
>>>
>>Jeff,
>>
>>Are you saying that the CLI is holding back a Forrest release?
>>    
>>
>
>A bit ;)  0.4 and previous versions have all had a mechanism to exclude
>certain URIs from being traversed.  Forrest's own site gives errors if
>some URLs aren't excluded.
>
>  
>
>>Is the a timescale for it?
>>    
>>
>
>No particular timescale.  It's been 6 months since 0.4 though, so a
>release soon would be nice.
>
>  
>
>>A few points:
>>
>>1) If you switch back to link view, would that enable you to achieve 
>>your 'excludes' requirement?
>>    
>>
>
>Yes, but I've gotten used to the CLI speeding along, and wouldn't like to
>go back.
>
>  
>
>>2) The LinkGatherer doesn't currently work, as a recent fix to caching 
>>broke it. It assumes that the LinkGatherer component isn't cached, as 
>>its 'gathering' side effect isn't cached.
>>    
>>
>
>Strange thing is, I haven't been able to replicate this in Forrest, after
>updating locally to CVS Cocoon.  CLI rendering works fine, both on
>initial and subsequent renderings.  I thought perhaps we have the buggy
>cache impl, but in my tests I'm using the same excalibur-store as in
>Cocoon, so I don't know what's going on.
>
>  
>
>>3) I think I might be able to fix that (just rebuilding my Eclipse 
>>environment...), by setting the LinkGatherer to return null in response 
>>to getValitity()
>>4) I just started thinking about your excludes code (assuming that link 
>>gathering does start working again). Basically, there's a number of 
>>things one can exclude upon - source URI, source prefix, full source URI 
>>(prefix and URI), final destination URI . How about something like:
>>
>><exclude type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>><include type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>>    
>>
>
>I'd be happy with a simple 'ignore this link', but wildcards would be
>great.
>
>I'm a bit confused by all the @src types though.  Is 'dest-uri' the final
>filesystem destination?  Is there anything possible with src="dest-uri"
>that isn't possible otherwise?  Does 'src-prefix' mean "ignore URIs
>starting with this prefix"?  If so, why not just use a wildcard?
>
>  
>
>>With include, you can have only a very narrow part of your site
>>crawled.
>>
>>Note: I think the xconf format needs some serious rethinking, so this 
>>would be a temporary extension.
>>    
>>
>
>I agree, the format isn't something that can be decided up-front.  I
>wouldn't worry too much about keeping backwards-compat.
>
>  
>
>>What do you think?
>>
>>I'm struggling to fit a number of projects into limited time (1 1/2 
>>hours per day) - want to do Cocoon stuff, but need to work on some other 
>>sites), but I'm keen to get Cocoon working for you.
>>    
>>
>
>Thanks very much :)  I'm in the same boat, working on Forrest in the
>evenings.  No rush -- there's plenty of other stuff to keep us busy
>before a release.
>
>
>--Jeff
>
>PS: in your CLI experiments, have you ever encountered a bug where the
>last link in a page isn't crawled?  I'll try to come up with a decent
>replicable example, but thought I'd mention it anyway.
>
>
>  
>
>>Regards, Upayavira
>>
>>    
>>
>
>  
>



Re: Cocoon CLI: excluding URIs

Posted by Upayavira <uv...@upaya.co.uk>.
Jeff Turner wrote:

>On Thu, Aug 28, 2003 at 08:19:25PM +0100, Upayavira wrote:
>  
>
>>Jeff,
>>
>>In Cocoon CVS you'll now find code to do simple includes and excludes. 
>>Stick something like the following somewhere in your cli.xconf.
>>
>><include pattern="docs/*"/>
>><exclude pattern="**/images/**"/>
>>    
>>
>
>Wohoo :)  Works nicely, and just in time for some weekend hacking. Thanks
>very much.
>
Glad you like it!

>A small thing: do you think we could get rid of the timestamp until such
>a time as it can be printed on the same line as the rest of the text?
>Doubling the length of the output is quite a heavy price to pay.
>  
>
I've done this in CVS. I'll re-add it later using the proper 
BeanListener interface.

>Forrest ppl: if there's no objections, I'll commit an updated Cocoon to
>Forrest CVS some time this w/end.
>  
>
Not a Forrest person, but I'd be up for it! Always good to have people 
doing my testing ;-)

Regards, Upayavira





Re: Cocoon CLI: excluding URIs

Posted by Nicola Ken Barozzi <ni...@apache.org>.
Upayavira wrote, On 30/08/2003 11.31:
> Nicola Ken Barozzi wrote:
...
>> Just note that it's possible that this mechanism shall be superceded, 
>> so we should probably not publicise it as a Forrest feature for our 
>> users.
>> But I'm fine anyways.
> 
> I was thinking about this, and I kind'a think that <exclude 
> pattern="**/api/**"/> is a lot easier to say than writing an XSLT to 
> exclude links. So, in terms of simple filtering, I think there's scope 
> for both link-view filtering and xconf includes/excludes.
> 
> Just a thought...

I agree, IMHO link gathering is IMV something that should be done by 
Cocoon more than "with" Cocoon. We'll see on cocoon-dev, probably 
keeping both mechanisms would be better.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------



Re: Cocoon CLI: excluding URIs

Posted by Upayavira <uv...@upaya.co.uk>.
Nicola Ken Barozzi wrote:

>
> Jeff Turner wrote, On 29/08/2003 16.01:
>
> ...
>
>> Forrest ppl: if there's no objections, I'll commit an updated Cocoon to
>> Forrest CVS some time this w/end.
>
>
> No objections.
>
> Just note that it's possible that this mechanism shall be superceded, 
> so we should probably not publicise it as a Forrest feature for our 
> users.
> But I'm fine anyways.

I was thinking about this, and I kind'a think that <exclude 
pattern="**/api/**"/> is a lot easier to say than writing an XSLT to 
exclude links. So, in terms of simple filtering, I think there's scope 
for both link-view filtering and xconf includes/excludes.

Just a thought...

Regards, Upayavira




Re: Cocoon CLI: excluding URIs

Posted by Nicola Ken Barozzi <ni...@apache.org>.
Jeff Turner wrote, On 29/08/2003 16.01:

...
> Forrest ppl: if there's no objections, I'll commit an updated Cocoon to
> Forrest CVS some time this w/end.

No objections.

Just note that it's possible that this mechanism shall be superceded, so 
we should probably not publicise it as a Forrest feature for our users.
But I'm fine anyways.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------



Re: Cocoon CLI: excluding URIs

Posted by Jeff Turner <je...@apache.org>.
On Thu, Aug 28, 2003 at 08:19:25PM +0100, Upayavira wrote:
> Jeff,
> 
> In Cocoon CVS you'll now find code to do simple includes and excludes. 
> Stick something like the following somewhere in your cli.xconf.
> 
> <include pattern="docs/*"/>
> <exclude pattern="**/images/**"/>

Wohoo :)  Works nicely, and just in time for some weekend hacking. Thanks
very much.

A small thing: do you think we could get rid of the timestamp until such
a time as it can be printed on the same line as the rest of the text?
Doubling the length of the output is quite a heavy price to pay.

Forrest ppl: if there's no objections, I'll commit an updated Cocoon to
Forrest CVS some time this w/end.


--Jeff

> Hope that helps. And the coding was delightfully easy (given the 
> availability of the o.a.c.matching.helpers.WildcardHelper class).
> 
> Regards, Upayavira
> 

Re: Cocoon CLI: excluding URIs

Posted by Upayavira <uv...@upaya.co.uk>.
Jeff,

In Cocoon CVS you'll now find code to do simple includes and excludes. 
Stick something like the following somewhere in your cli.xconf.

<include pattern="docs/*"/>
<exclude pattern="**/images/**"/>

Hope that helps. And the coding was delightfully easy (given the 
availability of the o.a.c.matching.helpers.WildcardHelper class).

Regards, Upayavira

Jeff Turner wrote:

>On Wed, Aug 27, 2003 at 10:42:36AM +0100, Upayavira wrote:
>  
>
>>Jeff Turner wrote:
>>
>>    
>>
>>>On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
>>>
>>>
>>>      
>>>
>>>>I rebuilt my local Forrest doco today but i get all these strange
>>>>error messages about "site:" and "ext:" URLs being broken.
>>>>Here is one example...
>>>>------
>>>>...
>>>>* [0] your-project.pdf
>>>>X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
>>>>* [48] cap.html
>>>>...
>>>>------
>>>>
>>>>On the other hand, i have a project site that builds with no such
>>>>problems. So i do not know what is going on. Any clues?
>>>>  
>>>>
>>>>        
>>>>
>>>The problem is with the new CLI: we have no way to exclude certain URLs
>>>      
>>>
>>>from being traversed.  The Forrest site gives these broken links because
>>    
>>
>>>sitemap-ref.xml deliberately references some raw XML (index.xml), which
>>>contains refs to untranslated links like 'site:contrib'.  It just an
>>>annoyance really -- doesn't harm the actual output.
>>>
>>>If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
>>>onto the Cocoon CLI so we can do a long-overdue 0.5 release.
>>>
>>>      
>>>
>>Jeff,
>>
>>Are you saying that the CLI is holding back a Forrest release?
>>    
>>
>
>A bit ;)  0.4 and previous versions have all had a mechanism to exclude
>certain URIs from being traversed.  Forrest's own site gives errors if
>some URLs aren't excluded.
>
>  
>
>>Is the a timescale for it?
>>    
>>
>
>No particular timescale.  It's been 6 months since 0.4 though, so a
>release soon would be nice.
>
>  
>
>>A few points:
>>
>>1) If you switch back to link view, would that enable you to achieve 
>>your 'excludes' requirement?
>>    
>>
>
>Yes, but I've gotten used to the CLI speeding along, and wouldn't like to
>go back.
>
>  
>
>>2) The LinkGatherer doesn't currently work, as a recent fix to caching 
>>broke it. It assumes that the LinkGatherer component isn't cached, as 
>>its 'gathering' side effect isn't cached.
>>    
>>
>
>Strange thing is, I haven't been able to replicate this in Forrest, after
>updating locally to CVS Cocoon.  CLI rendering works fine, both on
>initial and subsequent renderings.  I thought perhaps we have the buggy
>cache impl, but in my tests I'm using the same excalibur-store as in
>Cocoon, so I don't know what's going on.
>
>  
>
>>3) I think I might be able to fix that (just rebuilding my Eclipse 
>>environment...), by setting the LinkGatherer to return null in response 
>>to getValitity()
>>4) I just started thinking about your excludes code (assuming that link 
>>gathering does start working again). Basically, there's a number of 
>>things one can exclude upon - source URI, source prefix, full source URI 
>>(prefix and URI), final destination URI . How about something like:
>>
>><exclude type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>><include type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>>    
>>
>
>I'd be happy with a simple 'ignore this link', but wildcards would be
>great.
>
>I'm a bit confused by all the @src types though.  Is 'dest-uri' the final
>filesystem destination?  Is there anything possible with src="dest-uri"
>that isn't possible otherwise?  Does 'src-prefix' mean "ignore URIs
>starting with this prefix"?  If so, why not just use a wildcard?
>
>  
>
>>With include, you can have only a very narrow part of your site
>>crawled.
>>
>>Note: I think the xconf format needs some serious rethinking, so this 
>>would be a temporary extension.
>>    
>>
>
>I agree, the format isn't something that can be decided up-front.  I
>wouldn't worry too much about keeping backwards-compat.
>
>  
>
>>What do you think?
>>
>>I'm struggling to fit a number of projects into limited time (1 1/2 
>>hours per day) - want to do Cocoon stuff, but need to work on some other 
>>sites), but I'm keen to get Cocoon working for you.
>>    
>>
>
>Thanks very much :)  I'm in the same boat, working on Forrest in the
>evenings.  No rush -- there's plenty of other stuff to keep us busy
>before a release.
>
>
>--Jeff
>
>PS: in your CLI experiments, have you ever encountered a bug where the
>last link in a page isn't crawled?  I'll try to come up with a decent
>replicable example, but thought I'd mention it anyway.
>
>
>  
>
>>Regards, Upayavira
>>
>>    
>>
>
>  
>



Cocoon CLI: excluding URIs (was: Re: broken links to "site:" URLs)

Posted by Jeff Turner <je...@apache.org>.
On Wed, Aug 27, 2003 at 10:42:36AM +0100, Upayavira wrote:
> Jeff Turner wrote:
> 
> >On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
> > 
> >
> >>I rebuilt my local Forrest doco today but i get all these strange
> >>error messages about "site:" and "ext:" URLs being broken.
> >>Here is one example...
> >>------
> >>...
> >>* [0] your-project.pdf
> >>X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
> >>* [48] cap.html
> >>...
> >>------
> >>
> >>On the other hand, i have a project site that builds with no such
> >>problems. So i do not know what is going on. Any clues?
> >>   
> >>
> >
> >The problem is with the new CLI: we have no way to exclude certain URLs
> >from being traversed.  The Forrest site gives these broken links because
> >sitemap-ref.xml deliberately references some raw XML (index.xml), which
> >contains refs to untranslated links like 'site:contrib'.  It just an
> >annoyance really -- doesn't harm the actual output.
> >
> >If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
> >onto the Cocoon CLI so we can do a long-overdue 0.5 release.
> >
> Jeff,
> 
> Are you saying that the CLI is holding back a Forrest release?

A bit ;)  0.4 and previous versions have all had a mechanism to exclude
certain URIs from being traversed.  Forrest's own site gives errors if
some URLs aren't excluded.

> Is the a timescale for it?

No particular timescale.  It's been 6 months since 0.4 though, so a
release soon would be nice.

> A few points:
> 
> 1) If you switch back to link view, would that enable you to achieve 
> your 'excludes' requirement?

Yes, but I've gotten used to the CLI speeding along, and wouldn't like to
go back.

> 2) The LinkGatherer doesn't currently work, as a recent fix to caching 
> broke it. It assumes that the LinkGatherer component isn't cached, as 
> its 'gathering' side effect isn't cached.

Strange thing is, I haven't been able to replicate this in Forrest, after
updating locally to CVS Cocoon.  CLI rendering works fine, both on
initial and subsequent renderings.  I thought perhaps we have the buggy
cache impl, but in my tests I'm using the same excalibur-store as in
Cocoon, so I don't know what's going on.

> 3) I think I might be able to fix that (just rebuilding my Eclipse 
> environment...), by setting the LinkGatherer to return null in response 
> to getValitity()
> 4) I just started thinking about your excludes code (assuming that link 
> gathering does start working again). Basically, there's a number of 
> things one can exclude upon - source URI, source prefix, full source URI 
> (prefix and URI), final destination URI . How about something like:
> 
> <exclude type="regexp| wildcard" src="source-uri | source-prefix | 
> full-source-uri | dest-uri" match="<pattern>"/>
> <include type="regexp| wildcard" src="source-uri | source-prefix | 
> full-source-uri | dest-uri" match="<pattern>"/>

I'd be happy with a simple 'ignore this link', but wildcards would be
great.

I'm a bit confused by all the @src types though.  Is 'dest-uri' the final
filesystem destination?  Is there anything possible with src="dest-uri"
that isn't possible otherwise?  Does 'src-prefix' mean "ignore URIs
starting with this prefix"?  If so, why not just use a wildcard?

> With include, you can have only a very narrow part of your site
> crawled.
> 
> Note: I think the xconf format needs some serious rethinking, so this 
> would be a temporary extension.

I agree, the format isn't something that can be decided up-front.  I
wouldn't worry too much about keeping backwards-compat.

> What do you think?
> 
> I'm struggling to fit a number of projects into limited time (1 1/2 
> hours per day) - want to do Cocoon stuff, but need to work on some other 
> sites), but I'm keen to get Cocoon working for you.

Thanks very much :)  I'm in the same boat, working on Forrest in the
evenings.  No rush -- there's plenty of other stuff to keep us busy
before a release.


--Jeff

PS: in your CLI experiments, have you ever encountered a bug where the
last link in a page isn't crawled?  I'll try to come up with a decent
replicable example, but thought I'd mention it anyway.


> Regards, Upayavira
> 

Re: broken links to "site:" URLs

Posted by Upayavira <uv...@upaya.co.uk>.
Jeff Turner wrote:

>On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
>  
>
>>I rebuilt my local Forrest doco today but i get all these strange
>>error messages about "site:" and "ext:" URLs being broken.
>>Here is one example...
>>------
>>...
>>* [0] your-project.pdf
>>X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
>>* [48] cap.html
>>...
>>------
>>
>>On the other hand, i have a project site that builds with no such
>>problems. So i do not know what is going on. Any clues?
>>    
>>
>
>The problem is with the new CLI: we have no way to exclude certain URLs
>from being traversed.  The Forrest site gives these broken links because
>sitemap-ref.xml deliberately references some raw XML (index.xml), which
>contains refs to untranslated links like 'site:contrib'.  It just an
>annoyance really -- doesn't harm the actual output.
>
>If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
>onto the Cocoon CLI so we can do a long-overdue 0.5 release.
>
Jeff,

Are you saying that the CLI is holding back a Forrest release? Is the a 
timescale for it?

A few points:

1) If you switch back to link view, would that enable you to achieve 
your 'excludes' requirement?
2) The LinkGatherer doesn't currently work, as a recent fix to caching 
broke it. It assumes that the LinkGatherer component isn't cached, as 
its 'gathering' side effect isn't cached.
3) I think I might be able to fix that (just rebuilding my Eclipse 
environment...), by setting the LinkGatherer to return null in response 
to getValitity()
4) I just started thinking about your excludes code (assuming that link 
gathering does start working again). Basically, there's a number of 
things one can exclude upon - source URI, source prefix, full source URI 
(prefix and URI), final destination URI . How about something like:

<exclude type="regexp| wildcard" src="source-uri | source-prefix | 
full-source-uri | dest-uri" match="<pattern>"/>
<include type="regexp| wildcard" src="source-uri | source-prefix | 
full-source-uri | dest-uri" match="<pattern>"/>

With include, you can have only a very narrow part of your site crawled.

Note: I think the xconf format needs some serious rethinking, so this 
would be a temporary extension.

What do you think?

I'm struggling to fit a number of projects into limited time (1 1/2 
hours per day) - want to do Cocoon stuff, but need to work on some other 
sites), but I'm keen to get Cocoon working for you.

Regards, Upayavira



Re: broken links to "site:" URLs

Posted by Jeff Turner <je...@apache.org>.
On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
> I rebuilt my local Forrest doco today but i get all these strange
> error messages about "site:" and "ext:" URLs being broken.
> Here is one example...
> ------
> ...
> * [0] your-project.pdf
> X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
> * [48] cap.html
> ...
> ------
> 
> On the other hand, i have a project site that builds with no such
> problems. So i do not know what is going on. Any clues?

The problem is with the new CLI: we have no way to exclude certain URLs
from being traversed.  The Forrest site gives these broken links because
sitemap-ref.xml deliberately references some raw XML (index.xml), which
contains refs to untranslated links like 'site:contrib'.  It just an
annoyance really -- doesn't harm the actual output.

If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
onto the Cocoon CLI so we can do a long-overdue 0.5 release.


--Jeff

> --David
>