You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by Upayavira <uv...@upaya.co.uk> on 2003/08/09 07:33:31 UTC

CLI Reporting

Dear Forresters,

I asked this of cocoon-dev and got no response, so I'll ask here.

I'm in the process of completing a significant rewrite of the Cocoon CLI, which I 
hope the Cocoon and Forrest communities will accept. It supports most of the 
existing functionality, but the code is much easier to follow, debug and enhance.

One consequence of this is that I can report a lot more of what is going on. I've got 
it reporting (to stdout) for each page:
	* the number of links per page
	* the number of as yet unvisited links per page
	* the time taken to generate the page
	* the actual links found in a page
	* whether those links are broken 
	* whether those links have already been added to the crawlers link list

I'll no doubt think of more things that can be reported. So, I have two questions:

1) Are there other things you'd like to know, to give the process greater visibility?

2) How should I report this information. There's three possibilities:
	* to the screen (results in a lot of info scrolling by)
	* to an XML file (extending the broken links xml file idea)
	* to the standard Cocoon log files (don't support structured data)

Any thoughts?

Regards, Upayavira

Re: CLI Reporting

Posted by Upayavira <uv...@upaya.co.uk>.
Jeff,

> > > X [0] site:changes      BROKEN: No pipeline matched request:
> > > site:changes from page sitemap-ref.xml
> > 
> > It'd take some thinking, but it should be doable. Partly because
> > that link might be used in more than one place, so you'd need to
> > report a broken link and all its linking pages, which is kinda the
> > other way around from what I've been planning.
> 
> Each broken link could be reported as it is encountered:

Yes, but that means remembering the 'parent' whenever you add a link to be crawled. 
And as you find more links to uncrawled pages, you can add another parent to the 
page. You only find out that a page is broken when you try spidering it, and you might 
have seen ten links to it already. Do you show just one, or all ten?

> X [0] site:changes      BROKEN: No pipeline matched request:
> site:changes from page index.xml .... X [0] site:changes      BROKEN:
> No pipeline matched request: site:changes from page sitemap-ref.xml
> ....
> 
> Meaning only one link, to the page's parent, need be recorded when the
> link sampler encounters it.

But the broken link may be found when it is crawled, not when the link sampler sees 
it.
 
> > > Or even better, 
> > > 
> > > X [0] site:changes      BROKEN: No pipeline matched request:
> > > site:changes from page sitemap-ref.xml line 102
> > 
> > Given the way that links are gathered, it won't be possible to
> > calculate line numbers (i.e. in a SAX pipeline).
> 
> Well there's the org.xml.sax.Locator object, but I don't know if
> Cocoon does much with it.

I don't think so, as I remember others talking about that problem.
 
> > > > 2) How should I report this information. There's three
> > > > possibilities: 	* to the screen (results in a lot of info
> > > > scrolling by) 	* to an XML file (extending the broken links xml
> > > > file idea) 	* to the standard Cocoon log files (don't support
> > > > structured data)
> > > 
> > > Perhaps real-time text, as currently done, with full XML logged at
> > > the same time?  Then one day we could have a web interface for
> > > Forrest with a "render this site to disk" button.   Once the CLI
> > > is done, we could transform the output to HTML.
> > 
> > Okay. So we have minimal output to stdout, and XML generated to log
> > what's been going on. And I'll use SAX for creating that XML rather
> > than DOM so that it'll be ready for a decent cocoon based web
> > interface (such as Unico's publishingService).
> > 
> > Thanks for this. I'll see what I can get going, and then put what
> > I've got into the Cocoon scratchpad. I hope you'll be willing to
> > give it a go.
> 
> Certainly will.  Thanks!

Great. I'll get on and code something tomorrow, and will just come up with something 
that you can comment upon.

Regards, Upayavira


Re: CLI Reporting

Posted by Jeff Turner <je...@apache.org>.
On Sat, Aug 09, 2003 at 01:37:28PM +0100, Upayavira wrote:
> On 9 Aug 2003 at 19:42, Jeff Turner wrote:
...
> > X [0] site:changes      BROKEN: No pipeline matched request:
> > site:changes from page sitemap-ref.xml
> 
> It'd take some thinking, but it should be doable. Partly because that link might be 
> used in more than one place, so you'd need to report a broken link and all its linking 
> pages, which is kinda the other way around from what I've been planning.

Each broken link could be reported as it is encountered:

X [0] site:changes      BROKEN: No pipeline matched request: site:changes from page index.xml
....
X [0] site:changes      BROKEN: No pipeline matched request: site:changes from page sitemap-ref.xml
....

Meaning only one link, to the page's parent, need be recorded when the
link sampler encounters it.

> > Or even better, 
> > 
> > X [0] site:changes      BROKEN: No pipeline matched request:
> > site:changes from page sitemap-ref.xml line 102
> 
> Given the way that links are gathered, it won't be possible to calculate line numbers 
> (i.e. in a SAX pipeline).

Well there's the org.xml.sax.Locator object, but I don't know if Cocoon
does much with it.

> > > 2) How should I report this information. There's three
> > > possibilities: 	* to the screen (results in a lot of info scrolling
> > > by) 	* to an XML file (extending the broken links xml file idea) 	*
> > > to the standard Cocoon log files (don't support structured data)
> > 
> > Perhaps real-time text, as currently done, with full XML logged at the
> > same time?  Then one day we could have a web interface for Forrest
> > with a "render this site to disk" button.   Once the CLI is done, we
> > could transform the output to HTML.
> 
> Okay. So we have minimal output to stdout, and XML generated to log what's been 
> going on. And I'll use SAX for creating that XML rather than DOM so that it'll be ready 
> for a decent cocoon based web interface (such as Unico's publishingService).
> 
> Thanks for this. I'll see what I can get going, and then put what I've got into the 
> Cocoon scratchpad. I hope you'll be willing to give it a go.

Certainly will.  Thanks!

--Jeff

> Regards, Upayavira
> 

Re: CLI Reporting

Posted by Upayavira <uv...@upaya.co.uk>.
On 9 Aug 2003 at 19:42, Jeff Turner wrote:

> > I'm in the process of completing a significant rewrite of the Cocoon
> > CLI, which I hope the Cocoon and Forrest communities will accept. It
> > supports most of the existing functionality, but the code is much
> > easier to follow, debug and enhance.
> 
> Cool :) We'll ship Forrest 0.5 with pretty much whatever you come up
> with that has ignore-these-links support ;)  

Keep saying that and I'll get to it! I haven't yet reworked the xconf format. I'm sure I'll 
add excludes easily enough when I get around to that.

> > One consequence of this is that I can report a lot more of what is
> > going on. I've got it reporting (to stdout) for each page: 	* the
> > number of links per page 	* the number of as yet unvisited links per
> > page 	* the time taken to generate the page 	* the actual links
> > found in a page 	* whether those links are broken 	* whether those
> > links have already been added to the crawlers link list
> > 
> > I'll no doubt think of more things that can be reported. So, I have
> > two questions:
> > 
> > 1) Are there other things you'd like to know, to give the process
> > greater visibility?
> 
> IMO the current minimal output is fine.  If you'd like to report more
> ('time taken' would be useful), that's also fine.

Okay, so I'll add time taken to the screen output.

> What I'd *love* to see is better error messages when something breaks.
> Specifically, when there is a broken link, I'd like to know which page
> the link was in.  Currently there is no way to tell.  One just gets
> errors like:
> 
> X [0] site:changes      BROKEN: No pipeline matched request:
> site:changes
> 
> Ideally one would get:
> 
> X [0] site:changes      BROKEN: No pipeline matched request:
> site:changes from page sitemap-ref.xml

It'd take some thinking, but it should be doable. Partly because that link might be 
used in more than one place, so you'd need to report a broken link and all its linking 
pages, which is kinda the other way around from what I've been planning.
 
> Or even better, 
> 
> X [0] site:changes      BROKEN: No pipeline matched request:
> site:changes from page sitemap-ref.xml line 102

Given the way that links are gathered, it won't be possible to calculate line numbers 
(i.e. in a SAX pipeline).
 
> > 2) How should I report this information. There's three
> > possibilities: 	* to the screen (results in a lot of info scrolling
> > by) 	* to an XML file (extending the broken links xml file idea) 	*
> > to the standard Cocoon log files (don't support structured data)
> 
> Perhaps real-time text, as currently done, with full XML logged at the
> same time?  Then one day we could have a web interface for Forrest
> with a "render this site to disk" button.   Once the CLI is done, we
> could transform the output to HTML.

Okay. So we have minimal output to stdout, and XML generated to log what's been 
going on. And I'll use SAX for creating that XML rather than DOM so that it'll be ready 
for a decent cocoon based web interface (such as Unico's publishingService).

Thanks for this. I'll see what I can get going, and then put what I've got into the 
Cocoon scratchpad. I hope you'll be willing to give it a go.

Regards, Upayavira


Re: CLI Reporting

Posted by Jeff Turner <je...@apache.org>.
On Sat, Aug 09, 2003 at 06:33:31AM +0100, Upayavira wrote:
> Dear Forresters,
> 
> I asked this of cocoon-dev and got no response, so I'll ask here.
> 
> I'm in the process of completing a significant rewrite of the Cocoon CLI, which I 
> hope the Cocoon and Forrest communities will accept. It supports most of the 
> existing functionality, but the code is much easier to follow, debug and enhance.

Cool :) We'll ship Forrest 0.5 with pretty much whatever you come up with
that has ignore-these-links support ;)  

> One consequence of this is that I can report a lot more of what is going on. I've got 
> it reporting (to stdout) for each page:
> 	* the number of links per page
> 	* the number of as yet unvisited links per page
> 	* the time taken to generate the page
> 	* the actual links found in a page
> 	* whether those links are broken 
> 	* whether those links have already been added to the crawlers link list
> 
> I'll no doubt think of more things that can be reported. So, I have two questions:
> 
> 1) Are there other things you'd like to know, to give the process greater visibility?

IMO the current minimal output is fine.  If you'd like to report more
('time taken' would be useful), that's also fine.

What I'd *love* to see is better error messages when something breaks.
Specifically, when there is a broken link, I'd like to know which page
the link was in.  Currently there is no way to tell.  One just gets
errors like:

X [0] site:changes      BROKEN: No pipeline matched request: site:changes

Ideally one would get:

X [0] site:changes      BROKEN: No pipeline matched request: site:changes from page sitemap-ref.xml

Or even better, 

X [0] site:changes      BROKEN: No pipeline matched request: site:changes from page sitemap-ref.xml line 102

> 2) How should I report this information. There's three possibilities:
> 	* to the screen (results in a lot of info scrolling by)
> 	* to an XML file (extending the broken links xml file idea)
> 	* to the standard Cocoon log files (don't support structured data)

Perhaps real-time text, as currently done, with full XML logged at the
same time?  Then one day we could have a web interface for Forrest with a
"render this site to disk" button.   Once the CLI is done, we could
transform the output to HTML.


--Jeff

> 
> Any thoughts?
> 
> Regards, Upayavira