You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Nicola Ken Barozzi <ni...@apache.org> on 2002/12/17 15:52:38 UTC

[RT] New Cocoon Site Crawler Environment

Of all these discussions, one thing sticks out: we must 
rewrite/fix/enhance/whatever the Cocoon crawler.

Reasons:

  - speed
  - correct link gathering

but mostly

  - speed

Why is it so slow?
Mostly because it generates each source three times.

* to get the links.
* for each link to get the mime/type.
* to get the page itself

To do this it uses two environments, the FileSavingEnvironment and the 
LinkSamplingEnvironment.


                          {~}


I've taken a look at the crawler project in Lucene sandbox, but its 
objectives are totally different from ours. We could in the future add a 
plugin to it to be able to index a Cocoon site using the link view, but 
it does indexing, not saving a site locally.
So our option is to do the work in Cocoon.


                          {~}


The three calls to Cocoon can be reduced quite easily to two, by making 
the call to the FileSavingEnvironment return both things at the same 
time and using those. Or by caching the result as the proposed Ant task 
in Cocoon scratchpad does.

The problem arises with the LinkSamplingEnvironment, because it uses a 
Cocoon view to get the links. Thus we need to ask Cocoon two things, the 
links and the contents.

Let's leave aside the view concept for now, and think about how to 
sample links from a content being produced.

We can use a LinklSamplingPipeline.
Yes, a pipeline that introduces a connector just after the 
"content"-tagged sitemap component and saves the links found in the 
environment.

Thus after the call we would have in the environment the result, the 
type and the links, all in one call.

In essence, we are creating a non-blocking view that runs parallelly to 
the main pipeline and reports the results to the environment.

This is how views are managed in the interpreted sitemap, in a transformer:


         // Check view
         if (this.views != null) {
	
             //inform the pipeline that we have a branch point
             context.getProcessingPipeline().informBranchPoint();
	
             String cocoonView = env.getView();
             if (cocoonView != null) {

                 // Get view node
                 ProcessingNode viewNode =
                    (ProcessingNode)this.views.get(cocoonView);

                 if (viewNode != null) {
                     if (getLogger().isInfoEnabled()) {
                         getLogger().info("Jumping to view "
                            + cocoonView + " from transformer at "
                            + this.getLocation());
                     }
                     return viewNode.invoke(env, context);
                 }
             }
         }

         // Return false to contine sitemap invocation
         return false;
     }

It effectively branches and continues only with the view.

Wait, this means that when the CLI recreates a site it doesn't save the 
views, right?
Correct, views are simply ignored by the CLI and not created on disk. 
This is also due to how views are invoken in Cocoon, with a ? parameter, 
so they cannot be saved to disk with the correct URL.

But even if I don't save it, I may need it for internal Cocoon 
processing, as is that case with the crawler.

I don't know if it's best to use a special pipeline, to cache the views, 
  or what, but we need to find a solution.

Any idea?

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [RT] New Cocoon Site Crawler Environment

Posted by Vadim Gritsenko <va...@verizon.net>.
Nicola Ken Barozzi wrote:

>
> Vadim Gritsenko wrote:
>
>> Nicola Ken Barozzi wrote:
>> ...
>>
>>> Why is it so slow?
>>> Mostly because it generates each source three times.
>>

[...]

>>> Thus after the call we would have in the environment the result, the 
>>> type and the links, all in one call.
>>
>>
>> Type and links - yes, I agree. Content - no, we won't get correct 
>> content because links will not be translated in this content. And 
>> produced content is impossible to "re-link" because it can be any 
>> binary format supporting links (MS Excel, PDF, MS Word, ...)
>
>
> Ok, you are correct.
>
> Please add here the results we have come to in our fast AIM 
> discussion, I have to run now.


Ok, here is the thing. It is possible to get everything in one call (and 
- this remark goes to Berin - without increase in resource consumption), 
if we (re)move translateURI functionality from the Main. Problem is that 
this getType() method is used only for one purpose - to decide on a good 
name for the resulting file, to decide on a good extension according to 
the MIMEUtils settings. And another problem is that this getLinks is 
used only to collect this information (about good names) and deliver it 
to the LinkTranslator transformer, which does actual work of replacing 
links.

So, if we remove link translation from the Main.java, where it can go 
and how it should be done? There are several options.

1) Do not change names.
This works for everything except URIs ending with "/" - and for such 
URIs, we can use existing solution - add Constants.INDEX_URI to the end.
Points in favor of this method:
    * generated site will be close to the live site with regards to file 
names.
    * in Main java, there is need in only one call.
2) Change names according to the translation table supplied to the Main 
by the user.
This solution provides some flexibility (may be too much of it).
Points in favor of this method:
    * Flexibility.
    * Same as above.
3) Change names as we done that before - by utilizing MIMEUtils.
Points in favor of this method:
    * This is backward compatible way.
    * We still have to know types of all links to do translation. Which 
means, extra getType() call on every link (excluding duplicates - 
information is cached). Hm, this one, actually, is not in favor...

And this name translation can happen in LinkTranslator transformer which 
currently does link translation magic. If we move all URI translation 
logic, whatever it will be (see points 1-3 above), it will be possible 
to implement Main in one step instead of three steps.

Exclusion being the case (3), where complexity will be added to 
LinkTranslator, but still, we will reduce calls from 3 (per link) to 2.


> Thanks :-) 


You are welcome. Hope I tell story quite understandable.


>> But, there is hope to get all in once - if LinkSamplingTransformer 
>> will also be LinkTranslatingTransformer and will call Main back on 
>> every new link (recursive processing - as opposed to iterative 
>> processing in current implementation of the Main). The drawback of 
>> recursion approach is increased memory consumption.
>
>
> NAO = not an option


Yes, it was totally wrong idea from my side.


> It doesn't scale, you are right.


And it never did. Amen.

Vadim



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [RT] New Cocoon Site Crawler Environment

Posted by Berin Loritsch <bl...@apache.org>.
Nicola Ken Barozzi wrote:
> 
> 
> Berin Loritsch wrote:
> 
>> For instance, part of the issue resides in the fact that any client
>> (i.e. CLI environment or Servlet) can only access one view at a time
>> for any resource.
>>
>> Why not allow a client to access all views to a resource that it
>> needs/wants simultaneously?  That will allow things like the all
>> important profiling information to be appended after HTML pages
>> are rendered.
> 
> 
> I thought of this too. But in practice?...
> 
>> Cocoon is so entrenched in the single path of execution mentality that
>> environments that need the extra complexity can't have it.
>>
>> Each resource should only need to be rendered once, and only
>> once.  Each view to the resource should be accessible by a client.
>>
>> FOr instance, the CLI client wants the Link/Mime-Type information
>> and the content itself.  The Link/Mime-Type information is accessed
>> via the LinkSamplingEnvironment.  In reality, that is a poor name
>> for what you are really wanting to represent.  It should be the
>> LinkSamplingView.  That view caches information that can be incorporated
>> back into the list of links we are resolving.
> 
> 
> Ok, but in practice, how does the client request the view results?
> I kinda like this non-blocking view concept, but fail to see clearly the 
>  practical implementation.

I think this gets back to the whole Multi-Path pipelining thread a while
back (i.e. allowing the multiplexing of a pipeline to serialize results
to the disk while sending the results to the user at the same time.

In the end, I don't think that "Views" as they are defined currently are
what we are really after.  What we want is a way of siphening
information from our pipeline so that we can use it as we see fit.

As such, what we really need is something like a "publish/subscribe"
mechanism similar to the way Avalon Instrument works.  I.e. if we don't
need it, we don't waste precious resources--but if we do need it, then
it is available to us.

I am not sure of the mechanism, but perhaps we can butt our heads
together to come up with the best solution.

---------------------------------------------
Introducing NetZero Long Distance
1st month Free!
Sign up today at: www.netzerolongdistance.com

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [RT] New Cocoon Site Crawler Environment

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Berin Loritsch wrote:
> Nicola Ken Barozzi wrote:
> 
>>
>> Vadim Gritsenko wrote:
>>
>>> Nicola Ken Barozzi wrote:
>>> ...
>>>
>>>> Why is it so slow?
>>>> Mostly because it generates each source three times.
>>>
>>>
>>
>> [...]
>>
>>> Note: It gets the page with all the links translated using data 
>>> gathered on previous step.
>>
>>
>>
>> [...]
>>
>>> We can combine getType and getLinks calls into one, see below.
> 
> 
> 
> If it does not scale the way things are now--and I agree generating
> the source three times is two times too many--then we may have to
> change things a bit more deeply.
> 
> For instance, part of the issue resides in the fact that any client
> (i.e. CLI environment or Servlet) can only access one view at a time
> for any resource.
> 
> Why not allow a client to access all views to a resource that it
> needs/wants simultaneously?  That will allow things like the all
> important profiling information to be appended after HTML pages
> are rendered.

I thought of this too. But in practice?...

> Cocoon is so entrenched in the single path of execution mentality that
> environments that need the extra complexity can't have it.
> 
> Each resource should only need to be rendered once, and only
> once.  Each view to the resource should be accessible by a client.
> 
> FOr instance, the CLI client wants the Link/Mime-Type information
> and the content itself.  The Link/Mime-Type information is accessed
> via the LinkSamplingEnvironment.  In reality, that is a poor name
> for what you are really wanting to represent.  It should be the
> LinkSamplingView.  That view caches information that can be incorporated
> back into the list of links we are resolving.

Ok, but in practice, how does the client request the view results?
I kinda like this non-blocking view concept, but fail to see clearly the 
  practical implementation.

> Another issue I have that is related to link crawling, but not to
> the multi-view access.  It is the error page generation.  It is not
> *always* an error if a link is not handled by Cocoon.
> 
> A common example is the fact that JavaDocs are generated outside of
> Cocoon, and the error page that screws up the link to the JavaDocs
> is a *bad* thing.

This is not really a CLI error, but the fact that Cocoon (wrongly IMHO) 
doesn't handle that part of the URI space. We are dealing with this 
concept in Forrest, where we have seen that complete sub-URI spaces can 
be dealt with without link crawling, and so it's feasable to have Cocoon 
serve all those javadocs and not break.

Anyway, there is a way of not making the link be crawled, by setting the 
xlink attribute.

> Perhaps we should allow for known exclusions, or turn off the error
> page generation for the missing links--recording them to a file like
> we do now.

Yup, should be settable +1

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [RT] New Cocoon Site Crawler Environment

Posted by Berin Loritsch <bl...@apache.org>.
Nicola Ken Barozzi wrote:
> 
> Vadim Gritsenko wrote:
> 
>> Nicola Ken Barozzi wrote:
>> ...
>>
>>> Why is it so slow?
>>> Mostly because it generates each source three times.
>>
> 
> [...]
> 
>> Note: It gets the page with all the links translated using data 
>> gathered on previous step.
> 
> 
> [...]
> 
>> We can combine getType and getLinks calls into one, see below.


If it does not scale the way things are now--and I agree generating
the source three times is two times too many--then we may have to
change things a bit more deeply.

For instance, part of the issue resides in the fact that any client
(i.e. CLI environment or Servlet) can only access one view at a time
for any resource.

Why not allow a client to access all views to a resource that it
needs/wants simultaneously?  That will allow things like the all
important profiling information to be appended after HTML pages
are rendered.

Cocoon is so entrenched in the single path of execution mentality that
environments that need the extra complexity can't have it.

Each resource should only need to be rendered once, and only
once.  Each view to the resource should be accessible by a client.

FOr instance, the CLI client wants the Link/Mime-Type information
and the content itself.  The Link/Mime-Type information is accessed
via the LinkSamplingEnvironment.  In reality, that is a poor name
for what you are really wanting to represent.  It should be the
LinkSamplingView.  That view caches information that can be incorporated
back into the list of links we are resolving.

Another issue I have that is related to link crawling, but not to
the multi-view access.  It is the error page generation.  It is not
*always* an error if a link is not handled by Cocoon.

A common example is the fact that JavaDocs are generated outside of
Cocoon, and the error page that screws up the link to the JavaDocs
is a *bad* thing.

Perhaps we should allow for known exclusions, or turn off the error
page generation for the missing links--recording them to a file like
we do now.

Just some food for thought.

---------------------------------------------
Introducing NetZero Long Distance
1st month Free!
Sign up today at: www.netzerolongdistance.com

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [RT] New Cocoon Site Crawler Environment

Posted by Nicola Ken Barozzi <ni...@apache.org>.
Vadim Gritsenko wrote:
> Nicola Ken Barozzi wrote:
> ...
> 
>> Why is it so slow?
>> Mostly because it generates each source three times.

[...]

> Note: It gets the page with all the links translated using data gathered 
> on previous step.

[...]

> We can combine getType and getLinks calls into one, see below.
> 
> 
>> Let's leave aside the view concept for now, and think about how to 
>> sample links from a content being produced.
>>
>> We can use a LinklSamplingPipeline.
>> Yes, a pipeline that introduces a connector just after the 
>> "content"-tagged sitemap component and saves the links found in the 
>> environment. 
> 
> 
> Mmmm... Correction: pipeline that introduces LinkSamplingTransforming 
> right before serializer. You can't get links from the content view 
> because it might (will) have none yet. Links must be sampled right 
> before the serializer, as links view does.

The link view can be set to kick in at any part of the pipeline, it's 
always SAX.
It's up to the sitemap editor to tell which step is the semantically 
rich one. Can be the first, in the middle, or right before the Serializer.

>> Thus after the call we would have in the environment the result, the 
>> type and the links, all in one call.
> 
> Type and links - yes, I agree. Content - no, we won't get correct 
> content because links will not be translated in this content. And 
> produced content is impossible to "re-link" because it can be any binary 
> format supporting links (MS Excel, PDF, MS Word, ...)

Ok, you are correct.

Please add here the results we have come to in our fast AIM discussion, 
I have to run now.

Thanks :-)

> But, there is hope to get all in once - if LinkSamplingTransformer will 
> also be LinkTranslatingTransformer and will call Main back on every new 
> link (recursive processing - as opposed to iterative processing in 
> current implementation of the Main). The drawback of recursion approach 
> is increased memory consumption.

NAO = not an option

It doesn't scale, you are right.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [RT] New Cocoon Site Crawler Environment

Posted by Vadim Gritsenko <va...@verizon.net>.
Nicola Ken Barozzi wrote:
...

> Why is it so slow?
> Mostly because it generates each source three times.
>
> * to get the links. 


* to get the mime type

> * for each link 


...whose mime type is not known yet...

> to get the mime/type.
> * to get the page itself 


Note: It gets the page with all the links translated using data gathered 
on previous step.


> To do this it uses two environments, the FileSavingEnvironment and the 
> LinkSamplingEnvironment.


...

> The three calls to Cocoon can be reduced quite easily to two, by 
> making the call to the FileSavingEnvironment return both things at the 
> same time and using those.


Clarify: what two things.


> Or by caching the result as the proposed Ant task in Cocoon scratchpad 
> does.
>
> The problem arises with the LinkSamplingEnvironment, because it uses a 
> Cocoon view to get the links. Thus we need to ask Cocoon two things, 
> the links and the contents. 


We can combine getType and getLinks calls into one, see below.


> Let's leave aside the view concept for now, and think about how to 
> sample links from a content being produced.
>
> We can use a LinklSamplingPipeline.
> Yes, a pipeline that introduces a connector just after the 
> "content"-tagged sitemap component and saves the links found in the 
> environment. 


Mmmm... Correction: pipeline that introduces LinkSamplingTransforming 
right before serializer. You can't get links from the content view 
because it might (will) have none yet. Links must be sampled right 
before the serializer, as links view does.


> Thus after the call we would have in the environment the result, the 
> type and the links, all in one call.


Type and links - yes, I agree. Content - no, we won't get correct 
content because links will not be translated in this content. And 
produced content is impossible to "re-link" because it can be any binary 
format supporting links (MS Excel, PDF, MS Word, ...)

But, there is hope to get all in once - if LinkSamplingTransformer will 
also be LinkTranslatingTransformer and will call Main back on every new 
link (recursive processing - as opposed to iterative processing in 
current implementation of the Main). The drawback of recursion approach 
is increased memory consumption.


<snip/>

Vadim




---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [RT] New Cocoon Site Crawler Environment

Posted by Nicola Ken Barozzi <ni...@apache.org>.
Bernhard Huber wrote:
> Hi,

Hi :-)

> <big snip/>
> 
> ask Cocoon two things, make a Generator/Transformer to do the two thinks,
> 
> I now play around with a SourceLinkStatusGenerator, which is like
> StatusGenerator but does not request the links of a page via http: call,
> but via processor.process() call, it does it recursivly, does you ask
> SourceLinkStatusGenerator give me all links outbounded links of 
> index.html, and it will return an xml document with all links of the 
> pages reachable from index.html.
> 
> You ask Cocoon give me the content of page index.html plus its out 
> bounding links.
> 
> The only problem I see you will get not text/html if you ask Cocoon this
> question but text/html+application/x-cocoon-links response - taking the 
> index.html example of above.
> 
> Moreover you might have to adopt the sitemap to let's
> <map:match pattern="crawling"> and asking within this map:match
> cocoon the right question?

Actually I'd ask the question to the Environment, because link harvsting 
has to be plugged in the pipelines or the views in a non-intrusive manner.

> Hmm, if you rely on links, you might want LinkTransformer, not to throw 
> away the page content, but to harvest the links content-no-destructive.

Yes.

> Hmm, that would be the best no big sitemap changes, just another
> transforming step, instead of type="xslt" src="linkstatus.xslt"
> the new LinkAndContentTransformer step, but the content-type issue stays.

We could do away with it, and get the file as-is.

> btw, thxs for starting this RT, i don't have the passion to initiate 
> this, but it is neccessary, and i appreciate it.

Wasn't it you who did the Ant stuff? Where do you think I got inspiration?

Thank *you* :-)

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [RT] New Cocoon Site Crawler Environment

Posted by Bernhard Huber <be...@a1.net>.
Hi,

Nicola Ken Barozzi wrote:
> 
> Of all these discussions, one thing sticks out: we must 
> rewrite/fix/enhance/whatever the Cocoon crawler.
> 
> Reasons:
> 
>  - speed
>  - correct link gathering
> 
> but mostly
> 
>  - speed
> 
> Why is it so slow?
> Mostly because it generates each source three times.
> 
> * to get the links.
> * for each link to get the mime/type.
> * to get the page itself
> 
> To do this it uses two environments, the FileSavingEnvironment and the 
> LinkSamplingEnvironment.
> 
> 
>                          {~}
> 
> 
> I've taken a look at the crawler project in Lucene sandbox, but its 
> objectives are totally different from ours. We could in the future add a 
> plugin to it to be able to index a Cocoon site using the link view, but 
> it does indexing, not saving a site locally.
> So our option is to do the work in Cocoon.
> 
> 
>                          {~}
> 
> 
> The three calls to Cocoon can be reduced quite easily to two, by making 
> the call to the FileSavingEnvironment return both things at the same 
> time and using those. Or by caching the result as the proposed Ant task 
> in Cocoon scratchpad does.
> 
yup,

> The problem arises with the LinkSamplingEnvironment, because it uses a 
> Cocoon view to get the links. Thus we need to ask Cocoon two things, the 
> links and the contents.
<big snip/>

ask Cocoon two things, make a Generator/Transformer to do the two thinks,

I now play around with a SourceLinkStatusGenerator, which is like
StatusGenerator but does not request the links of a page via http: call,
but via processor.process() call, it does it recursivly, does you ask
SourceLinkStatusGenerator give me all links outbounded links of 
index.html, and it will return an xml document with all links of the 
pages reachable from index.html.

You ask Cocoon give me the content of page index.html plus its out 
bounding links.

The only problem I see you will get not text/html if you ask Cocoon this
question but text/html+application/x-cocoon-links response - taking the 
index.html example of above.

Moreover you might have to adopt the sitemap to let's
<map:match pattern="crawling"> and asking within this map:match
cocoon the right question?

Hmm, if you rely on links, you might want LinkTransformer, not to throw 
away the page content, but to harvest the links content-no-destructive.
Hmm, that would be the best no big sitemap changes, just another
transforming step, instead of type="xslt" src="linkstatus.xslt"
the new LinkAndContentTransformer step, but the content-type issue stays.

btw, thxs for starting this RT, i don't have the passion to initiate 
this, but it is neccessary, and i appreciate it.

bye bernhard



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org