You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by Nicola Ken Barozzi <ni...@apache.org> on 2002/12/17 15:52:38 UTC
[RT] New Cocoon Site Crawler Environment
Of all these discussions, one thing sticks out: we must
rewrite/fix/enhance/whatever the Cocoon crawler.
Reasons:
- speed
- correct link gathering
but mostly
- speed
Why is it so slow?
Mostly because it generates each source three times.
* to get the links.
* for each link to get the mime/type.
* to get the page itself
To do this it uses two environments, the FileSavingEnvironment and the
LinkSamplingEnvironment.
{~}
I've taken a look at the crawler project in Lucene sandbox, but its
objectives are totally different from ours. We could in the future add a
plugin to it to be able to index a Cocoon site using the link view, but
it does indexing, not saving a site locally.
So our option is to do the work in Cocoon.
{~}
The three calls to Cocoon can be reduced quite easily to two, by making
the call to the FileSavingEnvironment return both things at the same
time and using those. Or by caching the result as the proposed Ant task
in Cocoon scratchpad does.
The problem arises with the LinkSamplingEnvironment, because it uses a
Cocoon view to get the links. Thus we need to ask Cocoon two things, the
links and the contents.
Let's leave aside the view concept for now, and think about how to
sample links from a content being produced.
We can use a LinklSamplingPipeline.
Yes, a pipeline that introduces a connector just after the
"content"-tagged sitemap component and saves the links found in the
environment.
Thus after the call we would have in the environment the result, the
type and the links, all in one call.
In essence, we are creating a non-blocking view that runs parallelly to
the main pipeline and reports the results to the environment.
This is how views are managed in the interpreted sitemap, in a transformer:
// Check view
if (this.views != null) {
//inform the pipeline that we have a branch point
context.getProcessingPipeline().informBranchPoint();
String cocoonView = env.getView();
if (cocoonView != null) {
// Get view node
ProcessingNode viewNode =
(ProcessingNode)this.views.get(cocoonView);
if (viewNode != null) {
if (getLogger().isInfoEnabled()) {
getLogger().info("Jumping to view "
+ cocoonView + " from transformer at "
+ this.getLocation());
}
return viewNode.invoke(env, context);
}
}
}
// Return false to contine sitemap invocation
return false;
}
It effectively branches and continues only with the view.
Wait, this means that when the CLI recreates a site it doesn't save the
views, right?
Correct, views are simply ignored by the CLI and not created on disk.
This is also due to how views are invoken in Cocoon, with a ? parameter,
so they cannot be saved to disk with the correct URL.
But even if I don't save it, I may need it for internal Cocoon
processing, as is that case with the crawler.
I don't know if it's best to use a special pipeline, to cache the views,
or what, but we need to find a solution.
Any idea?
--
Nicola Ken Barozzi nicolaken@apache.org
- verba volant, scripta manent -
(discussions get forgotten, just code remains)
---------------------------------------------------------------------
Re: [RT] New Cocoon Site Crawler Environment
Posted by Vadim Gritsenko <va...@verizon.net>.
Nicola Ken Barozzi wrote:
>
> Vadim Gritsenko wrote:
>
>> Nicola Ken Barozzi wrote:
>> ...
>>
>>> Why is it so slow?
>>> Mostly because it generates each source three times.
>>
[...]
>>> Thus after the call we would have in the environment the result, the
>>> type and the links, all in one call.
>>
>>
>> Type and links - yes, I agree. Content - no, we won't get correct
>> content because links will not be translated in this content. And
>> produced content is impossible to "re-link" because it can be any
>> binary format supporting links (MS Excel, PDF, MS Word, ...)
>
>
> Ok, you are correct.
>
> Please add here the results we have come to in our fast AIM
> discussion, I have to run now.
Ok, here is the thing. It is possible to get everything in one call (and
- this remark goes to Berin - without increase in resource consumption),
if we (re)move translateURI functionality from the Main. Problem is that
this getType() method is used only for one purpose - to decide on a good
name for the resulting file, to decide on a good extension according to
the MIMEUtils settings. And another problem is that this getLinks is
used only to collect this information (about good names) and deliver it
to the LinkTranslator transformer, which does actual work of replacing
links.
So, if we remove link translation from the Main.java, where it can go
and how it should be done? There are several options.
1) Do not change names.
This works for everything except URIs ending with "/" - and for such
URIs, we can use existing solution - add Constants.INDEX_URI to the end.
Points in favor of this method:
* generated site will be close to the live site with regards to file
names.
* in Main java, there is need in only one call.
2) Change names according to the translation table supplied to the Main
by the user.
This solution provides some flexibility (may be too much of it).
Points in favor of this method:
* Flexibility.
* Same as above.
3) Change names as we done that before - by utilizing MIMEUtils.
Points in favor of this method:
* This is backward compatible way.
* We still have to know types of all links to do translation. Which
means, extra getType() call on every link (excluding duplicates -
information is cached). Hm, this one, actually, is not in favor...
And this name translation can happen in LinkTranslator transformer which
currently does link translation magic. If we move all URI translation
logic, whatever it will be (see points 1-3 above), it will be possible
to implement Main in one step instead of three steps.
Exclusion being the case (3), where complexity will be added to
LinkTranslator, but still, we will reduce calls from 3 (per link) to 2.
> Thanks :-)
You are welcome. Hope I tell story quite understandable.
>> But, there is hope to get all in once - if LinkSamplingTransformer
>> will also be LinkTranslatingTransformer and will call Main back on
>> every new link (recursive processing - as opposed to iterative
>> processing in current implementation of the Main). The drawback of
>> recursion approach is increased memory consumption.
>
>
> NAO = not an option
Yes, it was totally wrong idea from my side.
> It doesn't scale, you are right.
And it never did. Amen.
Vadim
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [RT] New Cocoon Site Crawler Environment
Posted by Berin Loritsch <bl...@apache.org>.
Nicola Ken Barozzi wrote:
>
>
> Berin Loritsch wrote:
>
>> For instance, part of the issue resides in the fact that any client
>> (i.e. CLI environment or Servlet) can only access one view at a time
>> for any resource.
>>
>> Why not allow a client to access all views to a resource that it
>> needs/wants simultaneously? That will allow things like the all
>> important profiling information to be appended after HTML pages
>> are rendered.
>
>
> I thought of this too. But in practice?...
>
>> Cocoon is so entrenched in the single path of execution mentality that
>> environments that need the extra complexity can't have it.
>>
>> Each resource should only need to be rendered once, and only
>> once. Each view to the resource should be accessible by a client.
>>
>> FOr instance, the CLI client wants the Link/Mime-Type information
>> and the content itself. The Link/Mime-Type information is accessed
>> via the LinkSamplingEnvironment. In reality, that is a poor name
>> for what you are really wanting to represent. It should be the
>> LinkSamplingView. That view caches information that can be incorporated
>> back into the list of links we are resolving.
>
>
> Ok, but in practice, how does the client request the view results?
> I kinda like this non-blocking view concept, but fail to see clearly the
> practical implementation.
I think this gets back to the whole Multi-Path pipelining thread a while
back (i.e. allowing the multiplexing of a pipeline to serialize results
to the disk while sending the results to the user at the same time.
In the end, I don't think that "Views" as they are defined currently are
what we are really after. What we want is a way of siphening
information from our pipeline so that we can use it as we see fit.
As such, what we really need is something like a "publish/subscribe"
mechanism similar to the way Avalon Instrument works. I.e. if we don't
need it, we don't waste precious resources--but if we do need it, then
it is available to us.
I am not sure of the mechanism, but perhaps we can butt our heads
together to come up with the best solution.
---------------------------------------------
Introducing NetZero Long Distance
1st month Free!
Sign up today at: www.netzerolongdistance.com
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [RT] New Cocoon Site Crawler Environment
Posted by Nicola Ken Barozzi <ni...@apache.org>.
Berin Loritsch wrote:
> Nicola Ken Barozzi wrote:
>
>>
>> Vadim Gritsenko wrote:
>>
>>> Nicola Ken Barozzi wrote:
>>> ...
>>>
>>>> Why is it so slow?
>>>> Mostly because it generates each source three times.
>>>
>>>
>>
>> [...]
>>
>>> Note: It gets the page with all the links translated using data
>>> gathered on previous step.
>>
>>
>>
>> [...]
>>
>>> We can combine getType and getLinks calls into one, see below.
>
>
>
> If it does not scale the way things are now--and I agree generating
> the source three times is two times too many--then we may have to
> change things a bit more deeply.
>
> For instance, part of the issue resides in the fact that any client
> (i.e. CLI environment or Servlet) can only access one view at a time
> for any resource.
>
> Why not allow a client to access all views to a resource that it
> needs/wants simultaneously? That will allow things like the all
> important profiling information to be appended after HTML pages
> are rendered.
I thought of this too. But in practice?...
> Cocoon is so entrenched in the single path of execution mentality that
> environments that need the extra complexity can't have it.
>
> Each resource should only need to be rendered once, and only
> once. Each view to the resource should be accessible by a client.
>
> FOr instance, the CLI client wants the Link/Mime-Type information
> and the content itself. The Link/Mime-Type information is accessed
> via the LinkSamplingEnvironment. In reality, that is a poor name
> for what you are really wanting to represent. It should be the
> LinkSamplingView. That view caches information that can be incorporated
> back into the list of links we are resolving.
Ok, but in practice, how does the client request the view results?
I kinda like this non-blocking view concept, but fail to see clearly the
practical implementation.
> Another issue I have that is related to link crawling, but not to
> the multi-view access. It is the error page generation. It is not
> *always* an error if a link is not handled by Cocoon.
>
> A common example is the fact that JavaDocs are generated outside of
> Cocoon, and the error page that screws up the link to the JavaDocs
> is a *bad* thing.
This is not really a CLI error, but the fact that Cocoon (wrongly IMHO)
doesn't handle that part of the URI space. We are dealing with this
concept in Forrest, where we have seen that complete sub-URI spaces can
be dealt with without link crawling, and so it's feasable to have Cocoon
serve all those javadocs and not break.
Anyway, there is a way of not making the link be crawled, by setting the
xlink attribute.
> Perhaps we should allow for known exclusions, or turn off the error
> page generation for the missing links--recording them to a file like
> we do now.
Yup, should be settable +1
--
Nicola Ken Barozzi nicolaken@apache.org
- verba volant, scripta manent -
(discussions get forgotten, just code remains)
---------------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [RT] New Cocoon Site Crawler Environment
Posted by Berin Loritsch <bl...@apache.org>.
Nicola Ken Barozzi wrote:
>
> Vadim Gritsenko wrote:
>
>> Nicola Ken Barozzi wrote:
>> ...
>>
>>> Why is it so slow?
>>> Mostly because it generates each source three times.
>>
>
> [...]
>
>> Note: It gets the page with all the links translated using data
>> gathered on previous step.
>
>
> [...]
>
>> We can combine getType and getLinks calls into one, see below.
If it does not scale the way things are now--and I agree generating
the source three times is two times too many--then we may have to
change things a bit more deeply.
For instance, part of the issue resides in the fact that any client
(i.e. CLI environment or Servlet) can only access one view at a time
for any resource.
Why not allow a client to access all views to a resource that it
needs/wants simultaneously? That will allow things like the all
important profiling information to be appended after HTML pages
are rendered.
Cocoon is so entrenched in the single path of execution mentality that
environments that need the extra complexity can't have it.
Each resource should only need to be rendered once, and only
once. Each view to the resource should be accessible by a client.
FOr instance, the CLI client wants the Link/Mime-Type information
and the content itself. The Link/Mime-Type information is accessed
via the LinkSamplingEnvironment. In reality, that is a poor name
for what you are really wanting to represent. It should be the
LinkSamplingView. That view caches information that can be incorporated
back into the list of links we are resolving.
Another issue I have that is related to link crawling, but not to
the multi-view access. It is the error page generation. It is not
*always* an error if a link is not handled by Cocoon.
A common example is the fact that JavaDocs are generated outside of
Cocoon, and the error page that screws up the link to the JavaDocs
is a *bad* thing.
Perhaps we should allow for known exclusions, or turn off the error
page generation for the missing links--recording them to a file like
we do now.
Just some food for thought.
---------------------------------------------
Introducing NetZero Long Distance
1st month Free!
Sign up today at: www.netzerolongdistance.com
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [RT] New Cocoon Site Crawler Environment
Posted by Nicola Ken Barozzi <ni...@apache.org>.
Vadim Gritsenko wrote:
> Nicola Ken Barozzi wrote:
> ...
>
>> Why is it so slow?
>> Mostly because it generates each source three times.
[...]
> Note: It gets the page with all the links translated using data gathered
> on previous step.
[...]
> We can combine getType and getLinks calls into one, see below.
>
>
>> Let's leave aside the view concept for now, and think about how to
>> sample links from a content being produced.
>>
>> We can use a LinklSamplingPipeline.
>> Yes, a pipeline that introduces a connector just after the
>> "content"-tagged sitemap component and saves the links found in the
>> environment.
>
>
> Mmmm... Correction: pipeline that introduces LinkSamplingTransforming
> right before serializer. You can't get links from the content view
> because it might (will) have none yet. Links must be sampled right
> before the serializer, as links view does.
The link view can be set to kick in at any part of the pipeline, it's
always SAX.
It's up to the sitemap editor to tell which step is the semantically
rich one. Can be the first, in the middle, or right before the Serializer.
>> Thus after the call we would have in the environment the result, the
>> type and the links, all in one call.
>
> Type and links - yes, I agree. Content - no, we won't get correct
> content because links will not be translated in this content. And
> produced content is impossible to "re-link" because it can be any binary
> format supporting links (MS Excel, PDF, MS Word, ...)
Ok, you are correct.
Please add here the results we have come to in our fast AIM discussion,
I have to run now.
Thanks :-)
> But, there is hope to get all in once - if LinkSamplingTransformer will
> also be LinkTranslatingTransformer and will call Main back on every new
> link (recursive processing - as opposed to iterative processing in
> current implementation of the Main). The drawback of recursion approach
> is increased memory consumption.
NAO = not an option
It doesn't scale, you are right.
--
Nicola Ken Barozzi nicolaken@apache.org
- verba volant, scripta manent -
(discussions get forgotten, just code remains)
---------------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [RT] New Cocoon Site Crawler Environment
Posted by Vadim Gritsenko <va...@verizon.net>.
Nicola Ken Barozzi wrote:
...
> Why is it so slow?
> Mostly because it generates each source three times.
>
> * to get the links.
* to get the mime type
> * for each link
...whose mime type is not known yet...
> to get the mime/type.
> * to get the page itself
Note: It gets the page with all the links translated using data gathered
on previous step.
> To do this it uses two environments, the FileSavingEnvironment and the
> LinkSamplingEnvironment.
...
> The three calls to Cocoon can be reduced quite easily to two, by
> making the call to the FileSavingEnvironment return both things at the
> same time and using those.
Clarify: what two things.
> Or by caching the result as the proposed Ant task in Cocoon scratchpad
> does.
>
> The problem arises with the LinkSamplingEnvironment, because it uses a
> Cocoon view to get the links. Thus we need to ask Cocoon two things,
> the links and the contents.
We can combine getType and getLinks calls into one, see below.
> Let's leave aside the view concept for now, and think about how to
> sample links from a content being produced.
>
> We can use a LinklSamplingPipeline.
> Yes, a pipeline that introduces a connector just after the
> "content"-tagged sitemap component and saves the links found in the
> environment.
Mmmm... Correction: pipeline that introduces LinkSamplingTransforming
right before serializer. You can't get links from the content view
because it might (will) have none yet. Links must be sampled right
before the serializer, as links view does.
> Thus after the call we would have in the environment the result, the
> type and the links, all in one call.
Type and links - yes, I agree. Content - no, we won't get correct
content because links will not be translated in this content. And
produced content is impossible to "re-link" because it can be any binary
format supporting links (MS Excel, PDF, MS Word, ...)
But, there is hope to get all in once - if LinkSamplingTransformer will
also be LinkTranslatingTransformer and will call Main back on every new
link (recursive processing - as opposed to iterative processing in
current implementation of the Main). The drawback of recursion approach
is increased memory consumption.
<snip/>
Vadim
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [RT] New Cocoon Site Crawler Environment
Posted by Nicola Ken Barozzi <ni...@apache.org>.
Bernhard Huber wrote:
> Hi,
Hi :-)
> <big snip/>
>
> ask Cocoon two things, make a Generator/Transformer to do the two thinks,
>
> I now play around with a SourceLinkStatusGenerator, which is like
> StatusGenerator but does not request the links of a page via http: call,
> but via processor.process() call, it does it recursivly, does you ask
> SourceLinkStatusGenerator give me all links outbounded links of
> index.html, and it will return an xml document with all links of the
> pages reachable from index.html.
>
> You ask Cocoon give me the content of page index.html plus its out
> bounding links.
>
> The only problem I see you will get not text/html if you ask Cocoon this
> question but text/html+application/x-cocoon-links response - taking the
> index.html example of above.
>
> Moreover you might have to adopt the sitemap to let's
> <map:match pattern="crawling"> and asking within this map:match
> cocoon the right question?
Actually I'd ask the question to the Environment, because link harvsting
has to be plugged in the pipelines or the views in a non-intrusive manner.
> Hmm, if you rely on links, you might want LinkTransformer, not to throw
> away the page content, but to harvest the links content-no-destructive.
Yes.
> Hmm, that would be the best no big sitemap changes, just another
> transforming step, instead of type="xslt" src="linkstatus.xslt"
> the new LinkAndContentTransformer step, but the content-type issue stays.
We could do away with it, and get the file as-is.
> btw, thxs for starting this RT, i don't have the passion to initiate
> this, but it is neccessary, and i appreciate it.
Wasn't it you who did the Ant stuff? Where do you think I got inspiration?
Thank *you* :-)
--
Nicola Ken Barozzi nicolaken@apache.org
- verba volant, scripta manent -
(discussions get forgotten, just code remains)
---------------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [RT] New Cocoon Site Crawler Environment
Posted by Bernhard Huber <be...@a1.net>.
Hi,
Nicola Ken Barozzi wrote:
>
> Of all these discussions, one thing sticks out: we must
> rewrite/fix/enhance/whatever the Cocoon crawler.
>
> Reasons:
>
> - speed
> - correct link gathering
>
> but mostly
>
> - speed
>
> Why is it so slow?
> Mostly because it generates each source three times.
>
> * to get the links.
> * for each link to get the mime/type.
> * to get the page itself
>
> To do this it uses two environments, the FileSavingEnvironment and the
> LinkSamplingEnvironment.
>
>
> {~}
>
>
> I've taken a look at the crawler project in Lucene sandbox, but its
> objectives are totally different from ours. We could in the future add a
> plugin to it to be able to index a Cocoon site using the link view, but
> it does indexing, not saving a site locally.
> So our option is to do the work in Cocoon.
>
>
> {~}
>
>
> The three calls to Cocoon can be reduced quite easily to two, by making
> the call to the FileSavingEnvironment return both things at the same
> time and using those. Or by caching the result as the proposed Ant task
> in Cocoon scratchpad does.
>
yup,
> The problem arises with the LinkSamplingEnvironment, because it uses a
> Cocoon view to get the links. Thus we need to ask Cocoon two things, the
> links and the contents.
<big snip/>
ask Cocoon two things, make a Generator/Transformer to do the two thinks,
I now play around with a SourceLinkStatusGenerator, which is like
StatusGenerator but does not request the links of a page via http: call,
but via processor.process() call, it does it recursivly, does you ask
SourceLinkStatusGenerator give me all links outbounded links of
index.html, and it will return an xml document with all links of the
pages reachable from index.html.
You ask Cocoon give me the content of page index.html plus its out
bounding links.
The only problem I see you will get not text/html if you ask Cocoon this
question but text/html+application/x-cocoon-links response - taking the
index.html example of above.
Moreover you might have to adopt the sitemap to let's
<map:match pattern="crawling"> and asking within this map:match
cocoon the right question?
Hmm, if you rely on links, you might want LinkTransformer, not to throw
away the page content, but to harvest the links content-no-destructive.
Hmm, that would be the best no big sitemap changes, just another
transforming step, instead of type="xslt" src="linkstatus.xslt"
the new LinkAndContentTransformer step, but the content-type issue stays.
btw, thxs for starting this RT, i don't have the passion to initiate
this, but it is neccessary, and i appreciate it.
bye bernhard
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org