You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cocoon.apache.org by Nils Kaiser <Ni...@gmx.net> on 2006/09/04 12:13:17 UTC

Crawling over web pages with cocoon (Running a pipeline per page)

Hello,

I have a usecase where I need to crawl over a web page to migrate 
content to another system. As I have used some of the components needed 
for the migration with cocoon already, it would be great if I could use 
the pipeline again. So the question is, how do I crawl the page 
automatically - or if not possible, what is the best way to achieve a 
similar behavior?

Has anyone used cocoon for something similar and can share its experiences?

Thx,

Nils

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org

Re: Re: Crawling over web pages with cocoon (Running a pipeline per page)

Posted by Bertrand Delacretaz <bd...@apache.org>.

On 9/5/06, Nils Kaiser <Ni...@gmx.net> wrote:

> So you mean you would get a dump of the site and call cocoon pipelines
> for the convertion..

No, I meant using wget to crawl the converted output of Cocoon, scenario:

1. wget gets an html starting page from Cocoon, with links to the
converted pages (which can be xml or anything). Cocoon get its input
either from http resources or from files.

2. wget crawls that page and saves all the converted files

So I wouldn't use Cocoon's crawling features at all, only the
transformation pipelines.

But of course, it's also a good idea to start by crawling the original
site and saving its pages, it'll be faster if you need to to several
conversion runs.

> ...An additional point would be the ability to generate some useful logs
> about which page was converted and where... do you do something similar?..

I sometimes uses sitemap actions to log that kind of info.

-Bertrand

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org

Re: Crawling over web pages with cocoon (Running a pipeline per page)

Posted by Nils Kaiser <Ni...@gmx.net>.

So you mean you would get a dump of the site and call cocoon pipelines 
for the convertion. I like the idea of doing this in two steps, as it 
allows us to check everything / remove pages not needed before 
converting. Maybe a list of the urls (crawled together by wget or 
something else) would be enough, and I could fetch the content from the 
page from inside the pipeline (htmlgenerator).

The question is, how would I call cocoon then? Using the CLI or the 
cocoon bean?

I saw there is some crawler functionality along the bean but could not 
find any info about how to use it.

An additional point would be the ability to generate some useful logs 
about which page was converted and where... do you do something similar?

Thx,

Nils
>
> For a one-time job of converting a collection of webpages, I'd use an
> external crawler like wget, and create Cocoon pipelines to do the
> format conversion.
>
> You'll need a "table of contents" page which generates (at least
> indirect) links to all other pages, and use this page as an entry
> point for wget.
>
> You could of course do the whole thing in Cocoon, but it's probably
> faster to implement and test with this combination of tools.
>
> -Bertrand
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org

Re: Crawling over web pages with cocoon (Running a pipeline per page)

Posted by Bertrand Delacretaz <bd...@apache.org>.

On 9/4/06, Nils Kaiser <Ni...@gmx.net> wrote:

> ...So the question is, how do I crawl the page
> automatically - or if not possible, what is the best way to achieve a
> similar behavior?...

For a one-time job of converting a collection of webpages, I'd use an
external crawler like wget, and create Cocoon pipelines to do the
format conversion.

You'll need a "table of contents" page which generates (at least
indirect) links to all other pages, and use this page as an entry
point for wget.

You could of course do the whole thing in Cocoon, but it's probably
faster to implement and test with this combination of tools.

-Bertrand

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org

RE: Crawling over web pages with cocoon (Running a pipeline per page)

Posted by Warrell <wa...@iquo.co.uk>.

Hi Nils,

Have you looked at the LinkStatusGenerator? You could probably modify it to
generate XML representation of your target content.

That's where I would look and if that isn't suitable think about wrapping
some simple Java Spider as a custom Transformer.

Hope this helps,

Regards

Warrell

-----Original Message-----
From: Nils Kaiser [mailto:NilsKaiser@gmx.net] 
Sent: 04 September 2006 11:13
To: users@cocoon.apache.org
Subject: Crawling over web pages with cocoon (Running a pipeline per page)

Hello,

I have a usecase where I need to crawl over a web page to migrate 
content to another system. As I have used some of the components needed 
for the migration with cocoon already, it would be great if I could use 
the pipeline again. So the question is, how do I crawl the page 
automatically - or if not possible, what is the best way to achieve a 
similar behavior?

Has anyone used cocoon for something similar and can share its experiences?

Thx,

Nils

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org