You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Joydeep Banerjee <jb...@gmail.com> on 2009/01/08 21:09:20 UTC

Crawling dynamic pages using Nutch

Hi all,

I am trying to crawl dynamic pages using Nutch. But, the crawler doesn't
seem to go deep enough. I have tried giving depth=1000 but the crawler seems
to be fetching the same pages over and over again.

For example, a good scenario would be to search exact flights for a
specified search criteria (e.g., source, destination, departuredate,
returndate) from a travel site (e.g., expedia).

How does Nutch crawl database-generated pages or pages that are behind
forms?

Any suggestions would be highly appreciated.

Thanks.
Joydeep.

Re: Crawling dynamic pages using Nutch

Posted by Joydeep Banerjee <jb...@gmail.com>.
Didn't get any response.
Please help - I am stuck on my project with this issue.

Thanks.
Joydeep.


On Thu, Jan 8, 2009 at 12:09 PM, Joydeep Banerjee <jb...@gmail.com>wrote:

> Hi all,
>
> I am trying to crawl dynamic pages using Nutch. But, the crawler doesn't
> seem to go deep enough. I have tried giving depth=1000 but the crawler seems
> to be fetching the same pages over and over again.
>
> For example, a good scenario would be to search exact flights for a
> specified search criteria (e.g., source, destination, departuredate,
> returndate) from a travel site (e.g., expedia).
>
> How does Nutch crawl database-generated pages or pages that are behind
> forms?
>
> Any suggestions would be highly appreciated.
>
> Thanks.
> Joydeep.
>
>
>

Re: Crawling dynamic pages using Nutch

Posted by Doğacan Güney <do...@gmail.com>.
On Thu, Jan 8, 2009 at 10:09 PM, Joydeep Banerjee <jb...@gmail.com> wrote:
> Hi all,
>
> I am trying to crawl dynamic pages using Nutch. But, the crawler doesn't
> seem to go deep enough. I have tried giving depth=1000 but the crawler seems
> to be fetching the same pages over and over again.
>
> For example, a good scenario would be to search exact flights for a
> specified search criteria (e.g., source, destination, departuredate,
> returndate) from a travel site (e.g., expedia).
>
> How does Nutch crawl database-generated pages or pages that are behind
> forms?
>

Nutch just follows links on a page. It doesn't have support for
fetching a page behind
a form.

You can write a html parsefilter plugin that adds new links (these
links being pages behind
forms).

> Any suggestions would be highly appreciated.
>
> Thanks.
> Joydeep.
>



-- 
Doğacan Güney