You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Deepa Jayaveer <de...@tcs.com> on 2014/03/14 10:10:39 UTC
reg pagination
Hi
I am using Nutch 2.1 with MySQL. The requirement is to crawl all the
Paginated web pages.
Say, for example, if I had given the Seed URL as the first page (page no:1
) of some website (http://x.com?num=1)
and by giving appropriate regular expression through URL filter to make
nutch to crawl the pages with the pattern as "num"
Nutc able to crawl the given URLs
http://x.com?num=2
http://x.com?num=3 ...
Nutch is successfully crawling if the pagination URL is given in the
anchor tag(a href) for pagination.
I was facing issue when the web pages had used some java script function
to call the pagination by
calling function like onPaginationSubmit()
Nutch was not able to take crawl those pages. can anyone help to give
solution on how to crawl those paginated pages?
Thanks and Regards
Deepa Devi
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you
RE: reg pagination
Posted by Markus Jelsma <ma...@openindex.io>.
No, Nutch won't be able to crawl any Javascript generated URL unless you make some very heavy customizations such as using stuff like selenium or a Javascript runtime with embedded DOM environment. Nutch can however crawl AJAX webpages like google does.
https://issues.apache.org/jira/browse/NUTCH-1323
-----Original message-----
> From:Deepa Jayaveer <de...@tcs.com>
> Sent: Friday 14th March 2014 10:10
> To: user@nutch.apache.org
> Subject: reg pagination
>
> Hi
> I am using Nutch 2.1 with MySQL. The requirement is to crawl all the
> Paginated web pages.
>
> Say, for example, if I had given the Seed URL as the first page (page no:1
> ) of some website (http://x.com?num=1)
> and by giving appropriate regular expression through URL filter to make
> nutch to crawl the pages with the pattern as "num"
> Nutc able to crawl the given URLs
> http://x.com?num=2
> http://x.com?num=3 ...
>
> Nutch is successfully crawling if the pagination URL is given in the
> anchor tag(a href) for pagination.
>
> I was facing issue when the web pages had used some java script function
> to call the pagination by
> calling function like onPaginationSubmit()
>
> Nutch was not able to take crawl those pages. can anyone help to give
> solution on how to crawl those paginated pages?
>
>
>
>
> Thanks and Regards
> Deepa Devi
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>