You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Deepa Jayaveer <de...@tcs.com> on 2014/03/14 10:10:39 UTC

reg pagination

Hi 
 I am using Nutch 2.1 with MySQL.  The requirement is to crawl all the 
Paginated  web pages.

Say, for example, if I had given the Seed URL as the first page (page no:1 
) of some website (http://x.com?num=1)
and by  giving appropriate regular expression through URL filter  to make 
nutch to crawl the pages with the pattern as  "num"
Nutc able to crawl the given URLs
http://x.com?num=2
http://x.com?num=3 ...

Nutch is successfully  crawling  if the pagination  URL is given in the 
anchor tag(a href) for pagination.

 I was facing issue when the web pages had used some java script function 
to call the pagination by 
calling  function like onPaginationSubmit()

Nutch was not able to take crawl those pages. can anyone help to give 
solution on how to crawl those paginated pages?




Thanks and Regards
Deepa Devi 
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you



RE: reg pagination

Posted by Markus Jelsma <ma...@openindex.io>.
No, Nutch won't be able to crawl any Javascript generated URL unless you make some very heavy customizations such as using stuff like selenium or a Javascript runtime with embedded DOM environment. Nutch can however crawl AJAX webpages like google does.

https://issues.apache.org/jira/browse/NUTCH-1323
 
-----Original message-----
> From:Deepa Jayaveer <de...@tcs.com>
> Sent: Friday 14th March 2014 10:10
> To: user@nutch.apache.org
> Subject: reg pagination 
> 
> Hi 
>  I am using Nutch 2.1 with MySQL.  The requirement is to crawl all the 
> Paginated  web pages.
> 
> Say, for example, if I had given the Seed URL as the first page (page no:1 
> ) of some website (http://x.com?num=1)
> and by  giving appropriate regular expression through URL filter  to make 
> nutch to crawl the pages with the pattern as  "num"
> Nutc able to crawl the given URLs
> http://x.com?num=2
> http://x.com?num=3 ...
> 
> Nutch is successfully  crawling  if the pagination  URL is given in the 
> anchor tag(a href) for pagination.
> 
>  I was facing issue when the web pages had used some java script function 
> to call the pagination by 
> calling  function like onPaginationSubmit()
> 
> Nutch was not able to take crawl those pages. can anyone help to give 
> solution on how to crawl those paginated pages?
> 
> 
> 
> 
> Thanks and Regards
> Deepa Devi 
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain 
> confidential or privileged information. If you are 
> not the intended recipient, any dissemination, use, 
> review, distribution, printing or copying of the 
> information contained in this e-mail message 
> and/or attachments to it are strictly prohibited. If 
> you have received this communication in error, 
> please notify us by reply e-mail or telephone and 
> immediately and permanently delete the message 
> and any attachments. Thank you
> 
> 
>