You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Imtiaz Shakil Siddique <sh...@gmail.com> on 2015/06/05 13:16:52 UTC

How to Collect dynamically created anchors from a page

Hi,

I am using apache-nutch-1.9. My configuration ignores external links.

I've some urls in my seed file. But the problem is , nutch crawler doesn't
find the links in those pages because the site popuates content using ajax
call. I've removed all possible regex filters inside conf folder of nutch.

How can I collect those links. Any advice ?
Thanks in advance.

Re: How to Collect dynamically created anchors from a page

Posted by Michael Joyce <jo...@apache.org>.

Have you tried using the protocol-selenium plugin? I've had luck using to
fetch pages with dynamically loaded content.

https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-selenium

-- Jimmy

On Fri, Jun 5, 2015 at 4:16 AM, Imtiaz Shakil Siddique <
shakilsust006@gmail.com> wrote:

> Hi,
>
> I am using apache-nutch-1.9. My configuration ignores external links.
>
> I've some urls in my seed file. But the problem is , nutch crawler doesn't
> find the links in those pages because the site popuates content using ajax
> call. I've removed all possible regex filters inside conf folder of nutch.
>
> How can I collect those links. Any advice ?
> Thanks in advance.
>