You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Taichi Ho <he...@gmail.com> on 2015/10/03 05:53:41 UTC

Integrating Selenium with Nutch

Hi, all.

I have been experimenting with Selenium and Nutch following the link:
https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-interactiveselenium

I have been able to post a form using my custom handler. But the url
redirected after posting the form doesn't seem to enter the crawldb of
nutch. Is it the expected bahavior?

Also, it seems really slow to open and close firefox it for each url it
crawled. Is it possible to do this with multiple threads? I googled and
didn't find any promising answers. Do we have any workarounds?

Thank you all.

Re: Integrating Selenium with Nutch

Posted by Michael Joyce <jo...@apache.org>.

Regarding your first question:
A handler represents a single set of interactions with a page from which
content should be extracted. Once the handler returns, the content of the
page is read out of the body and returned under the original URL along with
the content from all the other handlers that are run. So that redirected
page will not be saved since it's not the in the content that is parsed. If
you need that to be the case you can simply append the URL to the body
post-redirect in the handler.

Unfortunately the slow aspect is part of the trouble with Selenium. Try
checking out Selenium-grid if you want to try to parallelize it a bit.
https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-selenium#b-setting-up-a-selenium-grid

-- Jimmy

On Fri, Oct 2, 2015 at 8:53 PM, Taichi Ho <he...@gmail.com> wrote:

> Hi, all.
>
> I have been experimenting with Selenium and Nutch following the link:
>
> https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-interactiveselenium
>
> I have been able to post a form using my custom handler. But the url
> redirected after posting the form doesn't seem to enter the crawldb of
> nutch. Is it the expected bahavior?
>
> Also, it seems really slow to open and close firefox it for each url it
> crawled. Is it possible to do this with multiple threads? I googled and
> didn't find any promising answers. Do we have any workarounds?
>
> Thank you all.
>
>