You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/08/28 21:48:46 UTC

[jira] [Commented] (NUTCH-2088) Add Optional Execution to Interactive Selenium Handlers

    [ https://issues.apache.org/jira/browse/NUTCH-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720465#comment-14720465 ] 

ASF GitHub Bot commented on NUTCH-2088:
---------------------------------------

GitHub user MJJoyce opened a pull request:

    https://github.com/apache/nutch/pull/53

    NUTCH-2088 - Add URL Processing Check to Interactive Selenium Handlers

    - Add shouldProcessURL to Handler interface. Handlers may now check URLs to determine if they should interact with them prior to loading the necessary WebDriver.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MJJoyce/nutch NUTCH-2088

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/53.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #53
    
----
commit f56ad050a3fff01f99bc9aca44f179b2665d80b8
Author: Michael Joyce <ml...@gmail.com>
Date:   2015-08-28T19:45:39Z

    NUTCH-2088 - Add URL Processing Check to Interactive Selenium Handlers

----


> Add Optional Execution to Interactive Selenium Handlers
> -------------------------------------------------------
>
>                 Key: NUTCH-2088
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2088
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.10
>            Reporter: Michael Joyce
>             Fix For: 1.11
>
>
> At the moment, all the Handlers run for every URL when using the interactive-selenium plugin. Often times when trying to do a deep crawl of a site you'll want to handle various subdomains and paths/files differently. You can effectively filter in the handlers at the moment, but only once you've loaded the WebDriver and incurred the associated overhead. It would be much nicer if the handler interface allowed for this check to occur prior to the request to retrieve page content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)