You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Mo Omer (JIRA)" <ji...@apache.org> on 2015/03/09 17:50:38 UTC

[jira] [Commented] (NUTCH-1948) Make the Selenium remote web driver specification, configuration and selection available via a Factory-type mechanism

    [ https://issues.apache.org/jira/browse/NUTCH-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353191#comment-14353191 ] 

Mo Omer commented on NUTCH-1948:
--------------------------------

Yo Lewis, 

In addition to being able to configure the driver used; it would be nice to be able to set the content fetched inside a config. After chatting with xaultx on Github (he wrote the nutch-htmlunit plugin, and is now working on a sort-of duplicate of nutch-selenium called nutch-ajax) - he mentioned that the nutch-selenium implementation is strictly html tag specific, which is true. There's another Selenium driver function called `getPageSource()`, which retrieves the entire page source, and may be preferable in most use cases (since it relies on the existing tika etc. toolchain for all parsing). However, it may be the case for some users, that they only want to retrieve a specific element or few from the pages, so driver.innerHTML should still be an option. Would something like https://github.com/xautlx/nutch-ajax/issues/2#issuecomment-77892177 be useful?

> Make the Selenium remote web driver specification, configuration and selection available via a Factory-type mechanism
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1948
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1948
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin, protocol
>    Affects Versions: 1.10
>            Reporter: Lewis John McGibbney
>             Fix For: 1.11
>
>
> Right now we only use the FirefoxDriver, however we could also use the following
>  * ChromeDriver <https://selenium.googlecode.com/git/docs/api/java/org/openqa/selenium/chrome/ChromeDriver.html>, 
>  * FirefoxDriver <https://selenium.googlecode.com/git/docs/api/java/org/openqa/selenium/firefox/FirefoxDriver.html>, 
>  * InternetExplorerDriver <https://selenium.googlecode.com/git/docs/api/java/org/openqa/selenium/ie/InternetExplorerDriver.html>, and 
>  * SafariDriver <https://selenium.googlecode.com/git/docs/api/java/org/openqa/selenium/safari/SafariDriver.html> 
> They could be available via a Factory-type mechanism which would allow us to define the driver within nutch-site.xml even.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)