You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by sn...@apache.org on 2019/02/23 23:06:50 UTC
[nutch] branch master updated: NUTCH-2676 Update to the latest
selenium and add code to use chrome and firefox headless mode with the
remote web driver NUTCH-2460 use the headless option of firefox and chrome
in protocol-selenium - upgrade of Selenium plugin related packages - added
the use of headless mode when using Selenium nodes(chrome & firefox) -
obsolete code for Selenium plugin removed - fix of a bug occurring during
the build of the Nutch docker container - added possibility to use a
Selenium Hub orchestrator [...]
This is an automated email from the ASF dual-hosted git repository.
snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git
The following commit(s) were added to refs/heads/master by this push:
new 8f421a4 NUTCH-2676 Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver NUTCH-2460 use the headless option of firefox and chrome in protocol-selenium - upgrade of Selenium plugin related packages - added the use of headless mode when using Selenium nodes(chrome & firefox) - obsolete code for Selenium plugin removed - fix of a bug occurring during the build of the Nutch docker container - added possibility to use a Seleniu [...]
new dfd8602 Merge pull request #430 from sbatururimi/NUTCH-2676
8f421a4 is described below
commit 8f421a4114f2d3e5be8726ca735766c6b9b19dbb
Author: Stas Batururimi <s....@gmail.com>
AuthorDate: Thu Nov 15 12:12:58 2018 +0000
NUTCH-2676 Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver
NUTCH-2460 use the headless option of firefox and chrome in protocol-selenium
- upgrade of Selenium plugin related packages
- added the use of headless mode when using Selenium nodes(chrome & firefox)
- obsolete code for Selenium plugin removed
- fix of a bug occurring during the build of the Nutch docker container
- added possibility to use a Selenium Hub orchestrator in multi-containers docker mode
- added several examples of using Nutch+Solr+Selenium Hub+Selenium Nodes in a network of Docker containers
---
.gitignore | 1 +
conf/nutch-default.xml | 26 +-
src/plugin/lib-selenium/README.md | 13 +
src/plugin/lib-selenium/build-ivy.xml | 2 +-
src/plugin/lib-selenium/ivy.xml | 11 +-
src/plugin/lib-selenium/plugin.xml | 120 ++-----
.../nutch/protocol/selenium/HttpWebClient.java | 352 +++++++++++++--------
7 files changed, 286 insertions(+), 239 deletions(-)
diff --git a/.gitignore b/.gitignore
index 732ca05..61e42e0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -13,3 +13,4 @@ ivy/ivy-2.3.0.jar
ivy/ivy-2.4.0.jar
ivy/ivy-2.5.0-rc1.jar
naivebayes-model
+.gitconfig
diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 97e1801..dadf30d 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -2525,10 +2525,11 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
<description>
A String value representing the flavour of Selenium
WebDriver() to use. Currently the following options
- exist - 'firefox', 'chrome', 'safari', 'opera', 'phantomjs' and 'remote'.
+ exist - 'firefox', 'chrome', 'safari', 'opera' and 'remote'.
If 'remote' is used it is essential to also set correct properties for
'selenium.hub.port', 'selenium.hub.path', 'selenium.hub.host',
- 'selenium.hub.protocol', 'selenium.grid.driver' and 'selenium.grid.binary'.
+ 'selenium.hub.protocol', 'selenium.grid.driver', 'selenium.grid.binary'
+ and 'selenium.enable.headless'.
</description>
</property>
@@ -2560,8 +2561,9 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
<name>selenium.grid.driver</name>
<value>firefox</value>
<description>A String value representing the flavour of Selenium
- WebDriver() used on the selenium grid. Currently the following options
- exist - 'firefox', 'phantomjs' </description>
+ WebDriver() used on the selenium grid. We must set `selenium.driver` to `remote` first.
+ Currently the following options
+ exist - 'firefox', 'chrome', 'random' </description>
</property>
<property>
@@ -2572,6 +2574,14 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
</description>
</property>
+<!-- headless options for Firefox and Chrome-->
+<property>
+ <name>selenium.enable.headless</name>
+ <value>false</value>
+ <description>A Boolean value representing the headless option
+ for Firefix and Chrome drivers
+ </description>
+</property>
<!-- selenium firefox configuration;
applies to protocol-selenium and protocol-interactiveselenium plugins -->
<property>
@@ -2622,6 +2632,14 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
Currently this option exist for - 'firefox' </description>
</property>
+<!-- selenium chrome configurations -->
+<property>
+ <name>webdriver.chrome.driver</name>
+ <value>/root/chromedriver</value>
+ <description>The path to the ChromeDriver binary</description>
+</property>
+<!-- end of selenium chrome configurations -->
+
<!-- protocol-interactiveselenium configuration -->
<property>
<name>interactiveselenium.handlers</name>
diff --git a/src/plugin/lib-selenium/README.md b/src/plugin/lib-selenium/README.md
new file mode 100644
index 0000000..1c6b37c
--- /dev/null
+++ b/src/plugin/lib-selenium/README.md
@@ -0,0 +1,13 @@
+# Updates
+* The use of phantomjs has been deprecated. Check [Wikipedia](https://en.wikipedia.org/wiki/PhantomJS) for more info.
+* The updated code for Safari webriver is under development as starting Safari 10 on OS X El Capitan and macOS Sierra, Safari comes bundled with a new driver implementation.
+* Opera is now based on ChromeDriver and has been adapted by Opera that enables programmatic automation of Chromium-based Opera products but hasn't been updated since April 5, 2017. We have suspended its support and removed from the code.([link](https://github.com/operasoftware/operachromiumdriver))
+* Headless mode has been added for Chrome and Firefox. Set `selenium.enable.headless` to `true` in nutch-default.xml or nutch-site.xml to use it.
+
+
+Your can run Nutch in Docker. Check some examples at https://github.com/sbatururimi/nutch-test.
+Don't forget to update Dockefile to point to the original Nutch repository when updated.
+
+# Contributors
+Stas Batururimi [s.batururimi@gmail.com]
+
diff --git a/src/plugin/lib-selenium/build-ivy.xml b/src/plugin/lib-selenium/build-ivy.xml
index 3abcf6d..fe919e5 100644
--- a/src/plugin/lib-selenium/build-ivy.xml
+++ b/src/plugin/lib-selenium/build-ivy.xml
@@ -17,7 +17,7 @@
-->
<project name="lib-selenium" default="deps-jar" xmlns:ivy="antlib:org.apache.ivy.ant">
- <property name="ivy.install.version" value="2.1.0" />
+ <property name="ivy.install.version" value="2.4.0" />
<condition property="ivy.home" value="${env.IVY_HOME}">
<isset property="env.IVY_HOME" />
</condition>
diff --git a/src/plugin/lib-selenium/ivy.xml b/src/plugin/lib-selenium/ivy.xml
index 701b725..d70dfaf 100644
--- a/src/plugin/lib-selenium/ivy.xml
+++ b/src/plugin/lib-selenium/ivy.xml
@@ -37,16 +37,13 @@
<dependencies>
<!-- begin selenium dependencies -->
- <dependency org="org.seleniumhq.selenium" name="selenium-java" rev="2.48.2" />
-
+ <dependency org="org.seleniumhq.selenium" name="selenium-java" rev="3.141.5" />
+ <!--
<dependency org="com.opera" name="operadriver" rev="1.5">
<exclude org="org.seleniumhq.selenium" name="selenium-remote-driver" />
</dependency>
- <dependency org="com.codeborne" name="phantomjsdriver" rev="1.2.1" >
- <exclude org="org.seleniumhq.selenium" name="selenium-remote-driver" />
- <exclude org="org.seleniumhq.selenium" name="selenium-java" />
- </dependency>
+ -->
<!-- end selenium dependencies -->
</dependencies>
-
+
</ivy-module>
diff --git a/src/plugin/lib-selenium/plugin.xml b/src/plugin/lib-selenium/plugin.xml
index a86d665..bf50ca0 100644
--- a/src/plugin/lib-selenium/plugin.xml
+++ b/src/plugin/lib-selenium/plugin.xml
@@ -29,147 +29,65 @@
<export name="*"/>
</library>
<!-- all classes from dependent libraries are exported -->
- <library name="cglib-nodep-2.1_3.jar">
+ <library name="animal-sniffer-annotations-1.14.jar">
<export name="*"/>
</library>
- <library name="commons-codec-1.10.jar">
+ <library name="byte-buddy-1.8.15.jar">
<export name="*"/>
</library>
- <library name="commons-collections-3.2.1.jar">
+ <library name="checker-compat-qual-2.0.0.jar">
<export name="*"/>
</library>
<library name="commons-exec-1.3.jar">
<export name="*"/>
</library>
- <library name="commons-io-2.4.jar">
+ <library name="error_prone_annotations-2.1.3.jar">
<export name="*"/>
</library>
- <library name="commons-jxpath-1.3.jar">
+ <library name="guava-25.0-jre.jar">
<export name="*"/>
</library>
- <library name="commons-lang3-3.4.jar">
+ <library name="j2objc-annotations-1.1.jar">
<export name="*"/>
</library>
- <library name="commons-logging-1.2.jar">
+ <library name="jsr305-1.3.9.jar">
<export name="*"/>
</library>
- <library name="cssparser-0.9.16.jar">
+ <library name="okhttp-3.11.0.jar">
<export name="*"/>
</library>
- <library name="gson-2.3.1.jar">
+ <library name="okio-1.14.0.jar">
<export name="*"/>
</library>
- <library name="guava-18.0.jar">
+ <library name="selenium-api-3.141.5.jar">
<export name="*"/>
</library>
- <library name="htmlunit-2.18.jar">
+ <library name="selenium-chrome-driver-3.141.5.jar">
<export name="*"/>
</library>
- <library name="htmlunit-core-js-2.17.jar">
+ <library name="selenium-edge-driver-3.141.5.jar">
<export name="*"/>
</library>
- <library name="httpclient-4.5.1.jar">
+ <library name="selenium-firefox-driver-3.141.5.jar">
<export name="*"/>
</library>
- <library name="httpcore-4.4.3.jar">
+ <library name="selenium-ie-driver-3.141.5.jar">
<export name="*"/>
</library>
- <library name="httpmime-4.5.jar">
+ <library name="selenium-java-3.141.5.jar">
<export name="*"/>
</library>
- <library name="ini4j-0.5.2.jar">
+ <library name="selenium-opera-driver-3.141.5.jar">
<export name="*"/>
</library>
- <library name="jetty-io-9.2.12.v20150709.jar">
+ <library name="selenium-remote-driver-3.141.5.jar">
<export name="*"/>
</library>
- <library name="jetty-util-9.2.12.v20150709.jar">
+ <library name="selenium-safari-driver-3.141.5.jar">
<export name="*"/>
</library>
- <library name="jna-4.1.0.jar">
- <export name="*"/>
- </library>
- <library name="jna-platform-4.1.0.jar">
- <export name="*"/>
- </library>
- <library name="nekohtml-1.9.22.jar">
- <export name="*"/>
- </library>
- <library name="netty-3.5.2.Final.jar">
- <export name="*"/>
- </library>
- <library name="operadriver-1.5.jar">
- <export name="*"/>
- </library>
- <library name="operalaunchers-1.1.jar">
- <export name="*"/>
- </library>
- <library name="phantomjsdriver-1.2.1.jar">
- <export name="*"/>
- </library>
- <library name="protobuf-java-2.4.1.jar">
- <export name="*"/>
- </library>
- <library name="sac-1.3.jar">
- <export name="*"/>
- </library>
- <library name="selenium-api-2.48.2.jar">
- <export name="*"/>
- </library>
- <library name="selenium-chrome-driver-2.48.2.jar">
- <export name="*"/>
- </library>
- <library name="selenium-edge-driver-2.48.2.jar">
- <export name="*"/>
- </library>
- <library name="selenium-firefox-driver-2.48.2.jar">
- <export name="*"/>
- </library>
- <library name="selenium-htmlunit-driver-2.48.2.jar">
- <export name="*"/>
- </library>
- <library name="selenium-ie-driver-2.48.2.jar">
- <export name="*"/>
- </library>
- <library name="selenium-java-2.48.2.jar">
- <export name="*"/>
- </library>
- <library name="selenium-leg-rc-2.48.2.jar">
- <export name="*"/>
- </library>
- <library name="selenium-remote-driver-2.48.2.jar">
- <export name="*"/>
- </library>
- <library name="selenium-safari-driver-2.48.2.jar">
- <export name="*"/>
- </library>
- <library name="selenium-support-2.48.2.jar">
- <export name="*"/>
- </library>
- <library name="serializer-2.7.2.jar">
- <export name="*"/>
- </library>
- <library name="webbit-0.4.14.jar">
- <export name="*"/>
- </library>
- <library name="websocket-api-9.2.12.v20150709.jar">
- <export name="*"/>
- </library>
- <library name="websocket-client-9.2.12.v20150709.jar">
- <export name="*"/>
- </library>
- <library name="websocket-common-9.2.12.v20150709.jar">
- <export name="*"/>
- </library>
- <library name="xalan-2.7.2.jar">
- <export name="*"/>
- </library>
- <library name="xercesImpl-2.11.0.jar">
- <export name="*"/>
- </library>
- <library name="xml-apis-1.4.01.jar">
+ <library name="selenium-support-3.141.5.jar">
<export name="*"/>
</library>
</runtime>
-
</plugin>
diff --git a/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java b/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
index 6e137f9..6af20b0 100644
--- a/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
+++ b/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
@@ -24,182 +24,274 @@ import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.util.concurrent.TimeUnit;
+import java.util.Random;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
+
import org.openqa.selenium.By;
+import org.openqa.selenium.Capabilities;
import org.openqa.selenium.OutputType;
import org.openqa.selenium.TakesScreenshot;
import org.openqa.selenium.TimeoutException;
import org.openqa.selenium.WebDriver;
+
import org.openqa.selenium.chrome.ChromeDriver;
-import org.openqa.selenium.firefox.FirefoxBinary;
+import org.openqa.selenium.chrome.ChromeOptions;
+
+//import org.openqa.selenium.firefox.FirefoxBinary;
import org.openqa.selenium.firefox.FirefoxDriver;
-import org.openqa.selenium.firefox.FirefoxProfile;
+//import org.openqa.selenium.firefox.FirefoxProfile;
+import org.openqa.selenium.firefox.FirefoxOptions;
+
import org.openqa.selenium.io.TemporaryFilesystem;
+
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;
-import org.openqa.selenium.safari.SafariDriver;
-import org.openqa.selenium.phantomjs.PhantomJSDriver;
-import org.openqa.selenium.phantomjs.PhantomJSDriverService;
+
+//import org.openqa.selenium.safari.SafariDriver;
+
+//import org.openqa.selenium.phantomjs.PhantomJSDriver;
+//import org.openqa.selenium.phantomjs.PhantomJSDriverService;
+
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
-import com.opera.core.systems.OperaDriver;
+import org.openqa.selenium.opera.OperaOptions;
+import org.openqa.selenium.opera.OperaDriver;
+//import com.opera.core.systems.OperaDriver;
public class HttpWebClient {
private static final Logger LOG = LoggerFactory
.getLogger(MethodHandles.lookup().lookupClass());
- public static ThreadLocal<WebDriver> threadWebDriver = new ThreadLocal<WebDriver>() {
-
- @Override
- protected WebDriver initialValue()
- {
- FirefoxProfile profile = new FirefoxProfile();
- profile.setPreference("permissions.default.stylesheet", 2);
- profile.setPreference("permissions.default.image", 2);
- profile.setPreference("dom.ipc.plugins.enabled.libflashplayer.so", "false");
- profile.setPreference(FirefoxProfile.ALLOWED_HOSTS_PREFERENCE, "localhost");
- WebDriver driver = new FirefoxDriver(profile);
- return driver;
- };
- };
-
public static WebDriver getDriverForPage(String url, Configuration conf) {
- WebDriver driver = null;
- DesiredCapabilities capabilities = null;
- long pageLoadWait = conf.getLong("page.load.delay", 3);
+ WebDriver driver = null;
+ long pageLoadWait = conf.getLong("page.load.delay", 3);
- try {
- String driverType = conf.get("selenium.driver", "firefox");
- switch (driverType) {
- case "firefox":
- String allowedHost = conf.get("selenium.firefox.allowed.hosts", "localhost");
- long firefoxBinaryTimeout = conf.getLong("selenium.firefox.binary.timeout", 45);
- boolean enableFlashPlayer = conf.getBoolean("selenium.firefox.enable.flash", false);
- int loadImage = conf.getInt("selenium.firefox.load.image", 1);
- int loadStylesheet = conf.getInt("selenium.firefox.load.stylesheet", 1);
- FirefoxProfile profile = new FirefoxProfile();
- FirefoxBinary binary = new FirefoxBinary();
- profile.setPreference(FirefoxProfile.ALLOWED_HOSTS_PREFERENCE, allowedHost);
- profile.setPreference("dom.ipc.plugins.enabled.libflashplayer.so", enableFlashPlayer);
- profile.setPreference("permissions.default.stylesheet", loadStylesheet);
- profile.setPreference("permissions.default.image", loadImage);
- binary.setTimeout(TimeUnit.SECONDS.toMillis(firefoxBinaryTimeout));
- driver = new FirefoxDriver(binary, profile);
- break;
- case "chrome":
- driver = new ChromeDriver();
- break;
- case "safari":
- driver = new SafariDriver();
- break;
- case "opera":
- driver = new OperaDriver();
- break;
- case "phantomjs":
- driver = new PhantomJSDriver();
- break;
- case "remote":
- String seleniumHubHost = conf.get("selenium.hub.host", "localhost");
- int seleniumHubPort = Integer.parseInt(conf.get("selenium.hub.port", "4444"));
- String seleniumHubPath = conf.get("selenium.hub.path", "/wd/hub");
- String seleniumHubProtocol = conf.get("selenium.hub.protocol", "http");
- String seleniumGridDriver = conf.get("selenium.grid.driver","firefox");
- String seleniumGridBinary = conf.get("selenium.grid.binary");
-
- switch (seleniumGridDriver){
- case "firefox":
- capabilities = DesiredCapabilities.firefox();
- capabilities.setBrowserName("firefox");
- capabilities.setJavascriptEnabled(true);
- capabilities.setCapability("firefox_binary",seleniumGridBinary);
- System.setProperty("webdriver.reap_profile", "false");
- driver = new RemoteWebDriver(new URL(seleniumHubProtocol, seleniumHubHost, seleniumHubPort, seleniumHubPath), capabilities);
- break;
- case "phantomjs":
- capabilities = DesiredCapabilities.phantomjs();
- capabilities.setBrowserName("phantomjs");
- capabilities.setJavascriptEnabled(true);
- capabilities.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,seleniumGridBinary);
- driver = new RemoteWebDriver(new URL(seleniumHubProtocol, seleniumHubHost, seleniumHubPort, seleniumHubPath), capabilities);
- break;
- default:
- LOG.error("The Selenium Grid WebDriver choice {} is not available... defaulting to FirefoxDriver().", driverType);
- driver = new RemoteWebDriver(new URL(seleniumHubProtocol, seleniumHubHost, seleniumHubPort, seleniumHubPath), DesiredCapabilities.firefox());
- break;
- }
- break;
- default:
- LOG.error("The Selenium WebDriver choice {} is not available... defaulting to FirefoxDriver().", driverType);
- driver = new FirefoxDriver();
- break;
+ try {
+ String driverType = conf.get("selenium.driver", "firefox");
+ boolean enableHeadlessMode = conf.getBoolean("selenium.enable.headless",
+ false);
+
+ switch (driverType) {
+ case "firefox":
+ String geckoDriverPath = conf.get("selenium.grid.binary",
+ "/root/geckodriver");
+ driver = createFirefoxWebDriver(geckoDriverPath, enableHeadlessMode);
+ break;
+ case "chrome":
+ String chromeDriverPath = conf.get("selenium.grid.binary",
+ "/root/chromedriver");
+ driver = createChromeWebDriver(chromeDriverPath, enableHeadlessMode);
+ break;
+ // case "opera":
+ // // This class is provided as a convenience for easily testing the
+ // Chrome browser.
+ // String operaDriverPath = conf.get("selenium.grid.binary",
+ // "/root/operadriver");
+ // driver = createOperaWebDriver(operaDriverPath, enableHeadlessMode);
+ // break;
+ case "remote":
+ String seleniumHubHost = conf.get("selenium.hub.host", "localhost");
+ int seleniumHubPort = Integer
+ .parseInt(conf.get("selenium.hub.port", "4444"));
+ String seleniumHubPath = conf.get("selenium.hub.path", "/wd/hub");
+ String seleniumHubProtocol = conf.get("selenium.hub.protocol", "http");
+ URL seleniumHubUrl = new URL(seleniumHubProtocol, seleniumHubHost,
+ seleniumHubPort, seleniumHubPath);
+
+ String seleniumGridDriver = conf.get("selenium.grid.driver", "firefox");
+
+ switch (seleniumGridDriver) {
+ case "firefox":
+ driver = createFirefoxRemoteWebDriver(seleniumHubUrl,
+ enableHeadlessMode);
+ break;
+ case "chrome":
+ driver = createChromeRemoteWebDriver(seleniumHubUrl,
+ enableHeadlessMode);
+ break;
+ case "random":
+ driver = createRandomRemoteWebDriver(seleniumHubUrl,
+ enableHeadlessMode);
+ break;
+ default:
+ LOG.error(
+ "The Selenium Grid WebDriver choice {} is not available... defaulting to FirefoxDriver().",
+ driverType);
+ driver = createDefaultRemoteWebDriver(seleniumHubUrl,
+ enableHeadlessMode);
+ break;
}
- LOG.debug("Selenium {} WebDriver selected.", driverType);
-
- driver.manage().timeouts().pageLoadTimeout(pageLoadWait, TimeUnit.SECONDS);
- driver.get(url);
- } catch (Exception e) {
- if(e instanceof TimeoutException) {
- LOG.debug("Selenium WebDriver: Timeout Exception: Capturing whatever loaded so far...");
- return driver;
- }
- cleanUpDriver(driver);
- throw new RuntimeException(e);
- }
-
- return driver;
- }
+ break;
+ default:
+ LOG.error(
+ "The Selenium WebDriver choice {} is not available... defaulting to FirefoxDriver().",
+ driverType);
+ FirefoxOptions options = new FirefoxOptions();
+ driver = new FirefoxDriver(options);
+ break;
+ }
+ LOG.debug("Selenium {} WebDriver selected.", driverType);
- public static String getHTMLContent(WebDriver driver, Configuration conf) {
- if (conf.getBoolean("take.screenshot", false)) {
- takeScreenshot(driver, conf);
+ driver.manage().timeouts().pageLoadTimeout(pageLoadWait,
+ TimeUnit.SECONDS);
+ driver.get(url);
+ } catch (Exception e) {
+ if (e instanceof TimeoutException) {
+ LOG.error(
+ "Selenium WebDriver: Timeout Exception: Capturing whatever loaded so far...");
+ return driver;
+ } else {
+ LOG.error(e.toString());
}
+ cleanUpDriver(driver);
+ throw new RuntimeException(e);
+ }
+
+ return driver;
+ }
+
+ public static WebDriver createFirefoxWebDriver(String firefoxDriverPath,
+ boolean enableHeadlessMode) {
+ System.setProperty("webdriver.gecko.driver", firefoxDriverPath);
+ FirefoxOptions firefoxOptions = new FirefoxOptions();
+ if (enableHeadlessMode) {
+ firefoxOptions.addArguments("--headless");
+ }
+ WebDriver driver = new FirefoxDriver(firefoxOptions);
+ return driver;
+ }
- return driver.findElement(By.tagName("body")).getAttribute("innerHTML");
+ public static WebDriver createChromeWebDriver(String chromeDriverPath,
+ boolean enableHeadlessMode) {
+ // if not specified, WebDriver will search your path for chromedriver
+ System.setProperty("webdriver.chrome.driver", chromeDriverPath);
+ ChromeOptions chromeOptions = new ChromeOptions();
+ chromeOptions.addArguments("--no-sandbox");
+ chromeOptions.addArguments("--disable-extensions");
+ // be sure to set selenium.enable.headless to true if no monitor attached
+ // to your server
+ if (enableHeadlessMode) {
+ chromeOptions.addArguments("--headless");
+ }
+ WebDriver driver = new ChromeDriver(chromeOptions);
+ return driver;
+ }
+
+ public static WebDriver createOperaWebDriver(String operaDriverPath,
+ boolean enableHeadlessMode) {
+ // if not specified, WebDriver will search your path for operadriver
+ System.setProperty("webdriver.opera.driver", operaDriverPath);
+ OperaOptions operaOptions = new OperaOptions();
+ // operaOptions.setBinary("/usr/bin/opera");
+ operaOptions.addArguments("--no-sandbox");
+ operaOptions.addArguments("--disable-extensions");
+ // be sure to set selenium.enable.headless to true if no monitor attached
+ // to your server
+ if (enableHeadlessMode) {
+ operaOptions.addArguments("--headless");
+ }
+ WebDriver driver = new OperaDriver(operaOptions);
+ return driver;
+ }
+
+ public static RemoteWebDriver createFirefoxRemoteWebDriver(URL seleniumHubUrl,
+ boolean enableHeadlessMode) {
+ FirefoxOptions firefoxOptions = new FirefoxOptions();
+ if (enableHeadlessMode) {
+ firefoxOptions.setHeadless(true);
+ }
+ RemoteWebDriver driver = new RemoteWebDriver(seleniumHubUrl,
+ firefoxOptions);
+ return driver;
+ }
+
+ public static RemoteWebDriver createChromeRemoteWebDriver(URL seleniumHubUrl,
+ boolean enableHeadlessMode) {
+ ChromeOptions chromeOptions = new ChromeOptions();
+ if (enableHeadlessMode) {
+ chromeOptions.setHeadless(true);
+ }
+ RemoteWebDriver driver = new RemoteWebDriver(seleniumHubUrl, chromeOptions);
+ return driver;
+ }
+
+ public static RemoteWebDriver createRandomRemoteWebDriver(URL seleniumHubUrl,
+ boolean enableHeadlessMode) {
+ // we consider a possibility of generating only 2 types of browsers: Firefox
+ // and
+ // Chrome only
+ Random r = new Random();
+ int min = 0;
+ // we have actually hardcoded the maximum number of types of web driver that
+ // can
+ // be created
+ // but this must be later moved to the configuration file in order to be
+ // able
+ // to randomly choose between much more types(ex: Edge, Opera, Safari)
+ int max = 1; // for 3 types, change to 2 and update the if-clause
+ int num = r.nextInt((max - min) + 1) + min;
+ if (num == 0) {
+ return createFirefoxRemoteWebDriver(seleniumHubUrl, enableHeadlessMode);
+ }
+
+ return createChromeRemoteWebDriver(seleniumHubUrl, enableHeadlessMode);
+ }
+
+ public static RemoteWebDriver createDefaultRemoteWebDriver(URL seleniumHubUrl,
+ boolean enableHeadlessMode) {
+ return createFirefoxRemoteWebDriver(seleniumHubUrl, enableHeadlessMode);
}
public static void cleanUpDriver(WebDriver driver) {
if (driver != null) {
try {
- driver.close();
+ // driver.close();
driver.quit();
TemporaryFilesystem.getDefaultTmpFS().deleteTemporaryFiles();
} catch (Exception e) {
- throw new RuntimeException(e);
+ LOG.error(e.toString());
+ // throw new RuntimeException(e);
}
}
}
/**
- * Function for obtaining the HTML BODY using the selected
- * <a href='https://seleniumhq.github.io/selenium/docs/api/java/org/openqa/selenium/WebDriver.html'>selenium webdriver</a>
- * There are a number of configuration properties within
- * <code>nutch-site.xml</code> which determine whether to
- * take screenshots of the rendered pages and persist them
- * as timestamped .png's into HDFS.
- * @param url the URL to fetch and render
- * @param conf the {@link org.apache.hadoop.conf.Configuration}
+ * Function for obtaining the HTML BODY using the selected <a href=
+ * 'https://seleniumhq.github.io/selenium/docs/api/java/org/openqa/selenium/WebDriver.html'>selenium
+ * webdriver</a> There are a number of configuration properties within
+ * <code>nutch-site.xml</code> which determine whether to take screenshots of
+ * the rendered pages and persist them as timestamped .png's into HDFS.
+ *
+ * @param url
+ * the URL to fetch and render
+ * @param conf
+ * the {@link org.apache.hadoop.conf.Configuration}
* @return the rendered inner HTML page
*/
public static String getHtmlPage(String url, Configuration conf) {
WebDriver driver = getDriverForPage(url, conf);
-
+
try {
if (conf.getBoolean("take.screenshot", false)) {
takeScreenshot(driver, conf);
}
- String innerHtml = driver.findElement(By.tagName("body")).getAttribute("innerHTML");
+ String innerHtml = driver.findElement(By.tagName("body"))
+ .getAttribute("innerHTML");
return innerHtml;
- // I'm sure this catch statement is a code smell ; borrowing it from lib-htmlunit
+ // I'm sure this catch statement is a code smell ; borrowing it from
+ // lib-htmlunit
} catch (Exception e) {
TemporaryFilesystem.getDefaultTmpFS().deleteTemporaryFiles();
+ // throw new RuntimeException(e);
+ LOG.error("getHtmlPage(url, conf): " + e.toString());
throw new RuntimeException(e);
} finally {
cleanUpDriver(driver);
@@ -213,24 +305,32 @@ public class HttpWebClient {
private static void takeScreenshot(WebDriver driver, Configuration conf) {
try {
String url = driver.getCurrentUrl();
- File srcFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
+ File srcFile = ((TakesScreenshot) driver)
+ .getScreenshotAs(OutputType.FILE);
LOG.debug("In-memory screenshot taken of: {}", url);
FileSystem fs = FileSystem.get(conf);
if (conf.get("screenshot.location") != null) {
- Path screenshotPath = new Path(conf.get("screenshot.location") + "/" + srcFile.getName());
+ Path screenshotPath = new Path(
+ conf.get("screenshot.location") + "/" + srcFile.getName());
OutputStream os = null;
if (!fs.exists(screenshotPath)) {
- LOG.debug("No existing screenshot already exists... creating new file at {} {}.", screenshotPath, srcFile.getName());
+ LOG.debug(
+ "No existing screenshot already exists... creating new file at {} {}.",
+ screenshotPath, srcFile.getName());
os = fs.create(screenshotPath);
}
InputStream is = new BufferedInputStream(new FileInputStream(srcFile));
IOUtils.copyBytes(is, os, conf);
- LOG.debug("Screenshot for {} successfully saved to: {} {}", url, screenshotPath, srcFile.getName());
+ LOG.debug("Screenshot for {} successfully saved to: {} {}", url,
+ screenshotPath, srcFile.getName());
} else {
- LOG.warn("Screenshot for {} not saved to HDFS (subsequently disgarded) as value for "
- + "'screenshot.location' is absent from nutch-site.xml.", url);
+ LOG.warn(
+ "Screenshot for {} not saved to HDFS (subsequently disgarded) as value for "
+ + "'screenshot.location' is absent from nutch-site.xml.",
+ url);
}
} catch (Exception e) {
+ LOG.error("Error taking screenshot: ", e);
cleanUpDriver(driver);
throw new RuntimeException(e);
}