You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by sw...@quandatics.com on 2021/10/28 09:45:30 UTC
javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin
Hi there,
Good day!
We would like to crawl the web data by executing the Nutch with Selenium
plugin with the following command:
$ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
However, it failed with the following error message:
2021-10-26 19:07:53,961 INFO selenium.Http - http.proxy.host = xxx.xx.xx.xx
2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.port = xxxx
2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.exception.list =
true
2021-10-26 19:07:53,962 INFO selenium.Http - http.timeout = 10000
2021-10-26 19:07:53,962 INFO selenium.Http - http.content.limit = 1048576
2021-10-26 19:07:53,962 INFO selenium.Http - http.agent = Apache Nutch
Test/Nutch-1.18
2021-10-26 19:07:53,962 INFO selenium.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2021-10-26 19:07:53,962 INFO selenium.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2021-10-26 19:07:53,962 INFO selenium.Http - http.enable.cookie.header =
true
2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
javax.net.ssl.SSLHandshakeException: Remote host closed connection during
handshake
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
at sun.security.ssl.SSL
FYI, we have tried the following approaches but the issues persisted.
1. Set the http.tls.certificates.check to false
2. Import the website's certificates to our java truststores
3. Our Nutch is configured with proxy
Kindly advise. Thanks in advance!
Best Regards,
Shi Wei
Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin
Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
The issue is now tracked in
https://issues.apache.org/jira/browse/NUTCH-2907
On 10/28/21 15:31, Sebastian Nagel wrote:
> Hi Shi Wei,
>
> sorry, but it looks like the Selenium protocol plugin has never been
> used with a proxy over https. There are two points which need (at a
> first glance) a rework:
>
> 1. the protocol tries to establish a TLS/SSL connection to the proxy if
> the URL to be crawled is a https:// URL. There might be some proxies
> which can do this, but the proxies I'm aware of expect a HTTP CONNECT
> [1] for HTTPS proxying.
>
> 2. probably also the browser / driver needs to be configured to
> use the same proxy. Afaics, this isn't done but is a requirement
> if the proxy is required for accessing web content. However, it
> might be possible by setting environment variables.
>
> Sorry again. Feel free to open a Jira issue to get this fixed.
>
> Best,
> Sebastian
>
> [1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method
>
>
> On 10/28/21 11:45, sw.ling@quandatics.com wrote:
>> Hi there,
>>
>>
>>
>> Good day!
>>
>>
>>
>> We would like to crawl the web data by executing the Nutch with Selenium
>> plugin with the following command:
>>
>>
>>
>> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
>> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
>>
>>
>>
>> However, it failed with the following error message:
>>
>>
>>
>> 2021-10-26 19:07:53,961 INFO selenium.Http - http.proxy.host = xxx.xx.xx.xx
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.port = xxxx
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.exception.list =
>> true
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.timeout = 10000
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.content.limit = 1048576
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.agent = Apache Nutch
>> Test/Nutch-1.18
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept =
>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.enable.cookie.header =
>> true
>>
>> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
>>
>> javax.net.ssl.SSLHandshakeException: Remote host closed connection during
>> handshake
>>
>> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
>>
>> at sun.security.ssl.SSL
>>
>>
>>
>> FYI, we have tried the following approaches but the issues persisted.
>>
>>
>>
>> 1. Set the http.tls.certificates.check to false
>>
>> 2. Import the website's certificates to our java truststores
>>
>> 3. Our Nutch is configured with proxy
>>
>>
>>
>> Kindly advise. Thanks in advance!
>>
>>
>>
>>
>>
>> Best Regards,
>>
>> Shi Wei
>>
>>
>>
>>
Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch
with Selenium Plugin
Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Shi Wei,
sorry, but it looks like the Selenium protocol plugin has never been
used with a proxy over https. There are two points which need (at a
first glance) a rework:
1. the protocol tries to establish a TLS/SSL connection to the proxy if
the URL to be crawled is a https:// URL. There might be some proxies
which can do this, but the proxies I'm aware of expect a HTTP CONNECT
[1] for HTTPS proxying.
2. probably also the browser / driver needs to be configured to
use the same proxy. Afaics, this isn't done but is a requirement
if the proxy is required for accessing web content. However, it
might be possible by setting environment variables.
Sorry again. Feel free to open a Jira issue to get this fixed.
Best,
Sebastian
[1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method
On 10/28/21 11:45, sw.ling@quandatics.com wrote:
> Hi there,
>
>
>
> Good day!
>
>
>
> We would like to crawl the web data by executing the Nutch with Selenium
> plugin with the following command:
>
>
>
> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
>
>
>
> However, it failed with the following error message:
>
>
>
> 2021-10-26 19:07:53,961 INFO selenium.Http - http.proxy.host = xxx.xx.xx.xx
>
> 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.port = xxxx
>
> 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.exception.list =
> true
>
> 2021-10-26 19:07:53,962 INFO selenium.Http - http.timeout = 10000
>
> 2021-10-26 19:07:53,962 INFO selenium.Http - http.content.limit = 1048576
>
> 2021-10-26 19:07:53,962 INFO selenium.Http - http.agent = Apache Nutch
> Test/Nutch-1.18
>
> 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
>
> 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
> 2021-10-26 19:07:53,962 INFO selenium.Http - http.enable.cookie.header =
> true
>
> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
>
> javax.net.ssl.SSLHandshakeException: Remote host closed connection during
> handshake
>
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
>
> at sun.security.ssl.SSL
>
>
>
> FYI, we have tried the following approaches but the issues persisted.
>
>
>
> 1. Set the http.tls.certificates.check to false
>
> 2. Import the website's certificates to our java truststores
>
> 3. Our Nutch is configured with proxy
>
>
>
> Kindly advise. Thanks in advance!
>
>
>
>
>
> Best Regards,
>
> Shi Wei
>
>
>
>