You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by sw...@quandatics.com on 2021/10/28 09:45:30 UTC

javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

Hi there,

 

Good day!

 

We would like to crawl the web data by executing the Nutch with Selenium
plugin with the following command:

 

$ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial

 

However, it failed with the following error message:

 

2021-10-26 19:07:53,961 INFO  selenium.Http - http.proxy.host = xxx.xx.xx.xx

2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.port = xxxx

2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.exception.list =
true

2021-10-26 19:07:53,962 INFO  selenium.Http - http.timeout = 10000

2021-10-26 19:07:53,962 INFO  selenium.Http - http.content.limit = 1048576

2021-10-26 19:07:53,962 INFO  selenium.Http - http.agent = Apache Nutch
Test/Nutch-1.18

2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3

2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

2021-10-26 19:07:53,962 INFO  selenium.Http - http.enable.cookie.header =
true

2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output

javax.net.ssl.SSLHandshakeException: Remote host closed connection during
handshake

        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)

        at sun.security.ssl.SSL

 

FYI, we have tried the following approaches but the issues persisted.

 

1. Set the http.tls.certificates.check to false

2. Import the website's certificates to our java truststores

3. Our Nutch is configured with proxy

 

Kindly advise. Thanks in advance!

 

 

Best Regards,

Shi Wei

 


Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
The issue is now tracked in
  https://issues.apache.org/jira/browse/NUTCH-2907

On 10/28/21 15:31, Sebastian Nagel wrote:
> Hi Shi Wei,
> 
> sorry, but it looks like the Selenium protocol plugin has never been
> used with a proxy over https. There are two points which need (at a
> first glance) a rework:
> 
> 1. the protocol tries to establish a TLS/SSL connection to the proxy if
> the URL to be crawled is a https:// URL. There might be some proxies
> which can do this, but the proxies I'm aware of expect a HTTP CONNECT
> [1] for HTTPS proxying.
> 
> 2. probably also the browser / driver needs to be configured to
> use the same proxy. Afaics, this isn't done but is a requirement
> if the proxy is required for accessing web content. However, it
> might be possible by setting environment variables.
> 
> Sorry again. Feel free to open a Jira issue to get this fixed.
> 
> Best,
> Sebastian
> 
> [1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method
> 
> 
> On 10/28/21 11:45, sw.ling@quandatics.com wrote:
>> Hi there,
>>
>>  
>>
>> Good day!
>>
>>  
>>
>> We would like to crawl the web data by executing the Nutch with Selenium
>> plugin with the following command:
>>
>>  
>>
>> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
>> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
>>
>>  
>>
>> However, it failed with the following error message:
>>
>>  
>>
>> 2021-10-26 19:07:53,961 INFO  selenium.Http - http.proxy.host = xxx.xx.xx.xx
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.port = xxxx
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.exception.list =
>> true
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.timeout = 10000
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.content.limit = 1048576
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.agent = Apache Nutch
>> Test/Nutch-1.18
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept =
>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.enable.cookie.header =
>> true
>>
>> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
>>
>> javax.net.ssl.SSLHandshakeException: Remote host closed connection during
>> handshake
>>
>>         at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
>>
>>         at sun.security.ssl.SSL
>>
>>  
>>
>> FYI, we have tried the following approaches but the issues persisted.
>>
>>  
>>
>> 1. Set the http.tls.certificates.check to false
>>
>> 2. Import the website's certificates to our java truststores
>>
>> 3. Our Nutch is configured with proxy
>>
>>  
>>
>> Kindly advise. Thanks in advance!
>>
>>  
>>
>>  
>>
>> Best Regards,
>>
>> Shi Wei
>>
>>  
>>
>>

Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Shi Wei,

sorry, but it looks like the Selenium protocol plugin has never been
used with a proxy over https. There are two points which need (at a
first glance) a rework:

1. the protocol tries to establish a TLS/SSL connection to the proxy if
the URL to be crawled is a https:// URL. There might be some proxies
which can do this, but the proxies I'm aware of expect a HTTP CONNECT
[1] for HTTPS proxying.

2. probably also the browser / driver needs to be configured to
use the same proxy. Afaics, this isn't done but is a requirement
if the proxy is required for accessing web content. However, it
might be possible by setting environment variables.

Sorry again. Feel free to open a Jira issue to get this fixed.

Best,
Sebastian

[1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method


On 10/28/21 11:45, sw.ling@quandatics.com wrote:
> Hi there,
> 
>  
> 
> Good day!
> 
>  
> 
> We would like to crawl the web data by executing the Nutch with Selenium
> plugin with the following command:
> 
>  
> 
> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
> 
>  
> 
> However, it failed with the following error message:
> 
>  
> 
> 2021-10-26 19:07:53,961 INFO  selenium.Http - http.proxy.host = xxx.xx.xx.xx
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.port = xxxx
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.exception.list =
> true
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.timeout = 10000
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.content.limit = 1048576
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.agent = Apache Nutch
> Test/Nutch-1.18
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.enable.cookie.header =
> true
> 
> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
> 
> javax.net.ssl.SSLHandshakeException: Remote host closed connection during
> handshake
> 
>         at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
> 
>         at sun.security.ssl.SSL
> 
>  
> 
> FYI, we have tried the following approaches but the issues persisted.
> 
>  
> 
> 1. Set the http.tls.certificates.check to false
> 
> 2. Import the website's certificates to our java truststores
> 
> 3. Our Nutch is configured with proxy
> 
>  
> 
> Kindly advise. Thanks in advance!
> 
>  
> 
>  
> 
> Best Regards,
> 
> Shi Wei
> 
>  
> 
>