You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com.INVALID> on 2021/11/18 15:45:58 UTC

Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

The issue is now tracked in
  https://issues.apache.org/jira/browse/NUTCH-2907

On 10/28/21 15:31, Sebastian Nagel wrote:
> Hi Shi Wei,
> 
> sorry, but it looks like the Selenium protocol plugin has never been
> used with a proxy over https. There are two points which need (at a
> first glance) a rework:
> 
> 1. the protocol tries to establish a TLS/SSL connection to the proxy if
> the URL to be crawled is a https:// URL. There might be some proxies
> which can do this, but the proxies I'm aware of expect a HTTP CONNECT
> [1] for HTTPS proxying.
> 
> 2. probably also the browser / driver needs to be configured to
> use the same proxy. Afaics, this isn't done but is a requirement
> if the proxy is required for accessing web content. However, it
> might be possible by setting environment variables.
> 
> Sorry again. Feel free to open a Jira issue to get this fixed.
> 
> Best,
> Sebastian
> 
> [1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method
> 
> 
> On 10/28/21 11:45, sw.ling@quandatics.com wrote:
>> Hi there,
>>
>>  
>>
>> Good day!
>>
>>  
>>
>> We would like to crawl the web data by executing the Nutch with Selenium
>> plugin with the following command:
>>
>>  
>>
>> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
>> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
>>
>>  
>>
>> However, it failed with the following error message:
>>
>>  
>>
>> 2021-10-26 19:07:53,961 INFO  selenium.Http - http.proxy.host = xxx.xx.xx.xx
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.port = xxxx
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.exception.list =
>> true
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.timeout = 10000
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.content.limit = 1048576
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.agent = Apache Nutch
>> Test/Nutch-1.18
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept =
>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.enable.cookie.header =
>> true
>>
>> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
>>
>> javax.net.ssl.SSLHandshakeException: Remote host closed connection during
>> handshake
>>
>>         at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
>>
>>         at sun.security.ssl.SSL
>>
>>  
>>
>> FYI, we have tried the following approaches but the issues persisted.
>>
>>  
>>
>> 1. Set the http.tls.certificates.check to false
>>
>> 2. Import the website's certificates to our java truststores
>>
>> 3. Our Nutch is configured with proxy
>>
>>  
>>
>> Kindly advise. Thanks in advance!
>>
>>  
>>
>>  
>>
>> Best Regards,
>>
>> Shi Wei
>>
>>  
>>
>>