You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com.INVALID> on 2021/11/18 15:45:58 UTC
Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin
The issue is now tracked in
https://issues.apache.org/jira/browse/NUTCH-2907
On 10/28/21 15:31, Sebastian Nagel wrote:
> Hi Shi Wei,
>
> sorry, but it looks like the Selenium protocol plugin has never been
> used with a proxy over https. There are two points which need (at a
> first glance) a rework:
>
> 1. the protocol tries to establish a TLS/SSL connection to the proxy if
> the URL to be crawled is a https:// URL. There might be some proxies
> which can do this, but the proxies I'm aware of expect a HTTP CONNECT
> [1] for HTTPS proxying.
>
> 2. probably also the browser / driver needs to be configured to
> use the same proxy. Afaics, this isn't done but is a requirement
> if the proxy is required for accessing web content. However, it
> might be possible by setting environment variables.
>
> Sorry again. Feel free to open a Jira issue to get this fixed.
>
> Best,
> Sebastian
>
> [1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method
>
>
> On 10/28/21 11:45, sw.ling@quandatics.com wrote:
>> Hi there,
>>
>>
>>
>> Good day!
>>
>>
>>
>> We would like to crawl the web data by executing the Nutch with Selenium
>> plugin with the following command:
>>
>>
>>
>> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
>> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
>>
>>
>>
>> However, it failed with the following error message:
>>
>>
>>
>> 2021-10-26 19:07:53,961 INFO selenium.Http - http.proxy.host = xxx.xx.xx.xx
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.port = xxxx
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.exception.list =
>> true
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.timeout = 10000
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.content.limit = 1048576
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.agent = Apache Nutch
>> Test/Nutch-1.18
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept =
>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>>
>> 2021-10-26 19:07:53,962 INFO selenium.Http - http.enable.cookie.header =
>> true
>>
>> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
>>
>> javax.net.ssl.SSLHandshakeException: Remote host closed connection during
>> handshake
>>
>> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
>>
>> at sun.security.ssl.SSL
>>
>>
>>
>> FYI, we have tried the following approaches but the issues persisted.
>>
>>
>>
>> 1. Set the http.tls.certificates.check to false
>>
>> 2. Import the website's certificates to our java truststores
>>
>> 3. Our Nutch is configured with proxy
>>
>>
>>
>> Kindly advise. Thanks in advance!
>>
>>
>>
>>
>>
>> Best Regards,
>>
>> Shi Wei
>>
>>
>>
>>