You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sadiki Latty <sl...@uottawa.ca> on 2017/11/28 16:08:28 UTC

Certificates

Hey all,

I have a question regarding self-signed certs. I will be using nutch to crawl http and https sites, as well as using it to index to self-signed https Solr servers. I managed to add certificates to Solr and it fixed their inter-node communication butI am yet to find where in nutch I can do a similar configuration. I have seen articles saying that the protocol-httpclient plugin should be able to do it with some code modifications but the caveat is that httpclient may have underlying bugs so protocol-http is recommended. These articles were also almost 3 years old so options may have evolved now. Can some someone provide some insight into what my next steps should be. Essentially here are my questions:

1.       Should I use protocol-http, protocol-httpclient or other?

2.       Is there somewhere in a config file that I can tell Nutch to use a java keystore file similar to Solr?

Thanks

Sid

RE: [MASSMAIL]Certificates

Posted by Sadiki Latty <sl...@uottawa.ca>.
Hey Roannel,

This is what I needed. I kept trying to Google things related to certificates and nutch but I guess I should have just said java and certificates instead. Works like a charm now.


Thanks

Sid

-----Original Message-----
From: Roannel Fernández Hernández [mailto:roannel@uci.cu] 
Sent: November-28-17 3:31 PM
To: user@nutch.apache.org
Subject: Re: [MASSMAIL]Certificates

Hi Sadiki:

You must add your Solr's certificate into cacerts (keystore by default) of your Java distribution. Under Linux you can know where your cacerts file is, with:

echo $(readlink -f /usr/bin/java | sed "s:bin/java::")lib/security/cacerts

as is described on https://stackoverflow.com/questions/11936685/how-to-obtain-the-location-of-cacerts-of-the-default-java-installation

Regards.

----- Mensaje original -----
> De: "Sadiki Latty" <sl...@uottawa.ca>
> Para: user@nutch.apache.org
> Enviados: Martes, 28 de Noviembre 2017 14:03:27
> Asunto: RE: [MASSMAIL]Certificates
> 
> Hey Eyeris,
> 
> Thanks for the response. My issue isn't with the http/https crawling 
> but rather the indexing to Solr. My Solr instances are self-signed and 
> when Nutch tries to index what It found it fails because it doesn’t 
> respect the cert that Solr made. I had the same issue with Solr 
> talking to other Solr instances and the solution was to manually add 
> the cert and point Solr to the keystore file. I was hoping I could 
> find a similar solution for Nutch where I could add the Solr cert to the Nutch keystore but.
> 	1. I don’t know if Nutch can do that?
> 	2. If Nutch has this feature I don't know where the keystore file is.
> 	3. Your suggestion of using Portecle may be suitable for what I need but I
> 	still need to know where Nutch keeps this keystore file AND/OR how to tell
> 	Nutch to use this keystore file.
> 
> I am also willing to use protocol-httpclient but it still (without 
> extra
> configuration) doesn’t work me. I'm fairly new to Nutch so forgive me 
> if I'm missing something obvious.
> 
> Thanks
> 
> Sid
> 	
> 
> -----Original Message-----
> From: Eyeris Rodriguez Rueda [mailto:erueda@uci.cu]
> Sent: November-28-17 12:07 PM
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]Certificates
> 
> Hello Sid.
> I am using protocol-httpclient because in my modest opinion it have a 
> better handling of https websites than protocol-http.
> Since java 1.7 my problems with self signed certificates was deleted 
> and using protocol-httpclient and nutch 1.12.
> But if you have problems with websites that have self signed 
> certificates maybe you need to insert certificates into java keystore 
> using portecle tool you can download here: 
> https://sourceforge.net/projects/portecle/
> 
> Best regards.
> 
> 
> 
> ----- Mensaje original -----
> De: "Sadiki Latty" <sl...@uottawa.ca>
> Para: user@nutch.apache.org
> Enviados: Martes, 28 de Noviembre 2017 11:08:28
> Asunto: [MASSMAIL]Certificates
> 
> Hey all,
> 
> I have a question regarding self-signed certs. I will be using nutch 
> to crawl http and https sites, as well as using it to index to 
> self-signed https Solr servers. I managed to add certificates to Solr 
> and it fixed their inter-node communication butI am yet to find where 
> in nutch I can do a similar configuration. I have seen articles saying 
> that the protocol-httpclient plugin should be able to do it with some 
> code modifications but the caveat is that httpclient may have underlying bugs so protocol-http is recommended.
> These articles were also almost 3 years old so options may have evolved now.
> Can some someone provide some insight into what my next steps should be.
> Essentially here are my questions:
> 
> 1.       Should I use protocol-http, protocol-httpclient or other?
> 
> 
> 
> 2.       Is there somewhere in a config file that I can tell Nutch to use a
> java keystore file similar to Solr?
> 
> Thanks
> 
> Sid
> 
> **********************
> Text below is autogenerated by my email suplier.
> La @universidad_uci es Fidel: 15 años conectados al futuro... 
> conectados a la Revolución
> 2002-2017
> 
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución
2002-2017

Re: [MASSMAIL]Certificates

Posted by Roannel Fernández Hernández <ro...@uci.cu>.
Hi Sadiki:

You must add your Solr's certificate into cacerts (keystore by default) of your Java distribution. Under Linux you can know where your cacerts file is, with:

echo $(readlink -f /usr/bin/java | sed "s:bin/java::")lib/security/cacerts

as is described on https://stackoverflow.com/questions/11936685/how-to-obtain-the-location-of-cacerts-of-the-default-java-installation

Regards.

----- Mensaje original -----
> De: "Sadiki Latty" <sl...@uottawa.ca>
> Para: user@nutch.apache.org
> Enviados: Martes, 28 de Noviembre 2017 14:03:27
> Asunto: RE: [MASSMAIL]Certificates
> 
> Hey Eyeris,
> 
> Thanks for the response. My issue isn't with the http/https crawling but
> rather the indexing to Solr. My Solr instances are self-signed and when
> Nutch tries to index what It found it fails because it doesn’t respect the
> cert that Solr made. I had the same issue with Solr talking to other Solr
> instances and the solution was to manually add the cert and point Solr to
> the keystore file. I was hoping I could find a similar solution for Nutch
> where I could add the Solr cert to the Nutch keystore but.
> 	1. I don’t know if Nutch can do that?
> 	2. If Nutch has this feature I don't know where the keystore file is.
> 	3. Your suggestion of using Portecle may be suitable for what I need but I
> 	still need to know where Nutch keeps this keystore file AND/OR how to tell
> 	Nutch to use this keystore file.
> 
> I am also willing to use protocol-httpclient but it still (without extra
> configuration) doesn’t work me. I'm fairly new to Nutch so forgive me if I'm
> missing something obvious.
> 
> Thanks
> 
> Sid
> 	
> 
> -----Original Message-----
> From: Eyeris Rodriguez Rueda [mailto:erueda@uci.cu]
> Sent: November-28-17 12:07 PM
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]Certificates
> 
> Hello Sid.
> I am using protocol-httpclient because in my modest opinion it have a better
> handling of https websites than protocol-http.
> Since java 1.7 my problems with self signed certificates was deleted and
> using protocol-httpclient and nutch 1.12.
> But if you have problems with websites that have self signed certificates
> maybe you need to insert certificates into java keystore using portecle tool
> you can download here: https://sourceforge.net/projects/portecle/
> 
> Best regards.
> 
> 
> 
> ----- Mensaje original -----
> De: "Sadiki Latty" <sl...@uottawa.ca>
> Para: user@nutch.apache.org
> Enviados: Martes, 28 de Noviembre 2017 11:08:28
> Asunto: [MASSMAIL]Certificates
> 
> Hey all,
> 
> I have a question regarding self-signed certs. I will be using nutch to crawl
> http and https sites, as well as using it to index to self-signed https Solr
> servers. I managed to add certificates to Solr and it fixed their inter-node
> communication butI am yet to find where in nutch I can do a similar
> configuration. I have seen articles saying that the protocol-httpclient
> plugin should be able to do it with some code modifications but the caveat
> is that httpclient may have underlying bugs so protocol-http is recommended.
> These articles were also almost 3 years old so options may have evolved now.
> Can some someone provide some insight into what my next steps should be.
> Essentially here are my questions:
> 
> 1.       Should I use protocol-http, protocol-httpclient or other?
> 
> 
> 
> 2.       Is there somewhere in a config file that I can tell Nutch to use a
> java keystore file similar to Solr?
> 
> Thanks
> 
> Sid
> 
> **********************
> Text below is autogenerated by my email suplier.
> La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la
> Revolución
> 2002-2017
> 
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución
2002-2017

RE: [MASSMAIL]Certificates

Posted by Sadiki Latty <sl...@uottawa.ca>.
Hey Eyeris,

Thanks for the response. My issue isn't with the http/https crawling but rather the indexing to Solr. My Solr instances are self-signed and when Nutch tries to index what It found it fails because it doesn’t respect the cert that Solr made. I had the same issue with Solr talking to other Solr instances and the solution was to manually add the cert and point Solr to the keystore file. I was hoping I could find a similar solution for Nutch where I could add the Solr cert to the Nutch keystore but. 
	1. I don’t know if Nutch can do that?
	2. If Nutch has this feature I don't know where the keystore file is.
	3. Your suggestion of using Portecle may be suitable for what I need but I still need to know where Nutch keeps this keystore file AND/OR how to tell Nutch to use this keystore file.

I am also willing to use protocol-httpclient but it still (without extra configuration) doesn’t work me. I'm fairly new to Nutch so forgive me if I'm missing something obvious.

Thanks

Sid
	

-----Original Message-----
From: Eyeris Rodriguez Rueda [mailto:erueda@uci.cu] 
Sent: November-28-17 12:07 PM
To: user@nutch.apache.org
Subject: Re: [MASSMAIL]Certificates

Hello Sid.
I am using protocol-httpclient because in my modest opinion it have a better handling of https websites than protocol-http.
Since java 1.7 my problems with self signed certificates was deleted and using protocol-httpclient and nutch 1.12.
But if you have problems with websites that have self signed certificates maybe you need to insert certificates into java keystore using portecle tool you can download here: https://sourceforge.net/projects/portecle/

Best regards.



----- Mensaje original -----
De: "Sadiki Latty" <sl...@uottawa.ca>
Para: user@nutch.apache.org
Enviados: Martes, 28 de Noviembre 2017 11:08:28
Asunto: [MASSMAIL]Certificates

Hey all,

I have a question regarding self-signed certs. I will be using nutch to crawl http and https sites, as well as using it to index to self-signed https Solr servers. I managed to add certificates to Solr and it fixed their inter-node communication butI am yet to find where in nutch I can do a similar configuration. I have seen articles saying that the protocol-httpclient plugin should be able to do it with some code modifications but the caveat is that httpclient may have underlying bugs so protocol-http is recommended. These articles were also almost 3 years old so options may have evolved now. Can some someone provide some insight into what my next steps should be. Essentially here are my questions:

1.       Should I use protocol-http, protocol-httpclient or other?



2.       Is there somewhere in a config file that I can tell Nutch to use a java keystore file similar to Solr?

Thanks

Sid

**********************
Text below is autogenerated by my email suplier.
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución
2002-2017

Re: [MASSMAIL]Certificates

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.
Hello Sid.
I am using protocol-httpclient because in my modest opinion it have a better handling of https websites than protocol-http.
Since java 1.7 my problems with self signed certificates was deleted and using protocol-httpclient and nutch 1.12.
But if you have problems with websites that have self signed certificates maybe you need to insert certificates into java keystore using portecle tool
you can download here: https://sourceforge.net/projects/portecle/

Best regards.



----- Mensaje original -----
De: "Sadiki Latty" <sl...@uottawa.ca>
Para: user@nutch.apache.org
Enviados: Martes, 28 de Noviembre 2017 11:08:28
Asunto: [MASSMAIL]Certificates

Hey all,

I have a question regarding self-signed certs. I will be using nutch to crawl http and https sites, as well as using it to index to self-signed https Solr servers. I managed to add certificates to Solr and it fixed their inter-node communication butI am yet to find where in nutch I can do a similar configuration. I have seen articles saying that the protocol-httpclient plugin should be able to do it with some code modifications but the caveat is that httpclient may have underlying bugs so protocol-http is recommended. These articles were also almost 3 years old so options may have evolved now. Can some someone provide some insight into what my next steps should be. Essentially here are my questions:

1.       Should I use protocol-http, protocol-httpclient or other?



2.       Is there somewhere in a config file that I can tell Nutch to use a java keystore file similar to Solr?

Thanks

Sid

**********************
Text below is autogenerated by my email suplier.
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución
2002-2017