You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris RodrIguez Rueda <er...@uci.cu> on 2014/12/04 16:02:27 UTC

questions about nutch 1.9

Hello.
I want to use nutch 1.9 but there are some things that i don´t understand because i was using nutch 1.5.1 before and some things are changed in nutch 1.9.
Sorry if is a basic things.
Some questions:

1- How i can do a crawl process without solr parameter like in nutch 1.5.1 that the spider jump this step if i don´t set solr parameter ?

2- It is possible to use topN or similar parameter in nutch 1.9 or every round include all link in crawldb ?

3- I have activated httpclient plugin and when i crawl a website that use https protocol i get this error in the output console 
*********************************
fetch of https://dragones.uci.cu/ failed with: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

parsechecker tool throw similar error.

Please any suggestion or advice will be appreciated.


---------------------------------------------------
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.

Re: questions about nutch 1.9

Posted by Eyeris RodrIguez Rueda <er...@uci.cu>.
All websites in my university uci.cu are free and i don´t need proxy for access its, i only need proxy for website out of this domain, I am testing nutch 1.9 inside my university and i have use parsechecker tool and it look like if the problem only happend with https conection.
Could you give me some advice?

I was looking the DummyX509TrustManager class of protocol-httpclient but i don´t understand so much. Maybe it is needed to insert one block for trust in websites by default.

  

----- Original Message -----
From: "Markus Jelsma" <ma...@openindex.io>
To: user@nutch.apache.org
Sent: Tuesday, December 16, 2014 8:08:42 AM
Subject: RE: questions about nutch 1.9

Hmm - then maybe you can access it through a proxy that doesnt deal with this problem? Then connect Nutch to the proxy.
Markus
 
-----Original message-----
> From:Eyeris RodrIguez Rueda <er...@uci.cu>
> Sent: Tuesday 16th December 2014 14:04
> To: user@nutch.apache.org
> Subject: Re: questions about nutch 1.9
> 
> Thanks markus and jonathan for your answer.
> I have try with protocol-http only but the problem persist,maybe the solution is a configuration that trust in websites with problem in certificates.
> This is very important for me because i have some websites using https and it is a limitation for use nutch 1.9 in my university.
> 
> 
> 
> 
> ----- Original Message -----
> From: "Markus Jelsma" <ma...@openindex.io>
> To: user@nutch.apache.org
> Sent: Tuesday, December 16, 2014 7:46:54 AM
> Subject: RE: questions about nutch 1.9
> 
> Hi - can you try the protocol-http plugin instead? It has some support for TLS.
>  
> -----Original message-----
> > From:Eyeris RodrIguez Rueda <er...@uci.cu>
> > Sent: Thursday 11th December 2014 22:18
> > To: user@nutch.apache.org
> > Subject: Re: questions about nutch 1.9
> > 
> > Please any help?
> > 
> > 
> > Hello.
> > I want to use nutch 1.9 but there are some things that i don´t understand because i was using nutch 1.5.1 before and some things are changed in nutch 1.9.
> > Sorry if is a basic things.
> > Some questions:
> > 
> > 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 that the spider jump this step if i don´t set solr parameter ?
> > 
> > 2- It is possible to use topN or similar parameter in nutch 1.9 or every round include all link in crawldb ?
> > 
> > 3- I have activated httpclient plugin and when i crawl a website that use https protocol i get this error in the output console 
> > *********************************
> > fetch of https://dragones.uci.cu/ failed with: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
> > 
> > parsechecker tool throw similar error.
> > 
> > Please any suggestion or advice will be appreciated.
> > 
> > 
> > 
> > ---------------------------------------------------
> > XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> > 
> 
> 
> ---------------------------------------------------
> XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> 


---------------------------------------------------
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.

RE: questions about nutch 1.9

Posted by Markus Jelsma <ma...@openindex.io>.
Hmm - then maybe you can access it through a proxy that doesnt deal with this problem? Then connect Nutch to the proxy.
Markus
 
-----Original message-----
> From:Eyeris RodrIguez Rueda <er...@uci.cu>
> Sent: Tuesday 16th December 2014 14:04
> To: user@nutch.apache.org
> Subject: Re: questions about nutch 1.9
> 
> Thanks markus and jonathan for your answer.
> I have try with protocol-http only but the problem persist,maybe the solution is a configuration that trust in websites with problem in certificates.
> This is very important for me because i have some websites using https and it is a limitation for use nutch 1.9 in my university.
> 
> 
> 
> 
> ----- Original Message -----
> From: "Markus Jelsma" <ma...@openindex.io>
> To: user@nutch.apache.org
> Sent: Tuesday, December 16, 2014 7:46:54 AM
> Subject: RE: questions about nutch 1.9
> 
> Hi - can you try the protocol-http plugin instead? It has some support for TLS.
>  
> -----Original message-----
> > From:Eyeris RodrIguez Rueda <er...@uci.cu>
> > Sent: Thursday 11th December 2014 22:18
> > To: user@nutch.apache.org
> > Subject: Re: questions about nutch 1.9
> > 
> > Please any help?
> > 
> > 
> > Hello.
> > I want to use nutch 1.9 but there are some things that i don´t understand because i was using nutch 1.5.1 before and some things are changed in nutch 1.9.
> > Sorry if is a basic things.
> > Some questions:
> > 
> > 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 that the spider jump this step if i don´t set solr parameter ?
> > 
> > 2- It is possible to use topN or similar parameter in nutch 1.9 or every round include all link in crawldb ?
> > 
> > 3- I have activated httpclient plugin and when i crawl a website that use https protocol i get this error in the output console 
> > *********************************
> > fetch of https://dragones.uci.cu/ failed with: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
> > 
> > parsechecker tool throw similar error.
> > 
> > Please any suggestion or advice will be appreciated.
> > 
> > 
> > 
> > ---------------------------------------------------
> > XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> > 
> 
> 
> ---------------------------------------------------
> XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> 

Re: questions about nutch 1.9

Posted by Eyeris RodrIguez Rueda <er...@uci.cu>.
Thanks markus and jonathan for your answer.
I have try with protocol-http only but the problem persist,maybe the solution is a configuration that trust in websites with problem in certificates.
This is very important for me because i have some websites using https and it is a limitation for use nutch 1.9 in my university.




----- Original Message -----
From: "Markus Jelsma" <ma...@openindex.io>
To: user@nutch.apache.org
Sent: Tuesday, December 16, 2014 7:46:54 AM
Subject: RE: questions about nutch 1.9

Hi - can you try the protocol-http plugin instead? It has some support for TLS.
 
-----Original message-----
> From:Eyeris RodrIguez Rueda <er...@uci.cu>
> Sent: Thursday 11th December 2014 22:18
> To: user@nutch.apache.org
> Subject: Re: questions about nutch 1.9
> 
> Please any help?
> 
> 
> Hello.
> I want to use nutch 1.9 but there are some things that i don´t understand because i was using nutch 1.5.1 before and some things are changed in nutch 1.9.
> Sorry if is a basic things.
> Some questions:
> 
> 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 that the spider jump this step if i don´t set solr parameter ?
> 
> 2- It is possible to use topN or similar parameter in nutch 1.9 or every round include all link in crawldb ?
> 
> 3- I have activated httpclient plugin and when i crawl a website that use https protocol i get this error in the output console 
> *********************************
> fetch of https://dragones.uci.cu/ failed with: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
> 
> parsechecker tool throw similar error.
> 
> Please any suggestion or advice will be appreciated.
> 
> 
> 
> ---------------------------------------------------
> XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> 


---------------------------------------------------
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.

RE: questions about nutch 1.9

Posted by Markus Jelsma <ma...@openindex.io>.
Hi - can you try the protocol-http plugin instead? It has some support for TLS.
 
-----Original message-----
> From:Eyeris RodrIguez Rueda <er...@uci.cu>
> Sent: Thursday 11th December 2014 22:18
> To: user@nutch.apache.org
> Subject: Re: questions about nutch 1.9
> 
> Please any help?
> 
> 
> Hello.
> I want to use nutch 1.9 but there are some things that i don´t understand because i was using nutch 1.5.1 before and some things are changed in nutch 1.9.
> Sorry if is a basic things.
> Some questions:
> 
> 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 that the spider jump this step if i don´t set solr parameter ?
> 
> 2- It is possible to use topN or similar parameter in nutch 1.9 or every round include all link in crawldb ?
> 
> 3- I have activated httpclient plugin and when i crawl a website that use https protocol i get this error in the output console 
> *********************************
> fetch of https://dragones.uci.cu/ failed with: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
> 
> parsechecker tool throw similar error.
> 
> Please any suggestion or advice will be appreciated.
> 
> 
> 
> ---------------------------------------------------
> XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> 

Re: questions about nutch 1.9

Posted by Jonathan Cooper-Ellis <jc...@ziftr.com>.
Hi Eyeris,

For TopN, check out the sizeFetchlist param in bin/crawl (line 62).

On Fri, Dec 12, 2014 at 9:18 AM, Eyeris RodrIguez Rueda <er...@uci.cu>
wrote:

> Thanks Sebastian.
> about the second question maybe i don´t explain good, in the script of
> nutch 1.9 the last parameter is number of rounds and i think that this is
> equivalent to depth parameter in nutch 1.5.1, but the topN parameter i
> don´t find in nutch 1.9,this is very usefull for make limited crawl process
> because every round don´t have limit.
>
> About the 3 question you are right dragones.uci.cu is only available
> inside the university, but you can try with any website that use https and
> check. I have an idea about what can be the problem but i don´t know how to
> solve.
> When i try to access to dragones.uci.cu with firefox i need to add an
> exception because the certificate has problem and nutch don´t know how to
> manipulate this error,it was great if i can configure an option that trust
> in websites with that errors. I was looking the httpclient plugin code but
> i don´t find the code that manipulate this problem, please could you help
> me or give an advice.
>
>
>
>
>
>
> ----- Original Message -----
> From: "Sebastian Nagel" <wa...@googlemail.com>
> To: user@nutch.apache.org
> Sent: Thursday, December 11, 2014 6:07:05 PM
> Subject: Re: questions about nutch 1.9
>
> Hi Eyeris,
>
> > 1- How i can do a crawl process with solr parameter like in nutch 1.5.1
> that the spider jump this
> step if i don´t set solr parameter ?
>
> Yes, that's possible in recent trunk of 1.x, see NUTCH-1832
> (in doubt, it should be possible to update/replace only bin/crawl):
> Just pass an empty Solr-URL.
>
>
> > 2- It is possible to use topN or similar parameter in nutch 1.9 or every
> round include all link in
> crawldb ?
>
> In this point, I don't know about any differences between 1.9 and 1.5.1.
>
>
> > 3- I have activated httpclient plugin and when i crawl a website that
> use https protocol i get
> this error in the output console
>
> Sorry, I remember you asked this question a month ago, and I didn't find
> the time to
> continue the thread. Can you try without httpclient:
> - use 1.9 or trunk
> - and remove protocol-httpclient from plugin.includes
>
> I'm not able test/reproduce the problem because I cannot resolve
> the host dragones.uci.cu ?  Is this host only reachable within the
> university network?
>
>
> Best,
> Sebastian
>
>
> On 12/11/2014 10:11 PM, Eyeris RodrIguez Rueda wrote:
> > Please any help?
> >
> >
> > Hello.
> > I want to use nutch 1.9 but there are some things that i don´t
> understand because i was using nutch 1.5.1 before and some things are
> changed in nutch 1.9.
> > Sorry if is a basic things.
> > Some questions:
> >
> > 1- How i can do a crawl process with solr parameter like in nutch 1.5.1
> that the spider jump this step if i don´t set solr parameter ?
> >
> > 2- It is possible to use topN or similar parameter in nutch 1.9 or every
> round include all link in crawldb ?
> >
> > 3- I have activated httpclient plugin and when i crawl a website that
> use https protocol i get this error in the output console
> > *********************************
> > fetch of https://dragones.uci.cu/ failed with:
> javax.net.ssl.SSLHandshakeException:
> sun.security.validator.ValidatorException: PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target
> >
> > parsechecker tool throw similar error.
> >
> > Please any suggestion or advice will be appreciated.
> >
> >
> >
> > ---------------------------------------------------
> > XII Aniversario de la creación de la Universidad de las Ciencias
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> >
>
>
>
> ---------------------------------------------------
> XII Aniversario de la creación de la Universidad de las Ciencias
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
>



-- 
Jonathan Cooper-Ellis
*Data Engineer*
myVBO, LLC dba Ziftr

Re: question about robots.txt

Posted by Patrick Kirsch <pk...@zscho.de>.
Am 13.12.2014 um 10:27 schrieb Shane Wood:
> I am asking a few websites to allow me to index there site, what you
> they add to the robots.txt and where do i get the exact name of my crawler.
In case you are using nutch-1.9 there is a file
conf/nutch-site.xml.
In this config file there are properties defined, like
  <name>http.agent.name</name>
and following.

This is used for identifying your crawler.

Did you already set this property and your crawler has not used it?
> 
> Cheers.
> Shane
> 
Regards,
 Patrick


Re: question about robots.txt

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi Shane,

They get it from the  http.agent.* properties in your nutch-conf.xml
or your nutch-site.xml. You give your crawler the identifying
name., description, url, email and version.

Cheers!

Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Shane Wood <sh...@cbm8bit.com>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Saturday, December 13, 2014 at 1:27 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: question about robots.txt

>I am asking a few websites to allow me to index there site, what you
>they add to the robots.txt and where do i get the exact name of my
>crawler.
>
>Cheers.
>Shane


question about robots.txt

Posted by Shane Wood <sh...@cbm8bit.com>.
I am asking a few websites to allow me to index there site, what you 
they add to the robots.txt and where do i get the exact name of my crawler.

Cheers.
Shane

Re: questions about nutch 1.9

Posted by Eyeris RodrIguez Rueda <er...@uci.cu>.
Thanks Sebastian.
about the second question maybe i don´t explain good, in the script of nutch 1.9 the last parameter is number of rounds and i think that this is equivalent to depth parameter in nutch 1.5.1, but the topN parameter i don´t find in nutch 1.9,this is very usefull for make limited crawl process because every round don´t have limit.

About the 3 question you are right dragones.uci.cu is only available inside the university, but you can try with any website that use https and check. I have an idea about what can be the problem but i don´t know how to solve. 
When i try to access to dragones.uci.cu with firefox i need to add an exception because the certificate has problem and nutch don´t know how to manipulate this error,it was great if i can configure an option that trust in websites with that errors. I was looking the httpclient plugin code but i don´t find the code that manipulate this problem, please could you help me or give an advice. 






----- Original Message -----
From: "Sebastian Nagel" <wa...@googlemail.com>
To: user@nutch.apache.org
Sent: Thursday, December 11, 2014 6:07:05 PM
Subject: Re: questions about nutch 1.9

Hi Eyeris,

> 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 that the spider jump this
step if i don´t set solr parameter ?

Yes, that's possible in recent trunk of 1.x, see NUTCH-1832
(in doubt, it should be possible to update/replace only bin/crawl):
Just pass an empty Solr-URL.


> 2- It is possible to use topN or similar parameter in nutch 1.9 or every round include all link in
crawldb ?

In this point, I don't know about any differences between 1.9 and 1.5.1.


> 3- I have activated httpclient plugin and when i crawl a website that use https protocol i get
this error in the output console

Sorry, I remember you asked this question a month ago, and I didn't find the time to
continue the thread. Can you try without httpclient:
- use 1.9 or trunk
- and remove protocol-httpclient from plugin.includes

I'm not able test/reproduce the problem because I cannot resolve
the host dragones.uci.cu ?  Is this host only reachable within the
university network?


Best,
Sebastian


On 12/11/2014 10:11 PM, Eyeris RodrIguez Rueda wrote:
> Please any help?
> 
> 
> Hello.
> I want to use nutch 1.9 but there are some things that i don´t understand because i was using nutch 1.5.1 before and some things are changed in nutch 1.9.
> Sorry if is a basic things.
> Some questions:
> 
> 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 that the spider jump this step if i don´t set solr parameter ?
> 
> 2- It is possible to use topN or similar parameter in nutch 1.9 or every round include all link in crawldb ?
> 
> 3- I have activated httpclient plugin and when i crawl a website that use https protocol i get this error in the output console 
> *********************************
> fetch of https://dragones.uci.cu/ failed with: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
> 
> parsechecker tool throw similar error.
> 
> Please any suggestion or advice will be appreciated.
> 
> 
> 
> ---------------------------------------------------
> XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> 



---------------------------------------------------
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.

Re: questions about nutch 1.9

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Eyeris,

> 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 that the spider jump this
step if i don´t set solr parameter ?

Yes, that's possible in recent trunk of 1.x, see NUTCH-1832
(in doubt, it should be possible to update/replace only bin/crawl):
Just pass an empty Solr-URL.


> 2- It is possible to use topN or similar parameter in nutch 1.9 or every round include all link in
crawldb ?

In this point, I don't know about any differences between 1.9 and 1.5.1.


> 3- I have activated httpclient plugin and when i crawl a website that use https protocol i get
this error in the output console

Sorry, I remember you asked this question a month ago, and I didn't find the time to
continue the thread. Can you try without httpclient:
- use 1.9 or trunk
- and remove protocol-httpclient from plugin.includes

I'm not able test/reproduce the problem because I cannot resolve
the host dragones.uci.cu ?  Is this host only reachable within the
university network?


Best,
Sebastian


On 12/11/2014 10:11 PM, Eyeris RodrIguez Rueda wrote:
> Please any help?
> 
> 
> Hello.
> I want to use nutch 1.9 but there are some things that i don´t understand because i was using nutch 1.5.1 before and some things are changed in nutch 1.9.
> Sorry if is a basic things.
> Some questions:
> 
> 1- How i can do a crawl process with solr parameter like in nutch 1.5.1 that the spider jump this step if i don´t set solr parameter ?
> 
> 2- It is possible to use topN or similar parameter in nutch 1.9 or every round include all link in crawldb ?
> 
> 3- I have activated httpclient plugin and when i crawl a website that use https protocol i get this error in the output console 
> *********************************
> fetch of https://dragones.uci.cu/ failed with: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
> 
> parsechecker tool throw similar error.
> 
> Please any suggestion or advice will be appreciated.
> 
> 
> 
> ---------------------------------------------------
> XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> 


Re: questions about nutch 1.9

Posted by Eyeris RodrIguez Rueda <er...@uci.cu>.
Please any help?


Hello.
I want to use nutch 1.9 but there are some things that i don´t understand because i was using nutch 1.5.1 before and some things are changed in nutch 1.9.
Sorry if is a basic things.
Some questions:

1- How i can do a crawl process with solr parameter like in nutch 1.5.1 that the spider jump this step if i don´t set solr parameter ?

2- It is possible to use topN or similar parameter in nutch 1.9 or every round include all link in crawldb ?

3- I have activated httpclient plugin and when i crawl a website that use https protocol i get this error in the output console 
*********************************
fetch of https://dragones.uci.cu/ failed with: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

parsechecker tool throw similar error.

Please any suggestion or advice will be appreciated.



---------------------------------------------------
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.