You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alex Basa <al...@yahoo.com> on 2009/01/21 00:40:46 UTC
fetching https documents
I searched for patches and couldn't find one. Does anyone know if nutch 0.9 supports crawling https websites? If so, can someone point me to the patch?
Thanks in advance,
Alex
Re: AW: fetching https documents
Posted by Alex Basa <al...@yahoo.com>.
Thanks Vimal.
I switched plugin to protocol-httpclient
set http.useHttp11 to true
and updated
commons-httpclient-3.0.1.jar
to
commons-httpclient-3.1.jar
and it seems fine now.
--- On Wed, 1/21/09, Vimal Varghese <vi...@tcs.com> wrote:
> From: Vimal Varghese <vi...@tcs.com>
> Subject: Re: AW: fetching https documents
> To: nutch-user@lucene.apache.org
> Cc: "alex_basa@yahoo.com" <al...@yahoo.com>, "nutch-user@lucene.apache.org" <nu...@lucene.apache.org>
> Date: Wednesday, January 21, 2009, 10:45 PM
> Hi Alex,
>
> If its not fetching https . you can try adding this https
> line to your
> crawl-urlfilter.txt file
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*(DOMAIN1|DOMAIN2)/
> +^https://([a-z0-9]*\.)*(DOMAIN1|DOMAIN2)/
>
> after adding this line it will fetch all the https urls.
>
> But i am still getting this exceptions for the https urls
>
> javax.net.ssl.SSLException: Unrecognized SSL message,
> plaintext
> connection?
>
> org.apache.nutch.protocol.http.api.HttpException:
> java.net.UnknownHostException: secure.americanexpress.com
>
> Vimal Varghese
>
>
>
>
> Koch Martina <Ko...@huberverlag.de>
> 21-01-09 04:05 PM
> Please respond to
> nutch-user@lucene.apache.org
>
>
> To
> "nutch-user@lucene.apache.org"
> <nu...@lucene.apache.org>,
> "alex_basa@yahoo.com" <al...@yahoo.com>
> cc
>
> Subject
> AW: fetching https documents
>
>
>
>
>
>
> Hi Alex,
>
> https pages can be fetched with the protocol-httpclient
> plugin.
>
> Kind regards,
> Martina
>
>
> -----Ursprüngliche Nachricht-----
> Von: Alex Basa [mailto:alex_basa@yahoo.com]
> Gesendet: Mittwoch, 21. Januar 2009 00:41
> An: nutch-user@lucene.apache.org
> Betreff: fetching https documents
>
> I searched for patches and couldn't find one. Does
> anyone know if nutch
> 0.9 supports crawling https websites? If so, can someone
> point me to the
> patch?
>
> Thanks in advance,
>
> Alex
>
>
>
>
> ForwardSourceID:NT0001429A
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
Re: AW: fetching https documents
Posted by Vimal Varghese <vi...@tcs.com>.
Hi Alex,
If its not fetching https . you can try adding this https line to your
crawl-urlfilter.txt file
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*(DOMAIN1|DOMAIN2)/
+^https://([a-z0-9]*\.)*(DOMAIN1|DOMAIN2)/
after adding this line it will fetch all the https urls.
But i am still getting this exceptions for the https urls
javax.net.ssl.SSLException: Unrecognized SSL message, plaintext
connection?
org.apache.nutch.protocol.http.api.HttpException:
java.net.UnknownHostException: secure.americanexpress.com
Vimal Varghese
Koch Martina <Ko...@huberverlag.de>
21-01-09 04:05 PM
Please respond to
nutch-user@lucene.apache.org
To
"nutch-user@lucene.apache.org" <nu...@lucene.apache.org>,
"alex_basa@yahoo.com" <al...@yahoo.com>
cc
Subject
AW: fetching https documents
Hi Alex,
https pages can be fetched with the protocol-httpclient plugin.
Kind regards,
Martina
-----Ursprüngliche Nachricht-----
Von: Alex Basa [mailto:alex_basa@yahoo.com]
Gesendet: Mittwoch, 21. Januar 2009 00:41
An: nutch-user@lucene.apache.org
Betreff: fetching https documents
I searched for patches and couldn't find one. Does anyone know if nutch
0.9 supports crawling https websites? If so, can someone point me to the
patch?
Thanks in advance,
Alex
ForwardSourceID:NT0001429A
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you
AW: fetching https documents
Posted by Koch Martina <Ko...@huberverlag.de>.
Hi Alex,
https pages can be fetched with the protocol-httpclient plugin.
Kind regards,
Martina
-----Ursprüngliche Nachricht-----
Von: Alex Basa [mailto:alex_basa@yahoo.com]
Gesendet: Mittwoch, 21. Januar 2009 00:41
An: nutch-user@lucene.apache.org
Betreff: fetching https documents
I searched for patches and couldn't find one. Does anyone know if nutch 0.9 supports crawling https websites? If so, can someone point me to the patch?
Thanks in advance,
Alex