You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Christopher Gross <co...@gmail.com> on 2012/03/07 21:22:56 UTC

Crawling with Certs

Is there any good documentation for setting up Nutch to crawl HTTPS
sites using a certificate?  I've poked around on the wiki and tried
some google searches without much luck.

I'm using Nutch 1.4.

Thanks!

-- Chris

Re: Crawling with Certs

Posted by Christopher Gross <co...@gmail.com>.

No, I never had any luck with it, after trying for a few days I gave up and
moved on to other things.  Even tried using Nutch 2.x, but still wasn't
able to get to a cert protected site.

I'm going to look into Apache Droids (http://incubator.apache.org/droids/)
and see if their crawler can crawl with a cert, just haven't had time yet.

-- Chris

On Thu, Jan 31, 2013 at 9:21 PM, ClaudeZhong <lv...@gmail.com> wrote:

> Hi Chris,
>
> I got the same problem, I have .cer certificate, have you solved it?
>
> Though I have a username and password, and I set it in httpclient-auth.xml
> as  HttpAuthenticationSchemes
> <http://wiki.apache.org/nutch/HttpAuthenticationSchemes>   said, but it
> didn't work. It said crawl finished in the log file, but I got nothing at
> all.
>
> Thanks,
> Claude
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawling-with-Certs-tp3807838p4037849.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Crawling with Certs

Posted by ClaudeZhong <lv...@gmail.com>.

Hi Chris,

I got the same problem, I have .cer certificate, have you solved it?

Though I have a username and password, and I set it in httpclient-auth.xml
as  HttpAuthenticationSchemes
<http://wiki.apache.org/nutch/HttpAuthenticationSchemes>   said, but it
didn't work. It said crawl finished in the log file, but I got nothing at
all.

Thanks,
Claude



--
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-with-Certs-tp3807838p4037849.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawling with Certs

Posted by Christopher Gross <co...@gmail.com>.

Snippet from logfile:

2012-03-08 14:23:32,090 DEBUG wire.header - >> "CONNECT localhost:443
HTTP/1.1"2012-03-08 14:23:32,090 DEBUG httpclient.HttpMethodBase -
Adding Host request header2012-03-08 14:23:32,090 DEBUG wire.header -
>> "User-Agent: Jakarta Commons-HttpClient/3.1[\r][\n]"2012-03-08
14:23:32,090 DEBUG wire.header - >> "Host: linsrchcddadev1o[\r][\n]"
2012-03-08 14:23:32,090 DEBUG wire.header - >> "Proxy-Connection:
Keep-Alive[\r][\n]"
2012-03-08 14:23:32,090 DEBUG wire.header - >> "[\r][\n]"
2012-03-08 14:23:32,098 DEBUG wire.header - << "HTTP/1.0 400 Bad
Request[\r][\n]"
2012-03-08 14:23:32,098 DEBUG wire.header - << "HTTP/1.0 400 Bad
Request[\r][\n]"2012-03-08 14:23:32,099 DEBUG wire.header - <<
"Content-Type: text/html[\r][\n]"
2012-03-08 14:23:32,099 DEBUG wire.header - << "Connection: close[\r][\n]"
2012-03-08 14:23:32,099 DEBUG wire.header - << "[\r][\n]"
2012-03-08 14:23:32,109 DEBUG httpclient.ConnectMethod - CONNECT status code 400
2012-03-08 14:23:32,109 DEBUG httpclient.HttpMethodDirector - CONNECT
failed, fake the response for the original method

I'm guessing that something needs to go in the httpclient-auth.xml
file -- but all the examples are for login/password combos, and not a
cert.  Any ideas on where to look for that?  The Authentication
Schemes page (http://wiki.apache.org/nutch/HttpAuthenticationSchemes)
fails to mention certs.

-- Chris



On Thu, Mar 8, 2012 at 8:16 AM, Christopher Gross <co...@gmail.com> wrote:
> The page text is pretty small -- I just made a few quick pages with
> some text and links.  http.content.limit is set to 65536 (default
> value, I think), so that should be OK.
>
> Well, when I run my script to do the crawl, I'm seeing 403 errors.
> When I use the 2 checker tools (parse & index) then  I don't see an
> error code, I just get blank content.  I'm not sure what to make of
> that.
>
> Perhaps the close is because the site needs to have a cert to work,
> and Nutch isn't providing one, so it gets no content back?
>
> -- Chris
>
>
>
> On Thu, Mar 8, 2012 at 5:59 AM, Lewis John Mcgibbney
> <le...@gmail.com> wrote:
>> Hi Christopher,
>>
>> It appears that the page is being fetched successfully. What is not
>> successful is the parser obtaining the page content... these fields appears
>> the be returning empty values when as you have stated this is not the case.
>>
>> How large is the page content? does you http.content.limit accommodate
>> this?
>>
>> Also you ARE getting back that the content metadata connection appears to
>> be closed! Maybe there are some other credentials to be supplied for
>> crawling certificate authenticated sites... I really don't know.
>>
>> On Wed, Mar 7, 2012 at 9:28 PM, Christopher Gross <co...@gmail.com> wrote:
>>
>>> Here's the parse checker output -- the page does have text (and 3
>>> links) but it's not showing it with the dumpText option.  I'd expect
>>> there to be some kind of error, since a fetch fails on it when i run
>>> that....
>>>
>>> ParseChecker output:
>>>
>>> ./bin/nutch parsechecker -dumpText https://localhost/crawldocs/index.html
>>>
>>> fetching: https://localhost/crawldocs/index.html
>>> parsing: https://localhost/crawldocs/index.html
>>> contentType: text/html
>>> ---------
>>> Url
>>> ---------------
>>> https://localhost/crawldocs/index.html---------
>>> ParseData
>>> ---------
>>> Version: 5
>>> Status: success(1,0)
>>> Title:
>>> Outlinks: 0
>>> Content Metadata: Connection=close Content-Type=text/html
>>> Parse Metadata: CharEncodingForConversion=windows-1252
>>> OriginalCharEncoding=windows-1252
>>> ---------
>>> ParseText
>>> ---------
>>>
>>> -- Chris
>>>
>>>
>>>
>>> On Wed, Mar 7, 2012 at 4:22 PM, Christopher Gross <co...@gmail.com>
>>> wrote:
>>> > Well, NTLM is a windows thing with a username and password.
>>> >
>>> > I have a certificate.  No username/password.  The debug stuff would be
>>> > helpful once I can get a bit farther...I don't know how to tell Nutch
>>> > to crawl with the cert.  I'm getting a 403 error -- it is not (using?
>>> > finding?) the certs that I have passed in via -D arguments.
>>> >
>>> > I appreciate you trying to help -- but I need knowledge on getting
>>> > Nutch to use a cert.
>>> >
>>> > -- Chris
>>> >
>>> >
>>> >
>>> > On Wed, Mar 7, 2012 at 4:14 PM, remi tassing <ta...@gmail.com>
>>> wrote:
>>> >> There are many debugging tips on the bottom of that page, did you try
>>> them?
>>> >>
>>> >> E.g. ParserChecker, debug-level log info, ...
>>> >>
>>> >> BTW, which authentication scheme is required by your site? For NTLMv2 is
>>> >> poorly supported
>>> >>
>>> >> Remi
>>> >>
>>> >> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com>
>>> wrote:
>>> >>> I have protocol-httpclient set.
>>> >>>
>>> >>> I can't see how I'm supposed to do the certs.  I can't seem to get
>>> >>> them to work by passing them in via -D args when I call the nutch
>>> >>> script (-Djavax.net.ssl.trustStore=xxxx
>>> >>> -Djavax.net.ssl.trustStorePassword=xxxxx ...etc).  Is there something
>>> >>> for them in the AuthenticationSchemes
>>> >>> (http://wiki.apache.org/nutch/HttpAuthenticationSchemes) that is not
>>> >>> shown on the page?
>>> >>>
>>> >>> If you have a specific page that could help please send that.
>>> >>>
>>> >>> -- Chris
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Wed, Mar 7, 2012 at 3:40 PM, remi tassing <ta...@gmail.com>
>>> >> wrote:
>>> >>>> Try googling for Nutch+httpclient
>>> >>>>
>>> >>>> Remi
>>> >>>>
>>> >>>> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com>
>>> wrote:
>>> >>>>> Is there any good documentation for setting up Nutch to crawl HTTPS
>>> >>>>> sites using a certificate?  I've poked around on the wiki and tried
>>> >>>>> some google searches without much luck.
>>> >>>>>
>>> >>>>> I'm using Nutch 1.4.
>>> >>>>>
>>> >>>>> Thanks!
>>> >>>>>
>>> >>>>> -- Chris
>>> >>>>>
>>> >>>
>>>
>>
>>
>>
>> --
>> *Lewis*

Re: Crawling with Certs

Posted by Christopher Gross <co...@gmail.com>.

The page text is pretty small -- I just made a few quick pages with
some text and links.  http.content.limit is set to 65536 (default
value, I think), so that should be OK.

Well, when I run my script to do the crawl, I'm seeing 403 errors.
When I use the 2 checker tools (parse & index) then  I don't see an
error code, I just get blank content.  I'm not sure what to make of
that.

Perhaps the close is because the site needs to have a cert to work,
and Nutch isn't providing one, so it gets no content back?

-- Chris



On Thu, Mar 8, 2012 at 5:59 AM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi Christopher,
>
> It appears that the page is being fetched successfully. What is not
> successful is the parser obtaining the page content... these fields appears
> the be returning empty values when as you have stated this is not the case.
>
> How large is the page content? does you http.content.limit accommodate
> this?
>
> Also you ARE getting back that the content metadata connection appears to
> be closed! Maybe there are some other credentials to be supplied for
> crawling certificate authenticated sites... I really don't know.
>
> On Wed, Mar 7, 2012 at 9:28 PM, Christopher Gross <co...@gmail.com> wrote:
>
>> Here's the parse checker output -- the page does have text (and 3
>> links) but it's not showing it with the dumpText option.  I'd expect
>> there to be some kind of error, since a fetch fails on it when i run
>> that....
>>
>> ParseChecker output:
>>
>> ./bin/nutch parsechecker -dumpText https://localhost/crawldocs/index.html
>>
>> fetching: https://localhost/crawldocs/index.html
>> parsing: https://localhost/crawldocs/index.html
>> contentType: text/html
>> ---------
>> Url
>> ---------------
>> https://localhost/crawldocs/index.html---------
>> ParseData
>> ---------
>> Version: 5
>> Status: success(1,0)
>> Title:
>> Outlinks: 0
>> Content Metadata: Connection=close Content-Type=text/html
>> Parse Metadata: CharEncodingForConversion=windows-1252
>> OriginalCharEncoding=windows-1252
>> ---------
>> ParseText
>> ---------
>>
>> -- Chris
>>
>>
>>
>> On Wed, Mar 7, 2012 at 4:22 PM, Christopher Gross <co...@gmail.com>
>> wrote:
>> > Well, NTLM is a windows thing with a username and password.
>> >
>> > I have a certificate.  No username/password.  The debug stuff would be
>> > helpful once I can get a bit farther...I don't know how to tell Nutch
>> > to crawl with the cert.  I'm getting a 403 error -- it is not (using?
>> > finding?) the certs that I have passed in via -D arguments.
>> >
>> > I appreciate you trying to help -- but I need knowledge on getting
>> > Nutch to use a cert.
>> >
>> > -- Chris
>> >
>> >
>> >
>> > On Wed, Mar 7, 2012 at 4:14 PM, remi tassing <ta...@gmail.com>
>> wrote:
>> >> There are many debugging tips on the bottom of that page, did you try
>> them?
>> >>
>> >> E.g. ParserChecker, debug-level log info, ...
>> >>
>> >> BTW, which authentication scheme is required by your site? For NTLMv2 is
>> >> poorly supported
>> >>
>> >> Remi
>> >>
>> >> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com>
>> wrote:
>> >>> I have protocol-httpclient set.
>> >>>
>> >>> I can't see how I'm supposed to do the certs.  I can't seem to get
>> >>> them to work by passing them in via -D args when I call the nutch
>> >>> script (-Djavax.net.ssl.trustStore=xxxx
>> >>> -Djavax.net.ssl.trustStorePassword=xxxxx ...etc).  Is there something
>> >>> for them in the AuthenticationSchemes
>> >>> (http://wiki.apache.org/nutch/HttpAuthenticationSchemes) that is not
>> >>> shown on the page?
>> >>>
>> >>> If you have a specific page that could help please send that.
>> >>>
>> >>> -- Chris
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Mar 7, 2012 at 3:40 PM, remi tassing <ta...@gmail.com>
>> >> wrote:
>> >>>> Try googling for Nutch+httpclient
>> >>>>
>> >>>> Remi
>> >>>>
>> >>>> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com>
>> wrote:
>> >>>>> Is there any good documentation for setting up Nutch to crawl HTTPS
>> >>>>> sites using a certificate?  I've poked around on the wiki and tried
>> >>>>> some google searches without much luck.
>> >>>>>
>> >>>>> I'm using Nutch 1.4.
>> >>>>>
>> >>>>> Thanks!
>> >>>>>
>> >>>>> -- Chris
>> >>>>>
>> >>>
>>
>
>
>
> --
> *Lewis*

Re: Crawling with Certs

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Christopher,

It appears that the page is being fetched successfully. What is not
successful is the parser obtaining the page content... these fields appears
the be returning empty values when as you have stated this is not the case.

How large is the page content? does you http.content.limit accommodate
this?

Also you ARE getting back that the content metadata connection appears to
be closed! Maybe there are some other credentials to be supplied for
crawling certificate authenticated sites... I really don't know.

On Wed, Mar 7, 2012 at 9:28 PM, Christopher Gross <co...@gmail.com> wrote:

> Here's the parse checker output -- the page does have text (and 3
> links) but it's not showing it with the dumpText option.  I'd expect
> there to be some kind of error, since a fetch fails on it when i run
> that....
>
> ParseChecker output:
>
> ./bin/nutch parsechecker -dumpText https://localhost/crawldocs/index.html
>
> fetching: https://localhost/crawldocs/index.html
> parsing: https://localhost/crawldocs/index.html
> contentType: text/html
> ---------
> Url
> ---------------
> https://localhost/crawldocs/index.html---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title:
> Outlinks: 0
> Content Metadata: Connection=close Content-Type=text/html
> Parse Metadata: CharEncodingForConversion=windows-1252
> OriginalCharEncoding=windows-1252
> ---------
> ParseText
> ---------
>
> -- Chris
>
>
>
> On Wed, Mar 7, 2012 at 4:22 PM, Christopher Gross <co...@gmail.com>
> wrote:
> > Well, NTLM is a windows thing with a username and password.
> >
> > I have a certificate.  No username/password.  The debug stuff would be
> > helpful once I can get a bit farther...I don't know how to tell Nutch
> > to crawl with the cert.  I'm getting a 403 error -- it is not (using?
> > finding?) the certs that I have passed in via -D arguments.
> >
> > I appreciate you trying to help -- but I need knowledge on getting
> > Nutch to use a cert.
> >
> > -- Chris
> >
> >
> >
> > On Wed, Mar 7, 2012 at 4:14 PM, remi tassing <ta...@gmail.com>
> wrote:
> >> There are many debugging tips on the bottom of that page, did you try
> them?
> >>
> >> E.g. ParserChecker, debug-level log info, ...
> >>
> >> BTW, which authentication scheme is required by your site? For NTLMv2 is
> >> poorly supported
> >>
> >> Remi
> >>
> >> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com>
> wrote:
> >>> I have protocol-httpclient set.
> >>>
> >>> I can't see how I'm supposed to do the certs.  I can't seem to get
> >>> them to work by passing them in via -D args when I call the nutch
> >>> script (-Djavax.net.ssl.trustStore=xxxx
> >>> -Djavax.net.ssl.trustStorePassword=xxxxx ...etc).  Is there something
> >>> for them in the AuthenticationSchemes
> >>> (http://wiki.apache.org/nutch/HttpAuthenticationSchemes) that is not
> >>> shown on the page?
> >>>
> >>> If you have a specific page that could help please send that.
> >>>
> >>> -- Chris
> >>>
> >>>
> >>>
> >>> On Wed, Mar 7, 2012 at 3:40 PM, remi tassing <ta...@gmail.com>
> >> wrote:
> >>>> Try googling for Nutch+httpclient
> >>>>
> >>>> Remi
> >>>>
> >>>> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com>
> wrote:
> >>>>> Is there any good documentation for setting up Nutch to crawl HTTPS
> >>>>> sites using a certificate?  I've poked around on the wiki and tried
> >>>>> some google searches without much luck.
> >>>>>
> >>>>> I'm using Nutch 1.4.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> -- Chris
> >>>>>
> >>>
>



-- 
*Lewis*

Re: Crawling with Certs

Posted by Christopher Gross <co...@gmail.com>.

Here's the parse checker output -- the page does have text (and 3
links) but it's not showing it with the dumpText option.  I'd expect
there to be some kind of error, since a fetch fails on it when i run
that....

ParseChecker output:

./bin/nutch parsechecker -dumpText https://localhost/crawldocs/index.html

fetching: https://localhost/crawldocs/index.html
parsing: https://localhost/crawldocs/index.html
contentType: text/html
---------
Url
---------------
https://localhost/crawldocs/index.html---------
ParseData
---------
Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: Connection=close Content-Type=text/html
Parse Metadata: CharEncodingForConversion=windows-1252
OriginalCharEncoding=windows-1252
---------
ParseText
---------

-- Chris



On Wed, Mar 7, 2012 at 4:22 PM, Christopher Gross <co...@gmail.com> wrote:
> Well, NTLM is a windows thing with a username and password.
>
> I have a certificate.  No username/password.  The debug stuff would be
> helpful once I can get a bit farther...I don't know how to tell Nutch
> to crawl with the cert.  I'm getting a 403 error -- it is not (using?
> finding?) the certs that I have passed in via -D arguments.
>
> I appreciate you trying to help -- but I need knowledge on getting
> Nutch to use a cert.
>
> -- Chris
>
>
>
> On Wed, Mar 7, 2012 at 4:14 PM, remi tassing <ta...@gmail.com> wrote:
>> There are many debugging tips on the bottom of that page, did you try them?
>>
>> E.g. ParserChecker, debug-level log info, ...
>>
>> BTW, which authentication scheme is required by your site? For NTLMv2 is
>> poorly supported
>>
>> Remi
>>
>> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com> wrote:
>>> I have protocol-httpclient set.
>>>
>>> I can't see how I'm supposed to do the certs.  I can't seem to get
>>> them to work by passing them in via -D args when I call the nutch
>>> script (-Djavax.net.ssl.trustStore=xxxx
>>> -Djavax.net.ssl.trustStorePassword=xxxxx ...etc).  Is there something
>>> for them in the AuthenticationSchemes
>>> (http://wiki.apache.org/nutch/HttpAuthenticationSchemes) that is not
>>> shown on the page?
>>>
>>> If you have a specific page that could help please send that.
>>>
>>> -- Chris
>>>
>>>
>>>
>>> On Wed, Mar 7, 2012 at 3:40 PM, remi tassing <ta...@gmail.com>
>> wrote:
>>>> Try googling for Nutch+httpclient
>>>>
>>>> Remi
>>>>
>>>> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com> wrote:
>>>>> Is there any good documentation for setting up Nutch to crawl HTTPS
>>>>> sites using a certificate?  I've poked around on the wiki and tried
>>>>> some google searches without much luck.
>>>>>
>>>>> I'm using Nutch 1.4.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -- Chris
>>>>>
>>>

Re: Crawling with Certs

Posted by Christopher Gross <co...@gmail.com>.

Well, NTLM is a windows thing with a username and password.

I have a certificate.  No username/password.  The debug stuff would be
helpful once I can get a bit farther...I don't know how to tell Nutch
to crawl with the cert.  I'm getting a 403 error -- it is not (using?
finding?) the certs that I have passed in via -D arguments.

I appreciate you trying to help -- but I need knowledge on getting
Nutch to use a cert.

-- Chris



On Wed, Mar 7, 2012 at 4:14 PM, remi tassing <ta...@gmail.com> wrote:
> There are many debugging tips on the bottom of that page, did you try them?
>
> E.g. ParserChecker, debug-level log info, ...
>
> BTW, which authentication scheme is required by your site? For NTLMv2 is
> poorly supported
>
> Remi
>
> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com> wrote:
>> I have protocol-httpclient set.
>>
>> I can't see how I'm supposed to do the certs.  I can't seem to get
>> them to work by passing them in via -D args when I call the nutch
>> script (-Djavax.net.ssl.trustStore=xxxx
>> -Djavax.net.ssl.trustStorePassword=xxxxx ...etc).  Is there something
>> for them in the AuthenticationSchemes
>> (http://wiki.apache.org/nutch/HttpAuthenticationSchemes) that is not
>> shown on the page?
>>
>> If you have a specific page that could help please send that.
>>
>> -- Chris
>>
>>
>>
>> On Wed, Mar 7, 2012 at 3:40 PM, remi tassing <ta...@gmail.com>
> wrote:
>>> Try googling for Nutch+httpclient
>>>
>>> Remi
>>>
>>> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com> wrote:
>>>> Is there any good documentation for setting up Nutch to crawl HTTPS
>>>> sites using a certificate?  I've poked around on the wiki and tried
>>>> some google searches without much luck.
>>>>
>>>> I'm using Nutch 1.4.
>>>>
>>>> Thanks!
>>>>
>>>> -- Chris
>>>>
>>

Re: Crawling with Certs

Posted by remi tassing <ta...@gmail.com>.

There are many debugging tips on the bottom of that page, did you try them?

E.g. ParserChecker, debug-level log info, ...

BTW, which authentication scheme is required by your site? For NTLMv2 is
poorly supported

Remi

On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com> wrote:
> I have protocol-httpclient set.
>
> I can't see how I'm supposed to do the certs.  I can't seem to get
> them to work by passing them in via -D args when I call the nutch
> script (-Djavax.net.ssl.trustStore=xxxx
> -Djavax.net.ssl.trustStorePassword=xxxxx ...etc).  Is there something
> for them in the AuthenticationSchemes
> (http://wiki.apache.org/nutch/HttpAuthenticationSchemes) that is not
> shown on the page?
>
> If you have a specific page that could help please send that.
>
> -- Chris
>
>
>
> On Wed, Mar 7, 2012 at 3:40 PM, remi tassing <ta...@gmail.com>
wrote:
>> Try googling for Nutch+httpclient
>>
>> Remi
>>
>> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com> wrote:
>>> Is there any good documentation for setting up Nutch to crawl HTTPS
>>> sites using a certificate?  I've poked around on the wiki and tried
>>> some google searches without much luck.
>>>
>>> I'm using Nutch 1.4.
>>>
>>> Thanks!
>>>
>>> -- Chris
>>>
>

Re: Crawling with Certs

Posted by Christopher Gross <co...@gmail.com>.

I have protocol-httpclient set.

I can't see how I'm supposed to do the certs.  I can't seem to get
them to work by passing them in via -D args when I call the nutch
script (-Djavax.net.ssl.trustStore=xxxx
-Djavax.net.ssl.trustStorePassword=xxxxx ...etc).  Is there something
for them in the AuthenticationSchemes
(http://wiki.apache.org/nutch/HttpAuthenticationSchemes) that is not
shown on the page?

If you have a specific page that could help please send that.

-- Chris



On Wed, Mar 7, 2012 at 3:40 PM, remi tassing <ta...@gmail.com> wrote:
> Try googling for Nutch+httpclient
>
> Remi
>
> On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com> wrote:
>> Is there any good documentation for setting up Nutch to crawl HTTPS
>> sites using a certificate?  I've poked around on the wiki and tried
>> some google searches without much luck.
>>
>> I'm using Nutch 1.4.
>>
>> Thanks!
>>
>> -- Chris
>>

Re: Crawling with Certs

Posted by remi tassing <ta...@gmail.com>.

Try googling for Nutch+httpclient

Remi

On Wednesday, March 7, 2012, Christopher Gross <co...@gmail.com> wrote:
> Is there any good documentation for setting up Nutch to crawl HTTPS
> sites using a certificate?  I've poked around on the wiki and tried
> some google searches without much luck.
>
> I'm using Nutch 1.4.
>
> Thanks!
>
> -- Chris
>