You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Larry.Santello" <la...@uline.com> on 2019/04/25 13:28:37 UTC

Nutch NTLM to IIS 8.5 - issues!

All -

I've tried several 1.x versions of Nutch and a variety of configurations and
simply can NOT get NTLM authentication working with Nutch. I need help
desperately!

Here are the relevent configuration points:
Note: "user", "password", and "ntdomain" are, of course, fillers for real
values

httpclient-auth.xml:
<credentials username="user" password="password" >
	<default realm="ntdomain" /> 
</credentials>

nutch-site.xml:
<property>
  <name>plugin.includes</name>
 
<value>protocol-(http|httpclient)|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description> </description>
</property>

logged problem (note that, yes, this is from 1.5.1, but 1.15 produces
similar results):
2019-04-25 07:38:47,641 INFO  parse.ParserChecker - fetching:
http://url.com/crawltest.html
2019-04-25 07:38:47,650 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch\apache-nutch-1.5.1\plugins
2019-04-25 07:38:47,728 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2019-04-25 07:38:47,729 INFO  plugin.PluginRepository - Registered Plugins:
2019-04-25 07:38:47,729 INFO  plugin.PluginRepository - 	Html Parse Plug-in
(parse-html)
2019-04-25 07:38:47,729 INFO  plugin.PluginRepository - 	HTTP Framework
(lib-http)
2019-04-25 07:38:47,729 INFO  plugin.PluginRepository - 	Http / Https
Protocol Plug-in (protocol-httpclient)
2019-04-25 07:38:47,729 INFO  plugin.PluginRepository - 	Regex URL Filter
(urlfilter-regex)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	the nutch core
extension points (nutch-extensionpoints)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Basic Indexing
Filter (index-basic)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Anchor Indexing
Filter (index-anchor)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Tika Parser Plug-in
(parse-tika)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Basic URL
Normalizer (urlnormalizer-basic)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Regex URL Filter
Framework (lib-regex-filter)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Regex URL
Normalizer (urlnormalizer-regex)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	URL Validator
(urlfilter-validator)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	CyberNeko HTML
Parser (lib-nekohtml)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Pass-through URL
Normalizer (urlnormalizer-pass)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	OPIC Scoring
Plug-in (scoring-opic)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Http Protocol
Plug-in (protocol-http)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - Registered
Extension-Points:
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Nutch Content
Parser (org.apache.nutch.parse.Parser)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Nutch Segment Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - 	Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2019-04-25 07:38:47,761 INFO  httpclient.Http - http.proxy.host = null
2019-04-25 07:38:47,762 INFO  httpclient.Http - http.proxy.port = 8080
2019-04-25 07:38:47,763 INFO  httpclient.Http - http.timeout = 10000
2019-04-25 07:38:47,763 INFO  httpclient.Http - http.content.limit = -1
2019-04-25 07:38:47,763 INFO  httpclient.Http - http.agent = Ulinenet
Spider/Nutch-1.5.1
2019-04-25 07:38:47,764 INFO  httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2019-04-25 07:38:47,764 INFO  httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2019-04-25 07:38:47,835 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest, basic]
2019-04-25 07:38:47,836 INFO  auth.AuthChallengeProcessor - ntlm
authentication scheme selected
2019-04-25 07:38:47,837 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 07:38:47,837 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 07:38:47,847 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 07:38:47,847 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 07:38:48,335 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 07:38:48,336 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 07:38:48,337 INFO  httpclient.HttpMethodDirector - Failure
authenticating with NTLM <any realm>@url.com:80
2019-04-25 07:38:48,507 INFO  crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2019-04-25 07:38:48,509 INFO  parse.ParserChecker - parsing:
http://url.com/crawltest.html
2019-04-25 07:38:48,509 INFO  parse.ParserChecker - contentType:
application/xhtml+xml
2019-04-25 07:38:48,510 INFO  parse.ParserChecker - signature:
495abb7f991fb4dd6a056f748908a2d9

The way i'm testing:
bin/nutch parsechecker http://url.com/crawltest.html

Finally, I should note that the following curl command DOES work:
curl --ntlm --user user:password http://url.com/crawltest.html






--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html

Re: Nutch NTLM to IIS 8.5 - issues!

Posted by "Larry.Santello" <la...@uline.com>.
One last reply...... Figured out how to do this:

In short, ntlm support for Nutch doesn't seem to work. You do, in fact, have
to use a proxy that supports it. The proxy I ended up using was cntlm at
http://cntlm.sourceforge.net/

Dont put any authentication in nutch.. just have it go to cntlm, with should
run on port 3128 locally...

In CNTLM, the "gotcha" is the proxy - for some reason, it absolutely
requires that it goes through some secondary proxy, even if you dont have
one running. What I did, which seemed to work, is I set it to my machine
name at port 80 (this could just be doing a pass through - port 80 is open
locally with IIS...)... Alternately, you could fire up Squid and have it go
through that...

Since I have no proxy, then i set my "NoProxy" in CNTLM to * ... As for the
rest, you have to set up the Username, Domain, Password, and Workstation
setting to match your environment.

I seem to be crawling now!



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html

Re: Nutch NTLM to IIS 8.5 - issues!

Posted by "Larry.Santello" <la...@uline.com>.
For clarification, I tried (and am now actively working with) v1.15 and it
didn't work there either. 

1.15 uses httpclient 4.5.5, so whatever the issue is wasn't resolved with
that either. 



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html

Re: Nutch NTLM to IIS 8.5 - issues!

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Michael,

can you provide a patch or pull request for the upgrade?

There is an issue open since long [1] but the available
patches are reported to raise further issues (see issue comments).
The challenge is indeed to to test all the authentication options
supported by protocol-httpclient including form authentication.

Cheers,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-1086

On 4/26/19 4:50 PM, Michael Portnoy wrote:
> Nutch 1.14 is using HttpClient 3.x which does not work with NTLM2. Not sure
> if that's your case. To get auth to work, we've had to migrate the
> httpclient plugin to use HttpClient 4.x
> 
> This may have been done in Nutch 1.15
> 
> On Fri., Apr. 26, 2019, 10:24 a.m. Larry.Santello, <la...@uline.com>
> wrote:
> 
>> Been reading a bit more and it sounds like an option may be to use an ntlm
>> proxy. You have Nutch set up for the proxy, and it's the proxy that sends
>> ntlm credentials. Ntlmaps seems like the product of choice for that proxy.
>> I
>> guess I'll give that a shot on Monday.
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>>
> 


Re: Nutch NTLM to IIS 8.5 - issues!

Posted by Michael Portnoy <2m...@gmail.com>.
Nutch 1.14 is using HttpClient 3.x which does not work with NTLM2. Not sure
if that's your case. To get auth to work, we've had to migrate the
httpclient plugin to use HttpClient 4.x

This may have been done in Nutch 1.15

On Fri., Apr. 26, 2019, 10:24 a.m. Larry.Santello, <la...@uline.com>
wrote:

> Been reading a bit more and it sounds like an option may be to use an ntlm
> proxy. You have Nutch set up for the proxy, and it's the proxy that sends
> ntlm credentials. Ntlmaps seems like the product of choice for that proxy.
> I
> guess I'll give that a shot on Monday.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>

Re: Nutch NTLM to IIS 8.5 - issues!

Posted by "Larry.Santello" <la...@uline.com>.
Been reading a bit more and it sounds like an option may be to use an ntlm
proxy. You have Nutch set up for the proxy, and it's the proxy that sends
ntlm credentials. Ntlmaps seems like the product of choice for that proxy. I
guess I'll give that a shot on Monday.



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html

RE: Nutch NTLM to IIS 8.5 - issues!

Posted by "Larry.Santello" <la...@uline.com>.
Thanks for responding! 

I've hit it again with TRACE logging... here's the results of that:

2019-04-25 08:53:10,261 INFO  parse.ParserChecker - fetching:
http://url.com/crawltest.html
2019-04-25 08:53:10,268 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch\apache-nutch-1.5.1\plugins
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - Registered Plugins:
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Html Parse Plug-in
(parse-html)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	HTTP Framework
(lib-http)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Http / Https
Protocol Plug-in (protocol-httpclient)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Regex URL Filter
(urlfilter-regex)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	the nutch core
extension points (nutch-extensionpoints)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Basic Indexing
Filter (index-basic)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Anchor Indexing
Filter (index-anchor)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Tika Parser Plug-in
(parse-tika)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Basic URL
Normalizer (urlnormalizer-basic)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Regex URL Filter
Framework (lib-regex-filter)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Regex URL
Normalizer (urlnormalizer-regex)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	URL Validator
(urlfilter-validator)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	CyberNeko HTML
Parser (lib-nekohtml)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Pass-through URL
Normalizer (urlnormalizer-pass)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	OPIC Scoring
Plug-in (scoring-opic)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Http Protocol
Plug-in (protocol-http)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - Registered
Extension-Points:
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Nutch Content
Parser (org.apache.nutch.parse.Parser)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Nutch Segment Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
2019-04-25 08:53:10,350 INFO  plugin.PluginRepository - 	Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2019-04-25 08:53:10,377 INFO  httpclient.Http - http.proxy.host = null
2019-04-25 08:53:10,377 INFO  httpclient.Http - http.proxy.port = 8080
2019-04-25 08:53:10,378 INFO  httpclient.Http - http.timeout = 10000
2019-04-25 08:53:10,379 INFO  httpclient.Http - http.content.limit = -1
2019-04-25 08:53:10,379 INFO  httpclient.Http - http.agent =
Spider/Nutch-1.5.1
2019-04-25 08:53:10,379 INFO  httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2019-04-25 08:53:10,380 INFO  httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2019-04-25 08:53:10,385 TRACE httpclient.Http - Credentials - username:
user; set as default for realm: ntdomain; scheme: 
2019-04-25 08:53:10,392 TRACE httpclient.Http - Pre-configured credentials
with scope -  host: url.com; port: 80; not found for url:
http://url.com/crawltest.html
2019-04-25 08:53:10,449 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest, basic]
2019-04-25 08:53:10,449 INFO  auth.AuthChallengeProcessor - ntlm
authentication scheme selected
2019-04-25 08:53:10,450 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 08:53:10,450 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 08:53:10,452 TRACE auth.NTLMScheme - enter
NTLMScheme.authenticate(Credentials, HttpMethod)
2019-04-25 08:53:10,460 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 08:53:10,460 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 08:53:10,461 TRACE auth.NTLMScheme - enter
NTLMScheme.authenticate(Credentials, HttpMethod)
2019-04-25 08:53:10,952 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 08:53:10,953 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 08:53:10,955 INFO  httpclient.HttpMethodDirector - Failure
authenticating with NTLM <any realm>@url.com:80
2019-04-25 08:53:10,959 TRACE httpclient.Http - url:
http://url.com/crawltest.html; status code: 401; bytes received: 6322;
Content-Length: 6322
2019-04-25 08:53:11,033 TRACE httpclient.Http - 401 Authentication Required
2019-04-25 08:53:11,133 INFO  crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2019-04-25 08:53:11,135 INFO  parse.ParserChecker - parsing:
http://urlcom/crawltest.html
2019-04-25 08:53:11,135 INFO  parse.ParserChecker - contentType:
application/xhtml+xml
2019-04-25 08:53:11,138 INFO  parse.ParserChecker - signature:
495abb7f991fb4dd6a056f748908a2d9

Regarding whats on the server security events - a couple interesting things:
1. It sees it, but the failure reason is "Unknown user name or bad
password". The user and password being sent from httpclient-auth.xml is the
exact same as what i'm sending in from the curl command
2. Unlike the Curl command, the Account Name being sent over is all upper
case! I have this suspicion that this has something to do with it. Again,
though, the username in httpclient-auth.xml is NOT all in upper case. 





--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html