You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Larry.Santello" <la...@uline.com> on 2019/04/25 13:28:37 UTC
Nutch NTLM to IIS 8.5 - issues!
All -
I've tried several 1.x versions of Nutch and a variety of configurations and
simply can NOT get NTLM authentication working with Nutch. I need help
desperately!
Here are the relevent configuration points:
Note: "user", "password", and "ntdomain" are, of course, fillers for real
values
httpclient-auth.xml:
<credentials username="user" password="password" >
<default realm="ntdomain" />
</credentials>
nutch-site.xml:
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description> </description>
</property>
logged problem (note that, yes, this is from 1.5.1, but 1.15 produces
similar results):
2019-04-25 07:38:47,641 INFO parse.ParserChecker - fetching:
http://url.com/crawltest.html
2019-04-25 07:38:47,650 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch\apache-nutch-1.5.1\plugins
2019-04-25 07:38:47,728 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2019-04-25 07:38:47,729 INFO plugin.PluginRepository - Registered Plugins:
2019-04-25 07:38:47,729 INFO plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2019-04-25 07:38:47,729 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2019-04-25 07:38:47,729 INFO plugin.PluginRepository - Http / Https
Protocol Plug-in (protocol-httpclient)
2019-04-25 07:38:47,729 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Tika Parser Plug-in
(parse-tika)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - URL Validator
(urlfilter-validator)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Registered
Extension-Points:
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Nutch Segment Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
2019-04-25 07:38:47,733 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2019-04-25 07:38:47,761 INFO httpclient.Http - http.proxy.host = null
2019-04-25 07:38:47,762 INFO httpclient.Http - http.proxy.port = 8080
2019-04-25 07:38:47,763 INFO httpclient.Http - http.timeout = 10000
2019-04-25 07:38:47,763 INFO httpclient.Http - http.content.limit = -1
2019-04-25 07:38:47,763 INFO httpclient.Http - http.agent = Ulinenet
Spider/Nutch-1.5.1
2019-04-25 07:38:47,764 INFO httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2019-04-25 07:38:47,764 INFO httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2019-04-25 07:38:47,835 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest, basic]
2019-04-25 07:38:47,836 INFO auth.AuthChallengeProcessor - ntlm
authentication scheme selected
2019-04-25 07:38:47,837 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 07:38:47,837 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 07:38:47,847 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 07:38:47,847 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 07:38:48,335 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 07:38:48,336 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 07:38:48,337 INFO httpclient.HttpMethodDirector - Failure
authenticating with NTLM <any realm>@url.com:80
2019-04-25 07:38:48,507 INFO crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2019-04-25 07:38:48,509 INFO parse.ParserChecker - parsing:
http://url.com/crawltest.html
2019-04-25 07:38:48,509 INFO parse.ParserChecker - contentType:
application/xhtml+xml
2019-04-25 07:38:48,510 INFO parse.ParserChecker - signature:
495abb7f991fb4dd6a056f748908a2d9
The way i'm testing:
bin/nutch parsechecker http://url.com/crawltest.html
Finally, I should note that the following curl command DOES work:
curl --ntlm --user user:password http://url.com/crawltest.html
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
Re: Nutch NTLM to IIS 8.5 - issues!
Posted by "Larry.Santello" <la...@uline.com>.
One last reply...... Figured out how to do this:
In short, ntlm support for Nutch doesn't seem to work. You do, in fact, have
to use a proxy that supports it. The proxy I ended up using was cntlm at
http://cntlm.sourceforge.net/
Dont put any authentication in nutch.. just have it go to cntlm, with should
run on port 3128 locally...
In CNTLM, the "gotcha" is the proxy - for some reason, it absolutely
requires that it goes through some secondary proxy, even if you dont have
one running. What I did, which seemed to work, is I set it to my machine
name at port 80 (this could just be doing a pass through - port 80 is open
locally with IIS...)... Alternately, you could fire up Squid and have it go
through that...
Since I have no proxy, then i set my "NoProxy" in CNTLM to * ... As for the
rest, you have to set up the Username, Domain, Password, and Workstation
setting to match your environment.
I seem to be crawling now!
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
Re: Nutch NTLM to IIS 8.5 - issues!
Posted by "Larry.Santello" <la...@uline.com>.
For clarification, I tried (and am now actively working with) v1.15 and it
didn't work there either.
1.15 uses httpclient 4.5.5, so whatever the issue is wasn't resolved with
that either.
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
Re: Nutch NTLM to IIS 8.5 - issues!
Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Michael,
can you provide a patch or pull request for the upgrade?
There is an issue open since long [1] but the available
patches are reported to raise further issues (see issue comments).
The challenge is indeed to to test all the authentication options
supported by protocol-httpclient including form authentication.
Cheers,
Sebastian
[1] https://issues.apache.org/jira/browse/NUTCH-1086
On 4/26/19 4:50 PM, Michael Portnoy wrote:
> Nutch 1.14 is using HttpClient 3.x which does not work with NTLM2. Not sure
> if that's your case. To get auth to work, we've had to migrate the
> httpclient plugin to use HttpClient 4.x
>
> This may have been done in Nutch 1.15
>
> On Fri., Apr. 26, 2019, 10:24 a.m. Larry.Santello, <la...@uline.com>
> wrote:
>
>> Been reading a bit more and it sounds like an option may be to use an ntlm
>> proxy. You have Nutch set up for the proxy, and it's the proxy that sends
>> ntlm credentials. Ntlmaps seems like the product of choice for that proxy.
>> I
>> guess I'll give that a shot on Monday.
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>>
>
Re: Nutch NTLM to IIS 8.5 - issues!
Posted by Michael Portnoy <2m...@gmail.com>.
Nutch 1.14 is using HttpClient 3.x which does not work with NTLM2. Not sure
if that's your case. To get auth to work, we've had to migrate the
httpclient plugin to use HttpClient 4.x
This may have been done in Nutch 1.15
On Fri., Apr. 26, 2019, 10:24 a.m. Larry.Santello, <la...@uline.com>
wrote:
> Been reading a bit more and it sounds like an option may be to use an ntlm
> proxy. You have Nutch set up for the proxy, and it's the proxy that sends
> ntlm credentials. Ntlmaps seems like the product of choice for that proxy.
> I
> guess I'll give that a shot on Monday.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>
Re: Nutch NTLM to IIS 8.5 - issues!
Posted by "Larry.Santello" <la...@uline.com>.
Been reading a bit more and it sounds like an option may be to use an ntlm
proxy. You have Nutch set up for the proxy, and it's the proxy that sends
ntlm credentials. Ntlmaps seems like the product of choice for that proxy. I
guess I'll give that a shot on Monday.
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
RE: Nutch NTLM to IIS 8.5 - issues!
Posted by "Larry.Santello" <la...@uline.com>.
Thanks for responding!
I've hit it again with TRACE logging... here's the results of that:
2019-04-25 08:53:10,261 INFO parse.ParserChecker - fetching:
http://url.com/crawltest.html
2019-04-25 08:53:10,268 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch\apache-nutch-1.5.1\plugins
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Registered Plugins:
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Http / Https
Protocol Plug-in (protocol-httpclient)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Tika Parser Plug-in
(parse-tika)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - URL Validator
(urlfilter-validator)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Registered
Extension-Points:
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Nutch Segment Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
2019-04-25 08:53:10,350 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2019-04-25 08:53:10,377 INFO httpclient.Http - http.proxy.host = null
2019-04-25 08:53:10,377 INFO httpclient.Http - http.proxy.port = 8080
2019-04-25 08:53:10,378 INFO httpclient.Http - http.timeout = 10000
2019-04-25 08:53:10,379 INFO httpclient.Http - http.content.limit = -1
2019-04-25 08:53:10,379 INFO httpclient.Http - http.agent =
Spider/Nutch-1.5.1
2019-04-25 08:53:10,379 INFO httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2019-04-25 08:53:10,380 INFO httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2019-04-25 08:53:10,385 TRACE httpclient.Http - Credentials - username:
user; set as default for realm: ntdomain; scheme:
2019-04-25 08:53:10,392 TRACE httpclient.Http - Pre-configured credentials
with scope - host: url.com; port: 80; not found for url:
http://url.com/crawltest.html
2019-04-25 08:53:10,449 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest, basic]
2019-04-25 08:53:10,449 INFO auth.AuthChallengeProcessor - ntlm
authentication scheme selected
2019-04-25 08:53:10,450 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 08:53:10,450 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 08:53:10,452 TRACE auth.NTLMScheme - enter
NTLMScheme.authenticate(Credentials, HttpMethod)
2019-04-25 08:53:10,460 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 08:53:10,460 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 08:53:10,461 TRACE auth.NTLMScheme - enter
NTLMScheme.authenticate(Credentials, HttpMethod)
2019-04-25 08:53:10,952 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 08:53:10,953 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 08:53:10,955 INFO httpclient.HttpMethodDirector - Failure
authenticating with NTLM <any realm>@url.com:80
2019-04-25 08:53:10,959 TRACE httpclient.Http - url:
http://url.com/crawltest.html; status code: 401; bytes received: 6322;
Content-Length: 6322
2019-04-25 08:53:11,033 TRACE httpclient.Http - 401 Authentication Required
2019-04-25 08:53:11,133 INFO crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2019-04-25 08:53:11,135 INFO parse.ParserChecker - parsing:
http://urlcom/crawltest.html
2019-04-25 08:53:11,135 INFO parse.ParserChecker - contentType:
application/xhtml+xml
2019-04-25 08:53:11,138 INFO parse.ParserChecker - signature:
495abb7f991fb4dd6a056f748908a2d9
Regarding whats on the server security events - a couple interesting things:
1. It sees it, but the failure reason is "Unknown user name or bad
password". The user and password being sent from httpclient-auth.xml is the
exact same as what i'm sending in from the curl command
2. Unlike the Curl command, the Account Name being sent over is all upper
case! I have this suspicion that this has something to do with it. Again,
though, the username in httpclient-auth.xml is NOT all in upper case.
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html