You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2013/10/28 15:13:40 UTC

RE: double slash in path normalized away by Nutch 1.7

Hi,

That double slash is normalized by your regex-normalizer. Check the configuration file and remove the normalization rule or change so it does does not normalize if it comes right after http or https.

Cheers,
Markus

 
 
-----Original message-----
> From:Steve Newcomb <sr...@coolheads.com>
> Sent: Monday 28th October 2013 15:10
> To: user@nutch.apache.org
> Subject: double slash in path normalized away by Nutch 1.7
> 
> I think maybe Nutch is not working correctly with respect to URLs whose
> path portions contain double slashes.  I'm using Nutch 1.7 (with the
> protocol-httpclient plugin) to validate a carefully-maintained list of
> URLs, so I'm paying unusually close attention, I guess, to what's
> happening to every one of them.
> 
> In Firefox, the following URL works:
> 
> https://www.pay.gov/paygov/forms/formInstance.html?nc=1356014395287&agencyFormId=44568890&userFormSearch=https%3A//www.pay.gov/paygov/keywordSearchForms.html%3FshowingDetails=true&showingAll=false&sortProperty=agencyFormName&totalResults=1&keyword=apma&ascending=true&pageOffset=0
> 
> Note the double slash after "https%3A" in the path portion of the URL.
> 
> After using Nutch to check this URL along with many others, the segment
> dump does not report this URL.  Instead, it reports another URL -- one
> in which the double slash in the path portion of the URL has been
> changed to a single slash.
> 
> The altered URL reported in the Nutch dump is evidently incorrect.  When
> I try the Nutch-reported URL in Firefox, I see that the server at
> www.pay.gov can't resolve it successfully.
> 
> The dump record for the altered URL reveals "robots denied", which is
> useful information for me, and it may be *correct* information, too: the
> URL is a form for users to fill out.  (I do not know what would happen
> if robots were allowed by the server.  I suspect Nutch would report that
> the resource does not exist, which would be incorrect for the URL I used
> as a seed, and correct for the URL that Nutch reported.)
> 
> But how can I find this information in the segment dump, since the
> information appears under a *different* URL than the one I was
> attempting to validate?  My current workaround is to normalize the path
> portion of the URL I'm looking for in the same apparently-incorrect
> fashion as Nutch does.  Not pretty.
> 
> 
> Steve Newcomb
> 
>