You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arthur Yarwood <ar...@fubaby.com> on 2016/03/04 20:50:23 UTC

ttp vs https duplicate fetches - host-urlnormalize?

I have recently discovered my crawl had a fetched a number of sites in 
duplicate - once over http, and again over https. In a  similar manner 
one can add a host to the host-urlnormlize file to avoid a similar issue 
with www.example.com vs example.com urls - is there a tactic to address 
http vs https?

Ideally always favouring http over https (for efficiency), but not 
totally discounting https totally, if an entire host is setup to always 
serve over https. i.e. I don't really want to block all https hosts via 
a regex-urlfilter.

I have worked around it to some degree via specific regex-urlfilters, 
but it would be nice if there was a global option, rather than have to 
tweak config everytime I discover duplicate content in my crawl.

-- 
Arthur Yarwood


Re: ttp vs https duplicate fetches - host-urlnormalize?

Posted by Arthur Yarwood <ar...@fubaby.com>.
Ah good stuff. I'll keep an eye out for that 1.12 release.
Many thanks!

Arthur

On 05/03/2016 20:48, Sebastian Nagel wrote:
> Hi Arthur,
>
> this problem has been recently discussed in
>    https://issues.apache.org/jira/browse/NUTCH-2065
> and addressed by urlnormalizer-protocol
>    https://issues.apache.org/jira/browse/NUTCH-2190
>
> Of course, you have to decide for every host
> which protocol shall be used.
>
> Cheers,
> Sebastian
>
>
> On 03/04/2016 08:50 PM, Arthur Yarwood wrote:
>> I have recently discovered my crawl had a fetched a number of sites in duplicate - once over http,
>> and again over https. In a  similar manner one can add a host to the host-urlnormlize file to avoid
>> a similar issue with www.example.com vs example.com urls - is there a tactic to address http vs https?
>>
>> Ideally always favouring http over https (for efficiency), but not totally discounting https
>> totally, if an entire host is setup to always serve over https. i.e. I don't really want to block
>> all https hosts via a regex-urlfilter.
>>
>> I have worked around it to some degree via specific regex-urlfilters, but it would be nice if there
>> was a global option, rather than have to tweak config everytime I discover duplicate content in my
>> crawl.
>>
-- 
Arthur Yarwood


Re: ttp vs https duplicate fetches - host-urlnormalize?

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Arthur,

this problem has been recently discussed in
  https://issues.apache.org/jira/browse/NUTCH-2065
and addressed by urlnormalizer-protocol
  https://issues.apache.org/jira/browse/NUTCH-2190

Of course, you have to decide for every host
which protocol shall be used.

Cheers,
Sebastian


On 03/04/2016 08:50 PM, Arthur Yarwood wrote:
> I have recently discovered my crawl had a fetched a number of sites in duplicate - once over http,
> and again over https. In a  similar manner one can add a host to the host-urlnormlize file to avoid
> a similar issue with www.example.com vs example.com urls - is there a tactic to address http vs https?
> 
> Ideally always favouring http over https (for efficiency), but not totally discounting https
> totally, if an entire host is setup to always serve over https. i.e. I don't really want to block
> all https hosts via a regex-urlfilter.
> 
> I have worked around it to some degree via specific regex-urlfilters, but it would be nice if there
> was a global option, rather than have to tweak config everytime I discover duplicate content in my
> crawl.
>