You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arthur Yarwood <ar...@fubaby.com> on 2016/03/04 20:50:23 UTC
ttp vs https duplicate fetches - host-urlnormalize?
I have recently discovered my crawl had a fetched a number of sites in
duplicate - once over http, and again over https. In a similar manner
one can add a host to the host-urlnormlize file to avoid a similar issue
with www.example.com vs example.com urls - is there a tactic to address
http vs https?
Ideally always favouring http over https (for efficiency), but not
totally discounting https totally, if an entire host is setup to always
serve over https. i.e. I don't really want to block all https hosts via
a regex-urlfilter.
I have worked around it to some degree via specific regex-urlfilters,
but it would be nice if there was a global option, rather than have to
tweak config everytime I discover duplicate content in my crawl.
--
Arthur Yarwood
Re: ttp vs https duplicate fetches - host-urlnormalize?
Posted by Arthur Yarwood <ar...@fubaby.com>.
Ah good stuff. I'll keep an eye out for that 1.12 release.
Many thanks!
Arthur
On 05/03/2016 20:48, Sebastian Nagel wrote:
> Hi Arthur,
>
> this problem has been recently discussed in
> https://issues.apache.org/jira/browse/NUTCH-2065
> and addressed by urlnormalizer-protocol
> https://issues.apache.org/jira/browse/NUTCH-2190
>
> Of course, you have to decide for every host
> which protocol shall be used.
>
> Cheers,
> Sebastian
>
>
> On 03/04/2016 08:50 PM, Arthur Yarwood wrote:
>> I have recently discovered my crawl had a fetched a number of sites in duplicate - once over http,
>> and again over https. In a similar manner one can add a host to the host-urlnormlize file to avoid
>> a similar issue with www.example.com vs example.com urls - is there a tactic to address http vs https?
>>
>> Ideally always favouring http over https (for efficiency), but not totally discounting https
>> totally, if an entire host is setup to always serve over https. i.e. I don't really want to block
>> all https hosts via a regex-urlfilter.
>>
>> I have worked around it to some degree via specific regex-urlfilters, but it would be nice if there
>> was a global option, rather than have to tweak config everytime I discover duplicate content in my
>> crawl.
>>
--
Arthur Yarwood
Re: ttp vs https duplicate fetches - host-urlnormalize?
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Arthur,
this problem has been recently discussed in
https://issues.apache.org/jira/browse/NUTCH-2065
and addressed by urlnormalizer-protocol
https://issues.apache.org/jira/browse/NUTCH-2190
Of course, you have to decide for every host
which protocol shall be used.
Cheers,
Sebastian
On 03/04/2016 08:50 PM, Arthur Yarwood wrote:
> I have recently discovered my crawl had a fetched a number of sites in duplicate - once over http,
> and again over https. In a similar manner one can add a host to the host-urlnormlize file to avoid
> a similar issue with www.example.com vs example.com urls - is there a tactic to address http vs https?
>
> Ideally always favouring http over https (for efficiency), but not totally discounting https
> totally, if an entire host is setup to always serve over https. i.e. I don't really want to block
> all https hosts via a regex-urlfilter.
>
> I have worked around it to some degree via specific regex-urlfilters, but it would be nice if there
> was a global option, rather than have to tweak config everytime I discover duplicate content in my
> crawl.
>