You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by lewis john mcgibbney <le...@apache.org> on 2022/11/08 14:30:19 UTC

Re: user Digest 8 Nov 2022 10:16:05 -0000 Issue 3169

Hi Mike,

Yes it is possible to extend the TLD list. In fact, when the TLD lost was
compiled the author left a note explicitly stating that it may not be
complete.
https://github.com/apache/nutch/blob/master/conf/domain-suffixes.xml.template
Please submit a PR if you wish to make any changes or additions. You can
use the parser checker tool to validate your change before creating the PR.
Thanks
lewismc

On Tue, Nov 8, 2022 at 02:16 <us...@nutch.apache.org> wrote:

>
> ---------- Forwarded message ----------
> From: Mike <mz...@gmail.com>
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 8 Nov 2022 11:15:51 +0100
> Subject: Incomplete TLD List
> Hi!
> Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
> the TLD list?
>
>         "url":"https://about.google/intl/en_FR/how-our-business-works/",
>         "tstamp":"2022-11-06T17:22:14.808Z",
>         "domain":"google",
>         "digest":"3b9a23d42f200392d12a697bbb8d4d87",
>
>
> Thanks
>
> Mike
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc