You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mike <mz...@gmail.com> on 2022/11/08 10:15:51 UTC

Incomplete TLD List

Hi!
Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
the TLD list?

        "url":"https://about.google/intl/en_FR/how-our-business-works/",
        "tstamp":"2022-11-06T17:22:14.808Z",
        "domain":"google",
        "digest":"3b9a23d42f200392d12a697bbb8d4d87",


Thanks

Mike

Re: Incomplete TLD List

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Mike, hi Markus,

there's also
   https://issues.apache.org/jira/browse/NUTCH-1806
which would make it much easier to keep up-to-date with the public suffix list.

Resp., because crawler-commons loads the public suffix list
(for historic reasons named "effective_tld_names.dat") from the class path
it would be quite easy to update the list by simple placing it in the
Nutch conf folder.

@Mike: please, let us know whether this is an option (for the long term). You 
may also upvote the Jira issue. Thanks!

Best,
Sebastian

On 11/8/22 11:45, Markus Jelsma wrote:
> Hello Mike,
> 
> You can try adding the TLD to conf/domain-suffixes.xml and see if it works.
> 
> Regards,
> Markus
> 
> Op di 8 nov. 2022 om 11:16 schreef Mike <mz...@gmail.com>:
> 
>> Hi!
>> Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
>> the TLD list?
>>
>>          "url":"https://about.google/intl/en_FR/how-our-business-works/",
>>          "tstamp":"2022-11-06T17:22:14.808Z",
>>          "domain":"google",
>>          "digest":"3b9a23d42f200392d12a697bbb8d4d87",
>>
>>
>> Thanks
>>
>> Mike
>>
> 

Re: Incomplete TLD List

Posted by Markus Jelsma <ma...@openindex.io>.
Hello Mike,

You can try adding the TLD to conf/domain-suffixes.xml and see if it works.

Regards,
Markus

Op di 8 nov. 2022 om 11:16 schreef Mike <mz...@gmail.com>:

> Hi!
> Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
> the TLD list?
>
>         "url":"https://about.google/intl/en_FR/how-our-business-works/",
>         "tstamp":"2022-11-06T17:22:14.808Z",
>         "domain":"google",
>         "digest":"3b9a23d42f200392d12a697bbb8d4d87",
>
>
> Thanks
>
> Mike
>