You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2017/10/20 08:14:16 UTC
Re: inject deletes urls from crawldb
Hi Michael,
that's actually due to a bug introduced with Nutch 1.12 and already fixed for Nutch 1.14, see
https://issues.apache.org/jira/browse/NUTCH-2335
Thanks,
Sebastian
On 09/28/2017 07:26 PM, Michael Coffey wrote:
> If the Inject command does filtering, then the documentation should say so. The page https://wiki.apache.org/nutch/bin/nutch%20inject does not mention any filtering or normalization. I find it very counter-intuitive that an injection operation would delete existing data.
>
> Should I edit that page? Can I?
>
>
> From: Markus Jelsma <ma...@openindex.io>
> To: "user@nutch.apache.org" <us...@nutch.apache.org>; User <us...@nutch.apache.org>
> Sent: Thursday, September 28, 2017 2:06 AM
> Subject: RE: inject deletes urls from crawldb
>
> filters and/or normalizers come to mind!
>
>
>
> -----Original message-----
>> From:Michael Coffey <mc...@yahoo.com.INVALID>
>> Sent: Thursday 28th September 2017 4:40
>> To: User <us...@nutch.apache.org>
>> Subject: inject deletes urls from crawldb
>>
>> Perhaps my strangest question yet!
>> Why does Inject delete URLs from the crawldb and how can I prevent it?
>> I was trying to add 2 new sites to an existing crawldb. According to readdb stats, about 10% of my URLs disappeared in the process.
>>
>> (before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls: 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched): 346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone): 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): 77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): 9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate): 65
>> (after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls: 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched): 318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone): 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): 2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): 9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate): 49
>> My command line is like this
>> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb /crawls/$crawlspace/seeds_nbcnews.txt
>> Does it apply urlfilters as it injects?
>>
>
>
>