You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2017/10/20 08:14:16 UTC

Re: inject deletes urls from crawldb

Hi Michael,

that's actually due to a bug introduced with Nutch 1.12 and already fixed for Nutch 1.14, see
  https://issues.apache.org/jira/browse/NUTCH-2335

Thanks,
Sebastian

On 09/28/2017 07:26 PM, Michael Coffey wrote:
> If the Inject command does filtering, then the documentation should say so. The page https://wiki.apache.org/nutch/bin/nutch%20inject does not mention any filtering or normalization. I find it very counter-intuitive that an injection operation would delete existing data.
> 
> Should I edit that page? Can I?
> 
> 
>       From: Markus Jelsma <ma...@openindex.io>
>  To: "user@nutch.apache.org" <us...@nutch.apache.org>; User <us...@nutch.apache.org> 
>  Sent: Thursday, September 28, 2017 2:06 AM
>  Subject: RE: inject deletes urls from crawldb
>    
> filters and/or normalizers come to mind!
> 
>  
>  
> -----Original message-----
>> From:Michael Coffey <mc...@yahoo.com.INVALID>
>> Sent: Thursday 28th September 2017 4:40
>> To: User <us...@nutch.apache.org>
>> Subject: inject deletes urls from crawldb
>>
>> Perhaps my strangest question yet!
>> Why does Inject delete URLs from the crawldb and how can I prevent it?
>> I was trying to add 2 new sites to an existing crawldb. According to readdb stats, about 10% of my URLs disappeared in the process.
>>
>> (before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls: 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched):      346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone): 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    65
>> (after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls: 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched):      318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone): 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    49
>> My command line is like this
>> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb /crawls/$crawlspace/seeds_nbcnews.txt
>> Does it apply urlfilters as it injects?
>>
> 
>    
>