You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2017/10/02 19:50:20 UTC

deletions from index

With my new news crawl, I would like to keep web pages in the index, even after they have disappeared from the web, so I can continue using them in machine-learning processes. I thought I could achieve this by avoiding running cleaning jobs. However, I still notice increasing numbers of deletions in my solr index.
When and why does nutch tell the indexer to delete documents, other than during cleaningJob?
For example, recently, Solr tells me that numDocs is about 189,000 and deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs have just been replaced by newer content, I am not ready to believe that has happened to so many of them.
Should I use a different indexer, or different settings, or something other than an indexer for this purpose?

Re: deletions from index

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.

So, I had these numbers in my index:
Num Docs: 189550Max Docs: 285531
Deleted Docs: 95981

Then I did a crawl and index, which told meindexed (add/update): 13,423
And now I have these numbers in my index:

Num Docs: 190785Max Docs: 223339Deleted Docs: 32554So, I am completely confused. I don't use "-deleteGone" but I get massive numbers of deletions.

Is it your theory that Solr's report of deleted docs really just means that docs were replaced by newer content?


      From: Markus Jelsma <ma...@openindex.io>
 To: "user@nutch.apache.org" <us...@nutch.apache.org>; User <us...@nutch.apache.org> 
 Sent: Monday, October 2, 2017 1:19 PM
 Subject: RE: deletions from index
   
You can check the Hadoop job's counters to see how many are being deleted. If some are, then -deleteGone is on in your case. Only with that setting documents are going to be deleted.

 
 
-----Original message-----
> From:Michael Coffey <mc...@yahoo.com.INVALID>
> Sent: Monday 2nd October 2017 21:51
> To: User <us...@nutch.apache.org>
> Subject: deletions from index
> 
> With my new news crawl, I would like to keep web pages in the index, even after they have disappeared from the web, so I can continue using them in machine-learning processes. I thought I could achieve this by avoiding running cleaning jobs. However, I still notice increasing numbers of deletions in my solr index.
> When and why does nutch tell the indexer to delete documents, other than during cleaningJob?
> For example, recently, Solr tells me that numDocs is about 189,000 and deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs have just been replaced by newer content, I am not ready to believe that has happened to so many of them.
> Should I use a different indexer, or different settings, or something other than an indexer for this purpose?
>

RE: deletions from index

Posted by Markus Jelsma <ma...@openindex.io>.

You can check the Hadoop job's counters to see how many are being deleted. If some are, then -deleteGone is on in your case. Only with that setting documents are going to be deleted.

 
 
-----Original message-----
> From:Michael Coffey <mc...@yahoo.com.INVALID>
> Sent: Monday 2nd October 2017 21:51
> To: User <us...@nutch.apache.org>
> Subject: deletions from index
> 
> With my new news crawl, I would like to keep web pages in the index, even after they have disappeared from the web, so I can continue using them in machine-learning processes. I thought I could achieve this by avoiding running cleaning jobs. However, I still notice increasing numbers of deletions in my solr index.
> When and why does nutch tell the indexer to delete documents, other than during cleaningJob?
> For example, recently, Solr tells me that numDocs is about 189,000 and deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs have just been replaced by newer content, I am not ready to believe that has happened to so many of them.
> Should I use a different indexer, or different settings, or something other than an indexer for this purpose?
>