You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "kunhu0404@gmail.com" <ku...@gmail.com> on 2018/08/30 06:48:00 UTC

Solr Stale pages

Hello All,

I would like to know how Solr will handle the stale pages. For example there
are 30 documents indexed for a domain abc.com and in the second collection i
have only 27 documents for the same abc.com domain  and this needs to be
indexed in Solr. 
 So how solr will handle the old pages alraedy indexed ? will it delete the
stale pages in every new collection update ?
Thank you





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr Stale pages

Posted by Cassandra Targett <ct...@apache.org>.
As Jan pointed out, unless your client sends Solr some instructions for
what to do with those documents specifically, Solr doesn't do anything.

In your example, Nutch crawls 30 documents at first, and 30 documents are
sent to Solr and added to the index. On next crawl, it finds 27 documents,
and 27 documents are sent to Solr. If these documents have the same unique
keys (IDs) as 27 documents already in the index, the documents in the index
will be updated (someone can correct me on this, but I believe these IDs
get updated even if the content itself has not changed).

Unless Nutch (or any other client) specifically tells Solr to do something
with the 3 documents that were not sent as part of this second update, Solr
does nothing with regard to those documents. Which makes sense, you don't
want Solr just deleting documents because you didn't happen to update them
with every indexing request.

Solr maintains no record of where a document came from, what client sent
it, nor whether subsequent updates from the same client update or do not
update the same set of documents as previous requests from the same client.
It is up to the client process itself to keep track of this, and send Solr
details of what to do with subsequent update requests. In this case, what
you want is for Nutch to send Solr a delete by ID request for those 3
documents so they are removed. I'm not sure if Nutch is capable of doing
that, however.

On Thu, Aug 30, 2018 at 7:00 AM kunhu0404@gmail.com <ku...@gmail.com>
wrote:

> Thanks for the update
>
> I'm using Nutch 1.14 and Solr 6.6.3 and Zookeeper 3.4.12. We are using two
> Solr and configured as Solr cloud. Please let me know if anything is
> missing
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Solr Stale pages

Posted by "kunhu0404@gmail.com" <ku...@gmail.com>.
Thanks for the update

I'm using Nutch 1.14 and Solr 6.6.3 and Zookeeper 3.4.12. We are using two
Solr and configured as Solr cloud. Please let me know if anything is missing



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr Stale pages

Posted by Jan Høydahl <ja...@cominvent.com>.
Hi

Please give us more context. You can start with telling us which crawler you are using and more about your architecture.
It is NOT Solr's responsibility to add/delete documents on its own. it is the client (crawler) that has to know when a document is stale or gone from the source, and then the crawler needs to explicitly send a delete request for that doc.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 30. aug. 2018 kl. 08:48 skrev kunhu0404@gmail.com:
> 
> Hello All,
> 
> I would like to know how Solr will handle the stale pages. For example there
> are 30 documents indexed for a domain abc.com and in the second collection i
> have only 27 documents for the same abc.com domain  and this needs to be
> indexed in Solr. 
> So how solr will handle the old pages alraedy indexed ? will it delete the
> stale pages in every new collection update ?
> Thank you
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html