You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2014/03/04 20:51:28 UTC

[jira] [Updated] (SOLR-5795) Option to periodically delete docs based on an expiration field -- or ttl specified when indexed.

     [ https://issues.apache.org/jira/browse/SOLR-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-5795:
---------------------------

    Attachment: SOLR-5795.patch

bq. Content should never be removed without explicit external interactions and this will lead to so many "where did my content go" type problems.. Especially since once its gone from the index debugging what went wrong is not going to be easy.. writing a script to send a query delete periodically is really not that complex and then it becomes the responsibility of the content owner/developer to delete content..

I'm not sure i follow your reasoning there -- "where did my content go" type situations can already exist via any {{deleteByQuery}} (not to mention really subtle things like {{SignatureUpdateProcessorFactory}}).  If anything the approach i'm suggesting should be _more_ obvious then an external script -- because it would need to be configured right there in the {{solrconfig.xml}} where it's obvious and easy to see, as opposed to "where did my content go? ... time to wade through days of logs looking for deleteByQuey requests that could be coming from anywhere, at any interval of time."

The bottom line, is that someone with the ability to edit {{solrconfig.xml}} already has the ability to trump & manipulate & block & mess with content sent from remote clients by content owners/developers -- this would in fact be another way to do that, but i don't think that's a bad thing.  It would just be a simpler, self contained, way for solr admins to say "I want to have a way to automatically expire content that people put in my index"

bq. I would suggest that if this does go in some sort of "audit" output be produced (eg X docs delete automatically or a list of ids)

that would be really nice in general with any sort of {{deleteByQuery}} -- but it's not currently possible to get that info back from the IndexWriter.  The best we can do is explicitly log when/why we are triggering the automatic deleteByQuery commands

----

I'm attaching a patch with a really rough proof of concept for the design outlined above ... still a lot of nocommits & error checking & tests needed, but it gives you something to try out to see what I had in mind.

With this patch applied, you can startup the example and load docs along the lines of this...

{noformat}
java -Durl="http://localhost:8983/solr/collection1/update?update.chain=nocommit" -Ddata=args -jar post.jar '<add><doc><field name="id">EXP</field><field name="_expire_at_">NOW+8MINUTES</field></doc><doc><field name="id">SAFE</field></doc><doc><field name="id">TTL</field><field name="_ttl_">+3MINUTES</field></doc></add>'
{noformat}

* Every 5 minutes, a thread will wake up and delete docs
* {{EXP}} has an explicit value in the {{\_expire\_at\_}} of 8 minutes after it was indexed -- if you index the docs immediatley after starting up Solr, it should be deleted ~10Minutes after startup.
* {{TTL}} has an implicit  {{\_ttl\_}} value of 3 minutes after it was indexed, which the processor converts to an absolute value and puts in the {{\_expire\_at\_}} -- if you index the docs immediatley after starting up Solr, it should be deleted ~5Minutes after startup.
* {{SAFE}} will never be deleted, because nothing gives it a value in the {{\_expire\_at\_}} field.

One note where we definitely have to deviate from what i described initially: having hte scheduled task use the factory to access the chain to trigger the delete didn't pan out because i wasn't thinking clearly enough about what that existing API looks like -- the factory doesn't know what chain it's in, or what processor(s) should be "next", that's input to getInstance() method on the factory from the chain.  so instead the configuration requires you to specify the name of a chain (can be the same chain you are in) and that chain is used to execute the delete.

(The trickiest part of all of this, will be writing the tests)

> Option to periodically delete docs based on an expiration field -- or ttl specified when indexed.
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-5795
>                 URL: https://issues.apache.org/jira/browse/SOLR-5795
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>         Attachments: SOLR-5795.patch
>
>
> A question I get periodically from people is how to automatically remove documents from a collection at a certain time (or after a certain amount of time).  
> Excluding from search results using a filter query on a date field is trivial, but you still have to periodically send a deleteByQuery to clean up those older "expired" documents.  And in the case where you want all documents to auto-expire some fixed amount of time when they were indexed, you still have to setup a simple UpdateProcessorto set that expiration date.  So i've been thinking it would be nice if there was a simple way to configure solr to do it all for you.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org