You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by mikr00 <kr...@gmail.com> on 2011/11/12 22:08:53 UTC

Delete by Query with limited number of rows

I have the following problem and can't seem to find a solution:

I'm building up a frequently updated solr index. In order to deal with
limited ressources I would like to limit the total number of documents in
the index. In other words: I would like to declare that no more than (for
example) 1.000.000 documents should be in the index. Whenever new documents
are added (or better: when newly added documents are being committed), I
would like to:

- check, whether the limit is exceeded
- delete as many of the oldest documents from the index as necessary, such
that the limit is no longer exceeded.

Similar to a first in first out list. The problem is: It's easy to check the
limit, but how can I delete the oldest documents to go again below the
limit? Can I do it with a delete by query request? In that case, I would
probably have to limit the number of rows? But I can't seem to find a way to
do that. Or would you see a different solution (maybe there is a way to
configure the solr core such that it automatically behaves as desribed?)?

I would very much appreciate any help!

Thanks in Advance.

Cheers

Michael

--
View this message in context: http://lucene.472066.n3.nabble.com/Delete-by-Query-with-limited-number-of-rows-tp3503094p3503094.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Delete by Query with limited number of rows

Posted by mikr00 <kr...@gmail.com>.
Hi Erick, hi Yury,

thanks to your input I found a perfect solution for my case. Even though
this is not a solr-only solution, I will just briefly describe how it works
since it might be of interest to others:

I have put up a mysql database holding two tables. The first only has a
primarykey with auto-increment and nothing else. The second has a primarykey
but without auto-increment and also fields for the content I store in solr. 

Now, before I add something to the solr core, I add an entry to the first
mysql database. After the insertion, I get the primarykey for the action. I
check, whether it is above my limit of documents. If so, I empty the first
mysql table and reset the auto-increment to zero. I than insert a mysql
entry to the second table using the primarykey taken from the first table
(if the primarykey exists, I do not add an entry but update the existing
one). And finally I have a solr core which holds my searchable data and has
a uniquekey field. Into this core I add a new document by using the
primarykey from the first mysql table for the uniquekey field.

The solution has two main benefits for me:

- I can precisely control the number of documents in my solr core.
- I do now also have a backup of my data in mysql

Thank you very much for your help!



--
View this message in context: http://lucene.472066.n3.nabble.com/Delete-by-Query-with-limited-number-of-rows-tp3503094p3506380.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Delete by Query with limited number of rows

Posted by Erick Erickson <er...@gmail.com>.
There's nothing built into Solr that lets you do this automatically. About
the best you can do is probably a delete by query going back some fixed
time interval. So rather than keeping the last N documents, you keep
documents that are, say, no more than 1 month old (or whatever you
determine your interval is that allows you to keep around 1M docs
around).

Then you'd have to monitor this on a running system to see how close
you were to your target numbers....

Best
Erick

On Sun, Nov 13, 2011 at 1:23 PM, mikr00 <kr...@gmail.com> wrote:
> Hi Yury,
>
> thank you very much for your quick reply. Currently I have a timestamp field
> (solr.DateField) and every time I add a document I use "NOW" for the
> timestamp field. I only commit documents on the core every four hours. This
> works fine with the timestamp since I can use "NOW". However, I couldn't
> figure out, how to define some kind of auto-increment for a particular
> field. I think I can't handle this from outside since I can have several
> adds in parallel from different clients. So I was wondering, whether there
> could be a field type that could actually automatically increase it's value
> for each added (commited) document? So that I could use a placeholder like
> "NOW" in the case of the DateField to indicate that I would like to
> auto-increment the field.
>
> Cheers
>
> Michael
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Delete-by-Query-with-limited-number-of-rows-tp3503094p3504924.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Delete by Query with limited number of rows

Posted by mikr00 <kr...@gmail.com>.
Hi Yury,

thank you very much for your quick reply. Currently I have a timestamp field
(solr.DateField) and every time I add a document I use "NOW" for the
timestamp field. I only commit documents on the core every four hours. This
works fine with the timestamp since I can use "NOW". However, I couldn't
figure out, how to define some kind of auto-increment for a particular
field. I think I can't handle this from outside since I can have several
adds in parallel from different clients. So I was wondering, whether there
could be a field type that could actually automatically increase it's value
for each added (commited) document? So that I could use a placeholder like
"NOW" in the case of the DateField to indicate that I would like to
auto-increment the field.

Cheers

Michael

--
View this message in context: http://lucene.472066.n3.nabble.com/Delete-by-Query-with-limited-number-of-rows-tp3503094p3504924.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Delete by Query with limited number of rows

Posted by Yury Kats <yu...@yahoo.com>.
On 11/12/2011 4:08 PM, mikr00 wrote:
> Similar to a first in first out list. The problem is: It's easy to check the
> limit, but how can I delete the oldest documents to go again below the
> limit? Can I do it with a delete by query request? In that case, I would
> probably have to limit the number of rows? But I can't seem to find a way to
> do that. Or would you see a different solution (maybe there is a way to
> configure the solr core such that it automatically behaves as desribed?)?

You can certainly delete a set of documents using "delete by query",
but you need to somehow identify what documents you want to have deleted.
For that, you'd need to have a field, such as a sequence number or a timestamp
when the document was added.

Alternatively, if you can control the uniqueKey field when adding documents,
you can just cycle it between 1 and 1,000,000. When you reach 1,000,000
set the uniqueKey back to 1 and keep adding. The new document will automatically
replace the old document with the key of "1".