You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Renee Sun <re...@mcafee.com> on 2015/09/02 21:30:33 UTC

is there any way to tell delete by query actually deleted anything?

I run this curl trying to delete some messages :

curl
'http://localhost:8080/solr/mycore/update?commit=true&stream.body=<delete><id>abacd</id></delete>'
| xmllint --format -

or

curl
'http://localhost:8080/solr/mycore/update?commit=true&stream.body=<delete><query>myfield:mycriteria</query></delete>'
| xmllint --format -

the results I got is like:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time 
Current
                                 Dload  Upload   Total   Spent    Left 
Speed
148   148    0   148    0     0  11402      0 --:--:-- --:--:-- --:--:--
14800
<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">10</int>
  </lst>
</response>

Is there an easy way for me to get the actually deleted document number? I
mean if the query did not hit any documents, I want to know that nothing got
deleted. But if it did hit documents, i would like to know how many were
delete...

thanks
Renee



--
View this message in context: http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: is there any way to tell delete by query actually deleted anything?

Posted by Renee Sun <re...@mcafee.com>.
thanks Shawn...

on the other side, I have just created a thin layer webapp I deploy it with
solr/tomcat. this webapp provides RESTful api allow all kind of clients in
our system to call and request a commit on the certain core on that solr
server.

I put in with the idea to have a centre/final place to control the commit on
the cores in local solr server.

so far it works by reducing the arbitrary requests, such as that I will not
allow 2 commit requests from different clients to commit on same core happen
too close to each other, I will disregard the second request if the first
just being done like less than 5 minutes ago.

I am think enhance this webapp to check on physical index dir timestamp, and
drop the request if the core has not been changed since last commit. This
will prevent the client trying to commit on all cold cores blindly where
only one of them actually was updated.

I mean to ask: is there any solr admin meta data I can fetch through restful
api, to get data such as index last updated time, or something like that?



--
View this message in context: http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776p4226818.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: is there any way to tell delete by query actually deleted anything?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/2/2015 3:32 PM, Renee Sun wrote:
> I think we have similar structure where we use frontier/back instead of
> hot/cold :-)
>
> so yes we will probably have to do the same.
>
> since we have large customers and some of them may have tera bytes data and
> end up with hundreds of cold cores.... the blind delete broadcasting to all
> of them is a performance kill.
>
> I am thinking of adding a in-memory inventory of coreID : docID  so I can
> identify which core the document is in efficiently... what do you think
> about it?

I could write code for the deleteByQuery method to figure out where to
send the requests.  Performance hasn't become a problem with the "send
to all shards" method.  If it does, then I know exactly what to do:

If the ID value that we use for sharding is larger than X, it goes to
the hot shard.  If not, then I would CRC32 hash the ID, mod the hash
value by the number of cold shards, and send it to the shard number (0
through 5 for our indexes) that comes out.

Our sharding ID field is actually not our uniqueKey field for Solr,
although it is the autoincrement primary key on the source MySQL
database.  Another way to think about this field is as the "delete id". 
Our Solr uniqueKey is a different field that has a unique-enforcing
index in MySQL.

If you want good performance with sharding operations, then you need a
sharding algorithm that is completely deterministic based on the key
value and the current shard layout.  If the shard layout changes then it
should not change frequently.  Our layout changes only once a day, at
which time the oldest documents are moved from the hot shard to the cold
shards.

Thanks,
Shawn


Re: is there any way to tell delete by query actually deleted anything?

Posted by Renee Sun <re...@mcafee.com>.
Hi Shawn,
I think we have similar structure where we use frontier/back instead of
hot/cold :-)

so yes we will probably have to do the same.

since we have large customers and some of them may have tera bytes data and
end up with hundreds of cold cores.... the blind delete broadcasting to all
of them is a performance kill.

I am thinking of adding a in-memory inventory of coreID : docID  so I can
identify which core the document is in efficiently... what do you think
about it?

thanks
Renee



--
View this message in context: http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776p4226805.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: is there any way to tell delete by query actually deleted anything?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/2/2015 2:24 PM, Renee Sun wrote:
> I have a sharded index. When I re-index a document (vs new index, which is
> different process), I need to delete the old one first to avoid dup. We all
> know that if there is only one core, the newly added document will replace
> the old one, but with multiple core indexes, we will have to issue delete
> command first to ALL shards since we do NOT know/remember which core the old
> document was indexed to ... 
>
> I also wanted to know if there is a better way handling this efficiently.
>
> Anyways, we are sending delete to all cores of this customer, one of them
> hit , others did not.
>
> But consequently, when I need to decide about commit, I do NOT want blindly
> commit to all cores, I want to know which one actually had the old doc so I
> only send commit to that core.
>
> I could alternatively use query first and skip if it did not hit, but delete
> if it does, and I can't short circuit since we have dups :-( based on a
> historical reason. 
>
> any suggestion how to make this more efficiently?

I have a sharded index too.  It is a more complicated sharding mechanism
than you would get in a default SolrCloud install (and my servers are
NOT running in cloud mode).  It's a hot/cold shard system, with one hot
shard and six cold shards.  Even though the shard that contains any
given document is *always* something that can be calculated according to
a configuration that changes at most once a day, I send all deletes to
every shard like you do.  Each batch of documents in the delete list
(currently set to a batch size of 500) is sent to each shard.

The deleteByQuery method on my Core class (this is a java program)
queries the Solr core to see if any documents are found.  If they are,
then the delete request is sent to Solr.  Any successful Solr update
operation (add, delete, etc) will set a "commit" flag in the class
instance, which is checked by the commit method.  When a commit is
requested on the Core class, if the flag is true, a commit is sent to
Solr.  If the commit succeeds, the flag is cleared.

Thanks,
Shawn


Re: is there any way to tell delete by query actually deleted anything?

Posted by Renee Sun <re...@mcafee.com>.
Hi Erick... as Shawn pointed out... I am not using solrcloud, I am using a
more complicated sharding scheme, home grown... 

thanks for your response :-)
Renee



--
View this message in context: http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776p4226806.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: is there any way to tell delete by query actually deleted anything?

Posted by Erick Erickson <er...@gmail.com>.
bq: I have a sharded index. When I re-index a document (vs new index, which is
different process), I need to delete the old one first to avoid dup

No, you do not need to issue the delete in a sharded collection
_assuming_ that the doc has the same <uniqueKey>. Why
do you think you do? If it's in some doc somewhere we need
to fix it.

Docs are routed by a hash no the <uniqueKey> in the default
case. So since it goes to the same shard, the fact that it's a
new version will be detected and it'll replace the old version.

Are you seeing anything different?

Best,
Erick

On Wed, Sep 2, 2015 at 1:24 PM, Renee Sun <re...@mcafee.com> wrote:
> Shawn,
> thanks for the reply.
>
> I have a sharded index. When I re-index a document (vs new index, which is
> different process), I need to delete the old one first to avoid dup. We all
> know that if there is only one core, the newly added document will replace
> the old one, but with multiple core indexes, we will have to issue delete
> command first to ALL shards since we do NOT know/remember which core the old
> document was indexed to ...
>
> I also wanted to know if there is a better way handling this efficiently.
>
> Anyways, we are sending delete to all cores of this customer, one of them
> hit , others did not.
>
> But consequently, when I need to decide about commit, I do NOT want blindly
> commit to all cores, I want to know which one actually had the old doc so I
> only send commit to that core.
>
> I could alternatively use query first and skip if it did not hit, but delete
> if it does, and I can't short circuit since we have dups :-( based on a
> historical reason.
>
> any suggestion how to make this more efficiently?
>
> thanks!
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776p4226788.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: is there any way to tell delete by query actually deleted anything?

Posted by Renee Sun <re...@mcafee.com>.
Shawn,
thanks for the reply.

I have a sharded index. When I re-index a document (vs new index, which is
different process), I need to delete the old one first to avoid dup. We all
know that if there is only one core, the newly added document will replace
the old one, but with multiple core indexes, we will have to issue delete
command first to ALL shards since we do NOT know/remember which core the old
document was indexed to ... 

I also wanted to know if there is a better way handling this efficiently.

Anyways, we are sending delete to all cores of this customer, one of them
hit , others did not.

But consequently, when I need to decide about commit, I do NOT want blindly
commit to all cores, I want to know which one actually had the old doc so I
only send commit to that core.

I could alternatively use query first and skip if it did not hit, but delete
if it does, and I can't short circuit since we have dups :-( based on a
historical reason. 

any suggestion how to make this more efficiently?
 
thanks!






--
View this message in context: http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776p4226788.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: is there any way to tell delete by query actually deleted anything?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/2/2015 1:30 PM, Renee Sun wrote:
> Is there an easy way for me to get the actually deleted document number? I
> mean if the query did not hit any documents, I want to know that nothing got
> deleted. But if it did hit documents, i would like to know how many were
> delete...

I do this by issuing the same query that I plan to use for the delete,
before doing the delete.  If numFound is zero, I don't do the delete. 
Either way I know how many docs are getting deleted.  Since the program
that does this is the only thing updating the index, I know that the
info is completely accurate.

Thanks,
Shawn


Re: is there any way to tell delete by query actually deleted anything?

Posted by Mark Ehle <ma...@gmail.com>.
Do a search with the same criteria before and after?

On Wed, Sep 2, 2015 at 3:30 PM, Renee Sun <re...@mcafee.com> wrote:

> I run this curl trying to delete some messages :
>
> curl
> 'http://localhost:8080/solr/mycore/update?commit=true&stream.body=
> <delete><id>abacd</id></delete>'
> | xmllint --format -
>
> or
>
> curl
> 'http://localhost:8080/solr/mycore/update?commit=true&stream.body=
> <delete><query>myfield:mycriteria</query></delete>'
> | xmllint --format -
>
> the results I got is like:
>
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                  Dload  Upload   Total   Spent    Left
> Speed
> 148   148    0   148    0     0  11402      0 --:--:-- --:--:-- --:--:--
> 14800
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
>   <lst name="responseHeader">
>     <int name="status">0</int>
>     <int name="QTime">10</int>
>   </lst>
> </response>
>
> Is there an easy way for me to get the actually deleted document number? I
> mean if the query did not hit any documents, I want to know that nothing
> got
> deleted. But if it did hit documents, i would like to know how many were
> delete...
>
> thanks
> Renee
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>