You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bharath Kumar <bh...@gmail.com> on 2016/08/09 04:14:46 UTC

Solr DeleteByQuery vs DeleteById

Hi All,

We are using SOLR 6.1 and i wanted to know which is better to use -
deleteById or deleteByQuery?

We have a program which deletes 100000 documents every 5 minutes from the
SOLR and we do it in a batch of 200 to delete those documents. For that we
now use deleteById(List<String> ids, 10000) to delete.
I wanted to know if we change it to deleteByQuery(query, 10000) where the
query is like this - (id:1 OR id:2 OR id:3 OR id:4). Will this have a
performance impact?

We use SOLR cloud with 3 SOLR nodes in the cluster and also we have a
similar setup on the target site and we use Cross Data Center Replication
to replicate from main site.

Can you please let me know if using deleteByQuery will have any impact? I
see it opens real time searcher on all the nodes in cluster.

-- 
Thanks & Regards,
Bharath MV Kumar

"Life is short, enjoy every moment of it"

Re: Solr DeleteByQuery vs DeleteById

Posted by Bharath Kumar <bh...@gmail.com>.
Hi Danny and Daniel,

Thank you so much for your inputs.

Actually we use deleteByIds, but because we need the CDCR solution to work
for us, we are having issues when we use deleteById. The deleteById logs a
transaction in the transaction logs and that when passed over to the target
site, the CDCR update processor is not able to process that transaction.
The issue occurs when we use unique key "id" field type as long. If we use
it as "string", there are no problems. But we have already data in
production, if we change the schema we need to re-index. So that is one of
the reason we are thinking of using delete by query.

I opened a ticket in JIRA - https://issues.apache.org/jira/browse/SOLR-9394
as well.

On Tue, Aug 9, 2016 at 8:58 AM, Daniel Collins <da...@gmail.com>
wrote:

> Seconding that point, we currently do DBQ to "tidy" some of our collections
> and time-bound them (so running "delete anything older than X").  They have
> similar issues with reordering and blocking from time to time.
>
> On 9 August 2016 at 14:20, danny teichthal <da...@gmail.com> wrote:
>
> > Hi Bharath,
> > I'm no expert, but we had some major problems because of deleteByQuery (
> in
> > short DBQ).
> > We ended up replacing all of our DBQ to delete by ids.
> >
> > My suggestion is that if you don't realy need it - don't use it.
> > Especially in your case, since you already know the population of ids, it
> > is redundant to query for it.
> >
> > I don't know how CDCR works, but we have a replication factor of 2 on our
> > SolrCloud cluster.
> > Since Solr 5.x , DBQ were stuck for a long while on the replicas,
> blocking
> > all updates.
> > It appears that on the replica side, there's an overhead of reordering
> and
> > executing the same DBQ over and over again, for consistency reasons.
> > It ends up buffering many delete by queries and blocks all updates.
> > In addition there's another defect on related slowness on DBQ -
> LUCENE-7049
> >
> >
> >
> >
> >
> > On Tue, Aug 9, 2016 at 7:14 AM, Bharath Kumar <bharath.mvkumar@gmail.com
> >
> > wrote:
> >
> > > Hi All,
> > >
> > > We are using SOLR 6.1 and i wanted to know which is better to use -
> > > deleteById or deleteByQuery?
> > >
> > > We have a program which deletes 100000 documents every 5 minutes from
> the
> > > SOLR and we do it in a batch of 200 to delete those documents. For that
> > we
> > > now use deleteById(List<String> ids, 10000) to delete.
> > > I wanted to know if we change it to deleteByQuery(query, 10000) where
> the
> > > query is like this - (id:1 OR id:2 OR id:3 OR id:4). Will this have a
> > > performance impact?
> > >
> > > We use SOLR cloud with 3 SOLR nodes in the cluster and also we have a
> > > similar setup on the target site and we use Cross Data Center
> Replication
> > > to replicate from main site.
> > >
> > > Can you please let me know if using deleteByQuery will have any
> impact? I
> > > see it opens real time searcher on all the nodes in cluster.
> > >
> > > --
> > > Thanks & Regards,
> > > Bharath MV Kumar
> > >
> > > "Life is short, enjoy every moment of it"
> > >
> >
>



-- 
Thanks & Regards,
Bharath MV Kumar

"Life is short, enjoy every moment of it"

Re: Solr DeleteByQuery vs DeleteById

Posted by Daniel Collins <da...@gmail.com>.
Seconding that point, we currently do DBQ to "tidy" some of our collections
and time-bound them (so running "delete anything older than X").  They have
similar issues with reordering and blocking from time to time.

On 9 August 2016 at 14:20, danny teichthal <da...@gmail.com> wrote:

> Hi Bharath,
> I'm no expert, but we had some major problems because of deleteByQuery ( in
> short DBQ).
> We ended up replacing all of our DBQ to delete by ids.
>
> My suggestion is that if you don't realy need it - don't use it.
> Especially in your case, since you already know the population of ids, it
> is redundant to query for it.
>
> I don't know how CDCR works, but we have a replication factor of 2 on our
> SolrCloud cluster.
> Since Solr 5.x , DBQ were stuck for a long while on the replicas, blocking
> all updates.
> It appears that on the replica side, there's an overhead of reordering and
> executing the same DBQ over and over again, for consistency reasons.
> It ends up buffering many delete by queries and blocks all updates.
> In addition there's another defect on related slowness on DBQ - LUCENE-7049
>
>
>
>
>
> On Tue, Aug 9, 2016 at 7:14 AM, Bharath Kumar <bh...@gmail.com>
> wrote:
>
> > Hi All,
> >
> > We are using SOLR 6.1 and i wanted to know which is better to use -
> > deleteById or deleteByQuery?
> >
> > We have a program which deletes 100000 documents every 5 minutes from the
> > SOLR and we do it in a batch of 200 to delete those documents. For that
> we
> > now use deleteById(List<String> ids, 10000) to delete.
> > I wanted to know if we change it to deleteByQuery(query, 10000) where the
> > query is like this - (id:1 OR id:2 OR id:3 OR id:4). Will this have a
> > performance impact?
> >
> > We use SOLR cloud with 3 SOLR nodes in the cluster and also we have a
> > similar setup on the target site and we use Cross Data Center Replication
> > to replicate from main site.
> >
> > Can you please let me know if using deleteByQuery will have any impact? I
> > see it opens real time searcher on all the nodes in cluster.
> >
> > --
> > Thanks & Regards,
> > Bharath MV Kumar
> >
> > "Life is short, enjoy every moment of it"
> >
>

Re: Solr DeleteByQuery vs DeleteById

Posted by danny teichthal <da...@gmail.com>.
Hi Bharath,
I'm no expert, but we had some major problems because of deleteByQuery ( in
short DBQ).
We ended up replacing all of our DBQ to delete by ids.

My suggestion is that if you don't realy need it - don't use it.
Especially in your case, since you already know the population of ids, it
is redundant to query for it.

I don't know how CDCR works, but we have a replication factor of 2 on our
SolrCloud cluster.
Since Solr 5.x , DBQ were stuck for a long while on the replicas, blocking
all updates.
It appears that on the replica side, there's an overhead of reordering and
executing the same DBQ over and over again, for consistency reasons.
It ends up buffering many delete by queries and blocks all updates.
In addition there's another defect on related slowness on DBQ - LUCENE-7049





On Tue, Aug 9, 2016 at 7:14 AM, Bharath Kumar <bh...@gmail.com>
wrote:

> Hi All,
>
> We are using SOLR 6.1 and i wanted to know which is better to use -
> deleteById or deleteByQuery?
>
> We have a program which deletes 100000 documents every 5 minutes from the
> SOLR and we do it in a batch of 200 to delete those documents. For that we
> now use deleteById(List<String> ids, 10000) to delete.
> I wanted to know if we change it to deleteByQuery(query, 10000) where the
> query is like this - (id:1 OR id:2 OR id:3 OR id:4). Will this have a
> performance impact?
>
> We use SOLR cloud with 3 SOLR nodes in the cluster and also we have a
> similar setup on the target site and we use Cross Data Center Replication
> to replicate from main site.
>
> Can you please let me know if using deleteByQuery will have any impact? I
> see it opens real time searcher on all the nodes in cluster.
>
> --
> Thanks & Regards,
> Bharath MV Kumar
>
> "Life is short, enjoy every moment of it"
>