You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by RAUNAK AGRAWAL <ag...@gmail.com> on 2018/09/27 22:52:29 UTC

Solr Streaming Queries Performance Issues [v7.2.1]

Hi Guys,

Just to give you context, we were using JSON Facets for doing analytical
queries in solr but they were slower. Hence we migrated our application to
use solr streaming facet queries.

But for last few days, we are observing now that streaming facet response
is slower that json facets. Also we have increased the number of documents
in collection (30%).

So I have couple of questions:

1. When to use JSON Facets and when to use solr streaming facets?
2. Solr streaming also comes with rollup? How is it different from
streaming facets?
3. Is there a way to debug the queries in streaming mode because I tried
debug=true but it is not working in streaming queries.
4. When I don't mention any number of workers for streaming queries, does
all the shards of a collection becomes the workers?

Looking forward to your reply .


Thanks and regards
Raunak

Re: Solr Streaming Queries Performance Issues [v7.2.1]

Posted by RAUNAK AGRAWAL <ag...@gmail.com>.
Thanks a lot Erick for the documentation. I will go through it and get back
to you in case of any queries.

Regards,
Raunak

On Fri, Sep 28, 2018 at 11:09 AM Erick Erickson <er...@gmail.com>
wrote:

> It Depends (tm). The behavior changed with Solr 7.5. Here are all the
> gory details:
>
>
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
>
> and for 7.5+
> https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/
>
> Best,
> Erick
> On Fri, Sep 28, 2018 at 10:09 AM RAUNAK AGRAWAL
> <ag...@gmail.com> wrote:
> >
> > Hey Guys,
> >
> > This is the sample query I am making:
> >
> >
> > curl http://localhost:8983/solr/collection_name/stream -d
> > 'expr=facet(collection_name,q="id:953",bucketSorts="week
> >
> desc",buckets="week",bucketSizeLimit=200,sum(sales),sum(amount),sum(days))'
> >
> >
> > Also in my collection, I have almost 10 Billion documents with many
> > deletions (close to 40%). I was planning to run optimise to merge the
> > segments but spoke to admin team and lucidworks guys and they were
> against
> > it saying that it will make very large segment file. Is it true that
> > optimise in solr should not be used, as it comes with other issues?
> >
> > Thanks
> >
> > On Fri, Sep 28, 2018 at 7:40 AM Toke Eskildsen <to...@kb.dk> wrote:
> >
> > > On Thu, 2018-09-27 at 15:52 -0700, RAUNAK AGRAWAL wrote:
> > > > But for last few days, we are observing now that streaming facet
> > > > response is slower that json facets. Also we have increased the
> > > > number of documents in collection (30%).
> > >
> > > Export performance goes down when segment size goes way up, so I would
> > > expect streaming to do the same. I would not expect a 30% increase to
> > > cause something serious on that account though. How many documents in
> > > your index?
> > >
> > > - Toke Eskildsen, Royal Danish Library
> > >
> > >
>

Re: Solr Streaming Queries Performance Issues [v7.2.1]

Posted by Erick Erickson <er...@gmail.com>.
It Depends (tm). The behavior changed with Solr 7.5. Here are all the
gory details:

https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

and for 7.5+
https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/

Best,
Erick
On Fri, Sep 28, 2018 at 10:09 AM RAUNAK AGRAWAL
<ag...@gmail.com> wrote:
>
> Hey Guys,
>
> This is the sample query I am making:
>
>
> curl http://localhost:8983/solr/collection_name/stream -d
> 'expr=facet(collection_name,q="id:953",bucketSorts="week
> desc",buckets="week",bucketSizeLimit=200,sum(sales),sum(amount),sum(days))'
>
>
> Also in my collection, I have almost 10 Billion documents with many
> deletions (close to 40%). I was planning to run optimise to merge the
> segments but spoke to admin team and lucidworks guys and they were against
> it saying that it will make very large segment file. Is it true that
> optimise in solr should not be used, as it comes with other issues?
>
> Thanks
>
> On Fri, Sep 28, 2018 at 7:40 AM Toke Eskildsen <to...@kb.dk> wrote:
>
> > On Thu, 2018-09-27 at 15:52 -0700, RAUNAK AGRAWAL wrote:
> > > But for last few days, we are observing now that streaming facet
> > > response is slower that json facets. Also we have increased the
> > > number of documents in collection (30%).
> >
> > Export performance goes down when segment size goes way up, so I would
> > expect streaming to do the same. I would not expect a 30% increase to
> > cause something serious on that account though. How many documents in
> > your index?
> >
> > - Toke Eskildsen, Royal Danish Library
> >
> >

Re: Solr Streaming Queries Performance Issues [v7.2.1]

Posted by RAUNAK AGRAWAL <ag...@gmail.com>.
Thank you Joel. Looking forward to the latest version of solr.

Thanks

On Fri, Sep 28, 2018 at 12:22 PM Joel Bernstein <jo...@gmail.com> wrote:

> The facet expression is currently not as expressive as the JSON facet API.
> So for very demanding use cases you can create more highly tuned JSON facet
> API call.
>
> The good news is we are working this. And also working on other expressions
> that can be wrapped around the facet expression to implement parallelism
> and scaling. We hope to have this ready for Solr 8, which is just around
> the corner.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Sep 28, 2018 at 2:52 PM RAUNAK AGRAWAL <ag...@gmail.com>
> wrote:
>
> > Thanks a lot Toki. I will get back to you soon regarding patch update
> after
> > having discussion with the team.
> >
> > Thanks & Regards
> >
> >
> > On Fri, Sep 28, 2018 at 11:30 AM Toke Eskildsen <to...@kb.dk> wrote:
> >
> > > RAUNAK AGRAWAL <ag...@gmail.com> wrote:
> > >
> > > > curl http://localhost:8983/solr/collection_name/stream -d
> > > > 'expr=facet(collection_name,q="id:953",bucketSorts="week
> > > > desc",buckets="week",bucketSizeLimit=200,sum(sales),
> > > > sum(amount),sum(days))'
> > >
> > > Stats on numeric fields then.
> > >
> > > > Also in my collection, I have almost 10 Billion documents
> > > > with many deletions (close to 40%).
> > >
> > > Quite a lot of documents and in this case deletions counts, as the
> > > internal structures for the deleted documents still needs to be
> iterated.
> > > In scale this looks somewhat like our 18 billion document setup, with
> the
> > > addendum that we use quite large segments (900GB).
> > >
> > > The performance regressions we encountered with Solr 7 lead to
> > > https://issues.apache.org/jira/browse/LUCENE-8374 which helped a lot
> > > (performance testing has not finished). If you have or can easily
> create
> > a
> > > test server where your shard(s) is the same size as your production
> > shards,
> > > I'd be happy to port the patch to Solr 7.2.1 to see it it helps. I am
> > > looking for independent verification, so it is no bother.
> > >
> > > > I was planning to run optimise to merge the segments but
> > > > spoke to admin team and lucidworks guys and they were
> > > > against it saying that it will make very large segment file.
> > >
> > > If your bottleneck is the same as ours, the large segment would mean
> > worse
> > > performance (with Solr 7).
> > >
> > > > Is it true that optimise in solr should not be used, as it comes with
> > > other issues?
> > >
> > > No simple answer there. If you have an index that you update very
> rarely,
> > > it can save memory and processing power. If you have a live index where
> > you
> > > add and delete documents, it will probably be a bad idea. One strategy
> > used
> > > with time series data is to have old and immutable data in dedicated
> > > collections, which can then be optimized.
> > >
> > > - Toke Eskildsen
> > >
> >
>

Re: Solr Streaming Queries Performance Issues [v7.2.1]

Posted by Joel Bernstein <jo...@gmail.com>.
The facet expression is currently not as expressive as the JSON facet API.
So for very demanding use cases you can create more highly tuned JSON facet
API call.

The good news is we are working this. And also working on other expressions
that can be wrapped around the facet expression to implement parallelism
and scaling. We hope to have this ready for Solr 8, which is just around
the corner.



Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Sep 28, 2018 at 2:52 PM RAUNAK AGRAWAL <ag...@gmail.com>
wrote:

> Thanks a lot Toki. I will get back to you soon regarding patch update after
> having discussion with the team.
>
> Thanks & Regards
>
>
> On Fri, Sep 28, 2018 at 11:30 AM Toke Eskildsen <to...@kb.dk> wrote:
>
> > RAUNAK AGRAWAL <ag...@gmail.com> wrote:
> >
> > > curl http://localhost:8983/solr/collection_name/stream -d
> > > 'expr=facet(collection_name,q="id:953",bucketSorts="week
> > > desc",buckets="week",bucketSizeLimit=200,sum(sales),
> > > sum(amount),sum(days))'
> >
> > Stats on numeric fields then.
> >
> > > Also in my collection, I have almost 10 Billion documents
> > > with many deletions (close to 40%).
> >
> > Quite a lot of documents and in this case deletions counts, as the
> > internal structures for the deleted documents still needs to be iterated.
> > In scale this looks somewhat like our 18 billion document setup, with the
> > addendum that we use quite large segments (900GB).
> >
> > The performance regressions we encountered with Solr 7 lead to
> > https://issues.apache.org/jira/browse/LUCENE-8374 which helped a lot
> > (performance testing has not finished). If you have or can easily create
> a
> > test server where your shard(s) is the same size as your production
> shards,
> > I'd be happy to port the patch to Solr 7.2.1 to see it it helps. I am
> > looking for independent verification, so it is no bother.
> >
> > > I was planning to run optimise to merge the segments but
> > > spoke to admin team and lucidworks guys and they were
> > > against it saying that it will make very large segment file.
> >
> > If your bottleneck is the same as ours, the large segment would mean
> worse
> > performance (with Solr 7).
> >
> > > Is it true that optimise in solr should not be used, as it comes with
> > other issues?
> >
> > No simple answer there. If you have an index that you update very rarely,
> > it can save memory and processing power. If you have a live index where
> you
> > add and delete documents, it will probably be a bad idea. One strategy
> used
> > with time series data is to have old and immutable data in dedicated
> > collections, which can then be optimized.
> >
> > - Toke Eskildsen
> >
>

Re: Solr Streaming Queries Performance Issues [v7.2.1]

Posted by RAUNAK AGRAWAL <ag...@gmail.com>.
Thanks a lot Toki. I will get back to you soon regarding patch update after
having discussion with the team.

Thanks & Regards


On Fri, Sep 28, 2018 at 11:30 AM Toke Eskildsen <to...@kb.dk> wrote:

> RAUNAK AGRAWAL <ag...@gmail.com> wrote:
>
> > curl http://localhost:8983/solr/collection_name/stream -d
> > 'expr=facet(collection_name,q="id:953",bucketSorts="week
> > desc",buckets="week",bucketSizeLimit=200,sum(sales),
> > sum(amount),sum(days))'
>
> Stats on numeric fields then.
>
> > Also in my collection, I have almost 10 Billion documents
> > with many deletions (close to 40%).
>
> Quite a lot of documents and in this case deletions counts, as the
> internal structures for the deleted documents still needs to be iterated.
> In scale this looks somewhat like our 18 billion document setup, with the
> addendum that we use quite large segments (900GB).
>
> The performance regressions we encountered with Solr 7 lead to
> https://issues.apache.org/jira/browse/LUCENE-8374 which helped a lot
> (performance testing has not finished). If you have or can easily create a
> test server where your shard(s) is the same size as your production shards,
> I'd be happy to port the patch to Solr 7.2.1 to see it it helps. I am
> looking for independent verification, so it is no bother.
>
> > I was planning to run optimise to merge the segments but
> > spoke to admin team and lucidworks guys and they were
> > against it saying that it will make very large segment file.
>
> If your bottleneck is the same as ours, the large segment would mean worse
> performance (with Solr 7).
>
> > Is it true that optimise in solr should not be used, as it comes with
> other issues?
>
> No simple answer there. If you have an index that you update very rarely,
> it can save memory and processing power. If you have a live index where you
> add and delete documents, it will probably be a bad idea. One strategy used
> with time series data is to have old and immutable data in dedicated
> collections, which can then be optimized.
>
> - Toke Eskildsen
>

Re: Solr Streaming Queries Performance Issues [v7.2.1]

Posted by Toke Eskildsen <to...@kb.dk>.
RAUNAK AGRAWAL <ag...@gmail.com> wrote:

> curl http://localhost:8983/solr/collection_name/stream -d 
> 'expr=facet(collection_name,q="id:953",bucketSorts="week 
> desc",buckets="week",bucketSizeLimit=200,sum(sales),
> sum(amount),sum(days))'

Stats on numeric fields then.

> Also in my collection, I have almost 10 Billion documents
> with many deletions (close to 40%).

Quite a lot of documents and in this case deletions counts, as the internal structures for the deleted documents still needs to be iterated. In scale this looks somewhat like our 18 billion document setup, with the addendum that we use quite large segments (900GB).

The performance regressions we encountered with Solr 7 lead to https://issues.apache.org/jira/browse/LUCENE-8374 which helped a lot (performance testing has not finished). If you have or can easily create a test server where your shard(s) is the same size as your production shards, I'd be happy to port the patch to Solr 7.2.1 to see it it helps. I am looking for independent verification, so it is no bother.

> I was planning to run optimise to merge the segments but
> spoke to admin team and lucidworks guys and they were
> against it saying that it will make very large segment file.

If your bottleneck is the same as ours, the large segment would mean worse performance (with Solr 7).

> Is it true that optimise in solr should not be used, as it comes with other issues?

No simple answer there. If you have an index that you update very rarely, it can save memory and processing power. If you have a live index where you add and delete documents, it will probably be a bad idea. One strategy used with time series data is to have old and immutable data in dedicated collections, which can then be optimized.

- Toke Eskildsen

Re: Solr Streaming Queries Performance Issues [v7.2.1]

Posted by RAUNAK AGRAWAL <ag...@gmail.com>.
Hey Guys,

This is the sample query I am making:


curl http://localhost:8983/solr/collection_name/stream -d
'expr=facet(collection_name,q="id:953",bucketSorts="week
desc",buckets="week",bucketSizeLimit=200,sum(sales),sum(amount),sum(days))'


Also in my collection, I have almost 10 Billion documents with many
deletions (close to 40%). I was planning to run optimise to merge the
segments but spoke to admin team and lucidworks guys and they were against
it saying that it will make very large segment file. Is it true that
optimise in solr should not be used, as it comes with other issues?

Thanks

On Fri, Sep 28, 2018 at 7:40 AM Toke Eskildsen <to...@kb.dk> wrote:

> On Thu, 2018-09-27 at 15:52 -0700, RAUNAK AGRAWAL wrote:
> > But for last few days, we are observing now that streaming facet
> > response is slower that json facets. Also we have increased the
> > number of documents in collection (30%).
>
> Export performance goes down when segment size goes way up, so I would
> expect streaming to do the same. I would not expect a 30% increase to
> cause something serious on that account though. How many documents in
> your index?
>
> - Toke Eskildsen, Royal Danish Library
>
>

Re: Solr Streaming Queries Performance Issues [v7.2.1]

Posted by Toke Eskildsen <to...@kb.dk>.
On Thu, 2018-09-27 at 15:52 -0700, RAUNAK AGRAWAL wrote:
> But for last few days, we are observing now that streaming facet
> response is slower that json facets. Also we have increased the
> number of documents in collection (30%).

Export performance goes down when segment size goes way up, so I would
expect streaming to do the same. I would not expect a 30% increase to
cause something serious on that account though. How many documents in
your index?

- Toke Eskildsen, Royal Danish Library


Re: Solr Streaming Queries Performance Issues [v7.2.1]

Posted by Joel Bernstein <jo...@gmail.com>.
Please post the Streaming Expression that you are using.


Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Sep 27, 2018 at 6:52 PM RAUNAK AGRAWAL <ag...@gmail.com>
wrote:

> Hi Guys,
>
> Just to give you context, we were using JSON Facets for doing analytical
> queries in solr but they were slower. Hence we migrated our application to
> use solr streaming facet queries.
>
> But for last few days, we are observing now that streaming facet response
> is slower that json facets. Also we have increased the number of documents
> in collection (30%).
>
> So I have couple of questions:
>
> 1. When to use JSON Facets and when to use solr streaming facets?
> 2. Solr streaming also comes with rollup? How is it different from
> streaming facets?
> 3. Is there a way to debug the queries in streaming mode because I tried
> debug=true but it is not working in streaming queries.
> 4. When I don't mention any number of workers for streaming queries, does
> all the shards of a collection becomes the workers?
>
> Looking forward to your reply .
>
>
> Thanks and regards
> Raunak
>