You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by thiaga rajan <ec...@yahoo.co.in.INVALID> on 2017/06/01 22:04:49 UTC

Performance Issue in Streaming Expressions

We are working on a proposal and feeling streaming API along with export handler will best fit for our usecases. We are already of having a structure in solr in which we are using graph queries to produce hierarchical structure. Now from the structure we need to join couple of more collections. We have 5 different collections. Collection 1- 800 k records. Collection 2- 200k records. Collection 3 - 7k records. Collection 4 - 6 million records. Collection 5 - 150 k records we are using the below strategy innerJoin( intersect( innerJoin(collection 1,collection 2), innerJoin(Collection 3, Collection 4)), collection 5). We are seeing performance is too slow when we start having collection 4. Just with collection 1 2 5 the results are coming in 2 secs. The moment I have included collection 4 in the query I could see a performance impact. I believe exporting large results from collection 4 is causing the issie. Currently I am using single sharded collection with no replica. I thinking if we can increase the memory as first option to increase performance as processing doc values need more memory. Then if that did not worked I can check using parallel stream/ sharding. Kindly advise is there could be anything else I missing?
Sent from Yahoo Mail on Android

Re: Performance Issue in Streaming Expressions

Posted by Joel Bernstein <jo...@gmail.com>.

Once you've scaled up the export from collection4 you can test the
performance of the join by moving the NullStream around the join.

parallel(null(innerJoin(collection 3, collection4)))

Again you'll want to test with different numbers of workers and replicas to
see where you max out performance of the join.


Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 2, 2017 at 10:25 AM, Joel Bernstein <jo...@gmail.com> wrote:

> innerJoin(intersect(innerJoin(collection1, collection2),
>                                innerJoin(collection 3, collection4)),
>                 collection5)
>
> Let's focus on:
>
> innerJoin(collection 3, collection4))
>
> The first thing to focus on is how fast is the export from collection4.
> You can test this with the NullStream with the following construct:
>
> null(search(collection4))
>
> The null stream will eat all the tuples and report back timing
> information. This will isolate the performance of the export from
> collection4.
>
> Once you have a baseline for how fast you can export from a single node,
> you can test with parallel export from a single node:
>
> parallel(null(search(collection4)))
>
> Then you can add replicas for collection4 and increase workers.
>
>
>
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Jun 1, 2017 at 11:51 PM, Susmit Shukla <sh...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Which version of solr are you on?
>> Increasing memory may not be useful as streaming API does not keep stuff
>> in
>> memory (except may be hash joins).
>> Increasing replicas (not sharding) and pushing the join computation on
>> worker solr cluster with #workers > 1 would definitely make things faster.
>> Are you limiting your results at some cutoff? if yes, then SOLR-10698
>> <https://issues.apache.org/jira/browse/SOLR-10698> can be useful fix.
>> Also
>> binary response format for streaming would be faster. (available in 6.5
>> probably)
>>
>>
>>
>> On Thu, Jun 1, 2017 at 3:04 PM, thiaga rajan <
>> ecethiagu2006@yahoo.co.in.invalid> wrote:
>>
>> > We are working on a proposal and feeling streaming API along with export
>> > handler will best fit for our usecases. We are already of having a
>> > structure in solr in which we are using graph queries to produce
>> > hierarchical structure. Now from the structure we need to join couple of
>> > more collections.         We have 5 different collections.
>> >           Collection 1- 800 k records.
>> > Collection 2- 200k records.
>>  Collection 3
>> > - 7k records.                                       Collection 4 - 6
>> > million records.                             Collection 5 - 150 k
>> records
>> >                             we are using the below strategy
>> >             innerJoin( intersect( innerJoin(collection 1,collection 2),
>> > innerJoin(Collection 3, Collection 4)), collection 5).
>> >                We are seeing performance is too slow when we start
>> having
>> > collection 4. Just with collection 1 2 5 the results are coming in 2
>> secs.
>> > The moment I have included collection 4 in the query I could see  a
>> > performance impact. I believe exporting large results from collection 4
>> is
>> > causing the issie. Currently I am using single sharded collection with
>> no
>> > replica. I thinking if we can increase the memory as first option to
>> > increase performance as processing doc values need more memory. Then if
>> > that did not worked I can check using parallel stream/ sharding. Kindly
>> > advise is there could be anything else I  missing?
>> > Sent from Yahoo Mail on Android
>>
>
>

Re: Performance Issue in Streaming Expressions

Posted by Joel Bernstein <jo...@gmail.com>.

innerJoin(intersect(innerJoin(collection1, collection2),
                               innerJoin(collection 3, collection4)),
                collection5)

Let's focus on:

innerJoin(collection 3, collection4))

The first thing to focus on is how fast is the export from collection4. You
can test this with the NullStream with the following construct:

null(search(collection4))

The null stream will eat all the tuples and report back timing information.
This will isolate the performance of the export from collection4.

Once you have a baseline for how fast you can export from a single node,
you can test with parallel export from a single node:

parallel(null(search(collection4)))

Then you can add replicas for collection4 and increase workers.













Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jun 1, 2017 at 11:51 PM, Susmit Shukla <sh...@gmail.com>
wrote:

> Hi,
>
> Which version of solr are you on?
> Increasing memory may not be useful as streaming API does not keep stuff in
> memory (except may be hash joins).
> Increasing replicas (not sharding) and pushing the join computation on
> worker solr cluster with #workers > 1 would definitely make things faster.
> Are you limiting your results at some cutoff? if yes, then SOLR-10698
> <https://issues.apache.org/jira/browse/SOLR-10698> can be useful fix. Also
> binary response format for streaming would be faster. (available in 6.5
> probably)
>
>
>
> On Thu, Jun 1, 2017 at 3:04 PM, thiaga rajan <
> ecethiagu2006@yahoo.co.in.invalid> wrote:
>
> > We are working on a proposal and feeling streaming API along with export
> > handler will best fit for our usecases. We are already of having a
> > structure in solr in which we are using graph queries to produce
> > hierarchical structure. Now from the structure we need to join couple of
> > more collections.         We have 5 different collections.
> >           Collection 1- 800 k records.
> > Collection 2- 200k records.                                   Collection
> 3
> > - 7k records.                                       Collection 4 - 6
> > million records.                             Collection 5 - 150 k records
> >                             we are using the below strategy
> >             innerJoin( intersect( innerJoin(collection 1,collection 2),
> > innerJoin(Collection 3, Collection 4)), collection 5).
> >                We are seeing performance is too slow when we start having
> > collection 4. Just with collection 1 2 5 the results are coming in 2
> secs.
> > The moment I have included collection 4 in the query I could see  a
> > performance impact. I believe exporting large results from collection 4
> is
> > causing the issie. Currently I am using single sharded collection with no
> > replica. I thinking if we can increase the memory as first option to
> > increase performance as processing doc values need more memory. Then if
> > that did not worked I can check using parallel stream/ sharding. Kindly
> > advise is there could be anything else I  missing?
> > Sent from Yahoo Mail on Android
>

Re: Performance Issue in Streaming Expressions

Posted by Susmit Shukla <sh...@gmail.com>.

Hi,

Which version of solr are you on?
Increasing memory may not be useful as streaming API does not keep stuff in
memory (except may be hash joins).
Increasing replicas (not sharding) and pushing the join computation on
worker solr cluster with #workers > 1 would definitely make things faster.
Are you limiting your results at some cutoff? if yes, then SOLR-10698
<https://issues.apache.org/jira/browse/SOLR-10698> can be useful fix. Also
binary response format for streaming would be faster. (available in 6.5
probably)



On Thu, Jun 1, 2017 at 3:04 PM, thiaga rajan <
ecethiagu2006@yahoo.co.in.invalid> wrote:

> We are working on a proposal and feeling streaming API along with export
> handler will best fit for our usecases. We are already of having a
> structure in solr in which we are using graph queries to produce
> hierarchical structure. Now from the structure we need to join couple of
> more collections.         We have 5 different collections.
>           Collection 1- 800 k records.
> Collection 2- 200k records.                                   Collection 3
> - 7k records.                                       Collection 4 - 6
> million records.                             Collection 5 - 150 k records
>                             we are using the below strategy
>             innerJoin( intersect( innerJoin(collection 1,collection 2),
> innerJoin(Collection 3, Collection 4)), collection 5).
>                We are seeing performance is too slow when we start having
> collection 4. Just with collection 1 2 5 the results are coming in 2 secs.
> The moment I have included collection 4 in the query I could see  a
> performance impact. I believe exporting large results from collection 4 is
> causing the issie. Currently I am using single sharded collection with no
> replica. I thinking if we can increase the memory as first option to
> increase performance as processing doc values need more memory. Then if
> that did not worked I can check using parallel stream/ sharding. Kindly
> advise is there could be anything else I  missing?
> Sent from Yahoo Mail on Android