You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Sergio García Maroto <ma...@gmail.com> on 2023/05/09 09:11:42 UTC

streaming expressions - sharding memory usage

Hi,

I am working currently on implementing sharding on current Solr Cloud
Cluster.
Main idea is to be able to scale horizontally.

At the moment, without sharding we have all collections sitting on all
servers.
We have as well pretty heavy streaming expressions returning many ids.
Average of 300,000 ids to join.

After  doing sharding I see a huge increase on CPU and memory usage.
Making queries way slower comparing sharding to not sharding.

I guess that's  expected bacuase the joins need to send data across servers
over network.

Any thoughs on best practices here. I guess a possible approach is to split
shards in more.

Regards
Sergio

Re: streaming expressions - sharding memory usage

Posted by Joel Bernstein <jo...@gmail.com>.

Unfortunately Solr right now doesn't have a great answer for the kind of
bulk extract with scoring that you're doing. The export handler is designed
for bulk extract but doesn't score. The select handler is designed for top
N retrieval with scoring. I'm surprised that single shard bulk extract with
scoring performs as well as it does.


Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, May 10, 2023 at 10:34 AM Sergio García Maroto <ma...@gmail.com>
wrote:

> Thanks Joel for your answer.
>
> Actually what I need is to return scores from different collections and
> make some calculations on the scores to retrieve at the end people.
> Let me show you a more complex sample. This is really fast on all
> collections in the same servers but very slow once sharding takes into
> place.
>
> select(select(
> rollup(
> innerJoin(
> select(search(person, q="((((SmartSearchS:"madrid [$CU] [$PRJ] [$REC]
> "~100)^4 OR (SmartSearchS:"madrid [$CU] [$PRJ] [$RECL] "~100)^3 OR
> (SmartSearchS:"madrid [$CU] [$PRJ] "~100)^2) OR (((SmartSearchS:(madrid*)))
> OR ((SmartSearchS:("madrid")))^3)) AND ((*:* -StatusSFD:("\*\*\*System
> Delete\*\*\*")) AND type_level:(parent)))", fl="PersonIDDoc,score",
> sort="PersonIDDoc desc,score desc", rows="2147483647"),PersonIDDoc,score as
> PersonScore),
>
> select(rollup(search(contactupdate, q="(((UpdateTextS:(cto)) AND
> IsAttachToPerson:(true)) AND (((AssignConfidentialS:(false) OR
> (AssignAuthorisedEmplIdsS:((cc)))))))", fl="PersonIDDoc,score",
> sort="PersonIDDoc desc,score desc",
>
> rows="2147483647"),over=PersonIDDoc,sum(score),sum(PersonIDDoc)),PersonIDDoc,add(0,0)
> as DocumetScore,sum(score) as ContactUpdateScore),
> on="PersonIDDoc"),
>
> over=PersonIDDoc,sum(PersonScore),sum(ContactUpdateScore),count(PersonIDDoc)),PersonIDDoc,sum(ContactUpdateScore)
> as ContactUpdateScore,sum(PersonScore) as
> PersonScoreTotal,count(PersonIDDoc) as
>
> TokenCount),PersonIDDoc,TokenCount,add(mult(PersonScoreTotal,0.6),mult(ContactUpdateScore,0.4))
> as CalculatedScore)
>
>
> I know is a pretty complex query but takes about 3 seconds on single shard
> and basically explodes on sharding.
>
> Does it mean I can't achive this behauviour sharding collections?
>
> Regards
>
>
>
> On Wed, 10 May 2023 at 16:17, Joel Bernstein <jo...@gmail.com> wrote:
>
> > So the first thing I see is that you're doing a search using the select
> > handler, which is required to sort by score. So in this scenario you will
> > run into deep paging issues as you increase the number of rows. This will
> > effect both memory and performance. A search using the export handler
> will
> > improve throughput as you add shards, without any memory penalty, but it
> > doesn't support scoring.
> >
> > In this case 1000 rows is not that many docs, so I'm surprised the
> penalty
> > is so high. But you will definitely run into large memory and performance
> > penalties if you start pulling larger result sets.
> >
> > Can you describe the exact use case you need to accomplish? For example,
> do
> > you need to extract a large number of documents by joining streams of
> > scored data? Or can you display just the top N documents of the joined
> > streams?
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Wed, May 10, 2023 at 6:47 AM Sergio García Maroto <marotosg@gmail.com
> >
> > wrote:
> >
> > > Sure. Let's start by the simplest stream expression.
> > > This one only targets person collection.
> > >
> > > *Stream Expression:*
> > > search(person, q="((((SmartSearchS:"france [$CU] [$PRJ] [$REC] "~100)^4
> > OR
> > > (SmartSearchS:"france [$CU] [$PRJ] [$RECL] "~100)^3 OR
> > > (SmartSearchS:"france [$CU] [$PRJ] "~100)^2) OR
> > (((SmartSearchS:(france*)))
> > > OR ((SmartSearchS:("france")))^3)) AND ((*:* -StatusSFD:("\*\*\*System
> > > Delete\*\*\*")) AND type_level:(parent)))", fl="PersonIDDoc,score",
> > > sort="score desc,PersonIDDoc desc", rows="1000")
> > >
> > > *Schema*
> > > <field name="PersonIDDoc" type="string" indexed="true" stored="true"
> > > docValues="true" />
> > >
> > > *No sharding*
> > > *1 shard 45.38GB with *64,348,740 docs
> > > stream expresion time : 660 ms
> > >
> > > *S**harding*
> > > *2 shards 23GB each*
> > > stream expresion time : 4000 ms
> > >
> > >
> > >
> > > On Wed, 10 May 2023 at 04:45, Joel Bernstein <jo...@gmail.com>
> wrote:
> > >
> > > > Can you share the expressions? Then we can discuss where the sharding
> > > comes
> > > > into play.
> > > >
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > >
> > > > On Tue, May 9, 2023 at 1:17 PM Sergio García Maroto <
> > marotosg@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am working currently on implementing sharding on current Solr
> Cloud
> > > > > Cluster.
> > > > > Main idea is to be able to scale horizontally.
> > > > >
> > > > > At the moment, without sharding we have all collections sitting on
> > all
> > > > > servers.
> > > > > We have as well pretty heavy streaming expressions returning many
> > ids.
> > > > > Average of 300,000 ids to join.
> > > > >
> > > > > After  doing sharding I see a huge increase on CPU and memory
> usage.
> > > > > Making queries way slower comparing sharding to not sharding.
> > > > >
> > > > > I guess that's  expected bacuase the joins need to send data across
> > > > servers
> > > > > over network.
> > > > >
> > > > > Any thoughs on best practices here. I guess a possible approach is
> to
> > > > split
> > > > > shards in more.
> > > > >
> > > > > Regards
> > > > > Sergio
> > > > >
> > > >
> > >
> >
>

Re: streaming expressions - sharding memory usage

Posted by Sergio García Maroto <ma...@gmail.com>.

Thanks Joel for your answer.

Actually what I need is to return scores from different collections and
make some calculations on the scores to retrieve at the end people.
Let me show you a more complex sample. This is really fast on all
collections in the same servers but very slow once sharding takes into
place.

select(select(
rollup(
innerJoin(
select(search(person, q="((((SmartSearchS:"madrid [$CU] [$PRJ] [$REC]
"~100)^4 OR (SmartSearchS:"madrid [$CU] [$PRJ] [$RECL] "~100)^3 OR
(SmartSearchS:"madrid [$CU] [$PRJ] "~100)^2) OR (((SmartSearchS:(madrid*)))
OR ((SmartSearchS:("madrid")))^3)) AND ((*:* -StatusSFD:("\*\*\*System
Delete\*\*\*")) AND type_level:(parent)))", fl="PersonIDDoc,score",
sort="PersonIDDoc desc,score desc", rows="2147483647"),PersonIDDoc,score as
PersonScore),

select(rollup(search(contactupdate, q="(((UpdateTextS:(cto)) AND
IsAttachToPerson:(true)) AND (((AssignConfidentialS:(false) OR
(AssignAuthorisedEmplIdsS:((cc)))))))", fl="PersonIDDoc,score",
sort="PersonIDDoc desc,score desc",
rows="2147483647"),over=PersonIDDoc,sum(score),sum(PersonIDDoc)),PersonIDDoc,add(0,0)
as DocumetScore,sum(score) as ContactUpdateScore),
on="PersonIDDoc"),
over=PersonIDDoc,sum(PersonScore),sum(ContactUpdateScore),count(PersonIDDoc)),PersonIDDoc,sum(ContactUpdateScore)
as ContactUpdateScore,sum(PersonScore) as
PersonScoreTotal,count(PersonIDDoc) as
TokenCount),PersonIDDoc,TokenCount,add(mult(PersonScoreTotal,0.6),mult(ContactUpdateScore,0.4))
as CalculatedScore)


I know is a pretty complex query but takes about 3 seconds on single shard
and basically explodes on sharding.

Does it mean I can't achive this behauviour sharding collections?

Regards



On Wed, 10 May 2023 at 16:17, Joel Bernstein <jo...@gmail.com> wrote:

> So the first thing I see is that you're doing a search using the select
> handler, which is required to sort by score. So in this scenario you will
> run into deep paging issues as you increase the number of rows. This will
> effect both memory and performance. A search using the export handler will
> improve throughput as you add shards, without any memory penalty, but it
> doesn't support scoring.
>
> In this case 1000 rows is not that many docs, so I'm surprised the penalty
> is so high. But you will definitely run into large memory and performance
> penalties if you start pulling larger result sets.
>
> Can you describe the exact use case you need to accomplish? For example, do
> you need to extract a large number of documents by joining streams of
> scored data? Or can you display just the top N documents of the joined
> streams?
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Wed, May 10, 2023 at 6:47 AM Sergio García Maroto <ma...@gmail.com>
> wrote:
>
> > Sure. Let's start by the simplest stream expression.
> > This one only targets person collection.
> >
> > *Stream Expression:*
> > search(person, q="((((SmartSearchS:"france [$CU] [$PRJ] [$REC] "~100)^4
> OR
> > (SmartSearchS:"france [$CU] [$PRJ] [$RECL] "~100)^3 OR
> > (SmartSearchS:"france [$CU] [$PRJ] "~100)^2) OR
> (((SmartSearchS:(france*)))
> > OR ((SmartSearchS:("france")))^3)) AND ((*:* -StatusSFD:("\*\*\*System
> > Delete\*\*\*")) AND type_level:(parent)))", fl="PersonIDDoc,score",
> > sort="score desc,PersonIDDoc desc", rows="1000")
> >
> > *Schema*
> > <field name="PersonIDDoc" type="string" indexed="true" stored="true"
> > docValues="true" />
> >
> > *No sharding*
> > *1 shard 45.38GB with *64,348,740 docs
> > stream expresion time : 660 ms
> >
> > *S**harding*
> > *2 shards 23GB each*
> > stream expresion time : 4000 ms
> >
> >
> >
> > On Wed, 10 May 2023 at 04:45, Joel Bernstein <jo...@gmail.com> wrote:
> >
> > > Can you share the expressions? Then we can discuss where the sharding
> > comes
> > > into play.
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Tue, May 9, 2023 at 1:17 PM Sergio García Maroto <
> marotosg@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am working currently on implementing sharding on current Solr Cloud
> > > > Cluster.
> > > > Main idea is to be able to scale horizontally.
> > > >
> > > > At the moment, without sharding we have all collections sitting on
> all
> > > > servers.
> > > > We have as well pretty heavy streaming expressions returning many
> ids.
> > > > Average of 300,000 ids to join.
> > > >
> > > > After  doing sharding I see a huge increase on CPU and memory usage.
> > > > Making queries way slower comparing sharding to not sharding.
> > > >
> > > > I guess that's  expected bacuase the joins need to send data across
> > > servers
> > > > over network.
> > > >
> > > > Any thoughs on best practices here. I guess a possible approach is to
> > > split
> > > > shards in more.
> > > >
> > > > Regards
> > > > Sergio
> > > >
> > >
> >
>

Re: streaming expressions - sharding memory usage

Posted by Joel Bernstein <jo...@gmail.com>.

So the first thing I see is that you're doing a search using the select
handler, which is required to sort by score. So in this scenario you will
run into deep paging issues as you increase the number of rows. This will
effect both memory and performance. A search using the export handler will
improve throughput as you add shards, without any memory penalty, but it
doesn't support scoring.

In this case 1000 rows is not that many docs, so I'm surprised the penalty
is so high. But you will definitely run into large memory and performance
penalties if you start pulling larger result sets.

Can you describe the exact use case you need to accomplish? For example, do
you need to extract a large number of documents by joining streams of
scored data? Or can you display just the top N documents of the joined
streams?

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, May 10, 2023 at 6:47 AM Sergio García Maroto <ma...@gmail.com>
wrote:

> Sure. Let's start by the simplest stream expression.
> This one only targets person collection.
>
> *Stream Expression:*
> search(person, q="((((SmartSearchS:"france [$CU] [$PRJ] [$REC] "~100)^4 OR
> (SmartSearchS:"france [$CU] [$PRJ] [$RECL] "~100)^3 OR
> (SmartSearchS:"france [$CU] [$PRJ] "~100)^2) OR (((SmartSearchS:(france*)))
> OR ((SmartSearchS:("france")))^3)) AND ((*:* -StatusSFD:("\*\*\*System
> Delete\*\*\*")) AND type_level:(parent)))", fl="PersonIDDoc,score",
> sort="score desc,PersonIDDoc desc", rows="1000")
>
> *Schema*
> <field name="PersonIDDoc" type="string" indexed="true" stored="true"
> docValues="true" />
>
> *No sharding*
> *1 shard 45.38GB with *64,348,740 docs
> stream expresion time : 660 ms
>
> *S**harding*
> *2 shards 23GB each*
> stream expresion time : 4000 ms
>
>
>
> On Wed, 10 May 2023 at 04:45, Joel Bernstein <jo...@gmail.com> wrote:
>
> > Can you share the expressions? Then we can discuss where the sharding
> comes
> > into play.
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Tue, May 9, 2023 at 1:17 PM Sergio García Maroto <ma...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I am working currently on implementing sharding on current Solr Cloud
> > > Cluster.
> > > Main idea is to be able to scale horizontally.
> > >
> > > At the moment, without sharding we have all collections sitting on all
> > > servers.
> > > We have as well pretty heavy streaming expressions returning many ids.
> > > Average of 300,000 ids to join.
> > >
> > > After  doing sharding I see a huge increase on CPU and memory usage.
> > > Making queries way slower comparing sharding to not sharding.
> > >
> > > I guess that's  expected bacuase the joins need to send data across
> > servers
> > > over network.
> > >
> > > Any thoughs on best practices here. I guess a possible approach is to
> > split
> > > shards in more.
> > >
> > > Regards
> > > Sergio
> > >
> >
>

Re: streaming expressions - sharding memory usage

Posted by Sergio García Maroto <ma...@gmail.com>.

Sure. Let's start by the simplest stream expression.
This one only targets person collection.

*Stream Expression:*
search(person, q="((((SmartSearchS:"france [$CU] [$PRJ] [$REC] "~100)^4 OR
(SmartSearchS:"france [$CU] [$PRJ] [$RECL] "~100)^3 OR
(SmartSearchS:"france [$CU] [$PRJ] "~100)^2) OR (((SmartSearchS:(france*)))
OR ((SmartSearchS:("france")))^3)) AND ((*:* -StatusSFD:("\*\*\*System
Delete\*\*\*")) AND type_level:(parent)))", fl="PersonIDDoc,score",
sort="score desc,PersonIDDoc desc", rows="1000")

*Schema*
<field name="PersonIDDoc" type="string" indexed="true" stored="true"
docValues="true" />

*No sharding*
*1 shard 45.38GB with *64,348,740 docs
stream expresion time : 660 ms

*S**harding*
*2 shards 23GB each*
stream expresion time : 4000 ms



On Wed, 10 May 2023 at 04:45, Joel Bernstein <jo...@gmail.com> wrote:

> Can you share the expressions? Then we can discuss where the sharding comes
> into play.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, May 9, 2023 at 1:17 PM Sergio García Maroto <ma...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I am working currently on implementing sharding on current Solr Cloud
> > Cluster.
> > Main idea is to be able to scale horizontally.
> >
> > At the moment, without sharding we have all collections sitting on all
> > servers.
> > We have as well pretty heavy streaming expressions returning many ids.
> > Average of 300,000 ids to join.
> >
> > After  doing sharding I see a huge increase on CPU and memory usage.
> > Making queries way slower comparing sharding to not sharding.
> >
> > I guess that's  expected bacuase the joins need to send data across
> servers
> > over network.
> >
> > Any thoughs on best practices here. I guess a possible approach is to
> split
> > shards in more.
> >
> > Regards
> > Sergio
> >
>

Re: streaming expressions - sharding memory usage

Posted by Joel Bernstein <jo...@gmail.com>.

Can you share the expressions? Then we can discuss where the sharding comes
into play.


Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, May 9, 2023 at 1:17 PM Sergio García Maroto <ma...@gmail.com>
wrote:

> Hi,
>
> I am working currently on implementing sharding on current Solr Cloud
> Cluster.
> Main idea is to be able to scale horizontally.
>
> At the moment, without sharding we have all collections sitting on all
> servers.
> We have as well pretty heavy streaming expressions returning many ids.
> Average of 300,000 ids to join.
>
> After  doing sharding I see a huge increase on CPU and memory usage.
> Making queries way slower comparing sharding to not sharding.
>
> I guess that's  expected bacuase the joins need to send data across servers
> over network.
>
> Any thoughs on best practices here. I guess a possible approach is to split
> shards in more.
>
> Regards
> Sergio
>