You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Jens Viebig <je...@vitec.com> on 2021/04/28 17:12:28 UTC

join with big 2nd collection

Hi List,

We have a join perfomance issue and are not sure in which direction we should look to solve the issue.
We currently only have a single node setup

We have 2 collections where we do join querys, joined by a "primary key" string field contentId_s
Each dataset for a single contentId_s consists of multiple timecode based documents in both indexes which makes this a many to many query.

collection1 - contains generic metadata and timecode based content (think timecode based comments)
Documents: 382.872
Unique contentId_s: 16715
~ 160MB size
single shard

collection2 - contains timecode based GPS data (gps posititon, field of view...timecodes are not related to timecodes in collection1, so flatten the structure would blow up the number of documents to incredible numbers) :
Documents: 695.887.875
Unique contenId_s: 10199
~ 300 GB size
single shard

Hardware is a HP DL360 with 32gb of ram (also tried on a machine with 64gb with not much improvement) and 1TB SSD for the index

In our use case there is lots of indexing/deletion traffic on both indexes and only few queries fired against the server.

We are constantly indexing new content and deleting old documents. This was already getting problematic with HDDs so we switched to SDDs,
now indexing speed is fine for now (Might need also to scale this up in the future to allow more throughput).

But search speed suffers when we need to join with the big collection2 (taking up to 30sec for the query to succeed). We had some success experimenting with score join queries when collection2 results only returns a few unique Ids, but we can't predict that this is always the case, and if a lot of documents are hit in collection2,
performance is 10x worse than with original normal join.

Sample queries look like this (simplified, but more complex queries are not much slower):

Sample1:
query: coll1field:someval OR {!join from=contentId_s to=contentId_s fromIndex=collection2 v='coll2field:someval}
filter: {!collapse field=contentId_s min=timecode_f}

Sample 2:
query: coll1field:someval
filter: {!join from=contentId_s to=contentId_s fromIndex=collection2 v='coll2field:otherval}
filter: {!collapse field=contentId_s min=timecode_f}


I experimented with running the query on collection2 alone first only to get the numdocs (collapsing on contentId_s) to see how much results we get so we could choose the right join query, but then with many hits in collection2 this almost takes the same time as doing the join, so slow queries would get even slower

Caches also seem to not help much since almost every query fired is different and the index is mostly changing between requests anyways.

We are open to anything, adding nodes/hardware/shards/changing the index structure...
Currently we don't know how to get around the big join

Any advice in which direction we should look ?

Re: join with big 2nd collection

Posted by Joel Bernstein <jo...@gmail.com>.
Your understanding of the join is correct, it's all done locally in each
shard. Adding more servers should improve performance as the join will be
smaller. The memory footprint of each join is small, but there is a fixed
overhead for caches. Adding replicas will not affect memory and will
lighten the load across all the replicas as multiple queries can be spread
across more replicas. You'll also want to warm the join field by using a
static warming query following each commit and new searcher. You can warm
the caches by faceting on the join field in the static warming query. You
can simply match a single document in the warming query, the important part
is that facet code runs which will load the cache. Having more shards and
dedicated hardware will make the warming go faster as well because fewer
documents will be involved in loading the caches.

Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, May 5, 2021 at 5:03 AM Jens Viebig <je...@vitec.com> wrote:

> Hi Joel,
>
> Thanks for the hint. I can confirm that joins are much faster with this
> optimization.
> I now also tried the same collection with 8 shards vs a single shard all
> on as single solr node
> Sharding seems to also improve search speed in our scenario even with a
> single node.
>
> I think we will experiment with adding more solr nodes with replicas next.
>
> Does anyone have insights on how the join is done internally ?
> From my understanding every shard is doing the join on its own and
> returning joined results that are then combined to the overall result, is
> this correct ?
> With this assumption we expect that adding more nodes/servers could
> improve query speed ?
> We also experienced that we needed to add more heap memory otherwise we
> got an OOM in some cases.
> Will memory requirements change for a single node when adding more nodes
> with replicas ?
>
> Best Regards
> Jens
>
> -----Ursprüngliche Nachricht-----
> Von: Joel Bernstein <jo...@gmail.com>
> Gesendet: Montag, 3. Mai 2021 14:42
> An: users@solr.apache.org
> Betreff: Re: join with big 2nd collection
>
> Here is the jira that describes it:
>
> https://issues.apache.org/jira/browse/SOLR-15049
>
> The description and comments don't describe the full impact on performance
> of this optimization. When you have a large number of join keys the impact
> of this optimization is massive.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Mon, May 3, 2021 at 8:38 AM Joel Bernstein <jo...@gmail.com> wrote:
>
> > If you are using the latest version of Solr there is a new optimized
> > self join which is very fast. To use it you must join in the same core
> > and use the same join field (to and from field must be the same). The
> > optimization should kick in on it's own with the join qparser plugin
> > and will be orders of magnitudes faster for large joins.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Fri, Apr 30, 2021 at 11:37 AM Jens Viebig <je...@vitec.com>
> > wrote:
> >
> >> Tried something today which seems promising.
> >>
> >> I have put all documents from both cores in the same collection,
> >> sharded the collection to 8 shards, routing the documents so all
> >> documents with the same contentId_s end up on the same shard.
> >> To distinguish between document types we used a string field with an
> >> identifier doctype_s:col1  doctype_s:col2 (Btw. What would be the
> >> best data type for a doc identifier that is fast to filter on ?)
> >> Seems join inside the same core is a) much more efficient and b)
> >> seems to work with sharded index We are currently still running this
> >> on a single instance and have reasonable response times ~1 sec which
> >> would be Ok for us and a big improvement over the old state.
> >>
> >> - Why is the join that much faster ? Is it because of the sharding or
> >> also because of the same core ?
> >> - How can we expect this to scale when adding more documents
> >> (probably with adding solr instances/shards/replicas on additional
> servers ) ?
> >> Doubling/tripling/... the amount of docs
> >> - Would you expect query times to improve with additional servers and
> >> solr instances ?
> >> - What would be the best data type for a doc identifier that is fast
> >> to filter on to distinguish between different document types on the
> >> same collection ?
> >>
> >> What I don't like about this solution is that we loose the
> >> possibility completely reindex the a "document type". For example
> >> collection1 was pretty fast to completely reindex and possibly the
> >> schema changes more often, while collection2 is "index once / delete
> >> after x days" and is heavy to reindex.
> >>
> >> Best Regards
> >> Jens
> >>
> >> -----Ursprüngliche Nachricht-----
> >> Von: Jens Viebig <je...@vitec.com>
> >> Gesendet: Mittwoch, 28. April 2021 19:12
> >> An: users@solr.apache.org
> >> Betreff: join with big 2nd collection
> >>
> >> Hi List,
> >>
> >> We have a join perfomance issue and are not sure in which direction
> >> we should look to solve the issue.
> >> We currently only have a single node setup
> >>
> >> We have 2 collections where we do join querys, joined by a "primary key"
> >> string field contentId_s Each dataset for a single contentId_s
> >> consists of multiple timecode based documents in both indexes which
> >> makes this a many to many query.
> >>
> >> collection1 - contains generic metadata and timecode based content
> >> (think timecode based comments)
> >> Documents: 382.872
> >> Unique contentId_s: 16715
> >> ~ 160MB size
> >> single shard
> >>
> >> collection2 - contains timecode based GPS data (gps posititon, field
> >> of view...timecodes are not related to timecodes in collection1, so
> >> flatten the structure would blow up the number of documents to
> incredible numbers) :
> >> Documents: 695.887.875
> >> Unique contenId_s: 10199
> >> ~ 300 GB size
> >> single shard
> >>
> >> Hardware is a HP DL360 with 32gb of ram (also tried on a machine with
> >> 64gb with not much improvement) and 1TB SSD for the index
> >>
> >> In our use case there is lots of indexing/deletion traffic on both
> >> indexes and only few queries fired against the server.
> >>
> >> We are constantly indexing new content and deleting old documents.
> >> This was already getting problematic with HDDs so we switched to
> >> SDDs, now indexing speed is fine for now (Might need also to scale
> >> this up in the future to allow more throughput).
> >>
> >> But search speed suffers when we need to join with the big
> >> collection2 (taking up to 30sec for the query to succeed). We had
> >> some success experimenting with score join queries when collection2
> >> results only returns a few unique Ids, but we can't predict that this
> >> is always the case, and if a lot of documents are hit in collection2,
> >> performance is 10x worse than with original normal join.
> >>
> >> Sample queries look like this (simplified, but more complex queries
> >> are not much slower):
> >>
> >> Sample1:
> >> query: coll1field:someval OR {!join from=contentId_s to=contentId_s
> >> fromIndex=collection2 v='coll2field:someval}
> >> filter: {!collapse field=contentId_s min=timecode_f}
> >>
> >> Sample 2:
> >> query: coll1field:someval
> >> filter: {!join from=contentId_s to=contentId_s fromIndex=collection2
> >> v='coll2field:otherval}
> >> filter: {!collapse field=contentId_s min=timecode_f}
> >>
> >>
> >> I experimented with running the query on collection2 alone first only
> >> to get the numdocs (collapsing on contentId_s) to see how much
> >> results we get so we could choose the right join query, but then with
> >> many hits in
> >> collection2 this almost takes the same time as doing the join, so
> >> slow queries would get even slower
> >>
> >> Caches also seem to not help much since almost every query fired is
> >> different and the index is mostly changing between requests anyways.
> >>
> >> We are open to anything, adding nodes/hardware/shards/changing the
> >> index structure...
> >> Currently we don't know how to get around the big join
> >>
> >> Any advice in which direction we should look ?
> >>
> >
>

AW: join with big 2nd collection

Posted by Jens Viebig <je...@vitec.com>.
Hi Joel,

Thanks for the hint. I can confirm that joins are much faster with this optimization.
I now also tried the same collection with 8 shards vs a single shard all on as single solr node
Sharding seems to also improve search speed in our scenario even with a single node.

I think we will experiment with adding more solr nodes with replicas next. 

Does anyone have insights on how the join is done internally ? 
From my understanding every shard is doing the join on its own and returning joined results that are then combined to the overall result, is this correct ?
With this assumption we expect that adding more nodes/servers could improve query speed ?
We also experienced that we needed to add more heap memory otherwise we got an OOM in some cases. 
Will memory requirements change for a single node when adding more nodes with replicas ?

Best Regards
Jens

-----Ursprüngliche Nachricht-----
Von: Joel Bernstein <jo...@gmail.com> 
Gesendet: Montag, 3. Mai 2021 14:42
An: users@solr.apache.org
Betreff: Re: join with big 2nd collection

Here is the jira that describes it:

https://issues.apache.org/jira/browse/SOLR-15049

The description and comments don't describe the full impact on performance of this optimization. When you have a large number of join keys the impact of this optimization is massive.


Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, May 3, 2021 at 8:38 AM Joel Bernstein <jo...@gmail.com> wrote:

> If you are using the latest version of Solr there is a new optimized 
> self join which is very fast. To use it you must join in the same core 
> and use the same join field (to and from field must be the same). The 
> optimization should kick in on it's own with the join qparser plugin 
> and will be orders of magnitudes faster for large joins.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Apr 30, 2021 at 11:37 AM Jens Viebig <je...@vitec.com>
> wrote:
>
>> Tried something today which seems promising.
>>
>> I have put all documents from both cores in the same collection, 
>> sharded the collection to 8 shards, routing the documents so all 
>> documents with the same contentId_s end up on the same shard.
>> To distinguish between document types we used a string field with an 
>> identifier doctype_s:col1  doctype_s:col2 (Btw. What would be the 
>> best data type for a doc identifier that is fast to filter on ?) 
>> Seems join inside the same core is a) much more efficient and b) 
>> seems to work with sharded index We are currently still running this 
>> on a single instance and have reasonable response times ~1 sec which 
>> would be Ok for us and a big improvement over the old state.
>>
>> - Why is the join that much faster ? Is it because of the sharding or 
>> also because of the same core ?
>> - How can we expect this to scale when adding more documents 
>> (probably with adding solr instances/shards/replicas on additional servers ) ?
>> Doubling/tripling/... the amount of docs
>> - Would you expect query times to improve with additional servers and 
>> solr instances ?
>> - What would be the best data type for a doc identifier that is fast 
>> to filter on to distinguish between different document types on the 
>> same collection ?
>>
>> What I don't like about this solution is that we loose the 
>> possibility completely reindex the a "document type". For example 
>> collection1 was pretty fast to completely reindex and possibly the 
>> schema changes more often, while collection2 is "index once / delete 
>> after x days" and is heavy to reindex.
>>
>> Best Regards
>> Jens
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Jens Viebig <je...@vitec.com>
>> Gesendet: Mittwoch, 28. April 2021 19:12
>> An: users@solr.apache.org
>> Betreff: join with big 2nd collection
>>
>> Hi List,
>>
>> We have a join perfomance issue and are not sure in which direction 
>> we should look to solve the issue.
>> We currently only have a single node setup
>>
>> We have 2 collections where we do join querys, joined by a "primary key"
>> string field contentId_s Each dataset for a single contentId_s 
>> consists of multiple timecode based documents in both indexes which 
>> makes this a many to many query.
>>
>> collection1 - contains generic metadata and timecode based content 
>> (think timecode based comments)
>> Documents: 382.872
>> Unique contentId_s: 16715
>> ~ 160MB size
>> single shard
>>
>> collection2 - contains timecode based GPS data (gps posititon, field 
>> of view...timecodes are not related to timecodes in collection1, so 
>> flatten the structure would blow up the number of documents to incredible numbers) :
>> Documents: 695.887.875
>> Unique contenId_s: 10199
>> ~ 300 GB size
>> single shard
>>
>> Hardware is a HP DL360 with 32gb of ram (also tried on a machine with 
>> 64gb with not much improvement) and 1TB SSD for the index
>>
>> In our use case there is lots of indexing/deletion traffic on both 
>> indexes and only few queries fired against the server.
>>
>> We are constantly indexing new content and deleting old documents. 
>> This was already getting problematic with HDDs so we switched to 
>> SDDs, now indexing speed is fine for now (Might need also to scale 
>> this up in the future to allow more throughput).
>>
>> But search speed suffers when we need to join with the big 
>> collection2 (taking up to 30sec for the query to succeed). We had 
>> some success experimenting with score join queries when collection2 
>> results only returns a few unique Ids, but we can't predict that this 
>> is always the case, and if a lot of documents are hit in collection2, 
>> performance is 10x worse than with original normal join.
>>
>> Sample queries look like this (simplified, but more complex queries 
>> are not much slower):
>>
>> Sample1:
>> query: coll1field:someval OR {!join from=contentId_s to=contentId_s
>> fromIndex=collection2 v='coll2field:someval}
>> filter: {!collapse field=contentId_s min=timecode_f}
>>
>> Sample 2:
>> query: coll1field:someval
>> filter: {!join from=contentId_s to=contentId_s fromIndex=collection2 
>> v='coll2field:otherval}
>> filter: {!collapse field=contentId_s min=timecode_f}
>>
>>
>> I experimented with running the query on collection2 alone first only 
>> to get the numdocs (collapsing on contentId_s) to see how much 
>> results we get so we could choose the right join query, but then with 
>> many hits in
>> collection2 this almost takes the same time as doing the join, so 
>> slow queries would get even slower
>>
>> Caches also seem to not help much since almost every query fired is 
>> different and the index is mostly changing between requests anyways.
>>
>> We are open to anything, adding nodes/hardware/shards/changing the 
>> index structure...
>> Currently we don't know how to get around the big join
>>
>> Any advice in which direction we should look ?
>>
>

Re: join with big 2nd collection

Posted by Joel Bernstein <jo...@gmail.com>.
Here is the jira that describes it:

https://issues.apache.org/jira/browse/SOLR-15049

The description and comments don't describe the full impact on performance
of this optimization. When you have a large number of join keys the impact
of this optimization is massive.


Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, May 3, 2021 at 8:38 AM Joel Bernstein <jo...@gmail.com> wrote:

> If you are using the latest version of Solr there is a new optimized self
> join which is very fast. To use it you must join in the same core and use
> the same join field (to and from field must be the same). The
> optimization should kick in on it's own with the join qparser plugin and
> will be orders of magnitudes faster for large joins.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Apr 30, 2021 at 11:37 AM Jens Viebig <je...@vitec.com>
> wrote:
>
>> Tried something today which seems promising.
>>
>> I have put all documents from both cores in the same collection, sharded
>> the collection to 8 shards, routing the documents so all documents with the
>> same contentId_s end up on the same shard.
>> To distinguish between document types we used a string field with an
>> identifier doctype_s:col1  doctype_s:col2
>> (Btw. What would be the best data type for a doc identifier that is fast
>> to filter on ?)
>> Seems join inside the same core is a) much more efficient and b) seems to
>> work with sharded index
>> We are currently still running this on a single instance and have
>> reasonable response times ~1 sec which would be Ok for us and a big
>> improvement over the old state.
>>
>> - Why is the join that much faster ? Is it because of the sharding or
>> also because of the same core ?
>> - How can we expect this to scale when adding more documents (probably
>> with adding solr instances/shards/replicas on additional servers ) ?
>> Doubling/tripling/... the amount of docs
>> - Would you expect query times to improve with additional servers and
>> solr instances ?
>> - What would be the best data type for a doc identifier that is fast to
>> filter on to distinguish between different document types on the same
>> collection ?
>>
>> What I don't like about this solution is that we loose the possibility
>> completely reindex the a "document type". For example collection1 was
>> pretty fast to completely reindex and possibly the schema changes more
>> often, while collection2 is "index once / delete after x days" and is heavy
>> to reindex.
>>
>> Best Regards
>> Jens
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Jens Viebig <je...@vitec.com>
>> Gesendet: Mittwoch, 28. April 2021 19:12
>> An: users@solr.apache.org
>> Betreff: join with big 2nd collection
>>
>> Hi List,
>>
>> We have a join perfomance issue and are not sure in which direction we
>> should look to solve the issue.
>> We currently only have a single node setup
>>
>> We have 2 collections where we do join querys, joined by a "primary key"
>> string field contentId_s Each dataset for a single contentId_s consists of
>> multiple timecode based documents in both indexes which makes this a many
>> to many query.
>>
>> collection1 - contains generic metadata and timecode based content (think
>> timecode based comments)
>> Documents: 382.872
>> Unique contentId_s: 16715
>> ~ 160MB size
>> single shard
>>
>> collection2 - contains timecode based GPS data (gps posititon, field of
>> view...timecodes are not related to timecodes in collection1, so flatten
>> the structure would blow up the number of documents to incredible numbers) :
>> Documents: 695.887.875
>> Unique contenId_s: 10199
>> ~ 300 GB size
>> single shard
>>
>> Hardware is a HP DL360 with 32gb of ram (also tried on a machine with
>> 64gb with not much improvement) and 1TB SSD for the index
>>
>> In our use case there is lots of indexing/deletion traffic on both
>> indexes and only few queries fired against the server.
>>
>> We are constantly indexing new content and deleting old documents. This
>> was already getting problematic with HDDs so we switched to SDDs, now
>> indexing speed is fine for now (Might need also to scale this up in the
>> future to allow more throughput).
>>
>> But search speed suffers when we need to join with the big collection2
>> (taking up to 30sec for the query to succeed). We had some success
>> experimenting with score join queries when collection2 results only returns
>> a few unique Ids, but we can't predict that this is always the case, and if
>> a lot of documents are hit in collection2, performance is 10x worse than
>> with original normal join.
>>
>> Sample queries look like this (simplified, but more complex queries are
>> not much slower):
>>
>> Sample1:
>> query: coll1field:someval OR {!join from=contentId_s to=contentId_s
>> fromIndex=collection2 v='coll2field:someval}
>> filter: {!collapse field=contentId_s min=timecode_f}
>>
>> Sample 2:
>> query: coll1field:someval
>> filter: {!join from=contentId_s to=contentId_s fromIndex=collection2
>> v='coll2field:otherval}
>> filter: {!collapse field=contentId_s min=timecode_f}
>>
>>
>> I experimented with running the query on collection2 alone first only to
>> get the numdocs (collapsing on contentId_s) to see how much results we get
>> so we could choose the right join query, but then with many hits in
>> collection2 this almost takes the same time as doing the join, so slow
>> queries would get even slower
>>
>> Caches also seem to not help much since almost every query fired is
>> different and the index is mostly changing between requests anyways.
>>
>> We are open to anything, adding nodes/hardware/shards/changing the index
>> structure...
>> Currently we don't know how to get around the big join
>>
>> Any advice in which direction we should look ?
>>
>

Re: join with big 2nd collection

Posted by Joel Bernstein <jo...@gmail.com>.
If you are using the latest version of Solr there is a new optimized self
join which is very fast. To use it you must join in the same core and use
the same join field (to and from field must be the same). The
optimization should kick in on it's own with the join qparser plugin and
will be orders of magnitudes faster for large joins.



Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Apr 30, 2021 at 11:37 AM Jens Viebig <je...@vitec.com> wrote:

> Tried something today which seems promising.
>
> I have put all documents from both cores in the same collection, sharded
> the collection to 8 shards, routing the documents so all documents with the
> same contentId_s end up on the same shard.
> To distinguish between document types we used a string field with an
> identifier doctype_s:col1  doctype_s:col2
> (Btw. What would be the best data type for a doc identifier that is fast
> to filter on ?)
> Seems join inside the same core is a) much more efficient and b) seems to
> work with sharded index
> We are currently still running this on a single instance and have
> reasonable response times ~1 sec which would be Ok for us and a big
> improvement over the old state.
>
> - Why is the join that much faster ? Is it because of the sharding or also
> because of the same core ?
> - How can we expect this to scale when adding more documents (probably
> with adding solr instances/shards/replicas on additional servers ) ?
> Doubling/tripling/... the amount of docs
> - Would you expect query times to improve with additional servers and solr
> instances ?
> - What would be the best data type for a doc identifier that is fast to
> filter on to distinguish between different document types on the same
> collection ?
>
> What I don't like about this solution is that we loose the possibility
> completely reindex the a "document type". For example collection1 was
> pretty fast to completely reindex and possibly the schema changes more
> often, while collection2 is "index once / delete after x days" and is heavy
> to reindex.
>
> Best Regards
> Jens
>
> -----Ursprüngliche Nachricht-----
> Von: Jens Viebig <je...@vitec.com>
> Gesendet: Mittwoch, 28. April 2021 19:12
> An: users@solr.apache.org
> Betreff: join with big 2nd collection
>
> Hi List,
>
> We have a join perfomance issue and are not sure in which direction we
> should look to solve the issue.
> We currently only have a single node setup
>
> We have 2 collections where we do join querys, joined by a "primary key"
> string field contentId_s Each dataset for a single contentId_s consists of
> multiple timecode based documents in both indexes which makes this a many
> to many query.
>
> collection1 - contains generic metadata and timecode based content (think
> timecode based comments)
> Documents: 382.872
> Unique contentId_s: 16715
> ~ 160MB size
> single shard
>
> collection2 - contains timecode based GPS data (gps posititon, field of
> view...timecodes are not related to timecodes in collection1, so flatten
> the structure would blow up the number of documents to incredible numbers) :
> Documents: 695.887.875
> Unique contenId_s: 10199
> ~ 300 GB size
> single shard
>
> Hardware is a HP DL360 with 32gb of ram (also tried on a machine with 64gb
> with not much improvement) and 1TB SSD for the index
>
> In our use case there is lots of indexing/deletion traffic on both indexes
> and only few queries fired against the server.
>
> We are constantly indexing new content and deleting old documents. This
> was already getting problematic with HDDs so we switched to SDDs, now
> indexing speed is fine for now (Might need also to scale this up in the
> future to allow more throughput).
>
> But search speed suffers when we need to join with the big collection2
> (taking up to 30sec for the query to succeed). We had some success
> experimenting with score join queries when collection2 results only returns
> a few unique Ids, but we can't predict that this is always the case, and if
> a lot of documents are hit in collection2, performance is 10x worse than
> with original normal join.
>
> Sample queries look like this (simplified, but more complex queries are
> not much slower):
>
> Sample1:
> query: coll1field:someval OR {!join from=contentId_s to=contentId_s
> fromIndex=collection2 v='coll2field:someval}
> filter: {!collapse field=contentId_s min=timecode_f}
>
> Sample 2:
> query: coll1field:someval
> filter: {!join from=contentId_s to=contentId_s fromIndex=collection2
> v='coll2field:otherval}
> filter: {!collapse field=contentId_s min=timecode_f}
>
>
> I experimented with running the query on collection2 alone first only to
> get the numdocs (collapsing on contentId_s) to see how much results we get
> so we could choose the right join query, but then with many hits in
> collection2 this almost takes the same time as doing the join, so slow
> queries would get even slower
>
> Caches also seem to not help much since almost every query fired is
> different and the index is mostly changing between requests anyways.
>
> We are open to anything, adding nodes/hardware/shards/changing the index
> structure...
> Currently we don't know how to get around the big join
>
> Any advice in which direction we should look ?
>

AW: join with big 2nd collection

Posted by Jens Viebig <je...@vitec.com>.
Tried something today which seems promising.

I have put all documents from both cores in the same collection, sharded the collection to 8 shards, routing the documents so all documents with the same contentId_s end up on the same shard. 
To distinguish between document types we used a string field with an identifier doctype_s:col1  doctype_s:col2 
(Btw. What would be the best data type for a doc identifier that is fast to filter on ?)
Seems join inside the same core is a) much more efficient and b) seems to work with sharded index
We are currently still running this on a single instance and have reasonable response times ~1 sec which would be Ok for us and a big improvement over the old state.

- Why is the join that much faster ? Is it because of the sharding or also because of the same core ?
- How can we expect this to scale when adding more documents (probably with adding solr instances/shards/replicas on additional servers ) ? Doubling/tripling/... the amount of docs
- Would you expect query times to improve with additional servers and solr instances ?
- What would be the best data type for a doc identifier that is fast to filter on to distinguish between different document types on the same collection ?

What I don't like about this solution is that we loose the possibility completely reindex the a "document type". For example collection1 was pretty fast to completely reindex and possibly the schema changes more often, while collection2 is "index once / delete after x days" and is heavy to reindex.

Best Regards
Jens

-----Ursprüngliche Nachricht-----
Von: Jens Viebig <je...@vitec.com> 
Gesendet: Mittwoch, 28. April 2021 19:12
An: users@solr.apache.org
Betreff: join with big 2nd collection

Hi List,

We have a join perfomance issue and are not sure in which direction we should look to solve the issue.
We currently only have a single node setup

We have 2 collections where we do join querys, joined by a "primary key" string field contentId_s Each dataset for a single contentId_s consists of multiple timecode based documents in both indexes which makes this a many to many query.

collection1 - contains generic metadata and timecode based content (think timecode based comments)
Documents: 382.872
Unique contentId_s: 16715
~ 160MB size
single shard

collection2 - contains timecode based GPS data (gps posititon, field of view...timecodes are not related to timecodes in collection1, so flatten the structure would blow up the number of documents to incredible numbers) :
Documents: 695.887.875
Unique contenId_s: 10199
~ 300 GB size
single shard

Hardware is a HP DL360 with 32gb of ram (also tried on a machine with 64gb with not much improvement) and 1TB SSD for the index

In our use case there is lots of indexing/deletion traffic on both indexes and only few queries fired against the server.

We are constantly indexing new content and deleting old documents. This was already getting problematic with HDDs so we switched to SDDs, now indexing speed is fine for now (Might need also to scale this up in the future to allow more throughput).

But search speed suffers when we need to join with the big collection2 (taking up to 30sec for the query to succeed). We had some success experimenting with score join queries when collection2 results only returns a few unique Ids, but we can't predict that this is always the case, and if a lot of documents are hit in collection2, performance is 10x worse than with original normal join.

Sample queries look like this (simplified, but more complex queries are not much slower):

Sample1:
query: coll1field:someval OR {!join from=contentId_s to=contentId_s fromIndex=collection2 v='coll2field:someval}
filter: {!collapse field=contentId_s min=timecode_f}

Sample 2:
query: coll1field:someval
filter: {!join from=contentId_s to=contentId_s fromIndex=collection2 v='coll2field:otherval}
filter: {!collapse field=contentId_s min=timecode_f}


I experimented with running the query on collection2 alone first only to get the numdocs (collapsing on contentId_s) to see how much results we get so we could choose the right join query, but then with many hits in collection2 this almost takes the same time as doing the join, so slow queries would get even slower

Caches also seem to not help much since almost every query fired is different and the index is mostly changing between requests anyways.

We are open to anything, adding nodes/hardware/shards/changing the index structure...
Currently we don't know how to get around the big join

Any advice in which direction we should look ?