You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Matt Kuiper <ku...@gmail.com> on 2021/06/30 17:59:23 UTC

Aligning Shards from different Collections on the same Solr server based on Date Range

Hi Solr Group,

I am not sure the following is a viable use-case, welcoming input and any
implementation recommendations.

I would like to perform joins over two sharded collections.  Where docs are
routed to specific shards based on a date range and are the same for shards
in each collection.

I understand that this means that the replicas from each collection that
hold data to be joined need to be collated on the same Solr Server.   I
have read solutions that use ADD REPLICA to add a Collection B replica to
all SolrServers assuming Collection B has only one Shard.  For my use case
I need Collection B to have multiple shards.

*Collection A                Collection B              SolrServer *
Shard1_2020              Shard1_2020           172.33.0.1:8983_solr
Shard2_2021              Shard2_2021           172.33.0.2:8983_solr
Shard3_2022              Shard3_2022           172.33.0.3:8983_solr

I think my question comes down to how do I break shards by a date range,
and do it in a way that both Collections A and B would be defined by the
same date range?  If could reliably break shards by date, and know the date
range of the shard, I think I could use ADD REPLICA api to align.

Not sure a compositeId routing approach would work, but thinking an
implicit id may be hard to manage over time.

Is an approach like this viable, concerned a bit about
maintenance concerns, other ideas to support this join?

Note: I am considering this within Time series collections...

Matt

Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Posted by Matt Kuiper <ku...@gmail.com>.

Thanks Joel!  I will give this a try.  That is quite a performance boost.

Matt

On Tue, Jul 13, 2021 at 9:14 AM Joel Bernstein <jo...@gmail.com> wrote:

> The optimized join was added in Solr 8.8:
> https://issues.apache.org/jira/browse/SOLR-15049
>
> It kicks in when you use the join qparser plugin in the following scenario:
>
> 1) Do not specify a fromIndex. This is because the to and from index are
> the same.
> 2) The to and from fields are the same.
> 3) The join method is topLevelDV.
>
> {!join to=store_id from=store_id method=topLevelDV}
>
> If you do this with Solr 8.8+ you get the effect of SOLR-15049. It is a
> massive performance improvement. In my testing it was 7000 times faster
> then the standard join parser plugin for larger joins.
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Mon, Jul 12, 2021 at 1:34 PM Matt Kuiper <ku...@gmail.com> wrote:
>
> > Hi Joel,
> >
> > I reviewed a few options with my team, and your recommendation is at the
> > top of the list.  I believe it will work for our use case.
> >
> > You mentioned that if this approach worked, you would be willing to share
> > more details on an "optimized self join."
> >
> > I would enjoy hearing more.
> >
> > Thanks,
> > Matt
> >
> > On Fri, Jul 9, 2021 at 9:36 AM Joel Bernstein <jo...@gmail.com>
> wrote:
> >
> > > Block join is another option. If that works for you, from an indexing
> > > standpoint, it's the most performant query time join.
> > >
> > > If block indexing doesn't work for you then the optimized self join is
> > > almost as fast.
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Fri, Jul 9, 2021 at 11:31 AM Matt Kuiper <ku...@gmail.com>
> wrote:
> > >
> > > > Thanks Joel!
> > > >
> > > > On my list is to investigate Block Joins and Nested Child docs.
> > > >
> > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#block-join-query-parsers
> > > >
> > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/8_8/indexing-nested-documents.html#indexing-nested-documents
> > > >
> > > > However, it looks like you are not suggesting using nested docs, but
> > > > specifying a type field to differentiate between types of docs and
> > then a
> > > > join field.  Not having to build nested docs prior to updates would
> be
> > an
> > > > advantage.  And it makes sense that the join field would allow for
> > > reliable
> > > > routing to appropriate the shard for both doc types.
> > > >
> > > > I will take a further look and see if this approach will work, and
> get
> > > back
> > > > if more info is needed on the optimized self join.
> > > >
> > > > Thanks again,
> > > > Matt
> > > >
> > > >
> > > > On Fri, Jul 9, 2021 at 7:01 AM Joel Bernstein <jo...@gmail.com>
> > > wrote:
> > > >
> > > > > Can you solve this problem by adding all documents into the same
> > > > collection
> > > > > and performing self joins. You could add a field called rec_type to
> > > > > differentiate between the records.
> > > > >
> > > > > There are two good reasons for wanting to do this.
> > > > >
> > > > > 1) This allows you to route by the join key and easily co-locate
> > > records.
> > > > >
> > > > > 2) There is an optimized self join which is extremely fast that you
> > > could
> > > > > take advantage of if you did this.
> > > > >
> > > > > Let me know if this might be an option for you and we can discuss
> the
> > > > > optimized self join in more detail.
> > > > >
> > > > > Joel
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > >
> > > > > On Fri, Jul 2, 2021 at 6:28 PM Matt Kuiper <ku...@gmail.com>
> > wrote:
> > > > >
> > > > > > After some research, it appears the following approach may help
> in
> > > this
> > > > > > situation and relieve the requirement of collocating indexes for
> > > Joins.
> > > > > It
> > > > > > appears one drawback maybe the types of fields supported for the
> > JOIN
> > > > > > field.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > > On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper <kuiperme@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Solr Group,
> > > > > > >
> > > > > > > I am not sure the following is a viable use-case, welcoming
> input
> > > and
> > > > > any
> > > > > > > implementation recommendations.
> > > > > > >
> > > > > > > I would like to perform joins over two sharded collections.
> > Where
> > > > docs
> > > > > > > are routed to specific shards based on a date range and are the
> > > same
> > > > > for
> > > > > > > shards in each collection.
> > > > > > >
> > > > > > > I understand that this means that the replicas from each
> > collection
> > > > > that
> > > > > > > hold data to be joined need to be collated on the same Solr
> > Server.
> > > >  I
> > > > > > > have read solutions that use ADD REPLICA to add a Collection B
> > > > replica
> > > > > to
> > > > > > > all SolrServers assuming Collection B has only one Shard.  For
> my
> > > use
> > > > > > case
> > > > > > > I need Collection B to have multiple shards.
> > > > > > >
> > > > > > > *Collection A                Collection B
> > SolrServer *
> > > > > > > Shard1_2020              Shard1_2020           172.33.0.1:8983
> > > _solr
> > > > > > > Shard2_2021              Shard2_2021           172.33.0.2:8983
> > > _solr
> > > > > > > Shard3_2022              Shard3_2022           172.33.0.3:8983
> > > _solr
> > > > > > >
> > > > > > > I think my question comes down to how do I break shards by a
> date
> > > > > range,
> > > > > > > and do it in a way that both Collections A and B would be
> defined
> > > by
> > > > > the
> > > > > > > same date range?  If could reliably break shards by date, and
> > know
> > > > the
> > > > > > date
> > > > > > > range of the shard, I think I could use ADD REPLICA api to
> align.
> > > > > > >
> > > > > > > Not sure a compositeId routing approach would work, but
> thinking
> > an
> > > > > > > implicit id may be hard to manage over time.
> > > > > > >
> > > > > > > Is an approach like this viable, concerned a bit about
> > > > > > > maintenance concerns, other ideas to support this join?
> > > > > > >
> > > > > > > Note: I am considering this within Time series collections...
> > > > > > >
> > > > > > > Matt
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Posted by Joel Bernstein <jo...@gmail.com>.

The optimized join was added in Solr 8.8:
https://issues.apache.org/jira/browse/SOLR-15049

It kicks in when you use the join qparser plugin in the following scenario:

1) Do not specify a fromIndex. This is because the to and from index are
the same.
2) The to and from fields are the same.
3) The join method is topLevelDV.

{!join to=store_id from=store_id method=topLevelDV}

If you do this with Solr 8.8+ you get the effect of SOLR-15049. It is a
massive performance improvement. In my testing it was 7000 times faster
then the standard join parser plugin for larger joins.










Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, Jul 12, 2021 at 1:34 PM Matt Kuiper <ku...@gmail.com> wrote:

> Hi Joel,
>
> I reviewed a few options with my team, and your recommendation is at the
> top of the list.  I believe it will work for our use case.
>
> You mentioned that if this approach worked, you would be willing to share
> more details on an "optimized self join."
>
> I would enjoy hearing more.
>
> Thanks,
> Matt
>
> On Fri, Jul 9, 2021 at 9:36 AM Joel Bernstein <jo...@gmail.com> wrote:
>
> > Block join is another option. If that works for you, from an indexing
> > standpoint, it's the most performant query time join.
> >
> > If block indexing doesn't work for you then the optimized self join is
> > almost as fast.
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Fri, Jul 9, 2021 at 11:31 AM Matt Kuiper <ku...@gmail.com> wrote:
> >
> > > Thanks Joel!
> > >
> > > On my list is to investigate Block Joins and Nested Child docs.
> > >
> > >
> > >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#block-join-query-parsers
> > >
> > >
> > >
> >
> https://solr.apache.org/guide/8_8/indexing-nested-documents.html#indexing-nested-documents
> > >
> > > However, it looks like you are not suggesting using nested docs, but
> > > specifying a type field to differentiate between types of docs and
> then a
> > > join field.  Not having to build nested docs prior to updates would be
> an
> > > advantage.  And it makes sense that the join field would allow for
> > reliable
> > > routing to appropriate the shard for both doc types.
> > >
> > > I will take a further look and see if this approach will work, and get
> > back
> > > if more info is needed on the optimized self join.
> > >
> > > Thanks again,
> > > Matt
> > >
> > >
> > > On Fri, Jul 9, 2021 at 7:01 AM Joel Bernstein <jo...@gmail.com>
> > wrote:
> > >
> > > > Can you solve this problem by adding all documents into the same
> > > collection
> > > > and performing self joins. You could add a field called rec_type to
> > > > differentiate between the records.
> > > >
> > > > There are two good reasons for wanting to do this.
> > > >
> > > > 1) This allows you to route by the join key and easily co-locate
> > records.
> > > >
> > > > 2) There is an optimized self join which is extremely fast that you
> > could
> > > > take advantage of if you did this.
> > > >
> > > > Let me know if this might be an option for you and we can discuss the
> > > > optimized self join in more detail.
> > > >
> > > > Joel
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > >
> > > > On Fri, Jul 2, 2021 at 6:28 PM Matt Kuiper <ku...@gmail.com>
> wrote:
> > > >
> > > > > After some research, it appears the following approach may help in
> > this
> > > > > situation and relieve the requirement of collocating indexes for
> > Joins.
> > > > It
> > > > > appears one drawback maybe the types of fields supported for the
> JOIN
> > > > > field.
> > > > >
> > > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join
> > > > >
> > > > > Matt
> > > > >
> > > > > On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper <ku...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Solr Group,
> > > > > >
> > > > > > I am not sure the following is a viable use-case, welcoming input
> > and
> > > > any
> > > > > > implementation recommendations.
> > > > > >
> > > > > > I would like to perform joins over two sharded collections.
> Where
> > > docs
> > > > > > are routed to specific shards based on a date range and are the
> > same
> > > > for
> > > > > > shards in each collection.
> > > > > >
> > > > > > I understand that this means that the replicas from each
> collection
> > > > that
> > > > > > hold data to be joined need to be collated on the same Solr
> Server.
> > >  I
> > > > > > have read solutions that use ADD REPLICA to add a Collection B
> > > replica
> > > > to
> > > > > > all SolrServers assuming Collection B has only one Shard.  For my
> > use
> > > > > case
> > > > > > I need Collection B to have multiple shards.
> > > > > >
> > > > > > *Collection A                Collection B
> SolrServer *
> > > > > > Shard1_2020              Shard1_2020           172.33.0.1:8983
> > _solr
> > > > > > Shard2_2021              Shard2_2021           172.33.0.2:8983
> > _solr
> > > > > > Shard3_2022              Shard3_2022           172.33.0.3:8983
> > _solr
> > > > > >
> > > > > > I think my question comes down to how do I break shards by a date
> > > > range,
> > > > > > and do it in a way that both Collections A and B would be defined
> > by
> > > > the
> > > > > > same date range?  If could reliably break shards by date, and
> know
> > > the
> > > > > date
> > > > > > range of the shard, I think I could use ADD REPLICA api to align.
> > > > > >
> > > > > > Not sure a compositeId routing approach would work, but thinking
> an
> > > > > > implicit id may be hard to manage over time.
> > > > > >
> > > > > > Is an approach like this viable, concerned a bit about
> > > > > > maintenance concerns, other ideas to support this join?
> > > > > >
> > > > > > Note: I am considering this within Time series collections...
> > > > > >
> > > > > > Matt
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Posted by Matt Kuiper <ku...@gmail.com>.

Hi Joel,

I reviewed a few options with my team, and your recommendation is at the
top of the list.  I believe it will work for our use case.

You mentioned that if this approach worked, you would be willing to share
more details on an "optimized self join."

I would enjoy hearing more.

Thanks,
Matt

On Fri, Jul 9, 2021 at 9:36 AM Joel Bernstein <jo...@gmail.com> wrote:

> Block join is another option. If that works for you, from an indexing
> standpoint, it's the most performant query time join.
>
> If block indexing doesn't work for you then the optimized self join is
> almost as fast.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Jul 9, 2021 at 11:31 AM Matt Kuiper <ku...@gmail.com> wrote:
>
> > Thanks Joel!
> >
> > On my list is to investigate Block Joins and Nested Child docs.
> >
> >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#block-join-query-parsers
> >
> >
> >
> https://solr.apache.org/guide/8_8/indexing-nested-documents.html#indexing-nested-documents
> >
> > However, it looks like you are not suggesting using nested docs, but
> > specifying a type field to differentiate between types of docs and then a
> > join field.  Not having to build nested docs prior to updates would be an
> > advantage.  And it makes sense that the join field would allow for
> reliable
> > routing to appropriate the shard for both doc types.
> >
> > I will take a further look and see if this approach will work, and get
> back
> > if more info is needed on the optimized self join.
> >
> > Thanks again,
> > Matt
> >
> >
> > On Fri, Jul 9, 2021 at 7:01 AM Joel Bernstein <jo...@gmail.com>
> wrote:
> >
> > > Can you solve this problem by adding all documents into the same
> > collection
> > > and performing self joins. You could add a field called rec_type to
> > > differentiate between the records.
> > >
> > > There are two good reasons for wanting to do this.
> > >
> > > 1) This allows you to route by the join key and easily co-locate
> records.
> > >
> > > 2) There is an optimized self join which is extremely fast that you
> could
> > > take advantage of if you did this.
> > >
> > > Let me know if this might be an option for you and we can discuss the
> > > optimized self join in more detail.
> > >
> > > Joel
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Fri, Jul 2, 2021 at 6:28 PM Matt Kuiper <ku...@gmail.com> wrote:
> > >
> > > > After some research, it appears the following approach may help in
> this
> > > > situation and relieve the requirement of collocating indexes for
> Joins.
> > > It
> > > > appears one drawback maybe the types of fields supported for the JOIN
> > > > field.
> > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join
> > > >
> > > > Matt
> > > >
> > > > On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper <ku...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Solr Group,
> > > > >
> > > > > I am not sure the following is a viable use-case, welcoming input
> and
> > > any
> > > > > implementation recommendations.
> > > > >
> > > > > I would like to perform joins over two sharded collections.  Where
> > docs
> > > > > are routed to specific shards based on a date range and are the
> same
> > > for
> > > > > shards in each collection.
> > > > >
> > > > > I understand that this means that the replicas from each collection
> > > that
> > > > > hold data to be joined need to be collated on the same Solr Server.
> >  I
> > > > > have read solutions that use ADD REPLICA to add a Collection B
> > replica
> > > to
> > > > > all SolrServers assuming Collection B has only one Shard.  For my
> use
> > > > case
> > > > > I need Collection B to have multiple shards.
> > > > >
> > > > > *Collection A                Collection B              SolrServer *
> > > > > Shard1_2020              Shard1_2020           172.33.0.1:8983
> _solr
> > > > > Shard2_2021              Shard2_2021           172.33.0.2:8983
> _solr
> > > > > Shard3_2022              Shard3_2022           172.33.0.3:8983
> _solr
> > > > >
> > > > > I think my question comes down to how do I break shards by a date
> > > range,
> > > > > and do it in a way that both Collections A and B would be defined
> by
> > > the
> > > > > same date range?  If could reliably break shards by date, and know
> > the
> > > > date
> > > > > range of the shard, I think I could use ADD REPLICA api to align.
> > > > >
> > > > > Not sure a compositeId routing approach would work, but thinking an
> > > > > implicit id may be hard to manage over time.
> > > > >
> > > > > Is an approach like this viable, concerned a bit about
> > > > > maintenance concerns, other ideas to support this join?
> > > > >
> > > > > Note: I am considering this within Time series collections...
> > > > >
> > > > > Matt
> > > > >
> > > >
> > >
> >
>

Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Posted by Joel Bernstein <jo...@gmail.com>.

Block join is another option. If that works for you, from an indexing
standpoint, it's the most performant query time join.

If block indexing doesn't work for you then the optimized self join is
almost as fast.


Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Jul 9, 2021 at 11:31 AM Matt Kuiper <ku...@gmail.com> wrote:

> Thanks Joel!
>
> On my list is to investigate Block Joins and Nested Child docs.
>
>
> https://solr.apache.org/guide/8_8/other-parsers.html#block-join-query-parsers
>
>
> https://solr.apache.org/guide/8_8/indexing-nested-documents.html#indexing-nested-documents
>
> However, it looks like you are not suggesting using nested docs, but
> specifying a type field to differentiate between types of docs and then a
> join field.  Not having to build nested docs prior to updates would be an
> advantage.  And it makes sense that the join field would allow for reliable
> routing to appropriate the shard for both doc types.
>
> I will take a further look and see if this approach will work, and get back
> if more info is needed on the optimized self join.
>
> Thanks again,
> Matt
>
>
> On Fri, Jul 9, 2021 at 7:01 AM Joel Bernstein <jo...@gmail.com> wrote:
>
> > Can you solve this problem by adding all documents into the same
> collection
> > and performing self joins. You could add a field called rec_type to
> > differentiate between the records.
> >
> > There are two good reasons for wanting to do this.
> >
> > 1) This allows you to route by the join key and easily co-locate records.
> >
> > 2) There is an optimized self join which is extremely fast that you could
> > take advantage of if you did this.
> >
> > Let me know if this might be an option for you and we can discuss the
> > optimized self join in more detail.
> >
> > Joel
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Fri, Jul 2, 2021 at 6:28 PM Matt Kuiper <ku...@gmail.com> wrote:
> >
> > > After some research, it appears the following approach may help in this
> > > situation and relieve the requirement of collocating indexes for Joins.
> > It
> > > appears one drawback maybe the types of fields supported for the JOIN
> > > field.
> > >
> > >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join
> > >
> > > Matt
> > >
> > > On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper <ku...@gmail.com>
> wrote:
> > >
> > > > Hi Solr Group,
> > > >
> > > > I am not sure the following is a viable use-case, welcoming input and
> > any
> > > > implementation recommendations.
> > > >
> > > > I would like to perform joins over two sharded collections.  Where
> docs
> > > > are routed to specific shards based on a date range and are the same
> > for
> > > > shards in each collection.
> > > >
> > > > I understand that this means that the replicas from each collection
> > that
> > > > hold data to be joined need to be collated on the same Solr Server.
>  I
> > > > have read solutions that use ADD REPLICA to add a Collection B
> replica
> > to
> > > > all SolrServers assuming Collection B has only one Shard.  For my use
> > > case
> > > > I need Collection B to have multiple shards.
> > > >
> > > > *Collection A                Collection B              SolrServer *
> > > > Shard1_2020              Shard1_2020           172.33.0.1:8983_solr
> > > > Shard2_2021              Shard2_2021           172.33.0.2:8983_solr
> > > > Shard3_2022              Shard3_2022           172.33.0.3:8983_solr
> > > >
> > > > I think my question comes down to how do I break shards by a date
> > range,
> > > > and do it in a way that both Collections A and B would be defined by
> > the
> > > > same date range?  If could reliably break shards by date, and know
> the
> > > date
> > > > range of the shard, I think I could use ADD REPLICA api to align.
> > > >
> > > > Not sure a compositeId routing approach would work, but thinking an
> > > > implicit id may be hard to manage over time.
> > > >
> > > > Is an approach like this viable, concerned a bit about
> > > > maintenance concerns, other ideas to support this join?
> > > >
> > > > Note: I am considering this within Time series collections...
> > > >
> > > > Matt
> > > >
> > >
> >
>

Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Posted by Matt Kuiper <ku...@gmail.com>.

Thanks Joel!

On my list is to investigate Block Joins and Nested Child docs.

https://solr.apache.org/guide/8_8/other-parsers.html#block-join-query-parsers

https://solr.apache.org/guide/8_8/indexing-nested-documents.html#indexing-nested-documents

However, it looks like you are not suggesting using nested docs, but
specifying a type field to differentiate between types of docs and then a
join field.  Not having to build nested docs prior to updates would be an
advantage.  And it makes sense that the join field would allow for reliable
routing to appropriate the shard for both doc types.

I will take a further look and see if this approach will work, and get back
if more info is needed on the optimized self join.

Thanks again,
Matt


On Fri, Jul 9, 2021 at 7:01 AM Joel Bernstein <jo...@gmail.com> wrote:

> Can you solve this problem by adding all documents into the same collection
> and performing self joins. You could add a field called rec_type to
> differentiate between the records.
>
> There are two good reasons for wanting to do this.
>
> 1) This allows you to route by the join key and easily co-locate records.
>
> 2) There is an optimized self join which is extremely fast that you could
> take advantage of if you did this.
>
> Let me know if this might be an option for you and we can discuss the
> optimized self join in more detail.
>
> Joel
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Jul 2, 2021 at 6:28 PM Matt Kuiper <ku...@gmail.com> wrote:
>
> > After some research, it appears the following approach may help in this
> > situation and relieve the requirement of collocating indexes for Joins.
> It
> > appears one drawback maybe the types of fields supported for the JOIN
> > field.
> >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join
> >
> > Matt
> >
> > On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper <ku...@gmail.com> wrote:
> >
> > > Hi Solr Group,
> > >
> > > I am not sure the following is a viable use-case, welcoming input and
> any
> > > implementation recommendations.
> > >
> > > I would like to perform joins over two sharded collections.  Where docs
> > > are routed to specific shards based on a date range and are the same
> for
> > > shards in each collection.
> > >
> > > I understand that this means that the replicas from each collection
> that
> > > hold data to be joined need to be collated on the same Solr Server.   I
> > > have read solutions that use ADD REPLICA to add a Collection B replica
> to
> > > all SolrServers assuming Collection B has only one Shard.  For my use
> > case
> > > I need Collection B to have multiple shards.
> > >
> > > *Collection A                Collection B              SolrServer *
> > > Shard1_2020              Shard1_2020           172.33.0.1:8983_solr
> > > Shard2_2021              Shard2_2021           172.33.0.2:8983_solr
> > > Shard3_2022              Shard3_2022           172.33.0.3:8983_solr
> > >
> > > I think my question comes down to how do I break shards by a date
> range,
> > > and do it in a way that both Collections A and B would be defined by
> the
> > > same date range?  If could reliably break shards by date, and know the
> > date
> > > range of the shard, I think I could use ADD REPLICA api to align.
> > >
> > > Not sure a compositeId routing approach would work, but thinking an
> > > implicit id may be hard to manage over time.
> > >
> > > Is an approach like this viable, concerned a bit about
> > > maintenance concerns, other ideas to support this join?
> > >
> > > Note: I am considering this within Time series collections...
> > >
> > > Matt
> > >
> >
>

Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Posted by Joel Bernstein <jo...@gmail.com>.

Can you solve this problem by adding all documents into the same collection
and performing self joins. You could add a field called rec_type to
differentiate between the records.

There are two good reasons for wanting to do this.

1) This allows you to route by the join key and easily co-locate records.

2) There is an optimized self join which is extremely fast that you could
take advantage of if you did this.

Let me know if this might be an option for you and we can discuss the
optimized self join in more detail.

Joel









Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Jul 2, 2021 at 6:28 PM Matt Kuiper <ku...@gmail.com> wrote:

> After some research, it appears the following approach may help in this
> situation and relieve the requirement of collocating indexes for Joins.  It
> appears one drawback maybe the types of fields supported for the JOIN
> field.
>
> https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join
>
> Matt
>
> On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper <ku...@gmail.com> wrote:
>
> > Hi Solr Group,
> >
> > I am not sure the following is a viable use-case, welcoming input and any
> > implementation recommendations.
> >
> > I would like to perform joins over two sharded collections.  Where docs
> > are routed to specific shards based on a date range and are the same for
> > shards in each collection.
> >
> > I understand that this means that the replicas from each collection that
> > hold data to be joined need to be collated on the same Solr Server.   I
> > have read solutions that use ADD REPLICA to add a Collection B replica to
> > all SolrServers assuming Collection B has only one Shard.  For my use
> case
> > I need Collection B to have multiple shards.
> >
> > *Collection A                Collection B              SolrServer *
> > Shard1_2020              Shard1_2020           172.33.0.1:8983_solr
> > Shard2_2021              Shard2_2021           172.33.0.2:8983_solr
> > Shard3_2022              Shard3_2022           172.33.0.3:8983_solr
> >
> > I think my question comes down to how do I break shards by a date range,
> > and do it in a way that both Collections A and B would be defined by the
> > same date range?  If could reliably break shards by date, and know the
> date
> > range of the shard, I think I could use ADD REPLICA api to align.
> >
> > Not sure a compositeId routing approach would work, but thinking an
> > implicit id may be hard to manage over time.
> >
> > Is an approach like this viable, concerned a bit about
> > maintenance concerns, other ideas to support this join?
> >
> > Note: I am considering this within Time series collections...
> >
> > Matt
> >
>

Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Posted by Matt Kuiper <ku...@gmail.com>.

After some research, it appears the following approach may help in this
situation and relieve the requirement of collocating indexes for Joins.  It
appears one drawback maybe the types of fields supported for the JOIN field.

https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join

Matt

On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper <ku...@gmail.com> wrote:

> Hi Solr Group,
>
> I am not sure the following is a viable use-case, welcoming input and any
> implementation recommendations.
>
> I would like to perform joins over two sharded collections.  Where docs
> are routed to specific shards based on a date range and are the same for
> shards in each collection.
>
> I understand that this means that the replicas from each collection that
> hold data to be joined need to be collated on the same Solr Server.   I
> have read solutions that use ADD REPLICA to add a Collection B replica to
> all SolrServers assuming Collection B has only one Shard.  For my use case
> I need Collection B to have multiple shards.
>
> *Collection A                Collection B              SolrServer *
> Shard1_2020              Shard1_2020           172.33.0.1:8983_solr
> Shard2_2021              Shard2_2021           172.33.0.2:8983_solr
> Shard3_2022              Shard3_2022           172.33.0.3:8983_solr
>
> I think my question comes down to how do I break shards by a date range,
> and do it in a way that both Collections A and B would be defined by the
> same date range?  If could reliably break shards by date, and know the date
> range of the shard, I think I could use ADD REPLICA api to align.
>
> Not sure a compositeId routing approach would work, but thinking an
> implicit id may be hard to manage over time.
>
> Is an approach like this viable, concerned a bit about
> maintenance concerns, other ideas to support this join?
>
> Note: I am considering this within Time series collections...
>
> Matt
>