You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Nicolas Paris <ni...@riseup.net> on 2019/10/15 22:58:09 UTC

Solr-Cloud, join and collection collocation

Hi

I have several large collections that cannot fit in a standalone solr
instance. They are split over multiple shards in solr-cloud mode.

Those collections are supposed to be joined to an other collection to
retrieve subset. Because I am using distributed collections, I am not
able to use the solr join feature.

For this reason, I denormalize the information by adding the joined
collection within every collections. Naturally, when I want to update
the joined collection, I have to update every one of the distributed
collections.

In standalone mode, I only would have to update the joined collection.

I wonder if there is a way to overcome this limitation. For example, by
replicating the joined collection to every shard - or other method I am
ignoring.

Any thought ? 
-- 
nicolas

Re: Solr-Cloud, join and collection collocation

Posted by Nicolas Paris <ni...@riseup.net>.

> Note: adding score=none as a local param. Turns another algorithm
> dragging by from side join.

Indeed, the behavior with score=none local param is a query time
correlated with the joined collection subset size. For subset of 100k
documenrs, the query time is 1 seconds, 4 sec for 1M I get client
timeout (15sec) for any superior to 5M.

On this basis I guess some redesign will be necessary to find the good
in between normalization and de-normalization for insertion/selection
speed trade-off

Thanks



On Wed, Oct 16, 2019 at 03:32:33PM +0300, Mikhail Khludnev wrote:
> Note: adding score=none as a local param. Turns another algorithm dragging
> by from side join.
> 
> On Wed, Oct 16, 2019 at 11:37 AM Nicolas Paris <ni...@riseup.net>
> wrote:
> 
> > Sadly, the join performances are poor.
> > The joined collection is 12M documents, and the performances are 6k ms
> > versus 60ms when I compare to the denormalized field.
> >
> > Apparently, the performances does not change when the filter on the
> > joined collection is changed. It is still 6k ms when the subset is 12M
> > or 1 document in size. So the performance of join looks correlated to
> > size of joined collection and not the kind of filter applied to it.
> >
> > I will explore the streaming expressions
> >
> > On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote:
> > > > You can certainly replicate the joined collection to every shard. It
> > > > must fit in one shard and a replica of that shard must be co-located
> > > > with every replica of the “to” collection.
> > >
> > > Yes, I found this in the documentation, with a clear example just after
> > > this mail. I will test it today. I also read your blog about join
> > > performances[1] and I suspect the performance impact of joins will be
> > > huge because the joined collection is about 10M documents (only two
> > > fields, unique id and an array of longs and a filter applied to the
> > > array, join key is 10M unique IDs).
> > >
> > > > Have you looked at streaming and “streaming expressions"? It does not
> > > > have the same problem, although it does have its own limitations.
> > >
> > > I never tested them, and I am not very confortable yet in how to test
> > > them. Is it possible to mix query parsers and streaming expression in
> > > the client call via http parameters - or is streaming expression apply
> > > programmatically only ?
> > >
> > > [1] https://lucidworks.com/post/solr-and-joins/
> > >
> > > On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote:
> > > > You can certainly replicate the joined collection to every shard. It
> > must fit in one shard and a replica of that shard must be co-located with
> > every replica of the “to” collection.
> > > >
> > > > Have you looked at streaming and “streaming expressions"? It does not
> > have the same problem, although it does have its own limitations.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris <ni...@riseup.net>
> > wrote:
> > > > >
> > > > > Hi
> > > > >
> > > > > I have several large collections that cannot fit in a standalone solr
> > > > > instance. They are split over multiple shards in solr-cloud mode.
> > > > >
> > > > > Those collections are supposed to be joined to an other collection to
> > > > > retrieve subset. Because I am using distributed collections, I am not
> > > > > able to use the solr join feature.
> > > > >
> > > > > For this reason, I denormalize the information by adding the joined
> > > > > collection within every collections. Naturally, when I want to update
> > > > > the joined collection, I have to update every one of the distributed
> > > > > collections.
> > > > >
> > > > > In standalone mode, I only would have to update the joined
> > collection.
> > > > >
> > > > > I wonder if there is a way to overcome this limitation. For example,
> > by
> > > > > replicating the joined collection to every shard - or other method I
> > am
> > > > > ignoring.
> > > > >
> > > > > Any thought ?
> > > > > --
> > > > > nicolas
> > > >
> > >
> > > --
> > > nicolas
> > >
> >
> > --
> > nicolas
> >
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev

-- 
nicolas

Re: Solr-Cloud, join and collection collocation

Posted by Mikhail Khludnev <mk...@apache.org>.

Note: adding score=none as a local param. Turns another algorithm dragging
by from side join.

On Wed, Oct 16, 2019 at 11:37 AM Nicolas Paris <ni...@riseup.net>
wrote:

> Sadly, the join performances are poor.
> The joined collection is 12M documents, and the performances are 6k ms
> versus 60ms when I compare to the denormalized field.
>
> Apparently, the performances does not change when the filter on the
> joined collection is changed. It is still 6k ms when the subset is 12M
> or 1 document in size. So the performance of join looks correlated to
> size of joined collection and not the kind of filter applied to it.
>
> I will explore the streaming expressions
>
> On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote:
> > > You can certainly replicate the joined collection to every shard. It
> > > must fit in one shard and a replica of that shard must be co-located
> > > with every replica of the “to” collection.
> >
> > Yes, I found this in the documentation, with a clear example just after
> > this mail. I will test it today. I also read your blog about join
> > performances[1] and I suspect the performance impact of joins will be
> > huge because the joined collection is about 10M documents (only two
> > fields, unique id and an array of longs and a filter applied to the
> > array, join key is 10M unique IDs).
> >
> > > Have you looked at streaming and “streaming expressions"? It does not
> > > have the same problem, although it does have its own limitations.
> >
> > I never tested them, and I am not very confortable yet in how to test
> > them. Is it possible to mix query parsers and streaming expression in
> > the client call via http parameters - or is streaming expression apply
> > programmatically only ?
> >
> > [1] https://lucidworks.com/post/solr-and-joins/
> >
> > On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote:
> > > You can certainly replicate the joined collection to every shard. It
> must fit in one shard and a replica of that shard must be co-located with
> every replica of the “to” collection.
> > >
> > > Have you looked at streaming and “streaming expressions"? It does not
> have the same problem, although it does have its own limitations.
> > >
> > > Best,
> > > Erick
> > >
> > > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris <ni...@riseup.net>
> wrote:
> > > >
> > > > Hi
> > > >
> > > > I have several large collections that cannot fit in a standalone solr
> > > > instance. They are split over multiple shards in solr-cloud mode.
> > > >
> > > > Those collections are supposed to be joined to an other collection to
> > > > retrieve subset. Because I am using distributed collections, I am not
> > > > able to use the solr join feature.
> > > >
> > > > For this reason, I denormalize the information by adding the joined
> > > > collection within every collections. Naturally, when I want to update
> > > > the joined collection, I have to update every one of the distributed
> > > > collections.
> > > >
> > > > In standalone mode, I only would have to update the joined
> collection.
> > > >
> > > > I wonder if there is a way to overcome this limitation. For example,
> by
> > > > replicating the joined collection to every shard - or other method I
> am
> > > > ignoring.
> > > >
> > > > Any thought ?
> > > > --
> > > > nicolas
> > >
> >
> > --
> > nicolas
> >
>
> --
> nicolas
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Solr-Cloud, join and collection collocation

Posted by Nicolas Paris <ni...@riseup.net>.

Sadly, the join performances are poor.
The joined collection is 12M documents, and the performances are 6k ms
versus 60ms when I compare to the denormalized field.

Apparently, the performances does not change when the filter on the
joined collection is changed. It is still 6k ms when the subset is 12M
or 1 document in size. So the performance of join looks correlated to
size of joined collection and not the kind of filter applied to it.

I will explore the streaming expressions

On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote:
> > You can certainly replicate the joined collection to every shard. It
> > must fit in one shard and a replica of that shard must be co-located
> > with every replica of the “to” collection.
> 
> Yes, I found this in the documentation, with a clear example just after
> this mail. I will test it today. I also read your blog about join
> performances[1] and I suspect the performance impact of joins will be
> huge because the joined collection is about 10M documents (only two
> fields, unique id and an array of longs and a filter applied to the
> array, join key is 10M unique IDs).
> 
> > Have you looked at streaming and “streaming expressions"? It does not
> > have the same problem, although it does have its own limitations.
> 
> I never tested them, and I am not very confortable yet in how to test
> them. Is it possible to mix query parsers and streaming expression in
> the client call via http parameters - or is streaming expression apply
> programmatically only ?
> 
> [1] https://lucidworks.com/post/solr-and-joins/
> 
> On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote:
> > You can certainly replicate the joined collection to every shard. It must fit in one shard and a replica of that shard must be co-located with every replica of the “to” collection.
> > 
> > Have you looked at streaming and “streaming expressions"? It does not have the same problem, although it does have its own limitations.
> > 
> > Best,
> > Erick
> > 
> > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris <ni...@riseup.net> wrote:
> > > 
> > > Hi
> > > 
> > > I have several large collections that cannot fit in a standalone solr
> > > instance. They are split over multiple shards in solr-cloud mode.
> > > 
> > > Those collections are supposed to be joined to an other collection to
> > > retrieve subset. Because I am using distributed collections, I am not
> > > able to use the solr join feature.
> > > 
> > > For this reason, I denormalize the information by adding the joined
> > > collection within every collections. Naturally, when I want to update
> > > the joined collection, I have to update every one of the distributed
> > > collections.
> > > 
> > > In standalone mode, I only would have to update the joined collection.
> > > 
> > > I wonder if there is a way to overcome this limitation. For example, by
> > > replicating the joined collection to every shard - or other method I am
> > > ignoring.
> > > 
> > > Any thought ? 
> > > -- 
> > > nicolas
> > 
> 
> -- 
> nicolas
> 

-- 
nicolas

Re: Solr-Cloud, join and collection collocation

Posted by Nicolas Paris <ni...@riseup.net>.

> You can certainly replicate the joined collection to every shard. It
> must fit in one shard and a replica of that shard must be co-located
> with every replica of the “to” collection.

Yes, I found this in the documentation, with a clear example just after
this mail. I will test it today. I also read your blog about join
performances[1] and I suspect the performance impact of joins will be
huge because the joined collection is about 10M documents (only two
fields, unique id and an array of longs and a filter applied to the
array, join key is 10M unique IDs).

> Have you looked at streaming and “streaming expressions"? It does not
> have the same problem, although it does have its own limitations.

I never tested them, and I am not very confortable yet in how to test
them. Is it possible to mix query parsers and streaming expression in
the client call via http parameters - or is streaming expression apply
programmatically only ?

[1] https://lucidworks.com/post/solr-and-joins/

On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote:
> You can certainly replicate the joined collection to every shard. It must fit in one shard and a replica of that shard must be co-located with every replica of the “to” collection.
> 
> Have you looked at streaming and “streaming expressions"? It does not have the same problem, although it does have its own limitations.
> 
> Best,
> Erick
> 
> > On Oct 15, 2019, at 6:58 PM, Nicolas Paris <ni...@riseup.net> wrote:
> > 
> > Hi
> > 
> > I have several large collections that cannot fit in a standalone solr
> > instance. They are split over multiple shards in solr-cloud mode.
> > 
> > Those collections are supposed to be joined to an other collection to
> > retrieve subset. Because I am using distributed collections, I am not
> > able to use the solr join feature.
> > 
> > For this reason, I denormalize the information by adding the joined
> > collection within every collections. Naturally, when I want to update
> > the joined collection, I have to update every one of the distributed
> > collections.
> > 
> > In standalone mode, I only would have to update the joined collection.
> > 
> > I wonder if there is a way to overcome this limitation. For example, by
> > replicating the joined collection to every shard - or other method I am
> > ignoring.
> > 
> > Any thought ? 
> > -- 
> > nicolas
> 

-- 
nicolas

Re: Solr-Cloud, join and collection collocation

Posted by Erick Erickson <er...@gmail.com>.

You can certainly replicate the joined collection to every shard. It must fit in one shard and a replica of that shard must be co-located with every replica of the “to” collection.

Have you looked at streaming and “streaming expressions"? It does not have the same problem, although it does have its own limitations.

Best,
Erick

> On Oct 15, 2019, at 6:58 PM, Nicolas Paris <ni...@riseup.net> wrote:
> 
> Hi
> 
> I have several large collections that cannot fit in a standalone solr
> instance. They are split over multiple shards in solr-cloud mode.
> 
> Those collections are supposed to be joined to an other collection to
> retrieve subset. Because I am using distributed collections, I am not
> able to use the solr join feature.
> 
> For this reason, I denormalize the information by adding the joined
> collection within every collections. Naturally, when I want to update
> the joined collection, I have to update every one of the distributed
> collections.
> 
> In standalone mode, I only would have to update the joined collection.
> 
> I wonder if there is a way to overcome this limitation. For example, by
> replicating the joined collection to every shard - or other method I am
> ignoring.
> 
> Any thought ? 
> -- 
> nicolas