You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by mganeshs <mg...@live.in> on 2017/06/29 14:44:32 UTC

Allow Join over two sharded collection

All,

Any idea when this  ticket <https://issues.apache.org/jira/browse/SOLR-8297>  
will be addressed. 

https://issues.apache.org/jira/browse/SOLR-8297

One of the comments says by SOLR 7.0. Can we expect that by 7.0 ?

Regards,



--
View this message in context: http://lucene.472066.n3.nabble.com/Allow-Join-over-two-sharded-collection-tp4343443.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Allow Join over two sharded collection

Posted by Damien Kamerman <da...@gmail.com>.

Joins will work with shards as long as the docs you're joining from/to are
in the shard. Why not go compositeId routing (either ID=uniqueKey!docId or
router.field)? Is there no 'uniqueKey' which will distribute randomly? You
may need to put the same ACL docs in all shards depending on your use case.

On 30 June 2017 at 12:57, mganeshs <mg...@live.in> wrote:

> Hi Erick,
>
> Initially I also thought of using Streaming for Joins. But looks like Joins
> with Streaming is not for heavy QPS sort of queries and that's my use case.
> Currently things are working fine with normal join for us as we have only
> one shard. But in coming days number of documents to be indexed is going to
> be increased drastically. So we need to split shards. The time I split
> shards I can't use Joins.
>
> We thought of going with Implict routing for sharding. But if we go with
> Implicit routing, all indexing will not be distributed and so one shard
> could be getting more load which we don't want.
> So we badly looking for default Join.
> As I have posted in different questions in this forum itself and you too
> have replied.... our joins are between real documents and it's ACL
> documents. ACL document has multi value field whose value would be user or
> groups. Why we want to keep ACL separately instead of keeping it in same
> real document itself. It's because that our ACL can grow till 1L of users
> or
> even more. and for every change in ACL or its permission we don't want to
> re-index the real document as well.
>
> Do you think is there any better alternative ? or the way we have kept ACLs
> are wrong ?
>
> Regards,
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4343582.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Allow Join over two sharded collection

Posted by mganeshs <mg...@live.in>.

Hi Erick,

Initially I also thought of using Streaming for Joins. But looks like Joins
with Streaming is not for heavy QPS sort of queries and that's my use case. 
Currently things are working fine with normal join for us as we have only
one shard. But in coming days number of documents to be indexed is going to
be increased drastically. So we need to split shards. The time I split
shards I can't use Joins.

We thought of going with Implict routing for sharding. But if we go with
Implicit routing, all indexing will not be distributed and so one shard
could be getting more load which we don't want. 
So we badly looking for default Join.
As I have posted in different questions in this forum itself and you too
have replied.... our joins are between real documents and it's ACL
documents. ACL document has multi value field whose value would be user or
groups. Why we want to keep ACL separately instead of keeping it in same
real document itself. It's because that our ACL can grow till 1L of users or
even more. and for every change in ACL or its permission we don't want to
re-index the real document as well. 

Do you think is there any better alternative ? or the way we have kept ACLs
are wrong ? 

Regards,



--
View this message in context: http://lucene.472066.n3.nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4343582.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Allow Join over two sharded collection

Posted by Erick Erickson <er...@gmail.com>.

Probably won't be in 7.0. In fact it appears to have lost momentum so
I don't know if it'll ever be committed. Don't know that it _won't_,
but there's no way to say.

There's been a lot of work in the Solr Streaming world to do joins and
it's quite possible that that'll do what you need.

Best,
Erick

On Thu, Jun 29, 2017 at 7:44 AM, mganeshs <mg...@live.in> wrote:
> All,
>
> Any idea when this  ticket <https://issues.apache.org/jira/browse/SOLR-8297>
> will be addressed.
>
> https://issues.apache.org/jira/browse/SOLR-8297
>
> One of the comments says by SOLR 7.0. Can we expect that by 7.0 ?
>
> Regards,
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Allow-Join-over-two-sharded-collection-tp4343443.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Allow Join over two sharded collection

Posted by "Glick, David" <Da...@ibi.com>.

Unsubscribe 

Sent from my iPhone

> On Jul 1, 2017, at 8:02 PM, Susheel Kumar <su...@gmail.com> wrote:
> 
> Depending on your use case people also use collection aliasing for time
> series data.  See below
> 
> https://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/
> 
>> On Sat, Jul 1, 2017 at 7:13 PM, Susheel Kumar <su...@gmail.com> wrote:
>> 
>> As Eric said 1docs/month isn't a big deal.  I have 45+ million docs in one
>> shard but YMMV depending on other factors.
>> 
>> Also there is lot of confusion in the terminology. The default routing is
>> compositeID routing.  The implicit routing which Eric mentioned is the
>> manual routing.  https://issues.apache.org/jira/browse/SOLR-6630
>> 
>> Which routing you are suggesting to use? Can you clarify again.  Also
>> what's your exact use case.  Do you query old aged documents or you don't
>> need to and most or all of your queries are supposed to go to shard with
>> newer documents.
>> 
>> Thanks,
>> Susheel
>> 
>> On Sat, Jul 1, 2017 at 12:14 PM, Erick Erickson <er...@gmail.com>
>> wrote:
>> 
>>> 1M docs/month shouldn't make Solr break a sweat. If it really worries
>>> you and you're indexing in a big batch, index during off hours. At
>>> very worst, if you're ingesting them all at once you might have to
>>> throttle the indexing a bit.
>>> 
>>> Frankly, most of the time acquiring the documents from the system of
>>> record is where the bottleneck is and Solr easily handles the indexing
>>> load.
>>> 
>>> The other advantage is that if you use implicit routing rather than a
>>> composite ID, you can add shards to your collection one at a time as
>>> required, for time-series data that's an elegant way to "age out" old
>>> documents.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Sat, Jul 1, 2017 at 8:57 AM, mganeshs <mg...@live.in> wrote:
>>>> Hi Susheel,
>>>> 
>>>> Currently we have around 20M documents already and we are expecting now
>>> on
>>>> that every month 1M of documents.
>>>> The reason why don't want to for time based implicit routing is that,
>>> all
>>>> documents will end up with recent shard and so indexing will be heavy
>>> for
>>>> the new shard, where as older shards will be used just for query
>>> purpose.
>>>> If we have default sharding, then load for indexing is distributed
>>> across
>>>> all the shards. That's the reason we would like to stick to default
>>>> sharding. But Join is the issue over here when default sharding is used
>>> :-(
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context: http://lucene.472066.n3.nabble
>>> .com/Allow-Join-over-two-sharded-collection-tp4343443p4343803.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>> 
>> 
>>

Re: Allow Join over two sharded collection

Posted by Susheel Kumar <su...@gmail.com>.

Depending on your use case people also use collection aliasing for time
series data.  See below

https://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/

On Sat, Jul 1, 2017 at 7:13 PM, Susheel Kumar <su...@gmail.com> wrote:

> As Eric said 1docs/month isn't a big deal.  I have 45+ million docs in one
> shard but YMMV depending on other factors.
>
> Also there is lot of confusion in the terminology. The default routing is
> compositeID routing.  The implicit routing which Eric mentioned is the
> manual routing.  https://issues.apache.org/jira/browse/SOLR-6630
>
> Which routing you are suggesting to use? Can you clarify again.  Also
> what's your exact use case.  Do you query old aged documents or you don't
> need to and most or all of your queries are supposed to go to shard with
> newer documents.
>
> Thanks,
> Susheel
>
> On Sat, Jul 1, 2017 at 12:14 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> 1M docs/month shouldn't make Solr break a sweat. If it really worries
>> you and you're indexing in a big batch, index during off hours. At
>> very worst, if you're ingesting them all at once you might have to
>> throttle the indexing a bit.
>>
>> Frankly, most of the time acquiring the documents from the system of
>> record is where the bottleneck is and Solr easily handles the indexing
>> load.
>>
>> The other advantage is that if you use implicit routing rather than a
>> composite ID, you can add shards to your collection one at a time as
>> required, for time-series data that's an elegant way to "age out" old
>> documents.
>>
>> Best,
>> Erick
>>
>> On Sat, Jul 1, 2017 at 8:57 AM, mganeshs <mg...@live.in> wrote:
>> > Hi Susheel,
>> >
>> > Currently we have around 20M documents already and we are expecting now
>> on
>> > that every month 1M of documents.
>> > The reason why don't want to for time based implicit routing is that,
>> all
>> > documents will end up with recent shard and so indexing will be heavy
>> for
>> > the new shard, where as older shards will be used just for query
>> purpose.
>> > If we have default sharding, then load for indexing is distributed
>> across
>> > all the shards. That's the reason we would like to stick to default
>> > sharding. But Join is the issue over here when default sharding is used
>> :-(
>> >
>> >
>> >
>> > --
>> > View this message in context: http://lucene.472066.n3.nabble
>> .com/Allow-Join-over-two-sharded-collection-tp4343443p4343803.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>

Re: Allow Join over two sharded collection

Posted by Erick Erickson <er...@gmail.com>.

This doesn't appear to be being actively pursued, so it's anybody's guess.

Depending on your use-case, the streaming capabilities may be an
OOB solution.

Best,
Erick

On Wed, Feb 6, 2019 at 1:22 AM mganeshs <mg...@live.in> wrote:
>
> All,
>
> Any idea, whether this will be taken care or addressed in near future ?
>
> https://issues.apache.org/jira/browse/SOLR-8297
>
> Regards,
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Allow Join over two sharded collection

Posted by mganeshs <mg...@live.in>.

All, 

Any idea, whether this will be taken care or addressed in near future ? 

https://issues.apache.org/jira/browse/SOLR-8297

Regards,




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Allow Join over two sharded collection

Posted by Susheel Kumar <su...@gmail.com>.

How are you planing to manual route? What key(s) are you thinking to use.

Second the link i shared was collection aliasing and if you use that, you
will end up with multiple collections. Just want to clarify as you said
above "...manual routing and creating alias"

Again until the join feature is available across shards, you can still
continue with one shard (and replica's if needed).  20M + 1M/per month
shouldn't be a big deal.

Thanks,
Susheel

On Mon, Jul 3, 2017 at 11:16 PM, mganeshs <mg...@live.in> wrote:

> Hi Susheel,
>
> To make use of Joins only option is I should go for manual routing. If I go
> for manual routing based on time, we miss the power of distributing the
> load
> while indexing. It will end up with all indexing happens in newly created
> shard, which we feel this will not be efficient approach and degrades the
> performance of indexing as we have lot of jvms running, but still all
> indexing going to one single shard for indexing and we are also expecting
> 1M+ docs per month in coming days.
>
> For your question on whether we will query old aged document... ? Mostly we
> won't query old aged documents. With querying pattern, it's clear we should
> go for manual routing and creating alias. But when it comes to indexing, in
> order to distribute the load of indexing, we felt default routing is the
> best option, but Join will not work. And that's the reason for asking when
> this feature will be in place ?
>
> Regards,
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4344098.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Allow Join over two sharded collection

Posted by mganeshs <mg...@live.in>.

Hi Susheel,

To make use of Joins only option is I should go for manual routing. If I go
for manual routing based on time, we miss the power of distributing the load
while indexing. It will end up with all indexing happens in newly created
shard, which we feel this will not be efficient approach and degrades the
performance of indexing as we have lot of jvms running, but still all
indexing going to one single shard for indexing and we are also expecting
1M+ docs per month in coming days. 

For your question on whether we will query old aged document... ? Mostly we
won't query old aged documents. With querying pattern, it's clear we should
go for manual routing and creating alias. But when it comes to indexing, in
order to distribute the load of indexing, we felt default routing is the
best option, but Join will not work. And that's the reason for asking when
this feature will be in place ?

Regards,



--
View this message in context: http://lucene.472066.n3.nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4344098.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Allow Join over two sharded collection

Posted by Susheel Kumar <su...@gmail.com>.

As Eric said 1docs/month isn't a big deal.  I have 45+ million docs in one
shard but YMMV depending on other factors.

Also there is lot of confusion in the terminology. The default routing is
compositeID routing.  The implicit routing which Eric mentioned is the
manual routing.  https://issues.apache.org/jira/browse/SOLR-6630

Which routing you are suggesting to use? Can you clarify again.  Also
what's your exact use case.  Do you query old aged documents or you don't
need to and most or all of your queries are supposed to go to shard with
newer documents.

Thanks,
Susheel

On Sat, Jul 1, 2017 at 12:14 PM, Erick Erickson <er...@gmail.com>
wrote:

> 1M docs/month shouldn't make Solr break a sweat. If it really worries
> you and you're indexing in a big batch, index during off hours. At
> very worst, if you're ingesting them all at once you might have to
> throttle the indexing a bit.
>
> Frankly, most of the time acquiring the documents from the system of
> record is where the bottleneck is and Solr easily handles the indexing
> load.
>
> The other advantage is that if you use implicit routing rather than a
> composite ID, you can add shards to your collection one at a time as
> required, for time-series data that's an elegant way to "age out" old
> documents.
>
> Best,
> Erick
>
> On Sat, Jul 1, 2017 at 8:57 AM, mganeshs <mg...@live.in> wrote:
> > Hi Susheel,
> >
> > Currently we have around 20M documents already and we are expecting now
> on
> > that every month 1M of documents.
> > The reason why don't want to for time based implicit routing is that, all
> > documents will end up with recent shard and so indexing will be heavy for
> > the new shard, where as older shards will be used just for query purpose.
> > If we have default sharding, then load for indexing is distributed across
> > all the shards. That's the reason we would like to stick to default
> > sharding. But Join is the issue over here when default sharding is used
> :-(
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4343803.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Allow Join over two sharded collection

Posted by Erick Erickson <er...@gmail.com>.

1M docs/month shouldn't make Solr break a sweat. If it really worries
you and you're indexing in a big batch, index during off hours. At
very worst, if you're ingesting them all at once you might have to
throttle the indexing a bit.

Frankly, most of the time acquiring the documents from the system of
record is where the bottleneck is and Solr easily handles the indexing
load.

The other advantage is that if you use implicit routing rather than a
composite ID, you can add shards to your collection one at a time as
required, for time-series data that's an elegant way to "age out" old
documents.

Best,
Erick

On Sat, Jul 1, 2017 at 8:57 AM, mganeshs <mg...@live.in> wrote:
> Hi Susheel,
>
> Currently we have around 20M documents already and we are expecting now on
> that every month 1M of documents.
> The reason why don't want to for time based implicit routing is that, all
> documents will end up with recent shard and so indexing will be heavy for
> the new shard, where as older shards will be used just for query purpose.
> If we have default sharding, then load for indexing is distributed across
> all the shards. That's the reason we would like to stick to default
> sharding. But Join is the issue over here when default sharding is used :-(
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4343803.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Allow Join over two sharded collection

Posted by mganeshs <mg...@live.in>.

Hi Susheel,

Currently we have around 20M documents already and we are expecting now on
that every month 1M of documents. 
The reason why don't want to for time based implicit routing is that, all
documents will end up with recent shard and so indexing will be heavy for
the new shard, where as older shards will be used just for query purpose. 
If we have default sharding, then load for indexing is distributed across
all the shards. That's the reason we would like to stick to default
sharding. But Join is the issue over here when default sharding is used :-(



--
View this message in context: http://lucene.472066.n3.nabble.com/Allow-Join-over-two-sharded-collection-tp4343443p4343803.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Allow Join over two sharded collection

Posted by Susheel Kumar <su...@gmail.com>.

How many documents you have currently and how much it will be after growing
drastically.

Either you can add hardware and keep one shard until the joins are fully
available
or
You can shard and distribute using composite id router and that's still
better even though some/one shard(s) may get high load compare to having
just one single shard/node taking all the load, right?

On Fri, Jun 30, 2017 at 2:19 AM, Mikhail Khludnev <mk...@apache.org> wrote:

> probably in November or December.
>
> On Thu, Jun 29, 2017 at 5:44 PM, mganeshs <mg...@live.in> wrote:
>
> > All,
> >
> > Any idea when this  ticket <https://issues.apache.org/
> > jira/browse/SOLR-8297>
> > will be addressed.
> >
> > https://issues.apache.org/jira/browse/SOLR-8297
> >
> > One of the comments says by SOLR 7.0. Can we expect that by 7.0 ?
> >
> > Regards,
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Allow-Join-over-two-sharded-collection-tp4343443.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Allow Join over two sharded collection

Posted by Mikhail Khludnev <mk...@apache.org>.

probably in November or December.

On Thu, Jun 29, 2017 at 5:44 PM, mganeshs <mg...@live.in> wrote:

> All,
>
> Any idea when this  ticket <https://issues.apache.org/
> jira/browse/SOLR-8297>
> will be addressed.
>
> https://issues.apache.org/jira/browse/SOLR-8297
>
> One of the comments says by SOLR 7.0. Can we expect that by 7.0 ?
>
> Regards,
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Allow-Join-over-two-sharded-collection-tp4343443.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev