You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by S G <sg...@gmail.com> on 2019/07/07 07:09:05 UTC
Re: Discuss: virtual nodes in Solr

It could be a matter of perspective but the benefit of going from N shards
to N+k shards is just one and that benefit is a huge one IMO.

You need not double your hardware when you have to expand your cluster
"without doing a full re-ingestion".
When you have several tera-bytes of data on a performance-saturated cluster
and you want to scale the cluster for the next 1 TB of data, it is quite
costly to:
1. Go from N shards to 2N shards OR
2. Go from N shards to N+k shards with full re-ingestion of data that can
take more than a week.

Cassandra kind of data-sources have solved this problem very nicely by
allowing incremental addition of hardware.
- You are neither forced to double your hardware.
- Nor are you forced to reload all your data.
(I know the theory that Solr is not a primary datasource and user should be
ready to reload etc but it is time that we begin to add some clauses to
that theory and restrict its usage for "all" contexts since reloading TBs
of data is long and very painful)

So the only benefit of this feature is that it will save both money (on
hardware) and time (by avoiding reloading).
And user will be able to scale for every TB of data by just adding few
shards, which is very economical.

Hosting more than 1 shards on "some" nodes is not good either because then
those nodes will not perform very well.
(Note that problem we had was to scale a performance-saturated cluster for
the next unit of data like TB).

Another great and similar benefit is that it becomes easy to scale for a
burst in data.
Let us say, there is a July-sales or Black-Friday event and we expect that
in these two months, the data will be much more.
So an ability to scale shards up and down during and after such events
would again save a lot of money on the hardware and time.

Cheers,
SG


On Sat, Jun 29, 2019 at 11:30 AM Erick Erickson <er...@gmail.com>
wrote:

> Offhand I suspect this would be an enormous effort, not worth the work.
>
> I agree that double-or-nothing is not terribly convenient, but that said
> since multiple replicas can be hosted on the same node and moved to other
> hardware as needed (oversharding, even for existing collections) there are
> ways to deal with this currently.
>
> There would have to be extraordinary benefits to interest me. And the
> stated benefit so far of being able to expand gradually rather than
> doubling shards isn’t an extraordinary benefit. That effort would come at
> the expense of a lot of other work.
>
> Another way of saying it is that the burden of proof for the benefits is
> on you ;).
>
> Best,
> Erick
>
> > On Jun 28, 2019, at 8:51 PM, Will Martin <wm...@outlook.com> wrote:
> >
> > From: S G <sg...@gmail.com>>
> > Subject: Discuss: virtual nodes in Solr
> > Date: June 28, 2019 at 8:04:44 PM EDT
> > To: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> > Reply-To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
> >
> >
> > Hi,
> >
> > Has Solr tried to use vnodes concept like Cassandra:
> > https://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2<
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.datastax.com%2Fdev%2Fblog%2Fvirtual-nodes-in-cassandra-1-2&data=02%7C01%7C%7Cd5e503d4cc6446e4effb08d6fc3c7ff8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636973734116277981&sdata=k7kocZQHr342tm8swfyS%2FovYqFfmkHm1rZtlRCS9%2FOo%3D&reserved=0
> >
> >
> > If this can be implemented carefully, we need not live with just
> > shard-splitting alone that can only double the number of shards.
> > With vnodes, shards can be increased incrementally as the need arises.
> > What's more, shards can be decreased too when the doc-count/traffic
> > decreases.
> >
> > -SG
> >
> > +1
> >
> > Carefully? Deliberate would be a better word with this community; imho.
> How about an incubation epic story PMC?
> >
> >
> >
>
>