You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Ilan Ginzburg <il...@gmail.com> on 2023/04/28 17:33:29 UTC

SolrCloud separating compute from storage

Hi,

This is a long message, apologies. If responses are positive, there will
likely be plenty of other opportunities to discuss the topics mentioned
here.

I'm trying to see if the community would be interested in a contribution
allowing SolrCloud nodes to be totally stateless with persistent storage
done in S3 (or similar).

Salesforce has been working for a while on separating compute from storage
in SolrCloud,  see presentation at Activate 2019 SolrCloud in Public Cloud:
Scaling Compute Independently from Storage <https://youtu.be/6fE5KvOfb6A>.
In a nutshell, the idea is that all SolrCloud nodes are stateless, have a
local disk cache of the cores they're hosting but no persistent volumes (no
persistent indexes nor transaction logs), and shard level persistence is
done on S3.

This is different from a classic node remote storage:
- No "per node" transaction log (therefore no need to wait for down nodes
to come back up to recover their transaction log data)
- A single per shard copy on S3 (not a copy per replica)
- Local non persistent node volumes (SSD's for example) are a cache and are
used to serve queries and do indexing (therefore local cache is not limited
by memory like it is in HdfsDirectory for example but by disk size)
- When adding a node and replicas, the index content for the replicas is
fetched from S3, not from other (possibly overloaded already) nodes

The current state of our code is:
- It's a fork (it was shared as a branch for a while but eventually removed)
- We introduced a new replica type that is explicitly based on remote Blob
storage (and otherwise similar to TLOG) and that guards against concurrent
writes to S3 (when shard leadership changes in SolrCloud, there can be an
overlap period with two nodes acting as leaders which can cause corruption
with a naive implementation of remote storage at shard level)
- Currently we only push constructed segments to S3 which forces committing
every batch before acknowledging it to the client (we can't rely on the
transaction log on a node given it would be lost upon node restart, the
node having no persistent storage).

We're considering improving this approach by making the transaction log a
shard level abstraction (rather than a replica/node abstraction), and store
it in S3 as well with a transaction log per shard, not per replica.
This would allow indexing to not commit on every batch, speed up /update
requests, push the constructed segments asynchronously to S3, guarantee
data durability while still allowing nodes to be stateless (so can be shut
down at any time in any number without data loss and without having to
restart these nodes to recover data only they can access).

If there is interest in such a contribution to SolrCloud then we might
carry the next phase of the work upstream (initially in a branch, with the
intention to eventually merge).

If there is no interest, it's easier for us to keep working on our
fork that allows taking shortcuts and ignoring features of SolrCloud we do
not use (for example replacing existing transaction log with a shard level
transaction log rather than maintaining both).

Feedback welcome! I can present the idea in more detail in a future
committer/contributor meeting if there's interest.

Thanks,
Ilan

Re: SolrCloud separating compute from storage

Posted by Jason Gerlowski <ge...@gmail.com>.
+1 - sounds very promising!

On Sat, Apr 29, 2023 at 1:06 PM Wei <we...@gmail.com> wrote:
>
> This is an awesome feature for solr cloud! Currently for our read
> heavy/write heavy use case, we exclude all query requests from the leader
> in each shard to avoid becoming the load bottleneck. Also each solr cloud
> has its own pipeline for NRT updates.  With stateless replica and
> persistent storage support, can one small cluster be dedicated to handle
> the updates, while many serving clouds all pull the updated segments from
> central persistent storage? It would be significant resoure savings.
>
> Thanks,
> Wei
>
> On Sat, Apr 29, 2023 at 2:30 AM Ilan Ginzburg <il...@gmail.com> wrote:
>
> > The changing/overlapping leaders was the main challenge in the
> > implementation.
> > Logic such as:
> > If (iAmLeader()) {
> >    doThings();
> > }
> > Can have multiple participants doThings() at the same time as iAmLeader()
> > could change just after it was checked. The only way out in such an
> > approach is to do barriers (the old leader explicitly giving up leadership
> > before the new one takes over… sounds familiar? 😉). This is complicated.
> > What if the old leader is considered gone and can’t explicitly give up but
> > is not gone? Not being seen by the quorum does not automatically imply not
> > being able to write to S3.
> >
> > We solved it by having the writing to S3 (or indeed any storage, we added
> > an abstraction layer) use random file names (Solr file name + random
> > suffix) so that two concurrent nodes would not overwrite each other even if
> > they were writing similarly named segments/files.
> >
> > Then we used a conditional update in Zookeeper (on per write of one or more
> > segments, not one per file) to have one of the two nodes “win” the write to
> > S3. The data written by the losing node is ignored and not part of the S3
> > image of the shard.
> >
> > Indeed running Solr from the local disk is essential (the cache aspect).
> > Two orders of magnitude more space than in memory, more or less.
> >
> > And we run smaller shard sizes indeed!
> >
> > Thanks everybody for the feedback so far.
> >
> > Ilan
> >
> > On Sat 29 Apr 2023 at 07:08, Shawn Heisey <ap...@elyograg.org> wrote:
> >
> > > On 4/28/23 11:33, Ilan Ginzburg wrote:
> > > > Salesforce has been working for a while on separating compute from
> > > storage
> > > > in SolrCloud,  see presentation at Activate 2019 SolrCloud in Public
> > > Cloud:
> > > > Scaling Compute Independently from Storage <
> > https://youtu.be/6fE5KvOfb6A
> > > >.
> > > > In a nutshell, the idea is that all SolrCloud nodes are stateless,
> > have a
> > > > local disk cache of the cores they're hosting but no persistent volumes
> > > (no
> > > > persistent indexes nor transaction logs), and shard level persistence
> > is
> > > > done on S3.
> > >
> > > This is a very intriguing idea!  I think it would be particularly useful
> > > for containerized setups that can add or remove nodes to meet changing
> > > demands.
> > >
> > > My primary concern when I first looked at this was that with
> > > network-based storage there would be little opportunity for caching, and
> > > caching is SUPER critical for Solr performance.  Then when I began
> > > writing this reply, I saw above that you're talking about a local disk
> > > cache... so maybe that is not something to worry about.
> > >
> > > Bandwidth and latency limitations to/from the shared storage are another
> > > concern, especially with big indexes that have segments up to 5GB.
> > > Increasing the merge tier sizes and reducing the max segment size is
> > > probably a very good idea.
> > >
> > > Another challenge:  Ensuring that switching leaders happens reasonably
> > > quickly while making sure that there cannot be multiple replicas
> > > thinking they are leader at the same time.  Making the leader fencing
> > > bulletproof is a critical piece of this.  I suspect that the existing
> > > leader fencing could use some work, affecting SolrCloud in general.
> > >
> > > I don't want to get too deep in technical weeds, mostly because I do not
> > > understand all the details ... but I am curious about something that
> > > might affect this:  Are ephemeral znodes created by one Solr node
> > > visible to other Solr nodes?  If they are, then I think ZK would provide
> > > all the fencing needed, and could also keep track of the segments that
> > > exist in the remote storage so follower replicas can quickly keep up
> > > with the leader.
> > >
> > > There could also be implementations for more mundane shared storage like
> > > SMB or NFS.
> > >
> > > Thanks,
> > > Shawn
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> > > For additional commands, e-mail: dev-help@solr.apache.org
> > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: SolrCloud separating compute from storage

Posted by Wei <we...@gmail.com>.
This is an awesome feature for solr cloud! Currently for our read
heavy/write heavy use case, we exclude all query requests from the leader
in each shard to avoid becoming the load bottleneck. Also each solr cloud
has its own pipeline for NRT updates.  With stateless replica and
persistent storage support, can one small cluster be dedicated to handle
the updates, while many serving clouds all pull the updated segments from
central persistent storage? It would be significant resoure savings.

Thanks,
Wei

On Sat, Apr 29, 2023 at 2:30 AM Ilan Ginzburg <il...@gmail.com> wrote:

> The changing/overlapping leaders was the main challenge in the
> implementation.
> Logic such as:
> If (iAmLeader()) {
>    doThings();
> }
> Can have multiple participants doThings() at the same time as iAmLeader()
> could change just after it was checked. The only way out in such an
> approach is to do barriers (the old leader explicitly giving up leadership
> before the new one takes over… sounds familiar? 😉). This is complicated.
> What if the old leader is considered gone and can’t explicitly give up but
> is not gone? Not being seen by the quorum does not automatically imply not
> being able to write to S3.
>
> We solved it by having the writing to S3 (or indeed any storage, we added
> an abstraction layer) use random file names (Solr file name + random
> suffix) so that two concurrent nodes would not overwrite each other even if
> they were writing similarly named segments/files.
>
> Then we used a conditional update in Zookeeper (on per write of one or more
> segments, not one per file) to have one of the two nodes “win” the write to
> S3. The data written by the losing node is ignored and not part of the S3
> image of the shard.
>
> Indeed running Solr from the local disk is essential (the cache aspect).
> Two orders of magnitude more space than in memory, more or less.
>
> And we run smaller shard sizes indeed!
>
> Thanks everybody for the feedback so far.
>
> Ilan
>
> On Sat 29 Apr 2023 at 07:08, Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 4/28/23 11:33, Ilan Ginzburg wrote:
> > > Salesforce has been working for a while on separating compute from
> > storage
> > > in SolrCloud,  see presentation at Activate 2019 SolrCloud in Public
> > Cloud:
> > > Scaling Compute Independently from Storage <
> https://youtu.be/6fE5KvOfb6A
> > >.
> > > In a nutshell, the idea is that all SolrCloud nodes are stateless,
> have a
> > > local disk cache of the cores they're hosting but no persistent volumes
> > (no
> > > persistent indexes nor transaction logs), and shard level persistence
> is
> > > done on S3.
> >
> > This is a very intriguing idea!  I think it would be particularly useful
> > for containerized setups that can add or remove nodes to meet changing
> > demands.
> >
> > My primary concern when I first looked at this was that with
> > network-based storage there would be little opportunity for caching, and
> > caching is SUPER critical for Solr performance.  Then when I began
> > writing this reply, I saw above that you're talking about a local disk
> > cache... so maybe that is not something to worry about.
> >
> > Bandwidth and latency limitations to/from the shared storage are another
> > concern, especially with big indexes that have segments up to 5GB.
> > Increasing the merge tier sizes and reducing the max segment size is
> > probably a very good idea.
> >
> > Another challenge:  Ensuring that switching leaders happens reasonably
> > quickly while making sure that there cannot be multiple replicas
> > thinking they are leader at the same time.  Making the leader fencing
> > bulletproof is a critical piece of this.  I suspect that the existing
> > leader fencing could use some work, affecting SolrCloud in general.
> >
> > I don't want to get too deep in technical weeds, mostly because I do not
> > understand all the details ... but I am curious about something that
> > might affect this:  Are ephemeral znodes created by one Solr node
> > visible to other Solr nodes?  If they are, then I think ZK would provide
> > all the fencing needed, and could also keep track of the segments that
> > exist in the remote storage so follower replicas can quickly keep up
> > with the leader.
> >
> > There could also be implementations for more mundane shared storage like
> > SMB or NFS.
> >
> > Thanks,
> > Shawn
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> > For additional commands, e-mail: dev-help@solr.apache.org
> >
> >
>

Re: SolrCloud separating compute from storage

Posted by Ilan Ginzburg <il...@gmail.com>.
The changing/overlapping leaders was the main challenge in the
implementation.
Logic such as:
If (iAmLeader()) {
   doThings();
}
Can have multiple participants doThings() at the same time as iAmLeader()
could change just after it was checked. The only way out in such an
approach is to do barriers (the old leader explicitly giving up leadership
before the new one takes over… sounds familiar? 😉). This is complicated.
What if the old leader is considered gone and can’t explicitly give up but
is not gone? Not being seen by the quorum does not automatically imply not
being able to write to S3.

We solved it by having the writing to S3 (or indeed any storage, we added
an abstraction layer) use random file names (Solr file name + random
suffix) so that two concurrent nodes would not overwrite each other even if
they were writing similarly named segments/files.

Then we used a conditional update in Zookeeper (on per write of one or more
segments, not one per file) to have one of the two nodes “win” the write to
S3. The data written by the losing node is ignored and not part of the S3
image of the shard.

Indeed running Solr from the local disk is essential (the cache aspect).
Two orders of magnitude more space than in memory, more or less.

And we run smaller shard sizes indeed!

Thanks everybody for the feedback so far.

Ilan

On Sat 29 Apr 2023 at 07:08, Shawn Heisey <ap...@elyograg.org> wrote:

> On 4/28/23 11:33, Ilan Ginzburg wrote:
> > Salesforce has been working for a while on separating compute from
> storage
> > in SolrCloud,  see presentation at Activate 2019 SolrCloud in Public
> Cloud:
> > Scaling Compute Independently from Storage <https://youtu.be/6fE5KvOfb6A
> >.
> > In a nutshell, the idea is that all SolrCloud nodes are stateless, have a
> > local disk cache of the cores they're hosting but no persistent volumes
> (no
> > persistent indexes nor transaction logs), and shard level persistence is
> > done on S3.
>
> This is a very intriguing idea!  I think it would be particularly useful
> for containerized setups that can add or remove nodes to meet changing
> demands.
>
> My primary concern when I first looked at this was that with
> network-based storage there would be little opportunity for caching, and
> caching is SUPER critical for Solr performance.  Then when I began
> writing this reply, I saw above that you're talking about a local disk
> cache... so maybe that is not something to worry about.
>
> Bandwidth and latency limitations to/from the shared storage are another
> concern, especially with big indexes that have segments up to 5GB.
> Increasing the merge tier sizes and reducing the max segment size is
> probably a very good idea.
>
> Another challenge:  Ensuring that switching leaders happens reasonably
> quickly while making sure that there cannot be multiple replicas
> thinking they are leader at the same time.  Making the leader fencing
> bulletproof is a critical piece of this.  I suspect that the existing
> leader fencing could use some work, affecting SolrCloud in general.
>
> I don't want to get too deep in technical weeds, mostly because I do not
> understand all the details ... but I am curious about something that
> might affect this:  Are ephemeral znodes created by one Solr node
> visible to other Solr nodes?  If they are, then I think ZK would provide
> all the fencing needed, and could also keep track of the segments that
> exist in the remote storage so follower replicas can quickly keep up
> with the leader.
>
> There could also be implementations for more mundane shared storage like
> SMB or NFS.
>
> Thanks,
> Shawn
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
>
>

Re: SolrCloud separating compute from storage

Posted by Shawn Heisey <ap...@elyograg.org>.
On 4/28/23 11:33, Ilan Ginzburg wrote:
> Salesforce has been working for a while on separating compute from storage
> in SolrCloud,  see presentation at Activate 2019 SolrCloud in Public Cloud:
> Scaling Compute Independently from Storage <https://youtu.be/6fE5KvOfb6A>.
> In a nutshell, the idea is that all SolrCloud nodes are stateless, have a
> local disk cache of the cores they're hosting but no persistent volumes (no
> persistent indexes nor transaction logs), and shard level persistence is
> done on S3.

This is a very intriguing idea!  I think it would be particularly useful 
for containerized setups that can add or remove nodes to meet changing 
demands.

My primary concern when I first looked at this was that with 
network-based storage there would be little opportunity for caching, and 
caching is SUPER critical for Solr performance.  Then when I began 
writing this reply, I saw above that you're talking about a local disk 
cache... so maybe that is not something to worry about.

Bandwidth and latency limitations to/from the shared storage are another 
concern, especially with big indexes that have segments up to 5GB. 
Increasing the merge tier sizes and reducing the max segment size is 
probably a very good idea.

Another challenge:  Ensuring that switching leaders happens reasonably 
quickly while making sure that there cannot be multiple replicas 
thinking they are leader at the same time.  Making the leader fencing 
bulletproof is a critical piece of this.  I suspect that the existing 
leader fencing could use some work, affecting SolrCloud in general.

I don't want to get too deep in technical weeds, mostly because I do not 
understand all the details ... but I am curious about something that 
might affect this:  Are ephemeral znodes created by one Solr node 
visible to other Solr nodes?  If they are, then I think ZK would provide 
all the fencing needed, and could also keep track of the segments that 
exist in the remote storage so follower replicas can quickly keep up 
with the leader.

There could also be implementations for more mundane shared storage like 
SMB or NFS.

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: SolrCloud separating compute from storage

Posted by Mikhail Khludnev <mk...@apache.org>.
keeping from scratch speculation mode on..

On Wed, Jul 12, 2023 at 3:45 PM Ilan Ginzburg <il...@gmail.com> wrote:

> I think mandating the use of Kafka to switch to this mode of separating
> compute and storage would make adoption harder. One would also need a
> deployment of Kafka that is resilient to an AZ going down. We get this "for
> free" by using S3/GCS or similar.
>
It sounds like a reasonable requirement for vendor agnostic abstraction
like we have for S3. In this case managed offerings for kafta (AWS MSK) and
analogs (GCP Pub\Sub) simplify adoption.

Moreover, the transaction log will be per shard but we can't afford a Kafka
> topic per shard (in our use case we have thousands of shards on the
> cluster). Addressing all this will likely complexify a design that will
> lose its initial beauty and appeal.
>
Good point. But can't they be sharded transparently? If we run N
IndexeWriters, and every of them can consume 1 (but optionally..N) shards
of a topic?
Presumably, we can take  Solr Cross DC from solr-sandbox and turn
transaction log off.


>
> I hope to be able to present this topic using 5 minutes during the
> upcoming community virtual
> meetup
> <https://cwiki.apache.org/confluence/display/SOLR/2023-07-19+Meeting+notes
> >
> (July
> 19th).
>

Looking forward to it!


>
> Ilan
>
>
> On Wed, Jul 12, 2023 at 10:54 AM Mikhail Khludnev <mk...@apache.org> wrote:
>
> > Hello Ilan,
> > Late comment, though.
> >
> > On Fri, Apr 28, 2023 at 8:33 PM Ilan Ginzburg <il...@gmail.com>
> wrote:
> >
> > > ...
> > > We're considering improving this approach by making the transaction
> log a
> > > shard level abstraction (rather than a replica/node abstraction), and
> > store
> > > it in S3 as well with a transaction log per shard, not per replica.
> > > This would allow indexing to not commit on every batch, speed up
> /update
> > > requests, push the constructed segments asynchronously to S3, guarantee
> > > data durability while still allowing nodes to be stateless (so can be
> > shut
> > > down at any time in any number without data loss and without having to
> > > restart these nodes to recover data only they can access).
> > > ...
> > > Thanks,
> > > Ilan
> > >
> >
> > When discussing these (pretty cool) architectures I'm missing the point
> of
> > implementing transaction log in Solr codebase.
> > I think Kafka is the best fit for such a pre-indexer buffer. WYDT?
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>


-- 
Sincerely yours
Mikhail Khludnev

Re: SolrCloud separating compute from storage

Posted by Ilan Ginzburg <il...@gmail.com>.
Thanks for your feedback Mikhail. Your comment makes a lot of sense.

When discussing these (pretty cool) architectures I'm missing the point of
> implementing transaction log in Solr codebase.
> I think Kafka is the best fit for such a pre-indexer buffer. WYDT?


Starting from scratch with a design for such a write ahead log, there
likely are multiple options.
But here we're considering an integration into an existing system
(SolrCloud) that already has existing logic and implementation around
transaction logs and that currently does not have a dependency on Kafka.

I think mandating the use of Kafka to switch to this mode of separating
compute and storage would make adoption harder. One would also need a
deployment of Kafka that is resilient to an AZ going down. We get this "for
free" by using S3/GCS or similar.
Moreover, the transaction log will be per shard but we can't afford a Kafka
topic per shard (in our use case we have thousands of shards on the
cluster). Addressing all this will likely complexify a design that will
lose its initial beauty and appeal.

In any case, work on the transaction log side of things has not started,
and our (Salesforce) plan is to start that work once the existing state of
the separation of compute and storage has been shared upstream (in a
branch). There will be plenty of opportunities to discuss and decide on the
best approach (and work on it if anybody is interested). Maybe even decide
to not decide: use the abstraction internally and allow multiple
implementations: existing local file system, S3/GCS based as suggested,
Kafka or other based.

I hope to be able to present this topic using 5 minutes during the
upcoming community virtual
meetup
<https://cwiki.apache.org/confluence/display/SOLR/2023-07-19+Meeting+notes>
(July
19th).

Ilan


On Wed, Jul 12, 2023 at 10:54 AM Mikhail Khludnev <mk...@apache.org> wrote:

> Hello Ilan,
> Late comment, though.
>
> On Fri, Apr 28, 2023 at 8:33 PM Ilan Ginzburg <il...@gmail.com> wrote:
>
> > ...
> > We're considering improving this approach by making the transaction log a
> > shard level abstraction (rather than a replica/node abstraction), and
> store
> > it in S3 as well with a transaction log per shard, not per replica.
> > This would allow indexing to not commit on every batch, speed up /update
> > requests, push the constructed segments asynchronously to S3, guarantee
> > data durability while still allowing nodes to be stateless (so can be
> shut
> > down at any time in any number without data loss and without having to
> > restart these nodes to recover data only they can access).
> > ...
> > Thanks,
> > Ilan
> >
>
> When discussing these (pretty cool) architectures I'm missing the point of
> implementing transaction log in Solr codebase.
> I think Kafka is the best fit for such a pre-indexer buffer. WYDT?
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: SolrCloud separating compute from storage

Posted by Mikhail Khludnev <mk...@apache.org>.
Hello Ilan,
Late comment, though.

On Fri, Apr 28, 2023 at 8:33 PM Ilan Ginzburg <il...@gmail.com> wrote:

> ...
> We're considering improving this approach by making the transaction log a
> shard level abstraction (rather than a replica/node abstraction), and store
> it in S3 as well with a transaction log per shard, not per replica.
> This would allow indexing to not commit on every batch, speed up /update
> requests, push the constructed segments asynchronously to S3, guarantee
> data durability while still allowing nodes to be stateless (so can be shut
> down at any time in any number without data loss and without having to
> restart these nodes to recover data only they can access).
> ...
> Thanks,
> Ilan
>

When discussing these (pretty cool) architectures I'm missing the point of
implementing transaction log in Solr codebase.
I think Kafka is the best fit for such a pre-indexer buffer. WYDT?

-- 
Sincerely yours
Mikhail Khludnev

Re: SolrCloud separating compute from storage

Posted by Justin Sweeney <ju...@gmail.com>.
This definitely sounds very interesting and if we could abstract it away
from AWS specifically then even better. I think there are a lot of
advantages with an approach like this as you've mentioned. At FullStory we
are planning to get into some experiments using GCP Local SSDs and Google
Cloud Storage, so a very similar concept. This also would have been a
popular idea in other implementations I've seen. This is one of those ideas
that if it was relatively easy to implement for users I think it would get
a lot of adoption.

I fully support this concept as a contribution and would be happy to help
make that a reality.

On Fri, Apr 28, 2023 at 4:27 PM Joel Bernstein <jo...@gmail.com> wrote:

> I mentioned NIO providers for S3, GCS and Azure in a different email
> thread. This could be used to abstract away the S3 specific code and
> provide support for GCS and Azure without much more effort. This would make
> unit tests much easier to write because you can simply unit test to local
> disk by changing the URL scheme.
>
> The current design as I understand it only needs sequential reads and
> writes, as its just pushing segments to remote storage. So I think NIO
> providers should support the basic needs.
>
> I think the design sounds really interesting and is a nice step in
> modernizing SolrCloud.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Apr 28, 2023 at 4:03 PM David Smiley <ds...@apache.org> wrote:
>
> > To clarify the point to everyone: "separation of compute from storage"
> > allows infrastructure cost savings for when you have both large scale
> (many
> > shards in the cluster) and highly diverse collection/index utilization.
> > The vision of our contribution is that an unused shard can scale down to
> as
> > little as zero replicas if you are willing for a future request to be
> > delayed for S3 hydration. The implementation also has SolrCloud
> robostuness
> > advantages compared to NRT/TLOG replicas.  This is "the vision"; we
> haven't
> > achieved zero replicas yet and maybe partially realized stability
> > advantages, and of course we've got some issues to address.  Nevertheless
> > with our continued investment in this approach (probable but not
> certain),
> > that's where we'll be.
> >
> > I'm sure a number of users/businesses would find these advantages
> > desirable.  Maybe not most users but some.  Small data sets or ones that
> > are constantly being used might not see the primary advantage.
> >
> > Sadly, we did not look at HdfsDirectory with an S3 backend because we
> > completely missed the possibility of doing so.  I wish this was promoted
> > more, like in the documentation / ref guide.  HdfsDirectory might achieve
> > the aforementioned goals albeit with varying pros/cons to the proposed
> > contribution.  Also, statements about what the contribution does and
> could
> > do is a moving target; meanwhile HdfsDirectory has not seen investment in
> > features/optimizations, but I can easily imagine some!
> >
> > With my Apache hat on, my main concern with the proposed contribution is
> > hearing opinions on how well it fits into SolrCloud / the codebase and
> the
> > resulting maintenance as code owners.  That might be hard for anyone to
> > comment on right now because you can't see it but if there are
> preliminary
> > thoughts on what Ilan has written in this regard, then, please share
> them.
> > If we don't ultimately contribute the aforementioned functionality, I
> > wonder if the developer community here might agree with the replica type
> > enum moving to an actual abstraction (i.e. an interface) to allow for the
> > possibility of other replica types for Solr hackers like us, and perhaps
> > it'd improve SolrCloud code in so doing.
> >
> > A more concrete concern I have is how to limit dependencies.  The
> > contribution adds AWS S3 dependencies to solr-core but that's not at all
> > okay in Apache Solr.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Fri, Apr 28, 2023 at 1:33 PM Ilan Ginzburg <il...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > This is a long message, apologies. If responses are positive, there
> will
> > > likely be plenty of other opportunities to discuss the topics mentioned
> > > here.
> > >
> > > I'm trying to see if the community would be interested in a
> contribution
> > > allowing SolrCloud nodes to be totally stateless with persistent
> storage
> > > done in S3 (or similar).
> > >
> > > Salesforce has been working for a while on separating compute from
> > storage
> > > in SolrCloud,  see presentation at Activate 2019 SolrCloud in Public
> > Cloud:
> > > Scaling Compute Independently from Storage <
> https://youtu.be/6fE5KvOfb6A
> > >.
> > > In a nutshell, the idea is that all SolrCloud nodes are stateless,
> have a
> > > local disk cache of the cores they're hosting but no persistent volumes
> > (no
> > > persistent indexes nor transaction logs), and shard level persistence
> is
> > > done on S3.
> > >
> > > This is different from a classic node remote storage:
> > > - No "per node" transaction log (therefore no need to wait for down
> nodes
> > > to come back up to recover their transaction log data)
> > > - A single per shard copy on S3 (not a copy per replica)
> > > - Local non persistent node volumes (SSD's for example) are a cache and
> > are
> > > used to serve queries and do indexing (therefore local cache is not
> > limited
> > > by memory like it is in HdfsDirectory for example but by disk size)
> > > - When adding a node and replicas, the index content for the replicas
> is
> > > fetched from S3, not from other (possibly overloaded already) nodes
> > >
> > > The current state of our code is:
> > > - It's a fork (it was shared as a branch for a while but eventually
> > > removed)
> > > - We introduced a new replica type that is explicitly based on remote
> > Blob
> > > storage (and otherwise similar to TLOG) and that guards against
> > concurrent
> > > writes to S3 (when shard leadership changes in SolrCloud, there can be
> an
> > > overlap period with two nodes acting as leaders which can cause
> > corruption
> > > with a naive implementation of remote storage at shard level)
> > > - Currently we only push constructed segments to S3 which forces
> > committing
> > > every batch before acknowledging it to the client (we can't rely on the
> > > transaction log on a node given it would be lost upon node restart, the
> > > node having no persistent storage).
> > >
> > > We're considering improving this approach by making the transaction
> log a
> > > shard level abstraction (rather than a replica/node abstraction), and
> > store
> > > it in S3 as well with a transaction log per shard, not per replica.
> > > This would allow indexing to not commit on every batch, speed up
> /update
> > > requests, push the constructed segments asynchronously to S3, guarantee
> > > data durability while still allowing nodes to be stateless (so can be
> > shut
> > > down at any time in any number without data loss and without having to
> > > restart these nodes to recover data only they can access).
> > >
> > > If there is interest in such a contribution to SolrCloud then we might
> > > carry the next phase of the work upstream (initially in a branch, with
> > the
> > > intention to eventually merge).
> > >
> > > If there is no interest, it's easier for us to keep working on our
> > > fork that allows taking shortcuts and ignoring features of SolrCloud we
> > do
> > > not use (for example replacing existing transaction log with a shard
> > level
> > > transaction log rather than maintaining both).
> > >
> > > Feedback welcome! I can present the idea in more detail in a future
> > > committer/contributor meeting if there's interest.
> > >
> > > Thanks,
> > > Ilan
> > >
> >
>

Re: SolrCloud separating compute from storage

Posted by Joel Bernstein <jo...@gmail.com>.
I mentioned NIO providers for S3, GCS and Azure in a different email
thread. This could be used to abstract away the S3 specific code and
provide support for GCS and Azure without much more effort. This would make
unit tests much easier to write because you can simply unit test to local
disk by changing the URL scheme.

The current design as I understand it only needs sequential reads and
writes, as its just pushing segments to remote storage. So I think NIO
providers should support the basic needs.

I think the design sounds really interesting and is a nice step in
modernizing SolrCloud.


Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Apr 28, 2023 at 4:03 PM David Smiley <ds...@apache.org> wrote:

> To clarify the point to everyone: "separation of compute from storage"
> allows infrastructure cost savings for when you have both large scale (many
> shards in the cluster) and highly diverse collection/index utilization.
> The vision of our contribution is that an unused shard can scale down to as
> little as zero replicas if you are willing for a future request to be
> delayed for S3 hydration. The implementation also has SolrCloud robostuness
> advantages compared to NRT/TLOG replicas.  This is "the vision"; we haven't
> achieved zero replicas yet and maybe partially realized stability
> advantages, and of course we've got some issues to address.  Nevertheless
> with our continued investment in this approach (probable but not certain),
> that's where we'll be.
>
> I'm sure a number of users/businesses would find these advantages
> desirable.  Maybe not most users but some.  Small data sets or ones that
> are constantly being used might not see the primary advantage.
>
> Sadly, we did not look at HdfsDirectory with an S3 backend because we
> completely missed the possibility of doing so.  I wish this was promoted
> more, like in the documentation / ref guide.  HdfsDirectory might achieve
> the aforementioned goals albeit with varying pros/cons to the proposed
> contribution.  Also, statements about what the contribution does and could
> do is a moving target; meanwhile HdfsDirectory has not seen investment in
> features/optimizations, but I can easily imagine some!
>
> With my Apache hat on, my main concern with the proposed contribution is
> hearing opinions on how well it fits into SolrCloud / the codebase and the
> resulting maintenance as code owners.  That might be hard for anyone to
> comment on right now because you can't see it but if there are preliminary
> thoughts on what Ilan has written in this regard, then, please share them.
> If we don't ultimately contribute the aforementioned functionality, I
> wonder if the developer community here might agree with the replica type
> enum moving to an actual abstraction (i.e. an interface) to allow for the
> possibility of other replica types for Solr hackers like us, and perhaps
> it'd improve SolrCloud code in so doing.
>
> A more concrete concern I have is how to limit dependencies.  The
> contribution adds AWS S3 dependencies to solr-core but that's not at all
> okay in Apache Solr.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Fri, Apr 28, 2023 at 1:33 PM Ilan Ginzburg <il...@gmail.com> wrote:
>
> > Hi,
> >
> > This is a long message, apologies. If responses are positive, there will
> > likely be plenty of other opportunities to discuss the topics mentioned
> > here.
> >
> > I'm trying to see if the community would be interested in a contribution
> > allowing SolrCloud nodes to be totally stateless with persistent storage
> > done in S3 (or similar).
> >
> > Salesforce has been working for a while on separating compute from
> storage
> > in SolrCloud,  see presentation at Activate 2019 SolrCloud in Public
> Cloud:
> > Scaling Compute Independently from Storage <https://youtu.be/6fE5KvOfb6A
> >.
> > In a nutshell, the idea is that all SolrCloud nodes are stateless, have a
> > local disk cache of the cores they're hosting but no persistent volumes
> (no
> > persistent indexes nor transaction logs), and shard level persistence is
> > done on S3.
> >
> > This is different from a classic node remote storage:
> > - No "per node" transaction log (therefore no need to wait for down nodes
> > to come back up to recover their transaction log data)
> > - A single per shard copy on S3 (not a copy per replica)
> > - Local non persistent node volumes (SSD's for example) are a cache and
> are
> > used to serve queries and do indexing (therefore local cache is not
> limited
> > by memory like it is in HdfsDirectory for example but by disk size)
> > - When adding a node and replicas, the index content for the replicas is
> > fetched from S3, not from other (possibly overloaded already) nodes
> >
> > The current state of our code is:
> > - It's a fork (it was shared as a branch for a while but eventually
> > removed)
> > - We introduced a new replica type that is explicitly based on remote
> Blob
> > storage (and otherwise similar to TLOG) and that guards against
> concurrent
> > writes to S3 (when shard leadership changes in SolrCloud, there can be an
> > overlap period with two nodes acting as leaders which can cause
> corruption
> > with a naive implementation of remote storage at shard level)
> > - Currently we only push constructed segments to S3 which forces
> committing
> > every batch before acknowledging it to the client (we can't rely on the
> > transaction log on a node given it would be lost upon node restart, the
> > node having no persistent storage).
> >
> > We're considering improving this approach by making the transaction log a
> > shard level abstraction (rather than a replica/node abstraction), and
> store
> > it in S3 as well with a transaction log per shard, not per replica.
> > This would allow indexing to not commit on every batch, speed up /update
> > requests, push the constructed segments asynchronously to S3, guarantee
> > data durability while still allowing nodes to be stateless (so can be
> shut
> > down at any time in any number without data loss and without having to
> > restart these nodes to recover data only they can access).
> >
> > If there is interest in such a contribution to SolrCloud then we might
> > carry the next phase of the work upstream (initially in a branch, with
> the
> > intention to eventually merge).
> >
> > If there is no interest, it's easier for us to keep working on our
> > fork that allows taking shortcuts and ignoring features of SolrCloud we
> do
> > not use (for example replacing existing transaction log with a shard
> level
> > transaction log rather than maintaining both).
> >
> > Feedback welcome! I can present the idea in more detail in a future
> > committer/contributor meeting if there's interest.
> >
> > Thanks,
> > Ilan
> >
>

Re: SolrCloud separating compute from storage

Posted by David Smiley <ds...@apache.org>.
To clarify the point to everyone: "separation of compute from storage"
allows infrastructure cost savings for when you have both large scale (many
shards in the cluster) and highly diverse collection/index utilization.
The vision of our contribution is that an unused shard can scale down to as
little as zero replicas if you are willing for a future request to be
delayed for S3 hydration. The implementation also has SolrCloud robostuness
advantages compared to NRT/TLOG replicas.  This is "the vision"; we haven't
achieved zero replicas yet and maybe partially realized stability
advantages, and of course we've got some issues to address.  Nevertheless
with our continued investment in this approach (probable but not certain),
that's where we'll be.

I'm sure a number of users/businesses would find these advantages
desirable.  Maybe not most users but some.  Small data sets or ones that
are constantly being used might not see the primary advantage.

Sadly, we did not look at HdfsDirectory with an S3 backend because we
completely missed the possibility of doing so.  I wish this was promoted
more, like in the documentation / ref guide.  HdfsDirectory might achieve
the aforementioned goals albeit with varying pros/cons to the proposed
contribution.  Also, statements about what the contribution does and could
do is a moving target; meanwhile HdfsDirectory has not seen investment in
features/optimizations, but I can easily imagine some!

With my Apache hat on, my main concern with the proposed contribution is
hearing opinions on how well it fits into SolrCloud / the codebase and the
resulting maintenance as code owners.  That might be hard for anyone to
comment on right now because you can't see it but if there are preliminary
thoughts on what Ilan has written in this regard, then, please share them.
If we don't ultimately contribute the aforementioned functionality, I
wonder if the developer community here might agree with the replica type
enum moving to an actual abstraction (i.e. an interface) to allow for the
possibility of other replica types for Solr hackers like us, and perhaps
it'd improve SolrCloud code in so doing.

A more concrete concern I have is how to limit dependencies.  The
contribution adds AWS S3 dependencies to solr-core but that's not at all
okay in Apache Solr.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Apr 28, 2023 at 1:33 PM Ilan Ginzburg <il...@gmail.com> wrote:

> Hi,
>
> This is a long message, apologies. If responses are positive, there will
> likely be plenty of other opportunities to discuss the topics mentioned
> here.
>
> I'm trying to see if the community would be interested in a contribution
> allowing SolrCloud nodes to be totally stateless with persistent storage
> done in S3 (or similar).
>
> Salesforce has been working for a while on separating compute from storage
> in SolrCloud,  see presentation at Activate 2019 SolrCloud in Public Cloud:
> Scaling Compute Independently from Storage <https://youtu.be/6fE5KvOfb6A>.
> In a nutshell, the idea is that all SolrCloud nodes are stateless, have a
> local disk cache of the cores they're hosting but no persistent volumes (no
> persistent indexes nor transaction logs), and shard level persistence is
> done on S3.
>
> This is different from a classic node remote storage:
> - No "per node" transaction log (therefore no need to wait for down nodes
> to come back up to recover their transaction log data)
> - A single per shard copy on S3 (not a copy per replica)
> - Local non persistent node volumes (SSD's for example) are a cache and are
> used to serve queries and do indexing (therefore local cache is not limited
> by memory like it is in HdfsDirectory for example but by disk size)
> - When adding a node and replicas, the index content for the replicas is
> fetched from S3, not from other (possibly overloaded already) nodes
>
> The current state of our code is:
> - It's a fork (it was shared as a branch for a while but eventually
> removed)
> - We introduced a new replica type that is explicitly based on remote Blob
> storage (and otherwise similar to TLOG) and that guards against concurrent
> writes to S3 (when shard leadership changes in SolrCloud, there can be an
> overlap period with two nodes acting as leaders which can cause corruption
> with a naive implementation of remote storage at shard level)
> - Currently we only push constructed segments to S3 which forces committing
> every batch before acknowledging it to the client (we can't rely on the
> transaction log on a node given it would be lost upon node restart, the
> node having no persistent storage).
>
> We're considering improving this approach by making the transaction log a
> shard level abstraction (rather than a replica/node abstraction), and store
> it in S3 as well with a transaction log per shard, not per replica.
> This would allow indexing to not commit on every batch, speed up /update
> requests, push the constructed segments asynchronously to S3, guarantee
> data durability while still allowing nodes to be stateless (so can be shut
> down at any time in any number without data loss and without having to
> restart these nodes to recover data only they can access).
>
> If there is interest in such a contribution to SolrCloud then we might
> carry the next phase of the work upstream (initially in a branch, with the
> intention to eventually merge).
>
> If there is no interest, it's easier for us to keep working on our
> fork that allows taking shortcuts and ignoring features of SolrCloud we do
> not use (for example replacing existing transaction log with a shard level
> transaction log rather than maintaining both).
>
> Feedback welcome! I can present the idea in more detail in a future
> committer/contributor meeting if there's interest.
>
> Thanks,
> Ilan
>