You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Yakov Zhdanov <ya...@gmail.com> on 2020/10/10 23:36:26 UTC

Re: Apache Ignite 3.0

Hi!
I am back!

Here are several ideas on top of my mind for Ignite 3.0
1. Client nodes should take the config from servers. Basically it should be
enough to provide some cluster identifier or any known IP address to start
a client.

2. Thread per partition. Again. I strongly recommend taking a look at how
Scylla DB operates. I think this is the best distributed database threading
model and can be a perfect fit for Ignite. Main idea is in "share nothing"
approach - they keep thread switches to the necessary minimum - messages
reading or updating data are processed within the thread that reads them
(of course, the sender should properly route the message, i.e. send it to
the correct socket). Blocking operations such as fsync happen outside of
worker (shard) threads.
This will require to split indexes per partition which should be quite ok
for most of the use cases in my view. Edge cases with high-selectivity
indexes selecting 1-2 rows for a value can be sped up with hash indexes or
even by secondary index backed by another cache.

3. Replicate physical updates instead of logical. This will simplify logic
running on backups to zero. Read a page sent by the primary node and apply
it locally. Most probably this change will require pure thread per
partition described above.

4. Merge (all?) network components (communication, disco, REST, etc?) and
listen to one port.

5. Revisit transaction concurrency and isolation settings. Currently some
of the combinations do not work as users may expect them to and some look
just weird.

Ready to discuss the above and other points.

Thanks!
Yakov

Re: Apache Ignite 3.0

Posted by Yakov Zhdanov <ya...@gmail.com>.
Hey Valentin!

Any design docs/wiki for 1, 4  and 5 so far?

Yakov Zhdanov

Re: Apache Ignite 3.0

Posted by Valentin Kulichenko <va...@gmail.com>.
Hi Yakov,

Great to have you here! :) Thanks for the inputs. My comments are below.

*Alexey*, could you please comment on the points 3 and 4?

-Val

On Sat, Oct 10, 2020 at 4:36 PM Yakov Zhdanov <ya...@gmail.com>
wrote:

> Hi!
> I am back!
>
> Here are several ideas on top of my mind for Ignite 3.0
> 1. Client nodes should take the config from servers. Basically it should be
> enough to provide some cluster identifier or any known IP address to start
> a client.
>

[Val] We will have a meta-store that will store the cluster-wide
configuration (among other things). Whether it's a server or a client, any
new node will be able to fetch the required information from there. In
practice, it will work how you describe - just provide an IP address and
you are good to go.


>
> 2. Thread per partition. Again. I strongly recommend taking a look at how
> Scylla DB operates. I think this is the best distributed database threading
> model and can be a perfect fit for Ignite. Main idea is in "share nothing"
> approach - they keep thread switches to the necessary minimum - messages
> reading or updating data are processed within the thread that reads them
> (of course, the sender should properly route the message, i.e. send it to
> the correct socket). Blocking operations such as fsync happen outside of
> worker (shard) threads.
> This will require to split indexes per partition which should be quite ok
> for most of the use cases in my view. Edge cases with high-selectivity
> indexes selecting 1-2 rows for a value can be sped up with hash indexes or
> even by secondary index backed by another cache.
>
> 3. Replicate physical updates instead of logical. This will simplify logic
> running on backups to zero. Read a page sent by the primary node and apply
> it locally. Most probably this change will require pure thread per
> partition described above.


> 4. Merge (all?) network components (communication, disco, REST, etc?) and
> listen to one port.
>

[Val] This is in plans at least for the communication and disco. We'll see
how it goes for other components.


>
> 5. Revisit transaction concurrency and isolation settings. Currently some
> of the combinations do not work as users may expect them to and some look
> just weird.
>

[Val] Also planned.


>
> Ready to discuss the above and other points.
>
> Thanks!
> Yakov
>

Re: Apache Ignite 3.0

Posted by Ivan Daschinsky <iv...@gmail.com>.
Hi!
Alexey,
> If we want to support etcd as a metastorage - let's do this as a concrete
configuration option, a
> first-class citizen of the system rather than an SPI implementation with
a rigid interface.
On one side this is quite reasonable. But on the other side, if someone
wants to adopt, for example Apache Zookeeper or
some other proprietary external lock service, we could provide basic
interfaces to do the job.

> Thus, by default, they will be mixed which will significantly simplify
cluster setup and usability.
According to raft specs, the leader processes all requests from clients.
Leader's response latency is a crucial thing for the whole cluster
stability.
Cluster setup simplicity is a subject of documentation, scripts and so on,
i.e. starting kafka is quite easy.

Also, if we use mixed approach, service discovery protocol should be
implemented.This is necessary, because we should discover nodes firstly in
order to choose finite subset for RAFT ensemble.
For example, Consul by HashiCorp uses gossip protocol to do the job. (Nodes
participating in RAFT are called servers,  [1]

If we use separated approach, we could use service discovery pattern that
is common for zookeeper or etcd (data node create record with TTL and renew
it. (EPHEMERAL node approach for zk),
other data nodes watches for new records)

Some words about PacificA
Article  [2] -- is just brief descriptions and ideas. Alexey, is there any
formal specification of this protocol? Preferrably in TLA+?


[1] -- https://www.consul.io/docs/architecture/gossip
[2] --
https://www.microsoft.com/en-us/research/wp-content/uploads/2008/02/tr-2008-25.pdf




пт, 23 окт. 2020 г. в 13:05, Alexey Goncharuk <al...@gmail.com>:

> Hello Ivan,
>
> Thanks for the feedback, see my comments inline:
>
> чт, 22 окт. 2020 г. в 17:59, Ivan Daschinsky <iv...@gmail.com>:
>
> > Hi!
> > Alexey, your proposal looks great. Can I ask you some questions?
> > 1. Is nodes, that take part of metastorage replication group (raft
> > candidates and leader) are expected to also bear cache data and
> participate
> > in cache transactions?
> >    As for me, it seems quite dangerous to mix roles. For example, heavy
> > load from users can cause long GC pauses on leader of replication group
> and
> > therefore failure, new leader election, etc.
> >
> I think both ways should be possible. The set of nodes that hold
> metastorage should be defined declaratively in runtime, as well as the set
> of nodes holding table data. Thus, by default, they will be mixed which
> will significantly simplify cluster setup and usability, but when needed,
> this should be easily adjusted in runtime by the cluster administrator.
>
>
> > 2. If previous statement is true, other question arises. If one of
> > candidates or leader fails, how will a insufficient node will be chosen
> > from regular nodes to form full ensemble? Random one?
> >
> Similarly - by default, a 'best' node will be chosen from the available
> ones, but the administrator can override this.
>
>
> > 3. Do you think, that this metastorage implementation can be pluggable?
> it
> > can be implemented on top of etcd, for example.
>
> I think the metastorage abstraction must be clearly separated so it is
> possible to change the implementation. Moreover, I was thinking that we may
> use etcd to speed up the development of other system components while we
> are working on our own protocol implementation. However, I do not think we
> should expose it as a pluggable public API. If we want to support etcd as a
> metastorage - let's do this as a concrete configuration option, a
> first-class citizen of the system rather than an SPI implementation with a
> rigid interface.
>
> WDYT?
>


-- 
Sincerely yours, Ivan Daschinskiy

Re: Apache Ignite 3.0

Posted by Alexey Goncharuk <al...@gmail.com>.
Hello Ivan,

Thanks for the feedback, see my comments inline:

чт, 22 окт. 2020 г. в 17:59, Ivan Daschinsky <iv...@gmail.com>:

> Hi!
> Alexey, your proposal looks great. Can I ask you some questions?
> 1. Is nodes, that take part of metastorage replication group (raft
> candidates and leader) are expected to also bear cache data and participate
> in cache transactions?
>    As for me, it seems quite dangerous to mix roles. For example, heavy
> load from users can cause long GC pauses on leader of replication group and
> therefore failure, new leader election, etc.
>
I think both ways should be possible. The set of nodes that hold
metastorage should be defined declaratively in runtime, as well as the set
of nodes holding table data. Thus, by default, they will be mixed which
will significantly simplify cluster setup and usability, but when needed,
this should be easily adjusted in runtime by the cluster administrator.


> 2. If previous statement is true, other question arises. If one of
> candidates or leader fails, how will a insufficient node will be chosen
> from regular nodes to form full ensemble? Random one?
>
Similarly - by default, a 'best' node will be chosen from the available
ones, but the administrator can override this.


> 3. Do you think, that this metastorage implementation can be pluggable? it
> can be implemented on top of etcd, for example.

I think the metastorage abstraction must be clearly separated so it is
possible to change the implementation. Moreover, I was thinking that we may
use etcd to speed up the development of other system components while we
are working on our own protocol implementation. However, I do not think we
should expose it as a pluggable public API. If we want to support etcd as a
metastorage - let's do this as a concrete configuration option, a
first-class citizen of the system rather than an SPI implementation with a
rigid interface.

WDYT?

Re: Apache Ignite 3.0

Posted by Ivan Daschinsky <iv...@gmail.com>.
Hi!
Alexey, your proposal looks great. Can I ask you some questions?
1. Is nodes, that take part of metastorage replication group (raft
candidates and leader) are expected to also bear cache data and participate
in cache transactions?
   As for me, it seems quite dangerous to mix roles. For example, heavy
load from users can cause long GC pauses on leader of replication group and
therefore failure, new leader election, etc.

2. If previous statement is true, other question arises. If one of
candidates or leader fails, how will a insufficient node will be chosen
from regular nodes to form full ensemble? Random one?
3. Do you think, that this metastorage implementation can be pluggable? it
can be implemented on top of etcd, for example.


чт, 22 окт. 2020 г. в 13:04, Alexey Goncharuk <al...@gmail.com>:

> Hello Yakov,
>
> Glad to see you back!
>
> Hi!
> > I am back!
> >
> > Here are several ideas on top of my mind for Ignite 3.0
> > 1. Client nodes should take the config from servers. Basically it should
> be
> > enough to provide some cluster identifier or any known IP address to
> start
> > a client.
> >
> This totally makes sense and should be covered by the distributed
> metastorage approach described in [1]. A client can read and watch updates
> for the cluster configuration and run solely based on that config.
>
>
> > 2. Thread per partition. Again. I strongly recommend taking a look at how
> > Scylla DB operates. I think this is the best distributed database
> threading
> > model and can be a perfect fit for Ignite. Main idea is in "share
> nothing"
> > approach - they keep thread switches to the necessary minimum - messages
> > reading or updating data are processed within the thread that reads them
> > (of course, the sender should properly route the message, i.e. send it to
> > the correct socket). Blocking operations such as fsync happen outside of
> > worker (shard) threads.
> > This will require to split indexes per partition which should be quite ok
> > for most of the use cases in my view. Edge cases with high-selectivity
> > indexes selecting 1-2 rows for a value can be sped up with hash indexes
> or
> > even by secondary index backed by another cache.
> >
> Generally agree, and again this is what we will naturally have when we
> implement any log-based replication protocol [1]. However, I think it makes
> sense to separate the operation replication, which can be done in one
> thread, and actual command execution. The command execution can be done
> asynchronously thus reducing the latency of any operation to a single log
> append + fsync.
>
>
> > 3. Replicate physical updates instead of logical. This will simplify
> logic
> > running on backups to zero. Read a page sent by the primary node and
> apply
> > it locally. Most probably this change will require pure thread per
> > partition described above.
> >
> Not sure about this, I think this approach has the following disadvantages:
>  * We will not be able to replicate a single page update as such an update
> usually leaves storage in a corrupted state. Therefore, we will still have
> to group pages into batches which must be applied atomically, thus
> complicating the protocol
>  * Physical replication will result in significant network traffic
> amplification, especially for cases with large inline indexes. The same
> goes for EntryProcessors - we will have to replicate huge values while a
> small operation modifying a large object could have been replicated
>  * Physical replication complicates a road for Ignite to support a rolling
> upgrade in the future. If we choose to change the local storage format, we
> will have to somehow convert a new binary format to the old at replication
> time when sending from new to old nodes, and additionally support forward
> conversion on new nodes if an old node is a replication group leader
>  * Finally, this approach closes the road to having different storage
> formats on different nodes (for example, have one of the nodes in the
> replication group keep data in columnar format, as it is done in TiDB).
> This allows us to route analytical queries to separate dedicated nodes
> without affecting the performance properties of the whole replication
> group.
>
>
> > 4. Merge (all?) network components (communication, disco, REST, etc?) and
> > listen to one port.
> >
> Makes sense. There is no clear design for this though, we did not even
> discuss this in detail. I am not even sure we need separate discovery and
> communication services, as they are very dependent; on the other hand, a
> lot of discovery functions are moved to the distributed metastore.
>
>
>
> > 5. Revisit transaction concurrency and isolation settings. Currently some
> > of the combinations do not work as users may expect them to and some look
> > just weird.
> >
> Agree, looks like we can cut more than half of the transaction modes
> without sacrificing functionality at all. However, similarly to the single
> port point, we did not discuss how exactly the transactional protocol will
> look like yet, so this is an open question. Once I put my thoughts
> together, I will create an IEP for this (unless somebody does it earlier,
> of course).
>
> --AG
>
> [1]
>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure
> <
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure
> >
>


-- 
Sincerely yours, Ivan Daschinskiy

Re: Apache Ignite 3.0

Posted by Yakov Zhdanov <ya...@gmail.com>.
Alexey,
Thanks for details!

Common replication infra suggestion looks great!
Agree with your points regarding per-page replication, but still have a
feeling that this protocol can be made compact enough, e.g. by sending only
deltas. As far as entry processors we can decide on what to send - if the
serialized processor is of smaller size then we can send it instead of the
delta. This can sound like an overcomplication, but still good to measure.

Regards,
Yakov

Re: Apache Ignite 3.0

Posted by Alexey Goncharuk <al...@gmail.com>.
Hello Yakov,

Glad to see you back!

Hi!
> I am back!
>
> Here are several ideas on top of my mind for Ignite 3.0
> 1. Client nodes should take the config from servers. Basically it should be
> enough to provide some cluster identifier or any known IP address to start
> a client.
>
This totally makes sense and should be covered by the distributed
metastorage approach described in [1]. A client can read and watch updates
for the cluster configuration and run solely based on that config.


> 2. Thread per partition. Again. I strongly recommend taking a look at how
> Scylla DB operates. I think this is the best distributed database threading
> model and can be a perfect fit for Ignite. Main idea is in "share nothing"
> approach - they keep thread switches to the necessary minimum - messages
> reading or updating data are processed within the thread that reads them
> (of course, the sender should properly route the message, i.e. send it to
> the correct socket). Blocking operations such as fsync happen outside of
> worker (shard) threads.
> This will require to split indexes per partition which should be quite ok
> for most of the use cases in my view. Edge cases with high-selectivity
> indexes selecting 1-2 rows for a value can be sped up with hash indexes or
> even by secondary index backed by another cache.
>
Generally agree, and again this is what we will naturally have when we
implement any log-based replication protocol [1]. However, I think it makes
sense to separate the operation replication, which can be done in one
thread, and actual command execution. The command execution can be done
asynchronously thus reducing the latency of any operation to a single log
append + fsync.


> 3. Replicate physical updates instead of logical. This will simplify logic
> running on backups to zero. Read a page sent by the primary node and apply
> it locally. Most probably this change will require pure thread per
> partition described above.
>
Not sure about this, I think this approach has the following disadvantages:
 * We will not be able to replicate a single page update as such an update
usually leaves storage in a corrupted state. Therefore, we will still have
to group pages into batches which must be applied atomically, thus
complicating the protocol
 * Physical replication will result in significant network traffic
amplification, especially for cases with large inline indexes. The same
goes for EntryProcessors - we will have to replicate huge values while a
small operation modifying a large object could have been replicated
 * Physical replication complicates a road for Ignite to support a rolling
upgrade in the future. If we choose to change the local storage format, we
will have to somehow convert a new binary format to the old at replication
time when sending from new to old nodes, and additionally support forward
conversion on new nodes if an old node is a replication group leader
 * Finally, this approach closes the road to having different storage
formats on different nodes (for example, have one of the nodes in the
replication group keep data in columnar format, as it is done in TiDB).
This allows us to route analytical queries to separate dedicated nodes
without affecting the performance properties of the whole replication group.


> 4. Merge (all?) network components (communication, disco, REST, etc?) and
> listen to one port.
>
Makes sense. There is no clear design for this though, we did not even
discuss this in detail. I am not even sure we need separate discovery and
communication services, as they are very dependent; on the other hand, a
lot of discovery functions are moved to the distributed metastore.



> 5. Revisit transaction concurrency and isolation settings. Currently some
> of the combinations do not work as users may expect them to and some look
> just weird.
>
Agree, looks like we can cut more than half of the transaction modes
without sacrificing functionality at all. However, similarly to the single
port point, we did not discuss how exactly the transactional protocol will
look like yet, so this is an open question. Once I put my thoughts
together, I will create an IEP for this (unless somebody does it earlier,
of course).

--AG

[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure
<https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure>