Hi!
Alexey, your proposal looks great. Can I ask you some questions? 1. Is nodes, that take part of metastorage replication group (raft candidates and leader) are expected to also bear cache data and participate in cache transactions? As for me, it seems quite dangerous to mix roles. For example, heavy load from users can cause long GC pauses on leader of replication group and therefore failure, new leader election, etc. 2. If previous statement is true, other question arises. If one of candidates or leader fails, how will a insufficient node will be chosen from regular nodes to form full ensemble? Random one? 3. Do you think, that this metastorage implementation can be pluggable? it can be implemented on top of etcd, for example. чт, 22 окт. 2020 г. в 13:04, Alexey Goncharuk <[hidden email]>: > Hello Yakov, > > Glad to see you back! > > Hi! > > I am back! > > > > Here are several ideas on top of my mind for Ignite 3.0 > > 1. Client nodes should take the config from servers. Basically it should > be > > enough to provide some cluster identifier or any known IP address to > start > > a client. > > > This totally makes sense and should be covered by the distributed > metastorage approach described in [1]. A client can read and watch updates > for the cluster configuration and run solely based on that config. > > > > 2. Thread per partition. Again. I strongly recommend taking a look at how > > Scylla DB operates. I think this is the best distributed database > threading > > model and can be a perfect fit for Ignite. Main idea is in "share > nothing" > > approach - they keep thread switches to the necessary minimum - messages > > reading or updating data are processed within the thread that reads them > > (of course, the sender should properly route the message, i.e. send it to > > the correct socket). Blocking operations such as fsync happen outside of > > worker (shard) threads. > > This will require to split indexes per partition which should be quite ok > > for most of the use cases in my view. Edge cases with high-selectivity > > indexes selecting 1-2 rows for a value can be sped up with hash indexes > or > > even by secondary index backed by another cache. > > > Generally agree, and again this is what we will naturally have when we > implement any log-based replication protocol [1]. However, I think it makes > sense to separate the operation replication, which can be done in one > thread, and actual command execution. The command execution can be done > asynchronously thus reducing the latency of any operation to a single log > append + fsync. > > > > 3. Replicate physical updates instead of logical. This will simplify > logic > > running on backups to zero. Read a page sent by the primary node and > apply > > it locally. Most probably this change will require pure thread per > > partition described above. > > > Not sure about this, I think this approach has the following disadvantages: > * We will not be able to replicate a single page update as such an update > usually leaves storage in a corrupted state. Therefore, we will still have > to group pages into batches which must be applied atomically, thus > complicating the protocol > * Physical replication will result in significant network traffic > amplification, especially for cases with large inline indexes. The same > goes for EntryProcessors - we will have to replicate huge values while a > small operation modifying a large object could have been replicated > * Physical replication complicates a road for Ignite to support a rolling > upgrade in the future. If we choose to change the local storage format, we > will have to somehow convert a new binary format to the old at replication > time when sending from new to old nodes, and additionally support forward > conversion on new nodes if an old node is a replication group leader > * Finally, this approach closes the road to having different storage > formats on different nodes (for example, have one of the nodes in the > replication group keep data in columnar format, as it is done in TiDB). > This allows us to route analytical queries to separate dedicated nodes > without affecting the performance properties of the whole replication > group. > > > > 4. Merge (all?) network components (communication, disco, REST, etc?) and > > listen to one port. > > > Makes sense. There is no clear design for this though, we did not even > discuss this in detail. I am not even sure we need separate discovery and > communication services, as they are very dependent; on the other hand, a > lot of discovery functions are moved to the distributed metastore. > > > > > 5. Revisit transaction concurrency and isolation settings. Currently some > > of the combinations do not work as users may expect them to and some look > > just weird. > > > Agree, looks like we can cut more than half of the transaction modes > without sacrificing functionality at all. However, similarly to the single > port point, we did not discuss how exactly the transactional protocol will > look like yet, so this is an open question. Once I put my thoughts > together, I will create an IEP for this (unless somebody does it earlier, > of course). > > --AG > > [1] > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure > < > https://cwiki.apache.org/confluence/display/IGNITE/IEP-61%3A+Common+Replication+Infrastructure > > > -- Sincerely yours, Ivan Daschinskiy |
Hello Ivan,
Thanks for the feedback, see my comments inline: чт, 22 окт. 2020 г. в 17:59, Ivan Daschinsky <[hidden email]>: > Hi! > Alexey, your proposal looks great. Can I ask you some questions? > 1. Is nodes, that take part of metastorage replication group (raft > candidates and leader) are expected to also bear cache data and participate > in cache transactions? > As for me, it seems quite dangerous to mix roles. For example, heavy > load from users can cause long GC pauses on leader of replication group and > therefore failure, new leader election, etc. > I think both ways should be possible. The set of nodes that hold metastorage should be defined declaratively in runtime, as well as the set of nodes holding table data. Thus, by default, they will be mixed which will significantly simplify cluster setup and usability, but when needed, this should be easily adjusted in runtime by the cluster administrator. > 2. If previous statement is true, other question arises. If one of > candidates or leader fails, how will a insufficient node will be chosen > from regular nodes to form full ensemble? Random one? > Similarly - by default, a 'best' node will be chosen from the available ones, but the administrator can override this. > 3. Do you think, that this metastorage implementation can be pluggable? it > can be implemented on top of etcd, for example. I think the metastorage abstraction must be clearly separated so it is possible to change the implementation. Moreover, I was thinking that we may use etcd to speed up the development of other system components while we are working on our own protocol implementation. However, I do not think we should expose it as a pluggable public API. If we want to support etcd as a metastorage - let's do this as a concrete configuration option, a first-class citizen of the system rather than an SPI implementation with a rigid interface. WDYT? |
Hi!
Alexey, > If we want to support etcd as a metastorage - let's do this as a concrete configuration option, a > first-class citizen of the system rather than an SPI implementation with a rigid interface. On one side this is quite reasonable. But on the other side, if someone wants to adopt, for example Apache Zookeeper or some other proprietary external lock service, we could provide basic interfaces to do the job. > Thus, by default, they will be mixed which will significantly simplify cluster setup and usability. According to raft specs, the leader processes all requests from clients. Leader's response latency is a crucial thing for the whole cluster stability. Cluster setup simplicity is a subject of documentation, scripts and so on, i.e. starting kafka is quite easy. Also, if we use mixed approach, service discovery protocol should be implemented.This is necessary, because we should discover nodes firstly in order to choose finite subset for RAFT ensemble. For example, Consul by HashiCorp uses gossip protocol to do the job. (Nodes participating in RAFT are called servers, [1] If we use separated approach, we could use service discovery pattern that is common for zookeeper or etcd (data node create record with TTL and renew it. (EPHEMERAL node approach for zk), other data nodes watches for new records) Some words about PacificA Article [2] -- is just brief descriptions and ideas. Alexey, is there any formal specification of this protocol? Preferrably in TLA+? [1] -- https://www.consul.io/docs/architecture/gossip [2] -- https://www.microsoft.com/en-us/research/wp-content/uploads/2008/02/tr-2008-25.pdf пт, 23 окт. 2020 г. в 13:05, Alexey Goncharuk <[hidden email]>: > Hello Ivan, > > Thanks for the feedback, see my comments inline: > > чт, 22 окт. 2020 г. в 17:59, Ivan Daschinsky <[hidden email]>: > > > Hi! > > Alexey, your proposal looks great. Can I ask you some questions? > > 1. Is nodes, that take part of metastorage replication group (raft > > candidates and leader) are expected to also bear cache data and > participate > > in cache transactions? > > As for me, it seems quite dangerous to mix roles. For example, heavy > > load from users can cause long GC pauses on leader of replication group > and > > therefore failure, new leader election, etc. > > > I think both ways should be possible. The set of nodes that hold > metastorage should be defined declaratively in runtime, as well as the set > of nodes holding table data. Thus, by default, they will be mixed which > will significantly simplify cluster setup and usability, but when needed, > this should be easily adjusted in runtime by the cluster administrator. > > > > 2. If previous statement is true, other question arises. If one of > > candidates or leader fails, how will a insufficient node will be chosen > > from regular nodes to form full ensemble? Random one? > > > Similarly - by default, a 'best' node will be chosen from the available > ones, but the administrator can override this. > > > > 3. Do you think, that this metastorage implementation can be pluggable? > it > > can be implemented on top of etcd, for example. > > I think the metastorage abstraction must be clearly separated so it is > possible to change the implementation. Moreover, I was thinking that we may > use etcd to speed up the development of other system components while we > are working on our own protocol implementation. However, I do not think we > should expose it as a pluggable public API. If we want to support etcd as a > metastorage - let's do this as a concrete configuration option, a > first-class citizen of the system rather than an SPI implementation with a > rigid interface. > > WDYT? > -- Sincerely yours, Ivan Daschinskiy |
In reply to this post by Alexey Goncharuk
Alexey,
Thanks for details! Common replication infra suggestion looks great! Agree with your points regarding per-page replication, but still have a feeling that this protocol can be made compact enough, e.g. by sending only deltas. As far as entry processors we can decide on what to send - if the serialized processor is of smaller size then we can send it instead of the delta. This can sound like an overcomplication, but still good to measure. Regards, Yakov |
Free forum by Nabble | Edit this page |