MVCC version format

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

MVCC version format

Vladimir Ozerov
Igniters,

As you probably know, we are working on MVCC feature [1]. When MVCC is
enabled every cache operation/tx require one unique version for read and
possibly one version for write. Versions are assigned by a single node
called *coordinator*.

Requirements for version:
1) Must be increasing and comparable, so that we understand which operation
occurred earlier.
2) Support coordinator failtover
3) As compact as possible

Currently we implemented naive approach where version is represented as two
long values (16 bytes total):
1) Coordinator version (a kind of mix of a timestamp and topology version)
2) Local coordinator counter.

Other vendors typically assign 4 byte (long) or 8 byte (long) versions,
which is definitely better as they are faster to compare and require less
space, what could be especially important for small values, such as indexes.

I thought a bit, and came to conclusion that we could fit into 64 bit
version (8 bytes) as follows:
[0..47] - local coordinator counter; this gives capacity 10^14, which
should be enough for any sensible workload
[48..60] - coordinator version, which is incremented on every coordinator
change (relatively rare) or local counter wraparound (very rare); this
gives 8192 versions
[61] - coordinator epoch
[62] - delete marker, set on entry removal, cleared on resurrection (e.g.
TX rollback, or index change A->B->A)
[63] - freeze marker

Mechanics:
1) Every coordinator change increments coordinator version and share it via
discovery
2) If persistence is enabled, coordinator version is persisted and possibly
WAL logged
3) Epoch is switched between 0 and 1 on every coordinator version
wraparound (extremely rare event)
4) Epoch is cluster-wide properly, every node knows current epoch
5) As epoch switches, nodes start marking entries from previous epoch as
"frozen". This is done in a background in a very light-weight mode, as
there is no rush
6) Epoch 0 could not be switched to 1 if there are non-frozen entries from
previous epoch 1, and vice versa. This way version comparison is always
possible.

IMO this should do the trick so that we have 8 byte version with virtually
zero overhead in runtime. But a little distributed magic would be needed.

Thoughts?

P.S.: Idea is borrowed from PostgreSQL counter wraparound handling.

[1] https://issues.apache.org/jira/browse/IGNITE-3478
Reply | Threaded
Open this post in threaded view
|

Re: MVCC version format

Alexey Goncharuk
Vladimir,

Although at the first glance this seems to be ok for a local solution, I
have a few concerns wrt distributed version behavior:

1) I cannot say that epoch switch in this approach will be an extremely
rare event. Let's say a user regularly deploys a new version of application
code to a large Ignite cluster and restarts Ignite nodes for such an
upgrade. If version coordinator is not dedicated, then on average version
coordinator switch will occur N/2 times (N is the size of the cluster), in
the worst case the switch will happen N times. At this rate, epoch switch
may occur maybe once a month depending on a cluster size? The user will
need to take care of the node order restart.
2) What happens if a node went offline, then epoch switched twice, then
node gets back to cluster again (a classic ABA)? The only solution I see
here is that we track epoch switches and drop nodes that were offline
during the epoch switch forever. This may cause problems if a user has
partition loss policy = READ_WRITE_SAFE. Let's say a partition was offline,
but updates to other partitions were allowed. If an epoch switch happens at
this moment, the partition is lost forever.
3) How fast will we be able to flip the frozen bit for the whole dataset if
we are looking at terabytes per node? Given (1), we may be in trouble here.

2017-12-15 12:57 GMT+03:00 Vladimir Ozerov <[hidden email]>:

> Igniters,
>
> As you probably know, we are working on MVCC feature [1]. When MVCC is
> enabled every cache operation/tx require one unique version for read and
> possibly one version for write. Versions are assigned by a single node
> called *coordinator*.
>
> Requirements for version:
> 1) Must be increasing and comparable, so that we understand which operation
> occurred earlier.
> 2) Support coordinator failtover
> 3) As compact as possible
>
> Currently we implemented naive approach where version is represented as two
> long values (16 bytes total):
> 1) Coordinator version (a kind of mix of a timestamp and topology version)
> 2) Local coordinator counter.
>
> Other vendors typically assign 4 byte (long) or 8 byte (long) versions,
> which is definitely better as they are faster to compare and require less
> space, what could be especially important for small values, such as
> indexes.
>
> I thought a bit, and came to conclusion that we could fit into 64 bit
> version (8 bytes) as follows:
> [0..47] - local coordinator counter; this gives capacity 10^14, which
> should be enough for any sensible workload
> [48..60] - coordinator version, which is incremented on every coordinator
> change (relatively rare) or local counter wraparound (very rare); this
> gives 8192 versions
> [61] - coordinator epoch
> [62] - delete marker, set on entry removal, cleared on resurrection (e.g.
> TX rollback, or index change A->B->A)
> [63] - freeze marker
>
> Mechanics:
> 1) Every coordinator change increments coordinator version and share it via
> discovery
> 2) If persistence is enabled, coordinator version is persisted and possibly
> WAL logged
> 3) Epoch is switched between 0 and 1 on every coordinator version
> wraparound (extremely rare event)
> 4) Epoch is cluster-wide properly, every node knows current epoch
> 5) As epoch switches, nodes start marking entries from previous epoch as
> "frozen". This is done in a background in a very light-weight mode, as
> there is no rush
> 6) Epoch 0 could not be switched to 1 if there are non-frozen entries from
> previous epoch 1, and vice versa. This way version comparison is always
> possible.
>
> IMO this should do the trick so that we have 8 byte version with virtually
> zero overhead in runtime. But a little distributed magic would be needed.
>
> Thoughts?
>
> P.S.: Idea is borrowed from PostgreSQL counter wraparound handling.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-3478
>
Reply | Threaded
Open this post in threaded view
|

Re: MVCC version format

Vladimir Ozerov
Hi Alex,

Good points.

1) So far we've seen clusters up to ~ 2K nodes in size, and typical size is
much smaller. So this doesn't appear to be a a big problem in common case.
But what if at some point we see, say, 10_000 nodes deployment? in this
case proposed schema may virtually stop working at all :-) On the other
hand, frequent wraparound typically mean that not so much changes happened
in between, so we should be able to cope with freeze fast enough. Also big
deployments would have to take coordinator location in count anyway,
because on coordinator change we cannot start assigning versions
immediately because we need to collect full list of active transactions
from other nodes first, which is painful on large topologies. I.e our
current approach would require careful administration irrespectively of
whether we take proposed approach or not.

2) I think we could simply mark this partition as "definitely frozen"
because when offline partition comes back any active transaction would
definitely see it's data. Would that work?

3) This depends on final implementation heavily. We should do that
constantly in background and track progress on per-partition and possibly
per-block (visibility map?) levels. Provided large time frames, we should
be able to cope with.

On Fri, Dec 15, 2017 at 2:37 PM, Alexey Goncharuk <
[hidden email]> wrote:

> Vladimir,
>
> Although at the first glance this seems to be ok for a local solution, I
> have a few concerns wrt distributed version behavior:
>
> 1) I cannot say that epoch switch in this approach will be an extremely
> rare event. Let's say a user regularly deploys a new version of application
> code to a large Ignite cluster and restarts Ignite nodes for such an
> upgrade. If version coordinator is not dedicated, then on average version
> coordinator switch will occur N/2 times (N is the size of the cluster), in
> the worst case the switch will happen N times. At this rate, epoch switch
> may occur maybe once a month depending on a cluster size? The user will
> need to take care of the node order restart.
> 2) What happens if a node went offline, then epoch switched twice, then
> node gets back to cluster again (a classic ABA)? The only solution I see
> here is that we track epoch switches and drop nodes that were offline
> during the epoch switch forever. This may cause problems if a user has
> partition loss policy = READ_WRITE_SAFE. Let's say a partition was offline,
> but updates to other partitions were allowed. If an epoch switch happens at
> this moment, the partition is lost forever.
> 3) How fast will we be able to flip the frozen bit for the whole dataset if
> we are looking at terabytes per node? Given (1), we may be in trouble here.
>
> 2017-12-15 12:57 GMT+03:00 Vladimir Ozerov <[hidden email]>:
>
> > Igniters,
> >
> > As you probably know, we are working on MVCC feature [1]. When MVCC is
> > enabled every cache operation/tx require one unique version for read and
> > possibly one version for write. Versions are assigned by a single node
> > called *coordinator*.
> >
> > Requirements for version:
> > 1) Must be increasing and comparable, so that we understand which
> operation
> > occurred earlier.
> > 2) Support coordinator failtover
> > 3) As compact as possible
> >
> > Currently we implemented naive approach where version is represented as
> two
> > long values (16 bytes total):
> > 1) Coordinator version (a kind of mix of a timestamp and topology
> version)
> > 2) Local coordinator counter.
> >
> > Other vendors typically assign 4 byte (long) or 8 byte (long) versions,
> > which is definitely better as they are faster to compare and require less
> > space, what could be especially important for small values, such as
> > indexes.
> >
> > I thought a bit, and came to conclusion that we could fit into 64 bit
> > version (8 bytes) as follows:
> > [0..47] - local coordinator counter; this gives capacity 10^14, which
> > should be enough for any sensible workload
> > [48..60] - coordinator version, which is incremented on every coordinator
> > change (relatively rare) or local counter wraparound (very rare); this
> > gives 8192 versions
> > [61] - coordinator epoch
> > [62] - delete marker, set on entry removal, cleared on resurrection (e.g.
> > TX rollback, or index change A->B->A)
> > [63] - freeze marker
> >
> > Mechanics:
> > 1) Every coordinator change increments coordinator version and share it
> via
> > discovery
> > 2) If persistence is enabled, coordinator version is persisted and
> possibly
> > WAL logged
> > 3) Epoch is switched between 0 and 1 on every coordinator version
> > wraparound (extremely rare event)
> > 4) Epoch is cluster-wide properly, every node knows current epoch
> > 5) As epoch switches, nodes start marking entries from previous epoch as
> > "frozen". This is done in a background in a very light-weight mode, as
> > there is no rush
> > 6) Epoch 0 could not be switched to 1 if there are non-frozen entries
> from
> > previous epoch 1, and vice versa. This way version comparison is always
> > possible.
> >
> > IMO this should do the trick so that we have 8 byte version with
> virtually
> > zero overhead in runtime. But a little distributed magic would be needed.
> >
> > Thoughts?
> >
> > P.S.: Idea is borrowed from PostgreSQL counter wraparound handling.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-3478
> >
>