Apache Ignite Developers - Legacy Mail Archive

Rework locking architecture for MVCC and transactional SQL

Classic

List

Threaded

8 messages Options

Vladimir Ozerov

Rework locking architecture for MVCC and transactional SQL

Igniters,

As you probably we know we work actively on MVCC [1] and transactional SQL
[2] features which could be treated as a single huge improvement. We face a
number of challenges and one of them is locking.

At the moment information about all locks is kept in memory on per-entry
basis (see GridCacheMvccManager). For every locked key we maintain current
lock owner (XID) and the list of would-be-owner transactions. When
transaction is about to lock an entry two scenarios are possible:
1) If entry is not locked we obtain the lock immediately
2) if entry is locked we add current transaction to the wait list and jumps
to the next entry to be locked. Once the first entry is released by
conflicting transaction, current transaction becomes an owner of the first
entry and tries to promote itself for subsequent entries.

Once all required locks are obtained, response is sent to the caller.

This approach doesn't work well for transactional SQL - if we update
millions of rows in a single transaction we will simply run out of memory.
To mitigate the problem other database vendors keep information about locks
inside the tuples. I propose to apply the similar design as follows:

1) No per-entry lock information is stored in memory anymore.
2) The list of active transactions are maintained in memory still
3) When TX locks an entry, it sets special marker to the tuple [3]
4) When TX meets already locked entry, it enlists itself to wait queue of
conflicting transaction and suspends
5) When first transaction releases conflicting lock, it notifies and wakes
up suspended transactions, so they resume locking
6) Entry lock data is cleared on transaction commit
7) Entry lock data is not cleared on rollback or node restart; Instead, we
will could use active transactions list to identify invalid locks and
overwrite them as needed.

Also we could try employing tiered approach
1) Try to keep everything in-memory to minimize writes to blocks
2) Fallback to persistent lock data if certain threshold is reached.

Thoughts?

[1] https://issues.apache.org/jira/browse/IGNITE-3478
[2] https://issues.apache.org/jira/browse/IGNITE-4191
[3] Depends on final MVCC design - it could be per-tuple XID, undo vectors,
per-block transaction lists, etc..

Vladimir.

dmagda

Re: Rework locking architecture for MVCC and transactional SQL

Vladimir,

Thanks for a throughout overview and proposal.

> Also we could try employing tiered approach
> 1) Try to keep everything in-memory to minimize writes to blocks
> 2) Fallback to persistent lock data if certain threshold is reached.

What are the benefits of the backed-by-persistence approach in compare to the one based on tuples? Specifically:
- will the persistence approach work for both 3rd party and Ignite persistence?
- any performance impacts depending on a chosen method?
- what’s faster to implement?

—
Denis

> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]> wrote:
>
> Igniters,
>
> As you probably we know we work actively on MVCC [1] and transactional SQL
> [2] features which could be treated as a single huge improvement. We face a
> number of challenges and one of them is locking.
>
> At the moment information about all locks is kept in memory on per-entry
> basis (see GridCacheMvccManager). For every locked key we maintain current
> lock owner (XID) and the list of would-be-owner transactions. When
> transaction is about to lock an entry two scenarios are possible:
> 1) If entry is not locked we obtain the lock immediately
> 2) if entry is locked we add current transaction to the wait list and jumps
> to the next entry to be locked. Once the first entry is released by
> conflicting transaction, current transaction becomes an owner of the first
> entry and tries to promote itself for subsequent entries.
>
> Once all required locks are obtained, response is sent to the caller.
>
> This approach doesn't work well for transactional SQL - if we update
> millions of rows in a single transaction we will simply run out of memory.
> To mitigate the problem other database vendors keep information about locks
> inside the tuples. I propose to apply the similar design as follows:
>
> 1) No per-entry lock information is stored in memory anymore.
> 2) The list of active transactions are maintained in memory still
> 3) When TX locks an entry, it sets special marker to the tuple [3]
> 4) When TX meets already locked entry, it enlists itself to wait queue of
> conflicting transaction and suspends
> 5) When first transaction releases conflicting lock, it notifies and wakes
> up suspended transactions, so they resume locking
> 6) Entry lock data is cleared on transaction commit
> 7) Entry lock data is not cleared on rollback or node restart; Instead, we
> will could use active transactions list to identify invalid locks and
> overwrite them as needed.
>
> Also we could try employing tiered approach
> 1) Try to keep everything in-memory to minimize writes to blocks
> 2) Fallback to persistent lock data if certain threshold is reached.
>
> Thoughts?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-3478
> [2] https://issues.apache.org/jira/browse/IGNITE-4191
> [3] Depends on final MVCC design - it could be per-tuple XID, undo vectors,
> per-block transaction lists, etc..
>
> Vladimir.

Vladimir Ozerov

Re: Rework locking architecture for MVCC and transactional SQL

Denis,

Sorry, may be I was not clear enough - "tuple-approach" and "persistent
approach" are the same. By "tuple" I mean a row stored inside a data block.
Currently we store lock information in Java heap and proposal is to move it
to data blocks. The main driver is memory - if there are a rows to be
locked we will either run out of memory, or produce serious memory
pressure. For example, currently update of 1M entries will consume ~500Mb
of heap. With proposed approach it will consume almost nothing. The
drawback is increased number of dirty data pages, but it should not be a
problem because in final implementation we will update data rows before
prepare phase anyway, so I do not expect any write amplification in usual
case.

This approach is only applicable for Ignite persistence.

On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]> wrote:

> Vladimir,
>
> Thanks for a throughout overview and proposal.
>
> > Also we could try employing tiered approach
> > 1) Try to keep everything in-memory to minimize writes to blocks
> > 2) Fallback to persistent lock data if certain threshold is reached.
>
> What are the benefits of the backed-by-persistence approach in compare to
> the one based on tuples? Specifically:
> - will the persistence approach work for both 3rd party and Ignite
> persistence?
> - any performance impacts depending on a chosen method?
> - what’s faster to implement?
>
> —
> Denis
>
> > On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]>
> wrote:
> >
> > Igniters,
> >
> > As you probably we know we work actively on MVCC [1] and transactional
> SQL
> > [2] features which could be treated as a single huge improvement. We
> face a
> > number of challenges and one of them is locking.
> >
> > At the moment information about all locks is kept in memory on per-entry
> > basis (see GridCacheMvccManager). For every locked key we maintain
> current
> > lock owner (XID) and the list of would-be-owner transactions. When
> > transaction is about to lock an entry two scenarios are possible:
> > 1) If entry is not locked we obtain the lock immediately
> > 2) if entry is locked we add current transaction to the wait list and
> jumps
> > to the next entry to be locked. Once the first entry is released by
> > conflicting transaction, current transaction becomes an owner of the
> first
> > entry and tries to promote itself for subsequent entries.
> >
> > Once all required locks are obtained, response is sent to the caller.
> >
> > This approach doesn't work well for transactional SQL - if we update
> > millions of rows in a single transaction we will simply run out of
> memory.
> > To mitigate the problem other database vendors keep information about
> locks
> > inside the tuples. I propose to apply the similar design as follows:
> >
> > 1) No per-entry lock information is stored in memory anymore.
> > 2) The list of active transactions are maintained in memory still
> > 3) When TX locks an entry, it sets special marker to the tuple [3]
> > 4) When TX meets already locked entry, it enlists itself to wait queue of
> > conflicting transaction and suspends
> > 5) When first transaction releases conflicting lock, it notifies and
> wakes
> > up suspended transactions, so they resume locking
> > 6) Entry lock data is cleared on transaction commit
> > 7) Entry lock data is not cleared on rollback or node restart; Instead,
> we
> > will could use active transactions list to identify invalid locks and
> > overwrite them as needed.
> >
> > Also we could try employing tiered approach
> > 1) Try to keep everything in-memory to minimize writes to blocks
> > 2) Fallback to persistent lock data if certain threshold is reached.
> >
> > Thoughts?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-3478
> > [2] https://issues.apache.org/jira/browse/IGNITE-4191
> > [3] Depends on final MVCC design - it could be per-tuple XID, undo
> vectors,
> > per-block transaction lists, etc..
> >
> > Vladimir.
>
>

dmagda

Re: Rework locking architecture for MVCC and transactional SQL

Vladimir,

No it’s crystal clear, thanks.

If this approach works only for Ignite persistence based deployment, how will we handle locking for pure in-memory and caching of 3rd party databases scenarios? As I understand the tuples still will be stored in the page memory while there won’t be any opportunity to fallback to disk if the memory usage increases some threshold.

—
Denis

> On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <[hidden email]> wrote:
>
> Denis,
>
> Sorry, may be I was not clear enough - "tuple-approach" and "persistent
> approach" are the same. By "tuple" I mean a row stored inside a data block.
> Currently we store lock information in Java heap and proposal is to move it
> to data blocks. The main driver is memory - if there are a rows to be
> locked we will either run out of memory, or produce serious memory
> pressure. For example, currently update of 1M entries will consume ~500Mb
> of heap. With proposed approach it will consume almost nothing. The
> drawback is increased number of dirty data pages, but it should not be a
> problem because in final implementation we will update data rows before
> prepare phase anyway, so I do not expect any write amplification in usual
> case.
>
> This approach is only applicable for Ignite persistence.
>
> On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]> wrote:
>
>> Vladimir,
>>
>> Thanks for a throughout overview and proposal.
>>
>>> Also we could try employing tiered approach
>>> 1) Try to keep everything in-memory to minimize writes to blocks
>>> 2) Fallback to persistent lock data if certain threshold is reached.
>>
>> What are the benefits of the backed-by-persistence approach in compare to
>> the one based on tuples? Specifically:
>> - will the persistence approach work for both 3rd party and Ignite
>> persistence?
>> - any performance impacts depending on a chosen method?
>> - what’s faster to implement?
>>
>> —
>> Denis
>>
>>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]>
>> wrote:
>>>
>>> Igniters,
>>>
>>> As you probably we know we work actively on MVCC [1] and transactional
>> SQL
>>> [2] features which could be treated as a single huge improvement. We
>> face a
>>> number of challenges and one of them is locking.
>>>
>>> At the moment information about all locks is kept in memory on per-entry
>>> basis (see GridCacheMvccManager). For every locked key we maintain
>> current
>>> lock owner (XID) and the list of would-be-owner transactions. When
>>> transaction is about to lock an entry two scenarios are possible:
>>> 1) If entry is not locked we obtain the lock immediately
>>> 2) if entry is locked we add current transaction to the wait list and
>> jumps
>>> to the next entry to be locked. Once the first entry is released by
>>> conflicting transaction, current transaction becomes an owner of the
>> first
>>> entry and tries to promote itself for subsequent entries.
>>>
>>> Once all required locks are obtained, response is sent to the caller.
>>>
>>> This approach doesn't work well for transactional SQL - if we update
>>> millions of rows in a single transaction we will simply run out of
>> memory.
>>> To mitigate the problem other database vendors keep information about
>> locks
>>> inside the tuples. I propose to apply the similar design as follows:
>>>
>>> 1) No per-entry lock information is stored in memory anymore.
>>> 2) The list of active transactions are maintained in memory still
>>> 3) When TX locks an entry, it sets special marker to the tuple [3]
>>> 4) When TX meets already locked entry, it enlists itself to wait queue of
>>> conflicting transaction and suspends
>>> 5) When first transaction releases conflicting lock, it notifies and
>> wakes
>>> up suspended transactions, so they resume locking
>>> 6) Entry lock data is cleared on transaction commit
>>> 7) Entry lock data is not cleared on rollback or node restart; Instead,
>> we
>>> will could use active transactions list to identify invalid locks and
>>> overwrite them as needed.
>>>
>>> Also we could try employing tiered approach
>>> 1) Try to keep everything in-memory to minimize writes to blocks
>>> 2) Fallback to persistent lock data if certain threshold is reached.
>>>
>>> Thoughts?
>>>
>>> [1] https://issues.apache.org/jira/browse/IGNITE-3478
>>> [2] https://issues.apache.org/jira/browse/IGNITE-4191
>>> [3] Depends on final MVCC design - it could be per-tuple XID, undo
>> vectors,
>>> per-block transaction lists, etc..
>>>
>>> Vladimir.
>>
>>

Alexey Goncharuk

Re: Rework locking architecture for MVCC and transactional SQL

Vladimir,

What about moving the entire locking mechanism to a separate off-heap
memory region which will be volatile wrt restarts, but will still support
off-load to disk. In the current architecture, it means that we will need
to allocate a separate DataRegion with no WAL and no crash recovery - locks
are meaningless after a restart, and we will automatically drop them. I
would be interesting to prototype this because I think we may be on-par
with on-heap lock placement, as we already proved for in-memory caches.

2017-12-14 21:53 GMT+03:00 Denis Magda <[hidden email]>:

> Vladimir,
>
> No it’s crystal clear, thanks.
>
> If this approach works only for Ignite persistence based deployment, how
> will we handle locking for pure in-memory and caching of 3rd party
> databases scenarios? As I understand the tuples still will be stored in the
> page memory while there won’t be any opportunity to fallback to disk if the
> memory usage increases some threshold.
>
> —
> Denis
>
> > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <[hidden email]>
> wrote:
> >
> > Denis,
> >
> > Sorry, may be I was not clear enough - "tuple-approach" and "persistent
> > approach" are the same. By "tuple" I mean a row stored inside a data
> block.
> > Currently we store lock information in Java heap and proposal is to move
> it
> > to data blocks. The main driver is memory - if there are a rows to be
> > locked we will either run out of memory, or produce serious memory
> > pressure. For example, currently update of 1M entries will consume ~500Mb
> > of heap. With proposed approach it will consume almost nothing. The
> > drawback is increased number of dirty data pages, but it should not be a
> > problem because in final implementation we will update data rows before
> > prepare phase anyway, so I do not expect any write amplification in usual
> > case.
> >
> > This approach is only applicable for Ignite persistence.
> >
> > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]> wrote:
> >
> >> Vladimir,
> >>
> >> Thanks for a throughout overview and proposal.
> >>
> >>> Also we could try employing tiered approach
> >>> 1) Try to keep everything in-memory to minimize writes to blocks
> >>> 2) Fallback to persistent lock data if certain threshold is reached.
> >>
> >> What are the benefits of the backed-by-persistence approach in compare
> to
> >> the one based on tuples? Specifically:
> >> - will the persistence approach work for both 3rd party and Ignite
> >> persistence?
> >> - any performance impacts depending on a chosen method?
> >> - what’s faster to implement?
> >>
> >> —
> >> Denis
> >>
> >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]>
> >> wrote:
> >>>
> >>> Igniters,
> >>>
> >>> As you probably we know we work actively on MVCC [1] and transactional
> >> SQL
> >>> [2] features which could be treated as a single huge improvement. We
> >> face a
> >>> number of challenges and one of them is locking.
> >>>
> >>> At the moment information about all locks is kept in memory on
> per-entry
> >>> basis (see GridCacheMvccManager). For every locked key we maintain
> >> current
> >>> lock owner (XID) and the list of would-be-owner transactions. When
> >>> transaction is about to lock an entry two scenarios are possible:
> >>> 1) If entry is not locked we obtain the lock immediately
> >>> 2) if entry is locked we add current transaction to the wait list and
> >> jumps
> >>> to the next entry to be locked. Once the first entry is released by
> >>> conflicting transaction, current transaction becomes an owner of the
> >> first
> >>> entry and tries to promote itself for subsequent entries.
> >>>
> >>> Once all required locks are obtained, response is sent to the caller.
> >>>
> >>> This approach doesn't work well for transactional SQL - if we update
> >>> millions of rows in a single transaction we will simply run out of
> >> memory.
> >>> To mitigate the problem other database vendors keep information about
> >> locks
> >>> inside the tuples. I propose to apply the similar design as follows:
> >>>
> >>> 1) No per-entry lock information is stored in memory anymore.
> >>> 2) The list of active transactions are maintained in memory still
> >>> 3) When TX locks an entry, it sets special marker to the tuple [3]
> >>> 4) When TX meets already locked entry, it enlists itself to wait queue
> of
> >>> conflicting transaction and suspends
> >>> 5) When first transaction releases conflicting lock, it notifies and
> >> wakes
> >>> up suspended transactions, so they resume locking
> >>> 6) Entry lock data is cleared on transaction commit
> >>> 7) Entry lock data is not cleared on rollback or node restart; Instead,
> >> we
> >>> will could use active transactions list to identify invalid locks and
> >>> overwrite them as needed.
> >>>
> >>> Also we could try employing tiered approach
> >>> 1) Try to keep everything in-memory to minimize writes to blocks
> >>> 2) Fallback to persistent lock data if certain threshold is reached.
> >>>
> >>> Thoughts?
> >>>
> >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478
> >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191
> >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo
> >> vectors,
> >>> per-block transaction lists, etc..
> >>>
> >>> Vladimir.
> >>
> >>
>
>

Vladimir Ozerov

Re: Rework locking architecture for MVCC and transactional SQL

Alex,

That might be very good idea. In fact, what you describe closely resembles
TempDB in MS SQL Server [1]. It is also offloaded to disk, minimally logged
and purged on restart. Ideally we could try designing this component in
generic way, so that it could store a lot different temporal stuff:
1) Locks
2) UNDO data
3) Sort/join data (for SELECT and CREATE INDEX, statistics, whatsoever)
4) If needed - visibility info (e.g. for covering indexes and purge/vacuum)

WDYT?

Vladimir.

[1]
https://docs.microsoft.com/en-us/sql/relational-databases/databases/tempdb-database

On Fri, Dec 15, 2017 at 4:26 PM, Alexey Goncharuk <
[hidden email]> wrote:

> Vladimir,
>
> What about moving the entire locking mechanism to a separate off-heap
> memory region which will be volatile wrt restarts, but will still support
> off-load to disk. In the current architecture, it means that we will need
> to allocate a separate DataRegion with no WAL and no crash recovery - locks
> are meaningless after a restart, and we will automatically drop them. I
> would be interesting to prototype this because I think we may be on-par
> with on-heap lock placement, as we already proved for in-memory caches.
>
> 2017-12-14 21:53 GMT+03:00 Denis Magda <[hidden email]>:
>
> > Vladimir,
> >
> > No it’s crystal clear, thanks.
> >
> > If this approach works only for Ignite persistence based deployment, how
> > will we handle locking for pure in-memory and caching of 3rd party
> > databases scenarios? As I understand the tuples still will be stored in
> the
> > page memory while there won’t be any opportunity to fallback to disk if
> the
> > memory usage increases some threshold.
> >
> > —
> > Denis
> >
> > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <[hidden email]>
> > wrote:
> > >
> > > Denis,
> > >
> > > Sorry, may be I was not clear enough - "tuple-approach" and "persistent
> > > approach" are the same. By "tuple" I mean a row stored inside a data
> > block.
> > > Currently we store lock information in Java heap and proposal is to
> move
> > it
> > > to data blocks. The main driver is memory - if there are a rows to be
> > > locked we will either run out of memory, or produce serious memory
> > > pressure. For example, currently update of 1M entries will consume
> ~500Mb
> > > of heap. With proposed approach it will consume almost nothing. The
> > > drawback is increased number of dirty data pages, but it should not be
> a
> > > problem because in final implementation we will update data rows before
> > > prepare phase anyway, so I do not expect any write amplification in
> usual
> > > case.
> > >
> > > This approach is only applicable for Ignite persistence.
> > >
> > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]>
> wrote:
> > >
> > >> Vladimir,
> > >>
> > >> Thanks for a throughout overview and proposal.
> > >>
> > >>> Also we could try employing tiered approach
> > >>> 1) Try to keep everything in-memory to minimize writes to blocks
> > >>> 2) Fallback to persistent lock data if certain threshold is reached.
> > >>
> > >> What are the benefits of the backed-by-persistence approach in compare
> > to
> > >> the one based on tuples? Specifically:
> > >> - will the persistence approach work for both 3rd party and Ignite
> > >> persistence?
> > >> - any performance impacts depending on a chosen method?
> > >> - what’s faster to implement?
> > >>
> > >> —
> > >> Denis
> > >>
> > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]>
> > >> wrote:
> > >>>
> > >>> Igniters,
> > >>>
> > >>> As you probably we know we work actively on MVCC [1] and
> transactional
> > >> SQL
> > >>> [2] features which could be treated as a single huge improvement. We
> > >> face a
> > >>> number of challenges and one of them is locking.
> > >>>
> > >>> At the moment information about all locks is kept in memory on
> > per-entry
> > >>> basis (see GridCacheMvccManager). For every locked key we maintain
> > >> current
> > >>> lock owner (XID) and the list of would-be-owner transactions. When
> > >>> transaction is about to lock an entry two scenarios are possible:
> > >>> 1) If entry is not locked we obtain the lock immediately
> > >>> 2) if entry is locked we add current transaction to the wait list and
> > >> jumps
> > >>> to the next entry to be locked. Once the first entry is released by
> > >>> conflicting transaction, current transaction becomes an owner of the
> > >> first
> > >>> entry and tries to promote itself for subsequent entries.
> > >>>
> > >>> Once all required locks are obtained, response is sent to the caller.
> > >>>
> > >>> This approach doesn't work well for transactional SQL - if we update
> > >>> millions of rows in a single transaction we will simply run out of
> > >> memory.
> > >>> To mitigate the problem other database vendors keep information about
> > >> locks
> > >>> inside the tuples. I propose to apply the similar design as follows:
> > >>>
> > >>> 1) No per-entry lock information is stored in memory anymore.
> > >>> 2) The list of active transactions are maintained in memory still
> > >>> 3) When TX locks an entry, it sets special marker to the tuple [3]
> > >>> 4) When TX meets already locked entry, it enlists itself to wait
> queue
> > of
> > >>> conflicting transaction and suspends
> > >>> 5) When first transaction releases conflicting lock, it notifies and
> > >> wakes
> > >>> up suspended transactions, so they resume locking
> > >>> 6) Entry lock data is cleared on transaction commit
> > >>> 7) Entry lock data is not cleared on rollback or node restart;
> Instead,
> > >> we
> > >>> will could use active transactions list to identify invalid locks and
> > >>> overwrite them as needed.
> > >>>
> > >>> Also we could try employing tiered approach
> > >>> 1) Try to keep everything in-memory to minimize writes to blocks
> > >>> 2) Fallback to persistent lock data if certain threshold is reached.
> > >>>
> > >>> Thoughts?
> > >>>
> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478
> > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191
> > >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo
> > >> vectors,
> > >>> per-block transaction lists, etc..
> > >>>
> > >>> Vladimir.
> > >>
> > >>
> >
> >
>

Vladimir Ozerov

Re: Rework locking architecture for MVCC and transactional SQL

Well, there is completely different approach to handle the problem - lock
escalation. E.g. SQL Server typically require 96 bytes per lock. Much less
than Ignite, but still. And as soon as some query require more than 5000
row locks, it is escalated to exclusive table lock. Not only this
eliminates memory consumption problem, it increases performance of massive
updates dramatically.
On the other hand this approach is more prone to deadlocks, since
transactions updating disjoint data sets now have shared resources to be
locked.

There is no silver bullet apparently.

On Fri, Dec 15, 2017 at 4:42 PM, Vladimir Ozerov <[hidden email]>
wrote:

> Alex,
>
> That might be very good idea. In fact, what you describe closely resembles
> TempDB in MS SQL Server [1]. It is also offloaded to disk, minimally logged
> and purged on restart. Ideally we could try designing this component in
> generic way, so that it could store a lot different temporal stuff:
> 1) Locks
> 2) UNDO data
> 3) Sort/join data (for SELECT and CREATE INDEX, statistics, whatsoever)
> 4) If needed - visibility info (e.g. for covering indexes and purge/vacuum)
>
> WDYT?
>
> Vladimir.
>
> [1] https://docs.microsoft.com/en-us/sql/relational-
> databases/databases/tempdb-database
>
> On Fri, Dec 15, 2017 at 4:26 PM, Alexey Goncharuk <
> [hidden email]> wrote:
>
>> Vladimir,
>>
>> What about moving the entire locking mechanism to a separate off-heap
>> memory region which will be volatile wrt restarts, but will still support
>> off-load to disk. In the current architecture, it means that we will need
>> to allocate a separate DataRegion with no WAL and no crash recovery -
>> locks
>> are meaningless after a restart, and we will automatically drop them. I
>> would be interesting to prototype this because I think we may be on-par
>> with on-heap lock placement, as we already proved for in-memory caches.
>>
>> 2017-12-14 21:53 GMT+03:00 Denis Magda <[hidden email]>:
>>
>> > Vladimir,
>> >
>> > No it’s crystal clear, thanks.
>> >
>> > If this approach works only for Ignite persistence based deployment, how
>> > will we handle locking for pure in-memory and caching of 3rd party
>> > databases scenarios? As I understand the tuples still will be stored in
>> the
>> > page memory while there won’t be any opportunity to fallback to disk if
>> the
>> > memory usage increases some threshold.
>> >
>> > —
>> > Denis
>> >
>> > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <[hidden email]>
>> > wrote:
>> > >
>> > > Denis,
>> > >
>> > > Sorry, may be I was not clear enough - "tuple-approach" and
>> "persistent
>> > > approach" are the same. By "tuple" I mean a row stored inside a data
>> > block.
>> > > Currently we store lock information in Java heap and proposal is to
>> move
>> > it
>> > > to data blocks. The main driver is memory - if there are a rows to be
>> > > locked we will either run out of memory, or produce serious memory
>> > > pressure. For example, currently update of 1M entries will consume
>> ~500Mb
>> > > of heap. With proposed approach it will consume almost nothing. The
>> > > drawback is increased number of dirty data pages, but it should not
>> be a
>> > > problem because in final implementation we will update data rows
>> before
>> > > prepare phase anyway, so I do not expect any write amplification in
>> usual
>> > > case.
>> > >
>> > > This approach is only applicable for Ignite persistence.
>> > >
>> > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]>
>> wrote:
>> > >
>> > >> Vladimir,
>> > >>
>> > >> Thanks for a throughout overview and proposal.
>> > >>
>> > >>> Also we could try employing tiered approach
>> > >>> 1) Try to keep everything in-memory to minimize writes to blocks
>> > >>> 2) Fallback to persistent lock data if certain threshold is reached.
>> > >>
>> > >> What are the benefits of the backed-by-persistence approach in
>> compare
>> > to
>> > >> the one based on tuples? Specifically:
>> > >> - will the persistence approach work for both 3rd party and Ignite
>> > >> persistence?
>> > >> - any performance impacts depending on a chosen method?
>> > >> - what’s faster to implement?
>> > >>
>> > >> —
>> > >> Denis
>> > >>
>> > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]>
>> > >> wrote:
>> > >>>
>> > >>> Igniters,
>> > >>>
>> > >>> As you probably we know we work actively on MVCC [1] and
>> transactional
>> > >> SQL
>> > >>> [2] features which could be treated as a single huge improvement. We
>> > >> face a
>> > >>> number of challenges and one of them is locking.
>> > >>>
>> > >>> At the moment information about all locks is kept in memory on
>> > per-entry
>> > >>> basis (see GridCacheMvccManager). For every locked key we maintain
>> > >> current
>> > >>> lock owner (XID) and the list of would-be-owner transactions. When
>> > >>> transaction is about to lock an entry two scenarios are possible:
>> > >>> 1) If entry is not locked we obtain the lock immediately
>> > >>> 2) if entry is locked we add current transaction to the wait list
>> and
>> > >> jumps
>> > >>> to the next entry to be locked. Once the first entry is released by
>> > >>> conflicting transaction, current transaction becomes an owner of the
>> > >> first
>> > >>> entry and tries to promote itself for subsequent entries.
>> > >>>
>> > >>> Once all required locks are obtained, response is sent to the
>> caller.
>> > >>>
>> > >>> This approach doesn't work well for transactional SQL - if we update
>> > >>> millions of rows in a single transaction we will simply run out of
>> > >> memory.
>> > >>> To mitigate the problem other database vendors keep information
>> about
>> > >> locks
>> > >>> inside the tuples. I propose to apply the similar design as follows:
>> > >>>
>> > >>> 1) No per-entry lock information is stored in memory anymore.
>> > >>> 2) The list of active transactions are maintained in memory still
>> > >>> 3) When TX locks an entry, it sets special marker to the tuple [3]
>> > >>> 4) When TX meets already locked entry, it enlists itself to wait
>> queue
>> > of
>> > >>> conflicting transaction and suspends
>> > >>> 5) When first transaction releases conflicting lock, it notifies and
>> > >> wakes
>> > >>> up suspended transactions, so they resume locking
>> > >>> 6) Entry lock data is cleared on transaction commit
>> > >>> 7) Entry lock data is not cleared on rollback or node restart;
>> Instead,
>> > >> we
>> > >>> will could use active transactions list to identify invalid locks
>> and
>> > >>> overwrite them as needed.
>> > >>>
>> > >>> Also we could try employing tiered approach
>> > >>> 1) Try to keep everything in-memory to minimize writes to blocks
>> > >>> 2) Fallback to persistent lock data if certain threshold is reached.
>> > >>>
>> > >>> Thoughts?
>> > >>>
>> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478
>> > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191
>> > >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo
>> > >> vectors,
>> > >>> per-block transaction lists, etc..
>> > >>>
>> > >>> Vladimir.
>> > >>
>> > >>
>> >
>> >
>>
>
>

Alexey Goncharuk

Re: Rework locking architecture for MVCC and transactional SQL

That would be great, actually, We have a lot of data structures which size
heavily depends on the topology size and caches number, and all those data
structures are placed on-heap. I am looking forward moving that stuff to a
limited memory region with disk offload.

The DataRegion and disk offload are fairly easy to implement - we have
almost all mechanics in the current page memory implementation. We only
need to get rid of copy-on-write and checkpointing logic and alter page
eviction mechanics a little bit.

Another side note - we can look into range locks mechanics. It will be a
greater effort to implement them, but I think it worth it.

2017-12-15 16:59 GMT+03:00 Vladimir Ozerov <[hidden email]>:

> Well, there is completely different approach to handle the problem - lock
> escalation. E.g. SQL Server typically require 96 bytes per lock. Much less
> than Ignite, but still. And as soon as some query require more than 5000
> row locks, it is escalated to exclusive table lock. Not only this
> eliminates memory consumption problem, it increases performance of massive
> updates dramatically.
> On the other hand this approach is more prone to deadlocks, since
> transactions updating disjoint data sets now have shared resources to be
> locked.
>
> There is no silver bullet apparently.
>
> On Fri, Dec 15, 2017 at 4:42 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > Alex,
> >
> > That might be very good idea. In fact, what you describe closely
> resembles
> > TempDB in MS SQL Server [1]. It is also offloaded to disk, minimally
> logged
> > and purged on restart. Ideally we could try designing this component in
> > generic way, so that it could store a lot different temporal stuff:
> > 1) Locks
> > 2) UNDO data
> > 3) Sort/join data (for SELECT and CREATE INDEX, statistics, whatsoever)
> > 4) If needed - visibility info (e.g. for covering indexes and
> purge/vacuum)
> >
> > WDYT?
> >
> > Vladimir.
> >
> > [1] https://docs.microsoft.com/en-us/sql/relational-
> > databases/databases/tempdb-database
> >
> > On Fri, Dec 15, 2017 at 4:26 PM, Alexey Goncharuk <
> > [hidden email]> wrote:
> >
> >> Vladimir,
> >>
> >> What about moving the entire locking mechanism to a separate off-heap
> >> memory region which will be volatile wrt restarts, but will still
> support
> >> off-load to disk. In the current architecture, it means that we will
> need
> >> to allocate a separate DataRegion with no WAL and no crash recovery -
> >> locks
> >> are meaningless after a restart, and we will automatically drop them. I
> >> would be interesting to prototype this because I think we may be on-par
> >> with on-heap lock placement, as we already proved for in-memory caches.
> >>
> >> 2017-12-14 21:53 GMT+03:00 Denis Magda <[hidden email]>:
> >>
> >> > Vladimir,
> >> >
> >> > No it’s crystal clear, thanks.
> >> >
> >> > If this approach works only for Ignite persistence based deployment,
> how
> >> > will we handle locking for pure in-memory and caching of 3rd party
> >> > databases scenarios? As I understand the tuples still will be stored
> in
> >> the
> >> > page memory while there won’t be any opportunity to fallback to disk
> if
> >> the
> >> > memory usage increases some threshold.
> >> >
> >> > —
> >> > Denis
> >> >
> >> > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <[hidden email]
> >
> >> > wrote:
> >> > >
> >> > > Denis,
> >> > >
> >> > > Sorry, may be I was not clear enough - "tuple-approach" and
> >> "persistent
> >> > > approach" are the same. By "tuple" I mean a row stored inside a data
> >> > block.
> >> > > Currently we store lock information in Java heap and proposal is to
> >> move
> >> > it
> >> > > to data blocks. The main driver is memory - if there are a rows to
> be
> >> > > locked we will either run out of memory, or produce serious memory
> >> > > pressure. For example, currently update of 1M entries will consume
> >> ~500Mb
> >> > > of heap. With proposed approach it will consume almost nothing. The
> >> > > drawback is increased number of dirty data pages, but it should not
> >> be a
> >> > > problem because in final implementation we will update data rows
> >> before
> >> > > prepare phase anyway, so I do not expect any write amplification in
> >> usual
> >> > > case.
> >> > >
> >> > > This approach is only applicable for Ignite persistence.
> >> > >
> >> > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]>
> >> wrote:
> >> > >
> >> > >> Vladimir,
> >> > >>
> >> > >> Thanks for a throughout overview and proposal.
> >> > >>
> >> > >>> Also we could try employing tiered approach
> >> > >>> 1) Try to keep everything in-memory to minimize writes to blocks
> >> > >>> 2) Fallback to persistent lock data if certain threshold is
> reached.
> >> > >>
> >> > >> What are the benefits of the backed-by-persistence approach in
> >> compare
> >> > to
> >> > >> the one based on tuples? Specifically:
> >> > >> - will the persistence approach work for both 3rd party and Ignite
> >> > >> persistence?
> >> > >> - any performance impacts depending on a chosen method?
> >> > >> - what’s faster to implement?
> >> > >>
> >> > >> —
> >> > >> Denis
> >> > >>
> >> > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <
> [hidden email]>
> >> > >> wrote:
> >> > >>>
> >> > >>> Igniters,
> >> > >>>
> >> > >>> As you probably we know we work actively on MVCC [1] and
> >> transactional
> >> > >> SQL
> >> > >>> [2] features which could be treated as a single huge improvement.
> We
> >> > >> face a
> >> > >>> number of challenges and one of them is locking.
> >> > >>>
> >> > >>> At the moment information about all locks is kept in memory on
> >> > per-entry
> >> > >>> basis (see GridCacheMvccManager). For every locked key we maintain
> >> > >> current
> >> > >>> lock owner (XID) and the list of would-be-owner transactions. When
> >> > >>> transaction is about to lock an entry two scenarios are possible:
> >> > >>> 1) If entry is not locked we obtain the lock immediately
> >> > >>> 2) if entry is locked we add current transaction to the wait list
> >> and
> >> > >> jumps
> >> > >>> to the next entry to be locked. Once the first entry is released
> by
> >> > >>> conflicting transaction, current transaction becomes an owner of
> the
> >> > >> first
> >> > >>> entry and tries to promote itself for subsequent entries.
> >> > >>>
> >> > >>> Once all required locks are obtained, response is sent to the
> >> caller.
> >> > >>>
> >> > >>> This approach doesn't work well for transactional SQL - if we
> update
> >> > >>> millions of rows in a single transaction we will simply run out of
> >> > >> memory.
> >> > >>> To mitigate the problem other database vendors keep information
> >> about
> >> > >> locks
> >> > >>> inside the tuples. I propose to apply the similar design as
> follows:
> >> > >>>
> >> > >>> 1) No per-entry lock information is stored in memory anymore.
> >> > >>> 2) The list of active transactions are maintained in memory still
> >> > >>> 3) When TX locks an entry, it sets special marker to the tuple [3]
> >> > >>> 4) When TX meets already locked entry, it enlists itself to wait
> >> queue
> >> > of
> >> > >>> conflicting transaction and suspends
> >> > >>> 5) When first transaction releases conflicting lock, it notifies
> and
> >> > >> wakes
> >> > >>> up suspended transactions, so they resume locking
> >> > >>> 6) Entry lock data is cleared on transaction commit
> >> > >>> 7) Entry lock data is not cleared on rollback or node restart;
> >> Instead,
> >> > >> we
> >> > >>> will could use active transactions list to identify invalid locks
> >> and
> >> > >>> overwrite them as needed.
> >> > >>>
> >> > >>> Also we could try employing tiered approach
> >> > >>> 1) Try to keep everything in-memory to minimize writes to blocks
> >> > >>> 2) Fallback to persistent lock data if certain threshold is
> reached.
> >> > >>>
> >> > >>> Thoughts?
> >> > >>>
> >> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478
> >> > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191
> >> > >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo
> >> > >> vectors,
> >> > >>> per-block transaction lists, etc..
> >> > >>>
> >> > >>> Vladimir.
> >> > >>
> >> > >>
> >> >
> >> >
> >>
> >
> >
>