Igniters,
As you probably we know we work actively on MVCC [1] and transactional SQL [2] features which could be treated as a single huge improvement. We face a number of challenges and one of them is locking. At the moment information about all locks is kept in memory on per-entry basis (see GridCacheMvccManager). For every locked key we maintain current lock owner (XID) and the list of would-be-owner transactions. When transaction is about to lock an entry two scenarios are possible: 1) If entry is not locked we obtain the lock immediately 2) if entry is locked we add current transaction to the wait list and jumps to the next entry to be locked. Once the first entry is released by conflicting transaction, current transaction becomes an owner of the first entry and tries to promote itself for subsequent entries. Once all required locks are obtained, response is sent to the caller. This approach doesn't work well for transactional SQL - if we update millions of rows in a single transaction we will simply run out of memory. To mitigate the problem other database vendors keep information about locks inside the tuples. I propose to apply the similar design as follows: 1) No per-entry lock information is stored in memory anymore. 2) The list of active transactions are maintained in memory still 3) When TX locks an entry, it sets special marker to the tuple [3] 4) When TX meets already locked entry, it enlists itself to wait queue of conflicting transaction and suspends 5) When first transaction releases conflicting lock, it notifies and wakes up suspended transactions, so they resume locking 6) Entry lock data is cleared on transaction commit 7) Entry lock data is not cleared on rollback or node restart; Instead, we will could use active transactions list to identify invalid locks and overwrite them as needed. Also we could try employing tiered approach 1) Try to keep everything in-memory to minimize writes to blocks 2) Fallback to persistent lock data if certain threshold is reached. Thoughts? [1] https://issues.apache.org/jira/browse/IGNITE-3478 [2] https://issues.apache.org/jira/browse/IGNITE-4191 [3] Depends on final MVCC design - it could be per-tuple XID, undo vectors, per-block transaction lists, etc.. Vladimir. |
Vladimir,
Thanks for a throughout overview and proposal. > Also we could try employing tiered approach > 1) Try to keep everything in-memory to minimize writes to blocks > 2) Fallback to persistent lock data if certain threshold is reached. What are the benefits of the backed-by-persistence approach in compare to the one based on tuples? Specifically: - will the persistence approach work for both 3rd party and Ignite persistence? - any performance impacts depending on a chosen method? - what’s faster to implement? — Denis > On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]> wrote: > > Igniters, > > As you probably we know we work actively on MVCC [1] and transactional SQL > [2] features which could be treated as a single huge improvement. We face a > number of challenges and one of them is locking. > > At the moment information about all locks is kept in memory on per-entry > basis (see GridCacheMvccManager). For every locked key we maintain current > lock owner (XID) and the list of would-be-owner transactions. When > transaction is about to lock an entry two scenarios are possible: > 1) If entry is not locked we obtain the lock immediately > 2) if entry is locked we add current transaction to the wait list and jumps > to the next entry to be locked. Once the first entry is released by > conflicting transaction, current transaction becomes an owner of the first > entry and tries to promote itself for subsequent entries. > > Once all required locks are obtained, response is sent to the caller. > > This approach doesn't work well for transactional SQL - if we update > millions of rows in a single transaction we will simply run out of memory. > To mitigate the problem other database vendors keep information about locks > inside the tuples. I propose to apply the similar design as follows: > > 1) No per-entry lock information is stored in memory anymore. > 2) The list of active transactions are maintained in memory still > 3) When TX locks an entry, it sets special marker to the tuple [3] > 4) When TX meets already locked entry, it enlists itself to wait queue of > conflicting transaction and suspends > 5) When first transaction releases conflicting lock, it notifies and wakes > up suspended transactions, so they resume locking > 6) Entry lock data is cleared on transaction commit > 7) Entry lock data is not cleared on rollback or node restart; Instead, we > will could use active transactions list to identify invalid locks and > overwrite them as needed. > > Also we could try employing tiered approach > 1) Try to keep everything in-memory to minimize writes to blocks > 2) Fallback to persistent lock data if certain threshold is reached. > > Thoughts? > > [1] https://issues.apache.org/jira/browse/IGNITE-3478 > [2] https://issues.apache.org/jira/browse/IGNITE-4191 > [3] Depends on final MVCC design - it could be per-tuple XID, undo vectors, > per-block transaction lists, etc.. > > Vladimir. |
Denis,
Sorry, may be I was not clear enough - "tuple-approach" and "persistent approach" are the same. By "tuple" I mean a row stored inside a data block. Currently we store lock information in Java heap and proposal is to move it to data blocks. The main driver is memory - if there are a rows to be locked we will either run out of memory, or produce serious memory pressure. For example, currently update of 1M entries will consume ~500Mb of heap. With proposed approach it will consume almost nothing. The drawback is increased number of dirty data pages, but it should not be a problem because in final implementation we will update data rows before prepare phase anyway, so I do not expect any write amplification in usual case. This approach is only applicable for Ignite persistence. On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]> wrote: > Vladimir, > > Thanks for a throughout overview and proposal. > > > Also we could try employing tiered approach > > 1) Try to keep everything in-memory to minimize writes to blocks > > 2) Fallback to persistent lock data if certain threshold is reached. > > What are the benefits of the backed-by-persistence approach in compare to > the one based on tuples? Specifically: > - will the persistence approach work for both 3rd party and Ignite > persistence? > - any performance impacts depending on a chosen method? > - what’s faster to implement? > > — > Denis > > > On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]> > wrote: > > > > Igniters, > > > > As you probably we know we work actively on MVCC [1] and transactional > SQL > > [2] features which could be treated as a single huge improvement. We > face a > > number of challenges and one of them is locking. > > > > At the moment information about all locks is kept in memory on per-entry > > basis (see GridCacheMvccManager). For every locked key we maintain > current > > lock owner (XID) and the list of would-be-owner transactions. When > > transaction is about to lock an entry two scenarios are possible: > > 1) If entry is not locked we obtain the lock immediately > > 2) if entry is locked we add current transaction to the wait list and > jumps > > to the next entry to be locked. Once the first entry is released by > > conflicting transaction, current transaction becomes an owner of the > first > > entry and tries to promote itself for subsequent entries. > > > > Once all required locks are obtained, response is sent to the caller. > > > > This approach doesn't work well for transactional SQL - if we update > > millions of rows in a single transaction we will simply run out of > memory. > > To mitigate the problem other database vendors keep information about > locks > > inside the tuples. I propose to apply the similar design as follows: > > > > 1) No per-entry lock information is stored in memory anymore. > > 2) The list of active transactions are maintained in memory still > > 3) When TX locks an entry, it sets special marker to the tuple [3] > > 4) When TX meets already locked entry, it enlists itself to wait queue of > > conflicting transaction and suspends > > 5) When first transaction releases conflicting lock, it notifies and > wakes > > up suspended transactions, so they resume locking > > 6) Entry lock data is cleared on transaction commit > > 7) Entry lock data is not cleared on rollback or node restart; Instead, > we > > will could use active transactions list to identify invalid locks and > > overwrite them as needed. > > > > Also we could try employing tiered approach > > 1) Try to keep everything in-memory to minimize writes to blocks > > 2) Fallback to persistent lock data if certain threshold is reached. > > > > Thoughts? > > > > [1] https://issues.apache.org/jira/browse/IGNITE-3478 > > [2] https://issues.apache.org/jira/browse/IGNITE-4191 > > [3] Depends on final MVCC design - it could be per-tuple XID, undo > vectors, > > per-block transaction lists, etc.. > > > > Vladimir. > > |
Vladimir,
No it’s crystal clear, thanks. If this approach works only for Ignite persistence based deployment, how will we handle locking for pure in-memory and caching of 3rd party databases scenarios? As I understand the tuples still will be stored in the page memory while there won’t be any opportunity to fallback to disk if the memory usage increases some threshold. — Denis > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <[hidden email]> wrote: > > Denis, > > Sorry, may be I was not clear enough - "tuple-approach" and "persistent > approach" are the same. By "tuple" I mean a row stored inside a data block. > Currently we store lock information in Java heap and proposal is to move it > to data blocks. The main driver is memory - if there are a rows to be > locked we will either run out of memory, or produce serious memory > pressure. For example, currently update of 1M entries will consume ~500Mb > of heap. With proposed approach it will consume almost nothing. The > drawback is increased number of dirty data pages, but it should not be a > problem because in final implementation we will update data rows before > prepare phase anyway, so I do not expect any write amplification in usual > case. > > This approach is only applicable for Ignite persistence. > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]> wrote: > >> Vladimir, >> >> Thanks for a throughout overview and proposal. >> >>> Also we could try employing tiered approach >>> 1) Try to keep everything in-memory to minimize writes to blocks >>> 2) Fallback to persistent lock data if certain threshold is reached. >> >> What are the benefits of the backed-by-persistence approach in compare to >> the one based on tuples? Specifically: >> - will the persistence approach work for both 3rd party and Ignite >> persistence? >> - any performance impacts depending on a chosen method? >> - what’s faster to implement? >> >> — >> Denis >> >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]> >> wrote: >>> >>> Igniters, >>> >>> As you probably we know we work actively on MVCC [1] and transactional >> SQL >>> [2] features which could be treated as a single huge improvement. We >> face a >>> number of challenges and one of them is locking. >>> >>> At the moment information about all locks is kept in memory on per-entry >>> basis (see GridCacheMvccManager). For every locked key we maintain >> current >>> lock owner (XID) and the list of would-be-owner transactions. When >>> transaction is about to lock an entry two scenarios are possible: >>> 1) If entry is not locked we obtain the lock immediately >>> 2) if entry is locked we add current transaction to the wait list and >> jumps >>> to the next entry to be locked. Once the first entry is released by >>> conflicting transaction, current transaction becomes an owner of the >> first >>> entry and tries to promote itself for subsequent entries. >>> >>> Once all required locks are obtained, response is sent to the caller. >>> >>> This approach doesn't work well for transactional SQL - if we update >>> millions of rows in a single transaction we will simply run out of >> memory. >>> To mitigate the problem other database vendors keep information about >> locks >>> inside the tuples. I propose to apply the similar design as follows: >>> >>> 1) No per-entry lock information is stored in memory anymore. >>> 2) The list of active transactions are maintained in memory still >>> 3) When TX locks an entry, it sets special marker to the tuple [3] >>> 4) When TX meets already locked entry, it enlists itself to wait queue of >>> conflicting transaction and suspends >>> 5) When first transaction releases conflicting lock, it notifies and >> wakes >>> up suspended transactions, so they resume locking >>> 6) Entry lock data is cleared on transaction commit >>> 7) Entry lock data is not cleared on rollback or node restart; Instead, >> we >>> will could use active transactions list to identify invalid locks and >>> overwrite them as needed. >>> >>> Also we could try employing tiered approach >>> 1) Try to keep everything in-memory to minimize writes to blocks >>> 2) Fallback to persistent lock data if certain threshold is reached. >>> >>> Thoughts? >>> >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478 >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191 >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo >> vectors, >>> per-block transaction lists, etc.. >>> >>> Vladimir. >> >> |
Vladimir,
What about moving the entire locking mechanism to a separate off-heap memory region which will be volatile wrt restarts, but will still support off-load to disk. In the current architecture, it means that we will need to allocate a separate DataRegion with no WAL and no crash recovery - locks are meaningless after a restart, and we will automatically drop them. I would be interesting to prototype this because I think we may be on-par with on-heap lock placement, as we already proved for in-memory caches. 2017-12-14 21:53 GMT+03:00 Denis Magda <[hidden email]>: > Vladimir, > > No it’s crystal clear, thanks. > > If this approach works only for Ignite persistence based deployment, how > will we handle locking for pure in-memory and caching of 3rd party > databases scenarios? As I understand the tuples still will be stored in the > page memory while there won’t be any opportunity to fallback to disk if the > memory usage increases some threshold. > > — > Denis > > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <[hidden email]> > wrote: > > > > Denis, > > > > Sorry, may be I was not clear enough - "tuple-approach" and "persistent > > approach" are the same. By "tuple" I mean a row stored inside a data > block. > > Currently we store lock information in Java heap and proposal is to move > it > > to data blocks. The main driver is memory - if there are a rows to be > > locked we will either run out of memory, or produce serious memory > > pressure. For example, currently update of 1M entries will consume ~500Mb > > of heap. With proposed approach it will consume almost nothing. The > > drawback is increased number of dirty data pages, but it should not be a > > problem because in final implementation we will update data rows before > > prepare phase anyway, so I do not expect any write amplification in usual > > case. > > > > This approach is only applicable for Ignite persistence. > > > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]> wrote: > > > >> Vladimir, > >> > >> Thanks for a throughout overview and proposal. > >> > >>> Also we could try employing tiered approach > >>> 1) Try to keep everything in-memory to minimize writes to blocks > >>> 2) Fallback to persistent lock data if certain threshold is reached. > >> > >> What are the benefits of the backed-by-persistence approach in compare > to > >> the one based on tuples? Specifically: > >> - will the persistence approach work for both 3rd party and Ignite > >> persistence? > >> - any performance impacts depending on a chosen method? > >> - what’s faster to implement? > >> > >> — > >> Denis > >> > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]> > >> wrote: > >>> > >>> Igniters, > >>> > >>> As you probably we know we work actively on MVCC [1] and transactional > >> SQL > >>> [2] features which could be treated as a single huge improvement. We > >> face a > >>> number of challenges and one of them is locking. > >>> > >>> At the moment information about all locks is kept in memory on > per-entry > >>> basis (see GridCacheMvccManager). For every locked key we maintain > >> current > >>> lock owner (XID) and the list of would-be-owner transactions. When > >>> transaction is about to lock an entry two scenarios are possible: > >>> 1) If entry is not locked we obtain the lock immediately > >>> 2) if entry is locked we add current transaction to the wait list and > >> jumps > >>> to the next entry to be locked. Once the first entry is released by > >>> conflicting transaction, current transaction becomes an owner of the > >> first > >>> entry and tries to promote itself for subsequent entries. > >>> > >>> Once all required locks are obtained, response is sent to the caller. > >>> > >>> This approach doesn't work well for transactional SQL - if we update > >>> millions of rows in a single transaction we will simply run out of > >> memory. > >>> To mitigate the problem other database vendors keep information about > >> locks > >>> inside the tuples. I propose to apply the similar design as follows: > >>> > >>> 1) No per-entry lock information is stored in memory anymore. > >>> 2) The list of active transactions are maintained in memory still > >>> 3) When TX locks an entry, it sets special marker to the tuple [3] > >>> 4) When TX meets already locked entry, it enlists itself to wait queue > of > >>> conflicting transaction and suspends > >>> 5) When first transaction releases conflicting lock, it notifies and > >> wakes > >>> up suspended transactions, so they resume locking > >>> 6) Entry lock data is cleared on transaction commit > >>> 7) Entry lock data is not cleared on rollback or node restart; Instead, > >> we > >>> will could use active transactions list to identify invalid locks and > >>> overwrite them as needed. > >>> > >>> Also we could try employing tiered approach > >>> 1) Try to keep everything in-memory to minimize writes to blocks > >>> 2) Fallback to persistent lock data if certain threshold is reached. > >>> > >>> Thoughts? > >>> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478 > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191 > >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo > >> vectors, > >>> per-block transaction lists, etc.. > >>> > >>> Vladimir. > >> > >> > > |
Alex,
That might be very good idea. In fact, what you describe closely resembles TempDB in MS SQL Server [1]. It is also offloaded to disk, minimally logged and purged on restart. Ideally we could try designing this component in generic way, so that it could store a lot different temporal stuff: 1) Locks 2) UNDO data 3) Sort/join data (for SELECT and CREATE INDEX, statistics, whatsoever) 4) If needed - visibility info (e.g. for covering indexes and purge/vacuum) WDYT? Vladimir. [1] https://docs.microsoft.com/en-us/sql/relational-databases/databases/tempdb-database On Fri, Dec 15, 2017 at 4:26 PM, Alexey Goncharuk < [hidden email]> wrote: > Vladimir, > > What about moving the entire locking mechanism to a separate off-heap > memory region which will be volatile wrt restarts, but will still support > off-load to disk. In the current architecture, it means that we will need > to allocate a separate DataRegion with no WAL and no crash recovery - locks > are meaningless after a restart, and we will automatically drop them. I > would be interesting to prototype this because I think we may be on-par > with on-heap lock placement, as we already proved for in-memory caches. > > 2017-12-14 21:53 GMT+03:00 Denis Magda <[hidden email]>: > > > Vladimir, > > > > No it’s crystal clear, thanks. > > > > If this approach works only for Ignite persistence based deployment, how > > will we handle locking for pure in-memory and caching of 3rd party > > databases scenarios? As I understand the tuples still will be stored in > the > > page memory while there won’t be any opportunity to fallback to disk if > the > > memory usage increases some threshold. > > > > — > > Denis > > > > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <[hidden email]> > > wrote: > > > > > > Denis, > > > > > > Sorry, may be I was not clear enough - "tuple-approach" and "persistent > > > approach" are the same. By "tuple" I mean a row stored inside a data > > block. > > > Currently we store lock information in Java heap and proposal is to > move > > it > > > to data blocks. The main driver is memory - if there are a rows to be > > > locked we will either run out of memory, or produce serious memory > > > pressure. For example, currently update of 1M entries will consume > ~500Mb > > > of heap. With proposed approach it will consume almost nothing. The > > > drawback is increased number of dirty data pages, but it should not be > a > > > problem because in final implementation we will update data rows before > > > prepare phase anyway, so I do not expect any write amplification in > usual > > > case. > > > > > > This approach is only applicable for Ignite persistence. > > > > > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]> > wrote: > > > > > >> Vladimir, > > >> > > >> Thanks for a throughout overview and proposal. > > >> > > >>> Also we could try employing tiered approach > > >>> 1) Try to keep everything in-memory to minimize writes to blocks > > >>> 2) Fallback to persistent lock data if certain threshold is reached. > > >> > > >> What are the benefits of the backed-by-persistence approach in compare > > to > > >> the one based on tuples? Specifically: > > >> - will the persistence approach work for both 3rd party and Ignite > > >> persistence? > > >> - any performance impacts depending on a chosen method? > > >> - what’s faster to implement? > > >> > > >> — > > >> Denis > > >> > > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]> > > >> wrote: > > >>> > > >>> Igniters, > > >>> > > >>> As you probably we know we work actively on MVCC [1] and > transactional > > >> SQL > > >>> [2] features which could be treated as a single huge improvement. We > > >> face a > > >>> number of challenges and one of them is locking. > > >>> > > >>> At the moment information about all locks is kept in memory on > > per-entry > > >>> basis (see GridCacheMvccManager). For every locked key we maintain > > >> current > > >>> lock owner (XID) and the list of would-be-owner transactions. When > > >>> transaction is about to lock an entry two scenarios are possible: > > >>> 1) If entry is not locked we obtain the lock immediately > > >>> 2) if entry is locked we add current transaction to the wait list and > > >> jumps > > >>> to the next entry to be locked. Once the first entry is released by > > >>> conflicting transaction, current transaction becomes an owner of the > > >> first > > >>> entry and tries to promote itself for subsequent entries. > > >>> > > >>> Once all required locks are obtained, response is sent to the caller. > > >>> > > >>> This approach doesn't work well for transactional SQL - if we update > > >>> millions of rows in a single transaction we will simply run out of > > >> memory. > > >>> To mitigate the problem other database vendors keep information about > > >> locks > > >>> inside the tuples. I propose to apply the similar design as follows: > > >>> > > >>> 1) No per-entry lock information is stored in memory anymore. > > >>> 2) The list of active transactions are maintained in memory still > > >>> 3) When TX locks an entry, it sets special marker to the tuple [3] > > >>> 4) When TX meets already locked entry, it enlists itself to wait > queue > > of > > >>> conflicting transaction and suspends > > >>> 5) When first transaction releases conflicting lock, it notifies and > > >> wakes > > >>> up suspended transactions, so they resume locking > > >>> 6) Entry lock data is cleared on transaction commit > > >>> 7) Entry lock data is not cleared on rollback or node restart; > Instead, > > >> we > > >>> will could use active transactions list to identify invalid locks and > > >>> overwrite them as needed. > > >>> > > >>> Also we could try employing tiered approach > > >>> 1) Try to keep everything in-memory to minimize writes to blocks > > >>> 2) Fallback to persistent lock data if certain threshold is reached. > > >>> > > >>> Thoughts? > > >>> > > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478 > > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191 > > >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo > > >> vectors, > > >>> per-block transaction lists, etc.. > > >>> > > >>> Vladimir. > > >> > > >> > > > > > |
Well, there is completely different approach to handle the problem - lock
escalation. E.g. SQL Server typically require 96 bytes per lock. Much less than Ignite, but still. And as soon as some query require more than 5000 row locks, it is escalated to exclusive table lock. Not only this eliminates memory consumption problem, it increases performance of massive updates dramatically. On the other hand this approach is more prone to deadlocks, since transactions updating disjoint data sets now have shared resources to be locked. There is no silver bullet apparently. On Fri, Dec 15, 2017 at 4:42 PM, Vladimir Ozerov <[hidden email]> wrote: > Alex, > > That might be very good idea. In fact, what you describe closely resembles > TempDB in MS SQL Server [1]. It is also offloaded to disk, minimally logged > and purged on restart. Ideally we could try designing this component in > generic way, so that it could store a lot different temporal stuff: > 1) Locks > 2) UNDO data > 3) Sort/join data (for SELECT and CREATE INDEX, statistics, whatsoever) > 4) If needed - visibility info (e.g. for covering indexes and purge/vacuum) > > WDYT? > > Vladimir. > > [1] https://docs.microsoft.com/en-us/sql/relational- > databases/databases/tempdb-database > > On Fri, Dec 15, 2017 at 4:26 PM, Alexey Goncharuk < > [hidden email]> wrote: > >> Vladimir, >> >> What about moving the entire locking mechanism to a separate off-heap >> memory region which will be volatile wrt restarts, but will still support >> off-load to disk. In the current architecture, it means that we will need >> to allocate a separate DataRegion with no WAL and no crash recovery - >> locks >> are meaningless after a restart, and we will automatically drop them. I >> would be interesting to prototype this because I think we may be on-par >> with on-heap lock placement, as we already proved for in-memory caches. >> >> 2017-12-14 21:53 GMT+03:00 Denis Magda <[hidden email]>: >> >> > Vladimir, >> > >> > No it’s crystal clear, thanks. >> > >> > If this approach works only for Ignite persistence based deployment, how >> > will we handle locking for pure in-memory and caching of 3rd party >> > databases scenarios? As I understand the tuples still will be stored in >> the >> > page memory while there won’t be any opportunity to fallback to disk if >> the >> > memory usage increases some threshold. >> > >> > — >> > Denis >> > >> > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <[hidden email]> >> > wrote: >> > > >> > > Denis, >> > > >> > > Sorry, may be I was not clear enough - "tuple-approach" and >> "persistent >> > > approach" are the same. By "tuple" I mean a row stored inside a data >> > block. >> > > Currently we store lock information in Java heap and proposal is to >> move >> > it >> > > to data blocks. The main driver is memory - if there are a rows to be >> > > locked we will either run out of memory, or produce serious memory >> > > pressure. For example, currently update of 1M entries will consume >> ~500Mb >> > > of heap. With proposed approach it will consume almost nothing. The >> > > drawback is increased number of dirty data pages, but it should not >> be a >> > > problem because in final implementation we will update data rows >> before >> > > prepare phase anyway, so I do not expect any write amplification in >> usual >> > > case. >> > > >> > > This approach is only applicable for Ignite persistence. >> > > >> > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]> >> wrote: >> > > >> > >> Vladimir, >> > >> >> > >> Thanks for a throughout overview and proposal. >> > >> >> > >>> Also we could try employing tiered approach >> > >>> 1) Try to keep everything in-memory to minimize writes to blocks >> > >>> 2) Fallback to persistent lock data if certain threshold is reached. >> > >> >> > >> What are the benefits of the backed-by-persistence approach in >> compare >> > to >> > >> the one based on tuples? Specifically: >> > >> - will the persistence approach work for both 3rd party and Ignite >> > >> persistence? >> > >> - any performance impacts depending on a chosen method? >> > >> - what’s faster to implement? >> > >> >> > >> — >> > >> Denis >> > >> >> > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <[hidden email]> >> > >> wrote: >> > >>> >> > >>> Igniters, >> > >>> >> > >>> As you probably we know we work actively on MVCC [1] and >> transactional >> > >> SQL >> > >>> [2] features which could be treated as a single huge improvement. We >> > >> face a >> > >>> number of challenges and one of them is locking. >> > >>> >> > >>> At the moment information about all locks is kept in memory on >> > per-entry >> > >>> basis (see GridCacheMvccManager). For every locked key we maintain >> > >> current >> > >>> lock owner (XID) and the list of would-be-owner transactions. When >> > >>> transaction is about to lock an entry two scenarios are possible: >> > >>> 1) If entry is not locked we obtain the lock immediately >> > >>> 2) if entry is locked we add current transaction to the wait list >> and >> > >> jumps >> > >>> to the next entry to be locked. Once the first entry is released by >> > >>> conflicting transaction, current transaction becomes an owner of the >> > >> first >> > >>> entry and tries to promote itself for subsequent entries. >> > >>> >> > >>> Once all required locks are obtained, response is sent to the >> caller. >> > >>> >> > >>> This approach doesn't work well for transactional SQL - if we update >> > >>> millions of rows in a single transaction we will simply run out of >> > >> memory. >> > >>> To mitigate the problem other database vendors keep information >> about >> > >> locks >> > >>> inside the tuples. I propose to apply the similar design as follows: >> > >>> >> > >>> 1) No per-entry lock information is stored in memory anymore. >> > >>> 2) The list of active transactions are maintained in memory still >> > >>> 3) When TX locks an entry, it sets special marker to the tuple [3] >> > >>> 4) When TX meets already locked entry, it enlists itself to wait >> queue >> > of >> > >>> conflicting transaction and suspends >> > >>> 5) When first transaction releases conflicting lock, it notifies and >> > >> wakes >> > >>> up suspended transactions, so they resume locking >> > >>> 6) Entry lock data is cleared on transaction commit >> > >>> 7) Entry lock data is not cleared on rollback or node restart; >> Instead, >> > >> we >> > >>> will could use active transactions list to identify invalid locks >> and >> > >>> overwrite them as needed. >> > >>> >> > >>> Also we could try employing tiered approach >> > >>> 1) Try to keep everything in-memory to minimize writes to blocks >> > >>> 2) Fallback to persistent lock data if certain threshold is reached. >> > >>> >> > >>> Thoughts? >> > >>> >> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478 >> > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191 >> > >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo >> > >> vectors, >> > >>> per-block transaction lists, etc.. >> > >>> >> > >>> Vladimir. >> > >> >> > >> >> > >> > >> > > |
That would be great, actually, We have a lot of data structures which size
heavily depends on the topology size and caches number, and all those data structures are placed on-heap. I am looking forward moving that stuff to a limited memory region with disk offload. The DataRegion and disk offload are fairly easy to implement - we have almost all mechanics in the current page memory implementation. We only need to get rid of copy-on-write and checkpointing logic and alter page eviction mechanics a little bit. Another side note - we can look into range locks mechanics. It will be a greater effort to implement them, but I think it worth it. 2017-12-15 16:59 GMT+03:00 Vladimir Ozerov <[hidden email]>: > Well, there is completely different approach to handle the problem - lock > escalation. E.g. SQL Server typically require 96 bytes per lock. Much less > than Ignite, but still. And as soon as some query require more than 5000 > row locks, it is escalated to exclusive table lock. Not only this > eliminates memory consumption problem, it increases performance of massive > updates dramatically. > On the other hand this approach is more prone to deadlocks, since > transactions updating disjoint data sets now have shared resources to be > locked. > > There is no silver bullet apparently. > > On Fri, Dec 15, 2017 at 4:42 PM, Vladimir Ozerov <[hidden email]> > wrote: > > > Alex, > > > > That might be very good idea. In fact, what you describe closely > resembles > > TempDB in MS SQL Server [1]. It is also offloaded to disk, minimally > logged > > and purged on restart. Ideally we could try designing this component in > > generic way, so that it could store a lot different temporal stuff: > > 1) Locks > > 2) UNDO data > > 3) Sort/join data (for SELECT and CREATE INDEX, statistics, whatsoever) > > 4) If needed - visibility info (e.g. for covering indexes and > purge/vacuum) > > > > WDYT? > > > > Vladimir. > > > > [1] https://docs.microsoft.com/en-us/sql/relational- > > databases/databases/tempdb-database > > > > On Fri, Dec 15, 2017 at 4:26 PM, Alexey Goncharuk < > > [hidden email]> wrote: > > > >> Vladimir, > >> > >> What about moving the entire locking mechanism to a separate off-heap > >> memory region which will be volatile wrt restarts, but will still > support > >> off-load to disk. In the current architecture, it means that we will > need > >> to allocate a separate DataRegion with no WAL and no crash recovery - > >> locks > >> are meaningless after a restart, and we will automatically drop them. I > >> would be interesting to prototype this because I think we may be on-par > >> with on-heap lock placement, as we already proved for in-memory caches. > >> > >> 2017-12-14 21:53 GMT+03:00 Denis Magda <[hidden email]>: > >> > >> > Vladimir, > >> > > >> > No it’s crystal clear, thanks. > >> > > >> > If this approach works only for Ignite persistence based deployment, > how > >> > will we handle locking for pure in-memory and caching of 3rd party > >> > databases scenarios? As I understand the tuples still will be stored > in > >> the > >> > page memory while there won’t be any opportunity to fallback to disk > if > >> the > >> > memory usage increases some threshold. > >> > > >> > — > >> > Denis > >> > > >> > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <[hidden email] > > > >> > wrote: > >> > > > >> > > Denis, > >> > > > >> > > Sorry, may be I was not clear enough - "tuple-approach" and > >> "persistent > >> > > approach" are the same. By "tuple" I mean a row stored inside a data > >> > block. > >> > > Currently we store lock information in Java heap and proposal is to > >> move > >> > it > >> > > to data blocks. The main driver is memory - if there are a rows to > be > >> > > locked we will either run out of memory, or produce serious memory > >> > > pressure. For example, currently update of 1M entries will consume > >> ~500Mb > >> > > of heap. With proposed approach it will consume almost nothing. The > >> > > drawback is increased number of dirty data pages, but it should not > >> be a > >> > > problem because in final implementation we will update data rows > >> before > >> > > prepare phase anyway, so I do not expect any write amplification in > >> usual > >> > > case. > >> > > > >> > > This approach is only applicable for Ignite persistence. > >> > > > >> > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <[hidden email]> > >> wrote: > >> > > > >> > >> Vladimir, > >> > >> > >> > >> Thanks for a throughout overview and proposal. > >> > >> > >> > >>> Also we could try employing tiered approach > >> > >>> 1) Try to keep everything in-memory to minimize writes to blocks > >> > >>> 2) Fallback to persistent lock data if certain threshold is > reached. > >> > >> > >> > >> What are the benefits of the backed-by-persistence approach in > >> compare > >> > to > >> > >> the one based on tuples? Specifically: > >> > >> - will the persistence approach work for both 3rd party and Ignite > >> > >> persistence? > >> > >> - any performance impacts depending on a chosen method? > >> > >> - what’s faster to implement? > >> > >> > >> > >> — > >> > >> Denis > >> > >> > >> > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov < > [hidden email]> > >> > >> wrote: > >> > >>> > >> > >>> Igniters, > >> > >>> > >> > >>> As you probably we know we work actively on MVCC [1] and > >> transactional > >> > >> SQL > >> > >>> [2] features which could be treated as a single huge improvement. > We > >> > >> face a > >> > >>> number of challenges and one of them is locking. > >> > >>> > >> > >>> At the moment information about all locks is kept in memory on > >> > per-entry > >> > >>> basis (see GridCacheMvccManager). For every locked key we maintain > >> > >> current > >> > >>> lock owner (XID) and the list of would-be-owner transactions. When > >> > >>> transaction is about to lock an entry two scenarios are possible: > >> > >>> 1) If entry is not locked we obtain the lock immediately > >> > >>> 2) if entry is locked we add current transaction to the wait list > >> and > >> > >> jumps > >> > >>> to the next entry to be locked. Once the first entry is released > by > >> > >>> conflicting transaction, current transaction becomes an owner of > the > >> > >> first > >> > >>> entry and tries to promote itself for subsequent entries. > >> > >>> > >> > >>> Once all required locks are obtained, response is sent to the > >> caller. > >> > >>> > >> > >>> This approach doesn't work well for transactional SQL - if we > update > >> > >>> millions of rows in a single transaction we will simply run out of > >> > >> memory. > >> > >>> To mitigate the problem other database vendors keep information > >> about > >> > >> locks > >> > >>> inside the tuples. I propose to apply the similar design as > follows: > >> > >>> > >> > >>> 1) No per-entry lock information is stored in memory anymore. > >> > >>> 2) The list of active transactions are maintained in memory still > >> > >>> 3) When TX locks an entry, it sets special marker to the tuple [3] > >> > >>> 4) When TX meets already locked entry, it enlists itself to wait > >> queue > >> > of > >> > >>> conflicting transaction and suspends > >> > >>> 5) When first transaction releases conflicting lock, it notifies > and > >> > >> wakes > >> > >>> up suspended transactions, so they resume locking > >> > >>> 6) Entry lock data is cleared on transaction commit > >> > >>> 7) Entry lock data is not cleared on rollback or node restart; > >> Instead, > >> > >> we > >> > >>> will could use active transactions list to identify invalid locks > >> and > >> > >>> overwrite them as needed. > >> > >>> > >> > >>> Also we could try employing tiered approach > >> > >>> 1) Try to keep everything in-memory to minimize writes to blocks > >> > >>> 2) Fallback to persistent lock data if certain threshold is > reached. > >> > >>> > >> > >>> Thoughts? > >> > >>> > >> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478 > >> > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191 > >> > >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo > >> > >> vectors, > >> > >>> per-block transaction lists, etc.. > >> > >>> > >> > >>> Vladimir. > >> > >> > >> > >> > >> > > >> > > >> > > > > > |
Free forum by Nabble | Edit this page |