Hello, everyone!
Currently, property DataStorageConfiguration#maxWalArchiveSize is not working as expected by users. We can easily go beyond this limit and overflow the disk, which will lead to errors and a crash of the node. I propose to fix this behavior and not let WAL archive overflow. It is suggested not to add segments to the archive if we can exceed the DataStorageConfiguration#maxWalArchiveSize and wait until space becomes available for this. Thus, we may have a deadlock: Get checkpontReadLock -> write to WAL -> need to rollover WAL segment -> need to clean WAL archive -> need to complete checkpoint (impossible because of checkpontReadLock taken). To avoid such situations, I suggest adding a custom heuristic - do not give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few (default 1) segments left. But this will not allow us to completely avoid archive overflow situations. Therefore, I suggest fail node by FH when a deadlock is detected, since it could be the same if there was no disk space left. |
Hi,
Is this related to issue seen with IGNITE-13912 ? I had hit IGNITE-13912 when I was using ignite 2.9 release. I am yet to try my use case with the fix provided as part of IGNITE-13912 Regards, Vishwas On Tue, 26 Jan, 2021, 21:18 ткаленко кирилл, <[hidden email]> wrote: > Hello, everyone! > > Currently, property DataStorageConfiguration#maxWalArchiveSize is not > working as expected by users. We can easily go beyond this limit and > overflow the disk, which will lead to errors and a crash of the node. I > propose to fix this behavior and not let WAL archive overflow. > > It is suggested not to add segments to the archive if we can exceed the > DataStorageConfiguration#maxWalArchiveSize and wait until space becomes > available for this. > > Thus, we may have a deadlock: > Get checkpontReadLock -> write to WAL -> need to rollover WAL segment -> > need to clean WAL archive -> need to complete checkpoint (impossible > because of checkpontReadLock taken). > > To avoid such situations, I suggest adding a custom heuristic - do not > give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few > (default 1) segments left. > But this will not allow us to completely avoid archive overflow > situations. Therefore, I suggest fail node by FH when a deadlock is > detected, since it could be the same if there was no disk space left. > |
In reply to this post by Kirill Tkalenko
Hello ! this is unclear for me, all you described near brings no info why node work improperly and why FH can possibly fail this node. Can you explain ? >Hello, everyone! > >Currently, property DataStorageConfiguration#maxWalArchiveSize is not working as expected by users. We can easily go beyond this limit and overflow the disk, which will lead to errors and a crash of the node. I propose to fix this behavior and not let WAL archive overflow. > >It is suggested not to add segments to the archive if we can exceed the DataStorageConfiguration#maxWalArchiveSize and wait until space becomes available for this. > >Thus, we may have a deadlock: >Get checkpontReadLock -> write to WAL -> need to rollover WAL segment -> need to clean WAL archive -> need to complete checkpoint (impossible because of checkpontReadLock taken). > >To avoid such situations, I suggest adding a custom heuristic - do not give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few (default 1) segments left. >But this will not allow us to completely avoid archive overflow situations. Therefore, I suggest fail node by FH when a deadlock is detected, since it could be the same if there was no disk space left. |
As for me, correct approach is to trigger checkpoint when we are too close
to WAL archive size limit. The main purpose of these mechanism is to provide durability, so we should think about not to fail node, nor to delete data voluntary, but prevent possible data loss. вт, 26 янв. 2021 г. в 19:13, Zhenya Stanilovsky <[hidden email] >: > > > Hello ! > this is unclear for me, all you described near brings no info why node > work improperly and why FH can possibly fail this node. Can you explain ? > > >Hello, everyone! > > > >Currently, property DataStorageConfiguration#maxWalArchiveSize is not > working as expected by users. We can easily go beyond this limit and > overflow the disk, which will lead to errors and a crash of the node. I > propose to fix this behavior and not let WAL archive overflow. > > > >It is suggested not to add segments to the archive if we can exceed the > DataStorageConfiguration#maxWalArchiveSize and wait until space becomes > available for this. > > > >Thus, we may have a deadlock: > >Get checkpontReadLock -> write to WAL -> need to rollover WAL segment -> > need to clean WAL archive -> need to complete checkpoint (impossible > because of checkpontReadLock taken). > > > >To avoid such situations, I suggest adding a custom heuristic - do not > give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few > (default 1) segments left. > >But this will not allow us to completely avoid archive overflow > situations. Therefore, I suggest fail node by FH when a deadlock is > detected, since it could be the same if there was no disk space left. > > > > -- Sincerely yours, Ivan Daschinskiy |
In reply to this post by vbm
Hi!
No, we basically have a problem with the growth of WAL archive. 26.01.2021, 19:06, "Vishwas Bm" <[hidden email]>: > Hi, > > Is this related to issue seen with > IGNITE-13912 ? > > I had hit IGNITE-13912 when I was using ignite 2.9 release. > I am yet to try my use case with the fix provided as part of IGNITE-13912 > > Regards, > Vishwas > > On Tue, 26 Jan, 2021, 21:18 ткаленко кирилл, <[hidden email]> wrote: > >> Hello, everyone! >> >> Currently, property DataStorageConfiguration#maxWalArchiveSize is not >> working as expected by users. We can easily go beyond this limit and >> overflow the disk, which will lead to errors and a crash of the node. I >> propose to fix this behavior and not let WAL archive overflow. >> >> It is suggested not to add segments to the archive if we can exceed the >> DataStorageConfiguration#maxWalArchiveSize and wait until space becomes >> available for this. >> >> Thus, we may have a deadlock: >> Get checkpontReadLock -> write to WAL -> need to rollover WAL segment -> >> need to clean WAL archive -> need to complete checkpoint (impossible >> because of checkpontReadLock taken). >> >> To avoid such situations, I suggest adding a custom heuristic - do not >> give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few >> (default 1) segments left. >> But this will not allow us to completely avoid archive overflow >> situations. Therefore, I suggest fail node by FH when a deadlock is >> detected, since it could be the same if there was no disk space left. |
Free forum by Nabble | Edit this page |