Apache Ignite Developers - Legacy Mail Archive

Hard limit WAL archive size

Classic

List

Threaded

5 messages Options

Kirill Tkalenko

Hard limit WAL archive size

Hello, everyone!

Currently, property DataStorageConfiguration#maxWalArchiveSize is not working as expected by users. We can easily go beyond this limit and overflow the disk, which will lead to errors and a crash of the node. I propose to fix this behavior and not let WAL archive overflow.

It is suggested not to add segments to the archive if we can exceed the DataStorageConfiguration#maxWalArchiveSize and wait until space becomes available for this.

Thus, we may have a deadlock:
Get checkpontReadLock -> write to WAL -> need to rollover WAL segment -> need to clean WAL archive -> need to complete checkpoint (impossible because of checkpontReadLock taken).

To avoid such situations, I suggest adding a custom heuristic - do not give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few (default 1) segments left.
But this will not allow us to completely avoid archive overflow situations. Therefore, I suggest fail node by FH when a deadlock is detected, since it could be the same if there was no disk space left.

vbm

Re: Hard limit WAL archive size

Hi,

Is this related to issue seen with
IGNITE-13912 ?

I had hit IGNITE-13912 when I was using ignite 2.9 release.
I am yet to try my use case with the fix provided as part of IGNITE-13912

Regards,
Vishwas

On Tue, 26 Jan, 2021, 21:18 ткаленко кирилл, <[hidden email]> wrote:

> Hello, everyone!
>
> Currently, property DataStorageConfiguration#maxWalArchiveSize is not
> working as expected by users. We can easily go beyond this limit and
> overflow the disk, which will lead to errors and a crash of the node. I
> propose to fix this behavior and not let WAL archive overflow.
>
> It is suggested not to add segments to the archive if we can exceed the
> DataStorageConfiguration#maxWalArchiveSize and wait until space becomes
> available for this.
>
> Thus, we may have a deadlock:
> Get checkpontReadLock -> write to WAL -> need to rollover WAL segment ->
> need to clean WAL archive -> need to complete checkpoint (impossible
> because of checkpontReadLock taken).
>
> To avoid such situations, I suggest adding a custom heuristic - do not
> give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few
> (default 1) segments left.
> But this will not allow us to completely avoid archive overflow
> situations. Therefore, I suggest fail node by FH when a deadlock is
> detected, since it could be the same if there was no disk space left.
>

Zhenya Stanilovsky

Re: Hard limit WAL archive size

In reply to this post by Kirill Tkalenko

Hello !
this is unclear for me, all you described near brings no info why node work improperly and why FH can possibly fail this node. Can you explain ?

>Hello, everyone!
>
>Currently, property DataStorageConfiguration#maxWalArchiveSize is not working as expected by users. We can easily go beyond this limit and overflow the disk, which will lead to errors and a crash of the node. I propose to fix this behavior and not let WAL archive overflow.
>
>It is suggested not to add segments to the archive if we can exceed the DataStorageConfiguration#maxWalArchiveSize and wait until space becomes available for this.
>
>Thus, we may have a deadlock:
>Get checkpontReadLock -> write to WAL -> need to rollover WAL segment -> need to clean WAL archive -> need to complete checkpoint (impossible because of checkpontReadLock taken).
>
>To avoid such situations, I suggest adding a custom heuristic - do not give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few (default 1) segments left.
>But this will not allow us to completely avoid archive overflow situations. Therefore, I suggest fail node by FH when a deadlock is detected, since it could be the same if there was no disk space left.

Ivan Daschinsky

Re: Hard limit WAL archive size

As for me, correct approach is to trigger checkpoint when we are too close
to WAL archive size limit.
The main purpose of these mechanism is to provide durability, so we
should think about not to fail node, nor to delete data voluntary,
but prevent possible data loss.

вт, 26 янв. 2021 г. в 19:13, Zhenya Stanilovsky <[hidden email]
>:

>
>
> Hello !
> this is unclear for me, all you described near brings no info why node
> work improperly and why FH can possibly fail this node. Can you explain ?
>
> >Hello, everyone!
> >
> >Currently, property DataStorageConfiguration#maxWalArchiveSize is not
> working as expected by users. We can easily go beyond this limit and
> overflow the disk, which will lead to errors and a crash of the node. I
> propose to fix this behavior and not let WAL archive overflow.
> >
> >It is suggested not to add segments to the archive if we can exceed the
> DataStorageConfiguration#maxWalArchiveSize and wait until space becomes
> available for this.
> >
> >Thus, we may have a deadlock:
> >Get checkpontReadLock -> write to WAL -> need to rollover WAL segment ->
> need to clean WAL archive -> need to complete checkpoint (impossible
> because of checkpontReadLock taken).
> >
> >To avoid such situations, I suggest adding a custom heuristic - do not
> give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few
> (default 1) segments left.
> >But this will not allow us to completely avoid archive overflow
> situations. Therefore, I suggest fail node by FH when a deadlock is
> detected, since it could be the same if there was no disk space left.
>
>
>
>

--
Sincerely yours, Ivan Daschinskiy

Kirill Tkalenko

Re: Hard limit WAL archive size

In reply to this post by vbm

Hi!
No, we basically have a problem with the growth of WAL archive.

26.01.2021, 19:06, "Vishwas Bm" <[hidden email]>:

> Hi,
>
> Is this related to issue seen with
> IGNITE-13912 ?
>
> I had hit IGNITE-13912 when I was using ignite 2.9 release.
> I am yet to try my use case with the fix provided as part of IGNITE-13912
>
> Regards,
> Vishwas
>
> On Tue, 26 Jan, 2021, 21:18 ткаленко кирилл, <[hidden email]> wrote:
>
>> Hello, everyone!
>>
>> Currently, property DataStorageConfiguration#maxWalArchiveSize is not
>> working as expected by users. We can easily go beyond this limit and
>> overflow the disk, which will lead to errors and a crash of the node. I
>> propose to fix this behavior and not let WAL archive overflow.
>>
>> It is suggested not to add segments to the archive if we can exceed the
>> DataStorageConfiguration#maxWalArchiveSize and wait until space becomes
>> available for this.
>>
>> Thus, we may have a deadlock:
>> Get checkpontReadLock -> write to WAL -> need to rollover WAL segment ->
>> need to clean WAL archive -> need to complete checkpoint (impossible
>> because of checkpontReadLock taken).
>>
>> To avoid such situations, I suggest adding a custom heuristic - do not
>> give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few
>> (default 1) segments left.
>> But this will not allow us to completely avoid archive overflow
>> situations. Therefore, I suggest fail node by FH when a deadlock is
>> detected, since it could be the same if there was no disk space left.