Apache Ignite Developers - Legacy Mail Archive

Reconsider TTL expire mechanics in Ignite

Classic

List

Threaded

8 messages Options

Alexey Goncharuk

Reconsider TTL expire mechanics in Ignite

Igniters,

We recently experienced some issues with TTL with enabled persistence, the
issues were related to persistence implementation details. However, when we
were adding tests to cover more cases, we found more failures, which, I
think, reveal some fundamental issues with expire mechanism.

In short, the root cause of the issue is that we expire entries on primary
and backup nodes independently, which means:
1) Partition sizes may have different values at different times which will
trigger false-negative checks on partition map exchange which was recently
added
2) More importantly, this may lead to inconsistent primary and backup node
values when EntryProcessor is used, because an entry processor may observe
a non-null value on one node and a null value on another node.

In my opinion, the second issue is critical and we must change the expiry
mechanics to run expiry in a distributed mode, with cache mode semantics
for entry remove.

Thoughts?

Vladimir Ozerov

Re: Reconsider TTL expire mechanics in Ignite

Huge +1.

On Tue, Apr 24, 2018 at 4:05 PM, Alexey Goncharuk <
[hidden email]> wrote:

> Igniters,
>
> We recently experienced some issues with TTL with enabled persistence, the
> issues were related to persistence implementation details. However, when we
> were adding tests to cover more cases, we found more failures, which, I
> think, reveal some fundamental issues with expire mechanism.
>
> In short, the root cause of the issue is that we expire entries on primary
> and backup nodes independently, which means:
> 1) Partition sizes may have different values at different times which will
> trigger false-negative checks on partition map exchange which was recently
> added
> 2) More importantly, this may lead to inconsistent primary and backup node
> values when EntryProcessor is used, because an entry processor may observe
> a non-null value on one node and a null value on another node.
>
> In my opinion, the second issue is critical and we must change the expiry
> mechanics to run expiry in a distributed mode, with cache mode semantics
> for entry remove.
>
> Thoughts?
>

Ivan Rakov

Re: Reconsider TTL expire mechanics in Ignite

In reply to this post by Alexey Goncharuk

Alexey,

Distributed expire will result in serious performance overhead, mostly
on network level.
I think, the main use case of TTL are in-memory caches that accelerate
access to slower third-party data source. In such case nothing is broken
if data is missing; strong consistency guarantees are not needed. I
think, that's why we should keep "local expiration" at least for
in-memory caches. Our in-memory page eviction works in the same way.

Best Regards,
Ivan Rakov

On 24.04.2018 16:05, Alexey Goncharuk wrote:

Alexey Goncharuk

Re: Reconsider TTL expire mechanics in Ignite

Ivan,

Agree about the use-case when we have a read-write-through store. However,
we allow to use Ignite in-memory caches even without 3rd party stores, in
this case the same issue is still present. Maybe we can keep local expire
for read-through caches and have strongly consistent expire for other modes?

2018-04-24 16:51 GMT+03:00 Ivan Rakov <[hidden email]>:

> Alexey,
>
> Distributed expire will result in serious performance overhead, mostly on
> network level.
> I think, the main use case of TTL are in-memory caches that accelerate
> access to slower third-party data source. In such case nothing is broken if
> data is missing; strong consistency guarantees are not needed. I think,
> that's why we should keep "local expiration" at least for in-memory caches.
> Our in-memory page eviction works in the same way.
>
> Best Regards,
> Ivan Rakov
>
> On 24.04.2018 16:05, Alexey Goncharuk wrote:
>
>> Igniters,
>>
>> We recently experienced some issues with TTL with enabled persistence, the
>> issues were related to persistence implementation details. However, when
>> we
>> were adding tests to cover more cases, we found more failures, which, I
>> think, reveal some fundamental issues with expire mechanism.
>>
>> In short, the root cause of the issue is that we expire entries on primary
>> and backup nodes independently, which means:
>> 1) Partition sizes may have different values at different times which will
>> trigger false-negative checks on partition map exchange which was recently
>> added
>> 2) More importantly, this may lead to inconsistent primary and backup node
>> values when EntryProcessor is used, because an entry processor may observe
>> a non-null value on one node and a null value on another node.
>>
>> In my opinion, the second issue is critical and we must change the expiry
>> mechanics to run expiry in a distributed mode, with cache mode semantics
>> for entry remove.
>>
>> Thoughts?
>>
>>
>

Andrew Mashenkov

Re: Reconsider TTL expire mechanics in Ignite

Alexey,

What if user touch backup entry via readFromBackup=true?
Should we start distributed operation ( e.g. TTL update or expiration) in
that case?

On Tue, Apr 24, 2018 at 5:02 PM, Alexey Goncharuk <
[hidden email]> wrote:

> Ivan,
>
> Agree about the use-case when we have a read-write-through store. However,
> we allow to use Ignite in-memory caches even without 3rd party stores, in
> this case the same issue is still present. Maybe we can keep local expire
> for read-through caches and have strongly consistent expire for other
> modes?
>
> 2018-04-24 16:51 GMT+03:00 Ivan Rakov <[hidden email]>:
>
> > Alexey,
> >
> > Distributed expire will result in serious performance overhead, mostly on
> > network level.
> > I think, the main use case of TTL are in-memory caches that accelerate
> > access to slower third-party data source. In such case nothing is broken
> if
> > data is missing; strong consistency guarantees are not needed. I think,
> > that's why we should keep "local expiration" at least for in-memory
> caches.
> > Our in-memory page eviction works in the same way.
> >
> > Best Regards,
> > Ivan Rakov
> >
> > On 24.04.2018 16:05, Alexey Goncharuk wrote:
> >
> >> Igniters,
> >>
> >> We recently experienced some issues with TTL with enabled persistence,
> the
> >> issues were related to persistence implementation details. However, when
> >> we
> >> were adding tests to cover more cases, we found more failures, which, I
> >> think, reveal some fundamental issues with expire mechanism.
> >>
> >> In short, the root cause of the issue is that we expire entries on
> primary
> >> and backup nodes independently, which means:
> >> 1) Partition sizes may have different values at different times which
> will
> >> trigger false-negative checks on partition map exchange which was
> recently
> >> added
> >> 2) More importantly, this may lead to inconsistent primary and backup
> node
> >> values when EntryProcessor is used, because an entry processor may
> observe
> >> a non-null value on one node and a null value on another node.
> >>
> >> In my opinion, the second issue is critical and we must change the
> expiry
> >> mechanics to run expiry in a distributed mode, with cache mode semantics
> >> for entry remove.
> >>
> >> Thoughts?
> >>
> >>
> >
>

--
Best regards,
Andrey V. Mashenkov

Alexey Goncharuk

Re: Reconsider TTL expire mechanics in Ignite

Andrey,

No, in this case the entry must not be evicted and kept as-is because only
primary node can decide when an entry must be expired. The read in this
case should return null, though. I understand that we can get non-monotonic
reads, but this is always the case when readFromBackup is true.

2018-04-24 17:15 GMT+03:00 Andrey Mashenkov <[hidden email]>:

> Alexey,
>
> What if user touch backup entry via readFromBackup=true?
> Should we start distributed operation ( e.g. TTL update or expiration) in
> that case?
>
> On Tue, Apr 24, 2018 at 5:02 PM, Alexey Goncharuk <
> [hidden email]> wrote:
>
> > Ivan,
> >
> > Agree about the use-case when we have a read-write-through store.
> However,
> > we allow to use Ignite in-memory caches even without 3rd party stores, in
> > this case the same issue is still present. Maybe we can keep local expire
> > for read-through caches and have strongly consistent expire for other
> > modes?
> >
> > 2018-04-24 16:51 GMT+03:00 Ivan Rakov <[hidden email]>:
> >
> > > Alexey,
> > >
> > > Distributed expire will result in serious performance overhead, mostly
> on
> > > network level.
> > > I think, the main use case of TTL are in-memory caches that accelerate
> > > access to slower third-party data source. In such case nothing is
> broken
> > if
> > > data is missing; strong consistency guarantees are not needed. I think,
> > > that's why we should keep "local expiration" at least for in-memory
> > caches.
> > > Our in-memory page eviction works in the same way.
> > >
> > > Best Regards,
> > > Ivan Rakov
> > >
> > > On 24.04.2018 16:05, Alexey Goncharuk wrote:
> > >
> > >> Igniters,
> > >>
> > >> We recently experienced some issues with TTL with enabled persistence,
> > the
> > >> issues were related to persistence implementation details. However,
> when
> > >> we
> > >> were adding tests to cover more cases, we found more failures, which,
> I
> > >> think, reveal some fundamental issues with expire mechanism.
> > >>
> > >> In short, the root cause of the issue is that we expire entries on
> > primary
> > >> and backup nodes independently, which means:
> > >> 1) Partition sizes may have different values at different times which
> > will
> > >> trigger false-negative checks on partition map exchange which was
> > recently
> > >> added
> > >> 2) More importantly, this may lead to inconsistent primary and backup
> > node
> > >> values when EntryProcessor is used, because an entry processor may
> > observe
> > >> a non-null value on one node and a null value on another node.
> > >>
> > >> In my opinion, the second issue is critical and we must change the
> > expiry
> > >> mechanics to run expiry in a distributed mode, with cache mode
> semantics
> > >> for entry remove.
> > >>
> > >> Thoughts?
> > >>
> > >>
> > >
> >
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Andrew Mashenkov

Re: Reconsider TTL expire mechanics in Ignite

Alexey,

Actually, there are 2 cases with readFromBackup=true which is used by
default for Replicated caches:
-user touch expired entry on backup node:
we can just return null and keep entry as-is with hope primary will remove
it.

-user touch alive entry on backup node:
TTL should be updated on primary somehow to prevent frequently used entries
eviction.

On Tue, Apr 24, 2018 at 5:18 PM, Alexey Goncharuk <
[hidden email]> wrote:

> Andrey,
>
> No, in this case the entry must not be evicted and kept as-is because only
> primary node can decide when an entry must be expired. The read in this
> case should return null, though. I understand that we can get non-monotonic
> reads, but this is always the case when readFromBackup is true.
>
> 2018-04-24 17:15 GMT+03:00 Andrey Mashenkov <[hidden email]>:
>
> > Alexey,
> >
> > What if user touch backup entry via readFromBackup=true?
> > Should we start distributed operation ( e.g. TTL update or expiration) in
> > that case?
> >
> > On Tue, Apr 24, 2018 at 5:02 PM, Alexey Goncharuk <
> > [hidden email]> wrote:
> >
> > > Ivan,
> > >
> > > Agree about the use-case when we have a read-write-through store.
> > However,
> > > we allow to use Ignite in-memory caches even without 3rd party stores,
> in
> > > this case the same issue is still present. Maybe we can keep local
> expire
> > > for read-through caches and have strongly consistent expire for other
> > > modes?
> > >
> > > 2018-04-24 16:51 GMT+03:00 Ivan Rakov <[hidden email]>:
> > >
> > > > Alexey,
> > > >
> > > > Distributed expire will result in serious performance overhead,
> mostly
> > on
> > > > network level.
> > > > I think, the main use case of TTL are in-memory caches that
> accelerate
> > > > access to slower third-party data source. In such case nothing is
> > broken
> > > if
> > > > data is missing; strong consistency guarantees are not needed. I
> think,
> > > > that's why we should keep "local expiration" at least for in-memory
> > > caches.
> > > > Our in-memory page eviction works in the same way.
> > > >
> > > > Best Regards,
> > > > Ivan Rakov
> > > >
> > > > On 24.04.2018 16:05, Alexey Goncharuk wrote:
> > > >
> > > >> Igniters,
> > > >>
> > > >> We recently experienced some issues with TTL with enabled
> persistence,
> > > the
> > > >> issues were related to persistence implementation details. However,
> > when
> > > >> we
> > > >> were adding tests to cover more cases, we found more failures,
> which,
> > I
> > > >> think, reveal some fundamental issues with expire mechanism.
> > > >>
> > > >> In short, the root cause of the issue is that we expire entries on
> > > primary
> > > >> and backup nodes independently, which means:
> > > >> 1) Partition sizes may have different values at different times
> which
> > > will
> > > >> trigger false-negative checks on partition map exchange which was
> > > recently
> > > >> added
> > > >> 2) More importantly, this may lead to inconsistent primary and
> backup
> > > node
> > > >> values when EntryProcessor is used, because an entry processor may
> > > observe
> > > >> a non-null value on one node and a null value on another node.
> > > >>
> > > >> In my opinion, the second issue is critical and we must change the
> > > expiry
> > > >> mechanics to run expiry in a distributed mode, with cache mode
> > semantics
> > > >> for entry remove.
> > > >>
> > > >> Thoughts?
> > > >>
> > > >>
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
> >
>

--
Best regards,
Andrey V. Mashenkov

Ivan Rakov

Re: Reconsider TTL expire mechanics in Ignite

In reply to this post by Alexey Goncharuk

I think, it would be more fair and simple to configure distributed
expiration as flag in cache configuration.

By the way, we still have to store ordered set of expirable entries on
every node. Having https://issues.apache.org/jira/browse/IGNITE-5874
merged, we can do the following: if distributed eviction is enabled,
primary node will scan PendingEntriesTree and generate remove requests,
if it's disabled, every node will clear it's own PendingEntriesTree.
This will allow user to switch distributed expiration on/off after grid
restart.

Best Regards,
Ivan Rakov

On 24.04.2018 17:02, Alexey Goncharuk wrote:

>
> 1.
> Ivan,
>
> Agree about the use-case when we have a read-write-through store. However,
> we allow to use Ignite in-memory caches even without 3rd party stores, in
> this case the same issue is still present. Maybe we can keep local expire
> for read-through caches and have strongly consistent expire for other modes?
>
> 2018-04-24 16:51 GMT+03:00 Ivan Rakov <[hidden email]>:
>
>> Alexey,
>>
>> Distributed expire will result in serious performance overhead, mostly on
>> network level.
>> I think, the main use case of TTL are in-memory caches that accelerate
>> access to slower third-party data source. In such case nothing is broken if
>> data is missing; strong consistency guarantees are not needed. I think,
>> that's why we should keep "local expiration" at least for in-memory caches.
>> Our in-memory page eviction works in the same way.
>>
>> Best Regards,
>> Ivan Rakov
>>
>> On 24.04.2018 16:05, Alexey Goncharuk wrote:
>>
>>> Igniters,
>>>
>>> We recently experienced some issues with TTL with enabled persistence, the
>>> issues were related to persistence implementation details. However, when
>>> we
>>> were adding tests to cover more cases, we found more failures, which, I
>>> think, reveal some fundamental issues with expire mechanism.
>>>
>>> In short, the root cause of the issue is that we expire entries on primary
>>> and backup nodes independently, which means:
>>> 1) Partition sizes may have different values at different times which will
>>> trigger false-negative checks on partition map exchange which was recently
>>> added
>>> 2) More importantly, this may lead to inconsistent primary and backup node
>>> values when EntryProcessor is used, because an entry processor may observe
>>> a non-null value on one node and a null value on another node.
>>>
>>> In my opinion, the second issue is critical and we must change the expiry
>>> mechanics to run expiry in a distributed mode, with cache mode semantics
>>> for entry remove.
>>>
>>> Thoughts?
>>>
>>>