Apache Ignite Developers - Legacy Mail Archive

IGNITE-4534 - Ignite 2.0 eviction design

Classic

List

Threaded

4 messages Options

Ivan Rakov

IGNITE-4534 - Ignite 2.0 eviction design

Hello Igniters,

I want to share some thoughts about supporting eviction on new page
memory storage.

I think, we should reconsider eviction policies in their current state.
Current pluggable eviction policy mechanism allows to determine which
key will be evicted from the cache and when eviction should be
performed. There are two issues about it:

1) Such mechanism tends to make off-heap storage highly fragmented:
batch of small evicted entries can accumulate needed total amount of
free space in FreeList, but it still would be impossible to store one
big entry.

2) Growth of cache size implies creating O(n) heap objects and links to
maintain eviction policy, which causes linear growth of GC load. That
doesn't suit us: getting rid of GC overhead was one of the main reasons
for keeping cache data in off-heap memory.

I propose to use "whole page eviction, off-heap eviction metadata" as
default mode for eviction. It's almost non-deterministic, because actual
eviction result depends on how data is packed into pages, but is more
efficient, and shouldn't cause long-term performance degradation due to
fragmentation.

I can suggest two ways to implement it:

1) *K-random-LRU
*The idea is to allocate big off-heap array, to track timestamp of last
usage for each data page.
When data page on address X is accessed, we store current timestamp in X
/ PAGE_SIZE array position.
When there's a need for eviction, we randomly choose K indices of array
(if some of indices point to non-data pages, re-choose them) and evict
data page with minimum timestamp. In case K=5, we'll evict page from 17%
least recently used set with 50% probability
<http://math.stackexchange.com/questions/786392/expectation-of-minimum-of-n-i-i-d-uniform-random-variables>.
Non-data pages can be simply recognized by having zero in corresponding
position of timestamp array.
If entry is too big and occupies more than one page, only first page
will be tracked, and other will be considered as non-data pages.
*K-random-LRU-2* is perspective variant of this approach. We'll evict
page with minimum time of penultimate access. Implementation is mostly
the same, the difference is that we should store two timestamps for each
data page. LRU-2 outperforms
<http://www.cs.cmu.edu/%7Echristos/courses/721-resources/p297-o_neil.pdf>
LRU by resolving "one-hit wonder" problem: if page is accessed very
rarely, but accidentally accessed once, it's "protected" from eviction
for long time.
Memory overhead: timestamp can be packed into 4 bytes. 100GB off-heap
cache with standard 2KB pages requires 200MB of additional memory (400MB
in case of LRU-2 strategy).*
*

2)*Adaptation of CLOCK-Pro
*CLOCK-Pro (article
<https://www.usenix.org/legacy/event/usenix05/tech/general/full_papers/jiang/jiang_html/html.html>,
presentation <http://www.slideshare.net/huliang64/clockpro>) is modern
page replacement algorithm, used in actual OS kernels. It is an
approximation of LIRS
<http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-02-6.pdf>
with significant advantage for our case: instead of doubly linked list,
which is hard to maintain in off-heap memory, CLOCK-Pro uses only
circular list and three pointers. However, we can't use CLOCK-Pro as is:
there's difference between OS paged memory and Ignite off-heap storage.
In OS, if memory page is swapped out, it still contains the same data,
with same relevance and patterns of access. That's why keeping metadata
for swapped-out page (see Non-resident Cold Pages set) works. In Ignite,
in case we evict page, all data is discarded; entries with same keys may
appear in the cache again, but they will be evenly distributed across
the whole storage.
I propose to use simplified version of CLOCK-Pro with only two pointers
and two page sets: hot and cold. Non-resident pages won't be tracked.
To implement it, we should allocate big off-heap array to track flags
for each page. We should store (hot, cold) bit, (accessed, not accessed)
bit and one bit indicating that the page is data page, totally 3 bits
per page. 100GB off-heap cache with standard 2KB pages requires ~20MB of
additional memory, or 50MB in case we don't want to shrink metadata for
multiple pages in one byte.
Memory overhead is really small, but we can't predict average and
maximum time for one eviction, it has to be discovered during an
experiment (in hypothetical worst case, clock hand can travel through
the whole array). Also, hot/cold ratio can affect performance
dramatically and should be tuned well.

Note about classic "key" eviction: fair per-entry policies (FIFO, LRU,
Sorted) may be useful for some users, and we may want to support them as
well. It's no problem at all to implement them on-heap (more than,
current implementations of EvictionPolicy will work in if we make minor
changes in the code). In off-heap, FIFO is easy to implement (circular
buffer), but LRU (and more effective LIRS, if someone wants it) require
doubly linked lists. The only option I see is using our BPlusTree with
some kind of implicit key, but this implies O(log n) overhead on each
entry read access.

Another note: current GridCacheEvictionManager have eviction backup
synchronization mechanism. Do we want to support it for off-heap storage
as well?

Please let me know if my understanding of any algorithm or part of the
system is wrong.

--
Best Regards,
Ivan Rakov

dsetrakyan

Re: IGNITE-4534 - Ignite 2.0 eviction design

Ivan, thanks for the detailed explanation. My preference would be to
support all of the suggested eviction policies and control them from the
configuration. However, it does sound the K-random-LRU or K-randome-LRU-2
will be much easier to support than LIRs.

Don't we already have something like K-random-LRU in place?

D.

On Tue, Feb 14, 2017 at 9:13 AM, Ivan Rakov <[hidden email]> wrote:

> Hello Igniters,
>
> I want to share some thoughts about supporting eviction on new page memory
> storage.
>
> I think, we should reconsider eviction policies in their current state.
> Current pluggable eviction policy mechanism allows to determine which key
> will be evicted from the cache and when eviction should be performed. There
> are two issues about it:
>
> 1) Such mechanism tends to make off-heap storage highly fragmented: batch
> of small evicted entries can accumulate needed total amount of free space
> in FreeList, but it still would be impossible to store one big entry.
>
> 2) Growth of cache size implies creating O(n) heap objects and links to
> maintain eviction policy, which causes linear growth of GC load. That
> doesn't suit us: getting rid of GC overhead was one of the main reasons for
> keeping cache data in off-heap memory.
>
> I propose to use "whole page eviction, off-heap eviction metadata" as
> default mode for eviction. It's almost non-deterministic, because actual
> eviction result depends on how data is packed into pages, but is more
> efficient, and shouldn't cause long-term performance degradation due to
> fragmentation.
>
> I can suggest two ways to implement it:
>
> 1) *K-random-LRU
> *The idea is to allocate big off-heap array, to track timestamp of last
> usage for each data page.
> When data page on address X is accessed, we store current timestamp in X /
> PAGE_SIZE array position.
> When there's a need for eviction, we randomly choose K indices of array
> (if some of indices point to non-data pages, re-choose them) and evict data
> page with minimum timestamp. In case K=5, we'll evict page from 17% least
> recently used set with 50% probability <http://math.stackexchange.com
> /questions/786392/expectation-of-minimum-of-n-i-i-d-uniform-
> random-variables>.
> Non-data pages can be simply recognized by having zero in corresponding
> position of timestamp array.
> If entry is too big and occupies more than one page, only first page will
> be tracked, and other will be considered as non-data pages.
> *K-random-LRU-2* is perspective variant of this approach. We'll evict page
> with minimum time of penultimate access. Implementation is mostly the same,
> the difference is that we should store two timestamps for each data page.
> LRU-2 outperforms <http://www.cs.cmu.edu/%7Echri
> stos/courses/721-resources/p297-o_neil.pdf> LRU by resolving "one-hit
> wonder" problem: if page is accessed very rarely, but accidentally accessed
> once, it's "protected" from eviction for long time.
> Memory overhead: timestamp can be packed into 4 bytes. 100GB off-heap
> cache with standard 2KB pages requires 200MB of additional memory (400MB in
> case of LRU-2 strategy).*
> *
>
> 2)*Adaptation of CLOCK-Pro
> *CLOCK-Pro (article <https://www.usenix.org/legacy
> /event/usenix05/tech/general/full_papers/jiang/jiang_html/html.html>,
> presentation <http://www.slideshare.net/huliang64/clockpro>) is modern
> page replacement algorithm, used in actual OS kernels. It is an
> approximation of LIRS <http://web.cse.ohio-state.edu
> /hpcs/WWW/HTML/publications/papers/TR-02-6.pdf> with significant
> advantage for our case: instead of doubly linked list, which is hard to
> maintain in off-heap memory, CLOCK-Pro uses only circular list and three
> pointers. However, we can't use CLOCK-Pro as is: there's difference between
> OS paged memory and Ignite off-heap storage. In OS, if memory page is
> swapped out, it still contains the same data, with same relevance and
> patterns of access. That's why keeping metadata for swapped-out page (see
> Non-resident Cold Pages set) works. In Ignite, in case we evict page, all
> data is discarded; entries with same keys may appear in the cache again,
> but they will be evenly distributed across the whole storage.
> I propose to use simplified version of CLOCK-Pro with only two pointers
> and two page sets: hot and cold. Non-resident pages won't be tracked.
> To implement it, we should allocate big off-heap array to track flags for
> each page. We should store (hot, cold) bit, (accessed, not accessed) bit
> and one bit indicating that the page is data page, totally 3 bits per page.
> 100GB off-heap cache with standard 2KB pages requires ~20MB of additional
> memory, or 50MB in case we don't want to shrink metadata for multiple pages
> in one byte.
> Memory overhead is really small, but we can't predict average and maximum
> time for one eviction, it has to be discovered during an experiment (in
> hypothetical worst case, clock hand can travel through the whole array).
> Also, hot/cold ratio can affect performance dramatically and should be
> tuned well.
>
> Note about classic "key" eviction: fair per-entry policies (FIFO, LRU,
> Sorted) may be useful for some users, and we may want to support them as
> well. It's no problem at all to implement them on-heap (more than, current
> implementations of EvictionPolicy will work in if we make minor changes in
> the code). In off-heap, FIFO is easy to implement (circular buffer), but
> LRU (and more effective LIRS, if someone wants it) require doubly linked
> lists. The only option I see is using our BPlusTree with some kind of
> implicit key, but this implies O(log n) overhead on each entry read access.
>
> Another note: current GridCacheEvictionManager have eviction backup
> synchronization mechanism. Do we want to support it for off-heap storage as
> well?
>
> Please let me know if my understanding of any algorithm or part of the
> system is wrong.
>
> --
> Best Regards,
> Ivan Rakov
>
>

Alexey Goncharuk

Re: IGNITE-4534 - Ignite 2.0 eviction design

Dmitriy,

For the current off-heap approach, we only have a per-entry LRU eviction
policy which does not fit well page memory architecture. I like the
adaptation of clock pro algorithm because it requires a significantly
smaller amount of memory to track page updates and is scan-resistant.

I want to back Ivan here and discuss whether we need synchronized evictions
at all. Given that current implementation has flaws and correct
implementation should work with the same protocol guarantees as a cache
update (meaning two-phase commit for transactional caches on every
operation, including reads). I would rather drop synchronous evictions at
all.

Thoughts?

2017-02-15 0:59 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:

> Ivan, thanks for the detailed explanation. My preference would be to
> support all of the suggested eviction policies and control them from the
> configuration. However, it does sound the K-random-LRU or K-randome-LRU-2
> will be much easier to support than LIRs.
>
> Don't we already have something like K-random-LRU in place?
>
> D.
>
> On Tue, Feb 14, 2017 at 9:13 AM, Ivan Rakov <[hidden email]> wrote:
>
> > Hello Igniters,
> >
> > I want to share some thoughts about supporting eviction on new page
> memory
> > storage.
> >
> > I think, we should reconsider eviction policies in their current state.
> > Current pluggable eviction policy mechanism allows to determine which key
> > will be evicted from the cache and when eviction should be performed.
> There
> > are two issues about it:
> >
> > 1) Such mechanism tends to make off-heap storage highly fragmented: batch
> > of small evicted entries can accumulate needed total amount of free space
> > in FreeList, but it still would be impossible to store one big entry.
> >
> > 2) Growth of cache size implies creating O(n) heap objects and links to
> > maintain eviction policy, which causes linear growth of GC load. That
> > doesn't suit us: getting rid of GC overhead was one of the main reasons
> for
> > keeping cache data in off-heap memory.
> >
> > I propose to use "whole page eviction, off-heap eviction metadata" as
> > default mode for eviction. It's almost non-deterministic, because actual
> > eviction result depends on how data is packed into pages, but is more
> > efficient, and shouldn't cause long-term performance degradation due to
> > fragmentation.
> >
> > I can suggest two ways to implement it:
> >
> > 1) *K-random-LRU
> > *The idea is to allocate big off-heap array, to track timestamp of last
> > usage for each data page.
> > When data page on address X is accessed, we store current timestamp in X
> /
> > PAGE_SIZE array position.
> > When there's a need for eviction, we randomly choose K indices of array
> > (if some of indices point to non-data pages, re-choose them) and evict
> data
> > page with minimum timestamp. In case K=5, we'll evict page from 17% least
> > recently used set with 50% probability <http://math.stackexchange.com
> > /questions/786392/expectation-of-minimum-of-n-i-i-d-uniform-
> > random-variables>.
> > Non-data pages can be simply recognized by having zero in corresponding
> > position of timestamp array.
> > If entry is too big and occupies more than one page, only first page will
> > be tracked, and other will be considered as non-data pages.
> > *K-random-LRU-2* is perspective variant of this approach. We'll evict
> page
> > with minimum time of penultimate access. Implementation is mostly the
> same,
> > the difference is that we should store two timestamps for each data page.
> > LRU-2 outperforms <http://www.cs.cmu.edu/%7Echri
> > stos/courses/721-resources/p297-o_neil.pdf> LRU by resolving "one-hit
> > wonder" problem: if page is accessed very rarely, but accidentally
> accessed
> > once, it's "protected" from eviction for long time.
> > Memory overhead: timestamp can be packed into 4 bytes. 100GB off-heap
> > cache with standard 2KB pages requires 200MB of additional memory (400MB
> in
> > case of LRU-2 strategy).*
> > *
> >
> > 2)*Adaptation of CLOCK-Pro
> > *CLOCK-Pro (article <https://www.usenix.org/legacy
> > /event/usenix05/tech/general/full_papers/jiang/jiang_html/html.html>,
> > presentation <http://www.slideshare.net/huliang64/clockpro>) is modern
> > page replacement algorithm, used in actual OS kernels. It is an
> > approximation of LIRS <http://web.cse.ohio-state.edu
> > /hpcs/WWW/HTML/publications/papers/TR-02-6.pdf> with significant
> > advantage for our case: instead of doubly linked list, which is hard to
> > maintain in off-heap memory, CLOCK-Pro uses only circular list and three
> > pointers. However, we can't use CLOCK-Pro as is: there's difference
> between
> > OS paged memory and Ignite off-heap storage. In OS, if memory page is
> > swapped out, it still contains the same data, with same relevance and
> > patterns of access. That's why keeping metadata for swapped-out page (see
> > Non-resident Cold Pages set) works. In Ignite, in case we evict page, all
> > data is discarded; entries with same keys may appear in the cache again,
> > but they will be evenly distributed across the whole storage.
> > I propose to use simplified version of CLOCK-Pro with only two pointers
> > and two page sets: hot and cold. Non-resident pages won't be tracked.
> > To implement it, we should allocate big off-heap array to track flags for
> > each page. We should store (hot, cold) bit, (accessed, not accessed) bit
> > and one bit indicating that the page is data page, totally 3 bits per
> page.
> > 100GB off-heap cache with standard 2KB pages requires ~20MB of additional
> > memory, or 50MB in case we don't want to shrink metadata for multiple
> pages
> > in one byte.
> > Memory overhead is really small, but we can't predict average and maximum
> > time for one eviction, it has to be discovered during an experiment (in
> > hypothetical worst case, clock hand can travel through the whole array).
> > Also, hot/cold ratio can affect performance dramatically and should be
> > tuned well.
> >
> > Note about classic "key" eviction: fair per-entry policies (FIFO, LRU,
> > Sorted) may be useful for some users, and we may want to support them as
> > well. It's no problem at all to implement them on-heap (more than,
> current
> > implementations of EvictionPolicy will work in if we make minor changes
> in
> > the code). In off-heap, FIFO is easy to implement (circular buffer), but
> > LRU (and more effective LIRS, if someone wants it) require doubly linked
> > lists. The only option I see is using our BPlusTree with some kind of
> > implicit key, but this implies O(log n) overhead on each entry read
> access.
> >
> > Another note: current GridCacheEvictionManager have eviction backup
> > synchronization mechanism. Do we want to support it for off-heap storage
> as
> > well?
> >
> > Please let me know if my understanding of any algorithm or part of the
> > system is wrong.
> >
> > --
> > Best Regards,
> > Ivan Rakov
> >
> >
>

dsetrakyan

Re: IGNITE-4534 - Ignite 2.0 eviction design

On Wed, Feb 15, 2017 at 9:06 AM, Alexey Goncharuk <
[hidden email]> wrote:

> Dmitriy,
>
> For the current off-heap approach, we only have a per-entry LRU eviction
> policy which does not fit well page memory architecture. I like the
> adaptation of clock pro algorithm because it requires a significantly
> smaller amount of memory to track page updates and is scan-resistant.
>

Agree.

>
> I want to back Ivan here and discuss whether we need synchronized evictions
> at all. Given that current implementation has flaws and correct
> implementation should work with the same protocol guarantees as a cache
> update (meaning two-phase commit for transactional caches on every
> operation, including reads). I would rather drop synchronous evictions at
> all.
>
> Thoughts?
>

Agree again. Synchronous evictions only make sense in pure in-memory
approach, where there is no database at all. AFAIK, most of the Ignite
deployments are backed by a database. If not, then users should control the
evictions themselves.

However, I am always concerned when removing a certain feature. If we start
getting complaints, we may have to add it in 2.0.

>
> 2017-02-15 0:59 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:
>
> > Ivan, thanks for the detailed explanation. My preference would be to
> > support all of the suggested eviction policies and control them from the
> > configuration. However, it does sound the K-random-LRU or K-randome-LRU-2
> > will be much easier to support than LIRs.
> >
> > Don't we already have something like K-random-LRU in place?
> >
> > D.
> >
> > On Tue, Feb 14, 2017 at 9:13 AM, Ivan Rakov <[hidden email]>
> wrote:
> >
> > > Hello Igniters,
> > >
> > > I want to share some thoughts about supporting eviction on new page
> > memory
> > > storage.
> > >
> > > I think, we should reconsider eviction policies in their current state.
> > > Current pluggable eviction policy mechanism allows to determine which
> key
> > > will be evicted from the cache and when eviction should be performed.
> > There
> > > are two issues about it:
> > >
> > > 1) Such mechanism tends to make off-heap storage highly fragmented:
> batch
> > > of small evicted entries can accumulate needed total amount of free
> space
> > > in FreeList, but it still would be impossible to store one big entry.
> > >
> > > 2) Growth of cache size implies creating O(n) heap objects and links to
> > > maintain eviction policy, which causes linear growth of GC load. That
> > > doesn't suit us: getting rid of GC overhead was one of the main reasons
> > for
> > > keeping cache data in off-heap memory.
> > >
> > > I propose to use "whole page eviction, off-heap eviction metadata" as
> > > default mode for eviction. It's almost non-deterministic, because
> actual
> > > eviction result depends on how data is packed into pages, but is more
> > > efficient, and shouldn't cause long-term performance degradation due to
> > > fragmentation.
> > >
> > > I can suggest two ways to implement it:
> > >
> > > 1) *K-random-LRU
> > > *The idea is to allocate big off-heap array, to track timestamp of last
> > > usage for each data page.
> > > When data page on address X is accessed, we store current timestamp in
> X
> > /
> > > PAGE_SIZE array position.
> > > When there's a need for eviction, we randomly choose K indices of array
> > > (if some of indices point to non-data pages, re-choose them) and evict
> > data
> > > page with minimum timestamp. In case K=5, we'll evict page from 17%
> least
> > > recently used set with 50% probability <http://math.stackexchange.com
> > > /questions/786392/expectation-of-minimum-of-n-i-i-d-uniform-
> > > random-variables>.
> > > Non-data pages can be simply recognized by having zero in corresponding
> > > position of timestamp array.
> > > If entry is too big and occupies more than one page, only first page
> will
> > > be tracked, and other will be considered as non-data pages.
> > > *K-random-LRU-2* is perspective variant of this approach. We'll evict
> > page
> > > with minimum time of penultimate access. Implementation is mostly the
> > same,
> > > the difference is that we should store two timestamps for each data
> page.
> > > LRU-2 outperforms <http://www.cs.cmu.edu/%7Echri
> > > stos/courses/721-resources/p297-o_neil.pdf> LRU by resolving "one-hit
> > > wonder" problem: if page is accessed very rarely, but accidentally
> > accessed
> > > once, it's "protected" from eviction for long time.
> > > Memory overhead: timestamp can be packed into 4 bytes. 100GB off-heap
> > > cache with standard 2KB pages requires 200MB of additional memory
> (400MB
> > in
> > > case of LRU-2 strategy).*
> > > *
> > >
> > > 2)*Adaptation of CLOCK-Pro
> > > *CLOCK-Pro (article <https://www.usenix.org/legacy
> > > /event/usenix05/tech/general/full_papers/jiang/jiang_html/html.html>,
> > > presentation <http://www.slideshare.net/huliang64/clockpro>) is modern
> > > page replacement algorithm, used in actual OS kernels. It is an
> > > approximation of LIRS <http://web.cse.ohio-state.edu
> > > /hpcs/WWW/HTML/publications/papers/TR-02-6.pdf> with significant
> > > advantage for our case: instead of doubly linked list, which is hard to
> > > maintain in off-heap memory, CLOCK-Pro uses only circular list and
> three
> > > pointers. However, we can't use CLOCK-Pro as is: there's difference
> > between
> > > OS paged memory and Ignite off-heap storage. In OS, if memory page is
> > > swapped out, it still contains the same data, with same relevance and
> > > patterns of access. That's why keeping metadata for swapped-out page
> (see
> > > Non-resident Cold Pages set) works. In Ignite, in case we evict page,
> all
> > > data is discarded; entries with same keys may appear in the cache
> again,
> > > but they will be evenly distributed across the whole storage.
> > > I propose to use simplified version of CLOCK-Pro with only two pointers
> > > and two page sets: hot and cold. Non-resident pages won't be tracked.
> > > To implement it, we should allocate big off-heap array to track flags
> for
> > > each page. We should store (hot, cold) bit, (accessed, not accessed)
> bit
> > > and one bit indicating that the page is data page, totally 3 bits per
> > page.
> > > 100GB off-heap cache with standard 2KB pages requires ~20MB of
> additional
> > > memory, or 50MB in case we don't want to shrink metadata for multiple
> > pages
> > > in one byte.
> > > Memory overhead is really small, but we can't predict average and
> maximum
> > > time for one eviction, it has to be discovered during an experiment (in
> > > hypothetical worst case, clock hand can travel through the whole
> array).
> > > Also, hot/cold ratio can affect performance dramatically and should be
> > > tuned well.
> > >
> > > Note about classic "key" eviction: fair per-entry policies (FIFO, LRU,
> > > Sorted) may be useful for some users, and we may want to support them
> as
> > > well. It's no problem at all to implement them on-heap (more than,
> > current
> > > implementations of EvictionPolicy will work in if we make minor changes
> > in
> > > the code). In off-heap, FIFO is easy to implement (circular buffer),
> but
> > > LRU (and more effective LIRS, if someone wants it) require doubly
> linked
> > > lists. The only option I see is using our BPlusTree with some kind of
> > > implicit key, but this implies O(log n) overhead on each entry read
> > access.
> > >
> > > Another note: current GridCacheEvictionManager have eviction backup
> > > synchronization mechanism. Do we want to support it for off-heap
> storage
> > as
> > > well?
> > >
> > > Please let me know if my understanding of any algorithm or part of the
> > > system is wrong.
> > >
> > > --
> > > Best Regards,
> > > Ivan Rakov
> > >
> > >
> >
>