Apache Ignite Developers - Legacy Mail Archive

Re: IgniteOutOfMemoryException in LOCAL cache mode with persistence enabled

Classic

List

Threaded

2 messages Options

dmagda

Re: IgniteOutOfMemoryException in LOCAL cache mode with persistence enabled

I tend to agree with Mitchell that the cluster should not crash. If the
crash is unavoidable based on the current architecture then a message
should be more descriptive.

Ignite persistence experts, could you please join the conversation and shed
more light to the reported behavior?

-
Denis

On Wed, Dec 11, 2019 at 3:25 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
[hidden email]> wrote:

> 2 GB is not reasonable for off heap memory for our use case. In general,
> even if off-heap is very low, performance should just degrade and calls
> should become blocking, I don't think that we should crash. Either way, the
> issue seems to be with putAll, not concurrent updates of different caches
> in the same data region. If I use Ignite's DataStreamer API instead of
> putAll, I get much better performance and no OOM exception. Any insight
> into why this might be would be appreciated.
>
> From: [hidden email] At: 12/10/19 11:24:35
> To: Mitchell Rathbun (BLOOMBERG/ 731 LEX ) <[hidden email]>,
> [hidden email]
> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with
> persistence enabled
>
> Hello!
>
> 10M is very very low-ball for testing performance of disk, considering how
> Ignite's wal/checkpoints are structured. As already told, it does not even
> work properly.
>
> I recommend using 2G value instead. Just load enough data so that you can
> observe constant checkpoints.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> ср, 4 дек. 2019 г. в 03:16, Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
> [hidden email]>:
>
>> For the requested full ignite log, where would this be found if we are
>> running using local mode? We are not explicitly running a separate ignite
>> node, and our WorkDirectory does not seem to have any logs
>>
>> From: [hidden email] At: 12/03/19 19:00:18
>> To: [hidden email]
>> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with
>> persistence enabled
>>
>> For our configuration properties, our DataRegion initialSize and MaxSize
>> was set to 11 MB and persistence was enabled. For DataStorage, our pageSize
>> was set to 8192 instead of 4096. For Cache, write behind is disabled, on
>> heap cache is disabled, and Atomicity Mode is Atomic
>>
>> From: [hidden email] At: 12/03/19 13:40:32
>> To: [hidden email]
>> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with
>> persistence enabled
>>
>> Hi Mitchell,
>>
>> Looks like it could be easily reproduced on low off-heap sizes, I tried
>> with
>> simple puts and got the same exception:
>>
>> class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Failed to
>> find a page for eviction [segmentCapacity=1580, loaded=619,
>> maxDirtyPages=465, dirtyPages=619, cpPages=0, pinnedInSegment=0,
>> failedToPrepare=620]
>> Out of memory in data region [name=Default_Region, initSize=10.0 MiB,
>> maxSize=10.0 MiB, persistenceEnabled=true] Try the following:
>> ^-- Increase maximum off-heap memory size
>> (DataRegionConfiguration.maxSize)
>> ^-- Enable Ignite persistence (DataRegionConfiguration.persistenceEnabled)
>> ^-- Enable eviction or expiration policies
>>
>> It looks like Ignite must issue a proper warning in this case and couple
>> of
>> issues must be filed against Ignite JIRA.
>>
>> Check out this article on persistent store available in Ignite confluence
>> as
>> well:
>>
>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+und
>> er+the+hood#IgnitePersistentStore-underthehood-Checkpointing
>>
>> I've managed to make kind of similar example working with 20 Mb region
>> with
>> a bit of tuning, added following properties to
>> org.apache.ignite.configuration.DataStorageConfiguration:
>> /<property name="checkpointFrequency" value="1500"/>
>> <property name="writeThrottlingEnabled" value="true"/>/
>>
>> The whole idea behind this is to trigger checkpoint on timeout rather than
>> on too much dirty pages percentage threshold. The checkpoint page buffer
>> size may not exceed data region size, which is 10 Mb, which might be
>> overflown during checkpoint as well.
>>
>> I assume that checkpoint is never triggered in this case because of
>> per-partition overhead: Ignite writes some meta per partition and it looks
>> like that it is at least 1 meta page utilized for each which results in
>> some
>> amount of off-heap devoured by these meta pages. In the case with the
>> lowest
>> possible region size, this might consume more than 3 Mb for cache with 1k
>> partitions and 70% dirty data pages threshold would never be reached.
>>
>> However, I found another issue when it is not possible to save meta page
>> on
>> checkpoint begin, this reproduces on 10 Mb data region with mentioned
>> storage configuration options.
>>
>> Could you please describe the configuration if you have anything different
>> from defaults (page size, wal mode, partitions count) and types of
>> key/value
>> that you use? And if it is possible, could you please attach full Ignite
>> log
>> from the node that has suffered from IOOM?
>>
>> As for the data region/cache, in reality you do also have cache groups
>> here
>> playing a role. But generally I would recommend you to go with one data
>> region for all caches unless you have a particular reason to have multiple
>> regions. As for example, you have some really important data in some cache
>> that always needs to be available in durable off-heap memory, then you
>> should have a separate data region for this cache as I'm not aware if
>> there
>> is possibility to disallow evicting pages for a specific cache.
>>
>> Cache groups documentation link:
>> https://apacheignite.readme.io/docs/cache-groups
>>
>> By default (a cache doesn't have cacheGroup property defined) each cache
>> has
>> it's own cache group with the very same name, that lives in the data
>> region
>> specified or default data region. You might use them or not, depending on
>> the goal you have: use when you want to reduce meta
>> overhead/checkpoints/partition exchanges and share internal structures to
>> save up space a bit, or do not use them if you want to speed up
>> inserts/lookups by having it's own dedicated partition maps and B+ trees.
>>
>> Best regards,
>> Anton
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>>
>>
>>
>

Sergey Chugunov

Re: IgniteOutOfMemoryException in LOCAL cache mode with persistence enabled

Hi Mitchell,

I believe that research done by Anton is correct, and the root cause of the
OOME is proportion of memory occupied by metapages in data region. Each
cache started in data region allocates one or more metapages per
initialized partition so when you run your test with only one cache this is
not a problem, but when second cache is added this results in OOME.

I don't think there is an easy way to prevent this exception in general but
I agree that we need to provide more descriptive error message and/or early
warning for the user that configuration of caches and data regions may lead
to such exception. I'll file a ticket for this improvement soon.

Best regards,
Sergey

On Thu, Dec 12, 2019 at 1:27 AM Denis Magda <[hidden email]> wrote:

> I tend to agree with Mitchell that the cluster should not crash. If the
> crash is unavoidable based on the current architecture then a message
> should be more descriptive.
>
> Ignite persistence experts, could you please join the conversation and
> shed more light to the reported behavior?
>
> -
> Denis
>
>
> On Wed, Dec 11, 2019 at 3:25 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
> [hidden email]> wrote:
>
>> 2 GB is not reasonable for off heap memory for our use case. In general,
>> even if off-heap is very low, performance should just degrade and calls
>> should become blocking, I don't think that we should crash. Either way, the
>> issue seems to be with putAll, not concurrent updates of different caches
>> in the same data region. If I use Ignite's DataStreamer API instead of
>> putAll, I get much better performance and no OOM exception. Any insight
>> into why this might be would be appreciated.
>>
>> From: [hidden email] At: 12/10/19 11:24:35
>> To: Mitchell Rathbun (BLOOMBERG/ 731 LEX ) <[hidden email]>,
>> [hidden email]
>> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with
>> persistence enabled
>>
>> Hello!
>>
>> 10M is very very low-ball for testing performance of disk, considering
>> how Ignite's wal/checkpoints are structured. As already told, it does not
>> even work properly.
>>
>> I recommend using 2G value instead. Just load enough data so that you can
>> observe constant checkpoints.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> ср, 4 дек. 2019 г. в 03:16, Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
>> [hidden email]>:
>>
>>> For the requested full ignite log, where would this be found if we are
>>> running using local mode? We are not explicitly running a separate ignite
>>> node, and our WorkDirectory does not seem to have any logs
>>>
>>> From: [hidden email] At: 12/03/19 19:00:18
>>> To: [hidden email]
>>> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with
>>> persistence enabled
>>>
>>> For our configuration properties, our DataRegion initialSize and MaxSize
>>> was set to 11 MB and persistence was enabled. For DataStorage, our pageSize
>>> was set to 8192 instead of 4096. For Cache, write behind is disabled, on
>>> heap cache is disabled, and Atomicity Mode is Atomic
>>>
>>> From: [hidden email] At: 12/03/19 13:40:32
>>> To: [hidden email]
>>> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with
>>> persistence enabled
>>>
>>> Hi Mitchell,
>>>
>>> Looks like it could be easily reproduced on low off-heap sizes, I tried
>>> with
>>> simple puts and got the same exception:
>>>
>>> class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Failed
>>> to
>>> find a page for eviction [segmentCapacity=1580, loaded=619,
>>> maxDirtyPages=465, dirtyPages=619, cpPages=0, pinnedInSegment=0,
>>> failedToPrepare=620]
>>> Out of memory in data region [name=Default_Region, initSize=10.0 MiB,
>>> maxSize=10.0 MiB, persistenceEnabled=true] Try the following:
>>> ^-- Increase maximum off-heap memory size
>>> (DataRegionConfiguration.maxSize)
>>> ^-- Enable Ignite persistence
>>> (DataRegionConfiguration.persistenceEnabled)
>>> ^-- Enable eviction or expiration policies
>>>
>>> It looks like Ignite must issue a proper warning in this case and couple
>>> of
>>> issues must be filed against Ignite JIRA.
>>>
>>> Check out this article on persistent store available in Ignite
>>> confluence as
>>> well:
>>>
>>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+und
>>> er+the+hood#IgnitePersistentStore-underthehood-Checkpointing
>>>
>>> I've managed to make kind of similar example working with 20 Mb region
>>> with
>>> a bit of tuning, added following properties to
>>> org.apache.ignite.configuration.DataStorageConfiguration:
>>> /<property name="checkpointFrequency" value="1500"/>
>>> <property name="writeThrottlingEnabled" value="true"/>/
>>>
>>> The whole idea behind this is to trigger checkpoint on timeout rather
>>> than
>>> on too much dirty pages percentage threshold. The checkpoint page buffer
>>> size may not exceed data region size, which is 10 Mb, which might be
>>> overflown during checkpoint as well.
>>>
>>> I assume that checkpoint is never triggered in this case because of
>>> per-partition overhead: Ignite writes some meta per partition and it
>>> looks
>>> like that it is at least 1 meta page utilized for each which results in
>>> some
>>> amount of off-heap devoured by these meta pages. In the case with the
>>> lowest
>>> possible region size, this might consume more than 3 Mb for cache with 1k
>>> partitions and 70% dirty data pages threshold would never be reached.
>>>
>>> However, I found another issue when it is not possible to save meta page
>>> on
>>> checkpoint begin, this reproduces on 10 Mb data region with mentioned
>>> storage configuration options.
>>>
>>> Could you please describe the configuration if you have anything
>>> different
>>> from defaults (page size, wal mode, partitions count) and types of
>>> key/value
>>> that you use? And if it is possible, could you please attach full Ignite
>>> log
>>> from the node that has suffered from IOOM?
>>>
>>> As for the data region/cache, in reality you do also have cache groups
>>> here
>>> playing a role. But generally I would recommend you to go with one data
>>> region for all caches unless you have a particular reason to have
>>> multiple
>>> regions. As for example, you have some really important data in some
>>> cache
>>> that always needs to be available in durable off-heap memory, then you
>>> should have a separate data region for this cache as I'm not aware if
>>> there
>>> is possibility to disallow evicting pages for a specific cache.
>>>
>>> Cache groups documentation link:
>>> https://apacheignite.readme.io/docs/cache-groups
>>>
>>> By default (a cache doesn't have cacheGroup property defined) each cache
>>> has
>>> it's own cache group with the very same name, that lives in the data
>>> region
>>> specified or default data region. You might use them or not, depending on
>>> the goal you have: use when you want to reduce meta
>>> overhead/checkpoints/partition exchanges and share internal structures to
>>> save up space a bit, or do not use them if you want to speed up
>>> inserts/lookups by having it's own dedicated partition maps and B+
>>> trees.
>>>
>>> Best regards,
>>> Anton
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>>
>>>
>>>
>>>
>>