[DISCUSSION] IEP-59: CDC - Capture Data Change

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSSION] IEP-59: CDC - Capture Data Change

Nikolay Izhikov-2
Hello, Igniters.

I want to start a discussion of the new feature [1]

CDC - capture data change. The feature allows the consumer to receive online notifications about data record changes.

It can be used in the following scenarios:
        * Export data into some warehouse, full-text search, or distributed log system.
        * Online statistics and analytics.
        * Wait and respond to some specific events or data changes.

Propose to implement new IgniteCDC application as follows:
        * Run on the server node host.
        * Watches for the appearance of the WAL archive segments.
        * Iterates it using existing WALIterator and notifies consumer of each record from the segment.

IgniteCDC features:
        * Independence from the server node process (JVM) - issues and failures of the consumer will not lead to server node instability.
        * Notification guarantees and failover - i.e. CDC track and save the pointer to the last consumed record. Continue notification from this pointer in case of restart.
        * Resilience for the consumer - it's not an issue when a consumer temporarily consumes slower than data appear.

WDYT?

[1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] IEP-59: CDC - Capture Data Change

Pavel Kovalenko
Hi Nikolay,

The idea is good. But what do you think to integrate these ideas into WAL-G
project?
https://github.com/wal-g/wal-g
It's a well-known tool that is already used to stream WAL for PostgreSQL,
MySQL, and MongoDB.
The advantages are integration with S3, GCP, Azure out of the box,
encryption, and compression.


ср, 14 окт. 2020 г. в 14:21, Nikolay Izhikov <[hidden email]>:

> Hello, Igniters.
>
> I want to start a discussion of the new feature [1]
>
> CDC - capture data change. The feature allows the consumer to receive
> online notifications about data record changes.
>
> It can be used in the following scenarios:
>         * Export data into some warehouse, full-text search, or
> distributed log system.
>         * Online statistics and analytics.
>         * Wait and respond to some specific events or data changes.
>
> Propose to implement new IgniteCDC application as follows:
>         * Run on the server node host.
>         * Watches for the appearance of the WAL archive segments.
>         * Iterates it using existing WALIterator and notifies consumer of
> each record from the segment.
>
> IgniteCDC features:
>         * Independence from the server node process (JVM) - issues and
> failures of the consumer will not lead to server node instability.
>         * Notification guarantees and failover - i.e. CDC track and save
> the pointer to the last consumed record. Continue notification from this
> pointer in case of restart.
>         * Resilience for the consumer - it's not an issue when a consumer
> temporarily consumes slower than data appear.
>
> WDYT?
>
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] IEP-59: CDC - Capture Data Change

Pavel Kovalenko
This tool is also can be used to store snapshots in an external warehouse.


ср, 14 окт. 2020 г. в 14:57, Pavel Kovalenko <[hidden email]>:

> Hi Nikolay,
>
> The idea is good. But what do you think to integrate these ideas into
> WAL-G project?
> https://github.com/wal-g/wal-g
> It's a well-known tool that is already used to stream WAL for PostgreSQL,
> MySQL, and MongoDB.
> The advantages are integration with S3, GCP, Azure out of the box,
> encryption, and compression.
>
>
> ср, 14 окт. 2020 г. в 14:21, Nikolay Izhikov <[hidden email]>:
>
>> Hello, Igniters.
>>
>> I want to start a discussion of the new feature [1]
>>
>> CDC - capture data change. The feature allows the consumer to receive
>> online notifications about data record changes.
>>
>> It can be used in the following scenarios:
>>         * Export data into some warehouse, full-text search, or
>> distributed log system.
>>         * Online statistics and analytics.
>>         * Wait and respond to some specific events or data changes.
>>
>> Propose to implement new IgniteCDC application as follows:
>>         * Run on the server node host.
>>         * Watches for the appearance of the WAL archive segments.
>>         * Iterates it using existing WALIterator and notifies consumer of
>> each record from the segment.
>>
>> IgniteCDC features:
>>         * Independence from the server node process (JVM) - issues and
>> failures of the consumer will not lead to server node instability.
>>         * Notification guarantees and failover - i.e. CDC track and save
>> the pointer to the last consumed record. Continue notification from this
>> pointer in case of restart.
>>         * Resilience for the consumer - it's not an issue when a consumer
>> temporarily consumes slower than data appear.
>>
>> WDYT?
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] IEP-59: CDC - Capture Data Change

Nikolay Izhikov-2
Hell. Pavel.

Thanks for the feedback.

> But what do you think to integrate these ideas into WAL-G project?

Looked into WAL-G description and look like it’s use-case are more restricted then CDC itself.
Self-definition - "WAL-G is an archival restoration tool for Postgres(beta for MySQL, MariaDB, and MongoDB)"

Restoration ability is one of the CDC application.
I mentioned other use-cases in the first letter.

Anyway, integration with the well-known tools is good.
I try to contact WAL-G community.


> 14 окт. 2020 г., в 14:59, Pavel Kovalenko <[hidden email]> написал(а):
>
> This tool is also can be used to store snapshots in an external warehouse.
>
>
> ср, 14 окт. 2020 г. в 14:57, Pavel Kovalenko <[hidden email]>:
>
>> Hi Nikolay,
>>
>> The idea is good. But what do you think to integrate these ideas into
>> WAL-G project?
>> https://github.com/wal-g/wal-g
>> It's a well-known tool that is already used to stream WAL for PostgreSQL,
>> MySQL, and MongoDB.
>> The advantages are integration with S3, GCP, Azure out of the box,
>> encryption, and compression.
>>
>>
>> ср, 14 окт. 2020 г. в 14:21, Nikolay Izhikov <[hidden email]>:
>>
>>> Hello, Igniters.
>>>
>>> I want to start a discussion of the new feature [1]
>>>
>>> CDC - capture data change. The feature allows the consumer to receive
>>> online notifications about data record changes.
>>>
>>> It can be used in the following scenarios:
>>>        * Export data into some warehouse, full-text search, or
>>> distributed log system.
>>>        * Online statistics and analytics.
>>>        * Wait and respond to some specific events or data changes.
>>>
>>> Propose to implement new IgniteCDC application as follows:
>>>        * Run on the server node host.
>>>        * Watches for the appearance of the WAL archive segments.
>>>        * Iterates it using existing WALIterator and notifies consumer of
>>> each record from the segment.
>>>
>>> IgniteCDC features:
>>>        * Independence from the server node process (JVM) - issues and
>>> failures of the consumer will not lead to server node instability.
>>>        * Notification guarantees and failover - i.e. CDC track and save
>>> the pointer to the last consumed record. Continue notification from this
>>> pointer in case of restart.
>>>        * Resilience for the consumer - it's not an issue when a consumer
>>> temporarily consumes slower than data appear.
>>>
>>> WDYT?
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] IEP-59: CDC - Capture Data Change

ezhuravl
In reply to this post by Pavel Kovalenko
Hi,

>On the segment archiving, utility iterates it using existing WALIterator
>Wait and respond to some specific events or data changes.
It seems like this solution will have an unpredictable delay for
synchronization for handling events.

Why can't we just implement a Debezium connector for Ignite, for example?
https://debezium.io/documentation/reference/1.3/index.html. It is a pretty
popular product that uses Kafka underneath.

Evgenii


ср, 14 окт. 2020 г. в 05:00, Pavel Kovalenko <[hidden email]>:

> This tool is also can be used to store snapshots in an external warehouse.
>
>
> ср, 14 окт. 2020 г. в 14:57, Pavel Kovalenko <[hidden email]>:
>
> > Hi Nikolay,
> >
> > The idea is good. But what do you think to integrate these ideas into
> > WAL-G project?
> > https://github.com/wal-g/wal-g
> > It's a well-known tool that is already used to stream WAL for PostgreSQL,
> > MySQL, and MongoDB.
> > The advantages are integration with S3, GCP, Azure out of the box,
> > encryption, and compression.
> >
> >
> > ср, 14 окт. 2020 г. в 14:21, Nikolay Izhikov <[hidden email]>:
> >
> >> Hello, Igniters.
> >>
> >> I want to start a discussion of the new feature [1]
> >>
> >> CDC - capture data change. The feature allows the consumer to receive
> >> online notifications about data record changes.
> >>
> >> It can be used in the following scenarios:
> >>         * Export data into some warehouse, full-text search, or
> >> distributed log system.
> >>         * Online statistics and analytics.
> >>         * Wait and respond to some specific events or data changes.
> >>
> >> Propose to implement new IgniteCDC application as follows:
> >>         * Run on the server node host.
> >>         * Watches for the appearance of the WAL archive segments.
> >>         * Iterates it using existing WALIterator and notifies consumer
> of
> >> each record from the segment.
> >>
> >> IgniteCDC features:
> >>         * Independence from the server node process (JVM) - issues and
> >> failures of the consumer will not lead to server node instability.
> >>         * Notification guarantees and failover - i.e. CDC track and save
> >> the pointer to the last consumed record. Continue notification from this
> >> pointer in case of restart.
> >>         * Resilience for the consumer - it's not an issue when a
> consumer
> >> temporarily consumes slower than data appear.
> >>
> >> WDYT?
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] IEP-59: CDC - Capture Data Change

Nikolay Izhikov-2
Hello, Evgeni.

> It seems like this solution will have an unpredictable delay for synchronization for handling events.

It’s true that CDC solution doesn’t have strict boundaries for notification delay because of asynchronous nature.
But, I assume that we will introduce a WAL rollout timeout for CDC cases
Please, take a look at the ticket [1].

The same approach is used by Oracle and other databases that implement CDC.

Anyway, I treat notification delay and split of the event and its consumption as an advantage of CDC, not downside :)

> Why can't we just implement a Debezium connector for Ignite, for example?

I think we can.
But, AFAIK debezium connectors developed for other databases uses CDC implementations similar to proposed.

[1] https://issues.apache.org/jira/browse/IGNITE-13582?src=confmacro


> 14 окт. 2020 г., в 15:36, Evgenii Zhuravlev <[hidden email]> написал(а):
>
> Hi,
>
>> On the segment archiving, utility iterates it using existing WALIterator
>> Wait and respond to some specific events or data changes.
> It seems like this solution will have an unpredictable delay for
> synchronization for handling events.
>
> Why can't we just implement a Debezium connector for Ignite, for example?
> https://debezium.io/documentation/reference/1.3/index.html. It is a pretty
> popular product that uses Kafka underneath.
>
> Evgenii
>
>
> ср, 14 окт. 2020 г. в 05:00, Pavel Kovalenko <[hidden email]>:
>
>> This tool is also can be used to store snapshots in an external warehouse.
>>
>>
>> ср, 14 окт. 2020 г. в 14:57, Pavel Kovalenko <[hidden email]>:
>>
>>> Hi Nikolay,
>>>
>>> The idea is good. But what do you think to integrate these ideas into
>>> WAL-G project?
>>> https://github.com/wal-g/wal-g
>>> It's a well-known tool that is already used to stream WAL for PostgreSQL,
>>> MySQL, and MongoDB.
>>> The advantages are integration with S3, GCP, Azure out of the box,
>>> encryption, and compression.
>>>
>>>
>>> ср, 14 окт. 2020 г. в 14:21, Nikolay Izhikov <[hidden email]>:
>>>
>>>> Hello, Igniters.
>>>>
>>>> I want to start a discussion of the new feature [1]
>>>>
>>>> CDC - capture data change. The feature allows the consumer to receive
>>>> online notifications about data record changes.
>>>>
>>>> It can be used in the following scenarios:
>>>>        * Export data into some warehouse, full-text search, or
>>>> distributed log system.
>>>>        * Online statistics and analytics.
>>>>        * Wait and respond to some specific events or data changes.
>>>>
>>>> Propose to implement new IgniteCDC application as follows:
>>>>        * Run on the server node host.
>>>>        * Watches for the appearance of the WAL archive segments.
>>>>        * Iterates it using existing WALIterator and notifies consumer
>> of
>>>> each record from the segment.
>>>>
>>>> IgniteCDC features:
>>>>        * Independence from the server node process (JVM) - issues and
>>>> failures of the consumer will not lead to server node instability.
>>>>        * Notification guarantees and failover - i.e. CDC track and save
>>>> the pointer to the last consumed record. Continue notification from this
>>>> pointer in case of restart.
>>>>        * Resilience for the consumer - it's not an issue when a
>> consumer
>>>> temporarily consumes slower than data appear.
>>>>
>>>> WDYT?
>>>>
>>>> [1]
>>>>
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] IEP-59: CDC - Capture Data Change

agura
Hi,

I definitely agree with Pavel and Evgenii ideas and comments.

From my point of view the proposal is not about Apache Ignite
features. Described functionality could be implemented outside of the
Apache Ignite project. Perhaps, Debezium connector or WAL-G module are
the best candidates for the proposed CDC.

CDC API will require additional support from Apache Ignite community
while there is no need in any public API for described cases. It is a
really good idea to implement CDC tool as part of Debezium or WAL-G
projects. For other cases there is WALIterator.

Also I have some comments about IEP-59. I use `>` for quotes from the IEP.

> Many use-cases build on observation and processing changed records.

Yes. But there is a significant difference between Apache Ignite and
RDBMS like MySQL and PostgreSQL. Apache Ignite is a middleware class
product and required logic could be implemented as part of business
logic while RDBMS don't provide such possibility (you could implement
an optional module for CDC purposes but *usually* you can't implement
CDC-like functionality as a stored procedure).

> Disadvantages of the CQ in described scenarios:
>
> CQ requires data to be sent over the network

In a normal case the CDC tool will also send data over the network. I
doubt that Apache Ignite node which is hosted on the same server
where, for example, analytical storage (consumer) is hosted, is a good
idea.

> CQ parts (filter, transformer) live inside server node JVM so issues in it may affect server node stability.

May affect or may not. It depends on filter/transformer
implementation. Just implement these parts properly. We have many
dangerous pits in Apache Ignite which represent entities like filter,
transformer, listener, etc. But we do not invent something that should
protect developers from blocking all this stuff. Except of failure
handling and blocker threads detection of course :) Many concurrent
models based on similar concepts and all these models are sensitive to
blocking.

> Slow CQ listener leads to increasing of the memory consumption of the server node.

Yes. But CQ couldn't be for free. Otherwise, this point is more likely
to refer to the previous one.

> Fails of the CQ listener lead to the loss of the events.

What kind of fails? CQ listener is the developer's responsibility.

> The convenient solution should be:
>
> Independence from the server node process (JVM) - issues and failures of the consumer shouldn't lead to server node instability.

It's a very good point. And it is the best point for implementing CDC
tool as some external tool.

> Notification guarantees and failover - i.e. track and save a pointer to the last consumed record. Continue notification from this pointer in case of restart.

Notification is a superfluous word here. There is no need for
notifications. All you need is WAL segments.


On Wed, Oct 14, 2020 at 4:05 PM Nikolay Izhikov <[hidden email]> wrote:

>
> Hello, Evgeni.
>
> > It seems like this solution will have an unpredictable delay for synchronization for handling events.
>
> It’s true that CDC solution doesn’t have strict boundaries for notification delay because of asynchronous nature.
> But, I assume that we will introduce a WAL rollout timeout for CDC cases
> Please, take a look at the ticket [1].
>
> The same approach is used by Oracle and other databases that implement CDC.
>
> Anyway, I treat notification delay and split of the event and its consumption as an advantage of CDC, not downside :)
>
> > Why can't we just implement a Debezium connector for Ignite, for example?
>
> I think we can.
> But, AFAIK debezium connectors developed for other databases uses CDC implementations similar to proposed.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-13582?src=confmacro
>
>
> > 14 окт. 2020 г., в 15:36, Evgenii Zhuravlev <[hidden email]> написал(а):
> >
> > Hi,
> >
> >> On the segment archiving, utility iterates it using existing WALIterator
> >> Wait and respond to some specific events or data changes.
> > It seems like this solution will have an unpredictable delay for
> > synchronization for handling events.
> >
> > Why can't we just implement a Debezium connector for Ignite, for example?
> > https://debezium.io/documentation/reference/1.3/index.html. It is a pretty
> > popular product that uses Kafka underneath.
> >
> > Evgenii
> >
> >
> > ср, 14 окт. 2020 г. в 05:00, Pavel Kovalenko <[hidden email]>:
> >
> >> This tool is also can be used to store snapshots in an external warehouse.
> >>
> >>
> >> ср, 14 окт. 2020 г. в 14:57, Pavel Kovalenko <[hidden email]>:
> >>
> >>> Hi Nikolay,
> >>>
> >>> The idea is good. But what do you think to integrate these ideas into
> >>> WAL-G project?
> >>> https://github.com/wal-g/wal-g
> >>> It's a well-known tool that is already used to stream WAL for PostgreSQL,
> >>> MySQL, and MongoDB.
> >>> The advantages are integration with S3, GCP, Azure out of the box,
> >>> encryption, and compression.
> >>>
> >>>
> >>> ср, 14 окт. 2020 г. в 14:21, Nikolay Izhikov <[hidden email]>:
> >>>
> >>>> Hello, Igniters.
> >>>>
> >>>> I want to start a discussion of the new feature [1]
> >>>>
> >>>> CDC - capture data change. The feature allows the consumer to receive
> >>>> online notifications about data record changes.
> >>>>
> >>>> It can be used in the following scenarios:
> >>>>        * Export data into some warehouse, full-text search, or
> >>>> distributed log system.
> >>>>        * Online statistics and analytics.
> >>>>        * Wait and respond to some specific events or data changes.
> >>>>
> >>>> Propose to implement new IgniteCDC application as follows:
> >>>>        * Run on the server node host.
> >>>>        * Watches for the appearance of the WAL archive segments.
> >>>>        * Iterates it using existing WALIterator and notifies consumer
> >> of
> >>>> each record from the segment.
> >>>>
> >>>> IgniteCDC features:
> >>>>        * Independence from the server node process (JVM) - issues and
> >>>> failures of the consumer will not lead to server node instability.
> >>>>        * Notification guarantees and failover - i.e. CDC track and save
> >>>> the pointer to the last consumed record. Continue notification from this
> >>>> pointer in case of restart.
> >>>>        * Resilience for the consumer - it's not an issue when a
> >> consumer
> >>>> temporarily consumes slower than data appear.
> >>>>
> >>>> WDYT?
> >>>>
> >>>> [1]
> >>>>
> >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
> >>>
> >>>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] IEP-59: CDC - Capture Data Change

Alexey Goncharuk
In reply to this post by Nikolay Izhikov-2
Hello Nikolay,

Thanks for the suggestion, it definitely may be a good feature, however, I
do not see any significant value that it currently adds to the already
existing WAL Iterator. I think the following issues should be addressed,
otherwise, no regular user will be able to use the CDC reliably:

   - The interface exposes WALRecord which is a private API
   - There is no way to start capturing changes from a certain point (a
   watermark for already processed data). Users can configure a large size for
   WAL archive to sustain long node downtime for historical rebalance. If a
   CDC agent is restarted, it will have to start from scratch. I see that it
   is present in the IEP as a design choice, but I think this is a major
   usability issue
   - If a CDC reader does not keep up with the WAL write rate (e.g. there
   is a short-term write burst and WAL archive is small), the Ignite node will
   delete WAL segments while the consumer is still reading it. Since the
   consumer is running out-of-process, we need to specify some sort of
   synchronization protocol between the node and the consumer
   - If Ignite node crashes, gets restarted and initiates full rebalance,
   the consumer will lose some updates
   - Usually, it makes sense for the CDC consumer to read updates only on
   primary nodes (otherwise, multiple agents will be doing duplicate work). In
   the current design, the consumer will not be able to differentiate
   primary/backup updates. Moreover, even if we wrote such flags to WAL, the
   consumer would need to process backup records anyway because it is unknown
   whether the primary consumer is alive. In other words, how would an end
   user organize the CDC failover minimizing the duplicate work?


ср, 14 окт. 2020 г. в 14:21, Nikolay Izhikov <[hidden email]>:

> Hello, Igniters.
>
> I want to start a discussion of the new feature [1]
>
> CDC - capture data change. The feature allows the consumer to receive
> online notifications about data record changes.
>
> It can be used in the following scenarios:
>         * Export data into some warehouse, full-text search, or
> distributed log system.
>         * Online statistics and analytics.
>         * Wait and respond to some specific events or data changes.
>
> Propose to implement new IgniteCDC application as follows:
>         * Run on the server node host.
>         * Watches for the appearance of the WAL archive segments.
>         * Iterates it using existing WALIterator and notifies consumer of
> each record from the segment.
>
> IgniteCDC features:
>         * Independence from the server node process (JVM) - issues and
> failures of the consumer will not lead to server node instability.
>         * Notification guarantees and failover - i.e. CDC track and save
> the pointer to the last consumed record. Continue notification from this
> pointer in case of restart.
>         * Resilience for the consumer - it's not an issue when a consumer
> temporarily consumes slower than data appear.
>
> WDYT?
>
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] IEP-59: CDC - Capture Data Change

Pavel Kovalenko
Alexey,

>> If a CDC agent is restarted, it will have to start from scratch
>> If a CDC reader does not keep up with the WAL write rate (e.g. there
   is a short-term write burst and WAL archive is small), the Ignite node
will
   delete WAL segments while the consumer is still reading it.

I think these cases can be resolved with the following approach:
PostgreSQL can be configured to execute a shell command after WAL segment
is archived. The same thing we can do for Ignite as well.
A command can create a hardlink for such WAL segment to a specified
directory to not loose it after deletion by Ignite and notify a CDC (or
another kind of process) about this segment.
That will be a filesystem queue and CDC after restart may proceed only
segments located at this directory, so it's no need to start from scratch.
When WAL segment is processed by CDC a hardlink from queue directory is
deleted.



пт, 16 окт. 2020 г. в 13:42, Alexey Goncharuk <[hidden email]>:

> Hello Nikolay,
>
> Thanks for the suggestion, it definitely may be a good feature, however, I
> do not see any significant value that it currently adds to the already
> existing WAL Iterator. I think the following issues should be addressed,
> otherwise, no regular user will be able to use the CDC reliably:
>
>    - The interface exposes WALRecord which is a private API
>    - There is no way to start capturing changes from a certain point (a
>    watermark for already processed data). Users can configure a large size
> for
>    WAL archive to sustain long node downtime for historical rebalance. If a
>    CDC agent is restarted, it will have to start from scratch. I see that
> it
>    is present in the IEP as a design choice, but I think this is a major
>    usability issue
>    - If a CDC reader does not keep up with the WAL write rate (e.g. there
>    is a short-term write burst and WAL archive is small), the Ignite node
> will
>    delete WAL segments while the consumer is still reading it. Since the
>    consumer is running out-of-process, we need to specify some sort of
>    synchronization protocol between the node and the consumer
>    - If Ignite node crashes, gets restarted and initiates full rebalance,
>    the consumer will lose some updates
>    - Usually, it makes sense for the CDC consumer to read updates only on
>    primary nodes (otherwise, multiple agents will be doing duplicate
> work). In
>    the current design, the consumer will not be able to differentiate
>    primary/backup updates. Moreover, even if we wrote such flags to WAL,
> the
>    consumer would need to process backup records anyway because it is
> unknown
>    whether the primary consumer is alive. In other words, how would an end
>    user organize the CDC failover minimizing the duplicate work?
>
>
> ср, 14 окт. 2020 г. в 14:21, Nikolay Izhikov <[hidden email]>:
>
> > Hello, Igniters.
> >
> > I want to start a discussion of the new feature [1]
> >
> > CDC - capture data change. The feature allows the consumer to receive
> > online notifications about data record changes.
> >
> > It can be used in the following scenarios:
> >         * Export data into some warehouse, full-text search, or
> > distributed log system.
> >         * Online statistics and analytics.
> >         * Wait and respond to some specific events or data changes.
> >
> > Propose to implement new IgniteCDC application as follows:
> >         * Run on the server node host.
> >         * Watches for the appearance of the WAL archive segments.
> >         * Iterates it using existing WALIterator and notifies consumer of
> > each record from the segment.
> >
> > IgniteCDC features:
> >         * Independence from the server node process (JVM) - issues and
> > failures of the consumer will not lead to server node instability.
> >         * Notification guarantees and failover - i.e. CDC track and save
> > the pointer to the last consumed record. Continue notification from this
> > pointer in case of restart.
> >         * Resilience for the consumer - it's not an issue when a consumer
> > temporarily consumes slower than data appear.
> >
> > WDYT?
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] IEP-59: CDC - Capture Data Change

Nikolay Izhikov-2
Hello, Alexey.

Sorry, for the long answer.

>   - The interface exposes WALRecord which is a private API

Not it's fixed.
CDC consumer should use public API to get notifications about a data change.
This API can be found in IEP [1] and PR [2]

```
@IgniteExperimental
public interface DataChangeListener<K, V> {
    String id();
    void start(IgniteConfiguration configuration, IgniteLogger log);
    boolean keepBinary();
    boolean onChange(Iterable<EntryEvent<K, V>> events);
    void stop();
}
 
@IgniteExperimental
public interface EntryEvent<K, V> {
    public K key();
    public V value();
    public boolean primary();
    EntryEventType operation();
    long cacheId();
    long expireTime();
}
```

> There is no way to start capturing changes from a certain point
> If a CDC agent is restarted, it will have to start from scratch.

There are a way :).

CDC store processed offset in a special file.
In the case of CDC restart changes will be captured from the last committed offset.

> Users can configure a large size for WAL archive to sustain long node downtime for historical rebalance

To fix this issue I propose to introduce timeout for force WAL segment archive.
And yes, the event time gap and big WAL segments are tradeoffs for real-world deployment.

> - If a CDC reader does not keep up with the WAL write rate (e.g. there is a short-term write burst and WAL archive is small), the Ignite node
will delete WAL segments while the consumer is still reading it.

This is fixed now.
I implemented Pavel proposal - On WAL rollover if CDC enabled hard link to archive segment is created in a special folder.
So CDC can process segment independently from main Ignite process and delete when finished.
Note, that segment data will be removed from the disk only after both CDC and Ignite will remove the hard link to it.

> If Ignite node crashes, gets restarted and initiates full rebalance, the consumer will lose some updates

I expect that consumers will be started on each cluster node.
So, no event loss here.

>  Usually, it makes sense for the CDC consumer to read updates only on  primary nodes

Makes sense.
Thanks.
I've added `primary` flag to DataEntry WAL record it.
Take a look at PR [3]

>  the consumer would need to process backup records anyway because it is unknown whether the primary consumer is alive.

If CDC on some node is down it will deliver updates on restart.

I want to restrict CDC scope only for "Deliver local WAL event to the consumer".
CDC itself not responsible for distributed consumer state.
It's up to the consumer to implement some kind of failover scenario to keep all CDC up and running.

And yes, it's expected - if CDC is down then event lag is grown.
To prevent it the user can process all events, not only primary.

> In other words, how would an end-user organize the CDC failover minimizing the duplicate work?

1. To recover from the CDC app failure simple restart will work.
2. For now, the user can distinguish between primary and backup DataEntry.
This approach allows to the user prevent duplicate work and recover from Ignite node fail when the OS and CDC app still up.
3. If it's required to keep a small event gap in case of server failure(OS, Ignite node, and CDC app is down).
It's required to process changes for backup nodes and do some duplicate processing.

[1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
[2] https://github.com/apache/ignite/pull/8360
[3] https://github.com/apache/ignite/pull/8377


> 16 окт. 2020 г., в 14:19, Pavel Kovalenko <[hidden email]> написал(а):
>
> Alexey,
>
>>> If a CDC agent is restarted, it will have to start from scratch
>>> If a CDC reader does not keep up with the WAL write rate (e.g. there
>   is a short-term write burst and WAL archive is small), the Ignite node
> will
>   delete WAL segments while the consumer is still reading it.
>
> I think these cases can be resolved with the following approach:
> PostgreSQL can be configured to execute a shell command after WAL segment
> is archived. The same thing we can do for Ignite as well.
> A command can create a hardlink for such WAL segment to a specified
> directory to not loose it after deletion by Ignite and notify a CDC (or
> another kind of process) about this segment.
> That will be a filesystem queue and CDC after restart may proceed only
> segments located at this directory, so it's no need to start from scratch.
> When WAL segment is processed by CDC a hardlink from queue directory is
> deleted.
>
>
>
> пт, 16 окт. 2020 г. в 13:42, Alexey Goncharuk <[hidden email]>:
>
>> Hello Nikolay,
>>
>> Thanks for the suggestion, it definitely may be a good feature, however, I
>> do not see any significant value that it currently adds to the already
>> existing WAL Iterator. I think the following issues should be addressed,
>> otherwise, no regular user will be able to use the CDC reliably:
>>
>>   - The interface exposes WALRecord which is a private API
>>   - There is no way to start capturing changes from a certain point (a
>>   watermark for already processed data). Users can configure a large size
>> for
>>   WAL archive to sustain long node downtime for historical rebalance. If a
>>   CDC agent is restarted, it will have to start from scratch. I see that
>> it
>>   is present in the IEP as a design choice, but I think this is a major
>>   usability issue
>>   - If a CDC reader does not keep up with the WAL write rate (e.g. there
>>   is a short-term write burst and WAL archive is small), the Ignite node
>> will
>>   delete WAL segments while the consumer is still reading it. Since the
>>   consumer is running out-of-process, we need to specify some sort of
>>   synchronization protocol between the node and the consumer
>>   - If Ignite node crashes, gets restarted and initiates full rebalance,
>>   the consumer will lose some updates
>>   - Usually, it makes sense for the CDC consumer to read updates only on
>>   primary nodes (otherwise, multiple agents will be doing duplicate
>> work). In
>>   the current design, the consumer will not be able to differentiate
>>   primary/backup updates. Moreover, even if we wrote such flags to WAL,
>> the
>>   consumer would need to process backup records anyway because it is
>> unknown
>>   whether the primary consumer is alive. In other words, how would an end
>>   user organize the CDC failover minimizing the duplicate work?
>>
>>
>> ср, 14 окт. 2020 г. в 14:21, Nikolay Izhikov <[hidden email]>:
>>
>>> Hello, Igniters.
>>>
>>> I want to start a discussion of the new feature [1]
>>>
>>> CDC - capture data change. The feature allows the consumer to receive
>>> online notifications about data record changes.
>>>
>>> It can be used in the following scenarios:
>>>        * Export data into some warehouse, full-text search, or
>>> distributed log system.
>>>        * Online statistics and analytics.
>>>        * Wait and respond to some specific events or data changes.
>>>
>>> Propose to implement new IgniteCDC application as follows:
>>>        * Run on the server node host.
>>>        * Watches for the appearance of the WAL archive segments.
>>>        * Iterates it using existing WALIterator and notifies consumer of
>>> each record from the segment.
>>>
>>> IgniteCDC features:
>>>        * Independence from the server node process (JVM) - issues and
>>> failures of the consumer will not lead to server node instability.
>>>        * Notification guarantees and failover - i.e. CDC track and save
>>> the pointer to the last consumed record. Continue notification from this
>>> pointer in case of restart.
>>>        * Resilience for the consumer - it's not an issue when a consumer
>>> temporarily consumes slower than data appear.
>>>
>>> WDYT?
>>>
>>> [1]
>>>
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-59+CDC+-+Capture+Data+Change
>>