HDFS iNotify

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

HDFS iNotify

Michael André Pearce
IGFS cache’s HDFS, as like any caching if the underlying store changes you can end up with a dirty read/inconsistent view, or you end up having to poll the original source, also if you want to pre-cache new data added to the underlying the same challenges applies.

This has already been noted a key issue for other tools such as indexers, oozie as such a solution has been already implemented in HDFS called iNotify under https://issues.apache.org/jira/browse/HDFS-6634 <https://issues.apache.org/jira/browse/HDFS-6634>

The idea/proposal here is that IGFS extended to be able to support underlying secondary file system updates, with the intent to first support Hadoop File system, HDFS iNotify and being able to keep IGFS up to date to underlying file system changes and future idea of being able to configure to pre-cache new files in certain dirs, such as newly ingested data.
Reply | Threaded
Open this post in threaded view
|

Re: HDFS iNotify

Vladimir Ozerov
Hi Mike,

I think this in general this is very good idea. For now IGFS has a
restriction that all file system operations should go through it. Otherwise
IGFS will end up in inconsistent state. INofity could resolve this
limitation. However, I this is not easy task for several reasons:
1) INotify is relatively new feature and as I can see from aforementioned
tickets from HDFS JIRA, it is still evolving. So it looks like we cannot
easily integrate with it directly, because it could make out modules
incompatible with older Hadoop versions. This could be resolved with
careful pluggable callback interface for IGFS and several implemntations of
proposed INotify module, targeting different versions of this interface.
2) We will have to revisit secondary file system logic. When file system is
updated from both IGFS and HDFS simultaneously, we need to be able to
resolve this conflicts somehow, and have single synchronization point to
prevent inconsistencies.

I will create a ticket for this.

Vladimir.


On Thu, Mar 24, 2016 at 2:05 AM, Michael André Pearce <
[hidden email]> wrote:

> IGFS cache’s HDFS, as like any caching if the underlying store changes you
> can end up with a dirty read/inconsistent view, or you end up having to
> poll the original source, also if you want to pre-cache new data added to
> the underlying the same challenges applies.
>
> This has already been noted a key issue for other tools such as indexers,
> oozie as such a solution has been already implemented in HDFS called
> iNotify under https://issues.apache.org/jira/browse/HDFS-6634 <
> https://issues.apache.org/jira/browse/HDFS-6634>
>
> The idea/proposal here is that IGFS extended to be able to support
> underlying secondary file system updates, with the intent to first support
> Hadoop File system, HDFS iNotify and being able to keep IGFS up to date to
> underlying file system changes and future idea of being able to configure
> to pre-cache new files in certain dirs, such as newly ingested data.