IGFS cache’s HDFS, as like any caching if the underlying store changes you can end up with a dirty read/inconsistent view, or you end up having to poll the original source, also if you want to pre-cache new data added to the underlying the same challenges applies.
This has already been noted a key issue for other tools such as indexers, oozie as such a solution has been already implemented in HDFS called iNotify under https://issues.apache.org/jira/browse/HDFS-6634 <https://issues.apache.org/jira/browse/HDFS-6634> The idea/proposal here is that IGFS extended to be able to support underlying secondary file system updates, with the intent to first support Hadoop File system, HDFS iNotify and being able to keep IGFS up to date to underlying file system changes and future idea of being able to configure to pre-cache new files in certain dirs, such as newly ingested data. |
Hi Mike,
I think this in general this is very good idea. For now IGFS has a restriction that all file system operations should go through it. Otherwise IGFS will end up in inconsistent state. INofity could resolve this limitation. However, I this is not easy task for several reasons: 1) INotify is relatively new feature and as I can see from aforementioned tickets from HDFS JIRA, it is still evolving. So it looks like we cannot easily integrate with it directly, because it could make out modules incompatible with older Hadoop versions. This could be resolved with careful pluggable callback interface for IGFS and several implemntations of proposed INotify module, targeting different versions of this interface. 2) We will have to revisit secondary file system logic. When file system is updated from both IGFS and HDFS simultaneously, we need to be able to resolve this conflicts somehow, and have single synchronization point to prevent inconsistencies. I will create a ticket for this. Vladimir. On Thu, Mar 24, 2016 at 2:05 AM, Michael André Pearce < [hidden email]> wrote: > IGFS cache’s HDFS, as like any caching if the underlying store changes you > can end up with a dirty read/inconsistent view, or you end up having to > poll the original source, also if you want to pre-cache new data added to > the underlying the same challenges applies. > > This has already been noted a key issue for other tools such as indexers, > oozie as such a solution has been already implemented in HDFS called > iNotify under https://issues.apache.org/jira/browse/HDFS-6634 < > https://issues.apache.org/jira/browse/HDFS-6634> > > The idea/proposal here is that IGFS extended to be able to support > underlying secondary file system updates, with the intent to first support > Hadoop File system, HDFS iNotify and being able to keep IGFS up to date to > underlying file system changes and future idea of being able to configure > to pre-cache new files in certain dirs, such as newly ingested data. |
Free forum by Nabble | Edit this page |