Why do we load Hadoop classes in a separate classloader?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Why do we load Hadoop classes in a separate classloader?

Vladimir Ozerov
Folks,

In current Hadoop Accelerator design we always process user jobs in a
separate classloader called HadoopClassLoader. It is somewhat special
because it always loads Hadoop classes from scratch.

This leads to at least two serious problems:
1) Very high permgen/metaspace load. Workaround - more permgen.
2) Native Hadoop libraries cannot be used. There are quire a few native
methods in Hadoop. Corresponding dll/so files are loaded in static class
initializers. As each HadoopClassLoader loads classes over and over again,
libraries are loaded several times as well. But Java do not allow several
loads of the same native library from different classloader. Result -> JNI
linkage errors. For instance, this affects Snappy compress/decompress
library which is pretty important in Hadoop ecosystem.

Clearly, this isolation with custom class loader was done on purpose. And I
understand why it is important, for example, for user-defined classes.

But why do we load Hadoop classes (e.g. org.apache.hadoop.fs.FileSystem)
multiple times? Does any one has a clue?

Vladimir.
Reply | Threaded
Open this post in threaded view
|

Re: Why do we load Hadoop classes in a separate classloader?

Sergi
Initially this was done to support multithreaded model of running tasks vs
the original multiprocess model of Hadoop.
And as far as I remember there were attempts to reuse Hadoop classes but
they failed.

As for high permgen load, it should not actually be that high, number of
classloaders should be something
slightly higher than number of concurrently running tasks. Task
classloaders are getting pooled and reused.

As for native code, I'm not sure what can be done here, I think
environments like OSGi (Eclipse, etc..) have
the same issue, may be we can look what they do in case of native
dependencies in bundles?

Sergi





2015-12-24 9:09 GMT+03:00 Vladimir Ozerov <[hidden email]>:

> Folks,
>
> In current Hadoop Accelerator design we always process user jobs in a
> separate classloader called HadoopClassLoader. It is somewhat special
> because it always loads Hadoop classes from scratch.
>
> This leads to at least two serious problems:
> 1) Very high permgen/metaspace load. Workaround - more permgen.
> 2) Native Hadoop libraries cannot be used. There are quire a few native
> methods in Hadoop. Corresponding dll/so files are loaded in static class
> initializers. As each HadoopClassLoader loads classes over and over again,
> libraries are loaded several times as well. But Java do not allow several
> loads of the same native library from different classloader. Result -> JNI
> linkage errors. For instance, this affects Snappy compress/decompress
> library which is pretty important in Hadoop ecosystem.
>
> Clearly, this isolation with custom class loader was done on purpose. And I
> understand why it is important, for example, for user-defined classes.
>
> But why do we load Hadoop classes (e.g. org.apache.hadoop.fs.FileSystem)
> multiple times? Does any one has a clue?
>
> Vladimir.
>