You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Conrad Crampton <co...@SecData.com> on 2017/08/11 13:46:43 UTC

Re: Classloader and removal of native libraries

Hi Aljoscha,
“Hope that helps”…
ABSOLUTELY!!!

I have dug through the javacpp source code to find how the Loader class uses the temp cache location for the native libraries and in my open method in my RichMapFunction I am now setting the System.property to a random location so if the job restarts and calls open again, it uses a random temp location as in your example.

So thank you so much, got me on the right path. Now I’ve got a problem with out of memory errors arrghh (I will start another thread on this as I don’t want to soil this one with a different topic)

Thanks again
Conrad

From: Aljoscha Krettek <al...@apache.org>
Date: Thursday, 10 August 2017 at 15:57
To: Conrad Crampton <co...@SecData.com>
Cc: "user@flink.apache.org" <us...@flink.apache.org>
Subject: Re: Classloader and removal of native libraries

Hi Conrad,

I'm afraid you're running in the same problem that we already encountered with loading the native RocksDB library: https://github.com/apache/flink/blob/219ae33d36e67e3e74f493cf4956a290bc966a5d/flink-contrib/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBStateBackend.java#L519

The code section has a good description of what is going on. We're using NativeLibraryLoader [1], which comes with RocksDB  to try and load the native library from a different temporary location if loading it the normal way fails. (The relocation of the lib to a temp dir is in the NativeLibraryLoader, not on our side. We're just providing a temp path for NativeLibraryLoader to work with.)

Hope that helps,
Aljoscha

[1] https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/NativeLibraryLoader.java

On 10. Aug 2017, at 15:36, Conrad Crampton <co...@SecData.com>> wrote:

Hi,
First time posting here so ‘hi’.
I have been using Flink (1.31 now) for a couple of months now and loving it. My deployment is to JobManager running as a long running session on Yarn.
I have a problem where I have a Flink streaming job that involves loading native libraries as part of one of the mappers (inheriting from RichMapFunction) loading (in the open method) a previously trained machine learning model (using Deeplearning4j). The problem lies with when loading the model, it also loads some native libraries using javacpp Loader class (which from looking at the source code determines a location for native libraries and from a hierarchy of availability of a System property, the users home dir (with .javacpp) or temp dir).
Anyway the actual problem lies is if an exception is thrown in the Flink job, the jobmanager tries to restart it, however it would appear that when it failed in the first place, references to the objects and therefore the classes aren’t released by the classloader as I get an error

java.lang.ExceptionInInitializerError
        at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.<init>(NativeOpExecutioner.java:43)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
        at java.lang.Class.newInstance(Class.java:442)
        at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:5784)
        at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5694)
        at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:184)
        at org.deeplearning4j.util.ModelSerializer.restoreMultiLayerNetwork(ModelSerializer.java:188)
        at org.deeplearning4j.util.ModelSerializer.restoreMultiLayerNetwork(ModelSerializer.java:138)
        at com.secdata.gi.detect.map.DGAClassifierMapFunction.open(DGAClassifierMapFunction.java:54)
        at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
        at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:111)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:376)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: ND4J is probably missing dependencies. For more information, please refer to: http://nd4j.org/getstarted.html
        at org.nd4j.nativeblas.NativeOpsHolder.<init>(NativeOpsHolder.java:51)
        at org.nd4j.nativeblas.NativeOpsHolder.<clinit>(NativeOpsHolder.java:19)
        ... 18 more
Caused by: java.lang.UnsatisfiedLinkError: no jnind4jcpu in java.library.path
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1865)
        at java.lang.Runtime.loadLibrary0(Runtime.java:870)
        at java.lang.System.loadLibrary(System.java:1122)
        at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:963)
        at org.bytedeco.javacpp.Loader.load(Loader.java:764)
        at org.bytedeco.javacpp.Loader.load(Loader.java:671)
        at org.nd4j.nativeblas.Nd4jCpu.<clinit>(Nd4jCpu.java:10)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.bytedeco.javacpp.Loader.load(Loader.java:726)
        at org.bytedeco.javacpp.Loader.load(Loader.java:671)
        at org.nd4j.nativeblas.Nd4jCpu$NativeOps.<clinit>(Nd4jCpu.java:62)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at org.nd4j.nativeblas.NativeOpsHolder.<init>(NativeOpsHolder.java:29)
        ... 19 more
Caused by: java.lang.UnsatisfiedLinkError: Native Library /home/yarn/.javacpp/cache/blob_a87e49f9a475a9dc4296f6afbc3ae171dc821d19/org/nd4j/nativeblas/linux-x86_64/libjnind4jcpu.so already loaded in another classloader
        at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1903)
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1822)
        at java.lang.Runtime.load0(Runtime.java:809)
        at java.lang.System.load(System.java:1086)
        at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:943)

I’m no expert in class loading at all but have tried a number of ways to overcome this based on the docs and dynamic classloading

1.       Putting the jars that contain the native libraries in the flink/lib directory so they are always loaded on JobManager startup and removing them from the deployed flink job
2.       Some code in the open / close methods of my mapper to try and dispose of the native library
NativeLibrary lib = NativeLibrary.getInstance(Loader.getCacheDir() + "/cache/nd4j-native-0.8.0-" + Loader.getPlatform() +  ".jar/org/nd4j/nativeblas/" + Loader.getPlatform() + "/libjnind4jcpu.so");
this.model = null;

if (null != lib) {
    System.out.println("found lib - removing");
    lib.dispose();
    System.gc();
}
This sort of works in finding the library (on my mac running locally) but as from the stacktrace above, on deployment to yarn some random value is placed after the /cache dir which I don’t know how to get a handle on to be able to construc the correct library location (and just using the library name as the jni docs suggest for NativeLibrary.getInstance fails to find the library)

Neither of the above approaches work, so when my job fails I can’t cancel and resubmit as it just fails with the same stack trace. The only way I can get it to run again is the cancel all other jobs running on JobManager, killing the JM yarn session, creating a new Yarn session JM, resubmitting flink job – which is a real pain. Ideally I would like to stop the exception in the first place, but as I can’t figure out how to get logging appearing in my yarn logs either (for debug) I’m at a bit of a loss!

Any pointers, suggestions please??

Many thanks
Conrad



SecureData, combating cyber threats

________________________________
The information contained in this message or any of its attachments may be privileged and confidential and intended for the exclusive use of the intended recipient. If you are not the intended recipient any disclosure, reproduction, distribution or other dissemination or use of this communications is strictly prohibited. The views expressed in this email are those of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if followed up by a formal written quote.
SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT



***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/sqqAKS4fcuTGX2PQPOmvUsrLibhXE7+S2XLa08iemTAEqT34+e6paNr2uTugfjZ5BFH3GVoNO4qqE4naC2k80A==> to report this email as spam.

Re: Classloader and removal of native libraries

Posted by Martin Eden <ma...@gmail.com>.
Hi,

I'm reviving this thread because I am probably hitting a similar issue with
loading a native library. However I am not able to resolve it with the
suggestions here.

I am using Flink 1.3.2 and the Jep library to call Cpython from a
RichFlatMapFunction with a parallelism of 10. I am instantiating the Jep
class in the open method. The Jep class has a static block where it loads
the native library:
static {
  System.loadLibrary("jep");
}

I am running Flink in standalone mode in a high availability setup with
zookeeper.

Initially I am setting the java.library.path on the task manager jvm. The
code works, however, when I am testing JobManager recovery by killing and
restarting the JobManager, when the job gets redeployed it starts endlessly
failing with:

java.lang.UnsatisfiedLinkError: Native Library
/Library/Python/2.7/site-packages/jep/libjep.jnilib already loaded in
another classloader

Removing the java.library.path from the task manager jvm, and setting it
programatically with a randomized path, each time the RichFlatMapFunction class
is initialised (i.e. each time the job is redeployed), causes the task
manager jvm to fail upon JobManager recovery.

Conrad, can you give more details about exactly what paths you set and how?

Any suggestions from anyone would be much appreciated?

Thanks,
M




On Fri, Aug 11, 2017 at 2:46 PM, Conrad Crampton <
conrad.crampton@secdata.com> wrote:

> Hi Aljoscha,
>
> “Hope that helps”…
>
> ABSOLUTELY!!!
>
>
>
> I have dug through the javacpp source code to find how the Loader class
> uses the temp cache location for the native libraries and in my open method
> in my RichMapFunction I am now setting the System.property to a random
> location so if the job restarts and calls open again, it uses a random temp
> location as in your example.
>
>
>
> So thank you so much, got me on the right path. Now I’ve got a problem
> with out of memory errors arrghh (I will start another thread on this as I
> don’t want to soil this one with a different topic)
>
>
>
> Thanks again
>
> Conrad
>
>
>
> *From: *Aljoscha Krettek <al...@apache.org>
> *Date: *Thursday, 10 August 2017 at 15:57
> *To: *Conrad Crampton <co...@SecData.com>
> *Cc: *"user@flink.apache.org" <us...@flink.apache.org>
> *Subject: *Re: Classloader and removal of native libraries
>
>
>
> Hi Conrad,
>
>
>
> I'm afraid you're running in the same problem that we already encountered
> with loading the native RocksDB library: https://github.com/
> apache/flink/blob/219ae33d36e67e3e74f493cf4956a290bc966a5d/flink-contrib/
> flink-statebackend-rocksdb/src/main/java/org/apache/
> flink/contrib/streaming/state/RocksDBStateBackend.java#L519
>
>
>
> The code section has a good description of what is going on. We're using
> NativeLibraryLoader [1], which comes with RocksDB  to try and load the
> native library from a different temporary location if loading it the normal
> way fails. (The relocation of the lib to a temp dir is in the
> NativeLibraryLoader, not on our side. We're just providing a temp path for
> NativeLibraryLoader to work with.)
>
>
>
> Hope that helps,
>
> Aljoscha
>
>
>
> [1] https://github.com/facebook/rocksdb/blob/master/
> java/src/main/java/org/rocksdb/NativeLibraryLoader.java
>
>
>
> On 10. Aug 2017, at 15:36, Conrad Crampton <co...@SecData.com>
> wrote:
>
>
>
> Hi,
>
> First time posting here so ‘hi’.
>
> I have been using Flink (1.31 now) for a couple of months now and loving
> it. My deployment is to JobManager running as a long running session on
> Yarn.
>
> I have a problem where I have a Flink streaming job that involves loading
> native libraries as part of one of the mappers (inheriting from
> RichMapFunction) loading (in the open method) a previously trained machine
> learning model (using Deeplearning4j). The problem lies with when loading
> the model, it also loads some native libraries using javacpp Loader class
> (which from looking at the source code determines a location for native
> libraries and from a hierarchy of availability of a System property, the
> users home dir (with .javacpp) or temp dir).
>
> Anyway the actual problem lies is if an exception is thrown in the Flink
> job, the jobmanager tries to restart it, however it would appear that when
> it failed in the first place, references to the objects and therefore the
> classes aren’t released by the classloader as I get an error
>
>
>
> java.lang.ExceptionInInitializerError
>
>         at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.<init>
> (NativeOpExecutioner.java:43)
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
>
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>
>         at java.lang.Class.newInstance(Class.java:442)
>
>         at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:5784)
>
>         at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5694)
>
>         at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:184)
>
>         at org.deeplearning4j.util.ModelSerializer.
> restoreMultiLayerNetwork(ModelSerializer.java:188)
>
>         at org.deeplearning4j.util.ModelSerializer.
> restoreMultiLayerNetwork(ModelSerializer.java:138)
>
>         at com.secdata.gi.detect.map.DGAClassifierMapFunction.open(
> DGAClassifierMapFunction.java:54)
>
>         at org.apache.flink.api.common.functions.util.FunctionUtils.
> openFunction(FunctionUtils.java:36)
>
>         at org.apache.flink.streaming.api.operators.
> AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:111)
>
>         at org.apache.flink.streaming.runtime.tasks.StreamTask.
> openAllOperators(StreamTask.java:376)
>
>         at org.apache.flink.streaming.runtime.tasks.StreamTask.
> invoke(StreamTask.java:253)
>
>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.lang.RuntimeException: ND4J is probably missing
> dependencies. For more information, please refer to: http://nd4j.org/
> getstarted.html
>
>         at org.nd4j.nativeblas.NativeOpsHolder.<init>(
> NativeOpsHolder.java:51)
>
>         at org.nd4j.nativeblas.NativeOpsHolder.<clinit>(
> NativeOpsHolder.java:19)
>
>         ... 18 more
>
> Caused by: java.lang.UnsatisfiedLinkError: no jnind4jcpu in
> java.library.path
>
>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1865)
>
>         at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>
>         at java.lang.System.loadLibrary(System.java:1122)
>
>         at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:963)
>
>         at org.bytedeco.javacpp.Loader.load(Loader.java:764)
>
>         at org.bytedeco.javacpp.Loader.load(Loader.java:671)
>
>         at org.nd4j.nativeblas.Nd4jCpu.<clinit>(Nd4jCpu.java:10)
>
>         at java.lang.Class.forName0(Native Method)
>
>         at java.lang.Class.forName(Class.java:348)
>
>         at org.bytedeco.javacpp.Loader.load(Loader.java:726)
>
>         at org.bytedeco.javacpp.Loader.load(Loader.java:671)
>
>         at org.nd4j.nativeblas.Nd4jCpu$NativeOps.<clinit>(Nd4jCpu.java:62)
>
>         at java.lang.Class.forName0(Native Method)
>
>         at java.lang.Class.forName(Class.java:264)
>
>         at org.nd4j.nativeblas.NativeOpsHolder.<init>(
> NativeOpsHolder.java:29)
>
>         ... 19 more
>
> Caused by: java.lang.UnsatisfiedLinkError: Native Library
> /home/yarn/.javacpp/cache/blob_a87e49f9a475a9dc4296f6afbc3ae1
> 71dc821d19/org/nd4j/nativeblas/linux-x86_64/libjnind4jcpu.so already
> loaded in another classloader
>
>         at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1903)
>
>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1822)
>
>         at java.lang.Runtime.load0(Runtime.java:809)
>
>         at java.lang.System.load(System.java:1086)
>
>         at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:943)
>
>
>
> I’m no expert in class loading at all but have tried a number of ways to
> overcome this based on the docs and dynamic classloading
>
>
>
> 1.       Putting the jars that contain the native libraries in the
> flink/lib directory so they are always loaded on JobManager startup and
> removing them from the deployed flink job
>
> 2.       Some code in the open / close methods of my mapper to try and
> dispose of the native library
> NativeLibrary lib = NativeLibrary.*getInstance*(Loader.*getCacheDir*() +
> *"/cache/nd4j-native-0.8.0-" *+ Loader.*getPlatform*() +
> *".jar/org/nd4j/nativeblas/" *+ Loader.*getPlatform*() +
> *"/libjnind4jcpu.so"*);
> *this*.*model *= *null*;
>
> *if *(*null *!= lib) {
>     System.*out*.println(*"found lib - removing"*);
>     lib.dispose();
>     System.*gc*();
> }
> This sort of works in finding the library (on my mac running locally) but
> as from the stacktrace above, on deployment to yarn some random value is
> placed after the /cache dir which I don’t know how to get a handle on to be
> able to construc the correct library location (and just using the library
> name as the jni docs suggest for NativeLibrary.getInstance fails to find
> the library)
>
>
>
> Neither of the above approaches work, so when my job fails I can’t cancel
> and resubmit as it just fails with the same stack trace. The only way I can
> get it to run again is the cancel all other jobs running on JobManager,
> killing the JM yarn session, creating a new Yarn session JM, resubmitting
> flink job – which is a real pain. Ideally I would like to stop the
> exception in the first place, but as I can’t figure out how to get logging
> appearing in my yarn logs either (for debug) I’m at a bit of a loss!
>
>
>
> Any pointers, suggestions please??
>
>
>
> Many thanks
>
> Conrad
>
>
>
>
>
> SecureData, combating cyber threats
>
>
> ------------------------------
>
> The information contained in this message or any of its attachments may be
> privileged and confidential and intended for the exclusive use of the
> intended recipient. If you are not the intended recipient any disclosure,
> reproduction, distribution or other dissemination or use of this
> communications is strictly prohibited. The views expressed in this email
> are those of the individual and not necessarily of SecureData Europe Ltd.
> Any prices quoted are only valid if followed up by a formal written quote.
>
> SecureData Europe Limited. Registered in England & Wales 04365896.
> Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
> Maidstone, Kent, ME16 9NT
>
>
>
>
>
> ***This email originated outside SecureData***
>
> Click here
> <https://www.mailcontrol.com/sr/sqqAKS4fcuTGX2PQPOmvUsrLibhXE7+S2XLa08iemTAEqT34+e6paNr2uTugfjZ5BFH3GVoNO4qqE4naC2k80A==>
> to report this email as spam.
>