You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by bhayes <Br...@informatica.com> on 2017/02/13 13:59:35 UTC

Does Spark support heavy duty third party libraries?

I have a rather heavy metal shared library which among other options can also
be accessed via a Java/JNI wrapper JAR (the library itself is written in
C++). This library needs up to 1000 external files which in total can be
larger than 50 GBytes in size. At init time all this data needs to be read
by the library, so it may take some time to initialize. The memory size
needed is also in this range, say 64 GBytes. The library itself would then
be accessed by some mapping function which is passed to Spark. My question
now is whether Spark (or if not, Hadoop) supports such kind of libraries,
specifically:

1. Can it be configured to init this lib only once at startup time of the
cluster or job?

2. Can it be configured to have the library available on only a few nodes
(non uniform cluster)?

3. Can it distribute the library, its connector JAR, its config files and
its data files throughout the cluster? (its main data files could be held in
a cluster file system otherwise)

4. The library can in fact also use multi-threading internally to process an
array of inputs, so the questions here are:
4a.does Spark support passing an array of inputs (Partitioning?)?
4b. Can Spark be made aware of the library-internal multi-threading?

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-support-heavy-duty-third-party-libraries-tp28384.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org