You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "jpivarski@gmail.com" <jp...@gmail.com> on 2016/04/28 21:56:26 UTC

Re: Tungsten off heap memory access for C++ libraries

Hi,

I'm coming from the particle physics community and I'm also very interested
in the development of this project. We have a huge C++ codebase and would
like to start using the higher-level abstractions of Spark in our data
analyses. To this end, I've been developing code that copies data from our
C++ framework, ROOT, into Scala:

https://github.com/diana-hep/rootconverter/tree/master/scaroot-reader
<https://github.com/diana-hep/rootconverter/tree/master/scaroot-reader>  

(Worth noting: the ROOT file format is too complex for a complete rewrite in
Java or Scala to be feasible. ROOT readers in Java and even Javascript
exist, but they only handle simple cases.)

I have a variety of options for how to lay out the bytes during this
transfer, and in all cases fill the constructor arguments of Scala classes
using macros. When I learned that you're moving the Spark data off-heap (at
the same time as I'm struggling to move it on-heap), I realized that you
must have chosen a serialization format for that data, and I should be using
/that/ serialization format.

Even though it's early, do you have any designs for that serialization
format? Have you picked a standard one? Most of the options, such as Avro,
don't make a lot of sense because they pack integers to minimize number of
bytes, rather than lay them out for efficient access (including any
byte-alignment considerations).

Also, are there any plans for an API that /fills/ an RDD or DataSet from the
C++ side, as I'm trying to do?

Thanks,
-- Jim


P.S. Concerning Java/C++ bindings, there are many. I tried JNI, JNA, BridJ,
and JavaCPP personally, but in the end picked JNA because of its
(comparatively) large user base. If Spark will be using Djinni, that could
be a symmetry-breaking consideration and I'll start using it for
consistency, maybe even interoperability.




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-tp13898p17387.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Tungsten off heap memory access for C++ libraries

Posted by "jpivarski@gmail.com" <jp...@gmail.com>.
jpivarski@gmail.com wrote
> P.S. Concerning Java/C++ bindings, there are many. I tried JNI, JNA,
> BridJ, and JavaCPP personally, but in the end picked JNA because of its
> (comparatively) large user base. If Spark will be using Djinni, that could
> be a symmetry-breaking consideration and I'll start using it for
> consistency, maybe even interoperability.

I think I misunderstood what Djinni is. JNA, BridJ, and JavaCPP provide
access to untyped bytes (except for common cases like java.lang.String), but
it looks like Djinni goes further and provides a type mapping--- exactly the
"serialization format" or "layout of bytes" that I was asking about.

Is it safe to say that when Spark has off-heap caching, that it will be in
the format specified by Djinni? If I work to integrate ROOT with Djinni,
will this be a major step toward integrating it with Spark 2.0?

Even if the above answers my first question, I'd still like to know if the
new Spark API will allow RDDs to be /filled/ from the C++ side, as a data
source, rather than a derived dataset.




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-tp13898p17388.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org