You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2015/09/01 11:11:12 UTC

Re: Tungsten off heap memory access for C++ libraries

Please do. Thanks.

On Mon, Aug 31, 2015 at 5:00 AM, Paul Weiss <pa...@gmail.com> wrote:

> Sounds good, want me to create a jira and link it to SPARK-9697? Will put
> down some ideas to start.
> On Aug 31, 2015 4:14 AM, "Reynold Xin" <rx...@databricks.com> wrote:
>
>> BTW if you are interested in this, we could definitely get some help in
>> terms of prototyping the feasibility, i.e. how we can have a native (e.g.
>> C++) API for data access shipped with Spark. There are a lot of questions
>> (e.g. build, portability) that need to be answered.
>>
>> On Mon, Aug 31, 2015 at 1:12 AM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>>
>>> On Sun, Aug 30, 2015 at 5:58 AM, Paul Weiss <pa...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Also, is this work being done on a branch I could look into further and
>>>> try out?
>>>>
>>>>
>>> We don't have a branch yet -- because there is no code nor design for
>>> this yet. As I said, it is one of the motivations behind Tungsten, but it
>>> is fairly early and we don't have anything yet. When we start doing it, I
>>> will shoot the dev list an email.
>>>
>>>
>>>
>>

Re: Tungsten off heap memory access for C++ libraries

Posted by "jpivarski@gmail.com" <jp...@gmail.com>.

jpivarski@gmail.com wrote
> P.S. Concerning Java/C++ bindings, there are many. I tried JNI, JNA,
> BridJ, and JavaCPP personally, but in the end picked JNA because of its
> (comparatively) large user base. If Spark will be using Djinni, that could
> be a symmetry-breaking consideration and I'll start using it for
> consistency, maybe even interoperability.

I think I misunderstood what Djinni is. JNA, BridJ, and JavaCPP provide
access to untyped bytes (except for common cases like java.lang.String), but
it looks like Djinni goes further and provides a type mapping--- exactly the
"serialization format" or "layout of bytes" that I was asking about.

Is it safe to say that when Spark has off-heap caching, that it will be in
the format specified by Djinni? If I work to integrate ROOT with Djinni,
will this be a major step toward integrating it with Spark 2.0?

Even if the above answers my first question, I'd still like to know if the
new Spark API will allow RDDs to be /filled/ from the C++ side, as a data
source, rather than a derived dataset.




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-tp13898p17388.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Tungsten off heap memory access for C++ libraries

Posted by "jpivarski@gmail.com" <jp...@gmail.com>.

Hi,

I'm coming from the particle physics community and I'm also very interested
in the development of this project. We have a huge C++ codebase and would
like to start using the higher-level abstractions of Spark in our data
analyses. To this end, I've been developing code that copies data from our
C++ framework, ROOT, into Scala:

https://github.com/diana-hep/rootconverter/tree/master/scaroot-reader
<https://github.com/diana-hep/rootconverter/tree/master/scaroot-reader>

(Worth noting: the ROOT file format is too complex for a complete rewrite in
Java or Scala to be feasible. ROOT readers in Java and even Javascript
exist, but they only handle simple cases.)

I have a variety of options for how to lay out the bytes during this
transfer, and in all cases fill the constructor arguments of Scala classes
using macros. When I learned that you're moving the Spark data off-heap (at
the same time as I'm struggling to move it on-heap), I realized that you
must have chosen a serialization format for that data, and I should be using
/that/ serialization format.

Even though it's early, do you have any designs for that serialization
format? Have you picked a standard one? Most of the options, such as Avro,
don't make a lot of sense because they pack integers to minimize number of
bytes, rather than lay them out for efficient access (including any
byte-alignment considerations).

Also, are there any plans for an API that /fills/ an RDD or DataSet from the
C++ side, as I'm trying to do?

Thanks,
-- Jim

P.S. Concerning Java/C++ bindings, there are many. I tried JNI, JNA, BridJ,
and JavaCPP personally, but in the end picked JNA because of its
(comparatively) large user base. If Spark will be using Djinni, that could
be a symmetry-breaking consideration and I'll start using it for
consistency, maybe even interoperability.

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-tp13898p17387.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Tungsten off heap memory access for C++ libraries

Posted by Paul Wais <pa...@gmail.com>.

Update for those who are still interested: djinni is a nice tool for
generating Java/C++ bindings.  Before today djinni's Java support was only
aimed at Android, but now djinni works with (at least) Debian, Ubuntu, and
CentOS.

djinni will help you run C++ code in-process with the caveat that djinni
only supports deep-copies of on-JVM-heap data (and no special off-heap
features yet).  However, you can in theory use Unsafe to get pointers to
off-heap memory and pass those (as ints) to native code.  

So if you need a solution *today*,  try checking out a small demo:
https://github.com/dropbox/djinni/tree/master/example/localhost

For the long deets, see:
 https://github.com/dropbox/djinni/pull/140



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-tp13898p14427.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Tungsten off heap memory access for C++ libraries

Posted by Paul Weiss <pa...@gmail.com>.

https://issues.apache.org/jira/browse/SPARK-10399

Is the jira to track.
On Sep 1, 2015 5:32 PM, "Paul Wais" <pa...@gmail.com> wrote:

> Paul: I've worked on running C++ code on Spark at scale before (via JNA,
> ~200
> cores) and am working on something more contribution-oriented now (via
> JNI).
> A few comments:
>  * If you need something *today*, try JNA.  It can be slow (e.g. a short
> native function in a tight loop) but works if you have an existing C
> library.
>  * If you want true zero-copy nested data structures (with explicit
> schema),
> you probably want to look at Google Flatbuffers or Captain Proto.  Protobuf
> does copies; not sure about Avro.  However, if instances of your nested
> messages fit completely in CPU cache, there might not be much benefit to
> zero-copy.
>  * Tungsten numeric arrays and UTF-8 strings should be portable but likely
> need some special handling.  (A major benefit of Protobuf, Avro,
> Flatbuffers, Capnp, etc., is these libraries already handle endianness and
> UTF8 for C++).
>  * NB: Don't try to dive into messing with (standard) Java String <->
> std::string using JNI.  It's a very messy problem :)
>
> Was there indeed a JIRA started to track this issue?  Can't find it at the
> moment ...
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-tp13898p13929.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Tungsten off heap memory access for C++ libraries

Posted by Paul Wais <pa...@gmail.com>.

Paul: I've worked on running C++ code on Spark at scale before (via JNA, ~200
cores) and am working on something more contribution-oriented now (via JNI).
A few comments:
* If you need something *today*, try JNA. It can be slow (e.g. a short
native function in a tight loop) but works if you have an existing C
library.
* If you want true zero-copy nested data structures (with explicit schema),
you probably want to look at Google Flatbuffers or Captain Proto. Protobuf
does copies; not sure about Avro. However, if instances of your nested
messages fit completely in CPU cache, there might not be much benefit to
zero-copy.
* Tungsten numeric arrays and UTF-8 strings should be portable but likely
need some special handling. (A major benefit of Protobuf, Avro,
Flatbuffers, Capnp, etc., is these libraries already handle endianness and
UTF8 for C++).
* NB: Don't try to dive into messing with (standard) Java String <->
std::string using JNI. It's a very messy problem :)

Was there indeed a JIRA started to track this issue? Can't find it at the
moment ...

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-tp13898p13929.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org