You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "jpivarski@gmail.com" <jp...@gmail.com> on 2016/05/26 21:46:46 UTC

How to access the off-heap representation of cached data in Spark 2.0

Following up on an  earlier thread
<http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-td13898.html> 
, I would like to access the off-heap representation of cached data in Spark
2.0 in order to see how Spark might be linked to physics software written in
C and C++.
I'm willing to do exploration on my own, but could somebody point me to a
place to start? I have downloaded the 2.0 preview and created a persisted
Dataset:
import scala.util.Randomcase class Muon(px: Double, py: Double) {  def pt =
Math.sqrt(px*px + py*py)}val rdd = sc.parallelize(0 until 10000 map {x => 
Muon(Random.nextGaussian, Random.nextGaussian)  }, 10)val df = rdd.toDFval
ds = df.as[Muon]ds.persist()
So I have a Dataset in memory, and if I understand the  blog articles
<https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html>  
correctly, it's in off-heap memory (sun.misc.Unsafe). Is there any way I
could get a pointer to that data that I could explore with BridJ? Any hints
on how it's stored? Like, could I get started through some Djinni calls or
something?
Thanks!
-- Jim




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: How to access the off-heap representation of cached data in Spark 2.0

Posted by "jpivarski@gmail.com" <jp...@gmail.com>.

Okay, that makes sense.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701p17723.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: How to access the off-heap representation of cached data in Spark 2.0

Posted by Jacek Laskowski <ja...@japila.pl>.

On Sun, May 29, 2016 at 5:30 PM, jpivarski@gmail.com
<jp...@gmail.com> wrote:

> If I find a way to provide
> access by modifying Spark source, can I just submit a pull request, or do I
> need to be a recognized Spark developer? If so, is there a process for
> becoming one?

Start a discussion here (or user@spark) and *iff* your idea attract
enough Spark committers to support it, you'd file an JIRA issue and
send a pull request afterwards.

Jacek

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: How to access the off-heap representation of cached data in Spark 2.0

Posted by "jpivarski@gmail.com" <jp...@gmail.com>.

Thanks Jacek and Kazuaki,

I guess got the wrong impression about the C++ API from somewhere--- I think
I read it in a JIRA wish list. If the byte array is accessed through
sun.misc.Unsafe, that's what I mean by off-heap. I found the Platform class,
which provides access to the bytes in Unsafe in a uniform way.

Thanks for pointing me to CachedBatch (private). If I find a way to provide
access by modifying Spark source, can I just submit a pull request, or do I
need to be a recognized Spark developer? If so, is there a process for
becoming one?

-- Jim




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701p17721.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: How to access the off-heap representation of cached data in Spark 2.0

Posted by Kazuaki Ishizaki <IS...@jp.ibm.com>.

Hi,
According to my understanding, contents in df.cache() is currently on Java 
heap as a set of Byte arrays in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala#L58
. Data is accessed by using sun.misc.unsafe APIs. Data maybe compressed 
sometime.
CachedBatch is private, and this representation may be changed in the 
future.

In general, It is not easy to access this data by using C/C++ API.

Regards,
Kazuaki Ishizaki

From:   Jacek Laskowski <ja...@japila.pl>
To:     "jpivarski@gmail.com" <jp...@gmail.com>
Cc:     dev <de...@spark.apache.org>
Date:   2016/05/29 08:18
Subject:        Re: How to access the off-heap representation of cached 
data in Spark 2.0

Hi Jim,

There's no C++ API in Spark to access the off-heap data. Moreover, I
also think "off-heap" has an overloaded meaning in Spark - for
tungsten and to persist your data off-heap (it's all about memory but
for different purposes and with client- and internal API).

That's my limited understanding of the things (and I'm not even sure
how trustworthy it is). Use with extreme caution.

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Sat, May 28, 2016 at 5:29 PM, jpivarski@gmail.com
<jp...@gmail.com> wrote:
> Is this not the place to ask such questions? Where can I get a hint as 
to how
> to access the new off-heap cache, or C++ API, if it exists? I'm willing 
to
> do my own research, but I have to have a place to start. (In fact, this 
is
> the first step in that research.)
>
> Thanks,
> -- Jim
>
>
>
>
> --
> View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701p17717.html

> Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: How to access the off-heap representation of cached data in Spark 2.0

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi Jim,

There's no C++ API in Spark to access the off-heap data. Moreover, I
also think "off-heap" has an overloaded meaning in Spark - for
tungsten and to persist your data off-heap (it's all about memory but
for different purposes and with client- and internal API).

That's my limited understanding of the things (and I'm not even sure
how trustworthy it is). Use with extreme caution.

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sat, May 28, 2016 at 5:29 PM, jpivarski@gmail.com
<jp...@gmail.com> wrote:
> Is this not the place to ask such questions? Where can I get a hint as to how
> to access the new off-heap cache, or C++ API, if it exists? I'm willing to
> do my own research, but I have to have a place to start. (In fact, this is
> the first step in that research.)
>
> Thanks,
> -- Jim
>
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701p17717.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: How to access the off-heap representation of cached data in Spark 2.0

Posted by "jpivarski@gmail.com" <jp...@gmail.com>.

Is this not the place to ask such questions? Where can I get a hint as to how
to access the new off-heap cache, or C++ API, if it exists? I'm willing to
do my own research, but I have to have a place to start. (In fact, this is
the first step in that research.)

Thanks,
-- Jim




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701p17717.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org