You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jean Georges Perrin <jg...@jgp.net> on 2017/06/20 17:46:36 UTC

"Sharing" dataframes...

Hey,

Here is my need: program A does something on a set of data and produces results, program B does that on another set, and finally, program C combines the data of A and B. Of course, the easy way is to dump all on disk after A and B are done, but I wanted to avoid this. 

I was thinking of creating a temp view, but I do not really like the temp aspect of it ;). Any idea (they are all worth sharing)

jg



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: "Sharing" dataframes...

Posted by Pierce Lamb <ri...@gmail.com>.

Hi Jean,

Since many in this thread have mentioned datastores from what I would call
the "Spark datastore ecosystem" I thought I would link you to a
StackOverflow answer I posted awhile back that tried to capture the
majority of this ecosystem. Most would claim to allow you to do something
like you're describing in your original email once connected to Spark:

https://stackoverflow.com/questions/39650298/how-to-save-insert-each-dstream-into-a-permanent-table/39753976#39753976

Regarding Rick Mortiz's reply, SnappyData
<https://github.com/SnappyDataInc/snappydata>, a member of this ecosystem,
avoids the latency intensive serialization steps he describes by
integrating the database and Spark such that they use the same JVM/block
manager (you can think of it as an in-memory SQL database replacing Spark's
native cache).

Hope this helps,

Pierce

On Wed, Jun 21, 2017 at 8:29 AM, Gene Pang <ge...@gmail.com> wrote:

> Hi Jean,
>
> As others have mentioned, you can use Alluxio with Spark dataframes
> <https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio> to
> keep the data in memory, and for other jobs to read them from memory again.
>
> Hope this helps,
> Gene
>
> On Wed, Jun 21, 2017 at 8:08 AM, Jean Georges Perrin <jg...@jgp.net> wrote:
>
>> I have looked at Livy in the (very recent past) past and it will not do
>> the trick for me. It seems pretty greedy in terms of resources (or at least
>> that was our experience). I will investigate how job-server could do the
>> trick.
>>
>> (on a side note I tried to find a paper on memory lifecycle within Spark
>> but was not very successful, maybe someone has a link to spare.)
>>
>> My need is to keep one/several dataframes in memory (well, within Spark)
>> so it/they can be reused at a later time, without persisting it/them to
>> disk (unless Spark wants to, of course).
>>
>>
>>
>> On Jun 21, 2017, at 10:47 AM, Michael Mior <mm...@uwaterloo.ca> wrote:
>>
>> This is a puzzling suggestion to me. It's unclear what features the OP
>> needs, so it's really hard to say whether Livy or job-server aren't
>> sufficient. It's true that neither are particularly mature, but they're
>> much more mature than a homemade project which hasn't started yet.
>>
>> That said, I'm not very familiar with either project, so perhaps there
>> are some big concerns I'm not aware of.
>>
>> --
>> Michael Mior
>> mmior@apache.org
>>
>> 2017-06-21 3:19 GMT-04:00 Rick Moritz <ra...@gmail.com>:
>>
>>> Keeping it inside the same program/SparkContext is the most performant
>>> solution, since you can avoid serialization and deserialization.
>>> In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM
>>> and invokes serialization and deserialization. Technologies that can help
>>> you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra
>>> with in-memory tables and a memory-backed HDFS-directory (see tiered
>>> storage).
>>> Although livy and job-server provide the functionality of providing a
>>> single SparkContext to mutliple programs, I would recommend you build your
>>> own framework for integrating different jobs, since many features you may
>>> need aren't present yet, while others may cause issues due to lack of
>>> maturity. Artificially splitting jobs is in general a bad idea, since it
>>> breaks the DAG and thus prevents some potential push-down optimizations.
>>>
>>> On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <jg...@jgp.net>
>>> wrote:
>>>
>>>> Thanks Vadim & Jörn... I will look into those.
>>>>
>>>> jg
>>>>
>>>> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <va...@datadoghq.com>
>>>> wrote:
>>>>
>>>> You can launch one permanent spark context and then execute your jobs
>>>> within the context. And since they'll be running in the same context, they
>>>> can share data easily.
>>>>
>>>> These two projects provide the functionality that you need:
>>>> https://github.com/spark-jobserver/spark-jobserver#persisten
>>>> t-context-mode---faster--required-for-related-jobs
>>>> https://github.com/cloudera/livy#post-sessions
>>>>
>>>> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <jg...@jgp.net>
>>>> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> Here is my need: program A does something on a set of data and
>>>>> produces results, program B does that on another set, and finally, program
>>>>> C combines the data of A and B. Of course, the easy way is to dump all on
>>>>> disk after A and B are done, but I wanted to avoid this.
>>>>>
>>>>> I was thinking of creating a temp view, but I do not really like the
>>>>> temp aspect of it ;). Any idea (they are all worth sharing)
>>>>>
>>>>> jg
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: "Sharing" dataframes...

Posted by Gene Pang <ge...@gmail.com>.

Hi Jean,

As others have mentioned, you can use Alluxio with Spark dataframes
<https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio> to
keep the data in memory, and for other jobs to read them from memory again.

Hope this helps,
Gene

On Wed, Jun 21, 2017 at 8:08 AM, Jean Georges Perrin <jg...@jgp.net> wrote:

> I have looked at Livy in the (very recent past) past and it will not do
> the trick for me. It seems pretty greedy in terms of resources (or at least
> that was our experience). I will investigate how job-server could do the
> trick.
>
> (on a side note I tried to find a paper on memory lifecycle within Spark
> but was not very successful, maybe someone has a link to spare.)
>
> My need is to keep one/several dataframes in memory (well, within Spark)
> so it/they can be reused at a later time, without persisting it/them to
> disk (unless Spark wants to, of course).
>
>
>
> On Jun 21, 2017, at 10:47 AM, Michael Mior <mm...@uwaterloo.ca> wrote:
>
> This is a puzzling suggestion to me. It's unclear what features the OP
> needs, so it's really hard to say whether Livy or job-server aren't
> sufficient. It's true that neither are particularly mature, but they're
> much more mature than a homemade project which hasn't started yet.
>
> That said, I'm not very familiar with either project, so perhaps there are
> some big concerns I'm not aware of.
>
> --
> Michael Mior
> mmior@apache.org
>
> 2017-06-21 3:19 GMT-04:00 Rick Moritz <ra...@gmail.com>:
>
>> Keeping it inside the same program/SparkContext is the most performant
>> solution, since you can avoid serialization and deserialization.
>> In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM
>> and invokes serialization and deserialization. Technologies that can help
>> you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra
>> with in-memory tables and a memory-backed HDFS-directory (see tiered
>> storage).
>> Although livy and job-server provide the functionality of providing a
>> single SparkContext to mutliple programs, I would recommend you build your
>> own framework for integrating different jobs, since many features you may
>> need aren't present yet, while others may cause issues due to lack of
>> maturity. Artificially splitting jobs is in general a bad idea, since it
>> breaks the DAG and thus prevents some potential push-down optimizations.
>>
>> On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <jg...@jgp.net>
>> wrote:
>>
>>> Thanks Vadim & Jörn... I will look into those.
>>>
>>> jg
>>>
>>> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <va...@datadoghq.com>
>>> wrote:
>>>
>>> You can launch one permanent spark context and then execute your jobs
>>> within the context. And since they'll be running in the same context, they
>>> can share data easily.
>>>
>>> These two projects provide the functionality that you need:
>>> https://github.com/spark-jobserver/spark-jobserver#persisten
>>> t-context-mode---faster--required-for-related-jobs
>>> https://github.com/cloudera/livy#post-sessions
>>>
>>> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <jg...@jgp.net>
>>> wrote:
>>>
>>>> Hey,
>>>>
>>>> Here is my need: program A does something on a set of data and produces
>>>> results, program B does that on another set, and finally, program C
>>>> combines the data of A and B. Of course, the easy way is to dump all on
>>>> disk after A and B are done, but I wanted to avoid this.
>>>>
>>>> I was thinking of creating a temp view, but I do not really like the
>>>> temp aspect of it ;). Any idea (they are all worth sharing)
>>>>
>>>> jg
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>>
>>
>
>

Re: "Sharing" dataframes...

Posted by Jean Georges Perrin <jg...@jgp.net>.

I have looked at Livy in the (very recent past) past and it will not do the trick for me. It seems pretty greedy in terms of resources (or at least that was our experience). I will investigate how job-server could do the trick.

(on a side note I tried to find a paper on memory lifecycle within Spark but was not very successful, maybe someone has a link to spare.)

My need is to keep one/several dataframes in memory (well, within Spark) so it/they can be reused at a later time, without persisting it/them to disk (unless Spark wants to, of course).



> On Jun 21, 2017, at 10:47 AM, Michael Mior <mm...@uwaterloo.ca> wrote:
> 
> This is a puzzling suggestion to me. It's unclear what features the OP needs, so it's really hard to say whether Livy or job-server aren't sufficient. It's true that neither are particularly mature, but they're much more mature than a homemade project which hasn't started yet.
> 
> That said, I'm not very familiar with either project, so perhaps there are some big concerns I'm not aware of.
> 
> --
> Michael Mior
> mmior@apache.org <ma...@apache.org>
> 
> 2017-06-21 3:19 GMT-04:00 Rick Moritz <rahvin@gmail.com <ma...@gmail.com>>:
> Keeping it inside the same program/SparkContext is the most performant solution, since you can avoid serialization and deserialization. In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM and invokes serialization and deserialization. Technologies that can help you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra with in-memory tables and a memory-backed HDFS-directory (see tiered storage).
> Although livy and job-server provide the functionality of providing a single SparkContext to mutliple programs, I would recommend you build your own framework for integrating different jobs, since many features you may need aren't present yet, while others may cause issues due to lack of maturity. Artificially splitting jobs is in general a bad idea, since it breaks the DAG and thus prevents some potential push-down optimizations.
> 
> On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <jgp@jgp.net <ma...@jgp.net>> wrote:
> Thanks Vadim & Jörn... I will look into those.
> 
> jg
> 
>> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <vadim.semenov@datadoghq.com <ma...@datadoghq.com>> wrote:
>> 
>> You can launch one permanent spark context and then execute your jobs within the context. And since they'll be running in the same context, they can share data easily.
>> 
>> These two projects provide the functionality that you need:
>> https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs <https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs>
>> https://github.com/cloudera/livy#post-sessions <https://github.com/cloudera/livy#post-sessions>
>> 
>> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <jgp@jgp.net <ma...@jgp.net>> wrote:
>> Hey,
>> 
>> Here is my need: program A does something on a set of data and produces results, program B does that on another set, and finally, program C combines the data of A and B. Of course, the easy way is to dump all on disk after A and B are done, but I wanted to avoid this.
>> 
>> I was thinking of creating a temp view, but I do not really like the temp aspect of it ;). Any idea (they are all worth sharing)
>> 
>> jg
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>> 
>> 
> 
> 
>

Re: "Sharing" dataframes...

Posted by Michael Mior <mm...@uwaterloo.ca>.

This is a puzzling suggestion to me. It's unclear what features the OP
needs, so it's really hard to say whether Livy or job-server aren't
sufficient. It's true that neither are particularly mature, but they're
much more mature than a homemade project which hasn't started yet.

That said, I'm not very familiar with either project, so perhaps there are
some big concerns I'm not aware of.

--
Michael Mior
mmior@apache.org

2017-06-21 3:19 GMT-04:00 Rick Moritz <ra...@gmail.com>:

> Keeping it inside the same program/SparkContext is the most performant
> solution, since you can avoid serialization and deserialization.
> In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM
> and invokes serialization and deserialization. Technologies that can help
> you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra
> with in-memory tables and a memory-backed HDFS-directory (see tiered
> storage).
> Although livy and job-server provide the functionality of providing a
> single SparkContext to mutliple programs, I would recommend you build your
> own framework for integrating different jobs, since many features you may
> need aren't present yet, while others may cause issues due to lack of
> maturity. Artificially splitting jobs is in general a bad idea, since it
> breaks the DAG and thus prevents some potential push-down optimizations.
>
> On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <jg...@jgp.net> wrote:
>
>> Thanks Vadim & Jörn... I will look into those.
>>
>> jg
>>
>> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <va...@datadoghq.com>
>> wrote:
>>
>> You can launch one permanent spark context and then execute your jobs
>> within the context. And since they'll be running in the same context, they
>> can share data easily.
>>
>> These two projects provide the functionality that you need:
>> https://github.com/spark-jobserver/spark-jobserver#persisten
>> t-context-mode---faster--required-for-related-jobs
>> https://github.com/cloudera/livy#post-sessions
>>
>> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <jg...@jgp.net> wrote:
>>
>>> Hey,
>>>
>>> Here is my need: program A does something on a set of data and produces
>>> results, program B does that on another set, and finally, program C
>>> combines the data of A and B. Of course, the easy way is to dump all on
>>> disk after A and B are done, but I wanted to avoid this.
>>>
>>> I was thinking of creating a temp view, but I do not really like the
>>> temp aspect of it ;). Any idea (they are all worth sharing)
>>>
>>> jg
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>

Re: "Sharing" dataframes...

Posted by Rick Moritz <ra...@gmail.com>.

Keeping it inside the same program/SparkContext is the most performant
solution, since you can avoid serialization and deserialization.
In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM
and invokes serialization and deserialization. Technologies that can help
you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra
with in-memory tables and a memory-backed HDFS-directory (see tiered
storage).
Although livy and job-server provide the functionality of providing a
single SparkContext to mutliple programs, I would recommend you build your
own framework for integrating different jobs, since many features you may
need aren't present yet, while others may cause issues due to lack of
maturity. Artificially splitting jobs is in general a bad idea, since it
breaks the DAG and thus prevents some potential push-down optimizations.

On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <jg...@jgp.net> wrote:

> Thanks Vadim & Jörn... I will look into those.
>
> jg
>
> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <va...@datadoghq.com>
> wrote:
>
> You can launch one permanent spark context and then execute your jobs
> within the context. And since they'll be running in the same context, they
> can share data easily.
>
> These two projects provide the functionality that you need:
> https://github.com/spark-jobserver/spark-jobserver#
> persistent-context-mode---faster--required-for-related-jobs
> https://github.com/cloudera/livy#post-sessions
>
> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <jg...@jgp.net> wrote:
>
>> Hey,
>>
>> Here is my need: program A does something on a set of data and produces
>> results, program B does that on another set, and finally, program C
>> combines the data of A and B. Of course, the easy way is to dump all on
>> disk after A and B are done, but I wanted to avoid this.
>>
>> I was thinking of creating a temp view, but I do not really like the temp
>> aspect of it ;). Any idea (they are all worth sharing)
>>
>> jg
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>

Re: "Sharing" dataframes...

Posted by Jean Georges Perrin <jg...@jgp.net>.

Thanks Vadim & Jörn... I will look into those.

jg

> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <va...@datadoghq.com> wrote:
> 
> You can launch one permanent spark context and then execute your jobs within the context. And since they'll be running in the same context, they can share data easily.
> 
> These two projects provide the functionality that you need:
> https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs <https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs>
> https://github.com/cloudera/livy#post-sessions <https://github.com/cloudera/livy#post-sessions>
> 
> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <jgp@jgp.net <ma...@jgp.net>> wrote:
> Hey,
> 
> Here is my need: program A does something on a set of data and produces results, program B does that on another set, and finally, program C combines the data of A and B. Of course, the easy way is to dump all on disk after A and B are done, but I wanted to avoid this.
> 
> I was thinking of creating a temp view, but I do not really like the temp aspect of it ;). Any idea (they are all worth sharing)
> 
> jg
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: "Sharing" dataframes...

Posted by Vadim Semenov <va...@datadoghq.com>.

You can launch one permanent spark context and then execute your jobs
within the context. And since they'll be running in the same context, they
can share data easily.

These two projects provide the functionality that you need:
https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs
https://github.com/cloudera/livy#post-sessions

On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <jg...@jgp.net> wrote:

> Hey,
>
> Here is my need: program A does something on a set of data and produces
> results, program B does that on another set, and finally, program C
> combines the data of A and B. Of course, the easy way is to dump all on
> disk after A and B are done, but I wanted to avoid this.
>
> I was thinking of creating a temp view, but I do not really like the temp
> aspect of it ;). Any idea (they are all worth sharing)
>
> jg
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: "Sharing" dataframes...

Posted by Jörn Franke <jo...@gmail.com>.

You could all express it in one program, alternatively ignite in memory file system or the ignite sharedrdd ( not sure if dataframe is supported)

> On 20. Jun 2017, at 19:46, Jean Georges Perrin <jg...@jgp.net> wrote:
> 
> Hey,
> 
> Here is my need: program A does something on a set of data and produces results, program B does that on another set, and finally, program C combines the data of A and B. Of course, the easy way is to dump all on disk after A and B are done, but I wanted to avoid this. 
> 
> I was thinking of creating a temp view, but I do not really like the temp aspect of it ;). Any idea (they are all worth sharing)
> 
> jg
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org