You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by johndesuv <de...@gmail.com> on 2017/02/28 16:02:51 UTC

DataFrame from in memory datasets in multiple JVMs

Hi,

I have an application that runs on a series of JVMs that each contain a
subset of a large dataset in memory.  I'd like to use this data in spark and
am looking at ways to use this as a data source in spark without writing the
data to disk as a handoff.

Parallelize doesn't work for me since I need to use the data across all the
JVMs as one DataFrame.

The only option I've come up with so far is to write a custom DataSource
that then transmits the data from each of the JVMs over the network.  This
seems like overkill though.

Is there a simpler solution for getting this data into a DataFrame?

Thanks,
John



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-from-in-memory-datasets-in-multiple-JVMs-tp28438.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: DataFrame from in memory datasets in multiple JVMs

Posted by John Desuvio <de...@gmail.com>.

Since the data is in multiple JVMs, only 1 of them can be the driver.   So
I can parallelize the data from 1 of the VMs but don't have a way to do the
same for the others.   Or am I missing something?

On Tue, Feb 28, 2017 at 3:53 PM, ayan guha <gu...@gmail.com> wrote:

> How about parallelize and then union all of them to one data frame?
>
> On Wed, 1 Mar 2017 at 3:07 am, Sean Owen <so...@cloudera.com> wrote:
>
>> Broadcasts let you send one copy of read only data to each executor.
>> That's not the same as a DataFrame and itseems nature means it doesnt make
>> sense to think of them as not distributed. But consider things like
>> broadcast hash joins which may be what you are looking for if you really
>> mean to join on a small DF efficiently.
>>
>> On Tue, Feb 28, 2017, 16:03 johndesuv <de...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have an application that runs on a series of JVMs that each contain a
>> subset of a large dataset in memory.  I'd like to use this data in spark
>> and
>> am looking at ways to use this as a data source in spark without writing
>> the
>> data to disk as a handoff.
>>
>> Parallelize doesn't work for me since I need to use the data across all
>> the
>> JVMs as one DataFrame.
>>
>> The only option I've come up with so far is to write a custom DataSource
>> that then transmits the data from each of the JVMs over the network.  This
>> seems like overkill though.
>>
>> Is there a simpler solution for getting this data into a DataFrame?
>>
>> Thanks,
>> John
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/DataFrame-from-in-memory-datasets-in-multiple-JVMs-
>> tp28438.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>> --
> Best Regards,
> Ayan Guha
>

Re: DataFrame from in memory datasets in multiple JVMs

Posted by ayan guha <gu...@gmail.com>.

How about parallelize and then union all of them to one data frame?

On Wed, 1 Mar 2017 at 3:07 am, Sean Owen <so...@cloudera.com> wrote:

> Broadcasts let you send one copy of read only data to each executor.
> That's not the same as a DataFrame and itseems nature means it doesnt make
> sense to think of them as not distributed. But consider things like
> broadcast hash joins which may be what you are looking for if you really
> mean to join on a small DF efficiently.
>
> On Tue, Feb 28, 2017, 16:03 johndesuv <de...@gmail.com> wrote:
>
> Hi,
>
> I have an application that runs on a series of JVMs that each contain a
> subset of a large dataset in memory.  I'd like to use this data in spark
> and
> am looking at ways to use this as a data source in spark without writing
> the
> data to disk as a handoff.
>
> Parallelize doesn't work for me since I need to use the data across all the
> JVMs as one DataFrame.
>
> The only option I've come up with so far is to write a custom DataSource
> that then transmits the data from each of the JVMs over the network.  This
> seems like overkill though.
>
> Is there a simpler solution for getting this data into a DataFrame?
>
> Thanks,
> John
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-from-in-memory-datasets-in-multiple-JVMs-tp28438.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> --
Best Regards,
Ayan Guha

Re: DataFrame from in memory datasets in multiple JVMs

Posted by Sean Owen <so...@cloudera.com>.

Broadcasts let you send one copy of read only data to each executor. That's
not the same as a DataFrame and itseems nature means it doesnt make sense
to think of them as not distributed. But consider things like broadcast
hash joins which may be what you are looking for if you really mean to join
on a small DF efficiently.

On Tue, Feb 28, 2017, 16:03 johndesuv <de...@gmail.com> wrote:

> Hi,
>
> I have an application that runs on a series of JVMs that each contain a
> subset of a large dataset in memory.  I'd like to use this data in spark
> and
> am looking at ways to use this as a data source in spark without writing
> the
> data to disk as a handoff.
>
> Parallelize doesn't work for me since I need to use the data across all the
> JVMs as one DataFrame.
>
> The only option I've come up with so far is to write a custom DataSource
> that then transmits the data from each of the JVMs over the network.  This
> seems like overkill though.
>
> Is there a simpler solution for getting this data into a DataFrame?
>
> Thanks,
> John
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-from-in-memory-datasets-in-multiple-JVMs-tp28438.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>