You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Everett Anderson <ev...@nuna.com.INVALID> on 2016/06/27 16:14:28 UTC

Best practice for handing tables between pipeline components

Hi,

We have a pipeline of components strung together via Airflow running on
AWS. Some of them are implemented in Spark, but some aren't. Generally they
can all talk to a JDBC/ODBC end point or read/write files from S3.

Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS or
S3 and reading it back in, again, in every component, if it could stay
cached in memory in a Spark cluster.

Our current investigation seems to lead us towards exploring if the
following things are possible:

   - Using a Hive metastore with S3 as its backing data store to try to
   keep a mapping from table name to files on S3 (not sure if one can cache a
   Hive table in Spark across contexts, though)
   - Using something like the spark-jobserver to keep a Spark SQLContext
   open across Spark components so they could avoid file I/O for cached tables

What's the best practice for handing tables between Spark programs? What
about between Spark and non-Spark programs?

Thanks!

- Everett

Re: Best practice for handing tables between pipeline components

Posted by Chanh Le <gi...@gmail.com>.

Hi Everett,
We are using Alluxio for the last 2 months. We implement Alluxio for sharing data each Spark Job, isolated Spark only for process layer and Alluxio for the storage layer.



> On Jun 29, 2016, at 2:52 AM, Everett Anderson <ev...@nuna.com.INVALID> wrote:
> 
> Thanks! Alluxio looks quite promising, but also quite new.
> 
> What did people do before?
> 
> On Mon, Jun 27, 2016 at 12:33 PM, Gene Pang <gene.pang@gmail.com <ma...@gmail.com>> wrote:
> Yes, Alluxio (http://www.alluxio.org/ <http://www.alluxio.org/>) can be used to store data in-memory between stages in a pipeline.
> 
> Here is more information about running Spark with Alluxio: http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html <http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html>
> 
> Hope that helps,
> Gene
> 
> On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu <vsathishkumaran@gmail.com <ma...@gmail.com>> wrote:
> Alluxio off heap memory would help to share cached objects
> 
> On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson <ev...@nuna.com.invalid> wrote:
> Hi,
> 
> We have a pipeline of components strung together via Airflow running on AWS. Some of them are implemented in Spark, but some aren't. Generally they can all talk to a JDBC/ODBC end point or read/write files from S3.
> 
> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS or S3 and reading it back in, again, in every component, if it could stay cached in memory in a Spark cluster. 
> 
> Our current investigation seems to lead us towards exploring if the following things are possible:
> Using a Hive metastore with S3 as its backing data store to try to keep a mapping from table name to files on S3 (not sure if one can cache a Hive table in Spark across contexts, though)
> Using something like the spark-jobserver to keep a Spark SQLContext open across Spark components so they could avoid file I/O for cached tables
> What's the best practice for handing tables between Spark programs? What about between Spark and non-Spark programs?
> 
> Thanks!
> 
> - Everett
> 
> 
>

Re: Best practice for handing tables between pipeline components

Posted by Everett Anderson <ev...@nuna.com.INVALID>.

Thanks! Alluxio looks quite promising, but also quite new.

What did people do before?

On Mon, Jun 27, 2016 at 12:33 PM, Gene Pang <ge...@gmail.com> wrote:

> Yes, Alluxio (http://www.alluxio.org/) can be used to store data
> in-memory between stages in a pipeline.
>
> Here is more information about running Spark with Alluxio:
> http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html
>
> Hope that helps,
> Gene
>
> On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu <
> vsathishkumaran@gmail.com> wrote:
>
>> Alluxio off heap memory would help to share cached objects
>>
>> On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson
>> <ev...@nuna.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> We have a pipeline of components strung together via Airflow running on
>>> AWS. Some of them are implemented in Spark, but some aren't. Generally they
>>> can all talk to a JDBC/ODBC end point or read/write files from S3.
>>>
>>> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS
>>> or S3 and reading it back in, again, in every component, if it could stay
>>> cached in memory in a Spark cluster.
>>>
>>> Our current investigation seems to lead us towards exploring if the
>>> following things are possible:
>>>
>>>    - Using a Hive metastore with S3 as its backing data store to try to
>>>    keep a mapping from table name to files on S3 (not sure if one can cache a
>>>    Hive table in Spark across contexts, though)
>>>    - Using something like the spark-jobserver to keep a
>>>    Spark SQLContext open across Spark components so they could avoid file I/O
>>>    for cached tables
>>>
>>> What's the best practice for handing tables between Spark programs? What
>>> about between Spark and non-Spark programs?
>>>
>>> Thanks!
>>>
>>> - Everett
>>>
>>>
>

Re: Best practice for handing tables between pipeline components

Posted by Gene Pang <ge...@gmail.com>.

Yes, Alluxio (http://www.alluxio.org/) can be used to store data in-memory
between stages in a pipeline.

Here is more information about running Spark with Alluxio:
http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html

Hope that helps,
Gene

On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu <
vsathishkumaran@gmail.com> wrote:

> Alluxio off heap memory would help to share cached objects
>
> On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson <ev...@nuna.com.invalid>
> wrote:
>
>> Hi,
>>
>> We have a pipeline of components strung together via Airflow running on
>> AWS. Some of them are implemented in Spark, but some aren't. Generally they
>> can all talk to a JDBC/ODBC end point or read/write files from S3.
>>
>> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS
>> or S3 and reading it back in, again, in every component, if it could stay
>> cached in memory in a Spark cluster.
>>
>> Our current investigation seems to lead us towards exploring if the
>> following things are possible:
>>
>>    - Using a Hive metastore with S3 as its backing data store to try to
>>    keep a mapping from table name to files on S3 (not sure if one can cache a
>>    Hive table in Spark across contexts, though)
>>    - Using something like the spark-jobserver to keep a Spark SQLContext
>>    open across Spark components so they could avoid file I/O for cached tables
>>
>> What's the best practice for handing tables between Spark programs? What
>> about between Spark and non-Spark programs?
>>
>> Thanks!
>>
>> - Everett
>>
>>

Re: Best practice for handing tables between pipeline components

Posted by Sathish Kumaran Vairavelu <vs...@gmail.com>.

Alluxio off heap memory would help to share cached objects
On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson <ev...@nuna.com.invalid>
wrote:

> Hi,
>
> We have a pipeline of components strung together via Airflow running on
> AWS. Some of them are implemented in Spark, but some aren't. Generally they
> can all talk to a JDBC/ODBC end point or read/write files from S3.
>
> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS
> or S3 and reading it back in, again, in every component, if it could stay
> cached in memory in a Spark cluster.
>
> Our current investigation seems to lead us towards exploring if the
> following things are possible:
>
>    - Using a Hive metastore with S3 as its backing data store to try to
>    keep a mapping from table name to files on S3 (not sure if one can cache a
>    Hive table in Spark across contexts, though)
>    - Using something like the spark-jobserver to keep a Spark SQLContext
>    open across Spark components so they could avoid file I/O for cached tables
>
> What's the best practice for handing tables between Spark programs? What
> about between Spark and non-Spark programs?
>
> Thanks!
>
> - Everett
>
>