You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by JG Perrin <jp...@lumeris.com> on 2017/11/02 13:44:55 UTC

Re: share datasets across multiple spark-streaming applications for lookup

Or Databaricks Delta (announced at Spark Summit) or IBM Event Store depending on the use case.

On Oct 31, 2017, at 14:30, Joseph Pride <jo...@versanalytics.com>> wrote:

Folks:

SnappyData.

I’m fairly new to working with it myself, but it looks pretty promising. It marries Spark with a co-located in-memory GemFire (or something gem-related) database. So you can access the data with SQL, JDBC, ODBC (if you wanna go Enterprise instead of open-source) or natively as mutable RDDs and DataFrames.

You can run it so the storage and Spark compute are co-located in the same JVM on each machine, so you get data locality instead of a bottleneck between load, save, and compute. The data is supposed to persist between applications, cluster startups, or multiple applications doing stuff to the data at the same time.

I hope it works for what I’m doing and isn’t too buggy. But it looks pretty good.

—Joe Pride

On Oct 31, 2017, at 11:14 AM, Gene Pang <ge...@gmail.com>> wrote:

Hi,

Alluxio enables sharing dataframes across different applications. This blog post<https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio> talks about dataframes and Alluxio, and this Spark Summit presentation<https://spark-summit.org/2017/events/best-practices-for-using-alluxio-with-apache-spark/> has additional information.

Thanks,
Gene

On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil <rc...@expedia.com>> wrote:
Any info on the below will be really appreciated.

I read about Alluxio and Ignite. Has anybody used any of them? Do they work well with multiple Apps doing lookups simultaneously? Are there better options? Thank you.

From: roshan joe <im...@gmail.com>>
Date: Monday, October 30, 2017 at 7:53 PM
To: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: share datasets across multiple spark-streaming applications for lookup

Hi,

What is the recommended way to share datasets across multiple spark-streaming applications, so that the incoming data can be looked up against this shared dataset?

The shared dataset is also incrementally refreshed and stored on S3. Below is the scenario.

Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3.
Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3.


Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3.
Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3.
Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and DS-2 and write to DS-n in S3.

So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming apps. Any input is appreciated. Thank you!