You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mike Trienis <mi...@orcsol.com> on 2015/03/01 20:06:55 UTC

Pushing data from AWS Kinesis -> Spark Streaming -> AWS Redshift

Hi All,

I am looking at integrating a data stream from AWS Kinesis to AWS Redshift
and since I am already ingesting the data through Spark Streaming, it seems
convenient to also push that data to AWS Redshift at the same time.

I have taken a look at the AWS kinesis connector although I am not sure it
was designed to integrate with Apache Spark. It seems more like a
standalone approach:

   - https://github.com/awslabs/amazon-kinesis-connectors

There is also a Spark redshift integration library, however, it looks like
it was intended for pulling data rather than pushing data to AWS Redshift:

   - https://github.com/databricks/spark-redshift

I finally took a look at a general Scala library that integrates with AWS
Redshift:

   - http://scalikejdbc.org/

Does anyone have any experience pushing data from Spark Streaming to AWS
Redshift? Does it make sense conceptually, or does it make more sense to
push data from AWS Kinesis to AWS Redshift VIA another standalone approach
such as the AWS Kinesis connectors.

Thanks, Mike.

Re: Pushing data from AWS Kinesis -> Spark Streaming -> AWS Redshift

Posted by Chris Fregly <ch...@fregly.com>.
Hey Mike-

Great to see you're using the AWS stack to its fullest!

I've already created the Kinesis-Spark Streaming connector with examples,
documentation, test, and everything.  You'll need to build Spark from
source with the -Pkinesis-asl profile, otherwise they won't be included in
the build.  This is due to the Amazon Software License (ASL).

Here's the link to the Kinesis-Spark Streaming integration guide:
http://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html.

Here's a link to the source:
https://github.com/apache/spark/tree/master/extras/kinesis-asl/src/main/scala/org/apache/spark

Here's the original jira as well as some follow-up jiras that I'm working
on now:

   - https://issues.apache.org/jira/browse/SPARK-1981
   - https://issues.apache.org/jira/browse/SPARK-5959
   - https://issues.apache.org/jira/browse/SPARK-5960

I met a lot of folks at the Strata San Jose conference a couple weeks ago
who are using this Kinesis connector with good results, so you should be OK
there.

Regarding Redshift, the current Redshift connector only pulls data from
Redshift - it doesn't not write to Redshift.

And actually, it only reads files that have been UNLOADed from Redshift
into S3, so it's not pulling from Redshift directly.

This is confusing, I know.  Here's the jira for this work:
https://issues.apache.org/jira/browse/SPARK-3205

For now, you'd have to use the AWS Java SDK to write to Redshift directly
from within your Spark Streaming worker as batches of data come in from the
Kinesis Stream.

This is roughly how your Spark Streaming app would look:

dstream.foreachRDD { rdd =>
  // everything within here runs on the Driver

  rdd.foreachPartition { partitionOfRecords =>
   // everything within here runs on the Worker and operates on a partition
of records

    // RedshiftConnectionPool is a static, lazily initialized singleton
pool of connections that runs within the Worker JVM

    // retrieve a connection from the pool
    val connection = RedshiftConnectionPool.getConnection()

    // perform the application logic here - parse and write to Redshift
using the connection
    partitionOfRecords.foreach(record => connection.send(record))

    // return to the pool for future reuse
    RedshiftConnectionPool.returnConnection(connection)
  }
}

Note that you would need to write the RedshiftConnectionPool single class
yourself using the AWS Java SDK as I mentioned.

There is a relatively new Spark SQL DataSources API that supports reading
and writing to these data sources.

An example Avro implementation is here:
http://spark-packages.org/package/databricks/spark-avro, although I think
that's a read-only impl, but you get the idea.

I've created 2 jiras for creating libraries for DynamoDB and Redshift that
implement this DataSources API.

Here are the jiras:

   - https://issues.apache.org/jira/browse/SPARK-6101 (DynamoDB)
   - https://issues.apache.org/jira/browse/SPARK-6102 (Redshift)

Perhaps you can take a stab at SPARK-6102?  I should be done with the
DynamoDB impl in the next few weeks - scheduled for Spark 1.4.0.

Hope that helps!  :)

-Chris


On Sun, Mar 1, 2015 at 11:06 AM, Mike Trienis <mi...@orcsol.com>
wrote:

> Hi All,
>
> I am looking at integrating a data stream from AWS Kinesis to AWS Redshift
> and since I am already ingesting the data through Spark Streaming, it seems
> convenient to also push that data to AWS Redshift at the same time.
>
> I have taken a look at the AWS kinesis connector although I am not sure it
> was designed to integrate with Apache Spark. It seems more like a
> standalone approach:
>
>    - https://github.com/awslabs/amazon-kinesis-connectors
>
> There is also a Spark redshift integration library, however, it looks like
> it was intended for pulling data rather than pushing data to AWS Redshift:
>
>    - https://github.com/databricks/spark-redshift
>
> I finally took a look at a general Scala library that integrates with AWS
> Redshift:
>
>    - http://scalikejdbc.org/
>
> Does anyone have any experience pushing data from Spark Streaming to AWS
> Redshift? Does it make sense conceptually, or does it make more sense to
> push data from AWS Kinesis to AWS Redshift VIA another standalone approach
> such as the AWS Kinesis connectors.
>
> Thanks, Mike.
>
>