You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Travis Crawford <tr...@gmail.com> on 2016/04/13 16:45:35 UTC

DynamoDB data source questions

Hi Spark gurus,

At Medium we're using Spark for an ETL job that scans DynamoDB tables and
loads into Redshift. Currently I use a parallel scanner implementation that
writes files to local disk, then have Spark read them as a DataFrame.

Ideally we could read the DynamoDB table directly as a DataFrame, so I
started putting together a data source at
https://github.com/traviscrawford/spark-dynamodb

A few questions:

* What's the best way to incrementally build the RDD[Row] returned by
"buildScan"? Currently I make an RDD[Row] from each page of results, and
union them together. Does this approach seem reasonable? Any suggestions
for a better way?

* Currently my stand-alone scanner creates separate threads for each scan
segment. I could use that same approach and create threads in the Spark
driver, though ideally each scan segment would run in an executor. Any tips
on how to get the segment scanners to run on Spark executors?

Thanks,
Travis

Re: DynamoDB data source questions

Posted by Travis Crawford <tr...@gmail.com>.

Hi Reynold,

Thanks for the tips. I made some changes based on your suggestion, and now
the table scan happens on executors.
https://github.com/traviscrawford/spark-dynamodb/blob/master/src/main/scala/com/github/traviscrawford/spark/dynamodb/DynamoDBRelation.scala

sqlContext.sparkContext
    .parallelize(scanConfigs, scanConfigs.length)
    .flatMap(DynamoDBRelation.scan)

When scanning a DynamoDB table, you create one or more scan requests, which
I'll call "segments." Each segment provides an iterator over pages, and
each page contains a collection of items. The actual network transfer
happens when getting the next page. At that point you can iterate over
items in memory.

Based on your feedback I now parallelize a collection of configs that
describe each segment to scan, then in flatMap create the scanner and fetch
all it's items.

I'm pointed enough in the right direction to finish this up.

Thanks,
Travis

On Wed, Apr 13, 2016 at 10:40 AM Reynold Xin <rx...@databricks.com> wrote:

> Responses inline
>
> On Wed, Apr 13, 2016 at 7:45 AM, Travis Crawford <traviscrawford@gmail.com
> > wrote:
>
>> Hi Spark gurus,
>>
>> At Medium we're using Spark for an ETL job that scans DynamoDB tables and
>> loads into Redshift. Currently I use a parallel scanner implementation that
>> writes files to local disk, then have Spark read them as a DataFrame.
>>
>> Ideally we could read the DynamoDB table directly as a DataFrame, so I
>> started putting together a data source at
>> https://github.com/traviscrawford/spark-dynamodb
>>
>> A few questions:
>>
>> * What's the best way to incrementally build the RDD[Row] returned by
>> "buildScan"? Currently I make an RDD[Row] from each page of results, and
>> union them together. Does this approach seem reasonable? Any suggestions
>> for a better way?
>>
>
> If the number of pages can be high (e.g. > 100), it is best to avoid using
> union. The simpler way is ...
>
> val pages = ...
> sc.parallelize(pages, pages.size).flatMap { page =>
>   ...
> }
>
> The above creates a task per page.
>
> Looking at your code, you are relying on Spark's JSON inference to read
> the JSON data. You would need a different thing there in order to
> parallelize this. Right now you are bringing all the data into the driver
> and then send them out.
>
>
>
>>
>> * Currently my stand-alone scanner creates separate threads for each scan
>> segment. I could use that same approach and create threads in the Spark
>> driver, though ideally each scan segment would run in an executor. Any tips
>> on how to get the segment scanners to run on Spark executors?
>>
>
> I'm not too familiar with dynamo. Is segment different from the page above?
>
>
>
>>
>> Thanks,
>> Travis
>>
>
>

Re: DynamoDB data source questions

Posted by Reynold Xin <rx...@databricks.com>.

Responses inline

On Wed, Apr 13, 2016 at 7:45 AM, Travis Crawford <tr...@gmail.com>
wrote:

> Hi Spark gurus,
>
> At Medium we're using Spark for an ETL job that scans DynamoDB tables and
> loads into Redshift. Currently I use a parallel scanner implementation that
> writes files to local disk, then have Spark read them as a DataFrame.
>
> Ideally we could read the DynamoDB table directly as a DataFrame, so I
> started putting together a data source at
> https://github.com/traviscrawford/spark-dynamodb
>
> A few questions:
>
> * What's the best way to incrementally build the RDD[Row] returned by
> "buildScan"? Currently I make an RDD[Row] from each page of results, and
> union them together. Does this approach seem reasonable? Any suggestions
> for a better way?
>

If the number of pages can be high (e.g. > 100), it is best to avoid using
union. The simpler way is ...

val pages = ...
sc.parallelize(pages, pages.size).flatMap { page =>
  ...
}

The above creates a task per page.

Looking at your code, you are relying on Spark's JSON inference to read the
JSON data. You would need a different thing there in order to parallelize
this. Right now you are bringing all the data into the driver and then send
them out.

>
> * Currently my stand-alone scanner creates separate threads for each scan
> segment. I could use that same approach and create threads in the Spark
> driver, though ideally each scan segment would run in an executor. Any tips
> on how to get the segment scanners to run on Spark executors?
>

I'm not too familiar with dynamo. Is segment different from the page above?

>
> Thanks,
> Travis
>