You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gary Malouf <ma...@gmail.com> on 2014/11/14 23:19:28 UTC

Sourcing data from RedShift

We have a bunch of data in RedShift tables that we'd like to pull in during
job runs to Spark.  What is the path/url format one uses to pull data from
there?  (This is in reference to using the
https://github.com/mengxr/redshift-input-format)

Re: Sourcing data from RedShift

Posted by Gary Malouf <ma...@gmail.com>.

Hi guys,

We ultimately needed to add 8 ec2 xl's to get better performance.  As was
suspected, we could not fit all the data into ram.

This worked great with files sized around 100-350MB in size as our initial
export task produced.  Unfortunately, for the partition settings that we
were able to get to work with GraphX (unable to change parallelism due to
bug), we are unable to keep writing files at this size - our output ends up
being closer to 1GB per file.

As a result, our job seems to struggle working with a 100GB worth of these
files.  We are in a rough spot because upgrading Spark right now is not
reasonable for us but this bug prevents solving the issue.

On Fri, Nov 14, 2014 at 9:29 PM, Gary Malouf <ma...@gmail.com> wrote:

> I'll try this out and follow up with what I find.
>
> On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng <me...@databricks.com>
> wrote:
>
>> For each node, if the CSV reader is implemented efficiently, you should
>> be able to hit at least half of the theoretical network bandwidth, which is
>> about 60MB/second/node. So if you just do counting, the expect time should
>> be within 3 minutes.
>>
>> Note that your cluster have 15GB * 12 = 180GB RAM in total. If you use
>> the default spark.storage.memoryFraction, it can barely cache 100GB of
>> data, not considering the overhead. So if your operation need to cache the
>> data to be efficient, you may need a larger cluster or change the storage
>> level to MEMORY_AND_DISK.
>>
>> -Xiangrui
>>
>> On Nov 14, 2014, at 5:32 PM, Gary Malouf <ma...@gmail.com> wrote:
>>
>> Hmm, we actually read the CSV data in S3 now and were looking to avoid
>> that.  Unfortunately, we've experienced dreadful performance reading 100GB
>> of text data for a job directly from S3 - our hope had been connecting
>> directly to Redshift would provide some boost.
>>
>> We had been using 12 m3.xlarges, but increasing default parallelism (to
>> 2x # of cpus across cluster) and increasing partitions during reading did
>> not seem to help.
>>
>> On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng <me...@databricks.com>
>> wrote:
>>
>>> Michael is correct. Using direct connection to dump data would be slow
>>> because there is only a single connection. Please use UNLOAD with ESCAPE
>>> option to dump the table to S3. See instructions at
>>>
>>> http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
>>>
>>> And then load them back using the redshift input format we wrote:
>>> https://github.com/databricks/spark-redshift (we moved the
>>> implementation to github/databricks). Right now all columns are loaded as
>>> string columns, and you need to do type casting manually. We plan to add a
>>> parser that can translate Redshift table schema directly to Spark SQL
>>> schema, but no ETA yet.
>>>
>>> -Xiangrui
>>>
>>> On Nov 14, 2014, at 3:46 PM, Michael Armbrust <mi...@databricks.com>
>>> wrote:
>>>
>>> I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
>>> command used to produce the data.  Xiangrui can correct me if I'm wrong
>>> though.
>>>
>>> On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf <ma...@gmail.com>
>>> wrote:
>>>
>>>> We have a bunch of data in RedShift tables that we'd like to pull in
>>>> during job runs to Spark.  What is the path/url format one uses to pull
>>>> data from there?  (This is in reference to using the
>>>> https://github.com/mengxr/redshift-input-format)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Sourcing data from RedShift

Posted by Gary Malouf <ma...@gmail.com>.

I'll try this out and follow up with what I find.

On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng <me...@databricks.com> wrote:

> For each node, if the CSV reader is implemented efficiently, you should be
> able to hit at least half of the theoretical network bandwidth, which is
> about 60MB/second/node. So if you just do counting, the expect time should
> be within 3 minutes.
>
> Note that your cluster have 15GB * 12 = 180GB RAM in total. If you use the
> default spark.storage.memoryFraction, it can barely cache 100GB of data,
> not considering the overhead. So if your operation need to cache the data
> to be efficient, you may need a larger cluster or change the storage level
> to MEMORY_AND_DISK.
>
> -Xiangrui
>
> On Nov 14, 2014, at 5:32 PM, Gary Malouf <ma...@gmail.com> wrote:
>
> Hmm, we actually read the CSV data in S3 now and were looking to avoid
> that.  Unfortunately, we've experienced dreadful performance reading 100GB
> of text data for a job directly from S3 - our hope had been connecting
> directly to Redshift would provide some boost.
>
> We had been using 12 m3.xlarges, but increasing default parallelism (to 2x
> # of cpus across cluster) and increasing partitions during reading did not
> seem to help.
>
> On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng <me...@databricks.com>
> wrote:
>
>> Michael is correct. Using direct connection to dump data would be slow
>> because there is only a single connection. Please use UNLOAD with ESCAPE
>> option to dump the table to S3. See instructions at
>>
>> http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
>>
>> And then load them back using the redshift input format we wrote:
>> https://github.com/databricks/spark-redshift (we moved the
>> implementation to github/databricks). Right now all columns are loaded as
>> string columns, and you need to do type casting manually. We plan to add a
>> parser that can translate Redshift table schema directly to Spark SQL
>> schema, but no ETA yet.
>>
>> -Xiangrui
>>
>> On Nov 14, 2014, at 3:46 PM, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>> I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
>> command used to produce the data.  Xiangrui can correct me if I'm wrong
>> though.
>>
>> On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf <ma...@gmail.com>
>> wrote:
>>
>>> We have a bunch of data in RedShift tables that we'd like to pull in
>>> during job runs to Spark.  What is the path/url format one uses to pull
>>> data from there?  (This is in reference to using the
>>> https://github.com/mengxr/redshift-input-format)
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: Sourcing data from RedShift

Posted by Gary Malouf <ma...@gmail.com>.

Hmm, we actually read the CSV data in S3 now and were looking to avoid
that.  Unfortunately, we've experienced dreadful performance reading 100GB
of text data for a job directly from S3 - our hope had been connecting
directly to Redshift would provide some boost.

We had been using 12 m3.xlarges, but increasing default parallelism (to 2x
# of cpus across cluster) and increasing partitions during reading did not
seem to help.

On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng <me...@databricks.com> wrote:

> Michael is correct. Using direct connection to dump data would be slow
> because there is only a single connection. Please use UNLOAD with ESCAPE
> option to dump the table to S3. See instructions at
>
> http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
>
> And then load them back using the redshift input format we wrote:
> https://github.com/databricks/spark-redshift (we moved the implementation
> to github/databricks). Right now all columns are loaded as string columns,
> and you need to do type casting manually. We plan to add a parser that can
> translate Redshift table schema directly to Spark SQL schema, but no ETA
> yet.
>
> -Xiangrui
>
> On Nov 14, 2014, at 3:46 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
> command used to produce the data.  Xiangrui can correct me if I'm wrong
> though.
>
> On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf <ma...@gmail.com>
> wrote:
>
>> We have a bunch of data in RedShift tables that we'd like to pull in
>> during job runs to Spark.  What is the path/url format one uses to pull
>> data from there?  (This is in reference to using the
>> https://github.com/mengxr/redshift-input-format)
>>
>>
>>
>>
>>
>
>

Re: Sourcing data from RedShift

Posted by Michael Armbrust <mi...@databricks.com>.

I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
command used to produce the data.  Xiangrui can correct me if I'm wrong
though.

On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf <ma...@gmail.com> wrote:

> We have a bunch of data in RedShift tables that we'd like to pull in
> during job runs to Spark.  What is the path/url format one uses to pull
> data from there?  (This is in reference to using the
> https://github.com/mengxr/redshift-input-format)
>
>
>
>
>