You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Qiang Li <ql...@appannie.com> on 2016/09/17 02:34:42 UTC

Spark output data to S3 is very slow

Hi,


I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very
quickly, but the last step, spark spend lots of time to rename or move data
from s3 temporary directory to real directory, then I try to set

spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter
or
spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet.DirectParquetOutputCommitter

But both doesn't work, looks like spark 2.0 removed these configs, how can
I let spark output directly without temporary directory ?

-- 
*This email may contain or reference confidential information and is 
intended only for the individual to whom it is addressed.  Please refrain 
from distributing, disclosing or copying this email and the information 
contained within unless you are the intended recipient.  If you received 
this email in error, please notify us at legal@appannie.com 
<le...@appannie.com>** immediately and remove it from your system.*

Re: Spark output data to S3 is very slow

Posted by Qiang Li <ql...@appannie.com>.

Tried several times, it is slow same as before, I will let spark output
data to HDFS, then sync data to S3 as temporary solution.

Thank you.

On Sat, Sep 17, 2016 at 10:43 AM, Takeshi Yamamuro <li...@gmail.com>
wrote:

> Hi,
>
> Have you seen the previous thread?
> https://www.mail-archive.com/user@spark.apache.org/msg56791.html
>
> // maropu
>
>
> On Sat, Sep 17, 2016 at 11:34 AM, Qiang Li <ql...@appannie.com> wrote:
>
>> Hi,
>>
>>
>> I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very
>> quickly, but the last step, spark spend lots of time to rename or move data
>> from s3 temporary directory to real directory, then I try to set
>>
>> spark.hadoop.spark.sql.parquet.output.committer.class=org.
>> apache.spark.sql.execution.datasources.parquet.DirectParq
>> uetOutputCommitter
>> or
>> spark.sql.parquet.output.committer.class=org.apache.spark.
>> sql.parquet.DirectParquetOutputCommitter
>>
>> But both doesn't work, looks like spark 2.0 removed these configs, how
>> can I let spark output directly without temporary directory ?
>>
>>
>>
>> *This email may contain or reference confidential information and is
>> intended only for the individual to whom it is addressed.  Please refrain
>> from distributing, disclosing or copying this email and the information
>> contained within unless you are the intended recipient.  If you received
>> this email in error, please notify us at legal@appannie.com
>> <le...@appannie.com>** immediately and remove it from your system.*
>
>
>
>
> --
> ---
> Takeshi Yamamuro
>

-- 
*This email may contain or reference confidential information and is 
intended only for the individual to whom it is addressed.  Please refrain 
from distributing, disclosing or copying this email and the information 
contained within unless you are the intended recipient.  If you received 
this email in error, please notify us at legal@appannie.com 
<le...@appannie.com>** immediately and remove it from your system.*

Re: Spark output data to S3 is very slow

Posted by Takeshi Yamamuro <li...@gmail.com>.

Hi,

Have you seen the previous thread?
https://www.mail-archive.com/user@spark.apache.org/msg56791.html

// maropu


On Sat, Sep 17, 2016 at 11:34 AM, Qiang Li <ql...@appannie.com> wrote:

> Hi,
>
>
> I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very
> quickly, but the last step, spark spend lots of time to rename or move data
> from s3 temporary directory to real directory, then I try to set
>
> spark.hadoop.spark.sql.parquet.output.committer.
> class=org.apache.spark.sql.execution.datasources.parquet.
> DirectParquetOutputCommitter
> or
> spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet.
> DirectParquetOutputCommitter
>
> But both doesn't work, looks like spark 2.0 removed these configs, how can
> I let spark output directly without temporary directory ?
>
>
>
> *This email may contain or reference confidential information and is
> intended only for the individual to whom it is addressed.  Please refrain
> from distributing, disclosing or copying this email and the information
> contained within unless you are the intended recipient.  If you received
> this email in error, please notify us at legal@appannie.com
> <le...@appannie.com>** immediately and remove it from your system.*




-- 
---
Takeshi Yamamuro