You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chetan Khatri <ck...@gmail.com> on 2016/10/21 20:47:03 UTC

Writing to Parquet Job turns to wait mode after even completion of job

Hello Spark Users,

I am writing around 10 GB of Processed Data to Parquet where having 1 TB of
HDD and 102 GB of RAM, 16 vCore machine on Google Cloud.

Every time, i write to parquet. it shows on Spark UI that stages succeeded
but on spark shell it hold context on wait mode for almost 10 mins. then it
clears broadcast, accumulator shared variables.

Can we sped up this thing ?

Thanks.

-- 
Yours Aye,
Chetan Khatri.
M.+91 76666 80574
Data Science Researcher
INDIA

Statement of Confidentiality
————————————————————————————
The contents of this e-mail message and any attachments are confidential
and are intended solely for addressee. The information may also be legally
privileged. This transmission is sent in trust, for the sole purpose of
delivery to the intended recipient. If you have received this transmission
in error, any use, reproduction or dissemination of this transmission is
strictly prohibited. If you are not the intended recipient, please
immediately notify the sender by reply e-mail or phone and delete this
message and its attachments, if any.

Re: Writing to Parquet Job turns to wait mode after even completion of job

Posted by Chetan Khatri <ck...@gmail.com>.

Thank you for everyone,  origin question " Every time, i write to parquet.
it shows on Spark UI that stages succeeded but on spark shell it hold
context on wait mode for almost 10 mins. then it clears broadcast,
accumulator shared variables.".

I don't think stopping context can resolve current issue.

It takes more time to clear Broadcast, accumulator etc.

Can we tune up this with spark 1.6.1 MapR distribution.
On Oct 27, 2016 2:34 PM, "Mehrez Alachheb" <la...@gmail.com> wrote:

> I think you should just shut down your SparkContext at the end.
> sc.stop()
>
> 2016-10-21 22:47 GMT+02:00 Chetan Khatri <ck...@gmail.com>:
>
>> Hello Spark Users,
>>
>> I am writing around 10 GB of Processed Data to Parquet where having 1 TB
>> of HDD and 102 GB of RAM, 16 vCore machine on Google Cloud.
>>
>> Every time, i write to parquet. it shows on Spark UI that stages
>> succeeded but on spark shell it hold context on wait mode for almost 10
>> mins. then it clears broadcast, accumulator shared variables.
>>
>> Can we sped up this thing ?
>>
>> Thanks.
>>
>> --
>> Yours Aye,
>> Chetan Khatri.
>> M.+91 76666 80574
>> Data Science Researcher
>> INDIA
>>
>> Statement of Confidentiality
>> ————————————————————————————
>> The contents of this e-mail message and any attachments are confidential
>> and are intended solely for addressee. The information may also be legally
>> privileged. This transmission is sent in trust, for the sole purpose of
>> delivery to the intended recipient. If you have received this transmission
>> in error, any use, reproduction or dissemination of this transmission is
>> strictly prohibited. If you are not the intended recipient, please
>> immediately notify the sender by reply e-mail or phone and delete this
>> message and its attachments, if any.
>>
>
>

Re: Writing to Parquet Job turns to wait mode after even completion of job

Posted by Mehrez Alachheb <la...@gmail.com>.

I think you should just shut down your SparkContext at the end.
sc.stop()

2016-10-21 22:47 GMT+02:00 Chetan Khatri <ck...@gmail.com>:

> Hello Spark Users,
>
> I am writing around 10 GB of Processed Data to Parquet where having 1 TB
> of HDD and 102 GB of RAM, 16 vCore machine on Google Cloud.
>
> Every time, i write to parquet. it shows on Spark UI that stages succeeded
> but on spark shell it hold context on wait mode for almost 10 mins. then it
> clears broadcast, accumulator shared variables.
>
> Can we sped up this thing ?
>
> Thanks.
>
> --
> Yours Aye,
> Chetan Khatri.
> M.+91 76666 80574
> Data Science Researcher
> INDIA
>
> Statement of Confidentiality
> ————————————————————————————
> The contents of this e-mail message and any attachments are confidential
> and are intended solely for addressee. The information may also be legally
> privileged. This transmission is sent in trust, for the sole purpose of
> delivery to the intended recipient. If you have received this transmission
> in error, any use, reproduction or dissemination of this transmission is
> strictly prohibited. If you are not the intended recipient, please
> immediately notify the sender by reply e-mail or phone and delete this
> message and its attachments, if any.
>

Re: Writing to Parquet Job turns to wait mode after even completion of job

Posted by Steve Loughran <st...@hortonworks.com>.

On 24 Oct 2016, at 20:32, Cheng Lian <li...@gmail.com>> wrote:



On 10/22/16 6:18 AM, Steve Loughran wrote:

...
On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian <li...@gmail.com>> wrote:

What version of Spark are you using and how many output files does the job writes out?

By default, Spark versions before 1.6 (not including) writes Parquet summary files when committing the job. This process reads footers from all Parquet files in the destination directory and merges them together. This can be particularly bad if you are appending a small amount of data to a large existing Parquet dataset.

If that's the case, you may disable Parquet summary files by setting Hadoop configuration " parquet.enable.summary-metadata" to false.


Now I'm a bit mixed up. Should that be spark.sql.parquet.enable.summary-metadata =false?
No, "parquet.enable.summary-metadata" is a Hadoop configuration option introduced by Parquet. In Spark 2.0, you can simply set it using spark.conf.set(), Spark will propagate it properly.


OK, chased it down to  a feature that ryanb @ netflix made optional, presumably for their s3 work (PARQUET-107 )

This is what I'm going to say make a good set of options for S3A & Parquet

spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false

While for ORC, you want


spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.orc.filterPushdown true

And:

spark.sql.hive.metastorePartitionPruning true

along with commitment via:

spark.speculation false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true



For when people get to play with the Hadoop S3A phase II binaries, they'll also be wanting

spark.hadoop.fs.s3a.readahead.range 157810688

// faster backward seek for ORC and Parquet input
spark.hadoop.fs.s3a.experimental.input.fadvise random

// PUT blocks in separate threads
spark.hadoop.fs.s3a.fast.output.enabled true


the fadvise one is *really* good when working with ORC/Parquet; without that column filtering and predicate pushdown is somewhat crippled.

Re: Writing to Parquet Job turns to wait mode after even completion of job

Posted by Cheng Lian <li...@gmail.com>.


On 10/22/16 6:18 AM, Steve Loughran wrote:

...
> On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>>
>>     What version of Spark are you using and how many output files
>>     does the job writes out?
>>
>>     By default, Spark versions before 1.6 (not including) writes
>>     Parquet summary files when committing the job. This process reads
>>     footers from all Parquet files in the destination directory and
>>     merges them together. This can be particularly bad if you are
>>     appending a small amount of data to a large existing Parquet dataset.
>>
>>     If that's the case, you may disable Parquet summary files by
>>     setting Hadoop configuration " parquet.enable.summary-metadata"
>>     to false.
>>
>>
>
> Now I'm a bit mixed up. Should that be 
> spark.sql.parquet.enable.summary-metadata =false?
No, "parquet.enable.summary-metadata" is a Hadoop configuration option 
introduced by Parquet. In Spark 2.0, you can simply set it using 
spark.conf.set(), Spark will propagate it properly.
>
>>     We've disabled it by default since 1.6.0
>>
>>     Cheng
>>
>>
>>     On 10/21/16 1:47 PM, Chetan Khatri wrote:
>>>     Hello Spark Users,
>>>
>>>     I am writing around 10 GB of Processed Data to Parquet where
>>>     having 1 TB of HDD and 102 GB of RAM, 16 vCore machine on Google
>>>     Cloud.
>>>
>>>     Every time, i write to parquet. it shows on Spark UI that stages
>>>     succeeded but on spark shell it hold context on wait mode for
>>>     almost 10 mins. then it clears broadcast, accumulator shared
>>>     variables.
>>>
>>>     Can we sped up this thing ?
>>>
>>>     Thanks.
>>>
>>>     -- 
>>>     Yours Aye,
>>>     Chetan Khatri.
>>>     M.+91 76666 80574
>>>     Data Science Researcher
>>>     INDIA
>>>
>>>     \u200b\u200bStatement of Confidentiality
>>>     \u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014
>>>     The contents of this e-mail message and any attachments are
>>>     confidential and are intended solely for addressee. The
>>>     information may also be legally privileged. This transmission is
>>>     sent in trust, for the sole purpose of delivery to the intended
>>>     recipient. If you have received this transmission in error, any
>>>     use, reproduction or dissemination of this transmission is
>>>     strictly prohibited. If you are not the intended recipient,
>>>     please immediately notify the sender by reply e-mail or phone
>>>     and delete this message and its attachments, if any.\u200b\u200b
>>
>>
>>
>>
>> -- 
>> Yours Aye,
>> Chetan Khatri.
>> M.+91 76666 80574
>> Data Science Researcher
>> INDIA
>>
>> \u200b\u200bStatement of Confidentiality
>> \u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014
>> The contents of this e-mail message and any attachments are 
>> confidential and are intended solely for addressee. The information 
>> may also be legally privileged. This transmission is sent in trust, 
>> for the sole purpose of delivery to the intended recipient. If you 
>> have received this transmission in error, any use, reproduction or 
>> dissemination of this transmission is strictly prohibited. If you are 
>> not the intended recipient, please immediately notify the sender by 
>> reply e-mail or phone and delete this message and its attachments, if 
>> any.\u200b\u200b
>

Re: Writing to Parquet Job turns to wait mode after even completion of job

Posted by Steve Loughran <st...@hortonworks.com>.

On 22 Oct 2016, at 00:48, Chetan Khatri <ck...@gmail.com>> wrote:

Hello Cheng,

Thank you for response.

I am using spark 1.6.1, i am writing around 350 gz parquet part files for single table. Processed around 180 GB of Data using Spark.

Are you writing to GCS storage to to the local HDD?

Regarding options to set, for performance reads against object store hosted parquet data, also go for

spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false

On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian <li...@gmail.com>> wrote:

What version of Spark are you using and how many output files does the job writes out?

By default, Spark versions before 1.6 (not including) writes Parquet summary files when committing the job. This process reads footers from all Parquet files in the destination directory and merges them together. This can be particularly bad if you are appending a small amount of data to a large existing Parquet dataset.

If that's the case, you may disable Parquet summary files by setting Hadoop configuration " parquet.enable.summary-metadata" to false.

Now I'm a bit mixed up. Should that be spark.sql.parquet.enable.summary-metadata =false?

We've disabled it by default since 1.6.0

Cheng

On 10/21/16 1:47 PM, Chetan Khatri wrote:
Hello Spark Users,

I am writing around 10 GB of Processed Data to Parquet where having 1 TB of HDD and 102 GB of RAM, 16 vCore machine on Google Cloud.

Every time, i write to parquet. it shows on Spark UI that stages succeeded but on spark shell it hold context on wait mode for almost 10 mins. then it clears broadcast, accumulator shared variables.

Can we sped up this thing ?

Thanks.

--
Yours Aye,
Chetan Khatri.
M.+91 76666 80574
Data Science Researcher
INDIA

Statement of Confidentiality
————————————————————————————
The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.

--
Yours Aye,
Chetan Khatri.
M.+91 76666 80574
Data Science Researcher
INDIA

Re: Writing to Parquet Job turns to wait mode after even completion of job

Posted by Chetan Khatri <ck...@gmail.com>.

Hello Cheng,

Thank you for response.

I am using spark 1.6.1, i am writing around 350 gz parquet part files for
single table. Processed around 180 GB of Data using Spark.

On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian <li...@gmail.com> wrote:

> What version of Spark are you using and how many output files does the job
> writes out?
>
> By default, Spark versions before 1.6 (not including) writes Parquet
> summary files when committing the job. This process reads footers from all
> Parquet files in the destination directory and merges them together. This
> can be particularly bad if you are appending a small amount of data to a
> large existing Parquet dataset.
>
> If that's the case, you may disable Parquet summary files by setting
> Hadoop configuration " parquet.enable.summary-metadata" to false.
>
> We've disabled it by default since 1.6.0
>
> Cheng
>
> On 10/21/16 1:47 PM, Chetan Khatri wrote:
>
> Hello Spark Users,
>
> I am writing around 10 GB of Processed Data to Parquet where having 1 TB
> of HDD and 102 GB of RAM, 16 vCore machine on Google Cloud.
>
> Every time, i write to parquet. it shows on Spark UI that stages succeeded
> but on spark shell it hold context on wait mode for almost 10 mins. then it
> clears broadcast, accumulator shared variables.
>
> Can we sped up this thing ?
>
> Thanks.
>
> --
> Yours Aye,
> Chetan Khatri.
> M.+91 76666 80574
> Data Science Researcher
> INDIA
>
> Statement of Confidentiality
> ————————————————————————————
> The contents of this e-mail message and any attachments are confidential
> and are intended solely for addressee. The information may also be legally
> privileged. This transmission is sent in trust, for the sole purpose of
> delivery to the intended recipient. If you have received this transmission
> in error, any use, reproduction or dissemination of this transmission is
> strictly prohibited. If you are not the intended recipient, please
> immediately notify the sender by reply e-mail or phone and delete this
> message and its attachments, if any.
>
>
>

-- 
Yours Aye,
Chetan Khatri.
M.+91 76666 80574
Data Science Researcher
INDIA

Statement of Confidentiality
————————————————————————————
The contents of this e-mail message and any attachments are confidential
and are intended solely for addressee. The information may also be legally
privileged. This transmission is sent in trust, for the sole purpose of
delivery to the intended recipient. If you have received this transmission
in error, any use, reproduction or dissemination of this transmission is
strictly prohibited. If you are not the intended recipient, please
immediately notify the sender by reply e-mail or phone and delete this
message and its attachments, if any.

Re: Writing to Parquet Job turns to wait mode after even completion of job

Posted by Cheng Lian <li...@gmail.com>.

What version of Spark are you using and how many output files does the 
job writes out?

By default, Spark versions before 1.6 (not including) writes Parquet 
summary files when committing the job. This process reads footers from 
all Parquet files in the destination directory and merges them together. 
This can be particularly bad if you are appending a small amount of data 
to a large existing Parquet dataset.

If that's the case, you may disable Parquet summary files by setting 
Hadoop configuration " parquet.enable.summary-metadata" to false.

We've disabled it by default since 1.6.0

Cheng


On 10/21/16 1:47 PM, Chetan Khatri wrote:
> Hello Spark Users,
>
> I am writing around 10 GB of Processed Data to Parquet where having 1 
> TB of HDD and 102 GB of RAM, 16 vCore machine on Google Cloud.
>
> Every time, i write to parquet. it shows on Spark UI that stages 
> succeeded but on spark shell it hold context on wait mode for almost 
> 10 mins. then it clears broadcast, accumulator shared variables.
>
> Can we sped up this thing ?
>
> Thanks.
>
> -- 
> Yours Aye,
> Chetan Khatri.
> M.+91 76666 80574
> Data Science Researcher
> INDIA
>
> \u200b\u200bStatement of Confidentiality
> \u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014
> The contents of this e-mail message and any attachments are 
> confidential and are intended solely for addressee. The information 
> may also be legally privileged. This transmission is sent in trust, 
> for the sole purpose of delivery to the intended recipient. If you 
> have received this transmission in error, any use, reproduction or 
> dissemination of this transmission is strictly prohibited. If you are 
> not the intended recipient, please immediately notify the sender by 
> reply e-mail or phone and delete this message and its attachments, if 
> any.\u200b\u200b