You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gavin Yue <yu...@gmail.com> on 2016/01/11 07:12:52 UTC

parquet repartitions and parquet.enable.summary-metadata does not work

Hey,

I am trying to convert a bunch of json files into parquet, which would
output over 7000 parquet files.  But tthere are too many files, so I want
to repartition based on id to 3000.

But I got the error of GC problem like this one:
https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=L6xQ9+9SjL3wGiiSpZyFH2xmThyg@mail.gmail.com%3E#archives

So I set  parquet.enable.summary-metadata to false. But when I
write.parquet, I could still see the 3000 jobs run after the writing
parquet and they failed due to GC.

Basically repartition never succeeded for me. Is there any other settings
which could be optimized?

Thanks,
Gavin

Re: parquet repartitions and parquet.enable.summary-metadata does not work

Posted by Cheng Lian <li...@gmail.com>.
I see. So there are actually 3000 tasks instead of 3000 jobs right?

Would you mind to provide the full stack trace of the GC issue? At first 
I thought it's identical to the _metadata one in the mail thread you 
mentioned.

Cheng

On 1/11/16 5:30 PM, Gavin Yue wrote:
> Here is how I set the conf: 
> sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
>
> This actually works, I do not see the _metadata file anymore.
>
> I think I made a mistake.  The 3000 jobs are coming from 
> repartition("id").
>
> I have 7600 json files and want to save as parquet.
>
> So if I use:  df.write.parquet(path), it would generate 7600 parquet 
> files with 7600 parititions which has no problem.
>
> But if I use repartition to change partition number, say: 
> df.reparition(3000).write.parquet
>
> This would generate 7600 + 3000 tasks.  3000 tasks always fails due to 
> GC problem.
>
> Best,
> Gavin
>
>
>
> On Mon, Jan 11, 2016 at 4:31 PM, Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hey Gavin,
>
>     Could you please provide a snippet of your code to show how did
>     you disabled "parquet.enable.summary-metadata" and wrote the
>     files? Especially, you mentioned you saw "3000 jobs" failed. Were
>     you writing each Parquet file with an individual job? (Usually
>     people use write.partitionBy(...).parquet(...) to write multiple
>     Parquet files.)
>
>     Cheng
>
>
>     On 1/10/16 10:12 PM, Gavin Yue wrote:
>
>         Hey,
>
>         I am trying to convert a bunch of json files into parquet,
>         which would output over 7000 parquet files. But tthere are too
>         many files, so I want to repartition based on id to 3000.
>
>         But I got the error of GC problem like this one:
>         https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=L6xQ9+9SjL3wGiiSpZyFH2xmThyg@mail.gmail.com%3E#archives
>
>         So I set  parquet.enable.summary-metadata to false. But when I
>         write.parquet, I could still see the 3000 jobs run after the
>         writing parquet and they failed due to GC.
>
>         Basically repartition never succeeded for me. Is there any
>         other settings which could be optimized?
>
>         Thanks,
>         Gavin
>
>
>


Re: parquet repartitions and parquet.enable.summary-metadata does not work

Posted by Gavin Yue <yu...@gmail.com>.
Here is how I set the conf:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

This actually works, I do not see the _metadata file anymore.

I think I made a mistake.  The 3000 jobs are coming from repartition("id").

I have 7600 json files and want to save as parquet.

So if I use:  df.write.parquet(path), it would generate 7600 parquet files
with 7600 parititions which has no problem.

But if I use repartition to change partition number, say:
df.reparition(3000).write.parquet

This would generate 7600 + 3000 tasks.  3000 tasks always fails due to GC
problem.

Best,
Gavin



On Mon, Jan 11, 2016 at 4:31 PM, Cheng Lian <li...@gmail.com> wrote:

> Hey Gavin,
>
> Could you please provide a snippet of your code to show how did you
> disabled "parquet.enable.summary-metadata" and wrote the files? Especially,
> you mentioned you saw "3000 jobs" failed. Were you writing each Parquet
> file with an individual job? (Usually people use
> write.partitionBy(...).parquet(...) to write multiple Parquet files.)
>
> Cheng
>
>
> On 1/10/16 10:12 PM, Gavin Yue wrote:
>
>> Hey,
>>
>> I am trying to convert a bunch of json files into parquet, which would
>> output over 7000 parquet files. But tthere are too many files, so I want to
>> repartition based on id to 3000.
>>
>> But I got the error of GC problem like this one:
>> https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=L6xQ9+9SjL3wGiiSpZyFH2xmThyg@mail.gmail.com%3E#archives
>>
>> So I set  parquet.enable.summary-metadata to false. But when I
>> write.parquet, I could still see the 3000 jobs run after the writing
>> parquet and they failed due to GC.
>>
>> Basically repartition never succeeded for me. Is there any other settings
>> which could be optimized?
>>
>> Thanks,
>> Gavin
>>
>
>

Re: parquet repartitions and parquet.enable.summary-metadata does not work

Posted by Cheng Lian <li...@gmail.com>.
Hey Gavin,

Could you please provide a snippet of your code to show how did you 
disabled "parquet.enable.summary-metadata" and wrote the files? 
Especially, you mentioned you saw "3000 jobs" failed. Were you writing 
each Parquet file with an individual job? (Usually people use 
write.partitionBy(...).parquet(...) to write multiple Parquet files.)

Cheng

On 1/10/16 10:12 PM, Gavin Yue wrote:
> Hey,
>
> I am trying to convert a bunch of json files into parquet, which would 
> output over 7000 parquet files. But tthere are too many files, so I 
> want to repartition based on id to 3000.
>
> But I got the error of GC problem like this one: 
> https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=L6xQ9+9SjL3wGiiSpZyFH2xmThyg@mail.gmail.com%3E#archives
>
> So I set  parquet.enable.summary-metadata to false. But when I 
> write.parquet, I could still see the 3000 jobs run after the writing 
> parquet and they failed due to GC.
>
> Basically repartition never succeeded for me. Is there any other 
> settings which could be optimized?
>
> Thanks,
> Gavin


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org