You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Pavel Plotnikov <pa...@team.wrike.com> on 2016/01/19 13:43:04 UTC

Parquet write optimization by row group size config

Hello,
I'm using spark on some machines in standalone mode, data storage is
mounted on this machines via nfs. A have input data stream and when i'm
trying to store all data for hour in parquet, a job executes mostly on one
core and this hourly data are stored in 40- 50 minutes. It is very slow!
And it is not IO problem. After research how parquet file works, i'm found
that it can be parallelized on row group abstraction level.
I think row group for my files is to large, and how can i change it?
When i create to big DataFrame i devides in parts very well and writes
quikly!

Thanks,
Pavel

Re: Parquet write optimization by row group size config

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

It would be good if you can share the code, someone here or I can guide you
better if you can post the code snippet.

Thanks
Best Regards

On Wed, Jan 20, 2016 at 10:54 PM, Pavel Plotnikov <
pavel.plotnikov@team.wrike.com> wrote:

> Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i
> missed something
>
> Regards,
> Pavel
>
> On Wed, Jan 20, 2016 at 9:51 AM Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Did you try re-partitioning the data before doing the write?
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
>> pavel.plotnikov@team.wrike.com> wrote:
>>
>>> Hello,
>>> I'm using spark on some machines in standalone mode, data storage is
>>> mounted on this machines via nfs. A have input data stream and when i'm
>>> trying to store all data for hour in parquet, a job executes mostly on one
>>> core and this hourly data are stored in 40- 50 minutes. It is very slow!
>>> And it is not IO problem. After research how parquet file works, i'm found
>>> that it can be parallelized on row group abstraction level.
>>> I think row group for my files is to large, and how can i change it?
>>> When i create to big DataFrame i devides in parts very well and writes
>>> quikly!
>>>
>>> Thanks,
>>> Pavel
>>>
>>
>>

Re: Parquet write optimization by row group size config

Posted by Pavel Plotnikov <pa...@team.wrike.com>.

I have got about 25 separated gzipped log files per hour. File sizes is
very different, from 10MB to 50MB of gzipped JSON data. So, i'am convert
this data in parquet each hour. Code very simple on python:
----
text_file = sc.textFile(src_file)
df = sqlCtx.jsonRDD(text_file.map(lambda x:
x.split('\t')[2]).map(json.loads).flatMap(flatting_events).map(specific_keys_types_wrapper).map(json.dumps))
df.write.parquet(out_file, mode='overwrite')
-----

The JSON in log files is not clear, and i need to make some preparation via
rdd. Output parquet files is very small about 35MB for largest source
files. This source log files converted one by one. It is cool that all
converting transformations are executed on lot of machine cores quickly,
but when i run command htop on my machines i found that it mostly use only
one core. So it very strange.
First think - create lot of spark contexts for each input file (or group of
files) and allocate then only 2 cores, and then it will be use all servers
power. But this solution looks ugly, and it eliminates all the beauty of
Spark in this case, may be this case not for spark.
I found, that on fist seconds job use all available cores but then start
work on one and it is not a IO probleb (file sizes to small for raid over
ssd). So, second think - problem in parquet files. After some docs reading,
i am understand that parquet have hot a lot of  levels of parallelism, and I
should look for a solution out there.

On Thu, Jan 21, 2016 at 10:35 AM Jörn Franke <jo...@gmail.com> wrote:

> What is your data size, the algorithm and the expected time?
> Depending on this the group can recommend you optimizations or tell you
> that the expectations are wrong
>
> On 20 Jan 2016, at 18:24, Pavel Plotnikov <pa...@team.wrike.com>
> wrote:
>
> Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i
> missed something
>
> Regards,
> Pavel
>
> On Wed, Jan 20, 2016 at 9:51 AM Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Did you try re-partitioning the data before doing the write?
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
>> pavel.plotnikov@team.wrike.com> wrote:
>>
>>> Hello,
>>> I'm using spark on some machines in standalone mode, data storage is
>>> mounted on this machines via nfs. A have input data stream and when i'm
>>> trying to store all data for hour in parquet, a job executes mostly on one
>>> core and this hourly data are stored in 40- 50 minutes. It is very slow!
>>> And it is not IO problem. After research how parquet file works, i'm found
>>> that it can be parallelized on row group abstraction level.
>>> I think row group for my files is to large, and how can i change it?
>>> When i create to big DataFrame i devides in parts very well and writes
>>> quikly!
>>>
>>> Thanks,
>>> Pavel
>>>
>>
>>

Re: Parquet write optimization by row group size config

Posted by Jörn Franke <jo...@gmail.com>.

What is your data size, the algorithm and the expected time?
Depending on this the group can recommend you optimizations or tell you that the expectations are wrong

> On 20 Jan 2016, at 18:24, Pavel Plotnikov <pa...@team.wrike.com> wrote:
> 
> Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i missed something
> 
> Regards,
> Pavel
> 
>> On Wed, Jan 20, 2016 at 9:51 AM Akhil Das <ak...@sigmoidanalytics.com> wrote:
>> Did you try re-partitioning the data before doing the write?
>> 
>> Thanks
>> Best Regards
>> 
>>> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <pa...@team.wrike.com> wrote:
>>> Hello, 
>>> I'm using spark on some machines in standalone mode, data storage is mounted on this machines via nfs. A have input data stream and when i'm trying to store all data for hour in parquet, a job executes mostly on one core and this hourly data are stored in 40- 50 minutes. It is very slow! And it is not IO problem. After research how parquet file works, i'm found that it can be parallelized on row group abstraction level. 
>>> I think row group for my files is to large, and how can i change it? 
>>> When i create to big DataFrame i devides in parts very well and writes quikly!
>>> 
>>> Thanks,
>>> Pavel

Re: Parquet write optimization by row group size config

Posted by Pavel Plotnikov <pa...@team.wrike.com>.

Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i
missed something

Regards,
Pavel

On Wed, Jan 20, 2016 at 9:51 AM Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Did you try re-partitioning the data before doing the write?
>
> Thanks
> Best Regards
>
> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
> pavel.plotnikov@team.wrike.com> wrote:
>
>> Hello,
>> I'm using spark on some machines in standalone mode, data storage is
>> mounted on this machines via nfs. A have input data stream and when i'm
>> trying to store all data for hour in parquet, a job executes mostly on one
>> core and this hourly data are stored in 40- 50 minutes. It is very slow!
>> And it is not IO problem. After research how parquet file works, i'm found
>> that it can be parallelized on row group abstraction level.
>> I think row group for my files is to large, and how can i change it?
>> When i create to big DataFrame i devides in parts very well and writes
>> quikly!
>>
>> Thanks,
>> Pavel
>>
>
>

Re: Parquet write optimization by row group size config

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Did you try re-partitioning the data before doing the write?

Thanks
Best Regards

On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
pavel.plotnikov@team.wrike.com> wrote:

> Hello,
> I'm using spark on some machines in standalone mode, data storage is
> mounted on this machines via nfs. A have input data stream and when i'm
> trying to store all data for hour in parquet, a job executes mostly on one
> core and this hourly data are stored in 40- 50 minutes. It is very slow!
> And it is not IO problem. After research how parquet file works, i'm found
> that it can be parallelized on row group abstraction level.
> I think row group for my files is to large, and how can i change it?
> When i create to big DataFrame i devides in parts very well and writes
> quikly!
>
> Thanks,
> Pavel
>