You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Eric Owhadi <er...@esgyn.com> on 2017/12/07 17:06:30 UTC

Rowgroup to hdfs block mapping / data locality

Hello Parquet-eers,
I am studying Parquet behavior in terms of Rowgroup to hdfs block mapping, and I found some unexpected behavior. (at least I did not expect that ☺).
Here is a print of the layout of a parquet file with 12 rowgroups on hdfs with block sizes of 134217728 and rowgroup size set to 134217728 at time of writing using Hive.

offset	             RG size		Offset + RG size			end of hdfs block
4		141389243	141389247			134217728
141389247	129560117	270949364			268435456
270949364	137647948	408597312			402653184
408597312	136785886	545383198			536870912
545383198	124824992	670208190			671088640
671088640	139463692	810552332 ->alignment		805306368
810552332	137161048	947713380			939524096
947713380	128972798	1076686178			1073741824
1076686178	138875458	1215561636			1207959552
1215561636	128142960	1343704596			1342177280
1343704596	138192915	1481897511			1476395008
1481897511	1149147	1483046658			1610612736

Ideally, we would want each Rowgroup on one and only one hdfs block. So I was expecting to see each rowgroup being a little less than 134217728 in size and fit into a single hdfs block and then padded to end of hdfs block so that next rowgroup starts on next block.
But what I see is that many rowgroup are actually bigger than 134217728, and there is only one instance of padding behavior to realign rowgroup to hdfsblock boundary (see where I tagged alignment above).
And even after this realignment, the next rowgroup size is still higher than 134217728, making again the following rowgroup sit on 2 blocks. So basically in this exemple, all rowgroups are sitting on 2 blocks, even if the user (me) intention is to have each rowgroup on on hdfs block (hence making rowgroup size and hdfs block size equal).

So question: Is there any attempt in Parquet format to achieve Rowgroup to hdfs block optimization, so that each rowgroup sit in one and only one hdfs block (like ORC stripe padding implemented in Hive 0.12)?
If yes, am I configuring something wrong to get the desired behavior?

Thanks in advance for the help,
Eric Owhadi


Re: Rowgroup to hdfs block mapping / data locality

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Sounds like a reasonable explanation to me. The min and max for record
count checks are also hard-coded, so you could be hitting the min. We've
seen this occasionally with data that doesn't compress well, but we
generally don't mind because S3 doesn't need block alignment. I think
there's a lot of improvement that could be done here to make sure row
groups are correctly sized for HDFS. It would be great to hear your
thoughts on it.

rb

On Thu, Dec 7, 2017 at 10:13 AM, Eric Owhadi <er...@esgyn.com> wrote:

> Very interesting, Thanks Zoltan.
> Now knowing what you mention, I am wondering if the code implement a
> mechanism to minimize impact of the occasional miss where we generate
> rowgroup higher than block size:
> Do we propagate the error to the next rowgroup, so that the max
> rowgroupsize we want to target just for the next rowgroup is equal to
> max_row_group - spill_over from previous rowgroup, so that we try to catch
> up a block boundary? If not, the effect of an overflow cascades pretty much
> to the whole files. May be this is what is happening in my case?
> Regards
> Eric
>
> -----Original Message-----
> From: Zoltan Ivanfi [mailto:zi@cloudera.com]
> Sent: Thursday, December 7, 2017 11:38 AM
> To: dev@parquet.apache.org
> Subject: Re: Rowgroup to hdfs block mapping / data locality
>
> Hi Eric,
>
> The row group size is supposed to be an upper bound, but occasionally may
> be exceeded, because the checks for reaching the row group size only happen
> every once in a while. Based on the first few records the code makes an
> estimation for how much uncompressed data will result in the desired
> compressed size and schedules the next check to be halfway through the rest
> of the estimated remaining part. This may cause problems with skewed data
> distribution. This logic is located in checkBlockSizeReached().
> <https://github.com/apache/parquet-mr/blob/master/
> parquet-hadoop/src/main/java/org/apache/parquet/hadoop/
> InternalParquetRecordWriter.java#L135>
>
> Parquet also has a limit for the padding size, called
> "parquet.writer.max-padding" in the configuration. Its default value is 0
> in Parquet 1.9.0 and 8MB in the lastest (unreleased) master.
>
> Br,
>
> Zoltan
>
> On Thu, Dec 7, 2017 at 6:18 PM Eric Owhadi <er...@esgyn.com> wrote:
>
> > Hello Parquet-eers,
> > I am studying Parquet behavior in terms of Rowgroup to hdfs block
> > mapping, and I found some unexpected behavior. (at least I did not
> expect that ☺).
> > Here is a print of the layout of a parquet file with 12 rowgroups on
> > hdfs with block sizes of 134217728 and rowgroup size set to 134217728
> > at time of writing using Hive.
> >
> > offset               RG size            Offset + RG size
> >       end of hdfs block
> > 4               141389243       141389247                       134217728
> > 141389247       129560117       270949364                       268435456
> > 270949364       137647948       408597312                       402653184
> > 408597312       136785886       545383198                       536870912
> > 545383198       124824992       670208190                       671088640
> > 671088640       139463692       810552332 ->alignment           805306368
> > 810552332       137161048       947713380                       939524096
> > 947713380       128972798       1076686178
> 1073741824
> > 1076686178      138875458       1215561636
> 1207959552
> > 1215561636      128142960       1343704596
> 1342177280
> > 1343704596      138192915       1481897511
> 1476395008
> > 1481897511      1149147 1483046658                      1610612736
> >
> > Ideally, we would want each Rowgroup on one and only one hdfs block.
> > So I was expecting to see each rowgroup being a little less than
> > 134217728 in size and fit into a single hdfs block and then padded to
> > end of hdfs block so that next rowgroup starts on next block.
> > But what I see is that many rowgroup are actually bigger than
> > 134217728, and there is only one instance of padding behavior to
> > realign rowgroup to hdfsblock boundary (see where I tagged alignment
> above).
> > And even after this realignment, the next rowgroup size is still
> > higher than 134217728, making again the following rowgroup sit on 2
> > blocks. So basically in this exemple, all rowgroups are sitting on 2
> > blocks, even if the user (me) intention is to have each rowgroup on on
> > hdfs block (hence making rowgroup size and hdfs block size equal).
> >
> > So question: Is there any attempt in Parquet format to achieve
> > Rowgroup to hdfs block optimization, so that each rowgroup sit in one
> > and only one hdfs block (like ORC stripe padding implemented in Hive
> 0.12)?
> > If yes, am I configuring something wrong to get the desired behavior?
> >
> > Thanks in advance for the help,
> > Eric Owhadi
> >
> >
>



-- 
Ryan Blue
Software Engineer
Netflix

RE: Rowgroup to hdfs block mapping / data locality

Posted by Eric Owhadi <er...@esgyn.com>.
Very interesting, Thanks Zoltan.
Now knowing what you mention, I am wondering if the code implement a mechanism to minimize impact of the occasional miss where we generate rowgroup higher than block size:
Do we propagate the error to the next rowgroup, so that the max rowgroupsize we want to target just for the next rowgroup is equal to max_row_group - spill_over from previous rowgroup, so that we try to catch up a block boundary? If not, the effect of an overflow cascades pretty much to the whole files. May be this is what is happening in my case?
Regards
Eric 

-----Original Message-----
From: Zoltan Ivanfi [mailto:zi@cloudera.com] 
Sent: Thursday, December 7, 2017 11:38 AM
To: dev@parquet.apache.org
Subject: Re: Rowgroup to hdfs block mapping / data locality

Hi Eric,

The row group size is supposed to be an upper bound, but occasionally may be exceeded, because the checks for reaching the row group size only happen every once in a while. Based on the first few records the code makes an estimation for how much uncompressed data will result in the desired compressed size and schedules the next check to be halfway through the rest of the estimated remaining part. This may cause problems with skewed data distribution. This logic is located in checkBlockSizeReached().
<https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L135>

Parquet also has a limit for the padding size, called "parquet.writer.max-padding" in the configuration. Its default value is 0 in Parquet 1.9.0 and 8MB in the lastest (unreleased) master.

Br,

Zoltan

On Thu, Dec 7, 2017 at 6:18 PM Eric Owhadi <er...@esgyn.com> wrote:

> Hello Parquet-eers,
> I am studying Parquet behavior in terms of Rowgroup to hdfs block 
> mapping, and I found some unexpected behavior. (at least I did not expect that ☺).
> Here is a print of the layout of a parquet file with 12 rowgroups on 
> hdfs with block sizes of 134217728 and rowgroup size set to 134217728 
> at time of writing using Hive.
>
> offset               RG size            Offset + RG size
>       end of hdfs block
> 4               141389243       141389247                       134217728
> 141389247       129560117       270949364                       268435456
> 270949364       137647948       408597312                       402653184
> 408597312       136785886       545383198                       536870912
> 545383198       124824992       670208190                       671088640
> 671088640       139463692       810552332 ->alignment           805306368
> 810552332       137161048       947713380                       939524096
> 947713380       128972798       1076686178                      1073741824
> 1076686178      138875458       1215561636                      1207959552
> 1215561636      128142960       1343704596                      1342177280
> 1343704596      138192915       1481897511                      1476395008
> 1481897511      1149147 1483046658                      1610612736
>
> Ideally, we would want each Rowgroup on one and only one hdfs block. 
> So I was expecting to see each rowgroup being a little less than 
> 134217728 in size and fit into a single hdfs block and then padded to 
> end of hdfs block so that next rowgroup starts on next block.
> But what I see is that many rowgroup are actually bigger than 
> 134217728, and there is only one instance of padding behavior to 
> realign rowgroup to hdfsblock boundary (see where I tagged alignment above).
> And even after this realignment, the next rowgroup size is still 
> higher than 134217728, making again the following rowgroup sit on 2 
> blocks. So basically in this exemple, all rowgroups are sitting on 2 
> blocks, even if the user (me) intention is to have each rowgroup on on 
> hdfs block (hence making rowgroup size and hdfs block size equal).
>
> So question: Is there any attempt in Parquet format to achieve 
> Rowgroup to hdfs block optimization, so that each rowgroup sit in one 
> and only one hdfs block (like ORC stripe padding implemented in Hive 0.12)?
> If yes, am I configuring something wrong to get the desired behavior?
>
> Thanks in advance for the help,
> Eric Owhadi
>
>

Re: Rowgroup to hdfs block mapping / data locality

Posted by Zoltan Ivanfi <zi...@cloudera.com>.
Hi Eric,

The row group size is supposed to be an upper bound, but occasionally may
be exceeded, because the checks for reaching the row group size only happen
every once in a while. Based on the first few records the code makes an
estimation for how much uncompressed data will result in the desired
compressed size and schedules the next check to be halfway through the rest
of the estimated remaining part. This may cause problems with skewed data
distribution. This logic is located in checkBlockSizeReached().
<https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L135>

Parquet also has a limit for the padding size, called
"parquet.writer.max-padding" in the configuration. Its default value is 0
in Parquet 1.9.0 and 8MB in the lastest (unreleased) master.

Br,

Zoltan

On Thu, Dec 7, 2017 at 6:18 PM Eric Owhadi <er...@esgyn.com> wrote:

> Hello Parquet-eers,
> I am studying Parquet behavior in terms of Rowgroup to hdfs block mapping,
> and I found some unexpected behavior. (at least I did not expect that ☺).
> Here is a print of the layout of a parquet file with 12 rowgroups on hdfs
> with block sizes of 134217728 and rowgroup size set to 134217728 at time of
> writing using Hive.
>
> offset               RG size            Offset + RG size
>       end of hdfs block
> 4               141389243       141389247                       134217728
> 141389247       129560117       270949364                       268435456
> 270949364       137647948       408597312                       402653184
> 408597312       136785886       545383198                       536870912
> 545383198       124824992       670208190                       671088640
> 671088640       139463692       810552332 ->alignment           805306368
> 810552332       137161048       947713380                       939524096
> 947713380       128972798       1076686178                      1073741824
> 1076686178      138875458       1215561636                      1207959552
> 1215561636      128142960       1343704596                      1342177280
> 1343704596      138192915       1481897511                      1476395008
> 1481897511      1149147 1483046658                      1610612736
>
> Ideally, we would want each Rowgroup on one and only one hdfs block. So I
> was expecting to see each rowgroup being a little less than 134217728 in
> size and fit into a single hdfs block and then padded to end of hdfs block
> so that next rowgroup starts on next block.
> But what I see is that many rowgroup are actually bigger than 134217728,
> and there is only one instance of padding behavior to realign rowgroup to
> hdfsblock boundary (see where I tagged alignment above).
> And even after this realignment, the next rowgroup size is still higher
> than 134217728, making again the following rowgroup sit on 2 blocks. So
> basically in this exemple, all rowgroups are sitting on 2 blocks, even if
> the user (me) intention is to have each rowgroup on on hdfs block (hence
> making rowgroup size and hdfs block size equal).
>
> So question: Is there any attempt in Parquet format to achieve Rowgroup to
> hdfs block optimization, so that each rowgroup sit in one and only one hdfs
> block (like ORC stripe padding implemented in Hive 0.12)?
> If yes, am I configuring something wrong to get the desired behavior?
>
> Thanks in advance for the help,
> Eric Owhadi
>
>