You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Brian Bowman <Br...@sas.com> on 2019/05/22 15:40:26 UTC

Parquet File Naming Convention Standards

All,

Here is an example .parquet data set saved using pySpark where the following files are members of directory: “foo.parquet”:

-rw-r--r--    1 sasbpb  r&d        8 Mar 26 12:10 ._SUCCESS.crc
-rw-r--r--    1 sasbpb  r&d    25632 Mar 26 12:10 .part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d    25356 Mar 26 12:10 .part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d    26300 Mar 26 12:10 .part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d    23728 Mar 26 12:10 .part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d        0 Mar 26 12:10 _SUCCESS
-rw-r--r--    1 sasbpb  r&d  3279617 Mar 26 12:10 part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--    1 sasbpb  r&d  3244105 Mar 26 12:10 part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--    1 sasbpb  r&d  3365039 Mar 26 12:10 part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--    1 sasbpb  r&d  3035960 Mar 26 12:10 part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet


Questions:

  1.  Is this the “standard” for creating/saving a .parquet data set?
  2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID.  Is the format:
     part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc
an established convention?  Is this documented somewhere?
  3.  Is there a C++ class to create the CRC?


Thanks,


Brian

RE: Parquet File Naming Convention Standards

Posted by "Lee, David" <Da...@blackrock.com>.

I've tried the one row group per parquet file / block and I ran into couple problems with some observations..

1. The single row group would contain 30 million rows x 10 columns for data. This requires a lot more memory to write the file. Saving 10 row groups one at a time into a single parquet file cuts the max memory usage down to 3 million rows.

2. Dictionary encoding only works if the dictionary values do not exceed the reserved space in a parquet file. Each row group has its own reserved space for dictionary values. Once you exceed the reserved space then dictionary encoding isn't used which can lead to slower query performance and increase the overall storage needed by 10% or more.

3. I generally try to store 30 million cells of data per row group. 3 million by 10 columns or 10 million rows x 3 columns, etc..

-----Original Message-----
From: Tim Armstrong <ta...@cloudera.com.INVALID> 
Sent: Wednesday, May 22, 2019 12:27 PM
To: Parquet Dev <de...@parquet.apache.org>
Subject: Re: Parquet File Naming Convention Standards

External Email: Use caution with links and attachments

Not reusing file names is generally a good idea - there are a bunch of interesting consistency issues, particularly on object stores, if you reuse file paths. This has come up for us with things like INSERT OVERWRITE in Hive, which tends to generate the same file names.

I think there's an interesting set of discussions to be had around best practices for file sizes and row group sizes.

One point is that a lot of big data frameworks schedule parallel work based on filesystem metadata only (i.e. file sizes and block sizes, if the filesystem has a concept of a block). If you have arbitrary parquet files this can break down in various ways - e.g. if you have a 1GB file, you have to guess what a good way to divide up the processing is. If there are fewer row groups than expected, you'll get skew and if there are more you'll lose out on parallelism. HDFS blocks were often a good way to do this, since a lot of writers aim for one row group per block, but Parquet files often come from a variety of sources and get munged in different ways, so the heuristic falls over  in various ways in some application. It's somewhat worse on object stores like S3, where there isn't a concept of a block, just whatever the writer and reader have configured - you really ideally want reader and writer block sizes to line up, but coordinating can be difficult for some workflows.

Working on Impala, I'm a bit biased towards larger blocks, because of the scheduling problems and also because of the extra overhead added with row groups - we end up needed to do extra I/O operations per row group, adding overhead (some of the overhead is inherent because the data you're reading is more fragmented, so of it is just our implementation).

On Wed, May 22, 2019 at 11:55 AM Brian Bowman <Br...@sas.com> wrote:

>  Thanks for the info!
>
> HDFS is only one of many storage platforms (distributed or otherwise) 
> that SAS supports.  In general larger physical files (e.g. 100MB to 
> 1GB) with multiple RowGroups are also a good thing for our usage 
> cases.  I'm working to get our Parquet (C to C++ via libparquet.so) writer to do this.
>
> -Brian
>
> On 5/22/19, 1:21 PM, "Lee, David" <Da...@blackrock.com> wrote:
>
>     EXTERNAL
>
>     I'm not a big fan of this convention which is a Spark convention..
>
>     A. The files should have at least "foo" in the name. Using PyArrow 
> I would create these files as foo.1.parquet, foo.2.parquet, etc..
>     B. These files are around 3 megs each. For HDFS storage, files 
> should be sized to match the HDFS blocksize which is usually set at 
> 128 megs
> (default) or 256 megs, 512 megs, 1 gig, etc..
>
>     
> https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.cloudera.com
> _blog_2009_02_the-2Dsmall-2Dfiles-2Dproblem_&d=DwIFaQ&c=zUO0BtkCe66yJv
> AZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Wsz97R5QSnh4U
> ivp0SuIfu3GIlO6rHwWLWE6O-Ib7ZE&s=7ppV9DET_wMkOjgvgATKUoIel_zxLOwnRDPET
> jrveyc&e=
>
>     I usually take small parquet files and save them as parquet row 
> groups in a larger parquet file to match the HDFS blocksize.
>
>     -----Original Message-----
>     From: Brian Bowman <Br...@sas.com>
>     Sent: Wednesday, May 22, 2019 8:40 AM
>     To: dev@parquet.apache.org
>     Subject: Parquet File Naming Convention Standards
>
>     External Email: Use caution with links and attachments
>
>
>     All,
>
>     Here is an example .parquet data set saved using pySpark where the 
> following files are members of directory: “foo.parquet”:
>
>     -rw-r--r--    1 sasbpb  r&d        8 Mar 26 12:10 ._SUCCESS.crc
>     -rw-r--r--    1 sasbpb  r&d    25632 Mar 26 12:10
> .part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d    25356 Mar 26 12:10
> .part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d    26300 Mar 26 12:10
> .part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d    23728 Mar 26 12:10
> .part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d        0 Mar 26 12:10 _SUCCESS
>     -rw-r--r--    1 sasbpb  r&d  3279617 Mar 26 12:10
> part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>     -rw-r--r--    1 sasbpb  r&d  3244105 Mar 26 12:10
> part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>     -rw-r--r--    1 sasbpb  r&d  3365039 Mar 26 12:10
> part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>     -rw-r--r--    1 sasbpb  r&d  3035960 Mar 26 12:10
> part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>
>
>     Questions:
>
>       1.  Is this the “standard” for creating/saving a .parquet data set?
>       2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a 
> UUID.  Is the format:
>          part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc 
> an established convention?  Is this documented somewhere?
>       3.  Is there a C++ class to create the CRC?
>
>
>     Thanks,
>
>
>     Brian
>
>
>     This message may contain information that is confidential or 
> privileged. If you are not the intended recipient, please advise the 
> sender immediately and delete this message. See 
> http://www.blackrock.com/corporate/compliance/email-disclaimers for 
> further information.  Please refer to 
> http://www.blackrock.com/corporate/compliance/privacy-policy for more 
> information about BlackRock’s Privacy Policy.
>
>     For a list of BlackRock's office addresses worldwide, see 
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
>     © 2019 BlackRock, Inc. All rights reserved.
>
>
>

Re: Parquet File Naming Convention Standards

Posted by Tim Armstrong <ta...@cloudera.com.INVALID>.

Not reusing file names is generally a good idea - there are a bunch of
interesting consistency issues, particularly on object stores, if you reuse
file paths. This has come up for us with things like INSERT OVERWRITE in
Hive, which tends to generate the same file names.

I think there's an interesting set of discussions to be had around best
practices for file sizes and row group sizes.

One point is that a lot of big data frameworks schedule parallel work based
on filesystem metadata only (i.e. file sizes and block sizes, if the
filesystem has a concept of a block). If you have arbitrary parquet files
this can break down in various ways - e.g. if you have a 1GB file, you have
to guess what a good way to divide up the processing is. If there are fewer
row groups than expected, you'll get skew and if there are more you'll lose
out on parallelism. HDFS blocks were often a good way to do this, since a
lot of writers aim for one row group per block, but Parquet files often
come from a variety of sources and get munged in different ways, so the
heuristic falls over  in various ways in some application. It's somewhat
worse on object stores like S3, where there isn't a concept of a block,
just whatever the writer and reader have configured - you really ideally
want reader and writer block sizes to line up, but coordinating can be
difficult for some workflows.

Working on Impala, I'm a bit biased towards larger blocks, because of the
scheduling problems and also because of the extra overhead added with row
groups - we end up needed to do extra I/O operations per row group, adding
overhead (some of the overhead is inherent because the data you're reading
is more fragmented, so of it is just our implementation).

On Wed, May 22, 2019 at 11:55 AM Brian Bowman <Br...@sas.com> wrote:

>  Thanks for the info!
>
> HDFS is only one of many storage platforms (distributed or otherwise) that
> SAS supports.  In general larger physical files (e.g. 100MB to 1GB) with
> multiple RowGroups are also a good thing for our usage cases.  I'm working
> to get our Parquet (C to C++ via libparquet.so) writer to do this.
>
> -Brian
>
> On 5/22/19, 1:21 PM, "Lee, David" <Da...@blackrock.com> wrote:
>
>     EXTERNAL
>
>     I'm not a big fan of this convention which is a Spark convention..
>
>     A. The files should have at least "foo" in the name. Using PyArrow I
> would create these files as foo.1.parquet, foo.2.parquet, etc..
>     B. These files are around 3 megs each. For HDFS storage, files should
> be sized to match the HDFS blocksize which is usually set at 128 megs
> (default) or 256 megs, 512 megs, 1 gig, etc..
>
>     https://blog.cloudera.com/blog/2009/02/the-small-files-problem/
>
>     I usually take small parquet files and save them as parquet row groups
> in a larger parquet file to match the HDFS blocksize.
>
>     -----Original Message-----
>     From: Brian Bowman <Br...@sas.com>
>     Sent: Wednesday, May 22, 2019 8:40 AM
>     To: dev@parquet.apache.org
>     Subject: Parquet File Naming Convention Standards
>
>     External Email: Use caution with links and attachments
>
>
>     All,
>
>     Here is an example .parquet data set saved using pySpark where the
> following files are members of directory: “foo.parquet”:
>
>     -rw-r--r--    1 sasbpb  r&d        8 Mar 26 12:10 ._SUCCESS.crc
>     -rw-r--r--    1 sasbpb  r&d    25632 Mar 26 12:10
> .part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d    25356 Mar 26 12:10
> .part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d    26300 Mar 26 12:10
> .part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d    23728 Mar 26 12:10
> .part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
>     -rw-r--r--    1 sasbpb  r&d        0 Mar 26 12:10 _SUCCESS
>     -rw-r--r--    1 sasbpb  r&d  3279617 Mar 26 12:10
> part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>     -rw-r--r--    1 sasbpb  r&d  3244105 Mar 26 12:10
> part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>     -rw-r--r--    1 sasbpb  r&d  3365039 Mar 26 12:10
> part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>     -rw-r--r--    1 sasbpb  r&d  3035960 Mar 26 12:10
> part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
>
>
>     Questions:
>
>       1.  Is this the “standard” for creating/saving a .parquet data set?
>       2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a
> UUID.  Is the format:
>          part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an
> established convention?  Is this documented somewhere?
>       3.  Is there a C++ class to create the CRC?
>
>
>     Thanks,
>
>
>     Brian
>
>
>     This message may contain information that is confidential or
> privileged. If you are not the intended recipient, please advise the sender
> immediately and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>     For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
>     © 2019 BlackRock, Inc. All rights reserved.
>
>
>

Re: Parquet File Naming Convention Standards

Posted by Brian Bowman <Br...@sas.com>.

 Thanks for the info!

HDFS is only one of many storage platforms (distributed or otherwise) that SAS supports.  In general larger physical files (e.g. 100MB to 1GB) with multiple RowGroups are also a good thing for our usage cases.  I'm working to get our Parquet (C to C++ via libparquet.so) writer to do this.

-Brian

On 5/22/19, 1:21 PM, "Lee, David" <Da...@blackrock.com> wrote:

    EXTERNAL
    
    I'm not a big fan of this convention which is a Spark convention..
    
    A. The files should have at least "foo" in the name. Using PyArrow I would create these files as foo.1.parquet, foo.2.parquet, etc..
    B. These files are around 3 megs each. For HDFS storage, files should be sized to match the HDFS blocksize which is usually set at 128 megs (default) or 256 megs, 512 megs, 1 gig, etc..
    
    https://blog.cloudera.com/blog/2009/02/the-small-files-problem/
    
    I usually take small parquet files and save them as parquet row groups in a larger parquet file to match the HDFS blocksize.
    
    -----Original Message-----
    From: Brian Bowman <Br...@sas.com>
    Sent: Wednesday, May 22, 2019 8:40 AM
    To: dev@parquet.apache.org
    Subject: Parquet File Naming Convention Standards
    
    External Email: Use caution with links and attachments
    
    
    All,
    
    Here is an example .parquet data set saved using pySpark where the following files are members of directory: “foo.parquet”:
    
    -rw-r--r--    1 sasbpb  r&d        8 Mar 26 12:10 ._SUCCESS.crc
    -rw-r--r--    1 sasbpb  r&d    25632 Mar 26 12:10 .part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
    -rw-r--r--    1 sasbpb  r&d    25356 Mar 26 12:10 .part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
    -rw-r--r--    1 sasbpb  r&d    26300 Mar 26 12:10 .part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
    -rw-r--r--    1 sasbpb  r&d    23728 Mar 26 12:10 .part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
    -rw-r--r--    1 sasbpb  r&d        0 Mar 26 12:10 _SUCCESS
    -rw-r--r--    1 sasbpb  r&d  3279617 Mar 26 12:10 part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
    -rw-r--r--    1 sasbpb  r&d  3244105 Mar 26 12:10 part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
    -rw-r--r--    1 sasbpb  r&d  3365039 Mar 26 12:10 part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
    -rw-r--r--    1 sasbpb  r&d  3035960 Mar 26 12:10 part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
    
    
    Questions:
    
      1.  Is this the “standard” for creating/saving a .parquet data set?
      2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID.  Is the format:
         part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an established convention?  Is this documented somewhere?
      3.  Is there a C++ class to create the CRC?
    
    
    Thanks,
    
    
    Brian
    
    
    This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.
    
    For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.
    
    © 2019 BlackRock, Inc. All rights reserved.

RE: Parquet File Naming Convention Standards

Posted by "Lee, David" <Da...@blackrock.com>.

I'm not a big fan of this convention which is a Spark convention..

A. The files should have at least "foo" in the name. Using PyArrow I would create these files as foo.1.parquet, foo.2.parquet, etc..
B. These files are around 3 megs each. For HDFS storage, files should be sized to match the HDFS blocksize which is usually set at 128 megs (default) or 256 megs, 512 megs, 1 gig, etc..

https://blog.cloudera.com/blog/2009/02/the-small-files-problem/

I usually take small parquet files and save them as parquet row groups in a larger parquet file to match the HDFS blocksize.

-----Original Message-----
From: Brian Bowman <Br...@sas.com> 
Sent: Wednesday, May 22, 2019 8:40 AM
To: dev@parquet.apache.org
Subject: Parquet File Naming Convention Standards 

External Email: Use caution with links and attachments


All,

Here is an example .parquet data set saved using pySpark where the following files are members of directory: “foo.parquet”:

-rw-r--r--    1 sasbpb  r&d        8 Mar 26 12:10 ._SUCCESS.crc
-rw-r--r--    1 sasbpb  r&d    25632 Mar 26 12:10 .part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d    25356 Mar 26 12:10 .part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d    26300 Mar 26 12:10 .part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d    23728 Mar 26 12:10 .part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--    1 sasbpb  r&d        0 Mar 26 12:10 _SUCCESS
-rw-r--r--    1 sasbpb  r&d  3279617 Mar 26 12:10 part-00000-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--    1 sasbpb  r&d  3244105 Mar 26 12:10 part-00001-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--    1 sasbpb  r&d  3365039 Mar 26 12:10 part-00002-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--    1 sasbpb  r&d  3035960 Mar 26 12:10 part-00003-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet


Questions:

  1.  Is this the “standard” for creating/saving a .parquet data set?
  2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID.  Is the format:
     part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an established convention?  Is this documented somewhere?
  3.  Is there a C++ class to create the CRC?


Thanks,


Brian


This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2019 BlackRock, Inc. All rights reserved.

Re: Parquet File Naming Convention Standards

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Relies inline.

On Wed, May 22, 2019 at 8:40 AM Brian Bowman Brian.Bowman@sas.com
<ht...@sas.com> wrote:

Questions:
>   1.  Is this the “standard” for creating/saving a .parquet data set?
>
File names are specific to the application that creates them. Iceberg, for
example, adds the task attempt number to ensure that no attempts try to
write to the same location. Some engines like Spark also include bucket
information in the file name.

  2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID.  Is
> the format:
>      part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc
> an established convention?  Is this documented somewhere?
>
The .crc file is created by the ChecksumFileSystem. Its name is always
.(data-file-name).crc

  3.  Is there a C++ class to create the CRC?
>
There is a C++ implementation of HDFS, but I don’t know if there is a local
FS that supports .crc files in C++.
-- 
Ryan Blue
Software Engineer
Netflix