You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2019/05/22 15:57:58 UTC

Re: Parquet File Naming Convention Standards

Relies inline.

On Wed, May 22, 2019 at 8:40 AM Brian Bowman Brian.Bowman@sas.com
<ht...@sas.com> wrote:

Questions:
>   1.  Is this the “standard” for creating/saving a .parquet data set?
>
File names are specific to the application that creates them. Iceberg, for
example, adds the task attempt number to ensure that no attempts try to
write to the same location. Some engines like Spark also include bucket
information in the file name.

  2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID.  Is
> the format:
>      part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc
> an established convention?  Is this documented somewhere?
>
The .crc file is created by the ChecksumFileSystem. Its name is always
.(data-file-name).crc

  3.  Is there a C++ class to create the CRC?
>
There is a C++ implementation of HDFS, but I don’t know if there is a local
FS that supports .crc files in C++.
-- 
Ryan Blue
Software Engineer
Netflix