You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Hailu, Andreas" <An...@gs.com> on 2019/07/01 20:15:07 UTC

File Naming Pattern from HadoopOutputFormat

Hello Flink team,

I'm writing Avro and Parquet files to HDFS, and I've would like to include a UUID as a part of the file name.

Our files in HDFS currently follow this pattern:

tmp-r-00001.snappy.parquet
tmp-r-00002.snappy.parquet
...

I'm using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

Best,
Andreas

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

RE: Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Posted by "Hailu, Andreas" <An...@gs.com>.

Very well - thank you both.

// ah

From: Haibo Sun <su...@163.com>
Sent: Wednesday, July 3, 2019 9:37 PM
To: Hailu, Andreas [Tech] <An...@ny.email.gs.com>
Cc: Yitzchak Lieberman <yi...@sentinelone.com>; user@flink.apache.org
Subject: Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Hi, Andreas

I'm glad you have had a solution. If you're interested in option 2 I talked about, you can follow up on the progress of the issue (https://issues.apache.org/jira/browse/FLINK-12573<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D12573&d=DwQGbg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=PjxKTfWKbjbcxTz3PnGvrR-7MEYlxraUaI1tTTQFqEw&s=pYKp2r7d5fGK3otU5dfUAmTaZf2cjeuVUMCmjupz8Ik&e=>) that Yitzchak said by watching it.

Best,
Haibo

At 2019-07-03 21:11:44, "Hailu, Andreas" <An...@gs.com>> wrote:

Hi Haibo, Yitzchak, thanks for getting back to me.

The pattern I chose to use which worked was to extend the HadoopOutputFormat class, override the open() method, and modify the "mapreduce.output.basename" configuration property to match my desired file naming structure.

// ah

From: Haibo Sun <su...@163.com>>
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman <yi...@sentinelone.com>>
Cc: Hailu, Andreas [Tech] <An...@ny.email.gs.com>>; user@flink.apache.org<ma...@flink.apache.org>
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat

Hi, Andreas

You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.

Best,
Haibo

At 2019-07-02 16:33:44, "Yitzchak Lieberman" <yi...@sentinelone.com>> wrote:
regarding option 2 for parquet:
implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:
<root dir>/day=20190101/part-1-1
there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D12573&d=DwMGbg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8&s=zGL975-dtGwduiOtS--MtzcwNbM6ti3ziA85_ki-Ql8&e=>

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <su...@163.com>> wrote:
Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the  getDirectoryFileName() method to customize a file name, as shown below.
The javadoc of AvroOutputFormat: https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.8_api_java_org_apache_flink_formats_avro_AvroOutputFormat.html&d=DwMGbg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8&s=tgLhwjHX0wHWgUMSSmNEOUmpatTF5N7JYfUEtYzQyf4&e=>

          public static class CustomAvroOutputFormat extends AvroOutputFormat {

                              public CustomAvroOutputFormat(Path filePath, Class type) {

                                                   super(filePath, type);

                              }

                              public CustomAvroOutputFormat(Class type) {

                                                   super(type);

                              }

                              @Override

                              public void open(int taskNumber, int numTasks) throws IOException {

                                                   this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);

                                                   super.open(taskNumber, numTasks);

                              }

                              @Override

                              protected String getDirectoryFileName(int taskNumber) {

                                                   // returns a custom filename

                                                   return null;

                              }

          }

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

ParquetStreamingFileSinkITCase:  https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_master_flink-2Dformats_flink-2Dparquet_src_test_java_org_apache_flink_formats_parquet_avro_ParquetStreamingFileSinkITCase.java&d=DwMGbg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8&s=8d-ErhDksiRSGYq3JQhnEERmXTnKMk99N7KfWs08Hho&e=>

StreamingFileSink#forBulkFormat: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_master_flink-2Dstreaming-2Djava_src_main_java_org_apache_flink_streaming_api_functions_sink_filesystem_StreamingFileSink.java&d=DwMGbg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8&s=-d96QSEVUNN702t_ejhD2TmXhc4fi8534hoQDbUrzAs&e=>

DateTimeBucketAssigner: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_master_flink-2Dstreaming-2Djava_src_main_java_org_apache_flink_streaming_api_functions_sink_filesystem_bucketassigners_DateTimeBucketAssigner.java&d=DwMGbg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8&s=u7c_I1rxZZYWSTUSW-JG8-6dk5MNLXMdvXXTr1Z0nAw&e=>

Best,
Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <An...@gs.com>> wrote:
Hello Flink team,

I'm writing Avro and Parquet files to HDFS, and I've would like to include a UUID as a part of the file name.

Our files in HDFS currently follow this pattern:

tmp-r-00001.snappy.parquet
tmp-r-00002.snappy.parquet
...

I'm using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

Best,
Andreas

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Posted by Haibo Sun <su...@163.com>.

Hi, Andreas  


I'm glad you have had a solution. If you're interested in option 2 I talked about, you can follow up on the progress of the issue (https://issues.apache.org/jira/browse/FLINK-12573) that Yitzchak said by watching it.


Best,
Haibo

At 2019-07-03 21:11:44, "Hailu, Andreas" <An...@gs.com> wrote:


Hi Haibo, Yitzchak, thanks for getting back to me.

 

The pattern I chose to use which worked was to extend the HadoopOutputFormat class, override the open() method, and modify the “mapreduce.output.basename” configuration property to match my desired file naming structure.

 

// ah

 

From: Haibo Sun <su...@163.com>
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman <yi...@sentinelone.com>
Cc: Hailu, Andreas [Tech] <An...@ny.email.gs.com>; user@flink.apache.org
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat

 


Hi, Andreas 

 

You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.

 

Best,

Haibo


At 2019-07-02 16:33:44, "Yitzchak Lieberman" <yi...@sentinelone.com> wrote:



regarding option 2 for parquet:

implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:

<root dir>/day=20190101/part-1-1

there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573

 

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <su...@163.com> wrote:

Hi, Andreas

 

I think the following things may be what you want.

 

1. For writing Avro, I think you can extend AvroOutputFormat and override the  getDirectoryFileName() method to customize a file name, as shown below.

The javadoc of AvroOutputFormat: https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html

 

          public static class CustomAvroOutputFormat extends AvroOutputFormat {
                              public CustomAvroOutputFormat(Path filePath, Class type) {
                                                   super(filePath, type);
                              }
 
                              public CustomAvroOutputFormat(Class type) {
                                                   super(type);
                              }
 
                              @Override
                              public void open(int taskNumber, int numTasks) throws IOException {
                                                   this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
                                                   super.open(taskNumber, numTasks);
                              }
 
                              @Override
                              protected String getDirectoryFileName(int taskNumber) {
                                                   // returns a custom filename
                                                   return null;
                              }
          }

 

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

 

ParquetStreamingFileSinkITCase:  https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java

 

StreamingFileSink#forBulkFormat: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java

 

DateTimeBucketAssigner: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java

 

 

Best,

Haibo


At 2019-07-02 04:15:07, "Hailu, Andreas" <An...@gs.com> wrote:



Hello Flink team,

 

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

 

Our files in HDFS currently follow this pattern:

 

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

 

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

 

Best,

Andreas

 


Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Posted by "Hailu, Andreas" <An...@gs.com>.

Hi Haibo, Yitzchak, thanks for getting back to me.

The pattern I chose to use which worked was to extend the HadoopOutputFormat class, override the open() method, and modify the "mapreduce.output.basename" configuration property to match my desired file naming structure.

// ah

From: Haibo Sun <su...@163.com>
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman <yi...@sentinelone.com>
Cc: Hailu, Andreas [Tech] <An...@ny.email.gs.com>; user@flink.apache.org
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat

Hi, Andreas

You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.

Best,
Haibo

At 2019-07-02 16:33:44, "Yitzchak Lieberman" <yi...@sentinelone.com>> wrote:

regarding option 2 for parquet:
implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:
<root dir>/day=20190101/part-1-1
there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D12573&d=DwMGbg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8&s=zGL975-dtGwduiOtS--MtzcwNbM6ti3ziA85_ki-Ql8&e=>

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <su...@163.com>> wrote:
Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the  getDirectoryFileName() method to customize a file name, as shown below.
The javadoc of AvroOutputFormat: https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.8_api_java_org_apache_flink_formats_avro_AvroOutputFormat.html&d=DwMGbg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8&s=tgLhwjHX0wHWgUMSSmNEOUmpatTF5N7JYfUEtYzQyf4&e=>

          public static class CustomAvroOutputFormat extends AvroOutputFormat {

                              public CustomAvroOutputFormat(Path filePath, Class type) {

                                                   super(filePath, type);

                              }

                              public CustomAvroOutputFormat(Class type) {

                                                   super(type);

                              }

                              @Override

                              public void open(int taskNumber, int numTasks) throws IOException {

                                                   this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);

                                                   super.open(taskNumber, numTasks);

                              }

                              @Override

                              protected String getDirectoryFileName(int taskNumber) {

                                                   // returns a custom filename

                                                   return null;

                              }

          }

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

ParquetStreamingFileSinkITCase:  https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_master_flink-2Dformats_flink-2Dparquet_src_test_java_org_apache_flink_formats_parquet_avro_ParquetStreamingFileSinkITCase.java&d=DwMGbg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8&s=8d-ErhDksiRSGYq3JQhnEERmXTnKMk99N7KfWs08Hho&e=>

StreamingFileSink#forBulkFormat: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_master_flink-2Dstreaming-2Djava_src_main_java_org_apache_flink_streaming_api_functions_sink_filesystem_StreamingFileSink.java&d=DwMGbg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8&s=-d96QSEVUNN702t_ejhD2TmXhc4fi8534hoQDbUrzAs&e=>

DateTimeBucketAssigner: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_master_flink-2Dstreaming-2Djava_src_main_java_org_apache_flink_streaming_api_functions_sink_filesystem_bucketassigners_DateTimeBucketAssigner.java&d=DwMGbg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8&s=u7c_I1rxZZYWSTUSW-JG8-6dk5MNLXMdvXXTr1Z0nAw&e=>

Best,
Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <An...@gs.com>> wrote:

Hello Flink team,

I'm writing Avro and Parquet files to HDFS, and I've would like to include a UUID as a part of the file name.

Our files in HDFS currently follow this pattern:

tmp-r-00001.snappy.parquet
tmp-r-00002.snappy.parquet
...

I'm using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

Best,
Andreas

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Re:Re: File Naming Pattern from HadoopOutputFormat

Posted by Haibo Sun <su...@163.com>.

Hi, Andreas 


You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.
 

Best,
Haibo

At 2019-07-02 16:33:44, "Yitzchak Lieberman" <yi...@sentinelone.com> wrote:

regarding option 2 for parquet:
implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:
<root dir>/day=20190101/part-1-1
there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573


On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <su...@163.com> wrote:

Hi, Andreas


I think the following things may be what you want.


1. For writing Avro, I think you can extend AvroOutputFormat and override the  getDirectoryFileName() method to customize a file name, as shown below.
The javadoc of AvroOutputFormat: https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html


	public static class CustomAvroOutputFormat extends AvroOutputFormat {
		public CustomAvroOutputFormat(Path filePath, Class type) {
			super(filePath, type);
		}

		public CustomAvroOutputFormat(Class type) {
			super(type);
		}

		@Override
		public void open(int taskNumber, int numTasks) throws IOException {
			this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
			super.open(taskNumber, numTasks);
		}

		@Override
		protected String getDirectoryFileName(int taskNumber) {
			// returns a custom filename
			return null;
		}
	}


2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).


ParquetStreamingFileSinkITCase:  https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java


StreamingFileSink#forBulkFormat: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java


DateTimeBucketAssigner: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java




Best,
Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <An...@gs.com> wrote:


Hello Flink team,

 

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

 

Our files in HDFS currently follow this pattern:

 

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

 

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

 

Best,

Andreas




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

Re: File Naming Pattern from HadoopOutputFormat

Posted by Yitzchak Lieberman <yi...@sentinelone.com>.

regarding option 2 for parquet:
implementing bucket assigner won't set the file name as getBucketId()
defined the directory for the files in case of partitioning the data,
for example:
<root dir>/day=20190101/part-1-1
there is an open issue for that:
https://issues.apache.org/jira/browse/FLINK-12573

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <su...@163.com> wrote:

> Hi, Andreas
>
> I think the following things may be what you want.
>
> 1. For writing Avro, I think you can extend AvroOutputFormat and override
> the  getDirectoryFileName() method to customize a file name, as shown below.
> The javadoc of AvroOutputFormat:
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html
>
> 	public static class CustomAvroOutputFormat extends AvroOutputFormat {
> 		public CustomAvroOutputFormat(Path filePath, Class type) {
> 			super(filePath, type);
> 		}
>
> 		public CustomAvroOutputFormat(Class type) {
> 			super(type);
> 		}
>
> 		@Override
> 		public void open(int taskNumber, int numTasks) throws IOException {
> 			this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
> 			super.open(taskNumber, numTasks);
> 		}
>
> 		@Override
> 		protected String getDirectoryFileName(int taskNumber) {
> 			// returns a custom filename
> 			return null;
> 		}
> 	}
>
>
> 2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase,
> StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create
> a class that implements the BucketAssigner interface and return a custom
> file name in the getBucketId() method (the value returned by getBucketId()
> will be treated as the file name).
>
> ParquetStreamingFileSinkITCase:
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java
>
> StreamingFileSink#forBulkFormat:
> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java
>
> DateTimeBucketAssigner:
> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java
>
>
> Best,
> Haibo
>
> At 2019-07-02 04:15:07, "Hailu, Andreas" <An...@gs.com> wrote:
>
> Hello Flink team,
>
>
>
> I’m writing Avro and Parquet files to HDFS, and I’ve would like to include
> a UUID as a part of the file name.
>
>
>
> Our files in HDFS currently follow this pattern:
>
>
>
> *tmp-r-00001.snappy.parquet*
>
> *tmp-r-00002.snappy.parquet*
>
> *...*
>
>
>
> I’m using a custom output format which extends a RichOutputFormat - is
> this something which is natively supported? If so, could you please
> recommend how this could be done, or share the relevant document?
>
>
>
> Best,
>
> Andreas
>
> ------------------------------
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices
>
>

Re:File Naming Pattern from HadoopOutputFormat

Posted by Haibo Sun <su...@163.com>.

Hi, Andreas


I think the following things may be what you want.


1. For writing Avro, I think you can extend AvroOutputFormat and override the  getDirectoryFileName() method to customize a file name, as shown below.
The javadoc of AvroOutputFormat: https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html


	public static class CustomAvroOutputFormat extends AvroOutputFormat {
		public CustomAvroOutputFormat(Path filePath, Class type) {
			super(filePath, type);
		}

		public CustomAvroOutputFormat(Class type) {
			super(type);
		}

		@Override
		public void open(int taskNumber, int numTasks) throws IOException {
			this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
			super.open(taskNumber, numTasks);
		}

		@Override
		protected String getDirectoryFileName(int taskNumber) {
			// returns a custom filename
			return null;
		}
	}


2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).


ParquetStreamingFileSinkITCase:  https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java


StreamingFileSink#forBulkFormat: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java


DateTimeBucketAssigner: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java




Best,
Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <An...@gs.com> wrote:


Hello Flink team,

 

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

 

Our files in HDFS currently follow this pattern:

 

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

 

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

 

Best,

Andreas




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices