You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by lan tran <in...@gmail.com> on 2022/03/18 05:03:58 UTC

[File partition Flink]

Hi team, I have some questions about the format when I process the files  
  
In-progress / Pending: part-<uid>-<partFileIndex>.inprogress.uid  
Finished: part-<uid>-<partFileIndex>  
  
Can you explain more about the partFileIndex since the format of the files is
quite weird. It produces two files (I wonder it related to the parallelism
which we have set is 2).  
part-6a13c70e-638d-4b10-820c-d7577e949e89-0-191  
part-6a13c70e-638d-4b10-820c-d7577e949e89-1-190  
  
If so, what happens if that our data is huge but the the commit time
(checkpoint around 1000ms) is small. Does it write into another files just
like this part-6a13c70e-638d-4b10-820c-d7577e949e89-0-192  by increasing the
final number ? Or they have the diffirent format.  
  
Another question is that since we are using the table API, does any option
that we can have to limit the files size or time that the files should closed
as it stated in the doc since I see that there is the option called
‘sink.rolling-policy.file-size’ and ‘sink.partition-commit.delay’. Does this
relevant to what we want ?  
  
One more question, although we have set 3 parallelism but at the end is still
be 1. Can you explain a bit about this case for me ?  
  
![](cid:image003.png@01D83AC0.3FBBE540)  
  
Thanks team.  
  
Best,  
Quynh.





Sent from [Mail](https://go.microsoft.com/fwlink/?LinkId=550986) for Windows

Re: [File partition Flink]

Posted by Yun Gao <yu...@aliyun.com>.

Hi Ian,

I think you are using the Table file / hive sinks. In this sink, it also include
a UUID in the prefix to ensures it behaves as appending instead of overwritting
if we run the same sql to write into the same directory / table partition multiple times.
Thus the file name pattern is part-<uuid>-<subtask index>-<sequence no of each subtask>.

Thus for the same run, if the data is huge and the checkpoint interval is small, the next file name
would indeed be like part-6a13c70e-638d-4b10-820c-d7577e949e89-0-192 , which increases the
sequence no of each task.

Regarding limiting the file size, I think sink.rolling-policy.file-size[1] should be the parameter.

Regarding the PartitionCommitter, it is mainly used to commit some metadatas, like writing
_SUCCESS file to the finished bucket or insert a record of the finished partition into the Hive metastore,
thus its parallelism is set to 1 to be able to have a summarized of all the writers' status to decide when
to write this piece of metadata. The writer should be contained in the precedent tasks, which has a parallelism
of 3 as expected.

Best,
Yun Gao

[1] https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/filesystem/#sink-rolling-policy-file-size
------------------------------------------------------------------
From:lan tran <in...@gmail.com>
Send Time:2022 Mar. 18 (Fri.) 13:04
To:user@flink.apache.org <us...@flink.apache.org>
Subject:[File partition Flink]

Hi team, I have some questions about the format when I process the files

In-progress / Pending: part-<uid>-<partFileIndex>.inprogress.uid
Finished: part-<uid>-<partFileIndex>

Can you explain more about the partFileIndex since the format of the files is quite weird. It produces two files (I wonder it related to the parallelism which we have set is 2).
part-6a13c70e-638d-4b10-820c-d7577e949e89-0-191
part-6a13c70e-638d-4b10-820c-d7577e949e89-1-190

If so, what happens if that our data is huge but the the commit time (checkpoint around 1000ms) is small. Does it write into another files just like this part-6a13c70e-638d-4b10-820c-d7577e949e89-0-192 by increasing the final number ? Or they have the diffirent format.

Another question is that since we are using the table API, does any option that we can have to limit the files size or time that the files should closed as it stated in the doc since I see that there is the option called ‘sink.rolling-policy.file-size’ and ‘sink.partition-commit.delay’. Does this relevant to what we want ?

One more question, although we have set 3 parallelism but at the end is still be 1. Can you explain a bit about this case for me ?

Thanks team.

Best,
Quynh.
Sent from Mail for Windows