You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Yik San Chan (Jira)" <ji...@apache.org> on 2021/08/20 07:43:00 UTC
[jira] [Created] (FLINK-23891) Support
'sink.partition-commit.policy.kind'='metastore,success-file' config for
batch Hive sink
Yik San Chan created FLINK-23891:
------------------------------------
Summary: Support 'sink.partition-commit.policy.kind'='metastore,success-file' config for batch Hive sink
Key: FLINK-23891
URL: https://issues.apache.org/jira/browse/FLINK-23891
Project: Flink
Issue Type: Improvement
Reporter: Yik San Chan
According to the [docs](https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/#partition-commit-policy), if I create a Hive table with config sink.partition-commit.policy.kind="metastore,success-file", once the write to the **streaming** Hive sink is finished:
- The HDFS directory will be registered to the Hive metastore,
- There will be a _SUCCESS file written to the directory when the job finishes.
An example result directory on HDFS looks like this:
[10.106.11.21:serv@cn-hz-wl-prod-data-stat00:~]$ hdfs dfs -ls /user/hive/warehouse/aiinfra.db/user_loss_predictions/p_day=20210819
Found 9 items
-rw-r----- 2 basedata aiinfra 0 2021-08-20 08:56 /user/hive/warehouse/aiinfra.db/user_loss_predictions/p_day=20210819/_SUCCESS
-rw-r----- 2 basedata aiinfra 10684668 2021-08-20 08:49 /user/hive/warehouse/aiinfra.db/user_loss_predictions/p_day=20210819/part-3ee91bc0-a5f6-44c9-b2e5-3d50ee882028-0-0
-rw-r----- 2 basedata aiinfra 10712792 2021-08-20 08:48 /user/hive/warehouse/aiinfra.db/user_loss_predictions/p_day=20210819/part-3ee91bc0-a5f6-44c9-b2e5-3d50ee882028-1-0
-rw-r----- 2 basedata aiinfra 10759066 2021-08-20 08:46 /user/hive/warehouse/aiinfra.db/user_loss_predictions/p_day=20210819/part-3ee91bc0-a5f6-44c9-b2e5-3d50ee882028-2-0
-rw-r----- 2 basedata aiinfra 10754886 2021-08-20 08:46 /user/hive/warehouse/aiinfra.db/user_loss_predictions/p_day=20210819/part-3ee91bc0-a5f6-44c9-b2e5-3d50ee882028-3-0
-rw-r----- 2 basedata aiinfra 10681155 2021-08-20 08:45 /user/hive/warehouse/aiinfra.db/user_loss_predictions/p_day=20210819/part-3ee91bc0-a5f6-44c9-b2e5-3d50ee882028-4-0
-rw-r----- 2 basedata aiinfra 10725101 2021-08-20 08:46 /user/hive/warehouse/aiinfra.db/user_loss_predictions/p_day=20210819/part-3ee91bc0-a5f6-44c9-b2e5-3d50ee882028-5-0
-rw-r----- 2 basedata aiinfra 10717976 2021-08-20 08:56 /user/hive/warehouse/aiinfra.db/user_loss_predictions/p_day=20210819/part-3ee91bc0-a5f6-44c9-b2e5-3d50ee882028-6-0
-rw-r----- 2 basedata aiinfra 10585453 2021-08-20 08:45 /user/hive/warehouse/aiinfra.db/user_loss_predictions/p_day=20210819/part-3ee91bc0-a5f6-44c9-b2e5-3d50ee882028-7-0
There are 8 part-* files because I set the flink run parallelism to 8. After all part-* are written, a _SUCCESS file is added (see the timestamp 08:56, which is later than all the rest).
However, this is not supported with a **batch** Hive sink. It would be great to add the support as downstream usually expect an accurate signal to proceed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)