You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Yanjia Gary Li (Jira)" <ji...@apache.org> on 2020/04/10 23:22:00 UTC
[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2
[ https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081030#comment-17081030 ]
Yanjia Gary Li commented on HUDI-773:
-------------------------------------
surprisingly easy...I tried the following test using Spark2.4 HDinsigh cluster with Azure Data Lake Storage V2. Hudi ran out of the box. No extra config needed.
{code:java}
// Initial Batch
val outputPath = "/Test/HudiWrite"
val df1 = Seq(
("0", "year=2019", "test1", "pass", "201901"),
("1", "year=2019", "test1", "pass", "201901"),
("2", "year=2020", "test1", "pass", "201901"),
("3", "year=2020", "test1", "pass", "201901")
).toDF("_uuid", "_partition", "PARAM_NAME", "RESULT_STRING", "TIMESTAMP")
val bulk_insert_ops = Map(
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_uuid",
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "_partition",
DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "TIMESTAMP",
"hoodie.bulkinsert.shuffle.parallelism" -> "10",
"hoodie.upsert.shuffle.parallelism" -> "10",
HoodieWriteConfig.TABLE_NAME -> "test"
)
df1.write.format("org.apache.hudi").options(bulk_insert_ops).mode(SaveMode.Overwrite).save(outputPath)
// Upsert
val upsert_ops = Map(
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_uuid",
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "_partition",
DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "TIMESTAMP",
"hoodie.bulkinsert.shuffle.parallelism" -> "10",
"hoodie.upsert.shuffle.parallelism" -> "10",
HoodieWriteConfig.TABLE_NAME -> "test"
)
val df2 = Seq(
("0", "year=2019", "test1", "pass", "201910"),
("1", "year=2019", "test1", "pass", "201910"),
("2", "year=2020", "test1", "pass", "201910"),
("3", "year=2020", "test1", "pass", "201910")
).toDF("_uuid", "_partition", "PARAM_NAME", "RESULT_STRING", "TIMESTAMP")
df2.write.format("org.apache.hudi").options(upsert_ops).mode(SaveMode.Append).save(outputPath)
// Read as hudi format
val df_read = spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).load(outputPath)
assert(df_read.count() == 4){code}
> Hudi On Azure Data Lake Storage V2
> ----------------------------------
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
> Issue Type: New Feature
> Components: Usability
> Reporter: Yanjia Gary Li
> Assignee: Yanjia Gary Li
> Priority: Minor
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)