You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2021/06/16 02:46:00 UTC
[jira] [Commented] (HUDI-2025) Bring parity between row writer
bulk_insert and rdd based bulk_insert
[ https://issues.apache.org/jira/browse/HUDI-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17364021#comment-17364021 ]
sivabalan narayanan commented on HUDI-2025:
-------------------------------------------
Trying to check differences between both flows.
Inspected metadata that gets attached to parquet written in both paths.
// listing just the keys in extra metadata.
sn$ grep "extra" /tmp/regular_bulk_insert_meta.out| cut -d"=" -f1
extra: org.apache.hudi.bloomfilter
extra: hoodie_min_record_key
extra: parquet.avro.schema
extra: writer.model.name
extra: hoodie_max_record_key
sn$ grep "extra" /tmp/rowWriter_bulk_insert_meta.out| cut -d"=" -f1
extra: org.apache.spark.version
extra: org.apache.hudi.bloomfilter
extra: hoodie_min_record_key
extra: org.apache.spark.sql.parquet.row.metadata
extra: hoodie_max_record_key
> Bring parity between row writer bulk_insert and rdd based bulk_insert
> ---------------------------------------------------------------------
>
> Key: HUDI-2025
> URL: https://issues.apache.org/jira/browse/HUDI-2025
> Project: Apache Hudi
> Issue Type: Task
> Reporter: sivabalan narayanan
> Priority: Major
>
> Bring parity between row writer bulk_insert and rdd based bulk_insert
--
This message was sent by Atlassian Jira
(v8.3.4#803005)