You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/11/08 12:32:21 UTC

[GitHub] [hudi] Limess opened a new issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

Limess opened a new issue #3945:
URL: https://github.com/apache/hudi/issues/3945


   **Describe the problem you faced**
   
   We're running a deltastreamer job into a new Hudi table.
   
   We have a partition column: ``, and we set `hoodie.datasource.write.drop.partition.columns=true`.
   
   When the execution completes, we observe that the partition column is present in the parquet file, and data:
   
   ```shell
   parquet-tools show s3://<bucket>/articles_hudi_copy_on_write_drop_partition_column_test/story_published_partition_date=2021-01-07/b4eec094-ea1e-4b95-839f-648592eddb08-0_18-26-3681_20211108115747.parquet --head 1 --columns story_published_partition_date --awsprofile signal-prod
   ℹ s3://<bucket>/articles_hudi_copy_on_write_drop_partition_column_test/story_published_partition_date=2021-01-07/b4eec094-ea1e-4b95-839f-648592eddb08-0_18-26-3681_20211108115747.parquet => /var/folders/lx/83dtr4vx0cs87l55pwnwk7600000gq/T/tmp0ypmpxcw/f39767b2-3218-4ebe-9396-9549d6998c02.parquet
   
   +----------------------------------+
   | story_published_partition_date   |
   |----------------------------------|
   | 2021-01-07T09:00:00Z             |
   +----------------------------------+
   ```
   
   Configuration:
   
   ```
           "Args": [
               "spark-submit",
               "--deploy-mode",
               "cluster",
               "--class",
               "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer",
               "--jars",
               "/home/hadoop/extra-jars/hudi-utilities-bundle_2.12-0.9.0.jar,/home/hadoop/extra-jars/spark-avro_2.12-3.0.1.jar,/home/hadoop/extra-jars/hudi-spark3-bundle_2.12-0.9.0.jar",
   
               "/home/hadoop/extra-jars/hudi-utilities-bundle_2.12-0.9.0.jar",
               "--props",
               "/etc/hudi/conf/hudi-base.properties",
               "--table-type",
               "COPY_ON_WRITE",
               "--op",
               "UPSERT ",
               "--source-ordering-field",
               "version",
               "--source-class",
               "org.apache.hudi.utilities.sources.ParquetDFSSource",
               "--transformer-class",
               "org.apache.hudi.utilities.transform.SqlFileBasedTransformer",
               "--target-base-path",
               "s3://<bucket>/articles_hudi_copy_on_write_drop_partition_column_test/",
               "--target-table",
               "articles_hudi_copy_on_write_drop_partition_column_test",
               "--enable-hive-sync",
               "--hoodie-conf",
               "hoodie.table.name=articles_hudi_copy_on_write_drop_partition_column_test",
               "--hoodie-conf",
               "hoodie.deltastreamer.transformer.sql.file=/etc/hudi/conf/schema/documents_schema.sql",
               "--hoodie-conf",
               "hoodie.datasource.write.recordkey.field=id",
               "--hoodie-conf",
               "hoodie.datasource.write.precombine.field=version",
               "--hoodie-conf",
               "hoodie.bloom.index.prune.by.ranges=false",
               "--hoodie-conf",
               "hoodie.datasource.write.partitionpath.field=story_published_partition_date",
               "--hoodie-conf",
               "hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator",
               "--hoodie-conf",
               "hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING",
               "--hoodie-conf",
               "hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-dd'T'HH:mm:ssZ,yyyy-MM-dd'T'HH:mm:ss.SSSZ",
               "--hoodie-conf",
               "hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex=,",
               "--hoodie-conf",
               "hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy-MM-dd",
               "--hoodie-conf",
               "hoodie.deltastreamer.keygen.timebased.output.timezone=UTC",
               "--hoodie-conf",
               "hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor",
               "--hoodie-conf",
               "hoodie.datasource.write.hive_style_partitioning=true",
               "--hoodie-conf",
               "hoodie.datasource.write.drop.partition.columns=true",
               "--hoodie-conf",
               "hoodie.datasource.write.reconcile.schema=true",
               "--hoodie-conf",
               "hoodie.datasource.hive_sync.enable=true",
               "--hoodie-conf",
               "hoodie.datasource.hive_sync.database=articles",
               "--hoodie-conf",
               "hoodie.datasource.hive_sync.table=articles_hudi_copy_on_write_drop_partition_column_test",
               "--hoodie-conf",
               "hoodie.datasource.hive_sync.partition_fields=story_published_partition_date",
               "--hoodie-conf",
               "hoodie.deltastreamer.source.dfs.root=s3://<input_bucket>/firehose_received_date=2021-11-08/"
           ]
   ```
   
   **Expected behavior**
   
   The column does not exist in the parquet file.
   
   **Environment Description**
   
   EMR 6.4.0
   
   * Hudi version: 0.9.0
   * Spark version :
   
   3.1.2
   
   * Hive version :
   
   Hive 3.1.2
   
   * Hadoop version :
   
   Amazon 3.2.1
   
   * Storage (HDFS/S3/GCS..) :
   
   S3
   
   * Running on Docker? (yes/no) :
   
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3945:
URL: https://github.com/apache/hudi/issues/3945#issuecomment-989455007


   my bad. will look at it then. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #3945:
URL: https://github.com/apache/hudi/issues/3945#issuecomment-989457674


   Looks like the support was never added to deltastreamer only. I have filed a tracking ticket [here](https://issues.apache.org/jira/browse/HUDI-2967). If either of you are interested in working towards it, let me know. I can guide you. we can get it in for 0.11. 
   Since we have a tracking jira, will close the github issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] fireking77 commented on issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

Posted by GitBox <gi...@apache.org>.
fireking77 commented on issue #3945:
URL: https://github.com/apache/hudi/issues/3945#issuecomment-985311538


   Hi Guys!
   
   +1 here
   
   I have the same issue with quite a same setup (EMR 6.4) but with Hudi version 0.8 wich is built into that EMR version.
   
   Thanks,
    Darvi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3945:
URL: https://github.com/apache/hudi/issues/3945#issuecomment-989202571


   Guess the feature itself is added only in 0.9.0 https://github.com/apache/hudi/commit/968927801470953f137368cf146778a7f01aa63f
   Please let us know if you are facing issues with 0.9.0 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3945:
URL: https://github.com/apache/hudi/issues/3945#issuecomment-989457674


   Looks like the support was never added to deltastreamer only. I have filed a tracking ticket [here](https://issues.apache.org/jira/browse/HUDI-2967). If you are interested in working towards it, let me know. I can guide you. we can get it in for 0.11. 
   Since we have a tracking jira, will close the github issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #3945:
URL: https://github.com/apache/hudi/issues/3945


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Limess commented on issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

Posted by GitBox <gi...@apache.org>.
Limess commented on issue #3945:
URL: https://github.com/apache/hudi/issues/3945#issuecomment-989203139


   This was using 0.9.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #3945:
URL: https://github.com/apache/hudi/issues/3945


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org