You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/03 19:39:20 UTC
[GitHub] [hudi] mithalee commented on issue #3336: [SUPPORT] Delete not functioning with deltastreamer

mithalee commented on issue #3336:
URL: https://github.com/apache/hudi/issues/3336#issuecomment-892112742


   @codope Hi, I did try the HoodieDeltaStreamer on below version of EMR:
   Release label:emr-6.3.0
   Hadoop distribution:Amazon 3.2.1
   Applications:Tez 0.9.2, Spark 3.1.1, Hive 3.1.2, JupyterHub 1.2.0, Sqoop 1.4.7, Zeppelin 0.9.0, Hue 4.9.0, Presto 0.245.1
   HUDI Utility: https://mvnrepository.com/artifact/org.apache.hudi/hudi-utilities-bundle_2.12/0.8.0
   The HoodieDeltaStreamer works as expected(upserts/deletes).  But the same data set and same spark submit configuration does not work on the Spark on K8 binaries.
   https://spark.apache.org/docs/3.1.1/submitting-applications.html
   I can perform the initial insert into Hudi table through the below spark submit but the upserts/deletes are throwing error. Can you confirm if the upserts /deletes works using the DS in Spark on K8s.
   
   **SPARK SUBMIT**
   spark-submit --master k8s://https://...sk1.us-west-1.eks.amazonaws.com 
   --deploy-mode cluster 
   --name spark-hudi 
   --jars https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.1/spark-avro_2.12-3.1.1.jar 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
   --conf spark.executor.instances=1 
   --conf spark.kubernetes.container.image=d.../spark:spark-hudi-0.2 
   --conf spark.kubernetes.namespace=spark-k8 
   --conf spark.kubernetes.container.image.pullSecrets=dockercloud-secret 
   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
   --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-1.amazonaws.com 
   --conf spark.hadoop.fs.s3a.access.key='A........' 
   --conf spark.hadoop.fs.s3a.secret.key='L.........' 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
   s3a://l...v/hudi-root/spark-submit-jars/hudi-utilities-bundle_2.12-0.8.0.jar 
   --table-type COPY_ON_WRITE --source-ordering-field one 
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
   --target-base-path s3a://......./hudi-root/transformed-tables/hudi_writer_mm7/ 
   --target-table test_table --base-file-format PARQUET 
   --hoodie-conf hoodie.datasource.write.recordkey.field=one 
   --hoodie-conf hoodie.datasource.write.partitionpath.field=two 
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://....../jen/example_4.parquet  
   
   **Parquet file example_4.parquet  :**
   import pyarrow.parquet as pq
   
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   import uuid
   
   df = pd.DataFrame({'one': [-1, 3, 2.5],
   'two': [100, 200, 300],
   'three': [True, True, True],
   '_hoodie_is_deleted': [False, False, False]},
   index=list('abc'))
   table = pa.Table.from_pandas(df)
   print(table)
   pq.write_table(table, 'example_4.parquet')
   
   **Delete file:**
   import pyarrow.parquet as pq
   
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   import uuid
   df = pd.DataFrame({'one': [-1, 3, 2.5],
   'two': [100, 200, 300],
   'three': [True, True, True],
   '_hoodie_is_deleted': [True, False, False]},
   index=list('abc'))
   table = pa.Table.from_pandas(df)
   print(table)
   pq.write_table(table, 'example_delete4.parquet')


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org