You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "soumilshah1995 (via GitHub)" <gi...@apache.org> on 2023/03/28 13:22:51 UTC

[GitHub] [hudi] soumilshah1995 opened a new issue, #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for YouTube Content for Community

soumilshah1995 opened a new issue, #8309:
URL: https://github.com/apache/hudi/issues/8309

   Hello All 
   firstly thank you very much for all help from community. i would want to mention i am new to delta streamer i have worked a lot with Glue jobs and i want to experiment with delta streamer so i can make videos and teach the community 
   
   i have setup complete pipeline from AWS Aurora Postgres  > DMS > S3 and i have EMR cluster 6.9 with Spark 3
   
   Attaching links for sample parquet files and sample json how it looks like 
   ![image](https://user-images.githubusercontent.com/39345855/228249922-ac19cf34-9112-40ff-b465-db1e9006eb43.png)
   
   Link to data files https://drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link
   
   
   Here is how i submit jobs 
   ```
       spark-submit
       --master yarn
       --deploy-mode cluster
       --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
       --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar
       --table-type COPY_ON_WRITE
       --source-ordering-field replicadmstimestamp
       --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
       --target-base-path s3://sql-server-dms-demo/hudi/public/sales
       --target-table invoice
       --payload-class org.apache.hudi.common.model.AWSDmsAvroPayload
       --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
       --hoodie-conf hoodie.datasource.write.recordkey.field=invoiceid
       --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://sql-server-dms-demo/raw/public/sales/
   ```
   
   # Error i get 
   ```
   23/03/28 13:08:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   23/03/28 13:08:49 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ip-172-32-147-4.ec2.internal/172.32.147.4:8032
   23/03/28 13:08:50 INFO Configuration: resource-types.xml not found
   23/03/28 13:08:50 INFO ResourceUtils: Unable to find 'resource-types.xml'.
   23/03/28 13:08:50 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (6144 MB per container)
   23/03/28 13:08:50 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
   23/03/28 13:08:50 INFO Client: Setting up container launch context for our AM
   23/03/28 13:08:50 INFO Client: Setting up the launch environment for our AM container
   23/03/28 13:08:50 INFO Client: Preparing resources for our AM container
   23/03/28 13:08:50 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
   23/03/28 13:08:52 INFO Client: Uploading resource file:/mnt/tmp/spark-7f0fabb5-07de-43c3-8a26-a2325d5be63a/__spark_libs__363124573059127100.zip -> hdfs://ip-172-32-147-4.ec2.internal:8020/user/hadoop/.sparkStaging/application_1680007316515_0003/__spark_libs__363124573059127100.zip
   23/03/28 13:08:53 INFO Client: Uploading resource file:/usr/lib/hudi/hudi-utilities-bundle_2.12-0.12.1-amzn-0.jar -> hdfs://ip-172-32-147-4.ec2.internal:8020/user/hadoop/.sparkStaging/application_1680007316515_0003/hudi-utilities-bundle_2.12-0.12.1-amzn-0.jar
   23/03/28 13:08:53 INFO Client: Uploading resource file:/etc/spark/conf.dist/hive-site.xml -> hdfs://ip-172-32-147-4.ec2.internal:8020/user/hadoop/.sparkStaging/application_1680007316515_0003/hive-site.xml
   23/03/28 13:08:54 INFO Client: Uploading resource file:/etc/hudi/conf.dist/hudi-defaults.conf -> hdfs://ip-172-32-147-4.ec2.internal:8020/user/hadoop/.sparkStaging/application_1680007316515_0003/hudi-defaults.conf
   23/03/28 13:08:54 INFO Client: Uploading resource file:/mnt/tmp/spark-7f0fabb5-07de-43c3-8a26-a2325d5be63a/__spark_conf__2001263387666561545.zip -> hdfs://ip-172-32-147-4.ec2.internal:8020/user/hadoop/.sparkStaging/application_1680007316515_0003/__spark_conf__.zip
   23/03/28 13:08:54 INFO SecurityManager: Changing view acls to: hadoop
   23/03/28 13:08:54 INFO SecurityManager: Changing modify acls to: hadoop
   23/03/28 13:08:54 INFO SecurityManager: Changing view acls groups to: 
   23/03/28 13:08:54 INFO SecurityManager: Changing modify acls groups to: 
   23/03/28 13:08:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
   23/03/28 13:08:54 INFO Client: Submitting application application_1680007316515_0003 to ResourceManager
   23/03/28 13:08:54 INFO YarnClientImpl: Submitted application application_1680007316515_0003
   23/03/28 13:08:55 INFO Client: Application report for application_1680007316515_0003 (state: ACCEPTED)
   23/03/28 13:08:55 INFO Client: 
   	 client token: N/A
   	 diagnostics: AM container is launched, waiting for AM container to Register with RM
   	 ApplicationMaster host: N/A
   	 ApplicationMaster RPC port: -1
   	 queue: default
   	 start time: 1680008934287
   	 final status: UNDEFINED
   	 tracking URL: http://ip-172-32-147-4.ec2.internal:20888/proxy/application_1680007316515_0003/
   	 user: hadoop
   23/03/28 13:08:56 INFO Client: Application report for application_1680007316515_0003 (state: ACCEPTED)
   23/03/28 13:08:57 INFO Client: Application report for application_1680007316515_0003 (state: ACCEPTED)
   23/03/28 13:08:58 INFO Client: Application report for application_1680007316515_0003 (state: ACCEPTED)
   23/03/28 13:08:59 INFO Client: Application report for application_1680007316515_0003 (state: ACCEPTED)
   23/03/28 13:09:00 INFO Client: Application report for application_1680007316515_0003 (state: ACCEPTED)
   23/03/28 13:09:01 INFO Client: Application report for application_1680007316515_0003 (state: RUNNING)
   23/03/28 13:09:01 INFO Client: 
   	 client token: N/A
   	 diagnostics: N/A
   	 ApplicationMaster host: ip-172-32-12-21.ec2.internal
   	 ApplicationMaster RPC port: 34367
   	 queue: default
   	 start time: 1680008934287
   	 final status: UNDEFINED
   	 tracking URL: http://ip-172-32-147-4.ec2.internal:20888/proxy/application_1680007316515_0003/
   	 user: hadoop
   23/03/28 13:09:02 INFO Client: Application report for application_1680007316515_0003 (state: RUNNING)
   23/03/28 13:09:03 INFO Client: Application report for application_1680007316515_0003 (state: RUNNING)
   23/03/28 13:09:04 INFO Client: Application report for application_1680007316515_0003 (state: ACCEPTED)
   23/03/28 13:09:04 INFO Client: 
   	 client token: N/A
   	 diagnostics: AM container is launched, waiting for AM container to Register with RM
   	 ApplicationMaster host: N/A
   	 ApplicationMaster RPC port: -1
   	 queue: default
   	 start time: 1680008934287
   	 final status: UNDEFINED
   	 tracking URL: http://ip-172-32-147-4.ec2.internal:20888/proxy/application_1680007316515_0003/
   	 user: hadoop
   23/03/28 13:09:05 INFO Client: Application report for application_1680007316515_0003 (state: ACCEPTED)
   23/03/28 13:09:06 INFO Client: Application report for application_1680007316515_0003 (state: ACCEPTED)
   23/03/28 13:09:07 INFO Client: Application report for application_1680007316515_0003 (state: ACCEPTED)
   23/03/28 13:09:08 INFO Client: Application report for application_1680007316515_0003 (state: ACCEPTED)
   23/03/28 13:09:09 INFO Client: Application report for application_1680007316515_0003 (state: RUNNING)
   23/03/28 13:09:09 INFO Client: 
   	 client token: N/A
   	 diagnostics: N/A
   	 ApplicationMaster host: ip-172-32-12-21.ec2.internal
   	 ApplicationMaster RPC port: 42741
   	 queue: default
   	 start time: 1680008934287
   	 final status: UNDEFINED
   	 tracking URL: http://ip-172-32-147-4.ec2.internal:20888/proxy/application_1680007316515_0003/
   	 user: hadoop
   23/03/28 13:09:10 INFO Client: Application report for application_1680007316515_0003 (state: RUNNING)
   23/03/28 13:09:11 INFO Client: Application report for application_1680007316515_0003 (state: FINISHED)
   23/03/28 13:09:11 INFO Client: 
   	 client token: N/A
   	 diagnostics: User class threw exception: java.io.IOException: Could not load key generator class org.apache.hudi.keygen.SimpleKeyGenerator
   	at org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:74)
   	at org.apache.hudi.utilities.deltastreamer.DeltaSync.<init>(DeltaSync.java:235)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:675)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:146)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:119)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:571)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:742)
   Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.keygen.SimpleKeyGenerator
   	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:91)
   	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:118)
   	at org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:72)
   	... 10 more
   Caused by: java.lang.reflect.InvocationTargetException
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
   	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89)
   	... 12 more
   Caused by: java.lang.IllegalArgumentException: Property hoodie.datasource.write.partitionpath.field not found
   	at org.apache.hudi.common.config.TypedProperties.checkKey(TypedProperties.java:67)
   	at org.apache.hudi.common.config.TypedProperties.getString(TypedProperties.java:72)
   	at org.apache.hudi.keygen.SimpleKeyGenerator.<init>(SimpleKeyGenerator.java:41)
   	... 17 more
   
   	 ApplicationMaster host: ip-172-32-12-21.ec2.internal
   	 ApplicationMaster RPC port: 42741
   	 queue: default
   	 start time: 1680008934287
   	 final status: FAILED
   	 tracking URL: http://ip-172-32-147-4.ec2.internal:20888/proxy/application_1680007316515_0003/
   	 user: hadoop
   23/03/28 13:09:11 ERROR Client: Application diagnostics message: User class threw exception: java.io.IOException: Could not load key generator class org.apache.hudi.keygen.SimpleKeyGenerator
   	at org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:74)
   	at org.apache.hudi.utilities.deltastreamer.DeltaSync.<init>(DeltaSync.java:235)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:675)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:146)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:119)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:571)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:742)
   Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.keygen.SimpleKeyGenerator
   	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:91)
   	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:118)
   	at org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:72)
   	... 10 more
   Caused by: java.lang.reflect.InvocationTargetException
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
   	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89)
   	... 12 more
   Caused by: java.lang.IllegalArgumentException: Property hoodie.datasource.write.partitionpath.field not found
   	at org.apache.hudi.common.config.TypedProperties.checkKey(TypedProperties.java:67)
   	at org.apache.hudi.common.config.TypedProperties.getString(TypedProperties.java:72)
   	at org.apache.hudi.keygen.SimpleKeyGenerator.<init>(SimpleKeyGenerator.java:41)
   	... 17 more
   
   Exception in thread "main" org.apache.spark.SparkException: Application application_1680007316515_0003 finished with failed status
   	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1354)
   	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1776)
   	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1006)
   	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
   	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
   	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
   	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1095)
   	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1104)
   	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   23/03/28 13:09:11 INFO ShutdownHookManager: Shutdown hook called
   23/03/28 13:09:11 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dbd4de25-f7b1-4875-b7fd-c599028ae4e0
   23/03/28 13:09:11 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-7f0fabb5-07de-43c3-8a26-a2325d5be63a
   Command exiting with ret '1'
   ```
   
   * Any advice | Feedback and pointing out what i am doing wrong would be great 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1489447371

   ![image](https://user-images.githubusercontent.com/39345855/228686460-1e13b50a-db8b-435d-94d7-868dd833919b.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1489325072

   # Step 1
   ![image](https://user-images.githubusercontent.com/39345855/228667774-e68ea5cc-a90a-42fa-bc72-f1a65e1568d5.png)
   
   # Step 2: 
   ![image](https://user-images.githubusercontent.com/39345855/228667855-adaa8831-c4a3-4a6e-af2b-3a2bc35e5d63.png)
   
   Even after running for 10 minutes i dont see base files 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1491028265

   These are config that worked for me 
   
   ```
     spark-submit \
       --class                 org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
       --conf                  spark.serializer=org.apache.spark.serializer.KryoSerializer \
       --conf                  spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension  \
       --conf                  spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
       --conf                  spark.sql.hive.convertMetastoreParquet=false \
       --conf                  spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory \
       --master                yarn \
       --deploy-mode           client \
       --deploy-mode           cluster \
       --executor-memory       1g \
        /usr/lib/hudi/hudi-utilities-bundle.jar \
       --table-type            COPY_ON_WRITE \
       --op                    UPSERT \
       --enable-sync \
       --source-ordering-field replicadmstimestamp  \
       --source-class          org.apache.hudi.utilities.sources.ParquetDFSSource \
       --target-base-path      s3://delta-streamer-demo-hudi/raw/public/sales \
       --target-table          invoice \
       --payload-class         org.apache.hudi.common.model.AWSDmsAvroPayload \
       --hoodie-conf           hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \
       --hoodie-conf           hoodie.datasource.write.recordkey.field=invoiceid \
       --hoodie-conf           hoodie.datasource.write.partitionpath.field=destinationstate \
       --hoodie-conf           hoodie.deltastreamer.source.dfs.root=s3://delta-streamer-demo-hudi/raw/public/sales \
       --hoodie-conf           hoodie.datasource.write.precombine.field=replicadmstimestamp \
       --hoodie-conf           hoodie.database.name=hudidb_raw  \
       --hoodie-conf           hoodie.datasource.hive_sync.enable=true \
       --hoodie-conf           hoodie.datasource.hive_sync.database=hudidb_raw \
       --hoodie-conf           hoodie.datasource.hive_sync.table=tbl_invoices \
       --hoodie-conf           hoodie.datasource.hive_sync.partition_fields=destinationstate
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1491027553

   > Do you mind sharing what was the issue? @soumilshah1995
   
   umm sure i think i was missing some config 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "pratyakshsharma (via GitHub)" <gi...@apache.org>.

pratyakshsharma commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1488080257

   Do you see any exceptions in logs? Also are you writing to this base path first time? Is your .hoodie folder empty? @soumilshah1995 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1487766708

   does this looks okay to you @bvaradar  | @yihua  
   ```
     spark-submit
       --master yarn
       --deploy-mode cluster
       --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
       --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
       --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
       --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar
       --table-type COPY_ON_WRITE
       --source-ordering-field replicadmstimestamp
       --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
       --target-base-path s3://sql-server-dms-demo/hudi/public/sales
       --target-table invoice
       --payload-class org.apache.hudi.common.model.AWSDmsAvroPayload
       --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
       --hoodie-conf hoodie.datasource.write.recordkey.field=invoiceid
       --hoodie-conf hoodie.datasource.write.partitionpath.field=destinationstate
       --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://sql-server-dms-demo/raw/public/sales/
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "yihua (via GitHub)" <gi...@apache.org>.

yihua commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1487764451

   Hi @soumilshah1995 The bottom exception complains that `Property hoodie.datasource.write.partitionpath.field not found`.  You need to specify the partition path field with `--hoodie-conf hoodie.datasource.write.partitionpath.field=<partition_column>`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1488494792

   @yihua @pratyakshsharma 
   First of all, I want to thank you for taking time out of your busy schedule to assist me with this matter. 
   Yes i cancelled the job since i am performing the lab from AWS Account to avoid charges the job was in running state from past 10 minutes and i could not see any Base Files. Yes i am writing to abse path for first time 
   
   if you all can tell me what am i missing or how can i resolve error that would be great please let me know if you need any further details happy to provide you. attaching screenshot of S3 which shows hudi folder was created but base file was not created 
   ![image](https://user-images.githubusercontent.com/39345855/228532580-2ef831a8-4e90-4e87-9abf-2dda2f9942e2.png)
   
   
   Links to parquet file can be found https://drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1491069489

   You are right I was missing that )
   
   On Thu, Mar 30, 2023 at 6:53 PM Y Ethan Guo ***@***.***>
   wrote:
   
   > @soumilshah1995 <https://github.com/soumilshah1995> Just got to this
   > again. So it looks like the preCombine field config (
   > hoodie.datasource.write.precombine.field) was missing in your original
   > configs, causing the job failure. If you have the driver logs, you should
   > see an exception because of this.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/hudi/issues/8309#issuecomment-1491068636>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AJMF5PZ6GHJ2UQU2BZRJVOTW6YFGZANCNFSM6AAAAAAWKSFH24>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   -- 
   Thanking You,
   Soumil Nitin Shah
   
   B.E in Electronic
   MS Electrical Engineering
   MS  Computer Engineering
   +1-646 204 5957
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "yihua (via GitHub)" <gi...@apache.org>.

yihua commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1491068636

   @soumilshah1995 Just got to this again.  So it looks like the preCombine field config (`hoodie.datasource.write.precombine.field`) was missing in your original configs, causing the job failure.  If you have the driver logs, you should see an exception because of this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1487781478

   @yihua @bvaradar 
   i just fired job 
   ![image](https://user-images.githubusercontent.com/39345855/228396434-8370a708-6765-4fa9-b9e5-e2fd70b1787b.png)
   
   As I was describing, even after 10 minutes, there are still only two files on S3 and no generated base files. I do, however, see the meta data folder that was created by hudi. 
   
   ![image](https://user-images.githubusercontent.com/39345855/228396600-3c9fdf30-a069-490a-8d28-313a8d271f56.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 closed issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 closed issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video 
URL: https://github.com/apache/hudi/issues/8309


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "pratyakshsharma (via GitHub)" <gi...@apache.org>.

pratyakshsharma commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1489780346

   Do you mind sharing what was the issue? @soumilshah1995 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "bvaradar (via GitHub)" <gi...@apache.org>.

bvaradar commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1487761185

   @soumilshah1995 : This looks like the correct package is not added to classpath. Can you check if org.apache.hudi.keygen.SimpleKeyGenerator is present in the jar being passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1489442038

   ```
     spark-submit \
       --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
       --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
       --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension  \
       --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
       --conf spark.sql.hive.convertMetastoreParquet=false \
       --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory \
       --master yarn \
       --deploy-mode client \
       --deploy-mode cluster \
       --executor-memory 1g \
       --driver-memory 2g \
        /usr/lib/hudi/hudi-utilities-bundle.jar \
       --table-type COPY_ON_WRITE \
       --op UPSERT \
       --source-ordering-field replicadmstimestamp  \
       --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
       --target-base-path s3://sql-server-dms-demo/hudi/public/sales \
       --target-table invoice \
       --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \
       --hoodie-conf hoodie.datasource.write.recordkey.field=invoiceid \
       --hoodie-conf hoodie.datasource.write.partitionpath.field=destinationstate \
       --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://sql-server-dms-demo/raw/public/sales \
       --hoodie-conf  hoodie.datasource.write.precombine.field=replicadmstimestamp
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1489441867

   issue has been resolved  Daniel Ford was very kind to help on slack i will post detailed video  about delta streamer on Youtube Channel 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1491738206

   
   # Project : Using Apache Hudi Deltastreamer and AWS DMS Hands on Labs
   ![image](https://user-images.githubusercontent.com/39345855/228927370-f7264d4a-f026-4014-9df4-b063f000f377.png)
   
   
   
   ------------------------------------------------------------------
   Video Tutorials 
   * Part 1: Project Overview : https://www.youtube.com/watch?v=D9L0NLSqC1s
   * Part 2: Aurora Setup : https://youtu.be/HR5A6iGb4LE
   * Part 3: https://youtu.be/rnyj5gkIPKA
   * Part 4: https://youtu.be/J1xvPIcDIaQ
   
   ------------------------------------------------------------------
   # Steps 
   ### Step 1:  Create Aurora Source Database and update the seetings to enable CDC on Postgres 
   *  Read More : https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html
   * Create new Create parameter group as shows in video and make sure to edit these two setting as shown in video part 2
   ```
   rds.logical_replication  1
   wal_sender_timeout   300000
   ```
   once done please apply these to database and reboot your Database 
   
   
   ### Step 2: Run Python file to create a table called sales in public schema in aurora and lets populate some data into table 
   
   Run python run.py
   ```
   
   try:
       import os
       import logging
   
       from functools import wraps
       from abc import ABC, abstractmethod
       from enum import Enum
       from logging import StreamHandler
   
       import uuid
       from datetime import datetime, timezone
       from random import randint
       import datetime
   
       import sqlalchemy as db
       from faker import Faker
       import random
       import psycopg2
       import psycopg2.extras as extras
       from dotenv import load_dotenv
   
       load_dotenv(".env")
   except Exception as e:
       raise Exception("Error: {} ".format(e))
   
   
   class Logging:
       """
       This class is used for logging data to datadog an to the console.
       """
   
       def __init__(self, service_name, ddsource, logger_name="demoapp"):
   
           self.service_name = service_name
           self.ddsource = ddsource
           self.logger_name = logger_name
   
           format = "[%(asctime)s] %(name)s %(levelname)s %(message)s"
           self.logger = logging.getLogger(self.logger_name)
           formatter = logging.Formatter(format, )
   
           if logging.getLogger().hasHandlers():
               logging.getLogger().setLevel(logging.INFO)
           else:
               logging.basicConfig(level=logging.INFO)
   
   
   global logger
   logger = Logging(service_name="database-common-module", ddsource="database-common-module",
                    logger_name="database-common-module")
   
   
   def error_handling_with_logging(argument=None):
       def real_decorator(function):
           @wraps(function)
           def wrapper(self, *args, **kwargs):
               function_name = function.__name__
               response = None
               try:
                   if kwargs == {}:
                       response = function(self)
                   else:
                       response = function(self, **kwargs)
               except Exception as e:
                   response = {
                       "status": -1,
                       "error": {"message": str(e), "function_name": function.__name__},
                   }
                   logger.logger.info(response)
               return response
   
           return wrapper
   
       return real_decorator
   
   
   class DatabaseInterface(ABC):
       @abstractmethod
       def get_data(self, query):
           """
           For given query fetch the data
           :param query: Str
           :return: Dict
           """
   
       def execute_non_query(self, query):
           """
           Inserts data into SQL Server
           :param query:  Str
           :return: Dict
           """
   
       def insert_many(self, query, data):
           """
           Insert Many items into database
           :param query: str
           :param data: tuple
           :return: Dict
           """
   
       def get_data_batch(self, batch_size=10, query=""):
           """
           Gets data into batches
           :param batch_size: INT
           :param query: STR
           :return: DICT
           """
   
       def get_table(self, table_name=""):
           """
           Gets the table from database
           :param table_name: STR
           :return: OBJECT
           """
   
   
   class Settings(object):
       """settings class"""
   
       def __init__(
               self,
               port="",
               server="",
               username="",
               password="",
               timeout=100,
               database_name="",
               connection_string="",
               collection_name="",
               **kwargs,
       ):
           self.port = port
           self.server = server
           self.username = username
           self.password = password
           self.timeout = timeout
           self.database_name = database_name
           self.connection_string = connection_string
           self.collection_name = collection_name
   
   
   class DatabaseAurora(DatabaseInterface):
       """Aurora database class"""
   
       def __init__(self, data_base_settings):
           self.data_base_settings = data_base_settings
           self.client = db.create_engine(
               "postgresql://{username}:{password}@{server}:{port}/{database}".format(
                   username=self.data_base_settings.username,
                   password=self.data_base_settings.password,
                   server=self.data_base_settings.server,
                   port=self.data_base_settings.port,
                   database=self.data_base_settings.database_name
               )
           )
           self.metadata = db.MetaData()
           logger.logger.info("Auroradb connection established successfully.")
   
       @error_handling_with_logging()
       def get_data(self, query):
           self.query = query
           cursor = self.client.connect()
           response = cursor.execute(self.query)
           result = response.fetchall()
           columns = response.keys()._keys
           data = [dict(zip(columns, item)) for item in result]
           cursor.close()
           return {"statusCode": 200, "data": data}
   
       @error_handling_with_logging()
       def execute_non_query(self, query):
           self.query = query
           cursor = self.client.connect()
           cursor.execute(self.query)
           cursor.close()
           return {"statusCode": 200, "data": True}
   
       @error_handling_with_logging()
       def insert_many(self, query, data):
           self.query = query
           print(data)
           cursor = self.client.connect()
           cursor.execute(self.query, data)
           cursor.close()
           return {"statusCode": 200, "data": True}
   
       @error_handling_with_logging()
       def get_data_batch(self, batch_size=10, query=""):
           self.query = query
           cursor = self.client.connect()
           response = cursor.execute(self.query)
           columns = response.keys()._keys
           while True:
               result = response.fetchmany(batch_size)
               if not result:
                   break
               else:
                   items = [dict(zip(columns, data)) for data in result]
                   yield items
   
       @error_handling_with_logging()
       def get_table(self, table_name=""):
           table = db.Table(table_name, self.metadata,
                            autoload=True,
                            autoload_with=self.client)
   
           return {"statusCode": 200, "table": table}
   
   
   class DatabaseAuroraPycopg(DatabaseInterface):
       """Aurora database class"""
   
       def __init__(self, data_base_settings):
           self.data_base_settings = data_base_settings
           self.client = psycopg2.connect(
               host=self.data_base_settings.server,
               port=self.data_base_settings.port,
               database=self.data_base_settings.database_name,
               user=self.data_base_settings.username,
               password=self.data_base_settings.password,
           )
   
       @error_handling_with_logging()
       def get_data(self, query):
           self.query = query
           cursor = self.client.cursor()
           cursor.execute(self.query)
           result = cursor.fetchall()
           columns = [column[0] for column in cursor.description]
           data = [dict(zip(columns, item)) for item in result]
           cursor.close()
           _ = {"statusCode": 200, "data": data}
   
           return _
   
       @error_handling_with_logging()
       def execute(self, query, data):
           self.query = query
           cursor = self.client.cursor()
           cursor.execute(self.query, data)
           self.client.commit()
           cursor.close()
           return {"statusCode": 200, "data": True}
   
       @error_handling_with_logging()
       def get_data_batch(self, batch_size=10, query=""):
           self.query = query
           cursor = self.client.cursor()
           cursor.execute(self.query)
           columns = [column[0] for column in cursor.description]
           while True:
               result = cursor.fetchmany(batch_size)
               if not result:
                   break
               else:
                   items = [dict(zip(columns, data)) for data in result]
                   yield items
   
       @error_handling_with_logging()
       def insert_many(self, query, data):
           self.query = query
           cursor = self.client.cursor()
           extras.execute_batch(cursor, self.query, data)
           self.client.commit()
           cursor.close()
           return {"statusCode": 200, "data": True}
   
   
   class Connector(Enum):
   
       ON_AURORA_PYCOPG = DatabaseAurora(
           data_base_settings=Settings(
               port="5432",
               server="XXXXXXXXX",
               username="postgres",
               password="postgres",
               database_name="postgres",
           )
       )
   
   
   def main():
       helper = Connector.ON_AURORA_PYCOPG.value
       import time
   
       states = ("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN",
                 "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
                 "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA",
                 "WA", "WV", "WI", "WY")
       shipping_types = ("Free", "3-Day", "2-Day")
   
       product_categories = ("Garden", "Kitchen", "Office", "Household")
       referrals = ("Other", "Friend/Colleague", "Repeat Customer", "Online Ad")
   
       try:
           query = """
   CREATE TABLE public.sales (
     invoiceid INTEGER,
     itemid INTEGER,
     category TEXT,
     price NUMERIC(10,2),
     quantity INTEGER,
     orderdate DATE,
     destinationstate TEXT,
     shippingtype TEXT,
     referral TEXT
   );
           """
           helper.execute_non_query(query=query,)
           time.sleep(2)
       except Exception as e:
           print("Error",e)
   
       try:
           query = """
               ALTER TABLE execute_non_query.sales REPLICA IDENTITY  FULL
           """
           helper.execute(query=query)
           time.sleep(2)
       except Exception as e:
           pass
   
       for i in range(0, 100):
   
           item_id = random.randint(1, 100)
           state = states[random.randint(0, len(states) - 1)]
           shipping_type = shipping_types[random.randint(0, len(shipping_types) - 1)]
           product_category = product_categories[random.randint(0, len(product_categories) - 1)]
           quantity = random.randint(1, 4)
           referral = referrals[random.randint(0, len(referrals) - 1)]
           price = random.randint(1, 100)
           order_date = datetime.date(2016, random.randint(1, 12), random.randint(1, 28)).isoformat()
           invoiceid = random.randint(1, 20000)
   
           data_order = (invoiceid, item_id, product_category, price, quantity, order_date, state, shipping_type, referral)
   
           query = """INSERT INTO public.sales
                                               (
                                               invoiceid, itemid, category, price, quantity, orderdate, destinationstate,shippingtype, referral
                                               )
                                           VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)"""
   
           helper.insert_many(query=query, data=data_order)
   
   
   main()
   ```
   
   
   ### Step 3:  Create a DMS Replication Instance as shown in video 3 and create S3 bucket and iAM roles refer video part 3
   
   ### Create Source and target in DMS
   * When you create target add following settings 
   ```
   
   {
       "CsvRowDelimiter": "\\n",
       "CsvDelimiter": ",",
       "BucketFolder": "raw",
       "BucketName": "XXXXXXXXXXXXX",
       "CompressionType": "NONE",
       "DataFormat": "parquet",
       "EnableStatistics": true,
       "IncludeOpForFullLoad": true,
       "TimestampColumnName": "replicadmstimestamp",
       "DatePartitionEnabled": false
   }
   ```
   #### Note add this as well in Extra connection attribute
   ![image](https://user-images.githubusercontent.com/39345855/228972148-10726c19-678b-4d77-a607-77fd7eebb105.png)
   ```
   parquetVersion=PARQUET_2_0;
   ```
   
   ### Step 4: create Task  add following settings 
   ```
   {
       "rules": [
           {
               "rule-type": "selection",
               "rule-id": "861743510",
               "rule-name": "861743510",
               "object-locator": {
                   "schema-name": "public",
                   "table-name": "sales"
               },
               "rule-action": "include",
               "filters": []
           }
       ]
   }
   ```
   
   
   # Create EMR cluster and fire the delta streamer 
   * Note you can download parquert files https://drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link
   * these are sample data files generated from DMS you can directly copy into your S3 for learning purposes 
   
   ```
     spark-submit \
       --class                 org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
       --conf                  spark.serializer=org.apache.spark.serializer.KryoSerializer \
       --conf                  spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension  \
       --conf                  spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
       --conf                  spark.sql.hive.convertMetastoreParquet=false \
       --conf                  spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory \
       --master                yarn \
       --deploy-mode           client \
       --deploy-mode           cluster \
       --executor-memory       1g \
        /usr/lib/hudi/hudi-utilities-bundle.jar \
       --table-type            COPY_ON_WRITE \
       --op                    UPSERT \
       --enable-sync \
       --source-ordering-field replicadmstimestamp  \
       --source-class          org.apache.hudi.utilities.sources.ParquetDFSSource \
       --target-base-path      s3://delta-streamer-demo-hudi/raw/public/sales \
       --target-table          invoice \
       --payload-class         org.apache.hudi.common.model.AWSDmsAvroPayload \
       --hoodie-conf           hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \
       --hoodie-conf           hoodie.datasource.write.recordkey.field=invoiceid \
       --hoodie-conf           hoodie.datasource.write.partitionpath.field=destinationstate \
       --hoodie-conf           hoodie.deltastreamer.source.dfs.root=s3://delta-streamer-demo-hudi/raw/public/sales \
       --hoodie-conf           hoodie.datasource.write.precombine.field=replicadmstimestamp \
       --hoodie-conf           hoodie.database.name=hudidb_raw  \
       --hoodie-conf           hoodie.datasource.hive_sync.enable=true \
       --hoodie-conf           hoodie.datasource.hive_sync.database=hudidb_raw \
       --hoodie-conf           hoodie.datasource.hive_sync.table=tbl_invoices \
       --hoodie-conf           hoodie.datasource.hive_sync.partition_fields=destinationstate
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for YouTube Content for Community

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1487702678

   @n3nash @bvaradar  
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #8309: [SUPPORT] Need Assistance with Hudi Delta Streamer for Community Video

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.

soumilshah1995 commented on issue #8309:
URL: https://github.com/apache/hudi/issues/8309#issuecomment-1487770185

   Please correct me if I'm doing something incorrectly. I've attached base files for your reference. I see hudi folders being created, but I don't see any base files (Parquet files) being created.  any idea why ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org