You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/01 23:11:16 UTC

[GitHub] [hudi] fzong76 opened a new issue, #5735: No hudi dataset was saved to s3

fzong76 opened a new issue, #5735:
URL: https://github.com/apache/hudi/issues/5735

   I am using deltastreamer to load files uploaded to s3 bucket. Everything works fine with `--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer`, but failed with `--class org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer` when trying to load multiple tables.
   
   Here is part of the log. Seems it processed the data, but I don't see any dataset created.
   
   ```
   22/06/01 22:57:29 INFO TaskSetManager: Starting task 197.0 in stage 23.0 (TID 1802) (ip-172-31-20-205.ec2.internal, executor 2, partition 197, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 189.0 in stage 23.0 (TID 1794) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (190/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 190.0 in stage 23.0 (TID 1795) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (191/200)
   22/06/01 22:57:29 INFO TaskSetManager: Starting task 198.0 in stage 23.0 (TID 1803) (ip-172-31-20-205.ec2.internal, executor 2, partition 198, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
   22/06/01 22:57:29 INFO TaskSetManager: Starting task 199.0 in stage 23.0 (TID 1804) (ip-172-31-30-20.ec2.internal, executor 1, partition 199, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 192.0 in stage 23.0 (TID 1797) in 7 ms on ip-172-31-30-20.ec2.internal (executor 1) (192/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 191.0 in stage 23.0 (TID 1796) in 7 ms on ip-172-31-30-20.ec2.internal (executor 1) (193/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 196.0 in stage 23.0 (TID 1801) in 4 ms on ip-172-31-30-20.ec2.internal (executor 1) (194/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 195.0 in stage 23.0 (TID 1800) in 5 ms on ip-172-31-30-20.ec2.internal (executor 1) (195/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 194.0 in stage 23.0 (TID 1799) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (196/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 199.0 in stage 23.0 (TID 1804) in 3 ms on ip-172-31-30-20.ec2.internal (executor 1) (197/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 197.0 in stage 23.0 (TID 1802) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (198/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 198.0 in stage 23.0 (TID 1803) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (199/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 193.0 in stage 23.0 (TID 1798) in 11 ms on ip-172-31-20-205.ec2.internal (executor 2) (200/200)
   22/06/01 22:57:29 INFO YarnScheduler: Removed TaskSet 23.0, whose tasks have all completed, from pool
   22/06/01 22:57:29 INFO DAGScheduler: ResultStage 23 (collect at SparkRejectUpdateStrategy.java:52) finished in 0.221 s
   22/06/01 22:57:29 INFO DAGScheduler: Job 8 is finished. Cancelling potential speculative or zombie tasks for this job
   22/06/01 22:57:29 INFO YarnScheduler: Killing all running tasks in stage 23: Stage finished
   22/06/01 22:57:29 INFO DAGScheduler: Job 8 finished: collect at SparkRejectUpdateStrategy.java:52, took 0.468092 s
   22/06/01 22:57:29 INFO SparkContext: Starting job: sum at DeltaSync.java:557
   22/06/01 22:57:29 INFO DAGScheduler: Job 9 finished: sum at DeltaSync.java:557, took 0.000174 s
   22/06/01 22:57:29 INFO SparkContext: Starting job: sum at DeltaSync.java:558
   22/06/01 22:57:29 INFO DAGScheduler: Job 10 finished: sum at DeltaSync.java:558, took 0.000157 s
   22/06/01 22:57:29 INFO SparkContext: Starting job: collect at SparkRDDWriteClient.java:124
   22/06/01 22:57:29 INFO DAGScheduler: Job 11 finished: collect at SparkRDDWriteClient.java:124, took 0.000179 s
   22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:29 INFO MultipartUploadOutputStream: close closed:false s3://fei-hudi-demo/s3_hudi_table/.hoodie/20220601225723633.commit
   22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:30 INFO MapPartitionsRDD: Removing RDD 73 from persistence list
   22/06/01 22:57:30 INFO BlockManager: Removing RDD 73
   22/06/01 22:57:30 INFO MapPartitionsRDD: Removing RDD 60 from persistence list
   22/06/01 22:57:30 INFO BlockManager: Removing RDD 60
   22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_hudi_table/.hoodie/20220601225723633.commit' for reading
   22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 's3://fei-hudi-demo/s3_meta_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:30 WARN S3EventsHoodieIncrSource: Already caught up. Begin Checkpoint was :20220601225531374
   ```
   
   Here is the command I used
   ```
   spark-submit \
   --jars "/usr/lib/spark/external/lib/spark-avro.jar,s3://fei-hudi-demo/aws-java-sdk-sqs-1.12.220.jar" \
   --master yarn --deploy-mode client \
   --class org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar \
   --table-type COPY_ON_WRITE \
   --enable-hive-sync \
   --source-ordering-field metricvalue --base-path-prefix s3://fei-hudi-demo/s3_hudi_table \
   --target-table dummy_table --continuous --min-sync-interval-seconds 10 \
   --props s3://fei-hudi-demo/hudi-ingestion-config/source.properties \
   --config-folder s3://fei-hudi-demo/hudi-ingestion-config \
   --source-class org.apache.hudi.utilities.sources.S3EventsHoodieIncrSource 
   ```
   
   source.properties
   ```
   hoodie.deltastreamer.ingestion.tablesToBeIngested=fei_hudi_test.table1
   hoodie.deltastreamer.ingestion.fei_hudi_test.table1.configFile=s3://fei-hudi-demo/hudi-ingestion-config/t1.properties
   
   hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
   hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
   hoodie.datasource.write.hive_style_partitioning=true 
   hoodie.datasource.hive_sync.database=fei_hudi_test 
   hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt=true 
   hoodie.deltastreamer.source.s3incr.key.prefix=""
   ```
   
   t1.properties
   ```
   hoodie.datasource.write.recordkey.field=launchId
   hoodie.datasource.write.partitionpath.field=appPlatform:simple,appVersion:simple
   hoodie.datasource.hive_sync.partition_fields=appPlatform,appVersion
   hoodie.deltastreamer.source.hoodieincr.path=s3://fei-hudi-demo/s3_meta_table
   ```
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on issue #5735: No hudi dataset was saved to s3

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on issue #5735:
URL: https://github.com/apache/hudi/issues/5735#issuecomment-1145794222

   @fzong76 Do you see any exception in the logs?
   
   > but failed with --class org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer when trying to load multiple tables.
   
   I see you are only trying to load a single table `fei_hudi_test.table1` as mentioned in the config file but you mentioned "trying to load multiple tables". Am I missing something here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5735: No hudi dataset was saved to s3

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #5735:
URL: https://github.com/apache/hudi/issues/5735#issuecomment-1289936131

   @fzong76 : gentle ping. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on issue #5735: No hudi dataset was saved to s3

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on issue #5735:
URL: https://github.com/apache/hudi/issues/5735#issuecomment-1216530544

   @fzong76 can you share the error that you see in the logs? The logs that you shared already do not contain any proper error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #5735: No hudi dataset was saved to s3

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #5735:
URL: https://github.com/apache/hudi/issues/5735#issuecomment-1296317288

   close due to inactivity


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5735: No hudi dataset was saved to s3

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #5735:
URL: https://github.com/apache/hudi/issues/5735#issuecomment-1216143296

   @pratyakshsharma : can you follow up here please. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5735: No hudi dataset was saved to s3

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #5735:
URL: https://github.com/apache/hudi/issues/5735#issuecomment-1289936247

   feel free to close if you got the issue resolved. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan closed issue #5735: No hudi dataset was saved to s3

Posted by GitBox <gi...@apache.org>.

xushiyan closed issue #5735: No hudi dataset was saved to s3
URL: https://github.com/apache/hudi/issues/5735


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] fzong76 commented on issue #5735: No hudi dataset was saved to s3

Posted by GitBox <gi...@apache.org>.

fzong76 commented on issue #5735:
URL: https://github.com/apache/hudi/issues/5735#issuecomment-1146083416

   Yes. I just started with one table since it's the first time I tried HoodieMultiTableDeltaStreamer. There is no exception in the logs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org