You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/05/26 15:49:21 UTC
[GitHub] [hudi] PavelPetukhov edited a comment on issue #2959: No data stored after migrating to Hudi 0.8.0

PavelPetukhov edited a comment on issue #2959:
URL: https://github.com/apache/hudi/issues/2959#issuecomment-848885930


   .hoodie directory structure is the following
   hdfs dfs -ls /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie
   Found 7 items
   drwxr-xr-x   - hdfs hadoop          0 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/.aux
   drwxr-xr-x   - hdfs hadoop          0 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/.temp
   -rw-r--r--   3 hdfs hadoop       1201 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/20210526183328.deltacommit
   -rw-r--r--   3 hdfs hadoop        518 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/20210526183328.deltacommit.inflight
   -rw-r--r--   3 hdfs hadoop          0 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/20210526183328.deltacommit.requested
   drwxr-xr-x   - hdfs hadoop          0 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/archived
   -rw-r--r--   3 hdfs hadoop        391 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/hoodie.properties
   
   
   Also, I have removed everything unrelated, so the request looks like this:
   
   /usr/local/spark/bin/spark-submit --conf "spark.yarn.submit.waitAppCompletion=false" \
   --conf "spark.dynamicAllocation.minExecutors=1" \
   --conf "spark.dynamicAllocation.maxExecutors=10" \
   --conf "spark.dynamicAllocation.enabled=true" \
   --conf "spark.dynamicAllocation.shuffleTracking.enabled=true" \
   --conf "spark.shuffle.service.enabled=true" \
   --conf "spark.eventLog.enabled=true" \
   --conf "spark.eventLog.dir=hdfs://xxx/eventLogging" \
   --conf "spark.executor.memoryOverhead=384" \
   --conf "spark.driver.memoryOverhead=384" \
   --conf "spark.driver.extraJavaOptions=-DsparkAappName=xxx -DlogIndex=GOLANG_JSON -DappName=data-lake-extractors-streamer -DlogFacility=stdout" \
   --packages org.apache.spark:spark-avro_2.12:2.4.7 \
   --master yarn \
   --deploy-mode cluster \
   --name xxx \
   --driver-memory 2G \
   --executor-memory 2G \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
   hdfs://xxx/user/hudi/hudi-utilities-bundle_2.12-0.8.0.jar \
   --op UPSERT \
   --table-type MERGE_ON_READ \
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
   --source-ordering-field __null_ts_ms \
   --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --target-base-path /user/hdfs/raw_data/public/xxx/yyy \
   --target-table xxx \
   --hoodie-conf "hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator" \
   --hoodie-conf "hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING" \
   --hoodie-conf "hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd" \
   --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-ddTHH:mm:ssZ,yyyy-MM-ddTHH:mm:ss.SSSZ" \
   --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex=" \
   --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.timezone=" \
   --hoodie-conf "hoodie.upsert.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.insert.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.delete.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.bulkinsert.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.embed.timeline.server=true" \
   --hoodie-conf "hoodie.filesystem.view.type=EMBEDDED_KV_STORE" \
   --hoodie-conf "hoodie.deltastreamer.schemaprovider.registry.url=http://xxx/subjects/xxx-value/versions/latest" \
   --hoodie-conf "bootstrap.servers=xxx" \
   --hoodie-conf "auto.offset.reset=earliest" \
   --hoodie-conf "group.id=hudi_group" \
   --hoodie-conf "schema.registry.url=http://xxx" \
   --hoodie-conf "hoodie.datasource.write.recordkey.field=id" \
   --hoodie-conf "hoodie.datasource.write.partitionpath.field=date:TIMESTAMP" \
   --hoodie-conf "hoodie.deltastreamer.source.kafka.topic=xxx" \
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org