You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/05/26 15:49:21 UTC
[GitHub] [hudi] PavelPetukhov edited a comment on issue #2959: No data stored after migrating to Hudi 0.8.0
PavelPetukhov edited a comment on issue #2959:
URL: https://github.com/apache/hudi/issues/2959#issuecomment-848885930
.hoodie directory structure is the following
hdfs dfs -ls /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie
Found 7 items
drwxr-xr-x - hdfs hadoop 0 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/.aux
drwxr-xr-x - hdfs hadoop 0 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/.temp
-rw-r--r-- 3 hdfs hadoop 1201 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/20210526183328.deltacommit
-rw-r--r-- 3 hdfs hadoop 518 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/20210526183328.deltacommit.inflight
-rw-r--r-- 3 hdfs hadoop 0 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/20210526183328.deltacommit.requested
drwxr-xr-x - hdfs hadoop 0 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/archived
-rw-r--r-- 3 hdfs hadoop 391 2021-05-26 18:33 /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie/hoodie.properties
Also, I have removed everything unrelated, so the request looks like this:
/usr/local/spark/bin/spark-submit --conf "spark.yarn.submit.waitAppCompletion=false" \
--conf "spark.dynamicAllocation.minExecutors=1" \
--conf "spark.dynamicAllocation.maxExecutors=10" \
--conf "spark.dynamicAllocation.enabled=true" \
--conf "spark.dynamicAllocation.shuffleTracking.enabled=true" \
--conf "spark.shuffle.service.enabled=true" \
--conf "spark.eventLog.enabled=true" \
--conf "spark.eventLog.dir=hdfs://xxx/eventLogging" \
--conf "spark.executor.memoryOverhead=384" \
--conf "spark.driver.memoryOverhead=384" \
--conf "spark.driver.extraJavaOptions=-DsparkAappName=xxx -DlogIndex=GOLANG_JSON -DappName=data-lake-extractors-streamer -DlogFacility=stdout" \
--packages org.apache.spark:spark-avro_2.12:2.4.7 \
--master yarn \
--deploy-mode cluster \
--name xxx \
--driver-memory 2G \
--executor-memory 2G \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
hdfs://xxx/user/hudi/hudi-utilities-bundle_2.12-0.8.0.jar \
--op UPSERT \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
--source-ordering-field __null_ts_ms \
--schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--target-base-path /user/hdfs/raw_data/public/xxx/yyy \
--target-table xxx \
--hoodie-conf "hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator" \
--hoodie-conf "hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING" \
--hoodie-conf "hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd" \
--hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-ddTHH:mm:ssZ,yyyy-MM-ddTHH:mm:ss.SSSZ" \
--hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex=" \
--hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.timezone=" \
--hoodie-conf "hoodie.upsert.shuffle.parallelism=2" \
--hoodie-conf "hoodie.insert.shuffle.parallelism=2" \
--hoodie-conf "hoodie.delete.shuffle.parallelism=2" \
--hoodie-conf "hoodie.bulkinsert.shuffle.parallelism=2" \
--hoodie-conf "hoodie.embed.timeline.server=true" \
--hoodie-conf "hoodie.filesystem.view.type=EMBEDDED_KV_STORE" \
--hoodie-conf "hoodie.deltastreamer.schemaprovider.registry.url=http://xxx/subjects/xxx-value/versions/latest" \
--hoodie-conf "bootstrap.servers=xxx" \
--hoodie-conf "auto.offset.reset=earliest" \
--hoodie-conf "group.id=hudi_group" \
--hoodie-conf "schema.registry.url=http://xxx" \
--hoodie-conf "hoodie.datasource.write.recordkey.field=id" \
--hoodie-conf "hoodie.datasource.write.partitionpath.field=date:TIMESTAMP" \
--hoodie-conf "hoodie.deltastreamer.source.kafka.topic=xxx" \
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org