You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/13 10:15:17 UTC

[GitHub] [hudi] Tandoy opened a new issue #3471: [SUPPORT]Failed to ingest Kafka data using HoodieDeltaStreamer

Tandoy opened a new issue #3471:
URL: https://github.com/apache/hudi/issues/3471


   **Steps to reproduce the behavior:**
   spark-submit --master yarn   \
   --driver-memory 1G \
   --num-executors 2 \
   --executor-memory 1G \
   --executor-cores 4 \
   --deploy-mode cluster \
   --conf spark.yarn.executor.memoryOverhead=512 \
   --conf spark.yarn.driver.memoryOverhead=512 \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls /home/appuser/tangzhi/hudi-0.8/hudi-release-0.8.0/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.8.0.jar` \
   --props file:///opt/apps/hudi/hudi-utilities/src/test/resources/delta-streamer-config/kafka.properties \
   --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
   --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
   --target-base-path hdfs://dxbigdata101:8020/user/hudi/test/data/hudi_test_occ \
   --op UPSERT \
   --target-table hudi_test_occ \
   --table-type COPY_ON_WRITE \
   --source-ordering-field uid \
   --source-limit 5000000
   
   **Expected behavior:**
   Use HoodieDeltaStreamer to ingest Kafka data
   
   **Environment Description:**
   Hudi version : 0.8
   Spark version : 2.4.0.cloudera2
   Hadoop version : 2.6.0-cdh5.13.3
   Hive version : 1.1.0-cdh5.13.3
   Storage (HDFS/S3/GCS..) : HDFS
   Running on Docker? (yes/no) :no
   
   **kafka.properties**
   ihoodie.upsert.shuffle.parallelism=2
   hoodie.insert.shuffle.parallelism=2
   hoodie.bulkinsert.shuffle.parallelism=2
   hoodie.datasource.write.recordkey.field=uid
   hoodie.datasource.write.partitionpath.field=ts
   hoodie.deltastreamer.schemaprovider.source.schema.file=hdfs://dxbigdata101:8020/user/hudi/test/data/schema.avsc
   hoodie.deltastreamer.schemaprovider.target.schema.file=hdfs://dxbigdata101:8020/user/hudi/test/data/schema.avsc
   hoodie.deltastreamer.source.kafka.topic=hudi_test_occ
   group.id=occ
   bootstrap.servers=dxbigdata103:9092
   auto.offset.reset=earliest
   hoodie.parquet.max.file.size=134217728
   hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
   hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING
   hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-dd HH:mm:ss
   hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
   
   **schema：**
   {
     "type":"record",
     "name":"gmall_event",
     "fields":[{
        "name": "area",
        "type": "string"
     }, {
        "name": "uid",
        "type": "long"
     }, {
        "name": "itemid",
        "type": "string"
     },{
        "name": "npgid",
        "type": "string"
     },{
        "name": "evid",
        "type": "string"
     },{
        "name": "os",
        "type": "string"
     },{
        "name": "pgid",
        "type": "string"
     },{
        "name": "appid",
        "type": "string"
     },{
        "name": "mid",
        "type": "string"
     }, {
        "name": "type",
        "type": "string"
     }, {
        "name": "ts",
        "type":"string"
     }
   ]}
   
   **Additional context:**
   The Spark task did not report an error, but only the .hoodie directory. There is no data directory. But can use the shell to correctly consume topic data
   
   ![Snipaste_2021-08-13_18-10-07](https://user-images.githubusercontent.com/56899730/129341726-458d3736-a7cb-473d-8aa4-08a8e1cb2b1e.PNG)
   
   ![Snipaste_2021-08-13_18-09-38](https://user-images.githubusercontent.com/56899730/129341736-e5796641-ca69-49fc-868d-acae39068d7c.PNG)
   
   ![Snipaste_2021-08-13_18-12-38](https://user-images.githubusercontent.com/56899730/129342063-f3f62100-fed5-4b31-b1c1-569eabe1ce61.PNG)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Tandoy commented on issue #3471: [SUPPORT]Failed to ingest Kafka data using HoodieDeltaStreamer

Posted by GitBox <gi...@apache.org>.

Tandoy commented on issue #3471:
URL: https://github.com/apache/hudi/issues/3471#issuecomment-900116315


   I checked the source code and found that auto.reset.offsets=latest was used by default when creating the KafkaOffsetGen class, but the official configuration file was auto.offset.reset. I modified the configuration items to successfully ingest the earliest Kafka data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Tandoy commented on issue #3471: [SUPPORT]Failed to ingest Kafka data using HoodieDeltaStreamer

Posted by GitBox <gi...@apache.org>.

Tandoy commented on issue #3471:
URL: https://github.com/apache/hudi/issues/3471#issuecomment-898546283


   When I reverted to Kafka to produce new channel data, HoodieDeltaStreamer data can use the latest Kafka, but the original data is not ingested. But what I set is auto.offset.reset=earliest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] giaosudau commented on issue #3471: [SUPPORT]Failed to ingest Kafka data using HoodieDeltaStreamer

Posted by GitBox <gi...@apache.org>.

giaosudau commented on issue #3471:
URL: https://github.com/apache/hudi/issues/3471#issuecomment-898869132


   you should check .hoodie commit file it stores the kafka topic offset inside.
   To change either you run delta-streamer reset offset or change it inside.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Tandoy closed issue #3471: [SUPPORT]Failed to ingest Kafka data using HoodieDeltaStreamer

Posted by GitBox <gi...@apache.org>.

Tandoy closed issue #3471:
URL: https://github.com/apache/hudi/issues/3471


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org