You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/11/24 01:33:09 UTC
[GitHub] [hudi] giaosudau edited a comment on pull request #2208: [HUDI-1040] Make Hudi support Spark 3

giaosudau edited a comment on pull request #2208:
URL: https://github.com/apache/hudi/pull/2208#issuecomment-732523090


   I tried to run deltastreamer with sqltransformer 
   
   Hi everyone,
   I am running spark3 https://github.com/apache/hudi/pull/2208
   with deltastreamer and sqltranformer for debezium data
   ``` 
   spark-submit \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
   --driver-memory 2g \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.sql.hive.convertMetastoreParquet=false \
   --packages org.apache.spark:spark-avro_2.12:3.0.1 \
   ~/workspace/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.6.1-SNAPSHOT.jar \
   --table-type MERGE_ON_READ \
   --source-ordering-field ts_ms \
   --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
   --target-base-path /Users/users/Downloads/roi/debezium/by_test/ \
   --target-table users \
   --props ./hudi_base.properties \
   --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer
   hoodie.upsert.shuffle.parallelism=2
   hoodie.insert.shuffle.parallelism=2
   hoodie.bulkinsert.shuffle.parallelism=2
   # Key fields, for kafka example
   hoodie.datasource.write.storage.type=MERGE_ON_READ
   hoodie.datasource.write.recordkey.field=id
   hoodie.datasource.write.partitionpath.field=ts_ms
   hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
   hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy-MM-dd
   # schema provider configs
   hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/dbz1.by_test.users-value/versions/latest
   #Kafka props
   hoodie.deltastreamer.source.kafka.topic=dbz1.by_test.users
   metadata.broker.list=localhost:9092
   bootstrap.servers=localhost:9092
   auto.offset.reset=earliest
   schema.registry.url=http://localhost:8081
   hoodie.deltastreamer.transformer.sql=SELECT ts_ms, op, after.* FROM <SRC> WHERE op IN ('u', 'c')
   ```
   
   ```
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x000000010f4cbad0, pid=33960, tid=0x0000000000013e03
   #
   # JRE version: OpenJDK Runtime Environment (8.0_265-b01) (build 1.8.0_265-b01)
   # Java VM: OpenJDK 64-Bit Server VM (25.265-b01 mixed mode bsd-amd64 compressed oops)
   # Problematic frame:
   # V  [libjvm.dylib+0xcbad0]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org