You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/19 19:37:49 UTC

[GitHub] [hudi] harishraju-govindaraju opened a new issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

harishraju-govindaraju opened a new issue #4641:
URL: https://github.com/apache/hudi/issues/4641


   
   **Describe the problem you faced**
   I started an EMR Cluster and trying to run deltastreamer of HUDI. However, i get an error 👍 
   **Failed to load org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.
   java.lang.ClassNotFoundException: org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer**
   
   I was trying to follow this documentation and do the steps. 
   
   https://hudi.apache.org/blog/2021/08/23/s3-events-source
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   
   # To start S3EventsSource
   
   spark-submit \
   --jars "/home/hadoop/hudi-utilities-bundle_2.11-0.9.0.jar,/usr/lib/spark/external/lib/spark-avro.jar,/home/hadoop/aws-java-sdk-sqs-1.12.22.jar" \
   --master yarn --deploy-mode client \
   --class "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer" /home/hadoop/hudi-packages/hudi-utilities-bundle_2.11-0.9.0-SNAPSHOT.jar \
   --table-type COPY_ON_WRITE --source-ordering-field eventTime \
   --target-base-path s3://s3-eip-dev-uea1-hudipoc-001/hudi-trusted/metadata/ \
   --target-table s3_meta_table  --continuous \
   --min-sync-interval-seconds 10 \
   --hoodie-conf hoodie.datasource.write.recordkey.field="s3.object.key,eventName" \
   --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=s3.bucket.name --enable-hive-sync \
   --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor \
   --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
   --hoodie-conf hoodie.datasource.hive_sync.database=default \
   --hoodie-conf hoodie.datasource.hive_sync.table=s3_meta_table \
   --hoodie-conf hoodie.datasource.hive_sync.partition_fields=bucket \
   --source-class org.apache.hudi.utilities.sources.S3EventsSource \
   --hoodie-conf hoodie.deltastreamer.source.queue.url=https://sqs.us-east-1.amazonaws.com/118897059965/sqshudi
   --hoodie-conf hoodie.deltastreamer.s3.source.queue.region=us-east-1
   
   **Expected behavior**
   
   I wanted to run this spark submit successfully.
   
   **Environment Description**
   
   * Hudi version :
   
   * Hive version :
   *EMR Version : 5.33.1
   Hive 2.3.7
   Spark 2.4.7
   Flink 1.12.1
   
   
   * Storage (HDFS/S3/GCS..) :  S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju commented on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Posted by GitBox <gi...@apache.org>.

harishraju-govindaraju commented on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1016880242


   Folks,
   
   I had to change the jar locations to S3 path and managed to  overcome the error. However, i am facing another error. I am using DeltaStreamer for first time run. I was in an assumption that the first time we run, the deltastreamer will create the hudi table. I get an error saying the hoodie table is not found ? Does that mean that i cannot use deltastreamer for initial loads.
   
   Exception in thread "main" org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in path s3://ztrusted1/default/hudi-table1/.hoodie
   
   Here is my spark-submit command. Please help .
   
   spark-submit \
   --jars "s3://zcustomjar/spark-avro_2.12-3.1.2.jar,s3://zcustomjar/hudi-spark-bundle_2.11-0.5.3-rc2.jar" \
   --deploy-mode "client" \
   --class "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer" /usr/lib/hudi/hudi-utilities-bundle.jar \
   --table-type COPY_ON_WRITE \
   --source-ordering-field id \
   --target-base-path s3://ztrusted1/default/hudi-table1 --target-table hudi-table1 \
   --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \
   --hoodie-conf hoodie.datasource.write.recordkey.field=id \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://zlanding1 \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=compcode \
   --hoodie-conf hoodie.datasource.write.operation=insert
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju edited a comment on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Posted by GitBox <gi...@apache.org>.

harishraju-govindaraju edited a comment on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017138248


   Tried to define proper schema. Still having same error . Any help is much appreciated as we are planning to use deltastreamer in production.
   
   Caused by: org.apache.hudi.exception.HoodieIOException: **Unrecognized token 'Objavro': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')**
    at [Source: (String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version"; line: 1, column: 11]
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju edited a comment on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Posted by GitBox <gi...@apache.org>.

harishraju-govindaraju edited a comment on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017122913


   Hello @nsivabalan ,
   
   Thanks for promptly responding to my question. 
   
   I tried to clear the folder and reran the below spark-submit command. The folder .hoodie got created but the job ended with error with no data files. 
   
    **_Unrecognized token 'Objavro': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
    at [Source: (String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version"; line: 1, column: 11]_**
   
   spark-submit \
   --jars "s3://zcustomjar/spark-avro_2.11-2.4.4.jar" \
   --deploy-mode "client" \
   --class "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer"  /usr/lib/hudi/hudi-utilities-bundle.jar \
   --schemaprovider-class "org.apache.hudi.utilities.schema.FilebasedSchemaProvider" \
   --table-type COPY_ON_WRITE \
   --source-ordering-field id \
   --target-base-path s3://ztrusted1/default/hudi-table1/ --target-table hudi-table1 \
   --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \
   --hoodie-conf hoodie.datasource.write.recordkey.field=id \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://zlanding1/input1/ \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=compcode \
   --hoodie-conf hoodie.datasource.write.operation=insert \
   --hoodie-conf hoodie.deltastreamer.schemaprovider.source.schema.file=s3://zcustomjar/source2.avsc \
   --hoodie-conf hoodie.deltastreamer.schemaprovider.target.schema.file=s3://zcustomjar/target.avsc \
   
   
   I have manually created the schema .avsc file using notepad. Not sure if that is a problem. 
   
   {
     "type" : "record",
     "name" : "triprec",
     "fields" : [
     {
       "name" : "id",
       "type" : "string"
     }, {
       "name" : "creation_date",
       "type" : "string"
     }, {
       "name" : "last_update_time",
       "type" : "string"
     }, {
       "name" : "quantity",
       "type" : "string"
     }, {
       "name" : "compcode",
       "type" : "string"
     }]
   }
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1024372275


   got it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju closed issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Posted by GitBox <gi...@apache.org>.

harishraju-govindaraju closed issue #4641:
URL: https://github.com/apache/hudi/issues/4641


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju commented on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Posted by GitBox <gi...@apache.org>.

harishraju-govindaraju commented on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1024294748


   I was able to do spark submit. My parameters had something wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju commented on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Posted by GitBox <gi...@apache.org>.

harishraju-govindaraju commented on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017122913


   Hello @nsivabalan ,
   
   Thanks for promptly responding to my question. 
   
   I tried to clear the folder and reran the below spark-submit command. The folder .hoodie got created but the job ended with error with no data files. 
   
    Unrecognized token 'Objavro': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
    at [Source: (String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version"; line: 1, column: 11]
   
   spark-submit \
   --jars "s3://zcustomjar/spark-avro_2.11-2.4.4.jar" \
   --deploy-mode "client" \
   --class "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer"  /usr/lib/hudi/hudi-utilities-bundle.jar \
   --schemaprovider-class "org.apache.hudi.utilities.schema.FilebasedSchemaProvider" \
   --table-type COPY_ON_WRITE \
   --source-ordering-field id \
   --target-base-path s3://ztrusted1/default/hudi-table1/ --target-table hudi-table1 \
   --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \
   --hoodie-conf hoodie.datasource.write.recordkey.field=id \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://zlanding1/input1/ \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=compcode \
   --hoodie-conf hoodie.datasource.write.operation=insert \
   --hoodie-conf hoodie.deltastreamer.schemaprovider.source.schema.file=s3://zcustomjar/source2.avsc \
   --hoodie-conf hoodie.deltastreamer.schemaprovider.target.schema.file=s3://zcustomjar/target.avsc \
   
   
   I have manually created the schema .avsc file using notepad. Not sure if that is a problem. 
   
   {
     "type" : "record",
     "name" : "triprec",
     "fields" : [
     {
       "name" : "id",
       "type" : "string"
     }, {
       "name" : "creation_date",
       "type" : "string"
     }, {
       "name" : "last_update_time",
       "type" : "string"
     }, {
       "name" : "quantity",
       "type" : "string"
     }, {
       "name" : "compcode",
       "type" : "string"
     }]
   }
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju edited a comment on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Posted by GitBox <gi...@apache.org>.

harishraju-govindaraju edited a comment on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017138248


   Tried to define proper schema. Still having same error . Any help is much appreciated as we are planning to use deltastreamer in production.
   
   Caused by: org.apache.hudi.exception.HoodieIOException: Unrecognized token 'Objavro': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
    at [Source: (String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version"; line: 1, column: 11]
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju commented on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Posted by GitBox <gi...@apache.org>.

harishraju-govindaraju commented on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017138248


   tried to define proper schema. Still having same error 
   
   Caused by: org.apache.hudi.exception.HoodieIOException: Unrecognized token 'Objavro': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
    at [Source: (String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version"; line: 1, column: 11]
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1018534907


   nope, you don't need to set any schema explicilty. 
   I went through all configs from the description. 
   
   Guess there could be some typo
   hoodie.deltastreamer.s3.source.queue.url is the right config. I see you are setting hoodie.deltastreamer.source.queue.url.
   
   CC @codope 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017017630


   nope. you should be able to start deltastreamer on a clean directory. 
   do you have full stacktrace. Also, can you try cleaning up the target directory complete and retry running deltastreamer. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org