You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/10/20 09:54:58 UTC

[GitHub] [hudi] Kavin88 opened a new issue #3831: Deltastreamer through Pyspark/livy

Kavin88 opened a new issue #3831:
URL: https://github.com/apache/hudi/issues/3831


   spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 \
    --master yarn \
    --deploy-mode cluster \
    --conf spark.sql.shuffle.partitions=100 \
    --driver-class-path $HADOOP_CONF_DIR \
    --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
    --table-type MERGE_ON_READ \
    --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
    --source-ordering-field tst  \
    --target-base-path /user/hive/warehouse/stock_ticks_mor \
    --target-table test \
    --props /var/demo/config/kafka-source.properties 
   
   1. Is deltastreamer can be used only as a CLI utility ?
   2. if it can be integrated in pyspark code as like datasource writer, how to pass deltastreamer utility specific parameters --props, --source-class and --continuous in hudi config options  ?
   3. Similary is it possible to pass above parameters(2nd point) through Livy for spark submission?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-948120800


   1. deltastreamer is a spark application that supposed to run on a cluster. not sure how it fits into a CLI utility. if you want to just use a CLI command to submit the job as in it just triggers the submission, yea there is nothing stops you doing it.
   2. most deltastreamer configs translates to hudi options internally, for e.g., --source-ordering-field matches precombine field option. I'd suggest find the all the needed hudi configs for your application based on deltastreamer's params and create a map of hudi write options from scratch, then pass it to datasource writer. The extra work is you may need to do some orchestration for your datasource writer like schedule it periodically and trigger compaction in a separate process. Not all deltastreamer params match to hudi write options like --continuous is for orchestration mode not writer option. So deltastreamer is at higher level than datasource writing you can't flip them.
   3. see 2)
   Hope this helps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-961833569






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Kavin88 edited a comment on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

Kavin88 edited a comment on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-948232383


   @xushiyan 1. As of now, I am directly doing the spark submit on the EMR cluster for deltastreamer run. Want  to understand if deltastreamer can be used same as hudi datasource writer. Params we would pass in datasource writer in pyspark is given below. I am not able to get how to pass the deltastreamers params in python/spark code or through livy submit. Not able to find how  to pass --continuous, source class name , source ordering field ,etc in below hudiOptions. Is this viable ?
   
   hudiOptions = {
   "hoodie.table.name": "hudi_test",
   "hoodie.datasource.write.recordkey.field": "id",
   "hoodie.datasource.write.precombine.field": "last_update_time",
   "hoodie.upsert.shuffle.parallelism": 1,
   "hoodie.insert.shuffle.parallelism": 1,
   'hoodie.datasource.write.storage.type': 'MERGE_ON_READ'
   }
   
   inputdf.write.format('org.apache.hudi').option('hoodie.datasource.write.operation', 'insert').options(**hudiOptions).mode('overwrite').save('storagepath')
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-974689954


   @Kavin88 so basically it is a feature request where you wish to pass these deltastreamer-specific arguments via options. It is indeed possible to create a set of configs say `hoodie.deltastreamer.xxx` to achieve this. But first i want to understand why it is not possible for you to submit args via Livy API. from what i read, https://community.cloudera.com/t5/Community-Articles/How-to-Submit-Spark-Application-through-Livy-REST-API/ta-p/247502, livy does allow you to specify args to the spark app.
   
   > My thought is to implement this through python code with sparksession and do the livy submit calling the pyfile.
   
   Not quite sure what this pyfile is supposed to contain. Can you elaborate?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-974690908


   @kywe665 it can be useful have an example showing how to submit HoodieDeltaStreamer via Apache Livy APIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-974689954






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan edited a comment on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

xushiyan edited a comment on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-974689954


   @Kavin88 so basically it is a feature request where you wish to pass these deltastreamer-specific arguments via options. It is indeed possible to create a set of configs say `hoodie.deltastreamer.xxx` to achieve this. But first i want to understand why it is not possible for you to submit args via Livy API. from what i read, https://community.cloudera.com/t5/Community-Articles/How-to-Submit-Spark-Application-through-Livy-REST-API/ta-p/247502, livy does allow you to specify args to the spark app.
   
   > My thought is to implement this through python code with sparksession and do the livy submit calling the pyfile.
   
   Not quite sure what this pyfile is supposed to contain. Is it about generating 100 deltastreamer jobs in code and calling Livy API to submit them? In this case I still don't see why you can't pass arguments there and have to go via hudi options. Can you elaborate?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan closed issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

xushiyan closed issue #3831:
URL: https://github.com/apache/hudi/issues/3831


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-1018604367


   @Kavin88 : did you get a chance to try out the external config file. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan edited a comment on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

xushiyan edited a comment on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-974689954


   @Kavin88 so basically it is a feature request where you wish to pass these deltastreamer-specific arguments via options. It is indeed possible to create a set of configs say `hoodie.deltastreamer.xxx` to achieve this. But first i want to understand why it is not possible for you to submit args via Livy API. from what i read, https://community.cloudera.com/t5/Community-Articles/How-to-Submit-Spark-Application-through-Livy-REST-API/ta-p/247502, livy does allow you to specify args to the spark app.
   
   > My thought is to implement this through python code with sparksession and do the livy submit calling the pyfile.
   
   Not quite sure what this pyfile is supposed to contain. Is it about generating 100 deltastreamer jobs in code and calling Livy API to submit them? In this case I still don't see why you can't pass arguments there and have to go via hudi options. Can you elaborate?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-991848661


   @Kavin88 : with 0.10.0, hudi has added support for setting configs in a props file. https://hudi.apache.org/releases/release-0.10.0#external-config-file-support
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Kavin88 commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

Kavin88 commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-948232383


   @xushiyan 1. As of now, I am directly doing the spark submit on the EMR cluster for deltastreamer run. Wanted to understand if deltastreamer can be used same as hudi datasource writer. Params we would pass in datasource writer in pyspark is given below. I am not able to get how to pass the deltastreamers params in python/spark code or through livy submit. Not able to find how  to pass --continuous, source class name , source ordering field ,etc in below hudiOptions. Is this viable ?
   
   hudiOptions = {
   "hoodie.table.name": "hudi_test",
   "hoodie.datasource.write.recordkey.field": "id",
   "hoodie.datasource.write.precombine.field": "last_update_time",
   "hoodie.upsert.shuffle.parallelism": 1,
   "hoodie.insert.shuffle.parallelism": 1,
   "hoodie.consistency.check.enabled": True,
   "hoodie.index.type": "BLOOM",
   "hoodie.index.bloom.num_entries": 60000,
   "hoodie.index.bloom.fpp": 0.000000001,
   "hoodie.cleaner.commits.retained": 2,
   'hoodie.datasource.write.storage.type': 'MERGE_ON_READ'
   }
   
   inputdf.write.format('org.apache.hudi').option('hoodie.datasource.write.operation', 'insert').options(**hudiOptions).mode('overwrite').save('storagepath')
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Kavin88 edited a comment on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

Kavin88 edited a comment on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-948232383


   @xushiyan 1. As of now, I am directly doing the spark submit on the EMR cluster for deltastreamer run. Wanted to understand if deltastreamer can be used same as hudi datasource writer. Params we would pass in datasource writer in pyspark is given below. I am not able to get how to pass the deltastreamers params in python/spark code or through livy submit. Not able to find how  to pass --continuous, source class name , source ordering field ,etc in below hudiOptions. Is this viable ?
   
   hudiOptions = {
   "hoodie.table.name": "hudi_test",
   "hoodie.datasource.write.recordkey.field": "id",
   "hoodie.datasource.write.precombine.field": "last_update_time",
   "hoodie.upsert.shuffle.parallelism": 1,
   "hoodie.insert.shuffle.parallelism": 1,
   'hoodie.datasource.write.storage.type': 'MERGE_ON_READ'
   }
   
   inputdf.write.format('org.apache.hudi').option('hoodie.datasource.write.operation', 'insert').options(**hudiOptions).mode('overwrite').save('storagepath')
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Kavin88 commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

Kavin88 commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-964268810


   i need to ingest 100 tables through deltastreamer and spark job would be running in continuous mode.  My thought is to implement this through python code  with sparksession and do the livy submit calling the pyfile. Could find anywhere to integrate the --props and --source-class parameter directly in python code rather than passing it through spark-submit. Planning to achieve this in generic code and change param values alone for each table. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-1018606629


   Also, if you were able to get it working w/ Livy, consider adding a blog or a write up on how you went about it. Others in the community might benefit from your work. We can help with the write up if need be. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Kavin88 edited a comment on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

Kavin88 edited a comment on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-964268810


   i need to ingest 100 tables through deltastreamer(Not Multitabledeltastreamer) and spark job would be running in continuous mode.  My thought is to implement this through python code  with sparksession and do the livy submit calling the pyfile. Could find anywhere to integrate the --props and --source-class parameter directly in python code rather than passing it through spark-submit. Planning to achieve this in generic code and change param values alone for each table. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-1025287708


   @stackls closing this due to inactive. if you ever come up with some work to share about Deltastreamer + Livy, happy to see and help promote it to the community. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3831: Deltastreamer through Pyspark/livy

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3831:
URL: https://github.com/apache/hudi/issues/3831#issuecomment-961833569


   My understanding is that, you run pyspark job in spark-submt?
   https://www.tutorialkart.com/apache-spark/submit-spark-application-python-example/ 
   or am I getting your requirement wrong. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org