You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/11/10 04:49:54 UTC

[GitHub] [hudi] RajasekarSribalan opened a new issue #2238: [SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating

RajasekarSribalan opened a new issue #2238:
URL: https://github.com/apache/hudi/issues/2238


   Hi All, I have query regarding CDC using hudi.
   
   My questions is ,  I am using SPARK Datasource API for upserts and delete on HUDI. What is the best way of doing deletes in hudi? 
   Our code flow is , 
   read Kafka -> persist DF in memory -> filter upserts > Write to Hudi -> Filter Deletes -> Write to hudi.. 
   
   Is this the right of handling both upsert and deletes from incoming streams… The problem with this approach is, hudi does indexing twice for a single batch of records as we do upsert separately and delete separately. I would like to have your suggestions for improving our pipeline.
   can we use “_hoodie_is_deleted” in Spark Datasource API. We can append a new column with _hoodie_is_deleted as true for delete records and false for insert/update records.. If we use “_hoodie_is_deleted”, will hudi hard delete the row or does it make it null? Pls confirm. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar closed issue #2238: [SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating

Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #2238:
URL: https://github.com/apache/hudi/issues/2238


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] helianthuslulu commented on issue #2238: [SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating

Posted by GitBox <gi...@apache.org>.
helianthuslulu commented on issue #2238:
URL: https://github.com/apache/hudi/issues/2238#issuecomment-730760611


   I met one problem with **"_hoodie_is_deletd"**,too. and hudi version is 0.5.2:
   **1)kafaka data sample:**
   "rowkey1","value1","value2",true
   "rowkey2","value1","value2",false
   **2)spark Dataframe schema:**
   `val structSchema: StructType = StructType(
   List(
       StructField("rowkey", StringType, true),
       StructField("rowkey", StringType, true),
       StructField("rowkey", StringType, true),
       StructField("_hoodie_is_deleted", BooleanType, true),
   )
   )`
   3)query ways:
   ①spark-shell
   ②spark-sql
   ③beeline
   ④code:
   val tripsSnapshotDF = spark.
     read.
     format("hudi").
     load(basePath + "/*/*/*/*")
   //load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
   tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
   spark.sql("select * from  hudi_trips_snapshot limit 20").show()
   **4)quey result:
   result data include lines "_hoodie_is_deleted=true"** 
   
   can you please give me some suggestion? @bvaradar 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2238: [SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2238:
URL: https://github.com/apache/hudi/issues/2238#issuecomment-730837343


   @RajasekarSribalan : sorry, missed this. Yes, you are right. if you use "UPSERT" operation, "_hoodie_is_deleted" value will be used to distinguish records to be deleted vs upserted. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2238: [SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2238:
URL: https://github.com/apache/hudi/issues/2238#issuecomment-743690952


   @RajasekarSribalan : can we close this issue if you don't have anything else. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2238: [SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2238:
URL: https://github.com/apache/hudi/issues/2238#issuecomment-744702995


   @RajasekarSribalan . Please open a new issue if you need any clarifications in this regard.
   
   Thanks,
   Balaji.V


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2238: [SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2238:
URL: https://github.com/apache/hudi/issues/2238#issuecomment-724872746


   @nsivabalan : Can you please take a look at this ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2238: [SUPPORT] _hoodie_is_deleted support for Spark Datasource API in hudi 0.5.2-incubating

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2238:
URL: https://github.com/apache/hudi/issues/2238#issuecomment-730837495


   @helianthuslulu : sorry I don't quite get your question. would you mind explaining once again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org