You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/03 17:13:59 UTC

[GitHub] [hudi] ashah-lightbox opened a new issue, #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.

ashah-lightbox opened a new issue, #5492:
URL: https://github.com/apache/hudi/issues/5492

   
   
   **Describe the problem you faced**
   
   I tried _hoodie_is_delete on pyspark emr notebook and it works as desired. Below is my attached example performed in EMR -
   https://gist.github.com/ashays83/6beaf642bd55b4c46292b8f382d0088b
   
   and the i  tried _hoodie_is_delete on hudi spark datasource on docker and it gives these results attached below
   https://gist.github.com/ashays83/af64d3c3795534e40c3b003b0796f349
   
   So as you can see in EMR when we upsert the updated records it keeps all the records and sets null value for hoodie_is_delete field for the records where the value is not specified.
   
   But, i don't see the exact behavior in spark datasource. In here it only keeps the records which has false value for  hoodie_is_delete and all other records gets deleted.
   
   So just wanted to understand why its acting differently on different environment.
   
   **Expected behavior**
   
   Need to have same result for hudi spark datasource.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ashah-lightbox commented on issue #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.

Posted by GitBox <gi...@apache.org>.
ashah-lightbox commented on issue #5492:
URL: https://github.com/apache/hudi/issues/5492#issuecomment-1126195871

   @nsivabalan this is the schema before inserting new records to hudi table. 
   `>>> df1.schema
   StructType(List(StructField(FIPSCode,LongType,true),StructField(LPSDistinctPropertyID,LongType,true),StructField(LPSAssessmentDataReleaseDate,LongType,true),StructField(AssessorsParcelNumber,StringType,true),StructField(_hoodie_is_deleted,BooleanType,true)))
   `
   after this i am running
   
    `>>> df1.write \
   ...     .format("org.apache.hudi") \
   ...     .options(**hudi_options) \
   ...     .mode("append") \
   ...     .save(basePath)`
   
   does aws has different or updates jars?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5492:
URL: https://github.com/apache/hudi/issues/5492#issuecomment-1124400786

   hmmm, interesting. this is the first time I am hearing someone saying that they are seeing diff behavior in emr and docker using same script in spark-datasource. 
   Can you paste the contents of commit metadata when you ingested via docker and via emr? lets see if we can spot any difference. 
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.

Posted by GitBox <gi...@apache.org>.
codope commented on issue #5492:
URL: https://github.com/apache/hudi/issues/5492#issuecomment-1348767380

   Closing due to inactivity


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5492:
URL: https://github.com/apache/hudi/issues/5492#issuecomment-1302887335

   @ashah-lightbox : gentle ping. any updates please. if you got the issue resolved, can we close it out. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5492:
URL: https://github.com/apache/hudi/issues/5492#issuecomment-1124401057

   w/ docker, can you ensure the operation is "upsert" and save mode is "Append". 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ashah-lightbox commented on issue #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.

Posted by GitBox <gi...@apache.org>.
ashah-lightbox commented on issue #5492:
URL: https://github.com/apache/hudi/issues/5492#issuecomment-1125080723

   Hey @nsivabalan As you can see in the above links I have used operation 'upsert' and save mode is 'append' while performing the update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5492:
URL: https://github.com/apache/hudi/issues/5492#issuecomment-1244809050

   also, I see you have used 2 diff scripts in both. 
   Can you try your EMR script https://gist.github.com/ashays83/6beaf642bd55b4c46292b8f382d0088b also in docker and share what you see. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5492:
URL: https://github.com/apache/hudi/issues/5492#issuecomment-1244808566

   @ashah-lightbox : sorry to have dropped the ball on this. I can help me understand what you mean by docker(case 2)? is it hdfs that you are using within docker? 
   when you say EMR(case1), does it mean spark data source writer in EMR ? and are you using EMR's version of hudi. in docker, are you using oss hudi bundles? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5492:
URL: https://github.com/apache/hudi/issues/5492#issuecomment-1125508485

   my hunch is that setting nullable did not work as expected. Can you do df1.printSchema before ingesting to hudi and confirm that nullable for hoodie_is_deleted is set to true in both cases. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed issue #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.

Posted by GitBox <gi...@apache.org>.
codope closed issue #5492: _hoodie_is_delete works differently on hudi spark datasource on docker compare to hudi on emr.
URL: https://github.com/apache/hudi/issues/5492


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org