You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/06/05 00:47:26 UTC

[GitHub] [hudi] nandini57 opened a new issue #1705: Tracking Hudi Data along transaction time and buisness time

nandini57 opened a new issue #1705:
URL: https://github.com/apache/hudi/issues/1705


   Hi Guys,
   
   Need your ideas on this topic.We want to track Hudi Data along business and processing time aka bitemporal way.The link below has the literature we follow.
   
   Due to unique record constraint, we are not allowed to keep same primary key(s) twice.The commit timeline is as well uni-dimensional.We are thinking along record keys ,but nothing concrete yet.
   
   Please help with your ideas on what else we can try or is it even possible?
   
   http://www2.cs.arizona.edu/~rts/tdbbook.pdf
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nandini57 commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
nandini57 commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-643692098


   Yes,thats what i did last week.Added a IS_ACTIVE col ,so that there is one insert.If its okay to you guys,can this approach be added to hudi-examples for bitemporal? it should help people in fintech who are evaluating hudi with similar requirements


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-660556777


   Thanks for the persistence. https://issues.apache.org/jira/browse/HUDI-1112. all yours. please grab it when ready


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nandini57 commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
nandini57 commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-640243834


   haha. Sharing a superb short version which i tried implementing from functional point of view.
   Check Chapter 10.
   https://goldmansachs.github.io/reladomo-kata/reladomo-tour-docs/tour-guide.html#N408B5
   
   Here's my version of it with Hudi.
   1. A custom merge -> DELETE + INSERT ops.Need to wrap each op with try catch and rollback commit time if it fails.I don't get this facility with plain hdfs saves and hudi is useful here.Correct me if wrong.
   2.Timestamped record keys so that it doesn't uproot data which is saved already.I need full control on old data saved to support Bitemporal
   
   Need your views on implementing like this .
   
    public static void merge(Dataset<Row> baseDF, Dataset<Row> adjustedDF, SparkSession spark, String tableType, String tableName, String tablePath, String partitionPath) throws IOException {
   
           baseDF.createOrReplaceTempView ("base");
           adjustedDF.createOrReplaceTempView ("adjustments");
   
           String findMatchSql = "select b.*,a.FROM_Z as FROM_Z_NEW,a.THRU_Z as THRU_Z_NEW ,a.OUT_Z as OUT_Z_NEW, a.IN_Z as IN_Z_NEW from base b inner join adjustments a where b.accountId = a.accountId and b.OUT_Z='99991231'";
           Dataset<Row> matchedDF = spark.sql (findMatchSql).cache (); // Cache Required as DF will be reused
   
           matchedDF.show (false);
   
           Dataset<Row> toBeDeletedDF = matchedDF.drop ("FROM_Z_NEW").drop ("THRU_Z_NEW").drop ("OUT_Z_NEW").drop ("IN_Z_NEW");
   
           //To be Invalidated -Insert
   
           Dataset<Row> invalidatedDF = matchedDF.drop ("OUT_Z").withColumnRenamed ("IN_Z_NEW", "OUT_Z")
                   .drop ("FROM_Z_NEW").drop ("OUT_Z_NEW").drop ("THRU_Z_NEW");
   
           invalidatedDF.show ();
   
           //New World View -Insert
   
           Dataset<Row> newWorldViewDF = matchedDF.drop ("THRU_Z").drop ("IN_Z")
                   .withColumnRenamed ("IN_Z_NEW", "IN_Z").withColumnRenamed ("FROM_Z_NEW", "THRU_Z")
          .drop ("OUT_Z_NEW").drop ("THRU_Z_NEW");
   
           newWorldViewDF.show ();
   
   // Rollback to be added if it fails.Same with Insert
           getBaseWriter (toBeDeletedDF, tableType, tableName)
                   .option (DataSourceWriteOptions.OPERATION_OPT_KEY ( ), DataSourceWriteOptions.DELETE_OPERATION_OPT_VAL ( ))
                   .option (DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY ( ), EmptyHoodieRecordPayload.class.getName ( ))
                   .mode (SaveMode.Append)
                   .save (tablePath);
   
   
           Dataset<Row> mergedDF = invalidatedDF.drop ("recordKey")
                   .withColumn ("recordKey", functions.concat_ws ("_", functions.lit (recordKeyGen ( )), functions.col ("accountId")))
                   .unionByName (newWorldViewDF.drop ("recordKey")
                           .withColumn ("recordKey", functions.concat_ws ("_", functions.lit (recordKeyGen ( )), functions.col ("accountId")))
                           .unionByName (adjustedDF));
   
          mergedDF.createOrReplaceTempView ("mergedDF");
   
          mergedDF= spark.sql ("select * from mergedDF where FROM_Z < THRU_Z");
   
           getBaseWriter (mergedDF, tableType, tableName)
                   .option (DataSourceWriteOptions.OPERATION_OPT_KEY ( ), DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL ( ))
                   .mode (SaveMode.Append)
                   .save (tablePath);
   
   
       }
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nandini57 commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
nandini57 commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-643062483


   Hi @bvaradar @vinothchandar  , do you see any problem with this approach or any points to consider?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nandini57 edited a comment on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
nandini57 edited a comment on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-640599130


   Yes Balaji. Each record can have 4 columns (IN_Z,OUT_Z(system dimension),FROM_Z,THRU_Z(business dimension)) .If you see the code above,i am creating different unique keys  and splitting merge operation into delete + insert
   
   Hudi mege creates one commit timestamp,but delete and insert here will create 2 commit timestamps and if any one/both operation fails, i need to rollback the commits.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nandini57 commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
nandini57 commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-644734485


   Sure.Need to work out pemisssions from my firm and will circle back with PR .Thanks guys!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-644727822


   @nandini57 +1 that'd be awesome.. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-643693932


   @nandini57 : Yes, Sounds like a good idea. Can you open a PR related to this ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar closed issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #1705:
URL: https://github.com/apache/hudi/issues/1705


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-640382104


   I may not be following the problem completely but this sounds like event and arrival time to me (w.r.t database). Is it possible to have a different unique key and have separate columns for business and transaction time ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-660424074


   Any luck @nandini57 ? I can also create a jira if you were able to work through the hops. Let me know


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nandini57 commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
nandini57 commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-653272647


   Working through the hops in my org @vinothchandar  .Will update as soon as possible


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-643687211


   @nandini57 : THis looks ok to me. TO avoid 2 different commits (delete and insert), you can try keeping an additional column for marking invalid rows and then have a separate single commit to delete them. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-653263618


   @nandini57 any updates? Still very interested in this.. :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-640158649


   That’s a larger book :).. can you please each explain your use case in more detail


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nandini57 commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
nandini57 commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-668908683


   Sure thankyou


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nandini57 commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
nandini57 commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-660494259


   Still waiting :( .Can you please create a JIRA and assign to me? Will possibly take 2 weeks more to get clearence


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-668657866


   Closing this and we can track through jira. @nandini57 : Please grab the jira if you are planning to work on it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nandini57 commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
nandini57 commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-640599130


   Yes Balaji. Each record can have 4 columns (IN_Z,OUT_Z(system dimension),FROM_Z,THRU_Z(business dimension)) .If you see the code above,i am creating different unique keys  and splitting merge operation into delete + insert


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nandini57 edited a comment on issue #1705: Tracking Hudi Data along transaction time and buisness time

Posted by GitBox <gi...@apache.org>.
nandini57 edited a comment on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-640243834


   haha. Sharing a superb short version which i tried implementing from functional point of view.
   Check Chapter 10.
   https://goldmansachs.github.io/reladomo-kata/reladomo-tour-docs/tour-guide.html#N408B5
   
   Here's my version of it with Hudi.
   1. A custom merge -> DELETE + INSERT ops.Need to wrap each op with try catch and rollback commit time if it fails.I don't get this facility with plain hdfs saves and hudi is useful here.Correct me if wrong.
   2.Timestamped record keys so that it doesn't uproot data which is saved already.I need full control on old data saved to support Bitemporal
   
   Need your views on implementing like this .So far it works!
    public static void merge(Dataset<Row> baseDF, Dataset<Row> adjustedDF, SparkSession spark, String tableType, String tableName, String tablePath, String partitionPath) throws IOException {
   
           baseDF.createOrReplaceTempView ("base");
           adjustedDF.createOrReplaceTempView ("adjustments");
   
           String findMatchSql = "select b.*,a.FROM_Z as FROM_Z_NEW,a.THRU_Z as THRU_Z_NEW ,a.OUT_Z as OUT_Z_NEW, a.IN_Z as IN_Z_NEW from base b inner join adjustments a where b.accountId = a.accountId and b.OUT_Z='99991231'";
           Dataset<Row> matchedDF = spark.sql (findMatchSql).cache (); // Cache Required as DF will be reused
   
           matchedDF.show (false);
   
           Dataset<Row> toBeDeletedDF = matchedDF.drop ("FROM_Z_NEW").drop ("THRU_Z_NEW").drop ("OUT_Z_NEW").drop ("IN_Z_NEW");
   
           //To be Invalidated -Insert
   
           Dataset<Row> invalidatedDF = matchedDF.drop ("OUT_Z").withColumnRenamed ("IN_Z_NEW", "OUT_Z")
                   .drop ("FROM_Z_NEW").drop ("OUT_Z_NEW").drop ("THRU_Z_NEW");
   
           invalidatedDF.show ();
   
           //New World View -Insert
   
           Dataset<Row> newWorldViewDF = matchedDF.drop ("THRU_Z").drop ("IN_Z")
                   .withColumnRenamed ("IN_Z_NEW", "IN_Z").withColumnRenamed ("FROM_Z_NEW", "THRU_Z")
          .drop ("OUT_Z_NEW").drop ("THRU_Z_NEW");
   
           newWorldViewDF.show ();
   
           getBaseWriter (toBeDeletedDF, tableType, tableName)
                   .option (DataSourceWriteOptions.OPERATION_OPT_KEY ( ), DataSourceWriteOptions.DELETE_OPERATION_OPT_VAL ( ))
                   .option (DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY ( ), EmptyHoodieRecordPayload.class.getName ( ))
                   .mode (SaveMode.Append)
                   .save (tablePath);
   
   
           Dataset<Row> mergedDF = invalidatedDF.drop ("recordKey")
                   .withColumn ("recordKey", functions.concat_ws ("_", functions.lit (recordKeyGen ( )), functions.col ("accountId")))
                   .unionByName (newWorldViewDF.drop ("recordKey")
                           .withColumn ("recordKey", functions.concat_ws ("_", functions.lit (recordKeyGen ( )), functions.col ("accountId")))
                           .unionByName (adjustedDF));
   
          mergedDF.createOrReplaceTempView ("mergedDF");
   
          mergedDF= spark.sql ("select * from mergedDF where FROM_Z < THRU_Z");
   
           getBaseWriter (mergedDF, tableType, tableName)
                   .option (DataSourceWriteOptions.OPERATION_OPT_KEY ( ), DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL ( ))
                   .mode (SaveMode.Append)
                   .save (tablePath);
   
   
       }
    ``


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org