You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/12 11:57:37 UTC

[GitHub] [hudi] tooptoop4 opened a new issue #1955: [SUPPORT] DMS partition treated as part of pk

tooptoop4 opened a new issue #1955:
URL: https://github.com/apache/hudi/issues/1955


   /home/ec2-user/spark_home/bin/spark-submit --conf "spark.hadoop.fs.s3a.proxy.host=redact" --conf "spark.hadoop.fs.s3a.proxy.port=redact" --conf "spark.driver.extraClassPath=/home/ec2-user/json-20090211.jar" --conf "spark.executor.extraClassPath=/home/ec2-user/json-20090211.jar" --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --jars "/home/ec2-user/spark-avro_2.11-2.4.6.jar" --master spark://redact:7077 --deploy-mode client /home/ec2-user/hudi-utilities-bundle_2.11-0.5.3-1.jar --table-type COPY_ON_WRITE --source-ordering-field TimeCreated --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --enable-hive-sync --hoodie-conf hoodie.datasource.hive_sync.database=redact --hoodie-conf hoodie.datasource.hive_sync.table=dmstest_multpk3 --hoodie-conf hoodie.datasource.hive_sync.partition_fields="sys_user" --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor --hoodie-conf  hoodie.datasource.hive
 _sync.use_jdbc=false --target-base-path s3a://redact/my2/multpk3 --target-table dmstest_multpk3 --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator --hoodie-conf hoodie.datasource.write.recordkey.field=version_no,group_company --hoodie-conf hoodie.datasource.write.partitionpath.field=sys_user --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://redact/dbo/redact > multpk3.log
   
   i do have https://github.com/apache/hudi/pull/1898 patched in this jar
   
   Instead of getting 1 row per version_no,group_company combo, I am getting multiple rows per version_no,group_company combo, in fact i am getting 1 row per version_no,group_company,sys_user combo
   
   How to make it not treat partition field as part of pk?
   
   ie for each version_no,group_company combo, i want to get the latest row by TimeCreated (ie the source-ordering-field) and then partition on whatever sys_user that latest row has.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

Posted by GitBox <gi...@apache.org>.

tooptoop4 commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-679356547


   perfect, that is how I expect. perhaps the default should be global index? or documentation should be updated?
   
   From coming from RDBMS background the PK is unique at table level not at partition level but reading below configs it is not clear that hudi default is different and I'm sure will trip up many newcomers to hudi:
   
   "RECORDKEY_FIELD_OPT_KEY (Required): **Primary key** field(s). Nested fields can be specified using the dot notation eg: a.b.c. When using multiple columns as primary key use comma separated notation, eg: "col1,col2,col3,etc". Single or multiple columns as primary key specified by KEYGENERATOR_CLASS_OPT_KEY property.
   Default value: "uuid"
   
   PARTITIONPATH_FIELD_OPT_KEY (Required): Columns to be used for **partitioning** the table. To prevent partitioning, provide empty string as value eg: "". Specify partitioning/no partitioning using KEYGENERATOR_CLASS_OPT_KEY. If synchronizing to hive, also specify using HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY.
   Default value: "partitionpath""
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-691771582


   Added Jira for doc update : https://issues.apache.org/jira/browse/HUDI-1279


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #1955: [SUPPORT] DMS partition treated as part of pk

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #1955:
URL: https://github.com/apache/hudi/issues/1955


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-679235853


   @nsivabalan : Please take a look when you get a chance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-679352339


   @tooptoop4 : can you clarify what you mean by this.
   ```
   ie for each version_no,group_company combo, i want to get the latest row by TimeCreated (ie the source-ordering-field) and then partition on whatever sys_user that latest row has.
   ```
   But in general, yes, if you use global index with the update partition path set, you should not see any duplicates in your entire hoodie dataset. 
   
   I can try to illustrate with an eg. Lets say each row consists only 4 vals, v_no(version no), cmp (group_company), time_cr, sys_user.
   Incase of regular index, combination of record keys and partition path forms unique keys. 
   
   If you are using regular index and ingest 
   v_1, c_1, t_1, u_1
   v_2, c_1, t_1, u_1
   v_1, c_1, t_1, u_2
   v_1, c_1, t_1, u_3
   
   This will result in 2 rows going to partition u_1, 1 row to partition u_2, and one row to u_3. 
   
   In 2nd batch of updates, lets say you ingest few more rows. 
   v_1, c_1, t_2, u_1
   v_3, c_1, t_2, u_1
   v_1, c_2, t_2, u_2
   v_1, c_3, t_2, u_3
   
   Here is the result
   u_1:
   v_1, c_1, t_2, u_1 (updated with latest value)
   v_2, c_1, t_1, u_1
   v_3, c_1, t_2, u_1 (insert from 2nd batch)
   u_2:
   v_1, c_2, t_2, u_2 (updated with latest value)
   u_3:
   v_1, c_1, t_1, u_3
   v_1, c_3, t_2, u_3(insert from 2nd batch)
   
   Incase of global index, only record keys are unique. 
   Lets see an example with global bloom, but with the update partition path config not set.
   
   If 1st batch of ingest contains
   v_1, c_1, t_1, u_1
   v_1, c_2, t_1, u_1
   v_2, c_1, t_1, u_2
   v_3, c_1, t_1, u_3
   
   result will be. 
   
   v_1, c_1, t_1, u_1 
   v_1, c_2, t_1, u_1
   v_2, c_1, t_1, u_2 
   v_3, c_1, t_1, u_3
   
   And 2nd batch of ingest contains 
   v_1, c_1, t_2, u_1 (updating with latest time)
   v_1, c_2, t_2, u_2 (moving v1,c2 from u_1 to u_2). expectation is that, this will update U_1 only, since the config is not set. and hence new partition path i.e. u_2 will be ignored. 
   v_2, c_2, t_2, u_2 (new insert)
   v_1, c_3, t_2, u_3 (new insert)
   
   So, the result will be
   v_1, c_1, t_2, u_1 (updated with latest time)
   v_1, c_2, t_2, u_1 (updated with latest time even though incoming record was sent to u_2)
   v_2, c_1, t_1, u_2 
   v_2, c_2, t_2, u_2 (new insert)
   v_3, c_1, t_1, u_3
   v_1, c_3, t_2, u_3 (new insert)
   
   We can go the same with the config value set. 
   
   result from first batch:
   v_1, c_1, t_1, u_1 
   v_1, c_2, t_1, u_1
   v_2, c_1, t_1, u_2 
   v_3, c_1, t_1, u_3
   
   And 2nd batch of ingest contains 
   v_1, c_1, t_2, u_1 (updating with latest time)
   v_1, c_2, t_2, u_2 (moving v1,c2 from u_1 to u_2). expectation is that, this will insert a new record to u_2 and will delete corres record from u_1, since the config is set.
   v_2, c_2, t_2, u_2 (new insert)
   v_1, c_3, t_2, u_3 (new insert)
   
   So, the result will be
   v_1, c_1, t_2, u_1 (updated with latest time)
   v_1, c_2, t_2, u_2 (updated with latest time and old record is deleted)
   v_2, c_1, t_1, u_2 
   v_2, c_2, t_2, u_2 (new insert)
   v_3, c_1, t_1, u_3
   v_1, c_3, t_2, u_3 (new insert)
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-681051908


   @tooptoop4 : hudi is more of synonymous to hive/presto where partitioning is by default enabled. But we will see how to fix the phrasing. btw, can you point me to the page where you saw this? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 edited a comment on issue #1955: [SUPPORT] DMS partition treated as part of pk

Posted by GitBox <gi...@apache.org>.

tooptoop4 edited a comment on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-679356547


   @nsivabalan  perfect, that is how I expect. perhaps the default should be global index? or documentation should be updated?
   
   From coming from RDBMS background the PK is unique at table level not at partition level but reading below configs it is not clear that hudi default is different and I'm sure will trip up many newcomers to hudi:
   
   "RECORDKEY_FIELD_OPT_KEY (Required): **Primary key** field(s). Nested fields can be specified using the dot notation eg: a.b.c. When using multiple columns as primary key use comma separated notation, eg: "col1,col2,col3,etc". Single or multiple columns as primary key specified by KEYGENERATOR_CLASS_OPT_KEY property.
   Default value: "uuid"
   
   PARTITIONPATH_FIELD_OPT_KEY (Required): Columns to be used for **partitioning** the table. To prevent partitioning, provide empty string as value eg: "". Specify partitioning/no partitioning using KEYGENERATOR_CLASS_OPT_KEY. If synchronizing to hive, also specify using HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY.
   Default value: "partitionpath""
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-673924758


   For your first question, since you are partitioning by `sys_user` you will likely get multiple rows (1 per sys_user). This is because the record keys are unique within a partition path and need not be unique across partition paths. 
   
   @nsivabalan could you help with the second part, where Global bloom is used ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

Posted by GitBox <gi...@apache.org>.

tooptoop4 commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-681106051


   @nsivabalan https://hudi.apache.org/docs/writing_data.html


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

Posted by GitBox <gi...@apache.org>.

tooptoop4 commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-672853435


   --hoodie-conf hoodie.index.type=GLOBAL_BLOOM --hoodie-conf hoodie.bloom.index.update.partition.path=true   adding those 2 flags seems to get the right row count in the target but can you confirm it will take the latest row by source-ordering-field ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org