You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/10 19:06:19 UTC

[GitHub] [hudi] pratyakshsharma opened a new pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

pratyakshsharma opened a new pull request #1816:
URL: https://github.com/apache/hudi/pull/1816


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma edited a comment on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
pratyakshsharma edited a comment on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-662962138


   > @pratyakshsharma Thanks for the updates and sorry for late response. For users not using latest master, they still need use NonpartitionedKeyGenerator, so I think it is valuable to mention it.
   
   In that case, we should mention about existing solutions for other key generation cases as well like SimpleKeyGenerator, ComplexKeyGenerator etc. Let me make the changes and update this PR then. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsuthar-lumiq removed a comment on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
nsuthar-lumiq removed a comment on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-665450530


   @pratyakshsharma could you please share the documentation that has an example of composite key uses. We are not getting, how to use it, and also does it also support Pyspark?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf commented on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
leesf commented on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-660992766


   @pratyakshsharma Thanks for the updates and sorry for late response. For users not using latest master, they still need use `NonpartitionedKeyGenerator`, so I think it is valuable to mention it. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma edited a comment on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
pratyakshsharma edited a comment on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-666426275


   @NikhilSuthar For checking the usage, you could actually go to master and check TestComplexKeyGenerator and TestCustomKeyGenerator classes. That should help you. To answer your other question, right now the entire codebase is in java and scala. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] NikhilSuthar commented on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
NikhilSuthar commented on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-665452095


   @pratyakshsharma could you please share the documentation that has an example of composite key uses. We are not getting, how to use it, and also does it also support Pyspark?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-666426275


   @NikhilSuthar For checking the usage, you could actually go to master and check TestComplexKeyGenerator and TestCustomKeyGenerator classes. That should help you. To answer your other question, right now the entire codebase is in java. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bhasudha merged pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
bhasudha merged pull request #1816:
URL: https://github.com/apache/hudi/pull/1816


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-657081370


   > hi @pratyakshsharma really great docs, should we also point out the hoodie.datasource.write.keygenerator.class config in this section ?
   
   Will add that the property should be set to CustomKeyGenerator.java class. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-662962138


   > @pratyakshsharma Thanks for the updates and sorry for late response. For users not using latest master, they still need use NonpartitionedKeyGenerator, so I think it is valuable to mention it.
   
   In that case, we should mention about existing solutions for other key generation cases as well like SimpleKeyGenerator, ComplexKeyGenerator etc. I was thinking once CustomKeyGenerator gets released as part of 0.6.0 release, then we could mention this in docs something like - 
   
   "For those using Hudi versions older than 0.6.0, you can use the following key generators - 
   
   1. Simple record key and custom timestamp based partition path - TimestampBasedKeyGenerator
   2. Composite record keys and composite partition paths - ComplexKeyGenerator
   3. Non partitioned table - NonpartitionedKeyGenerator"
   
   Let me know your thoughts on this @leesf 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#discussion_r453205672



##########
File path: docs/_docs/2_2_writing_data.md
##########
@@ -28,6 +28,58 @@ can be chosen/changed across each commit/deltacommit issued against the table.
  of initial load. However, this just does a best-effort job at sizing files vs guaranteeing file sizes like inserts/upserts do. 
 
 
+## Key Generation
+
+Hudi maintains hoodie keys (record key + partition path) for uniquely identifying a particular record. Hudi currently supports different combinations of record keys and partition paths as below - 
+
+ - Simple record key (consisting of only one field) and simple partition path (with optional hive style partitioning)
+ - Simple record key and custom timestamp based partition path (with optional hive style partitioning)
+ - Composite record keys (combination of multiple fields) and composite partition paths
+ - Composite record keys and timestamp based partition paths (composite also supported)
+ - Non partitioned table
+
+`CustomKeyGenerator.java` (part of hudi-spark module) class provides great support for generating hoodie keys of all the above listed types. All you need to do is supply values for the following properties properly to create your desired keys - 
+
+```java
+hoodie.datasource.write.recordkey.field
+hoodie.datasource.write.partitionpath.field
+hoodie.datasource.write.keygenerator.class
+```
+
+For having composite record keys, you need to provide comma separated fields like
+```java
+hoodie.datasource.write.recordkey.field=field1,field2
+```
+
+This will create your record key in the format `field1:value1,field2:value2` and so on, otherwise you can specify only one field in case of simple record keys. `CustomKeyGenerator` class defines an enum `PartitionKeyType` for configuring partition paths. It can take two possible values - SIMPLE and TIMESTAMP. 
+The value for `hoodie.datasource.write.partitionpath.field` property in case of partitioned tables needs to be provided in the format `field1:PartitionKeyType1,field2:PartitionKeyType2` and so on. For example, if you want to create partition path using 2 fields `country` and `date` where the latter has timestamp based values and needs to be customised in a given format, you can specify the following 
+
+```java
+hoodie.datasource.write.partitionpath.field=country:SIMPLE,date:TIMESTAMP
+``` 
+This will create the partition path in the format `<country_name>/<date>` or `country=<country_name>/date=<date>` depending on whether you want hive style partitioning or not.
+
+`TimestampBasedKeyGenerator` class defines the following properties which can be used for doing the customizations for timestamp based partition paths
+
+```java
+hoodie.deltastreamer.keygen.timebased.timestamp.type
+  This defines the type of the value that your field contains. It can be in string format or epoch format, for example
+hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit
+  This defines the granularity of your field, whether it contains the values in seconds or milliseconds
+hoodie.deltastreamer.keygen.timebased.input.dateformat
+  This defines the custom format in which the values are present in your field, for example yyyy/MM/dd
+hoodie.deltastreamer.keygen.timebased.output.dateformat
+  This defines the custom format in which you want the partition paths to be created, for example dt=yyyyMMdd
+hoodie.deltastreamer.keygen.timebased.timezone
+  This defines the timezone which the timestamp based values belong to
+```
+
+Finally, if you want to have non partitioned table, you can simply leave the property blank like

Review comment:
       Maybe I can add `hoodie.datasource.write.keygenerator.class` should be set to CustomKeyGenerator class for all cases. Even non partitioned table can be handled with CustomKeyGenerator only. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf commented on a change in pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
leesf commented on a change in pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#discussion_r453204744



##########
File path: docs/_docs/2_2_writing_data.md
##########
@@ -28,6 +28,58 @@ can be chosen/changed across each commit/deltacommit issued against the table.
  of initial load. However, this just does a best-effort job at sizing files vs guaranteeing file sizes like inserts/upserts do. 
 
 
+## Key Generation
+
+Hudi maintains hoodie keys (record key + partition path) for uniquely identifying a particular record. Hudi currently supports different combinations of record keys and partition paths as below - 
+
+ - Simple record key (consisting of only one field) and simple partition path (with optional hive style partitioning)
+ - Simple record key and custom timestamp based partition path (with optional hive style partitioning)
+ - Composite record keys (combination of multiple fields) and composite partition paths
+ - Composite record keys and timestamp based partition paths (composite also supported)
+ - Non partitioned table
+
+`CustomKeyGenerator.java` (part of hudi-spark module) class provides great support for generating hoodie keys of all the above listed types. All you need to do is supply values for the following properties properly to create your desired keys - 
+
+```java
+hoodie.datasource.write.recordkey.field
+hoodie.datasource.write.partitionpath.field
+hoodie.datasource.write.keygenerator.class
+```
+
+For having composite record keys, you need to provide comma separated fields like
+```java
+hoodie.datasource.write.recordkey.field=field1,field2
+```
+
+This will create your record key in the format `field1:value1,field2:value2` and so on, otherwise you can specify only one field in case of simple record keys. `CustomKeyGenerator` class defines an enum `PartitionKeyType` for configuring partition paths. It can take two possible values - SIMPLE and TIMESTAMP. 
+The value for `hoodie.datasource.write.partitionpath.field` property in case of partitioned tables needs to be provided in the format `field1:PartitionKeyType1,field2:PartitionKeyType2` and so on. For example, if you want to create partition path using 2 fields `country` and `date` where the latter has timestamp based values and needs to be customised in a given format, you can specify the following 
+
+```java
+hoodie.datasource.write.partitionpath.field=country:SIMPLE,date:TIMESTAMP
+``` 
+This will create the partition path in the format `<country_name>/<date>` or `country=<country_name>/date=<date>` depending on whether you want hive style partitioning or not.
+
+`TimestampBasedKeyGenerator` class defines the following properties which can be used for doing the customizations for timestamp based partition paths
+
+```java
+hoodie.deltastreamer.keygen.timebased.timestamp.type
+  This defines the type of the value that your field contains. It can be in string format or epoch format, for example
+hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit
+  This defines the granularity of your field, whether it contains the values in seconds or milliseconds
+hoodie.deltastreamer.keygen.timebased.input.dateformat
+  This defines the custom format in which the values are present in your field, for example yyyy/MM/dd
+hoodie.deltastreamer.keygen.timebased.output.dateformat
+  This defines the custom format in which you want the partition paths to be created, for example dt=yyyyMMdd
+hoodie.deltastreamer.keygen.timebased.timezone
+  This defines the timezone which the timestamp based values belong to
+```
+
+Finally, if you want to have non partitioned table, you can simply leave the property blank like

Review comment:
       should we also point out the `hoodie.datasource.write.keygenerator.class` should be set to `NonpartitionedKeyGenerator` ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] NikhilSuthar edited a comment on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
NikhilSuthar edited a comment on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-665452095


   @pratyakshsharma could you please share the documentation that has an example of composite key uses. We are not getting, how to use it, and does it also support Pyspark?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-662977488


   @leesf  please take a pass. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bhasudha commented on a change in pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
bhasudha commented on a change in pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#discussion_r469733389



##########
File path: docs/_docs/2_2_writing_data.md
##########
@@ -28,6 +28,58 @@ can be chosen/changed across each commit/deltacommit issued against the table.
  of initial load. However, this just does a best-effort job at sizing files vs guaranteeing file sizes like inserts/upserts do. 
 
 
+## Key Generation
+
+Hudi maintains hoodie keys (record key + partition path) for uniquely identifying a particular record. Hudi currently supports different combinations of record keys and partition paths as below - 
+
+ - Simple record key (consisting of only one field) and simple partition path (with optional hive style partitioning)
+ - Simple record key and custom timestamp based partition path (with optional hive style partitioning)
+ - Composite record keys (combination of multiple fields) and composite partition paths
+ - Composite record keys and timestamp based partition paths (composite also supported)
+ - Non partitioned table
+
+`CustomKeyGenerator.java` (part of hudi-spark module) class provides great support for generating hoodie keys of all the above listed types. All you need to do is supply values for the following properties properly to create your desired keys - 
+
+```java
+hoodie.datasource.write.recordkey.field
+hoodie.datasource.write.partitionpath.field
+hoodie.datasource.write.keygenerator.class
+```
+
+For having composite record keys, you need to provide comma separated fields like
+```java
+hoodie.datasource.write.recordkey.field=field1,field2
+```
+
+This will create your record key in the format `field1:value1,field2:value2` and so on, otherwise you can specify only one field in case of simple record keys. `CustomKeyGenerator` class defines an enum `PartitionKeyType` for configuring partition paths. It can take two possible values - SIMPLE and TIMESTAMP. 
+The value for `hoodie.datasource.write.partitionpath.field` property in case of partitioned tables needs to be provided in the format `field1:PartitionKeyType1,field2:PartitionKeyType2` and so on. For example, if you want to create partition path using 2 fields `country` and `date` where the latter has timestamp based values and needs to be customised in a given format, you can specify the following 
+
+```java
+hoodie.datasource.write.partitionpath.field=country:SIMPLE,date:TIMESTAMP
+``` 
+This will create the partition path in the format `<country_name>/<date>` or `country=<country_name>/date=<date>` depending on whether you want hive style partitioning or not.
+
+`TimestampBasedKeyGenerator` class defines the following properties which can be used for doing the customizations for timestamp based partition paths
+
+```java
+hoodie.deltastreamer.keygen.timebased.timestamp.type
+  This defines the type of the value that your field contains. It can be in string format or epoch format, for example
+hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit
+  This defines the granularity of your field, whether it contains the values in seconds or milliseconds
+hoodie.deltastreamer.keygen.timebased.input.dateformat
+  This defines the custom format in which the values are present in your field, for example yyyy/MM/dd
+hoodie.deltastreamer.keygen.timebased.output.dateformat
+  This defines the custom format in which you want the partition paths to be created, for example dt=yyyyMMdd
+hoodie.deltastreamer.keygen.timebased.timezone
+  This defines the timezone which the timestamp based values belong to
+```
+
+Finally, if you want to have non partitioned table, you can simply leave the property blank like

Review comment:
       > Maybe I can add `hoodie.datasource.write.keygenerator.class` should be set to CustomKeyGenerator class for all cases. Even non partitioned table can be handled with CustomKeyGenerator only.
   
   I think that should be okay. Just to avoid confusion, you can add that `CustomKeyGenerator` can also handle non partitioned data set on setting the key generator class as `CustomKeyGenerator` and leaving empty for `hoodie.datasource.write.partitionpath.field=`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsuthar-lumiq commented on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

Posted by GitBox <gi...@apache.org>.
nsuthar-lumiq commented on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-665450530


   @pratyakshsharma could you please share the documentation that has an example of composite key uses. We are not getting, how to use it, and also does it also support Pyspark?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org