You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/27 03:39:56 UTC

[GitHub] [hudi] nsivabalan opened a new pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

nsivabalan opened a new pull request #3549:
URL: https://github.com/apache/hudi/pull/3549


   ## What is the purpose of the pull request
   
   Blog on bulk_insert sort modes
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r703595908



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 
+small files are not managed with bulk_insert. 
+
+Bulk insert offers 3 different sort modes to cater to different needs of users, based on the following principles.
+
+- Sorting will give us good compression and upsert performance, if data is laid out well. Especially if your record keys 
+  have some sort of ordering (timestamp, etc) characteristics, sorting will assist in trimming down a lot of files 
+  during upsert. If data is sorted by frequently queried columns, queries will leverage parquet predicate pushdown 
+  to trim down the data to ensure lower latency as well.
+  
+- Additionally, parquet writing is quite a memory intensive operation. When writing large volumes of data into a table 
+  that is also partitioned into 1000s of partitions, without sorting of any kind, the writer may have to keep 1000s of 
+  parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes.
+  
+- It's also desirable to start with the smallest amount of files possible when bulk importing data, as to avoid 
+  metadata overhead later on for writers and queries. 
+
+3 Sort modes supported out of the box are: `PARTITION_SORT`,`GLOBAL_SORT` and `NONE`.
+
+## Configurations 
+One can set the config [“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode) to either 
+of the three values, namely NONE, GLOBAL_SORT and PARTITION_SORT. Default sort mode is `GLOBAL_SORT`.
+
+## Different Sort Modes
+
+### Global Sort
+
+As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files 
+pruned using key ranges, during index lookups for subsequent upserts. This is because each file has non-overlapping 
+min, max values for keys, which really helps, when the key has some ordering characteristics such as a time based prefix.
+Given we are writing to a single parquet file on a single output partition path on storage at any given time, this mode
+greatly helps control memory pressure during large partitioned writes. Also due to global sorting, each small table 
+partition path will be written from atmost two spark partition and thus contain just 2 files. 
+This is the default sort mode with bulk_insert operation in Hudi. 
+
+### Partition sort
+In this sort mode, records within a given spark partition will be sorted. But there are chances that a given spark partition 
+can contain records from different table partitions. And so, even though we sort within each spark partitions, this sort
+mode could result in large number of files at the end of bulk_insert, since records for a given table partition could 
+be spread across many spark partitions. During actual write by the writers, we may not have much open files 
+simultaneously, since we close out the file before moving to next file (as records are sorted within a spark partition) 
+and hence may not have much memory pressure. 
+
+### None
+
+In this mode, no transformation such as sorting is done to the user records and delegated to the writers as is. So, 
+when writing large volumes of data into a table partitioned into 1000s of partitions, the writer may have to keep 1000s of
+parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes. Also, 
+min max ranges for a given file could be very wide (unsorted records) and hence subsequent upserts may read 
+bloom filters from lot of files during index lookup. Since records are not sorted, and each writer could get records 
+across N number of table partitions, this sort mode could result in a huge number of files at the end of bulk import. 
+This could also impact your upsert or query performance due to large number of small files. 
+
+## User defined partitioner
+
+If none of the above built-in sort modes suffice, users can also choose to implement their own 
+[partitioner](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java)
+and plug it in with bulk insert as needed.
+
+## Bulk insert with different sort modes
+Here is a microbenchmark to show the performance difference between different sort modes.
+
+![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/sort-modes.png) <br/>

Review comment:
       Have fixed the benchmarks. PTAL. with row writer, I don't see a lot of overhead w/ sorting. with write client, definitely there was quite a overhead. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r697830290



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected

Review comment:
       nit: loading to data -> loading of data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#issuecomment-907294772


   few months back, when I ran some benchmarks, global sorting w/ bulk_insert took more time than no sorting which made sense. but this time, it wasn't the way I anticipated. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r703595908



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 
+small files are not managed with bulk_insert. 
+
+Bulk insert offers 3 different sort modes to cater to different needs of users, based on the following principles.
+
+- Sorting will give us good compression and upsert performance, if data is laid out well. Especially if your record keys 
+  have some sort of ordering (timestamp, etc) characteristics, sorting will assist in trimming down a lot of files 
+  during upsert. If data is sorted by frequently queried columns, queries will leverage parquet predicate pushdown 
+  to trim down the data to ensure lower latency as well.
+  
+- Additionally, parquet writing is quite a memory intensive operation. When writing large volumes of data into a table 
+  that is also partitioned into 1000s of partitions, without sorting of any kind, the writer may have to keep 1000s of 
+  parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes.
+  
+- It's also desirable to start with the smallest amount of files possible when bulk importing data, as to avoid 
+  metadata overhead later on for writers and queries. 
+
+3 Sort modes supported out of the box are: `PARTITION_SORT`,`GLOBAL_SORT` and `NONE`.
+
+## Configurations 
+One can set the config [“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode) to either 
+of the three values, namely NONE, GLOBAL_SORT and PARTITION_SORT. Default sort mode is `GLOBAL_SORT`.
+
+## Different Sort Modes
+
+### Global Sort
+
+As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files 
+pruned using key ranges, during index lookups for subsequent upserts. This is because each file has non-overlapping 
+min, max values for keys, which really helps, when the key has some ordering characteristics such as a time based prefix.
+Given we are writing to a single parquet file on a single output partition path on storage at any given time, this mode
+greatly helps control memory pressure during large partitioned writes. Also due to global sorting, each small table 
+partition path will be written from atmost two spark partition and thus contain just 2 files. 
+This is the default sort mode with bulk_insert operation in Hudi. 
+
+### Partition sort
+In this sort mode, records within a given spark partition will be sorted. But there are chances that a given spark partition 
+can contain records from different table partitions. And so, even though we sort within each spark partitions, this sort
+mode could result in large number of files at the end of bulk_insert, since records for a given table partition could 
+be spread across many spark partitions. During actual write by the writers, we may not have much open files 
+simultaneously, since we close out the file before moving to next file (as records are sorted within a spark partition) 
+and hence may not have much memory pressure. 
+
+### None
+
+In this mode, no transformation such as sorting is done to the user records and delegated to the writers as is. So, 
+when writing large volumes of data into a table partitioned into 1000s of partitions, the writer may have to keep 1000s of
+parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes. Also, 
+min max ranges for a given file could be very wide (unsorted records) and hence subsequent upserts may read 
+bloom filters from lot of files during index lookup. Since records are not sorted, and each writer could get records 
+across N number of table partitions, this sort mode could result in a huge number of files at the end of bulk import. 
+This could also impact your upsert or query performance due to large number of small files. 
+
+## User defined partitioner
+
+If none of the above built-in sort modes suffice, users can also choose to implement their own 
+[partitioner](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java)
+and plug it in with bulk insert as needed.
+
+## Bulk insert with different sort modes
+Here is a microbenchmark to show the performance difference between different sort modes.
+
+![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/sort-modes.png) <br/>

Review comment:
       Have fixed the benchmarks. PTAL. with row writer, I don't see a lot of overhead w/ sorting. with write client, definitely there was some overhead for sure. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#issuecomment-907543799


   @vinothchandar : this patch is also good to review. updated based on our discussion. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r707523359



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 

Review comment:
       I still think there is some confusion here. I went through the entire flow of DeltaStreamer. As per the below 2 lines - 
   1. https://github.com/apache/hudi/blob/5d60491f5b76ef0f77174d71567d0673d9315bcd/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L469
   2. https://github.com/apache/hudi/blob/5d60491f5b76ef0f77174d71567d0673d9315bcd/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L597
   
   Both the types of deduplication happens for INSERT as well as BULK_INSERT cases. Please correct me if I am still getting it wrong @nsivabalan 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r703639893



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 
+small files are not managed with bulk_insert. 
+
+Bulk insert offers 3 different sort modes to cater to different needs of users, based on the following principles.
+
+- Sorting will give us good compression and upsert performance, if data is laid out well. Especially if your record keys 
+  have some sort of ordering (timestamp, etc) characteristics, sorting will assist in trimming down a lot of files 
+  during upsert. If data is sorted by frequently queried columns, queries will leverage parquet predicate pushdown 
+  to trim down the data to ensure lower latency as well.
+  
+- Additionally, parquet writing is quite a memory intensive operation. When writing large volumes of data into a table 
+  that is also partitioned into 1000s of partitions, without sorting of any kind, the writer may have to keep 1000s of 
+  parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes.
+  
+- It's also desirable to start with the smallest amount of files possible when bulk importing data, as to avoid 
+  metadata overhead later on for writers and queries. 
+
+3 Sort modes supported out of the box are: `PARTITION_SORT`,`GLOBAL_SORT` and `NONE`.
+
+## Configurations 
+One can set the config [“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode) to either 
+of the three values, namely NONE, GLOBAL_SORT and PARTITION_SORT. Default sort mode is `GLOBAL_SORT`.
+
+## Different Sort Modes
+
+### Global Sort
+
+As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files 
+pruned using key ranges, during index lookups for subsequent upserts. This is because each file has non-overlapping 
+min, max values for keys, which really helps, when the key has some ordering characteristics such as a time based prefix.
+Given we are writing to a single parquet file on a single output partition path on storage at any given time, this mode
+greatly helps control memory pressure during large partitioned writes. Also due to global sorting, each small table 
+partition path will be written from atmost two spark partition and thus contain just 2 files. 

Review comment:
       yeah, thats why we are calling out as small table partition and hence the assumption is data for a single partition < max file size as per config. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#issuecomment-906900537


   <img width="1193" alt="Screen Shot 2021-08-26 at 11 35 23 PM" src="https://user-images.githubusercontent.com/513218/131068276-104629ae-cdc3-402b-b9b9-6e47cae0e163.png">
   <img width="1203" alt="Screen Shot 2021-08-26 at 11 35 34 PM" src="https://user-images.githubusercontent.com/513218/131068278-8fec2265-e187-498b-938b-9be47cb2b1de.png">
   <img width="1191" alt="Screen Shot 2021-08-26 at 11 35 43 PM" src="https://user-images.githubusercontent.com/513218/131068279-c790e212-4dc4-4ad5-a10b-a2b0d3a31510.png">
   <img width="1191" alt="Screen Shot 2021-08-26 at 11 36 06 PM" src="https://user-images.githubusercontent.com/513218/131068281-f73c3183-c415-4da3-ba0f-959915c7b8ec.png">
   <img width="1193" alt="Screen Shot 2021-08-26 at 11 36 14 PM" src="https://user-images.githubusercontent.com/513218/131068283-9c7331ba-2bcb-44e1-acda-a86c117bf92c.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r700624615



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 
+small files are not managed with bulk_insert. 
+
+Bulk insert offers 3 different sort modes to cater to different needs of users, based on the following principles.
+
+- Sorting will give us good compression and upsert performance, if data is laid out well. Especially if your record keys 
+  have some sort of ordering (timestamp, etc) characteristics, sorting will assist in trimming down a lot of files 
+  during upsert. If data is sorted by frequently queried columns, queries will leverage parquet predicate pushdown 
+  to trim down the data to ensure lower latency as well.
+  
+- Additionally, parquet writing is quite a memory intensive operation. When writing large volumes of data into a table 
+  that is also partitioned into 1000s of partitions, without sorting of any kind, the writer may have to keep 1000s of 
+  parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes.
+  
+- It's also desirable to start with the smallest amount of files possible when bulk importing data, as to avoid 
+  metadata overhead later on for writers and queries. 
+
+3 Sort modes supported out of the box are: `PARTITION_SORT`,`GLOBAL_SORT` and `NONE`.
+
+## Configurations 
+One can set the config [“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode) to either 
+of the three values, namely NONE, GLOBAL_SORT and PARTITION_SORT. Default sort mode is `GLOBAL_SORT`.
+
+## Different Sort Modes
+
+### Global Sort
+
+As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files 
+pruned using key ranges, during index lookups for subsequent upserts. This is because each file has non-overlapping 
+min, max values for keys, which really helps, when the key has some ordering characteristics such as a time based prefix.
+Given we are writing to a single parquet file on a single output partition path on storage at any given time, this mode
+greatly helps control memory pressure during large partitioned writes. Also due to global sorting, each small table 
+partition path will be written from atmost two spark partition and thus contain just 2 files. 
+This is the default sort mode with bulk_insert operation in Hudi. 
+
+### Partition sort
+In this sort mode, records within a given spark partition will be sorted. But there are chances that a given spark partition 
+can contain records from different table partitions. And so, even though we sort within each spark partitions, this sort
+mode could result in large number of files at the end of bulk_insert, since records for a given table partition could 
+be spread across many spark partitions. During actual write by the writers, we may not have much open files 
+simultaneously, since we close out the file before moving to next file (as records are sorted within a spark partition) 
+and hence may not have much memory pressure. 
+
+### None
+
+In this mode, no transformation such as sorting is done to the user records and delegated to the writers as is. So, 
+when writing large volumes of data into a table partitioned into 1000s of partitions, the writer may have to keep 1000s of
+parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes. Also, 
+min max ranges for a given file could be very wide (unsorted records) and hence subsequent upserts may read 
+bloom filters from lot of files during index lookup. Since records are not sorted, and each writer could get records 
+across N number of table partitions, this sort mode could result in a huge number of files at the end of bulk import. 
+This could also impact your upsert or query performance due to large number of small files. 
+
+## User defined partitioner
+
+If none of the above built-in sort modes suffice, users can also choose to implement their own 
+[partitioner](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java)
+and plug it in with bulk insert as needed.
+
+## Bulk insert with different sort modes
+Here is a microbenchmark to show the performance difference between different sort modes.
+
+![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/sort-modes.png) <br/>

Review comment:
       its a bit misleading how sorting adds very little overhead for bulk_insert. thoughts?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r697831983



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 
+small files are not managed with bulk_insert. 
+
+Bulk insert offers 3 different sort modes to cater to different needs of users, based on the following principles.
+
+- Sorting will give us good compression and upsert performance, if data is laid out well. Especially if your record keys 
+  have some sort of ordering (timestamp, etc) characteristics, sorting will assist in trimming down a lot of files 
+  during upsert. If data is sorted by frequently queried columns, queries will leverage parquet predicate pushdown 
+  to trim down the data to ensure lower latency as well.
+  
+- Additionally, parquet writing is quite a memory intensive operation. When writing large volumes of data into a table 
+  that is also partitioned into 1000s of partitions, without sorting of any kind, the writer may have to keep 1000s of 
+  parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes.
+  
+- It's also desirable to start with the smallest amount of files possible when bulk importing data, as to avoid 
+  metadata overhead later on for writers and queries. 
+
+3 Sort modes supported out of the box are: `PARTITION_SORT`,`GLOBAL_SORT` and `NONE`.
+
+## Configurations 
+One can set the config [“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode) to either 
+of the three values, namely NONE, GLOBAL_SORT and PARTITION_SORT. Default sort mode is `GLOBAL_SORT`.
+
+## Different Sort Modes
+
+### Global Sort
+
+As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files 
+pruned using key ranges, during index lookups for subsequent upserts. This is because each file has non-overlapping 
+min, max values for keys, which really helps, when the key has some ordering characteristics such as a time based prefix.
+Given we are writing to a single parquet file on a single output partition path on storage at any given time, this mode
+greatly helps control memory pressure during large partitioned writes. Also due to global sorting, each small table 
+partition path will be written from atmost two spark partition and thus contain just 2 files. 

Review comment:
       Can we add some link here which explains this (memory pressure control and at most 2 spark partitions) in detail? Frankly this is something new to me and this might be the case with other users also. 
   
   Essentially I want to understand how is sorting helping in achieve all this. May be adding some visual representation will help here. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#issuecomment-923297572


   <img width="598" alt="bulk_insert_sort_modes" src="https://user-images.githubusercontent.com/513218/134074531-21133411-9691-4e26-a183-9c367dd1261b.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r697831983



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 
+small files are not managed with bulk_insert. 
+
+Bulk insert offers 3 different sort modes to cater to different needs of users, based on the following principles.
+
+- Sorting will give us good compression and upsert performance, if data is laid out well. Especially if your record keys 
+  have some sort of ordering (timestamp, etc) characteristics, sorting will assist in trimming down a lot of files 
+  during upsert. If data is sorted by frequently queried columns, queries will leverage parquet predicate pushdown 
+  to trim down the data to ensure lower latency as well.
+  
+- Additionally, parquet writing is quite a memory intensive operation. When writing large volumes of data into a table 
+  that is also partitioned into 1000s of partitions, without sorting of any kind, the writer may have to keep 1000s of 
+  parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes.
+  
+- It's also desirable to start with the smallest amount of files possible when bulk importing data, as to avoid 
+  metadata overhead later on for writers and queries. 
+
+3 Sort modes supported out of the box are: `PARTITION_SORT`,`GLOBAL_SORT` and `NONE`.
+
+## Configurations 
+One can set the config [“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode) to either 
+of the three values, namely NONE, GLOBAL_SORT and PARTITION_SORT. Default sort mode is `GLOBAL_SORT`.
+
+## Different Sort Modes
+
+### Global Sort
+
+As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files 
+pruned using key ranges, during index lookups for subsequent upserts. This is because each file has non-overlapping 
+min, max values for keys, which really helps, when the key has some ordering characteristics such as a time based prefix.
+Given we are writing to a single parquet file on a single output partition path on storage at any given time, this mode
+greatly helps control memory pressure during large partitioned writes. Also due to global sorting, each small table 
+partition path will be written from atmost two spark partition and thus contain just 2 files. 

Review comment:
       I guess adding some visual representation might help here. 
   
   Also can you explain how a partition path will be written from at most 2 spark partitions? It depends on the file size and the amount of data present in a particular spark partition right?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan removed a comment on pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan removed a comment on pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#issuecomment-906900537






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#issuecomment-907522516


   I have addressed the feedback. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#issuecomment-907543700


   <img width="1204" alt="Screen Shot 2021-08-27 at 9 09 47 PM" src="https://user-images.githubusercontent.com/513218/131201491-63793fb6-06a2-437d-b43d-85d1ce7c74fe.png">
   <img width="1201" alt="Screen Shot 2021-08-27 at 9 09 59 PM" src="https://user-images.githubusercontent.com/513218/131201493-f951a69d-13ec-4cca-afe8-46d710e9b2dc.png">
   <img width="1188" alt="Screen Shot 2021-08-27 at 9 10 07 PM" src="https://user-images.githubusercontent.com/513218/131201494-82ecc14c-950b-4789-beb1-2c3498e8b48d.png">
   <img width="1194" alt="Screen Shot 2021-08-27 at 9 10 14 PM" src="https://user-images.githubusercontent.com/513218/131201496-61a6728a-c46a-4bbd-89e0-d014e837c6cd.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan removed a comment on pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan removed a comment on pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#issuecomment-907522666


   <img width="1198" alt="Screen Shot 2021-08-27 at 7 16 52 PM" src="https://user-images.githubusercontent.com/513218/131197470-bc085694-f322-4865-962a-72b4b7f22ea8.png">
   <img width="1204" alt="Screen Shot 2021-08-27 at 7 17 02 PM" src="https://user-images.githubusercontent.com/513218/131197474-c25c74c6-9136-459e-a2ca-d88543d6e8ff.png">
   <img width="1196" alt="Screen Shot 2021-08-27 at 7 17 12 PM" src="https://user-images.githubusercontent.com/513218/131197475-82f7835d-0547-4492-a43e-71d5c845abaf.png">
   <img width="1206" alt="Screen Shot 2021-08-27 at 7 17 20 PM" src="https://user-images.githubusercontent.com/513218/131197479-fd9ec438-5c6f-46ea-bc38-7218c93f0dcb.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#issuecomment-907522666


   <img width="1198" alt="Screen Shot 2021-08-27 at 7 16 52 PM" src="https://user-images.githubusercontent.com/513218/131197470-bc085694-f322-4865-962a-72b4b7f22ea8.png">
   <img width="1204" alt="Screen Shot 2021-08-27 at 7 17 02 PM" src="https://user-images.githubusercontent.com/513218/131197474-c25c74c6-9136-459e-a2ca-d88543d6e8ff.png">
   <img width="1196" alt="Screen Shot 2021-08-27 at 7 17 12 PM" src="https://user-images.githubusercontent.com/513218/131197475-82f7835d-0547-4492-a43e-71d5c845abaf.png">
   <img width="1206" alt="Screen Shot 2021-08-27 at 7 17 20 PM" src="https://user-images.githubusercontent.com/513218/131197479-fd9ec438-5c6f-46ea-bc38-7218c93f0dcb.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#issuecomment-907591853


   Thank you for writing this blog @nsivabalan . Quite useful! :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r697831983



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 
+small files are not managed with bulk_insert. 
+
+Bulk insert offers 3 different sort modes to cater to different needs of users, based on the following principles.
+
+- Sorting will give us good compression and upsert performance, if data is laid out well. Especially if your record keys 
+  have some sort of ordering (timestamp, etc) characteristics, sorting will assist in trimming down a lot of files 
+  during upsert. If data is sorted by frequently queried columns, queries will leverage parquet predicate pushdown 
+  to trim down the data to ensure lower latency as well.
+  
+- Additionally, parquet writing is quite a memory intensive operation. When writing large volumes of data into a table 
+  that is also partitioned into 1000s of partitions, without sorting of any kind, the writer may have to keep 1000s of 
+  parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes.
+  
+- It's also desirable to start with the smallest amount of files possible when bulk importing data, as to avoid 
+  metadata overhead later on for writers and queries. 
+
+3 Sort modes supported out of the box are: `PARTITION_SORT`,`GLOBAL_SORT` and `NONE`.
+
+## Configurations 
+One can set the config [“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode) to either 
+of the three values, namely NONE, GLOBAL_SORT and PARTITION_SORT. Default sort mode is `GLOBAL_SORT`.
+
+## Different Sort Modes
+
+### Global Sort
+
+As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files 
+pruned using key ranges, during index lookups for subsequent upserts. This is because each file has non-overlapping 
+min, max values for keys, which really helps, when the key has some ordering characteristics such as a time based prefix.
+Given we are writing to a single parquet file on a single output partition path on storage at any given time, this mode
+greatly helps control memory pressure during large partitioned writes. Also due to global sorting, each small table 
+partition path will be written from atmost two spark partition and thus contain just 2 files. 

Review comment:
       Can we add some link here which explains this (memory pressure control and at most 2 spark partitions) in detail? Frankly this is something new to me and this might be the case with other users also. 
   
   Essentially I want to understand how is sorting helping in achieve all this. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r697830937



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 

Review comment:
       Please correct me if I am wrong. AFAIK records will be looked up for performing deduplication in case of bulk insert as well, which is the same case with insert operation. Am I missing something here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r697831983



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 
+small files are not managed with bulk_insert. 
+
+Bulk insert offers 3 different sort modes to cater to different needs of users, based on the following principles.
+
+- Sorting will give us good compression and upsert performance, if data is laid out well. Especially if your record keys 
+  have some sort of ordering (timestamp, etc) characteristics, sorting will assist in trimming down a lot of files 
+  during upsert. If data is sorted by frequently queried columns, queries will leverage parquet predicate pushdown 
+  to trim down the data to ensure lower latency as well.
+  
+- Additionally, parquet writing is quite a memory intensive operation. When writing large volumes of data into a table 
+  that is also partitioned into 1000s of partitions, without sorting of any kind, the writer may have to keep 1000s of 
+  parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes.
+  
+- It's also desirable to start with the smallest amount of files possible when bulk importing data, as to avoid 
+  metadata overhead later on for writers and queries. 
+
+3 Sort modes supported out of the box are: `PARTITION_SORT`,`GLOBAL_SORT` and `NONE`.
+
+## Configurations 
+One can set the config [“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode) to either 
+of the three values, namely NONE, GLOBAL_SORT and PARTITION_SORT. Default sort mode is `GLOBAL_SORT`.
+
+## Different Sort Modes
+
+### Global Sort
+
+As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files 
+pruned using key ranges, during index lookups for subsequent upserts. This is because each file has non-overlapping 
+min, max values for keys, which really helps, when the key has some ordering characteristics such as a time based prefix.
+Given we are writing to a single parquet file on a single output partition path on storage at any given time, this mode
+greatly helps control memory pressure during large partitioned writes. Also due to global sorting, each small table 
+partition path will be written from atmost two spark partition and thus contain just 2 files. 

Review comment:
       Can we add some link here which explains this (memory pressure control and at most 2 spark partitions) in detail? Frankly this is something new to me and this might be the case with other users also.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r697582187



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,79 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports an operation called “bulk_insert” in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert operation. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports an operation called “bulk_insert” to assist in initial loading to data to hudi. This is expected
+to be faster when compared to using “insert” or “upsert” operation types. Bulk insert differs from insert in one
+aspect. Small files optimization is not available with “bulk insert”, where as “insert” does small file management. So, 
+existing records/files are never looked up with bulk_insert operation, thus making it faster compared to other write operations. 
+
+Bulk insert offers 3 different sort modes to cater to different needs of users. In general, sorting will give us
+good compression and upsert performance if data is layed out well. Especially if your record keys have some sort of
+ordering (timestamp, etc) characteristics, sorting will assist in trimming down a lot of files during upsert. If data is 
+sorted using some of the mostly queried columns, queries will leverage parquet predicate pushdown to trim down the data 
+to ensure lower latency as well.
+
+3 Sort modes supported out of the box are: PARTITION_SORT, GLOBAL_SORT and NONE
+
+## Configurations 
+Config to set for sort mode is 
+[“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode) and different 
+values are: NONE, GLOBAL_SORT and PARTITION_SORT.
+
+## Partition sort
+
+Records are sorted within each spark partition and then written to hudi. This expected to be faster than Global sort
+and if one does not need global sort, should resort to this sort mode. Be wary of the parallelism used, since if there
+are too many spark partitions assigned to write to the same hudi partition, it could end up creating a lot of small files.
+
+## Global Sort
+
+As the name suggests, all records are sorted globally before being written. This is the default sort mode with
+“bulk_insert” operation. Since records are globally sorted in this mode, if record keys have some ordering characteristics,
+this will benefit a lot during upsert to trim down a lot of files.
+
+## None
+
+For CDC kind of use-cases, record keys are mostly random. So, sorting may not give any real benefit as such. For
+such use-cases, you can choose to not do any sorting only.
+
+## User defined partitioner
+
+Users can also implement their own [partitioner](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java)
+and plug it in with bulk insert as per necessity.
+
+## Future: Sort merge
+
+In future, hudi also plans to add sort merge for updates/inserts going to the same data file. This will benefit users
+who wants to maintain some ordering among records for faster write and query latency.
+
+## Bulk insert with different sort modes
+Here is a microbenchmark to show the performance difference between different sort modes.
+
+![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/bulkinsert-sort-modes.png)
+_Figure: Shows performance of different bulk insert variants_
+
+Two set of datasets were used, one with random keys and another one with timestamp based record keys. This benchmark 
+had 10M entries being bulk inserted to hudi using different sort modes. 
+
+## Upsert followed by bulk insert
+But the real impact of sorting will be realized by a following upsert as records need to be looked up during upsert. If
+records are sorted nicely, upsert operation could filter out lot of files using range pruning with bloom index. If not, 
+all data files need to be looked into to search for incoming records. 
+
+![Upsert followed by bulk_insert with different sort modes](/assets/images/blog/bulkinsert-sort-modes/upsert-sort-modes.png)
+
+As you could see, when data is globally sorted, upserts will have lower latency since lot of data files could be filtered out.

Review comment:
       We need more motivation here. High level, we are saying -pay more cost during writing say 2x once, and reap benefits every other upsert? We need a more compelling case IMO

##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,79 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports an operation called “bulk_insert” in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert operation. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports an operation called “bulk_insert” to assist in initial loading to data to hudi. This is expected
+to be faster when compared to using “insert” or “upsert” operation types. Bulk insert differs from insert in one
+aspect. Small files optimization is not available with “bulk insert”, where as “insert” does small file management. So, 
+existing records/files are never looked up with bulk_insert operation, thus making it faster compared to other write operations. 
+
+Bulk insert offers 3 different sort modes to cater to different needs of users. In general, sorting will give us
+good compression and upsert performance if data is layed out well. Especially if your record keys have some sort of
+ordering (timestamp, etc) characteristics, sorting will assist in trimming down a lot of files during upsert. If data is 
+sorted using some of the mostly queried columns, queries will leverage parquet predicate pushdown to trim down the data 
+to ensure lower latency as well.
+
+3 Sort modes supported out of the box are: PARTITION_SORT, GLOBAL_SORT and NONE
+
+## Configurations 
+Config to set for sort mode is 
+[“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode) and different 
+values are: NONE, GLOBAL_SORT and PARTITION_SORT.
+
+## Partition sort
+
+Records are sorted within each spark partition and then written to hudi. This expected to be faster than Global sort
+and if one does not need global sort, should resort to this sort mode. Be wary of the parallelism used, since if there
+are too many spark partitions assigned to write to the same hudi partition, it could end up creating a lot of small files.
+
+## Global Sort
+
+As the name suggests, all records are sorted globally before being written. This is the default sort mode with
+“bulk_insert” operation. Since records are globally sorted in this mode, if record keys have some ordering characteristics,
+this will benefit a lot during upsert to trim down a lot of files.
+
+## None
+
+For CDC kind of use-cases, record keys are mostly random. So, sorting may not give any real benefit as such. For
+such use-cases, you can choose to not do any sorting only.
+
+## User defined partitioner
+
+Users can also implement their own [partitioner](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java)
+and plug it in with bulk insert as per necessity.
+
+## Future: Sort merge
+
+In future, hudi also plans to add sort merge for updates/inserts going to the same data file. This will benefit users
+who wants to maintain some ordering among records for faster write and query latency.
+
+## Bulk insert with different sort modes
+Here is a microbenchmark to show the performance difference between different sort modes.
+
+![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/bulkinsert-sort-modes.png)

Review comment:
       these figures are good. but I think we can group numbers by sort modes rather than key type. So its easy to see the differences across the sort modes (main focus of this blog)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r703597160



##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. This blog will talk about 
+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 

Review comment:
       guess you got confused w/ two configs. One is dedup(combine before insert) and another is Insert_Drop_Dupes. dedup is just deduping among incoming batch of records. Insert_Drop_Dupes is dropping those records that are already in storage. with row writer path, we don't support Insert_Drop_dupes. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org