You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/12 16:47:48 UTC

[GitHub] [hudi] yihua commented on a diff in pull request #6647: [DOCS] Add more FAQs

yihua commented on code in PR #6647:
URL: https://github.com/apache/hudi/pull/6647#discussion_r967553740


##########
website/docs/faq.md:
##########
@@ -581,6 +585,22 @@ After the second write:
 |  20220622204044318|20220622204044318...|                 1|                      |890aafc0-d897-44d...|hudi.apache.com|  1|   1|
 |  20220622204208997|20220622204208997...|                 2|                      |890aafc0-d897-44d...|             null|  1|   2|
 
+
+### I see two different records for the same record key value, each record key with a different timestamp format. How is this possible?
+
+This is a known issue with enabling row-writer for bulk_insert operation. When you do a bulk_insert followed by another 
+write operation such as upsert/insert this might be observed for timestamp fields specifically. For example, bulk_insert might produce
+timestamp `2016-12-29 09:54:00.0` for record key whereas non bulk_insert write operation might produce a long value like
+`1483023240000000` for the record key thus creating two different records. To fix this, starting 0.10.1 a new config [hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled) 
+is introduced to bring consistency irrespective of whether row writing is enabled on not. However, for the sake of 
+backwards compatibility and not breaking existing pipelines, this config is not set to true by default and will have to be enabled explicitly.

Review Comment:
   nit: `not set to true` -> `set to false`



##########
website/docs/faq.md:
##########
@@ -581,6 +585,22 @@ After the second write:
 |  20220622204044318|20220622204044318...|                 1|                      |890aafc0-d897-44d...|hudi.apache.com|  1|   1|
 |  20220622204208997|20220622204208997...|                 2|                      |890aafc0-d897-44d...|             null|  1|   2|
 
+
+### I see two different records for the same record key value, each record key with a different timestamp format. How is this possible?
+
+This is a known issue with enabling row-writer for bulk_insert operation. When you do a bulk_insert followed by another 
+write operation such as upsert/insert this might be observed for timestamp fields specifically. For example, bulk_insert might produce
+timestamp `2016-12-29 09:54:00.0` for record key whereas non bulk_insert write operation might produce a long value like
+`1483023240000000` for the record key thus creating two different records. To fix this, starting 0.10.1 a new config [hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled) 
+is introduced to bring consistency irrespective of whether row writing is enabled on not. However, for the sake of 
+backwards compatibility and not breaking existing pipelines, this config is not set to true by default and will have to be enabled explicitly.
+
+
+### Can I switch from one index type to another without having to rewrite the entire table?
+
+It should be okay to switch between Bloom index and Simple index as long as they are not global. 
+Moving from global to non global and vice versa may not work. Also switching between hbase and regular bloom might not work.

Review Comment:
   `non global` -> `non-global`



##########
website/docs/faq.md:
##########
@@ -287,13 +287,17 @@ Depending on how you write to Hudi these are the possible options currently.
    - Offline compaction can be achieved by setting ```compaction.async.enabled``` to ```false``` and periodically running [Flink offline Compactor](https://hudi.apache.org/docs/next/compaction/#flink-offline-compaction). When running the offline compactor, one needs to ensure there are no active writes to the table.
    - Third option (highly recommended over the second one) is to schedule the compactions from the regular ingestion job and executing the compaction plans from an offline job. To achieve this set ```compaction.async.enabled``` to ```false```, ```compaction.schedule.enabled``` to ```true``` and then run the [Flink offline Compactor](https://hudi.apache.org/docs/next/compaction/#flink-offline-compaction) periodically to execute the plans.
 
+### How to disable all table services in case of multiple writers?
+
+[hoodie.table.services.enabled](https://hudi.apache.org/docs/configurations/#hoodietableservicesenabled) is an umbrella config that can be used to turn off all table services at once without having to individually disable them. This is handy in use cases where there are multiple writers doing ingestion. While one of tha main pipelines can take care of the table services, other ingestion pipelines can disable them to avoid frequent trigger of cleaning/clustering etc. This does not apply to singe writer scenarios.

Review Comment:
   nit: typo: `tha` -> `the`



##########
website/docs/faq.md:
##########
@@ -287,13 +287,17 @@ Depending on how you write to Hudi these are the possible options currently.
    - Offline compaction can be achieved by setting ```compaction.async.enabled``` to ```false``` and periodically running [Flink offline Compactor](https://hudi.apache.org/docs/next/compaction/#flink-offline-compaction). When running the offline compactor, one needs to ensure there are no active writes to the table.
    - Third option (highly recommended over the second one) is to schedule the compactions from the regular ingestion job and executing the compaction plans from an offline job. To achieve this set ```compaction.async.enabled``` to ```false```, ```compaction.schedule.enabled``` to ```true``` and then run the [Flink offline Compactor](https://hudi.apache.org/docs/next/compaction/#flink-offline-compaction) periodically to execute the plans.
 
+### How to disable all table services in case of multiple writers?
+
+[hoodie.table.services.enabled](https://hudi.apache.org/docs/configurations/#hoodietableservicesenabled) is an umbrella config that can be used to turn off all table services at once without having to individually disable them. This is handy in use cases where there are multiple writers doing ingestion. While one of tha main pipelines can take care of the table services, other ingestion pipelines can disable them to avoid frequent trigger of cleaning/clustering etc. This does not apply to singe writer scenarios.
+
 ### What performance/ingest latency can I expect for Hudi writing?
 
 The speed at which you can write into Hudi depends on the [write operation](https://hudi.apache.org/docs/write_operations) and some trade-offs you make along the way like file sizing. Just like how databases incur overhead over direct/raw file I/O on disks,  Hudi operations may have overhead from supporting  database like features compared to reading/writing raw DFS files. That said, Hudi implements advanced techniques from database literature to keep these minimal. User is encouraged to have this perspective when trying to reason about Hudi performance. As the saying goes : there is no free lunch (not yet atleast)
 
 | Storage Type | Type of workload | Performance | Tips |
 |-------|--------|--------|--------|
-| copy on write | bulk_insert | Should match vanilla spark writing + an additional sort to properly size files | properly size [bulk insert parallelism](https://hudi.apache.org/docs/configurations#hoodiebulkinsertshuffleparallelism) to get right number of files. use insert if you want this auto tuned |
+| copy on write | bulk_insert | Should match vanilla spark writing + an additional sort to properly size files | properly size [bulk insert parallelism](https://hudi.apache.org/docs/configurations#hoodiebulkinsertshuffleparallelism) to get right number of files. use insert if you want this auto tuned . Configure [hoodie.bulkinsert.sort.mode](https://hudi.apache.org/docs/configurations#hoodiebulkinsertsortmode) for better file sizes at the cost of memory. The default value NONE offers the fastest performance and mtches `spark.write.parquet()` in terms of number of files, overheads. 

Review Comment:
   typo: `mtches` -> `matches`



##########
website/docs/faq.md:
##########
@@ -581,6 +585,22 @@ After the second write:
 |  20220622204044318|20220622204044318...|                 1|                      |890aafc0-d897-44d...|hudi.apache.com|  1|   1|
 |  20220622204208997|20220622204208997...|                 2|                      |890aafc0-d897-44d...|             null|  1|   2|
 
+
+### I see two different records for the same record key value, each record key with a different timestamp format. How is this possible?
+
+This is a known issue with enabling row-writer for bulk_insert operation. When you do a bulk_insert followed by another 
+write operation such as upsert/insert this might be observed for timestamp fields specifically. For example, bulk_insert might produce
+timestamp `2016-12-29 09:54:00.0` for record key whereas non bulk_insert write operation might produce a long value like
+`1483023240000000` for the record key thus creating two different records. To fix this, starting 0.10.1 a new config [hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled) 
+is introduced to bring consistency irrespective of whether row writing is enabled on not. However, for the sake of 
+backwards compatibility and not breaking existing pipelines, this config is not set to true by default and will have to be enabled explicitly.
+
+
+### Can I switch from one index type to another without having to rewrite the entire table?
+
+It should be okay to switch between Bloom index and Simple index as long as they are not global. 
+Moving from global to non global and vice versa may not work. Also switching between hbase and regular bloom might not work.

Review Comment:
   `hbase` -> `HBase (global index)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org