You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/09/02 22:52:59 UTC

[GitHub] [hudi] yihua commented on a change in pull request #3588: [MINOR] Fix wording and table in the marker blog

yihua commented on a change in pull request #3588:
URL: https://github.com/apache/hudi/pull/3588#discussion_r701473277



##########
File path: website/blog/2021-08-18-improving-marker-mechanism.md
##########
@@ -47,26 +47,26 @@ Note that the worker thread always checks whether the marker has already been cr
 
 ## Marker-related write options
 
-We introduce the following new marker-related write options in `0.9.0` release, to configure the marker mechanism.
+We introduce the following new marker-related write options in `0.9.0` release, to configure the marker mechanism.  Note that the timeline-server-based marker mechanism is not yet supported for HDFS in `0.9.0` release, and we plan to support the timeline-server-based marker mechanism for HDFS in the future.
 
 | Property Name |   Default   |     Meaning    |        
 | ------------- | ----------- | :-------------:| 
-| `hoodie.write.markers.type`     | direct | Marker type to use.  Two modes are supported: (1) `direct`: individual marker file corresponding to each data file is directly created by the writer; (2) `timeline_server_based`: marker operations are all handled at the timeline service which serves as a proxy.  New marker entries are batch processed and stored in a limited number of underlying files for efficiency. |
+| `hoodie.write.markers.type`     | direct | Marker type to use.  Two modes are supported: (1) `direct`: individual marker file corresponding to each data file is directly created by the executor; (2) `timeline_server_based`: marker operations are all handled at the timeline service which serves as a proxy.  New marker entries are batch processed and stored in a limited number of underlying files for efficiency. |
 | `hoodie.markers.timeline_server_based.batch.num_threads`     | 20 | Number of threads to use for batch processing marker creation requests at the timeline server. | 
 | `hoodie.markers.timeline_server_based.batch.interval_ms` | 50 | The batch interval in milliseconds for marker creation batch processing. |
 
 ## Performance
 
-We evaluate the write performance over both direct and timeline-server-based marker mechanisms by bulk-inserting a large dataset using Amazon EMR with Spark and S3. The input data is around 100GB.  We configure the write operation to generate a large number of data files concurrently by setting the max parquet file size to be 1MB and parallelism to be 240. As we noted before, while the latency of direct marker mechanism is acceptable for incremental writes with smaller number of data files written, it increases dramatically for large bulk inserts/writes which produce much more data files.
+We evaluate the write performance over both direct and timeline-server-based marker mechanisms by bulk-inserting a large dataset using Amazon EMR with Spark and S3. The input data is around 100GB.  We configure the write operation to generate a large number of data files concurrently by setting the max parquet file size to be 1MB and parallelism to be 240.  Note that it is unlikely to set max parquet file size to 1MB in production and such a setup is only to evaluate the performance regarding the marker mechanisms. As we noted before, while the latency of direct marker mechanism is acceptable for incremental writes with smaller number of data files written, it increases dramatically for large bulk inserts/writes which produce much more data files.
 
 As shown below, the timeline-server-based marker mechanism generates much fewer files storing markers because of the batch processing, leading to much less time on marker-related I/O operations, thus achieving 31% lower write completion time compared to the direct marker file mechanism.
 
 | Marker Type |   Total Files   |  Num data files written | Files created for markers | Marker deletion time | Bulk Insert Time (including marker deletion) |
 | ----------- | --------- | :---------: | :---------: | :---------: | :---------: | 
-| Direct | 165K | 1k | 165k | 5.4secs | - |
+| Direct | 1k | 1k | 1k | 5.4secs | - |

Review comment:
       Got it.  Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org