You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2019/03/22 17:24:30 UTC

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #610: Major cleanup of docs structure/content

vinothchandar commented on a change in pull request #610: Major cleanup of docs structure/content
URL: https://github.com/apache/incubator-hudi/pull/610#discussion_r268267284

##########
File path: docs/concepts.md
##########
@@ -35,91 +46,102 @@ When there is late arriving data (data intended for 9:00 arriving >1 hr late at
With the help of the timeline, an incremental query attempting to get all new data that was committed successfully since 10:00 hours, is able to very efficiently consume
only the changed files without say scanning all the time buckets > 07:00.

-## Terminologies
+## File management
+Hudi organizes a datasets into a directory structure under a `basepath` on DFS. Dataset is broken up into partitions, which are folders containing data files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its `partitionpath`, which is relative to the basepath.

- * `Hudi Dataset`
- A structured hive/spark dataset managed by Hudi. Hudi supports both partitioned and non-partitioned Hive tables.
- * `Commit`
- A commit marks a new batch of data applied to a dataset. Hudi maintains monotonically increasing timestamps to track commits and guarantees that a commit is atomically
- published.
- * `Commit Timeline`
- Commit Timeline refers to the sequence of Commits that was applied in order on a dataset over its lifetime.
- * `File Slice`
- Hudi provides efficient handling of updates by having a fixed mapping between record key to a logical file Id.
- Hudi uses MVCC to provide atomicity and isolation of readers from a writer. This means that a logical fileId will
- have many physical versions of it. Each of these physical version of a file represents a complete view of the
- file as of a commit and is called File Slice
- * `File Group`
- A file-group is a file-slice timeline. It is a list of file-slices in commit order. It is identified by `file id`
+Within each partition, files are organized into `file groups`, uniquely identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base columnar file (`*.parquet`) produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the base file since the base file was produced.
+Hudi adopts a MVCC design, where compaction action merges logs and base files to produce new file slices and cleaning action gets rid of
+unused/older file slices to reclaim space on DFS.

+Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file group, via an indexing mechanism.
+This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the
+mapped file group contains all versions of a group of records.

-## Storage Types
+## Storage Types & Views
+Hudi storage types define how data is indexed & laid out on the DFS and how the above primitives and timeline activities are implemented on top of such organization (i.e how data is written).
+This is not to be confused with the notion of `views`, which are merely how the underlying data is exposed to the queries (i.e how data is read).
+
+| Storage Type | Supported Views |
+|-------------- |------------------|
+| Copy On Write | Read Optimized + Incremental |
+| Merge On Read | Read Optimized + Incremental + Near Real-time |

-Hudi storage types capture how data is indexed & laid out on the filesystem, and how the above primitives and timeline activities are implemented on top of
-such organization (i.e how data is written). This is not to be confused with the notion of Read Optimized & Near-Real time tables, which are merely how the underlying data is exposed
-to the queries (i.e how data is read).
+### Storage Types
+Hudi supports the following storage types.

-Hudi (will) supports the following storage types.
+ - [Copy On Write](#copy-on-write-storage) : Stores data using solely columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing an synchronous merge during ingestion.

Review comment:
Fixed rest, except MVCC comment at the end.. MVCC does not always mean making a copy. it can also mean logging..

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services