You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2019/10/30 17:26:33 UTC

[GitHub] [incubator-hudi] yihua commented on a change in pull request #985: [HUDI-275] Translate the Querying Data page into Chinese documentation

yihua commented on a change in pull request #985: [HUDI-275] Translate the Querying Data page into Chinese documentation
URL: https://github.com/apache/incubator-hudi/pull/985#discussion_r340756611
 
 

 ##########
 File path: docs/querying_data.cn.md
 ##########
 @@ -1,122 +1,122 @@
 ---
-title: Querying Hudi Datasets
+title: 查询 Hudi 数据集
 keywords: hudi, hive, spark, sql, presto
 sidebar: mydoc_sidebar
 permalink: querying_data.html
 toc: false
-summary: In this page, we go over how to enable SQL queries on Hudi built tables.
+summary: 在这一页里,我们介绍了如何在Hudi构建的表上启用SQL查询。
 ---
 
-Conceptually, Hudi stores data physically once on DFS, while providing 3 logical views on top, as explained [before](concepts.html#views). 
-Once the dataset is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the dataset can be queried by popular query engines like Hive, Spark and Presto.
+从概念上讲,Hudi物理存储一次数据到DFS上,同时在其上提供三个逻辑视图,如[之前](concepts.html#views)所述。
+数据集同步到Hive Metastore后,它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包,
+就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。
 
-Specifically, there are two Hive tables named off [table name](configurations.html#TABLE_NAME_OPT_KEY) passed during write. 
-For e.g, if `table name = hudi_tbl`, then we get  
+具体来说,在写入过程中传递了两个由[table name](configurations.html#TABLE_NAME_OPT_KEY)命名的Hive表。
+例如,如果`table name = hudi_tbl`,我们得到
 
- - `hudi_tbl` realizes the read optimized view of the dataset backed by `HoodieParquetInputFormat`, exposing purely columnar data.
- - `hudi_tbl_rt` realizes the real time view of the dataset  backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_tbl` 实现了由 `HoodieParquetInputFormat` 支持的数据集的读优化视图,从而提供了纯列式数据。
+ - `hudi_tbl_rt` 实现了由 `HoodieParquetRealtimeInputFormat` 支持的数据集的实时视图,从而提供了基础数据和日志数据的合并视图。
 
-As discussed in the concepts section, the one key primitive needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows 
-since a specified instant time. This, together with upserts, are particularly useful for building data pipelines where 1 or more source Hudi tables are incrementally pulled (streams/facts),
-joined with other tables (datasets/dimensions), to [write out deltas](writing_data.html) to a target Hudi dataset. Incremental view is realized by querying one of the tables above, 
-with special configurations that indicates to query planning that only incremental data needs to be fetched out of the dataset. 
+如概念部分所述,[增量处理](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)所需要的
+一个关键原语是`增量拉取`(以从数据集中获取更改流/日志)。您可以增量提取Hudi数据集,这意味着自指定的即时时间起,
+您可以只获得全部更新和新行。 这与插入更新一起使用,对于构建某些数据管道尤其有用,包括将1个或多个源Hudi表(数据流/事实)以增量方式拉出(流/事实)
+并与其他表(数据集/维度)结合以[写出增量](write_data.html)到目标Hudi数据集。增量视图是通过查询上表之一实现的,具有特殊配置,
+该特殊配置指示查询计划仅需要从数据集中获取增量数据。
 
-In sections, below we will discuss in detail how to access all the 3 views on each query engine.
+接下来,我们将详细讨论在每个查询引擎上如何访问所有三个视图。
 
 ## Hive
 
-In order for Hive to recognize Hudi datasets and query correctly, the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
-in its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr). This will ensure the input format 
-classes with its dependencies are available for query planning & execution. 
-
-### Read Optimized table {#hive-ro-view}
-In addition to setup above, for beeline cli access, the `hive.input.format` variable needs to be set to the  fully qualified path name of the 
-inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, additionally the `hive.tez.input.format` needs to be set 
-to `org.apache.hadoop.hive.ql.io.HiveInputFormat`
-
-### Real time table {#hive-rt-view}
-In addition to installing the hive bundle jar on the HiveServer2, it needs to be put on the hadoop/hive installation across the cluster, so that
-queries can pick up the custom RecordReader as well.
-
-### Incremental Pulling {#hive-incr-pull}
-
-`HiveIncrementalPuller` allows incrementally extracting changes from large fact/dimension tables via HiveQL, combining the benefits of Hive (reliably process complex SQL queries) and 
-incremental primitives (speed up query by pulling tables incrementally instead of scanning fully). The tool uses Hive JDBC to run the hive query and saves its results in a temp table.
-that can later be upserted. Upsert utility (`HoodieDeltaStreamer`) has all the state it needs from the directory structure to know what should be the commit time on the target table.
-e.g: `/app/incremental-hql/intermediate/{source_table_name}_temp/{last_commit_included}`.The Delta Hive table registered will be of the form `{tmpdb}.{source_table}_{last_commit_included}`.
-
-The following are the configuration options for HiveIncrementalPuller
-
-| **Config** | **Description** | **Default** |
-|hiveUrl| Hive Server 2 URL to connect to |  |
-|hiveUser| Hive Server 2 Username |  |
-|hivePass| Hive Server 2 Password |  |
-|queue| YARN Queue name |  |
-|tmp| Directory where the temporary delta data is stored in DFS. The directory structure will follow conventions. Please see the below section.  |  |
-|extractSQLFile| The SQL to execute on the source table to extract the data. The data extracted will be all the rows that changed since a particular point in time. |  |
-|sourceTable| Source Table Name. Needed to set hive environment properties. |  |
-|targetTable| Target Table Name. Needed for the intermediate storage directory structure.  |  |
-|sourceDataPath| Source DFS Base Path. This is where the Hudi metadata will be read. |  |
-|targetDataPath| Target DFS Base path. This is needed to compute the fromCommitTime. This is not needed if fromCommitTime is specified explicitly. |  |
-|tmpdb| The database to which the intermediate temp delta table will be created | hoodie_temp |
-|fromCommitTime| This is the most important parameter. This is the point in time from which the changed records are pulled from.  |  |
-|maxCommits| Number of commits to include in the pull. Setting this to -1 will include all the commits from fromCommitTime. Setting this to a value > 0, will include records that ONLY changed in the specified number of commits after fromCommitTime. This may be needed if you need to catch up say 2 commits at a time. | 3 |
-|help| Utility Help |  |
-
-
-Setting fromCommitTime=0 and maxCommits=-1 will pull in the entire source dataset and can be used to initiate backfills. If the target dataset is a Hudi dataset,
-then the utility can determine if the target dataset has no commits or is behind more than 24 hour (this is configurable),
-it will automatically use the backfill configuration, since applying the last 24 hours incrementally could take more time than doing a backfill. The current limitation of the tool
-is the lack of support for self-joining the same table in mixed mode (normal and incremental modes).
-
-**NOTE on Hive queries that are executed using Fetch task:**
-Since Fetch tasks invoke InputFormat.listStatus() per partition, Hoodie metadata can be listed in
-every such listStatus() call. In order to avoid this, it might be useful to disable fetch tasks
-using the hive session property for incremental queries: `set hive.fetch.task.conversion=none;` This
-would ensure Map Reduce execution is chosen for a Hive query, which combines partitions (comma
-separated) and calls InputFormat.listStatus() only once with all those partitions.
+为了使Hive能够识别Hudi数据集并正确查询,
+HiveServer2需要在其[辅助jars路径](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)中提供`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar`。 
+这将确保输入格式类及其依赖项可用于查询计划和执行。
+
+### 读优化表 {#hive-ro-view}
+除了上述设置之外,对于beeline cli访问,还需要将`hive.input.format`变量设置为`org.apache.hudi.hadoop.HoodieParquetInputFormat`输入格式的完全限定路径名。
+对于Tez,还需要将`hive.tez.input.format`设置为`org.apache.hadoop.hive.ql.io.HiveInputFormat`。
+
+### 实时表 {#hive-rt-view}
+除了在HiveServer2上安装Hive捆绑jars之外,还需要将其放在整个集群的hadoop/hive安装中,这样查询也可以使用自定义RecordReader。
+
+### 增量拉取 {#hive-incr-pull}
+
+`HiveIncrementalPuller`允许通过HiveQL从大型事实/维表中增量提取更改,
+结合了Hive(可靠地处理复杂的SQL查询)和增量原语的好处(通过增量拉取而不是完全扫描来加快查询速度)。
+该工具使用Hive JDBC运行hive查询并将其结果保存在临时表中,这个表可以被插入更新。
+Upsert实用程序(`HoodieDeltaStreamer`)具有目录结构所需的所有状态,以了解目标表上的提交时间应为多少。
+例如:`/app/incremental-hql/intermediate/{source_table_name}_temp/{last_commit_included}`。
+已注册的Delta Hive表的格式为`{tmpdb}.{source_table}_{last_commit_included}`。
+
+以下是HiveIncrementalPuller的配置选项
+
+| **配置** | **描述** | **默认值** |
+|hiveUrl| 要连接的Hive Server 2的URL |  |
+|hiveUser| Hive Server 2 用户名 |  |
+|hivePass| Hive Server 2 密码 |  |
+|queue| YARN 队列名称 |  |
+|tmp| DFS中存储临时增量数据的目录。目录结构将遵循约定。请参阅以下部分。  |  |
+|extractSQLFile| 在源表上要执行的提取数据的SQL。提取的数据将是自特定时间点以来已更改的所有行。 |  |
+|sourceTable| 源表名称。在Hive环境属性中需要设置。 |  |
+|targetTable| 目标表名称。中间存储目录结构需要。  |  |
+|sourceDataPath| 源DFS基本路径。这是读取Hudi元数据的地方。 |  |
+|targetDataPath| 目标DFS基本路径。 这是计算fromCommitTime所必需的。 如果显式指定了fromCommitTime,则不需要设置这个参数。 |  |
+|tmpdb| 用来创建中间临时增量表的数据库 | hoodie_temp |
+|fromCommitTime| 这是最重要的参数。 这是从中提取更改的记录的时间点。 |  |
+|maxCommits| 要包含在拉取中的提交数。将此设置为-1将包括从fromCommitTime开始的所有提交。将此设置为大于0的值,将包括在fromCommitTime之后仅更改指定提交次数的记录。如果您需要一次赶上两次提交,则可能需要这样做。| 3 |
+|help| 实用程序帮助 |  |
+
+
+设置fromCommitTime=0和maxCommits=-1将提取整个源数据集,可用于启动回填。
 
 Review comment:
   我看"Backfill"有翻译成”回填“。要么不译直接放”Backfill“?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services