You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by vi...@apache.org on 2019/10/31 15:00:45 UTC

[incubator-hudi] branch asf-site updated: [HUDI-275] Translate the Querying Data page into Chinese documentation (#985)

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 4b3b197  [HUDI-275] Translate the Querying Data page into Chinese documentation (#985)
4b3b197 is described below

commit 4b3b197b8a6e983f20067ed3ef00694e19edf9f9
Author: Y Ethan Guo <et...@gmail.com>
AuthorDate: Thu Oct 31 08:00:37 2019 -0700

    [HUDI-275] Translate the Querying Data page into Chinese documentation (#985)
---
 docs/querying_data.cn.md | 174 +++++++++++++++++++++++------------------------
 1 file changed, 87 insertions(+), 87 deletions(-)

diff --git a/docs/querying_data.cn.md b/docs/querying_data.cn.md
index 1653b08..c690385 100644
--- a/docs/querying_data.cn.md
+++ b/docs/querying_data.cn.md
@@ -1,102 +1,102 @@
 ---
-title: Querying Hudi Datasets
+title: 查询 Hudi 数据集
 keywords: hudi, hive, spark, sql, presto
 sidebar: mydoc_sidebar
 permalink: querying_data.html
 toc: false
-summary: In this page, we go over how to enable SQL queries on Hudi built tables.
+summary: 在这一页里,我们介绍了如何在Hudi构建的表上启用SQL查询。
 ---
 
-Conceptually, Hudi stores data physically once on DFS, while providing 3 logical views on top, as explained [before](concepts.html#views). 
-Once the dataset is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the dataset can be queried by popular query engines like Hive, Spark and Presto.
+从概念上讲,Hudi物理存储一次数据到DFS上,同时在其上提供三个逻辑视图,如[之前](concepts.html#views)所述。
+数据集同步到Hive Metastore后,它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包,
+就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。
 
-Specifically, there are two Hive tables named off [table name](configurations.html#TABLE_NAME_OPT_KEY) passed during write. 
-For e.g, if `table name = hudi_tbl`, then we get  
+具体来说,在写入过程中传递了两个由[table name](configurations.html#TABLE_NAME_OPT_KEY)命名的Hive表。
+例如,如果`table name = hudi_tbl`,我们得到
 
- - `hudi_tbl` realizes the read optimized view of the dataset backed by `HoodieParquetInputFormat`, exposing purely columnar data.
- - `hudi_tbl_rt` realizes the real time view of the dataset  backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_tbl` 实现了由 `HoodieParquetInputFormat` 支持的数据集的读优化视图,从而提供了纯列式数据。
+ - `hudi_tbl_rt` 实现了由 `HoodieParquetRealtimeInputFormat` 支持的数据集的实时视图,从而提供了基础数据和日志数据的合并视图。
 
-As discussed in the concepts section, the one key primitive needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows 
-since a specified instant time. This, together with upserts, are particularly useful for building data pipelines where 1 or more source Hudi tables are incrementally pulled (streams/facts),
-joined with other tables (datasets/dimensions), to [write out deltas](writing_data.html) to a target Hudi dataset. Incremental view is realized by querying one of the tables above, 
-with special configurations that indicates to query planning that only incremental data needs to be fetched out of the dataset. 
+如概念部分所述,[增量处理](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)所需要的
+一个关键原语是`增量拉取`(以从数据集中获取更改流/日志)。您可以增量提取Hudi数据集,这意味着自指定的即时时间起,
+您可以只获得全部更新和新行。 这与插入更新一起使用,对于构建某些数据管道尤其有用,包括将1个或多个源Hudi表(数据流/事实)以增量方式拉出(流/事实)
+并与其他表(数据集/维度)结合以[写出增量](write_data.html)到目标Hudi数据集。增量视图是通过查询上表之一实现的,并具有特殊配置,
+该特殊配置指示查询计划仅需要从数据集中获取增量数据。
 
-In sections, below we will discuss in detail how to access all the 3 views on each query engine.
+接下来,我们将详细讨论在每个查询引擎上如何访问所有三个视图。
 
 ## Hive
 
-In order for Hive to recognize Hudi datasets and query correctly, the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
-in its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr). This will ensure the input format 
-classes with its dependencies are available for query planning & execution. 
-
-### Read Optimized table {#hive-ro-view}
-In addition to setup above, for beeline cli access, the `hive.input.format` variable needs to be set to the  fully qualified path name of the 
-inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, additionally the `hive.tez.input.format` needs to be set 
-to `org.apache.hadoop.hive.ql.io.HiveInputFormat`
-
-### Real time table {#hive-rt-view}
-In addition to installing the hive bundle jar on the HiveServer2, it needs to be put on the hadoop/hive installation across the cluster, so that
-queries can pick up the custom RecordReader as well.
-
-### Incremental Pulling {#hive-incr-pull}
-
-`HiveIncrementalPuller` allows incrementally extracting changes from large fact/dimension tables via HiveQL, combining the benefits of Hive (reliably process complex SQL queries) and 
-incremental primitives (speed up query by pulling tables incrementally instead of scanning fully). The tool uses Hive JDBC to run the hive query and saves its results in a temp table.
-that can later be upserted. Upsert utility (`HoodieDeltaStreamer`) has all the state it needs from the directory structure to know what should be the commit time on the target table.
-e.g: `/app/incremental-hql/intermediate/{source_table_name}_temp/{last_commit_included}`.The Delta Hive table registered will be of the form `{tmpdb}.{source_table}_{last_commit_included}`.
-
-The following are the configuration options for HiveIncrementalPuller
-
-| **Config** | **Description** | **Default** |
-|hiveUrl| Hive Server 2 URL to connect to |  |
-|hiveUser| Hive Server 2 Username |  |
-|hivePass| Hive Server 2 Password |  |
-|queue| YARN Queue name |  |
-|tmp| Directory where the temporary delta data is stored in DFS. The directory structure will follow conventions. Please see the below section.  |  |
-|extractSQLFile| The SQL to execute on the source table to extract the data. The data extracted will be all the rows that changed since a particular point in time. |  |
-|sourceTable| Source Table Name. Needed to set hive environment properties. |  |
-|targetTable| Target Table Name. Needed for the intermediate storage directory structure.  |  |
-|sourceDataPath| Source DFS Base Path. This is where the Hudi metadata will be read. |  |
-|targetDataPath| Target DFS Base path. This is needed to compute the fromCommitTime. This is not needed if fromCommitTime is specified explicitly. |  |
-|tmpdb| The database to which the intermediate temp delta table will be created | hoodie_temp |
-|fromCommitTime| This is the most important parameter. This is the point in time from which the changed records are pulled from.  |  |
-|maxCommits| Number of commits to include in the pull. Setting this to -1 will include all the commits from fromCommitTime. Setting this to a value > 0, will include records that ONLY changed in the specified number of commits after fromCommitTime. This may be needed if you need to catch up say 2 commits at a time. | 3 |
-|help| Utility Help |  |
-
-
-Setting fromCommitTime=0 and maxCommits=-1 will pull in the entire source dataset and can be used to initiate backfills. If the target dataset is a Hudi dataset,
-then the utility can determine if the target dataset has no commits or is behind more than 24 hour (this is configurable),
-it will automatically use the backfill configuration, since applying the last 24 hours incrementally could take more time than doing a backfill. The current limitation of the tool
-is the lack of support for self-joining the same table in mixed mode (normal and incremental modes).
-
-**NOTE on Hive queries that are executed using Fetch task:**
-Since Fetch tasks invoke InputFormat.listStatus() per partition, Hoodie metadata can be listed in
-every such listStatus() call. In order to avoid this, it might be useful to disable fetch tasks
-using the hive session property for incremental queries: `set hive.fetch.task.conversion=none;` This
-would ensure Map Reduce execution is chosen for a Hive query, which combines partitions (comma
-separated) and calls InputFormat.listStatus() only once with all those partitions.
+为了使Hive能够识别Hudi数据集并正确查询,
+HiveServer2需要在其[辅助jars路径](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)中提供`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar`。 
+这将确保输入格式类及其依赖项可用于查询计划和执行。
+
+### 读优化表 {#hive-ro-view}
+除了上述设置之外,对于beeline cli访问,还需要将`hive.input.format`变量设置为`org.apache.hudi.hadoop.HoodieParquetInputFormat`输入格式的完全限定路径名。
+对于Tez,还需要将`hive.tez.input.format`设置为`org.apache.hadoop.hive.ql.io.HiveInputFormat`。
+
+### 实时表 {#hive-rt-view}
+除了在HiveServer2上安装Hive捆绑jars之外,还需要将其放在整个集群的hadoop/hive安装中,这样查询也可以使用自定义RecordReader。
+
+### 增量拉取 {#hive-incr-pull}
+
+`HiveIncrementalPuller`允许通过HiveQL从大型事实/维表中增量提取更改,
+结合了Hive(可靠地处理复杂的SQL查询)和增量原语的好处(通过增量拉取而不是完全扫描来加快查询速度)。
+该工具使用Hive JDBC运行hive查询并将其结果保存在临时表中,这个表可以被插入更新。
+Upsert实用程序(`HoodieDeltaStreamer`)具有目录结构所需的所有状态,以了解目标表上的提交时间应为多少。
+例如:`/app/incremental-hql/intermediate/{source_table_name}_temp/{last_commit_included}`。
+已注册的Delta Hive表的格式为`{tmpdb}.{source_table}_{last_commit_included}`。
+
+以下是HiveIncrementalPuller的配置选项
+
+| **配置** | **描述** | **默认值** |
+|hiveUrl| 要连接的Hive Server 2的URL |  |
+|hiveUser| Hive Server 2 用户名 |  |
+|hivePass| Hive Server 2 密码 |  |
+|queue| YARN 队列名称 |  |
+|tmp| DFS中存储临时增量数据的目录。目录结构将遵循约定。请参阅以下部分。  |  |
+|extractSQLFile| 在源表上要执行的提取数据的SQL。提取的数据将是自特定时间点以来已更改的所有行。 |  |
+|sourceTable| 源表名称。在Hive环境属性中需要设置。 |  |
+|targetTable| 目标表名称。中间存储目录结构需要。  |  |
+|sourceDataPath| 源DFS基本路径。这是读取Hudi元数据的地方。 |  |
+|targetDataPath| 目标DFS基本路径。 这是计算fromCommitTime所必需的。 如果显式指定了fromCommitTime,则不需要设置这个参数。 |  |
+|tmpdb| 用来创建中间临时增量表的数据库 | hoodie_temp |
+|fromCommitTime| 这是最重要的参数。 这是从中提取更改的记录的时间点。 |  |
+|maxCommits| 要包含在拉取中的提交数。将此设置为-1将包括从fromCommitTime开始的所有提交。将此设置为大于0的值,将包括在fromCommitTime之后仅更改指定提交次数的记录。如果您需要一次赶上两次提交,则可能需要这样做。| 3 |
+|help| 实用程序帮助 |  |
+
+
+设置fromCommitTime=0和maxCommits=-1将提取整个源数据集,可用于启动Backfill。
+如果目标数据集是Hudi数据集,则该实用程序可以确定目标数据集是否没有提交或延迟超过24小时(这是可配置的),
+它将自动使用Backfill配置,因为增量应用最近24小时的更改会比Backfill花费更多的时间。
+该工具当前的局限性在于缺乏在混合模式(正常模式和增量模式)下自联接同一表的支持。
+
+**关于使用Fetch任务执行的Hive查询的说明:**
+由于Fetch任务为每个分区调用InputFormat.listStatus(),每个listStatus()调用都会列出Hoodie元数据。
+为了避免这种情况,如下操作可能是有用的,即使用Hive session属性对增量查询禁用Fetch任务:
+`set hive.fetch.task.conversion = none;`。这将确保Hive查询使用Map Reduce执行,
+合并分区(用逗号分隔),并且对所有这些分区仅调用一次InputFormat.listStatus()。
 
 ## Spark
 
-Spark provides much easier deployment & management of Hudi jars and bundles into jobs/notebooks. At a high level, there are two ways to access Hudi datasets in Spark.
+Spark可将Hudi jars和捆绑包轻松部署和管理到作业/笔记本中。简而言之,通过Spark有两种方法可以访问Hudi数据集。
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three views, including the real time view, relying on the custom Hudi input formats again like Hive.
+ - **Hudi DataSource**:支持读取优化和增量拉取,类似于标准数据源(例如:`spark.read.parquet`)的工作方式。
+ - **以Hive表读取**:支持所有三个视图,包括实时视图,依赖于自定义的Hudi输入格式(再次类似Hive)。
  
- In general, your spark job needs a dependency to `hudi-spark` or `hudi-spark-bundle-x.y.z.jar` needs to be on the class path of driver & executors (hint: use `--jars` argument)
+通常,您的spark作业需要依赖`hudi-spark`或`hudi-spark-bundle-x.y.z.jar`,
+它们必须位于驱动程序和执行程序的类路径上(提示:使用`--jars`参数)。
  
-### Read Optimized table {#spark-ro-view}
+### 读优化表 {#spark-ro-view}
 
-To read RO table as a Hive table using SparkSQL, simply push a path filter into sparkContext as follows. 
-This method retains Spark built-in optimizations for reading Parquet files like vectorized reading on Hudi tables.
+要使用SparkSQL将RO表读取为Hive表,只需按如下所示将路径过滤器推入sparkContext。
+对于Hudi表,该方法保留了Spark内置的读取Parquet文件的优化功能,例如进行矢量化读取。
 
 ```
 spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);
 ```
 
-If you prefer to glob paths on DFS via the datasource, you can simply do something like below to get a Spark dataframe to work with. 
+如果您希望通过数据源在DFS上使用全局路径,则只需执行以下类似操作即可得到Spark数据帧。
 
 ```
 Dataset<Row> hoodieROViewDF = spark.read().format("org.apache.hudi")
@@ -104,9 +104,9 @@ Dataset<Row> hoodieROViewDF = spark.read().format("org.apache.hudi")
 .load("/glob/path/pattern");
 ```
  
-### Real time table {#spark-rt-view}
-Currently, real time table can only be queried as a Hive table in Spark. In order to do this, set `spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback 
-to using the Hive Serde to read the data (planning/executions is still Spark). 
+### 实时表 {#spark-rt-view}
+当前,实时表只能在Spark中作为Hive表进行查询。为了做到这一点,设置`spark.sql.hive.convertMetastoreParquet = false`,
+迫使Spark回退到使用Hive Serde读取数据(计划/执行仍然是Spark)。
 
 ```
 $ spark-shell --jars hudi-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf  --packages com.databricks:spark-avro_2.11:4.0.0 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g  --master yarn-client
@@ -114,9 +114,9 @@ $ spark-shell --jars hudi-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path /e
 scala> sqlContext.sql("select count(*) from hudi_rt where datestr = '2016-10-02'").show()
 ```
 
-### Incremental Pulling {#spark-incr-pull}
-The `hudi-spark` module offers the DataSource API, a more elegant way to pull data from Hudi dataset and process it via Spark.
-A sample incremental pull, that will obtain all records written since `beginInstantTime`, looks like below.
+### 增量拉取 {#spark-incr-pull}
+`hudi-spark`模块提供了DataSource API,这是一种从Hudi数据集中提取数据并通过Spark处理数据的更优雅的方法。
+如下所示是一个示例增量拉取,它将获取自`beginInstantTime`以来写入的所有记录。
 
 ```
  Dataset<Row> hoodieIncViewDF = spark.read()
@@ -128,17 +128,17 @@ A sample incremental pull, that will obtain all records written since `beginInst
      .load(tablePath); // For incremental view, pass in the root/base path of dataset
 ```
 
-Please refer to [configurations](configurations.html#spark-datasource) section, to view all datasource options.
+请参阅[设置](configurations.html#spark-datasource)部分,以查看所有数据源选项。
 
-Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing.
+另外,`HoodieReadClient`通过Hudi的隐式索引提供了以下功能。
 
-| **API** | **Description** |
-| read(keys) | Read out the data corresponding to the keys as a DataFrame, using Hudi's own index for faster lookup |
-| filterExists() | Filter out already existing records from the provided RDD[HoodieRecord]. Useful for de-duplication |
-| checkExists(keys) | Check if the provided keys exist in a Hudi dataset |
+| **API** | **描述** |
+| read(keys) | 使用Hudi自己的索通过快速查找将与键对应的数据作为DataFrame读出 |
+| filterExists() | 从提供的RDD[HoodieRecord]中过滤出已经存在的记录。对删除重复数据有用 |
+| checkExists(keys) | 检查提供的键是否存在于Hudi数据集中 |
 
 
 ## Presto
 
-Presto is a popular query engine, providing interactive query performance. Hudi RO tables can be queries seamlessly in Presto. 
-This requires the `hudi-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation.
+Presto是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
+这需要在整个安装过程中将`hudi-presto-bundle` jar放入`<presto_install>/plugin/hive-hadoop2/`中。