You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hawq.apache.org by yo...@apache.org on 2016/08/19 17:48:21 UTC

[5/5] incubator-hawq-docs git commit: Clarify PXF segment control

Clarify PXF segment control


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/1f6714a3
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/1f6714a3
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/1f6714a3

Branch: refs/heads/develop
Commit: 1f6714a318255596fb6dbd15d3a49866d753294b
Parents: 42fa1bc
Author: Jane Beckman <jb...@pivotal.io>
Authored: Thu Aug 18 16:55:24 2016 -0700
Committer: David Yozie <yo...@apache.org>
Committed: Fri Aug 19 10:48:04 2016 -0700

----------------------------------------------------------------------
 bestpractices/general_bestpractices.html.md.erb | 1 +
 ddl/ddl-table.html.md.erb                       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/1f6714a3/bestpractices/general_bestpractices.html.md.erb
----------------------------------------------------------------------
diff --git a/bestpractices/general_bestpractices.html.md.erb b/bestpractices/general_bestpractices.html.md.erb
index 3d991cd..6c663c3 100644
--- a/bestpractices/general_bestpractices.html.md.erb
+++ b/bestpractices/general_bestpractices.html.md.erb
@@ -17,6 +17,7 @@ When using HAWQ, adhere to the following guidelines for best results:
     -   **Available resources**. Resources available at query time. If more resources are available in the resource queue, the resources will be used.
     -   **Hash table and bucket number**. If the query involves only hash-distributed tables, and the bucket number (bucketnum) configured for all the hash tables is either the same bucket number for all tables or the table size for random tables is no more than 1.5 times larger than the size of hash tables for the hash tables, then the query's parallelism is fixed (equal to the hash table bucket number). Otherwise, the number of virtual segments depends on the query's cost and hash-distributed table queries will behave like queries on randomly distributed tables.
     -   **Query Type**: For queries with some user-defined functions or for external tables where calculating resource costs is difficult , then the number of virtual segments is controlled by `hawq_rm_nvseg_perquery_limit `and `hawq_rm_nvseg_perquery_perseg_limit` parameters, as well as by the ON clause and the location list of external tables. If the query has a hash result table (e.g. `INSERT into hash_table`) then the number of virtual segment number must be equal to the bucket number of the resulting hash table, If the query is performed in utility mode, such as for `COPY` and `ANALYZE` operations, the virtual segment number is calculated by different policies, which will be explained later in this section.
+    -   **PXF**: PXF external tables use the `default_hash_table_bucket_number` parameter, not the `hawq_rm_nvseg_perquery_perseg_limit` parameter, to control the number of virtual segments. 
 
     See [Query Performance](../query/query-performance.html#topic38) for more details.
 

http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/1f6714a3/ddl/ddl-table.html.md.erb
----------------------------------------------------------------------
diff --git a/ddl/ddl-table.html.md.erb b/ddl/ddl-table.html.md.erb
index a29409c..7120031 100644
--- a/ddl/ddl-table.html.md.erb
+++ b/ddl/ddl-table.html.md.erb
@@ -68,7 +68,7 @@ All HAWQ tables are distributed. The default is `DISTRIBUTED RANDOMLY` \(round-r
 
 Randomly distributed tables have benefits over hash distributed tables. For example, after expansion, HAWQ's elasticity feature lets it automatically use more resources without needing to redistribute the data. For extremely large tables, redistribution is very expensive. Also, data locality for randomly distributed tables is better, especially after the underlying HDFS redistributes its data during rebalancing or because of data node failures. This is quite common when the cluster is large.
 
-However, hash distributed tables can be faster than randomly distributed tables. For example, for TPCH queries, where there are several queries, HASH distributed tables can have performance benefits. Choose a distribution policy that best suits your application scenario. When you `CREATE TABLE`, you can also specify the `bucketnum` option. The `bucketnum` determines the number of hash buckets used in creating a hash-distributed table or for pxf external table intermediate processing. The number of buckets also affects how many virtual segments will be created when processing this data. The bucketnumber of a gpfdist external table is the number of gpfdist location, and the bucketnumber of a command external table is `ON #num`.
+However, hash distributed tables can be faster than randomly distributed tables. For example, for TPCH queries, where there are several queries, HASH distributed tables can have performance benefits. Choose a distribution policy that best suits your application scenario. When you `CREATE TABLE`, you can also specify the `bucketnum` option. The `bucketnum` determines the number of hash buckets used in creating a hash-distributed table or for PXF external table intermediate processing. The number of buckets also affects how many virtual segments will be created when processing this data. The bucketnumber of a gpfdist external table is the number of gpfdist location, and the bucketnumber of a command external table is `ON #num`. PXF external tables use the `default_hash_table_bucket_number` parameter to control virtual segments. 
 
 HAWQ's elastic execution runtime is based on virtual segments, which are allocated on demand, based on the cost of the query. Each node uses one physical segment and a number of dynamically allocated virtual segments distributed to different hosts, thus simplifying performance tuning. Large queries use large numbers of virtual segments, while smaller queries use fewer virtual segments. Tables do not need to be redistributed when nodes are added or removed.