You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hawq.apache.org by yo...@apache.org on 2016/08/19 17:48:17 UTC

[1/5] incubator-hawq-docs git commit: pxf/hive reorganize syntax example and chg some params [#128450965]

Repository: incubator-hawq-docs
Updated Branches:
  refs/heads/develop 6e9f482ad -> 1f6714a31


pxf/hive reorganize syntax example and chg some params [#128450965]


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/4dfb8cd5
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/4dfb8cd5
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/4dfb8cd5

Branch: refs/heads/develop
Commit: 4dfb8cd51445dc9c3cc8f097b46cb0863f1fa596
Parents: 6e9f482
Author: Lisa Owen <lo...@pivotal.io>
Authored: Wed Aug 17 08:42:22 2016 -0700
Committer: David Yozie <yo...@apache.org>
Committed: Fri Aug 19 10:47:35 2016 -0700

----------------------------------------------------------------------
 pxf/HivePXF.html.md.erb | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/4dfb8cd5/pxf/HivePXF.html.md.erb
----------------------------------------------------------------------
diff --git a/pxf/HivePXF.html.md.erb b/pxf/HivePXF.html.md.erb
index a7160d3..efac11e 100644
--- a/pxf/HivePXF.html.md.erb
+++ b/pxf/HivePXF.html.md.erb
@@ -72,9 +72,8 @@ PXF has three built-in profiles for Hive tables:
 -   HiveRC
 -   HiveText
 
-The Hive profile works with any Hive storage type. Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on profiles.
-
-The following example creates a readable HAWQ external table representing a Hive table named `/user/eddie/test` using the PXF Hive profile:
+The Hive profile works with any Hive storage type. 
+The following example creates a readable HAWQ external table representing a Hive table named `accessories` in the `inventory` Hive database using the PXF Hive profile:
 
 ``` shell
 $ psql -d postgres
@@ -82,10 +81,14 @@ $ psql -d postgres
 
 ``` sql
 postgres=# CREATE EXTERNAL TABLE hivetest(id int, newid int)
-LOCATION ('pxf://namenode:51200/hive-db-name.test?PROFILE=Hive')
+LOCATION ('pxf://namenode:51200/inventory.accessories?PROFILE=Hive')
 FORMAT 'custom' (formatter='pxfwritable_import');
 ```
 
+
+Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on profiles.
+
+
 ### <a id="topic_b4v_g3n_25"></a>Hive Complex Types
 
 PXF tables support Hive data types that are not primitive types. The supported Hive complex data types are array, struct, map, and union. This Hive `CREATE TABLE` statement, for example, creates a table with each of these complex data types:

[3/5] incubator-hawq-docs git commit: Updates [#128508767]

Posted by yo...@apache.org.

Updates [#128508767]


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/2349cea0
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/2349cea0
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/2349cea0

Branch: refs/heads/develop
Commit: 2349cea029c3c8638c3f5cf8842b8ea9ca65c426
Parents: e464585
Author: Jane Beckman <jb...@pivotal.io>
Authored: Wed Aug 17 15:18:09 2016 -0700
Committer: David Yozie <yo...@apache.org>
Committed: Fri Aug 19 10:47:48 2016 -0700

----------------------------------------------------------------------
 reference/cli/admin_utilities/hawqstate.html.md.erb | 8 --------
 1 file changed, 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/2349cea0/reference/cli/admin_utilities/hawqstate.html.md.erb
----------------------------------------------------------------------
diff --git a/reference/cli/admin_utilities/hawqstate.html.md.erb b/reference/cli/admin_utilities/hawqstate.html.md.erb
index d272892..3927442 100644
--- a/reference/cli/admin_utilities/hawqstate.html.md.erb
+++ b/reference/cli/admin_utilities/hawqstate.html.md.erb
@@ -9,10 +9,8 @@ Shows the status of a running HAWQ system.
 ``` pre
 hawq state 
      [-b]
-     [-d <master_data_dir> | --datadir <master_data_dir>]
      [-l <logfile_directory> | --logdir <logfile_directory>]
      [(-v | --verbose) | (-q | --quiet)]  
-     [--hawqhome <hawq_home_dir>]
      
 hawq state [-h | --help]
 ```
@@ -32,12 +30,6 @@ The `hawq state` utility displays information about a running HAWQ instance. A H
 <dt>-b (brief status)  </dt>
 <dd>Display a brief summary of the state of the HAWQ system. This is the default mode.</dd>
 
-<dt>-d, -\\\-datadir \<master\_data\_dir\>  </dt>
-<dd>Status of the master data directory.</dd>
-
-<dt>-\\\-hawqhome \<hawq\_home\_dir\>  </dt>
-<dd>Display details of the designated home data directory if`$GPHOME` is not defined.` $GPHOME` is used by default in a standard installation.</dd>
-
 <dt>-l, -\\\-logdir \<logfile\_directory\>  </dt>
 <dd>Specifies the directory to check for logfiles. The default is `$GPHOME/hawqAdminLogs`.

[4/5] incubator-hawq-docs git commit: Removes heap table statement, updates [#128180963]

Posted by yo...@apache.org.

Removes heap table statement, updates [#128180963]


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/42fa1bc9
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/42fa1bc9
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/42fa1bc9

Branch: refs/heads/develop
Commit: 42fa1bc9363fc6104fe575e033129a1d5701c185
Parents: 2349cea
Author: Jane Beckman <jb...@pivotal.io>
Authored: Thu Aug 18 11:39:47 2016 -0700
Committer: David Yozie <yo...@apache.org>
Committed: Fri Aug 19 10:47:57 2016 -0700

----------------------------------------------------------------------
 reference/sql/CREATE-TABLE.html.md.erb | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/42fa1bc9/reference/sql/CREATE-TABLE.html.md.erb
----------------------------------------------------------------------
diff --git a/reference/sql/CREATE-TABLE.html.md.erb b/reference/sql/CREATE-TABLE.html.md.erb
index 5d1098b..99ff35e 100644
--- a/reference/sql/CREATE-TABLE.html.md.erb
+++ b/reference/sql/CREATE-TABLE.html.md.erb
@@ -228,7 +228,7 @@ The following storage options are available:
 
 **bucketnum** \u2014 Set to the number of hash buckets to be used in creating a hash-distributed table, specified as an integer greater than 0 and no more than the value of `default_hash_table_bucket_number`. The default when the table is created is 6 times the segment count. However, explicitly setting the bucket number when creating a hash table is recommended.
 
-**ORIENTATION** \u2014 Set to `row` (the default) for row-oriented storage, or parquet. The parquet column-oriented format can be more efficient for large-scale queries. This option is only valid if `APPENDONLY=TRUE`. Heap-storage tables can only be row-oriented.
+**ORIENTATION** \u2014 Set to `row` (the default) for row-oriented storage, or parquet. The parquet column-oriented format can be more efficient for large-scale queries. This option is only valid if `APPENDONLY=TRUE`. 
 
 **COMPRESSTYPE** \u2014 Set to `ZLIB`, `SNAPPY`, or `GZIP` to specify the type of compression used. `ZLIB` provides more compact compression ratios at lower speeds. Parquet tables support `SNAPPY` and `GZIP` compression. Append-only tables support `SNAPPY` and `ZLIB` compression.  This option is valid only if `APPENDONLY=TRUE`.
 
@@ -328,8 +328,8 @@ Using `SNAPPY` compression with parquet files is recommended for best performanc
 
 **Memory occupation**: When inserting or loading data to a parquet table, the whole rowgroup is stored in�physical memory until the size exceeds the threshold or the end of the�`INSERT` operation. Once either occurs, the entire rowgroup is flushed to disk. Also, at the beginning of�the `INSERT` operation, each column is pre-allocated a page buffer. The column pre-allocated page buffer�size should be `min(pageSizeLimit,                rowgroupSizeLimit/estimatedColumnWidth/estimatedRecordWidth)` for�the first rowgroup. For the following rowgroup, it should be `min(pageSizeLimit,                actualColumnChunkSize in last�rowgroup * 1.05)`, of which 1.05 is the estimated scaling factor. When reading data from a parquet table, the�requested columns of the row group are loaded into memory. Memory is allocated 8 MB by default. Ensure that memory occupation does not exceed physical memory when setting `ROWGROUPSIZE` or `PAGESIZE`, otherwise you may encounter an out of memory erro
 r.�
 
-**Batch vs. individual inserts**
-Only batch loading should be used with parquet files. Repeated individual inserts can result in bloated footers.
+**Bulk vs. trickle loads**
+Only bulk loads are recommended for use with parquet tables. Trickle loads can result in bloated footers and larger data files.
 
 ## <a id="parquetexamples"></a>Parquet Examples

[5/5] incubator-hawq-docs git commit: Clarify PXF segment control

Posted by yo...@apache.org.

Clarify PXF segment control


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/1f6714a3
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/1f6714a3
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/1f6714a3

Branch: refs/heads/develop
Commit: 1f6714a318255596fb6dbd15d3a49866d753294b
Parents: 42fa1bc
Author: Jane Beckman <jb...@pivotal.io>
Authored: Thu Aug 18 16:55:24 2016 -0700
Committer: David Yozie <yo...@apache.org>
Committed: Fri Aug 19 10:48:04 2016 -0700

----------------------------------------------------------------------
 bestpractices/general_bestpractices.html.md.erb | 1 +
 ddl/ddl-table.html.md.erb                       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/1f6714a3/bestpractices/general_bestpractices.html.md.erb
----------------------------------------------------------------------
diff --git a/bestpractices/general_bestpractices.html.md.erb b/bestpractices/general_bestpractices.html.md.erb
index 3d991cd..6c663c3 100644
--- a/bestpractices/general_bestpractices.html.md.erb
+++ b/bestpractices/general_bestpractices.html.md.erb
@@ -17,6 +17,7 @@ When using HAWQ, adhere to the following guidelines for best results:
     -   **Available resources**. Resources available at query time. If more resources are available in the resource queue, the resources will be used.
     -   **Hash table and bucket number**. If the query involves only hash-distributed tables, and the bucket number (bucketnum) configured for all the hash tables is either the same bucket number for all tables or the table size for random tables is no more than 1.5 times larger than the size of hash tables for the hash tables, then the query's parallelism is fixed (equal to the hash table bucket number). Otherwise, the number of virtual segments depends on the query's cost and hash-distributed table queries will behave like queries on randomly distributed tables.
     -   **Query Type**: For queries with some user-defined functions or for external tables where calculating resource costs is difficult , then the number of virtual segments is controlled by `hawq_rm_nvseg_perquery_limit `and `hawq_rm_nvseg_perquery_perseg_limit` parameters, as well as by the ON clause and the location list of external tables. If the query has a hash result table (e.g. `INSERT into hash_table`) then the number of virtual segment number must be equal to the bucket number of the resulting hash table, If the query is performed in utility mode, such as for `COPY` and `ANALYZE` operations, the virtual segment number is calculated by different policies, which will be explained later in this section.
+    -   **PXF**: PXF external tables use the `default_hash_table_bucket_number` parameter, not the `hawq_rm_nvseg_perquery_perseg_limit` parameter, to control the number of virtual segments. 
 
     See [Query Performance](../query/query-performance.html#topic38) for more details.
 

http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/1f6714a3/ddl/ddl-table.html.md.erb
----------------------------------------------------------------------
diff --git a/ddl/ddl-table.html.md.erb b/ddl/ddl-table.html.md.erb
index a29409c..7120031 100644
--- a/ddl/ddl-table.html.md.erb
+++ b/ddl/ddl-table.html.md.erb
@@ -68,7 +68,7 @@ All HAWQ tables are distributed. The default is `DISTRIBUTED RANDOMLY` \(round-r
 
 Randomly distributed tables have benefits over hash distributed tables. For example, after expansion, HAWQ's elasticity feature lets it automatically use more resources without needing to redistribute the data. For extremely large tables, redistribution is very expensive. Also, data locality for randomly distributed tables is better, especially after the underlying HDFS redistributes its data during rebalancing or because of data node failures. This is quite common when the cluster is large.
 
-However, hash distributed tables can be faster than randomly distributed tables. For example, for TPCH queries, where there are several queries, HASH distributed tables can have performance benefits. Choose a distribution policy that best suits your application scenario. When you `CREATE TABLE`, you can also specify the `bucketnum` option. The `bucketnum` determines the number of hash buckets used in creating a hash-distributed table or for pxf external table intermediate processing. The number of buckets also affects how many virtual segments will be created when processing this data. The bucketnumber of a gpfdist external table is the number of gpfdist location, and the bucketnumber of a command external table is `ON #num`.
+However, hash distributed tables can be faster than randomly distributed tables. For example, for TPCH queries, where there are several queries, HASH distributed tables can have performance benefits. Choose a distribution policy that best suits your application scenario. When you `CREATE TABLE`, you can also specify the `bucketnum` option. The `bucketnum` determines the number of hash buckets used in creating a hash-distributed table or for PXF external table intermediate processing. The number of buckets also affects how many virtual segments will be created when processing this data. The bucketnumber of a gpfdist external table is the number of gpfdist location, and the bucketnumber of a command external table is `ON #num`. PXF external tables use the `default_hash_table_bucket_number` parameter to control virtual segments. 
 
 HAWQ's elastic execution runtime is based on virtual segments, which are allocated on demand, based on the cost of the query. Each node uses one physical segment and a number of dynamically allocated virtual segments distributed to different hosts, thus simplifying performance tuning. Large queries use large numbers of virtual segments, while smaller queries use fewer virtual segments. Tables do not need to be redistributed when nodes are added or removed.

[2/5] incubator-hawq-docs git commit: enhance pxf/hive database info [#128450965]

Posted by yo...@apache.org.

enhance pxf/hive database info [#128450965]


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/e464585d
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/e464585d
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/e464585d

Branch: refs/heads/develop
Commit: e464585d7ec821c375c89e822aefac90ba429173
Parents: 4dfb8cd
Author: Lisa Owen <lo...@pivotal.io>
Authored: Wed Aug 17 13:37:51 2016 -0700
Committer: David Yozie <yo...@apache.org>
Committed: Fri Aug 19 10:47:43 2016 -0700

----------------------------------------------------------------------
 pxf/HivePXF.html.md.erb | 3 +++
 1 file changed, 3 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/e464585d/pxf/HivePXF.html.md.erb
----------------------------------------------------------------------
diff --git a/pxf/HivePXF.html.md.erb b/pxf/HivePXF.html.md.erb
index efac11e..db3e53c 100644
--- a/pxf/HivePXF.html.md.erb
+++ b/pxf/HivePXF.html.md.erb
@@ -64,6 +64,9 @@ where `<pxf parameters>` is:
  | PROFILE=profile-name
 ```
 
+
+If `hive-db-name` is omitted, pxf will default to the Hive `default` database.
+
 **Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting the pxf\_service\_port configuration parameter.
 
 PXF has three built-in profiles for Hive tables: