You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hawq.apache.org by yo...@apache.org on 2017/04/21 20:43:16 UTC
incubator-hawq-docs git commit: HAWQ-1432 - new topic for locating a
HAWQ table's HDFS files (closes #118)
Repository: incubator-hawq-docs
Updated Branches:
refs/heads/develop d687d9755 -> 2e6e0f386
HAWQ-1432 - new topic for locating a HAWQ table's HDFS files (closes #118)
Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/2e6e0f38
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/2e6e0f38
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/2e6e0f38
Branch: refs/heads/develop
Commit: 2e6e0f386801a529f66fb0e0b20e50bebed020a9
Parents: d687d97
Author: Lisa Owen <lo...@pivotal.io>
Authored: Fri Apr 21 13:43:12 2017 -0700
Committer: David Yozie <yo...@apache.org>
Committed: Fri Apr 21 13:43:12 2017 -0700
----------------------------------------------------------------------
.../source/subnavs/apache-hawq-nav.erb | 3 +
markdown/ddl/locate-table-hdfs.html.md.erb | 160 +++++++++++++++++++
2 files changed, 163 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/2e6e0f38/book/master_middleman/source/subnavs/apache-hawq-nav.erb
----------------------------------------------------------------------
diff --git a/book/master_middleman/source/subnavs/apache-hawq-nav.erb b/book/master_middleman/source/subnavs/apache-hawq-nav.erb
index 489f0c4..a32c9ef 100644
--- a/book/master_middleman/source/subnavs/apache-hawq-nav.erb
+++ b/book/master_middleman/source/subnavs/apache-hawq-nav.erb
@@ -214,6 +214,9 @@
<a href="/docs/userguide/2.2.0.0-incubating/ddl/ddl-table.html">Creating and Managing Tables</a>
</li>
<li>
+ <a href="/docs/userguide/2.2.0.0-incubating/ddl/locate-table-hdfs.html"> Identifying HAWQ Table HDFS Files</a>
+ </li>
+ <li>
<a href="/docs/userguide/2.2.0.0-incubating/ddl/ddl-storage.html">Choosing the Table Storage Model</a>
</li>
<li>
http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/2e6e0f38/markdown/ddl/locate-table-hdfs.html.md.erb
----------------------------------------------------------------------
diff --git a/markdown/ddl/locate-table-hdfs.html.md.erb b/markdown/ddl/locate-table-hdfs.html.md.erb
new file mode 100644
index 0000000..a38aa1a
--- /dev/null
+++ b/markdown/ddl/locate-table-hdfs.html.md.erb
@@ -0,0 +1,160 @@
+---
+title: Identifying HAWQ Table HDFS Files
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+You can determine the HDFS location of the data file(s) associated with a specific HAWQ table using the HAWQ filespace HDFS location, the table identifier, and the identifiers for the tablespace and database in which the table resides.
+
+The number of HDFS data files associated with a HAWQ table is determined by the distribution mechanism (hash or random) identified when the table was first created or altered.
+
+Only an HDFS or HAWQ superuser may access HAWQ table HDFS files.
+
+## <a id="idhdfsloc"></a> HDFS Location
+
+The format of the HDFS file path for a HAWQ table is:
+
+``` pre
+hdfs://<name-node>:<port>/<hawq-filespace-dir>/<tablespace-oid>/<database-oid>/<table-relfilenode>/<file-number>
+```
+
+The HDFS file path components are described in the table below.
+
+| Path Component | Description |
+|---------------------|----------------------------|
+| \<name-node\> | The HDFS NameNode host. |
+| \<port\> | The HDFS NameNode port. |
+| \<hawq-filespace-dir\> | The HDFS directory location of the HAWQ filespace. The default HAWQ filespace HDFS directory is `hawq_default`. |
+| \<tablespace-oid\> | The tablespace object identifier. The default HAWQ tablespace identifier is `16385`. |
+| \<database-oid\> | The database object identifier. |
+| \<table-relfilenode\> | The table object identifier. |
+| \<file-number\> | The file number. |
+
+**Note**: The HAWQ filespace name and its HDFS directory location must be identified when you create a new HAWQ filespace. You must know both to locate the HDFS files for a specific HAWQ table.
+
+The \<name-node\>:\<port\>/\<default-hawq-filespace-dir\> together comprise the `hawq_dfs_url` server configuration parameter. To display the value of the HAWQ default filespace URL:
+
+``` shell
+gpadmin@master$ hawq config -s hawq_dfs_url
+GUC : hawq_dfs_url
+Value : <name-node>:8020/hawq_default
+```
+
+or view the **HAWQ** service **Configs > Advanced**, **General** pane, in your Ambari console.
+
+You can determine the tablespace, database, and table object identifiers through HAWQ catalog queries. See the [Example](#ex_hdfslochash) below.
+
+
+## <a id="idnumfiles"></a> Number of Data Files
+
+The number of data files that are created for a HAWQ table differs for hash-distributed and randomly-distributed HAWQ tables.
+
+Hash-distributed HAWQ tables use a fixed number of virtual segments (vsegs). This number is determined by the `default_hash_table_bucket_number` server configuration parameter setting, or the `BUCKETNUM` value you provide in the `CREATE TABLE` call. The number of HDFS files that HAWQ creates for a hash-distributed table also depends on the maximum number of concurrent inserts that have been executed against the table. The number of HDFS files is always the `default_hash_table_bucket_number` or `BUCKETNUM` value multiplied by the maximum number of concurrent inserts.
+
+The number of HDFS files generated for a randomly-distributed HAWQ table varies depending on the total number of virtual segments that have written data to the table.
+
+
+## <a id="ex_hdfslochash"></a> Example: Locating HDFS Files for a HAWQ Table
+
+Perform the following steps to identify the HDFS location of the data files associated with a hash-distributed HAWQ table. The SQL queries used in this example are applicable to randomly-distributed HAWQ tables as well.
+
+**Note**: Your HAWQ catalog object identifier query results may differ.
+
+1. Start the `psql` subsystem:
+
+ ``` shell
+ gpadmin@master$ psql -d testdb
+ ```
+
+2. Create a hash-distributed table with 4 buckets and insert some data:
+
+ ``` sql
+ testdb=# CREATE TABLE hash_tbl (id int) WITH (BUCKETNUM=4) DISTRIBUTED BY (id);
+ CREATE TABLE
+ testdb=# INSERT INTO hash_tbl SELECT i FROM generate_series(1,100) AS i;
+ INSERT 0 100
+ ```
+
+4. Determine the tablespace identifier for your filespace. You must know both the filespace and tablespace names. For example:
+
+ ``` sql
+ testdb=# SELECT fsname, spcname AS tablespace_name, tablespace_oid
+ FROM pg_filespace, gp_persistent_tablespace_node, pg_tablespace
+ WHERE pg_tablespace.spcfsoid = gp_persistent_tablespace_node.filespace_oid
+ AND pg_filespace.oid = pg_tablespace.spcfsoid
+ AND fsname !~ '^pg_' ORDER BY 1;
+ fsname | tablespace_name | tablespace_oid
+ ------------+-----------------+----------------
+ dfs_system | dfs_default | 16385
+ tryfs | try_tablespace | 16619
+ (2 rows)
+ ```
+
+ The default HAWQ filespace name is `dfs_system`. The tablespace identifier associated with the default HAWQ tablespace named `dfs_default` is `16385`. Make note of this identifier.
+
+ The example above includes a second HAWQ filespace named `tryfs`. The tablespace identifier associated with the tablespace named `try_tablespace` is `16619`.
+
+3. Determine the object identifier of the database `testdb`:
+
+ ``` sql
+ testdb=# SELECT oid FROM pg_database WHERE datname = 'testdb';
+ oid
+ -------
+ 16508
+ (1 row)
+ ```
+
+ Make note of this identifier.
+
+4. Tables of the same name may reside in different schemas. The catalog query you use to determine the identifier for the `hash_tbl` table also includes the schema name (`public`):
+
+ ``` sql
+ testdb=# SELECT relname, relfilenode, nspname, relnamespace
+ FROM pg_class, pg_namespace
+ WHERE relname = 'hash_tbl' AND nspname = 'public' AND relnamespace=pg_namespace.oid;
+ relname | relfilenode | nspname | relnamespace
+ ----------+-------------+-----------+--------------
+ hash_tbl | 55784 | public | 2200
+ (1 row)
+ ```
+
+ Make note of the `relfilenode` value for `hash_tbl`.
+
+4. Construct an HDFS file path for `hash_tbl`. For example, using the HDFS directory location of the HAWQ default filespace:
+
+ ``` pre
+ hdfs://<name-node>:<port>/<hawq-filespace-name>/<tablespace-oid>/<database-oid>/<table-relfilenode>/<file-number>
+ hdfs://<name-node>:8020/hawq_default/16385/16508/55784
+ ```
+
+ Substitute your HDFS NameNode for \<name-node\>.
+
+4. Locate the HDFS file(s):
+
+ ``` shell
+ gpadmin@master$ hdfs dfs -ls hdfs://<name-node>:8020/hawq_default/16385/16508/55784
+ Found 6 items
+ -rw------- 3 gpadmin gpadmin 176 2017-04-17 15:24 hdfs://name-node:8020/hawq_default/16385/16508/55784/1
+ -rw------- 3 gpadmin gpadmin 168 2017-04-17 15:24 hdfs://name-node:8020/hawq_default/16385/16508/55784/2
+ -rw------- 3 gpadmin gpadmin 192 2017-04-17 15:24 hdfs://name-node:8020/hawq_default/16385/16508/55784/3
+ -rw------- 3 gpadmin gpadmin 168 2017-04-17 15:24 hdfs://name-node:8020/hawq_default/16385/16508/55784/4
+ ```
+
+ As expected, `hash_tbl` is comprised of 4 HDFS data files, a multiple of the `BUCKETNUM` you specified when creating the table in Step 2.