You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by ar...@apache.org on 2019/02/07 21:48:43 UTC
[impala] 01/03: [DOCS] Format fixes in the TABLESAMPLE doc
This is an automated email from the ASF dual-hosted git repository.
arodoni pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
commit 9a3633c75f84601e7b8d05522daa6a9e0cf66a53
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Thu Feb 7 11:38:43 2019 -0800
[DOCS] Format fixes in the TABLESAMPLE doc
Change-Id: I2304e422636d03d7d1109d3f0a551d6f04ce1a63
Reviewed-on: http://gerrit.cloudera.org:8080/12392
Reviewed-by: Alex Rodoni <ar...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
docs/topics/impala_tablesample.xml | 362 +++++++++++++++++--------------------
1 file changed, 165 insertions(+), 197 deletions(-)
diff --git a/docs/topics/impala_tablesample.xml b/docs/topics/impala_tablesample.xml
index e5123cb..67372fa 100644
--- a/docs/topics/impala_tablesample.xml
+++ b/docs/topics/impala_tablesample.xml
@@ -21,6 +21,7 @@ under the License.
<concept id="tablesample" rev="IMPALA-5309">
<title>TABLESAMPLE Clause</title>
+
<prolog>
<metadata>
<data name="Category" value="Impala"/>
@@ -34,263 +35,250 @@ under the License.
<conbody>
<p>
- Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need
- to explore the data distribution within the table, the table is very large,
- and it is impractical or unnecessary to process all the data from the table
- or selected partitions.
+ Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need to explore the
+ data distribution within the table, the table is very large, and it is impractical or
+ unnecessary to process all the data from the table or selected partitions.
</p>
<p>
- The clause makes the query process a randomized set of data files from the
- table, so that the total volume of data is greater than or equal to the specified
- percentage of data bytes within that table. (Or the data bytes within the set of
- partitions that remain after partition pruning is performed.)
+ The clause makes the query process a randomized set of data files from the table, so that
+ the total volume of data is greater than or equal to the specified percentage of data
+ bytes within that table. (Or the data bytes within the set of partitions that remain after
+ partition pruning is performed.)
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>
- <ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) [REPEATABLE(<varname>seed</varname>)]</ph>
+<ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) [REPEATABLE(<varname>seed</varname>)]</ph>
</codeblock>
<p>
- The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table name or table alias.
+ The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table name or table
+ alias.
</p>
<p>
- The <codeph>SYSTEM</codeph> keyword represents the sampling method. Currently,
- Impala only supports a single sampling method named <codeph>SYSTEM</codeph>.
+ The <codeph>SYSTEM</codeph> keyword represents the sampling method. Currently, Impala only
+ supports a single sampling method named <codeph>SYSTEM</codeph>.
</p>
<p>
- The <varname>percentage</varname> argument is an integer literal from 0 to 100.
- A percentage of 0 produces an empty result set for a particular table reference,
- while a percentage of 100 uses the entire contents. Because the sampling works by
- selecting a random set of data files, the proportion of sampled data from the
- table may be greater than the specified percentage, based on the number and sizes
- of the underlying data files. See the usage notes for details.
+ The <varname>percentage</varname> argument is an integer literal from 0 to 100. A
+ percentage of 0 produces an empty result set for a particular table reference, while a
+ percentage of 100 uses the entire contents. Because the sampling works by selecting a
+ random set of data files, the proportion of sampled data from the table may be greater
+ than the specified percentage, based on the number and sizes of the underlying data files.
+ See the usage notes for details.
</p>
<p>
- The optional <codeph>REPEATABLE</codeph> keyword lets you specify an arbitrary
- positive integer seed value that ensures that when the query is run again, the
- sampling selects the same set of data files each time. <codeph>REPEATABLE</codeph>
- does not have a default value. If you omit the <codeph>REPEATABLE</codeph> keyword,
- the random seed is derived from the current time.
+ The optional <codeph>REPEATABLE</codeph> keyword lets you specify an arbitrary positive
+ integer seed value that ensures that when the query is run again, the sampling selects the
+ same set of data files each time. <codeph>REPEATABLE</codeph> does not have a default
+ value. If you omit the <codeph>REPEATABLE</codeph> keyword, the random seed is derived
+ from the current time.
</p>
<p conref="../shared/impala_common.xml#common/added_in_290"/>
<p rev="2.12.0 IMPALA-5310">
- See <keyword keyref="compute_stats"/> for the
- <codeph>TABLESAMPLE</codeph> clause used in the <codeph>COMPUTE
- STATS</codeph> statement.
+ See <keyword keyref="compute_stats"/> for the <codeph>TABLESAMPLE</codeph> clause used in
+ the <codeph>COMPUTE STATS</codeph> statement.
</p>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
- You might use this clause with aggregation queries, such as finding
- the approximate average, minimum, or maximum where exact precision
- is not required. You can use these findings to plan the most effective
- strategy for constructing queries against the full table or designing
- a partitioning strategy for the data.
+ You might use this clause with aggregation queries, such as finding the approximate
+ average, minimum, or maximum where exact precision is not required. You can use these
+ findings to plan the most effective strategy for constructing queries against the full
+ table or designing a partitioning strategy for the data.
</p>
<p>
- Some other database systems have a <codeph>TABLESAMPLE</codeph> clause.
- The Impala syntax for this clause is modeled on the syntax for popular
- relational databases, not the Hive <codeph>TABLESAMPLE</codeph> clause.
- For example, there is no <codeph>BUCKETS</codeph> keyword as in HiveQL.
+ Some other database systems have a <codeph>TABLESAMPLE</codeph> clause. The Impala syntax
+ for this clause is modeled on the syntax for popular relational databases, not the Hive
+ <codeph>TABLESAMPLE</codeph> clause. For example, there is no <codeph>BUCKETS</codeph>
+ keyword as in HiveQL.
</p>
<p>
- The precision of the <varname>percentage</varname> threshold depends on
- the number and sizes of the underlying data files. Impala brings in
- additional data files, one at a time, until the number of bytes exceeds
- the specified percentage based on the total number of bytes for the
- entire set of table data. The precision of the percentage threshold is higher
- when the table contains many data files with consistent sizes. See the
- code listings later in this section for examples.
+ The precision of the <varname>percentage</varname> threshold depends on the number and
+ sizes of the underlying data files. Impala brings in additional data files, one at a time,
+ until the number of bytes exceeds the specified percentage based on the total number of
+ bytes for the entire set of table data. The precision of the percentage threshold is
+ higher when the table contains many data files with consistent sizes. See the code
+ listings later in this section for examples.
</p>
<p>
- When you estimate characteristics of the data distribution based on sampling
- a percentage of the table data, be aware that the data might be unevenly distributed
- between different files. Do not assume that the percentage figure reflects the
- percentage of rows in the table. For example, one file might contain all blank values
- for a <codeph>STRING</codeph> column, while another file contains long strings
- in that column; therefore, one file could contain many more rows than another.
- Likewise, a table created with the <codeph>SORT BY</codeph> clause might
- contain narrow ranges of values for the sort columns, making it impractical to
- extrapolate the number of distinct values for those columns based on sampling
- only some of the data files.
+ When you estimate characteristics of the data distribution based on sampling a percentage
+ of the table data, be aware that the data might be unevenly distributed between different
+ files. Do not assume that the percentage figure reflects the percentage of rows in the
+ table. For example, one file might contain all blank values for a <codeph>STRING</codeph>
+ column, while another file contains long strings in that column; therefore, one file could
+ contain many more rows than another. Likewise, a table created with the <codeph>SORT
+ BY</codeph> clause might contain narrow ranges of values for the sort columns, making it
+ impractical to extrapolate the number of distinct values for those columns based on
+ sampling only some of the data files.
</p>
<p>
- Because a sample of the table data might not contain all values for a particular
- column, if the <codeph>TABLESAMPLE</codeph> is used in a join query, the
- key relationships between the tables might produce incomplete result sets
- compared to joins using all the table data. For example, if you join 50%
- of table A with 50% of table B, some values in the join columns might
- not match between the two tables, even though overall there is a 1:1
+ Because a sample of the table data might not contain all values for a particular column,
+ if the <codeph>TABLESAMPLE</codeph> is used in a join query, the key relationships between
+ the tables might produce incomplete result sets compared to joins using all the table
+ data. For example, if you join 50% of table A with 50% of table B, some values in the join
+ columns might not match between the two tables, even though overall there is a 1:1
relationship between the tables.
</p>
<p>
- The <codeph>REPEATABLE</codeph> keyword makes identical queries use a
- consistent set of data files when the query is repeated. You specify an
- arbitrary integer key that acts as a seed value when Impala randomly
- selects the set of data files to use in the query. This technique
- lets you verify correctness, examine performance, and so on for queries
- using the <codeph>TABLESAMPLE</codeph> clause without the sampled data
- being different each time. The repeatable aspect is reset (that is, the
- set of selected data files may change) any time the contents of the table
- change. The statements or operations that can make sampling results
- non-repeatable are:
+ The <codeph>REPEATABLE</codeph> keyword makes identical queries use a consistent set of
+ data files when the query is repeated. You specify an arbitrary integer key that acts as a
+ seed value when Impala randomly selects the set of data files to use in the query. This
+ technique lets you verify correctness, examine performance, and so on for queries using
+ the <codeph>TABLESAMPLE</codeph> clause without the sampled data being different each
+ time. The repeatable aspect is reset (that is, the set of selected data files may change)
+ any time the contents of the table change. The statements or operations that can make
+ sampling results non-repeatable are:
</p>
<ul>
<li>
<codeph>INSERT</codeph>.
</li>
+
<li>
<codeph>TRUNCATE TABLE</codeph>.
</li>
+
<li>
<codeph>LOAD DATA</codeph>.
</li>
+
<li>
- <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph>
- after files are added or removed by a non-Impala mechanism.
- </li>
- <li>
+ <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> after files are added
+ or removed by a non-Impala mechanism.
</li>
+
+ <li></li>
</ul>
<p>
- This clause is similar in some ways to the <codeph>LIMIT</codeph> clause,
- because both serve to limit the size of the intermediate data and final
- result set. <codeph>LIMIT 0</codeph> is more efficient than
- <codeph>TABLESAMPLE SYSTEM(0)</codeph> for verifying that a query can execute
- without producing any results. <codeph>TABLESAMPLE SYSTEM(<varname>n</varname>)</codeph>
- often makes query processing more efficient than using a <codeph>LIMIT</codeph> clause
- by itself, because all phases of query execution use less data overall.
- If the intent is to retrieve some representative values from the table
- in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>,
- <codeph>ORDER BY</codeph>, and <codeph>LIMIT</codeph> clauses within a single query.
+ This clause is similar in some ways to the <codeph>LIMIT</codeph> clause, because both
+ serve to limit the size of the intermediate data and final result set. <codeph>LIMIT
+ 0</codeph> is more efficient than <codeph>TABLESAMPLE SYSTEM(0)</codeph> for verifying
+ that a query can execute without producing any results. <codeph>TABLESAMPLE
+ SYSTEM(<varname>n</varname>)</codeph> often makes query processing more efficient than
+ using a <codeph>LIMIT</codeph> clause by itself, because all phases of query execution use
+ less data overall. If the intent is to retrieve some representative values from the table
+ in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>, <codeph>ORDER
+ BY</codeph>, and <codeph>LIMIT</codeph> clauses within a single query.
</p>
<p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
+
<p>
- When you query a partitioned table, any partition pruning happens
- before Impala selects the data files to sample. For example, in a
- table partitioned by year, a query with <codeph>WHERE year = 2017</codeph>
- and a <codeph>TABLESAMPLE SYSTEM(10)</codeph> clause would sample
- data files representing at least 10% of the bytes present in the
- 2017 partition.
+ When you query a partitioned table, any partition pruning happens before Impala selects
+ the data files to sample. For example, in a table partitioned by year, a query with
+ <codeph>WHERE year = 2017</codeph> and a <codeph>TABLESAMPLE SYSTEM(10)</codeph> clause
+ would sample data files representing at least 10% of the bytes present in the 2017
+ partition.
</p>
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
+
<p>
- This clause applies to S3 tables the same way as tables
- with data files stored on HDFS.
+ This clause applies to S3 tables the same way as tables with data files stored on HDFS.
</p>
<p conref="../shared/impala_common.xml#common/adls_blurb"/>
+
<p>
- This clause applies to ADLS tables the same way as tables
- with data files stored on HDFS.
+ This clause applies to ADLS tables the same way as tables with data files stored on HDFS.
</p>
<p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
<p>
This clause does not apply to Kudu tables.
</p>
<p conref="../shared/impala_common.xml#common/hbase_blurb"/>
+
<p>
This clause does not apply to HBase tables.
</p>
<p conref="../shared/impala_common.xml#common/performance_blurb"/>
+
<p>
- From a performance perspective, the <codeph>TABLESAMPLE</codeph>
- clause is especially valuable for exploratory queries on
- text, Avro, or other file formats other than Parquet. Text-based
- or row-oriented file formats must process substantial amounts of
- redundant data for queries that derive aggregate results such as
- <codeph>MAX()</codeph>, <codeph>MIN()</codeph>, or <codeph>AVG()</codeph>
- for a single column. Therefore, you might use <codeph>TABLESAMPLE</codeph>
- early in the ETL pipeline, when data is still in raw text format
- and has not been converted to Parquet or moved into a partitioned
- table.
+ From a performance perspective, the <codeph>TABLESAMPLE</codeph> clause is especially
+ valuable for exploratory queries on text, Avro, or other file formats other than Parquet.
+ Text-based or row-oriented file formats must process substantial amounts of redundant data
+ for queries that derive aggregate results such as <codeph>MAX()</codeph>,
+ <codeph>MIN()</codeph>, or <codeph>AVG()</codeph> for a single column. Therefore, you
+ might use <codeph>TABLESAMPLE</codeph> early in the ETL pipeline, when data is still in
+ raw text format and has not been converted to Parquet or moved into a partitioned table.
</p>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<p>
- This clause applies only to tables that use a storage layer
- with underlying raw data files, such as HDFS, Amazon S3,
- or Microsoft ADLS.
+ This clause applies only to tables that use a storage layer with underlying raw data
+ files, such as HDFS, Amazon S3, or Microsoft ADLS.
</p>
<p>
- This clause does not apply to table references that represent views.
- A query that applies the <codeph>TABLESAMPLE</codeph> clause to a
- view or a subquery fails with a semantic error.
+ This clause does not apply to table references that represent views. A query that applies
+ the <codeph>TABLESAMPLE</codeph> clause to a view or a subquery fails with a semantic
+ error.
</p>
<p>
- Because the sampling works at the level of entire data files, it
- is by nature coarse-grained. It is possible to specify a small
- sample percentage but still process a substantial portion of the
- table data if the table contains relatively few data files, if
- each data file is very large, or if the data files vary substantially
- in size. Be sure that you understand the data distribution and physical
- file layout so that you can verify if the results are suitable for
- extrapolation. For example, if the table contains only a single data file,
- the <q>sample</q> will consist of all the table data regardless of
- the percentage you specify. If the table contains data files of
- 1 GiB, 1 GiB, and 1 KiB, when you specify a sampling percentage of
- 50 you would either process slightly more than 50% of the table
- (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB),
- depending on which data files were selected for sampling.
+ Because the sampling works at the level of entire data files, it is by nature
+ coarse-grained. It is possible to specify a small sample percentage but still process a
+ substantial portion of the table data if the table contains relatively few data files, if
+ each data file is very large, or if the data files vary substantially in size. Be sure
+ that you understand the data distribution and physical file layout so that you can verify
+ if the results are suitable for extrapolation. For example, if the table contains only a
+ single data file, the <q>sample</q> will consist of all the table data regardless of the
+ percentage you specify. If the table contains data files of 1 GiB, 1 GiB, and 1 KiB, when
+ you specify a sampling percentage of 50 you would either process slightly more than 50% of
+ the table (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB), depending on which
+ data files were selected for sampling.
</p>
<p>
- If data files are added by a non-Impala mechanism, and the
- table metadata is not updated by a <codeph>REFRESH</codeph>
- or <codeph>INVALIDATE METADATA</codeph> statement, the
- <codeph>TABLESAMPLE</codeph> clause does not consider those
- new files when computing the number of bytes in the table
- or selecting which files to sample.
+ If data files are added by a non-Impala mechanism, and the table metadata is not updated
+ by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement, the
+ <codeph>TABLESAMPLE</codeph> clause does not consider those new files when computing the
+ number of bytes in the table or selecting which files to sample.
</p>
<p>
- If data files are removed by a non-Impala mechanism, and the
- table metadata is not updated by a <codeph>REFRESH</codeph>
- or <codeph>INVALIDATE METADATA</codeph> statement, the
- query fails if the <codeph>TABLESAMPLE</codeph> clause
- attempts to reference any of the missing files.
+ If data files are removed by a non-Impala mechanism, and the table metadata is not updated
+ by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement, the query
+ fails if the <codeph>TABLESAMPLE</codeph> clause attempts to reference any of the missing
+ files.
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
- The following examples demonstrate the <codeph>TABLESAMPLE</codeph> clause.
- These examples intentionally use very small data sets to illustrate how
- the number of files, size of each file, and overall size of data in the table
- interact with the percentage specified in the clause.
+ The following examples demonstrate the <codeph>TABLESAMPLE</codeph> clause. These examples
+ intentionally use very small data sets to illustrate how the number of files, size of each
+ file, and overall size of data in the table interact with the percentage specified in the
+ clause.
</p>
<p>
- These examples use an unpartitioned table, containing several files of roughly
- the same size:
+ These examples use an unpartitioned table, containing several files of roughly the same
+ size:
</p>
-<codeblock><![CDATA[
-create table sample_demo (x int, s string);
+<codeblock>create table sample_demo (x int, s string);
insert into sample_demo values (1, 'one');
insert into sample_demo values (2, 'two');
@@ -314,20 +302,16 @@ show table stats sample_demo;
| #Rows | #Files | Size | Format | Location |
+-------+--------+------+--------+-------------------------+
| -1 | 5 | 34B | TEXT | /tsample.db/sample_demo |
-+-------+--------+------+--------+-------------------------+
-</codeblock>
++-------+--------+------+--------+-------------------------+</codeblock>
<p>
- A query that samples 50% of the table must process at least
- 17 bytes of data. Based on the sizes of the data files,
- we can predict that each such query uses 3 arbitrary files.
- Any 1 or 2 files are not enough to reach 50% of the total
- data in the table (34 bytes), so the query adds more files
- until it passes the 50% threshold:
+ A query that samples 50% of the table must process at least 17 bytes of data. Based on the
+ sizes of the data files, we can predict that each such query uses 3 arbitrary files. Any 1
+ or 2 files are not enough to reach 50% of the total data in the table (34 bytes), so the
+ query adds more files until it passes the 50% threshold:
</p>
-<codeblock><![CDATA[
-select distinct x from sample_demo tablesample system(50);
+<codeblock>select distinct x from sample_demo tablesample system(50);
+---+
| x |
+---+
@@ -352,19 +336,16 @@ select distinct x from sample_demo tablesample system(50);
| 5 |
| 3 |
| 2 |
-+---+
-</codeblock>
++---+</codeblock>
<p>
- To help run reproducible experiments, the <codeph>REPEATABLE</codeph>
- clause causes Impala to choose the same set of files for each query.
- Although the data set being considered is deterministic, the order
- of results varies (in the absence of an <codeph>ORDER BY</codeph>
+ To help run reproducible experiments, the <codeph>REPEATABLE</codeph> clause causes Impala
+ to choose the same set of files for each query. Although the data set being considered is
+ deterministic, the order of results varies (in the absence of an <codeph>ORDER BY</codeph>
clause) because of the way distributed queries are processed:
</p>
-<codeblock><![CDATA[
-select distinct x from sample_demo
+<codeblock>select distinct x from sample_demo
tablesample system(50) repeatable (12345);
+---+
| x |
@@ -386,17 +367,13 @@ select distinct x from sample_demo
</codeblock>
<p>
- The following examples show how uneven data distribution affects
- which data is sampled. Adding another data file containing a long
- string value changes the threshold for 50% of the total data in
- the table:
+ The following examples show how uneven data distribution affects which data is sampled.
+ Adding another data file containing a long string value changes the threshold for 50% of
+ the total data in the table:
</p>
-<codeblock><![CDATA[
-insert into sample_demo values (1000, 'Boyhood is the longest time in li
-fe for a boy. The last term of the school-year is made of decades, not o
-f weeks, and living through them is like waiting for the millennium. Boo
-th Tarkington');
+<codeblock>insert into sample_demo values
+(1000, 'Boyhood is the longest time in life for a boy. The last term of the school-year is made of decades, not of weeks, and living through them is like waiting for the millennium. Booth Tarkington');
show files in sample_demo;
+---------------------+------+-----------+
@@ -419,17 +396,15 @@ show table stats sample_demo;
</codeblock>
<p>
- Even though the queries do not refer to the <codeph>S</codeph>
- column containing the long value, all the sampling queries include
- the data file containing the column value <codeph>X=1000</codeph>,
- because the query cannot reach the 50% threshold (115 bytes) without
- including that file. The large file might be considered first, in which
- case it is the only file processed by the query. Or an arbitrary
- set of other files might be considered first.
+ Even though the queries do not refer to the <codeph>S</codeph> column containing the long
+ value, all the sampling queries include the data file containing the column value
+ <codeph>X=1000</codeph>, because the query cannot reach the 50% threshold (115 bytes)
+ without including that file. The large file might be considered first, in which case it is
+ the only file processed by the query. Or an arbitrary set of other files might be
+ considered first.
</p>
-<codeblock><![CDATA[
-select distinct x from sample_demo tablesample system(50);
+<codeblock>select distinct x from sample_demo tablesample system(50);
+------+
| x |
+------+
@@ -453,17 +428,14 @@ select distinct x from sample_demo tablesample system(50);
| 4 |
| 2 |
| 1 |
-+------+
-</codeblock>
++------+</codeblock>
<p>
- The following examples demonstrate how the <codeph>TABLESAMPLE</codeph>
- clause interacts with other table aspects, such as partitioning and file
- format:
+ The following examples demonstrate how the <codeph>TABLESAMPLE</codeph> clause interacts
+ with other table aspects, such as partitioning and file format:
</p>
-<codeblock><![CDATA[
-create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet;
+<codeblock>create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet;
insert into sample_demo_partitions partition (n = 1) select * from sample_demo;
insert into sample_demo_partitions partition (n = 2) select * from sample_demo;
@@ -493,12 +465,11 @@ show table stats tablesample_demo_partitioned;
</codeblock>
<p>
- If the query does not involve any partition pruning, the
- sampling applies to the data volume of the entire table:
+ If the query does not involve any partition pruning, the sampling applies to the data
+ volume of the entire table:
</p>
-<codeblock><![CDATA[
--- 18 rows total.
+<codeblock>-- 18 rows total.
select count(*) from sample_demo_partitions;
+----------+
| count(*) |
@@ -524,18 +495,14 @@ select count(*) from sample_demo_partitions
| count(*) |
+----------+
| 16 |
-+----------+
-</codeblock>
++----------+</codeblock>
<p>
- If the query only processes certain partitions,
- the query computes the sampling threshold based on
- the data size and set of files only from the
- relevant partitions:
+ If the query only processes certain partitions, the query computes the sampling threshold
+ based on the data size and set of files only from the relevant partitions:
</p>
-<codeblock><![CDATA[
-select count(*) from sample_demo_partitions
+<codeblock>select count(*) from sample_demo_partitions
tablesample system(50) where n = 1;
+----------+
| count(*) |
@@ -550,13 +517,14 @@ select count(*) from sample_demo_partitions
+----------+
| 2 |
+----------+
-]]>
</codeblock>
<p conref="../shared/impala_common.xml#common/related_info"/>
+
<p>
<xref href="impala_select.xml#select"/>
</p>
</conbody>
+
</concept>