You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by ar...@apache.org on 2019/02/07 21:48:43 UTC
[impala] 01/03: [DOCS] Format fixes in the TABLESAMPLE doc

This is an automated email from the ASF dual-hosted git repository.

arodoni pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 9a3633c75f84601e7b8d05522daa6a9e0cf66a53
Author: Alex Rodoni <ar...@cloudera.com>
AuthorDate: Thu Feb 7 11:38:43 2019 -0800

    [DOCS] Format fixes in the TABLESAMPLE doc
    
    Change-Id: I2304e422636d03d7d1109d3f0a551d6f04ce1a63
    Reviewed-on: http://gerrit.cloudera.org:8080/12392
    Reviewed-by: Alex Rodoni <ar...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 docs/topics/impala_tablesample.xml | 362 +++++++++++++++++--------------------
 1 file changed, 165 insertions(+), 197 deletions(-)

diff --git a/docs/topics/impala_tablesample.xml b/docs/topics/impala_tablesample.xml
index e5123cb..67372fa 100644
--- a/docs/topics/impala_tablesample.xml
+++ b/docs/topics/impala_tablesample.xml
@@ -21,6 +21,7 @@ under the License.
 <concept id="tablesample" rev="IMPALA-5309">
 
   <title>TABLESAMPLE Clause</title>
+
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -34,263 +35,250 @@ under the License.
   <conbody>
 
     <p>
-      Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need
-      to explore the data distribution within the table, the table is very large,
-      and it is impractical or unnecessary to process all the data from the table
-      or selected partitions.
+      Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need to explore the
+      data distribution within the table, the table is very large, and it is impractical or
+      unnecessary to process all the data from the table or selected partitions.
     </p>
 
     <p>
-      The clause makes the query process a randomized set of data files from the
-      table, so that the total volume of data is greater than or equal to the specified
-      percentage of data bytes within that table. (Or the data bytes within the set of
-      partitions that remain after partition pruning is performed.)
+      The clause makes the query process a randomized set of data files from the table, so that
+      the total volume of data is greater than or equal to the specified percentage of data
+      bytes within that table. (Or the data bytes within the set of partitions that remain after
+      partition pruning is performed.)
     </p>
 
     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
 
 <codeblock>
-  <ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) [REPEATABLE(<varname>seed</varname>)]</ph>
+<ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) [REPEATABLE(<varname>seed</varname>)]</ph>
 </codeblock>
 
     <p>
-      The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table name or table alias.
+      The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table name or table
+      alias.
     </p>
 
     <p>
-      The <codeph>SYSTEM</codeph> keyword represents the sampling method. Currently,
-      Impala only supports a single sampling method named <codeph>SYSTEM</codeph>.
+      The <codeph>SYSTEM</codeph> keyword represents the sampling method. Currently, Impala only
+      supports a single sampling method named <codeph>SYSTEM</codeph>.
     </p>
 
     <p>
-      The <varname>percentage</varname> argument is an integer literal from 0 to 100.
-      A percentage of 0 produces an empty result set for a particular table reference,
-      while a percentage of 100 uses the entire contents. Because the sampling works by
-      selecting a random set of data files, the proportion of sampled data from the
-      table may be greater than the specified percentage, based on the number and sizes
-      of the underlying data files. See the usage notes for details.
+      The <varname>percentage</varname> argument is an integer literal from 0 to 100. A
+      percentage of 0 produces an empty result set for a particular table reference, while a
+      percentage of 100 uses the entire contents. Because the sampling works by selecting a
+      random set of data files, the proportion of sampled data from the table may be greater
+      than the specified percentage, based on the number and sizes of the underlying data files.
+      See the usage notes for details.
     </p>
 
     <p>
-      The optional <codeph>REPEATABLE</codeph> keyword lets you specify an arbitrary
-      positive integer seed value that ensures that when the query is run again, the
-      sampling selects the same set of data files each time. <codeph>REPEATABLE</codeph>
-      does not have a default value. If you omit the <codeph>REPEATABLE</codeph> keyword,
-      the random seed is derived from the current time.
+      The optional <codeph>REPEATABLE</codeph> keyword lets you specify an arbitrary positive
+      integer seed value that ensures that when the query is run again, the sampling selects the
+      same set of data files each time. <codeph>REPEATABLE</codeph> does not have a default
+      value. If you omit the <codeph>REPEATABLE</codeph> keyword, the random seed is derived
+      from the current time.
     </p>
 
     <p conref="../shared/impala_common.xml#common/added_in_290"/>
 
     <p rev="2.12.0 IMPALA-5310">
-      See <keyword keyref="compute_stats"/> for the
-        <codeph>TABLESAMPLE</codeph> clause used in the <codeph>COMPUTE
-        STATS</codeph> statement.
+      See <keyword keyref="compute_stats"/> for the <codeph>TABLESAMPLE</codeph> clause used in
+      the <codeph>COMPUTE STATS</codeph> statement.
     </p>
 
     <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
 
     <p>
-      You might use this clause with aggregation queries, such as finding
-      the approximate average, minimum, or maximum where exact precision
-      is not required. You can use these findings to plan the most effective
-      strategy for constructing queries against the full table or designing
-      a partitioning strategy for the data.
+      You might use this clause with aggregation queries, such as finding the approximate
+      average, minimum, or maximum where exact precision is not required. You can use these
+      findings to plan the most effective strategy for constructing queries against the full
+      table or designing a partitioning strategy for the data.
     </p>
 
     <p>
-      Some other database systems have a <codeph>TABLESAMPLE</codeph> clause.
-      The Impala syntax for this clause is modeled on the syntax for popular
-      relational databases, not the Hive <codeph>TABLESAMPLE</codeph> clause.
-      For example, there is no <codeph>BUCKETS</codeph> keyword as in HiveQL.
+      Some other database systems have a <codeph>TABLESAMPLE</codeph> clause. The Impala syntax
+      for this clause is modeled on the syntax for popular relational databases, not the Hive
+      <codeph>TABLESAMPLE</codeph> clause. For example, there is no <codeph>BUCKETS</codeph>
+      keyword as in HiveQL.
     </p>
 
     <p>
-      The precision of the <varname>percentage</varname> threshold depends on
-      the number and sizes of the underlying data files. Impala brings in
-      additional data files, one at a time, until the number of bytes exceeds
-      the specified percentage based on the total number of bytes for the
-      entire set of table data. The precision of the percentage threshold is higher
-      when the table contains many data files with consistent sizes. See the
-      code listings later in this section for examples.
+      The precision of the <varname>percentage</varname> threshold depends on the number and
+      sizes of the underlying data files. Impala brings in additional data files, one at a time,
+      until the number of bytes exceeds the specified percentage based on the total number of
+      bytes for the entire set of table data. The precision of the percentage threshold is
+      higher when the table contains many data files with consistent sizes. See the code
+      listings later in this section for examples.
     </p>
 
     <p>
-      When you estimate characteristics of the data distribution based on sampling
-      a percentage of the table data, be aware that the data might be unevenly distributed
-      between different files. Do not assume that the percentage figure reflects the
-      percentage of rows in the table. For example, one file might contain all blank values
-      for a <codeph>STRING</codeph> column, while another file contains long strings
-      in that column; therefore, one file could contain many more rows than another.
-      Likewise, a table created with the <codeph>SORT BY</codeph> clause might
-      contain narrow ranges of values for the sort columns, making it impractical to
-      extrapolate the number of distinct values for those columns based on sampling
-      only some of the data files.
+      When you estimate characteristics of the data distribution based on sampling a percentage
+      of the table data, be aware that the data might be unevenly distributed between different
+      files. Do not assume that the percentage figure reflects the percentage of rows in the
+      table. For example, one file might contain all blank values for a <codeph>STRING</codeph>
+      column, while another file contains long strings in that column; therefore, one file could
+      contain many more rows than another. Likewise, a table created with the <codeph>SORT
+      BY</codeph> clause might contain narrow ranges of values for the sort columns, making it
+      impractical to extrapolate the number of distinct values for those columns based on
+      sampling only some of the data files.
     </p>
 
     <p>
-      Because a sample of the table data might not contain all values for a particular
-      column, if the <codeph>TABLESAMPLE</codeph> is used in a join query, the
-      key relationships between the tables might produce incomplete result sets
-      compared to joins using all the table data. For example, if you join 50%
-      of table A with 50% of table B, some values in the join columns might
-      not match between the two tables, even though overall there is a 1:1
+      Because a sample of the table data might not contain all values for a particular column,
+      if the <codeph>TABLESAMPLE</codeph> is used in a join query, the key relationships between
+      the tables might produce incomplete result sets compared to joins using all the table
+      data. For example, if you join 50% of table A with 50% of table B, some values in the join
+      columns might not match between the two tables, even though overall there is a 1:1
       relationship between the tables.
     </p>
 
     <p>
-      The <codeph>REPEATABLE</codeph> keyword makes identical queries use a
-      consistent set of data files when the query is repeated. You specify an
-      arbitrary integer key that acts as a seed value when Impala randomly
-      selects the set of data files to use in the query. This technique
-      lets you verify correctness, examine performance, and so on for queries
-      using the <codeph>TABLESAMPLE</codeph> clause without the sampled data
-      being different each time. The repeatable aspect is reset (that is, the
-      set of selected data files may change) any time the contents of the table
-      change. The statements or operations that can make sampling results
-      non-repeatable are:
+      The <codeph>REPEATABLE</codeph> keyword makes identical queries use a consistent set of
+      data files when the query is repeated. You specify an arbitrary integer key that acts as a
+      seed value when Impala randomly selects the set of data files to use in the query. This
+      technique lets you verify correctness, examine performance, and so on for queries using
+      the <codeph>TABLESAMPLE</codeph> clause without the sampled data being different each
+      time. The repeatable aspect is reset (that is, the set of selected data files may change)
+      any time the contents of the table change. The statements or operations that can make
+      sampling results non-repeatable are:
     </p>
 
     <ul>
       <li>
         <codeph>INSERT</codeph>.
       </li>
+
       <li>
         <codeph>TRUNCATE TABLE</codeph>.
       </li>
+
       <li>
         <codeph>LOAD DATA</codeph>.
       </li>
+
       <li>
-        <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph>
-        after files are added or removed by a non-Impala mechanism.
-      </li>
-      <li>
+        <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> after files are added
+        or removed by a non-Impala mechanism.
       </li>
+
+      <li></li>
     </ul>
 
     <p>
-      This clause is similar in some ways to the <codeph>LIMIT</codeph> clause,
-      because both serve to limit the size of the intermediate data and final
-      result set. <codeph>LIMIT 0</codeph> is more efficient than
-      <codeph>TABLESAMPLE SYSTEM(0)</codeph> for verifying that a query can execute
-      without producing any results. <codeph>TABLESAMPLE SYSTEM(<varname>n</varname>)</codeph>
-      often makes query processing more efficient than using a <codeph>LIMIT</codeph> clause
-      by itself, because all phases of query execution use less data overall.
-      If the intent is to retrieve some representative values from the table
-      in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>,
-      <codeph>ORDER BY</codeph>, and <codeph>LIMIT</codeph> clauses within a single query.
+      This clause is similar in some ways to the <codeph>LIMIT</codeph> clause, because both
+      serve to limit the size of the intermediate data and final result set. <codeph>LIMIT
+      0</codeph> is more efficient than <codeph>TABLESAMPLE SYSTEM(0)</codeph> for verifying
+      that a query can execute without producing any results. <codeph>TABLESAMPLE
+      SYSTEM(<varname>n</varname>)</codeph> often makes query processing more efficient than
+      using a <codeph>LIMIT</codeph> clause by itself, because all phases of query execution use
+      less data overall. If the intent is to retrieve some representative values from the table
+      in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>, <codeph>ORDER
+      BY</codeph>, and <codeph>LIMIT</codeph> clauses within a single query.
     </p>
 
     <p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
+
     <p>
-      When you query a partitioned table, any partition pruning happens
-      before Impala selects the data files to sample. For example, in a
-      table partitioned by year, a query with <codeph>WHERE year = 2017</codeph>
-      and a <codeph>TABLESAMPLE SYSTEM(10)</codeph> clause would sample
-      data files representing at least 10% of the bytes present in the
-      2017 partition.
+      When you query a partitioned table, any partition pruning happens before Impala selects
+      the data files to sample. For example, in a table partitioned by year, a query with
+      <codeph>WHERE year = 2017</codeph> and a <codeph>TABLESAMPLE SYSTEM(10)</codeph> clause
+      would sample data files representing at least 10% of the bytes present in the 2017
+      partition.
     </p>
 
     <p conref="../shared/impala_common.xml#common/s3_blurb"/>
+
     <p>
-      This clause applies to S3 tables the same way as tables
-      with data files stored on HDFS.
+      This clause applies to S3 tables the same way as tables with data files stored on HDFS.
     </p>
 
     <p conref="../shared/impala_common.xml#common/adls_blurb"/>
+
     <p>
-      This clause applies to ADLS tables the same way as tables
-      with data files stored on HDFS.
+      This clause applies to ADLS tables the same way as tables with data files stored on HDFS.
     </p>
 
     <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
     <p>
       This clause does not apply to Kudu tables.
     </p>
 
     <p conref="../shared/impala_common.xml#common/hbase_blurb"/>
+
     <p>
       This clause does not apply to HBase tables.
     </p>
 
     <p conref="../shared/impala_common.xml#common/performance_blurb"/>
+
     <p>
-      From a performance perspective, the <codeph>TABLESAMPLE</codeph>
-      clause is especially valuable for exploratory queries on
-      text, Avro, or other file formats other than Parquet. Text-based
-      or row-oriented file formats must process substantial amounts of
-      redundant data for queries that derive aggregate results such as
-      <codeph>MAX()</codeph>, <codeph>MIN()</codeph>, or <codeph>AVG()</codeph>
-      for a single column. Therefore, you might use <codeph>TABLESAMPLE</codeph>
-      early in the ETL pipeline, when data is still in raw text format
-      and has not been converted to Parquet or moved into a partitioned
-      table.
+      From a performance perspective, the <codeph>TABLESAMPLE</codeph> clause is especially
+      valuable for exploratory queries on text, Avro, or other file formats other than Parquet.
+      Text-based or row-oriented file formats must process substantial amounts of redundant data
+      for queries that derive aggregate results such as <codeph>MAX()</codeph>,
+      <codeph>MIN()</codeph>, or <codeph>AVG()</codeph> for a single column. Therefore, you
+      might use <codeph>TABLESAMPLE</codeph> early in the ETL pipeline, when data is still in
+      raw text format and has not been converted to Parquet or moved into a partitioned table.
     </p>
 
     <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
 
     <p>
-      This clause applies only to tables that use a storage layer
-      with underlying raw data files, such as HDFS, Amazon S3,
-      or Microsoft ADLS.
+      This clause applies only to tables that use a storage layer with underlying raw data
+      files, such as HDFS, Amazon S3, or Microsoft ADLS.
     </p>
 
     <p>
-      This clause does not apply to table references that represent views.
-      A query that applies the <codeph>TABLESAMPLE</codeph> clause to a
-      view or a subquery fails with a semantic error.
+      This clause does not apply to table references that represent views. A query that applies
+      the <codeph>TABLESAMPLE</codeph> clause to a view or a subquery fails with a semantic
+      error.
     </p>
 
     <p>
-      Because the sampling works at the level of entire data files, it
-      is by nature coarse-grained. It is possible to specify a small
-      sample percentage but still process a substantial portion of the
-      table data if the table contains relatively few data files, if
-      each data file is very large, or if the data files vary substantially
-      in size. Be sure that you understand the data distribution and physical
-      file layout so that you can verify if the results are suitable for
-      extrapolation. For example, if the table contains only a single data file,
-      the <q>sample</q> will consist of all the table data regardless of
-      the percentage you specify. If the table contains data files of
-      1 GiB, 1 GiB, and 1 KiB, when you specify a sampling percentage of
-      50 you would either process slightly more than 50% of the table
-      (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB),
-      depending on which data files were selected for sampling.
+      Because the sampling works at the level of entire data files, it is by nature
+      coarse-grained. It is possible to specify a small sample percentage but still process a
+      substantial portion of the table data if the table contains relatively few data files, if
+      each data file is very large, or if the data files vary substantially in size. Be sure
+      that you understand the data distribution and physical file layout so that you can verify
+      if the results are suitable for extrapolation. For example, if the table contains only a
+      single data file, the <q>sample</q> will consist of all the table data regardless of the
+      percentage you specify. If the table contains data files of 1 GiB, 1 GiB, and 1 KiB, when
+      you specify a sampling percentage of 50 you would either process slightly more than 50% of
+      the table (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB), depending on which
+      data files were selected for sampling.
     </p>
 
     <p>
-      If data files are added by a non-Impala mechanism, and the
-      table metadata is not updated by a <codeph>REFRESH</codeph>
-      or <codeph>INVALIDATE METADATA</codeph> statement, the
-      <codeph>TABLESAMPLE</codeph> clause does not consider those
-      new files when computing the number of bytes in the table
-      or selecting which files to sample.
+      If data files are added by a non-Impala mechanism, and the table metadata is not updated
+      by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement, the
+      <codeph>TABLESAMPLE</codeph> clause does not consider those new files when computing the
+      number of bytes in the table or selecting which files to sample.
     </p>
 
     <p>
-      If data files are removed by a non-Impala mechanism, and the
-      table metadata is not updated by a <codeph>REFRESH</codeph>
-      or <codeph>INVALIDATE METADATA</codeph> statement, the
-      query fails if the <codeph>TABLESAMPLE</codeph> clause
-      attempts to reference any of the missing files.
+      If data files are removed by a non-Impala mechanism, and the table metadata is not updated
+      by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement, the query
+      fails if the <codeph>TABLESAMPLE</codeph> clause attempts to reference any of the missing
+      files.
     </p>
 
     <p conref="../shared/impala_common.xml#common/example_blurb"/>
 
     <p>
-      The following examples demonstrate the <codeph>TABLESAMPLE</codeph> clause.
-      These examples intentionally use very small data sets to illustrate how
-      the number of files, size of each file, and overall size of data in the table
-      interact with the percentage specified in the clause.
+      The following examples demonstrate the <codeph>TABLESAMPLE</codeph> clause. These examples
+      intentionally use very small data sets to illustrate how the number of files, size of each
+      file, and overall size of data in the table interact with the percentage specified in the
+      clause.
     </p>
 
     <p>
-      These examples use an unpartitioned table, containing several files of roughly
-      the same size:
+      These examples use an unpartitioned table, containing several files of roughly the same
+      size:
     </p>
 
-<codeblock><![CDATA[
-create table sample_demo (x int, s string);
+<codeblock>create table sample_demo (x int, s string);
 
 insert into sample_demo values (1, 'one');
 insert into sample_demo values (2, 'two');
@@ -314,20 +302,16 @@ show table stats sample_demo;
 | #Rows | #Files | Size | Format | Location                |
 +-------+--------+------+--------+-------------------------+
 | -1    | 5      | 34B  | TEXT   | /tsample.db/sample_demo |
-+-------+--------+------+--------+-------------------------+
-</codeblock>
++-------+--------+------+--------+-------------------------+</codeblock>
 
     <p>
-      A query that samples 50% of the table must process at least
-      17 bytes of data. Based on the sizes of the data files,
-      we can predict that each such query uses 3 arbitrary files.
-      Any 1 or 2 files are not enough to reach 50% of the total
-      data in the table (34 bytes), so the query adds more files
-      until it passes the 50% threshold:
+      A query that samples 50% of the table must process at least 17 bytes of data. Based on the
+      sizes of the data files, we can predict that each such query uses 3 arbitrary files. Any 1
+      or 2 files are not enough to reach 50% of the total data in the table (34 bytes), so the
+      query adds more files until it passes the 50% threshold:
     </p>
 
-<codeblock><![CDATA[
-select distinct x from sample_demo tablesample system(50);
+<codeblock>select distinct x from sample_demo tablesample system(50);
 +---+
 | x |
 +---+
@@ -352,19 +336,16 @@ select distinct x from sample_demo tablesample system(50);
 | 5 |
 | 3 |
 | 2 |
-+---+
-</codeblock>
++---+</codeblock>
 
     <p>
-      To help run reproducible experiments, the <codeph>REPEATABLE</codeph>
-      clause causes Impala to choose the same set of files for each query.
-      Although the data set being considered is deterministic, the order
-      of results varies (in the absence of an <codeph>ORDER BY</codeph>
+      To help run reproducible experiments, the <codeph>REPEATABLE</codeph> clause causes Impala
+      to choose the same set of files for each query. Although the data set being considered is
+      deterministic, the order of results varies (in the absence of an <codeph>ORDER BY</codeph>
       clause) because of the way distributed queries are processed:
     </p>
 
-<codeblock><![CDATA[
-select distinct x from sample_demo
+<codeblock>select distinct x from sample_demo
   tablesample system(50) repeatable (12345);
 +---+
 | x |
@@ -386,17 +367,13 @@ select distinct x from sample_demo
 </codeblock>
 
     <p>
-      The following examples show how uneven data distribution affects
-      which data is sampled. Adding another data file containing a long
-      string value changes the threshold for 50% of the total data in
-      the table:
+      The following examples show how uneven data distribution affects which data is sampled.
+      Adding another data file containing a long string value changes the threshold for 50% of
+      the total data in the table:
     </p>
 
-<codeblock><![CDATA[
-insert into sample_demo values (1000, 'Boyhood is the longest time in li
-fe for a boy. The last term of the school-year is made of decades, not o
-f weeks, and living through them is like waiting for the millennium. Boo
-th Tarkington');
+<codeblock>insert into sample_demo values
+(1000, 'Boyhood is the longest time in life for a boy. The last term of the school-year is made of decades, not of weeks, and living through them is like waiting for the millennium. Booth Tarkington');
 
 show files in sample_demo;
 +---------------------+------+-----------+
@@ -419,17 +396,15 @@ show table stats sample_demo;
 </codeblock>
 
     <p>
-      Even though the queries do not refer to the <codeph>S</codeph>
-      column containing the long value, all the sampling queries include
-      the data file containing the column value <codeph>X=1000</codeph>,
-      because the query cannot reach the 50% threshold (115 bytes) without
-      including that file. The large file might be considered first, in which
-      case it is the only file processed by the query. Or an arbitrary
-      set of other files might be considered first.
+      Even though the queries do not refer to the <codeph>S</codeph> column containing the long
+      value, all the sampling queries include the data file containing the column value
+      <codeph>X=1000</codeph>, because the query cannot reach the 50% threshold (115 bytes)
+      without including that file. The large file might be considered first, in which case it is
+      the only file processed by the query. Or an arbitrary set of other files might be
+      considered first.
     </p>
 
-<codeblock><![CDATA[
-select distinct x from sample_demo tablesample system(50);
+<codeblock>select distinct x from sample_demo tablesample system(50);
 +------+
 | x    |
 +------+
@@ -453,17 +428,14 @@ select distinct x from sample_demo tablesample system(50);
 | 4    |
 | 2    |
 | 1    |
-+------+
-</codeblock>
++------+</codeblock>
 
     <p>
-      The following examples demonstrate how the <codeph>TABLESAMPLE</codeph>
-      clause interacts with other table aspects, such as partitioning and file
-      format:
+      The following examples demonstrate how the <codeph>TABLESAMPLE</codeph> clause interacts
+      with other table aspects, such as partitioning and file format:
     </p>
 
-<codeblock><![CDATA[
-create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet;
+<codeblock>create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet;
 
 insert into sample_demo_partitions partition (n = 1) select * from sample_demo;
 insert into sample_demo_partitions partition (n = 2) select * from sample_demo;
@@ -493,12 +465,11 @@ show table stats tablesample_demo_partitioned;
 </codeblock>
 
     <p>
-      If the query does not involve any partition pruning, the
-      sampling applies to the data volume of the entire table:
+      If the query does not involve any partition pruning, the sampling applies to the data
+      volume of the entire table:
     </p>
 
-<codeblock><![CDATA[
--- 18 rows total.
+<codeblock>-- 18 rows total.
 select count(*) from sample_demo_partitions;
 +----------+
 | count(*) |
@@ -524,18 +495,14 @@ select count(*) from sample_demo_partitions
 | count(*) |
 +----------+
 | 16       |
-+----------+
-</codeblock>
++----------+</codeblock>
 
     <p>
-      If the query only processes certain partitions,
-      the query computes the sampling threshold based on
-      the data size and set of files only from the
-      relevant partitions:
+      If the query only processes certain partitions, the query computes the sampling threshold
+      based on the data size and set of files only from the relevant partitions:
     </p>
 
-<codeblock><![CDATA[
-select count(*) from sample_demo_partitions
+<codeblock>select count(*) from sample_demo_partitions
   tablesample system(50) where n = 1;
 +----------+
 | count(*) |
@@ -550,13 +517,14 @@ select count(*) from sample_demo_partitions
 +----------+
 | 2        |
 +----------+
-]]>
 </codeblock>
 
     <p conref="../shared/impala_common.xml#common/related_info"/>
+
     <p>
       <xref href="impala_select.xml#select"/>
     </p>
 
   </conbody>
+
 </concept>