You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by mi...@apache.org on 2023/02/08 17:25:38 UTC

[impala] 01/04: IMPALA-10804: [DOCS] Document spill to remote storage

This is an automated email from the ASF dual-hosted git repository.

michaelsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 7dcf80b32e207c1078bed7aca1714ce59d2afe13
Author: Shajini Thayasingh <st...@cloudera.com>
AuthorDate: Fri Feb 3 10:40:43 2023 -0800

    IMPALA-10804: [DOCS] Document spill to remote storage
    
    Spill to HDFS, S3, and Ozone.
    
    Change-Id: I3efb2ffcc06cdbe69845c6dc4cf03d9f2e3dcabc
    Reviewed-on: http://gerrit.cloudera.org:8080/19472
    Reviewed-by: Yida Wu <wy...@gmail.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 docs/topics/impala_disk_space.xml | 106 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)

diff --git a/docs/topics/impala_disk_space.xml b/docs/topics/impala_disk_space.xml
index b32502ff1..4440f7763 100644
--- a/docs/topics/impala_disk_space.xml
+++ b/docs/topics/impala_disk_space.xml
@@ -343,6 +343,112 @@ under the License.
       <p> Compression levels from 1 up to 22 (default 3) are supported for <codeph>ZSTD</codeph>.
         The lower the compression level, the faster the speed at the cost of compression ratio.</p>
     </section>
+    <section>
+      <title>Configure Impala Daemon to spill to S3</title>
+      <p>Impala occasionally needs to use persistent storage for writing intermediate files during
+        large sorts, joins, aggregations, or analytic function operations. If your workload results
+        in large volumes of intermediate data being written, it is recommended to configure the
+        heavy spilling queries to use a remote storage location rather than the local one. The
+        advantage of using remote storage for scratch space is that it is elastic and can handle any
+        amount of spilling.</p>
+      <p><b>Before you begin</b></p>
+      <p>Identify the URL for an S3 bucket to which you want your new Impala to write the temporary
+        data. If you use the S3 bucket that is associated with the environment, navigate to the S3
+        bucket and copy the URL. If you want to use an external S3 bucket, you must first configure
+        your environment to use the external S3 bucket with the correct read/write permissions.</p>
+      <p><b>Configuring the Start-up Option in Impala daemon</b></p>
+      <p>You can use the Impalad start option scratch_dirs to specify the locations of the
+        intermediate files. The format of the option is <codeph>scratch_dirs= remote_dir, local_buffer_dir(,
+          local_dir…).</codeph></p>
+      <p>With the option specified above:</p>
+      <ul>
+        <li>You can specify only one remote directory. When you configure a remote directory, you
+          must specify a local buffer directory as the buffer. However you can use multiple local
+          directories with the remote directory. If you specify multiple local directories, the
+          first local directory would be used as the local buffer directory.</li>
+        <li>If you configure both remote and local directories, the remote directory is only used
+          when the local directories are fully utilized.</li>
+        <li>The size of a remote intermediate file could affect the query performance, and the value
+          can be set by <codeph>>remote_tmp_file_size</codeph> in the start-up option. The default
+          size of a remote intermediate file is 16MB while the maximum is 256MB.</li>
+      </ul>
+      <p><b>Examples</b></p>
+      <ul>
+        <li>A remote scratch dir with one local buffer dir, file size 64MB.
+          <codeblock>‑‑scratch_dirs="s3a://remote_dir, /local_buffer_dir" ‑‑remote_tmp_file_size=64M</codeblock></li>
+        <li>A remote scratch dir with one local buffer dir, and one local dir.
+          <codeblock>‑‑scratch_dirs="s3a://remote_dir, /local_buffer_dir, /local_dir"</codeblock></li>
+        <li>A remote scratch dir with one local buffer dir, and multiple local dirs.
+          <codeblock>‑‑scratch_dirs="s3a://remote_dir, /local_buffer_dir, /local_dir_1, /local_dir_2"</codeblock></li>
+      </ul>
+    </section>
+    <section>
+      <title>Configure Impala Daemon to spill to HDFS</title>
+      <p>Impala occasionally needs to use persistent storage for writing intermediate files during
+        large sorts, joins, aggregations, or analytic function operations. If your workload results
+        in large volumes of intermediate data being written, it is recommended to configure the
+        heavy spilling queries to use a remote storage location rather than the local one. The
+        advantage of using remote storage for scratch space is that it is elastic and can handle any
+        amount of spilling.</p>
+      <p><b>Before you begin</b></p>
+      <ul>
+        <li>Identify the HDFS scratch directory where you want your new Impala to write the
+          temporary data.</li>
+        <li>Identify the port number of the HDFS scratch directory.</li>
+        <li>Configure Impala to write temporary data to disk during query processing.</li>
+      </ul>
+      <p><b>Configuring the Start-up Option in Impala daemon</b></p>
+      <p>You can use the Impalad start option “scratch_dirs” to specify the locations of the
+        intermediate files.</p>
+      <p>Use the following format for this start up option:</p>
+      <codeblock>‑‑scratch_dirs=”hdfs://ip_address:port_num(:max_bytes)(:priority), /local_buffer_dir” ‑‑remote_tmp_file_size=xM</codeblock>
+      <ul>
+        <li>Where <codeph>“hdfs://ip_address:port_num/path(:max_bytes)(:priority)”</codeph> is the remote
+          directory.</li>
+        <li><codeph>port_num</codeph> is required for the HDFS scratch directory.</li>
+        <li><codeph>max_bytes</codeph> and <codeph>priority</codeph> are optional.</li>
+      </ul>
+      <p>Using the above format:</p>
+      <ul>
+        <li>You can specify only one remote directory.</li>
+        <li>When you configure a remote directory, you must specify a local buffer directory as the
+          buffer. However you can use multiple local directories with the remote directory. If you
+          specify multiple local directories, the first local directory would be used as the local
+          buffer directory.</li>
+        <li>If you configure both remote and local directories, the remote directory is only used
+          when the local directories are fully utilized.</li>
+        <li>The size of a remote intermediate file could affect the query performance, and the value
+          can be set by “remote_tmp_file_size” in the start-up option. The default size of a remote
+          intermediate file is 16MB while the maximum is 512MB.</li>
+      </ul>
+      <p><b>Examples</b></p>
+      <ul>
+        <li>A hdfs scratch dir with one local buffer dir, file size 64MB. The space of hdfs scratch
+          dir is limited to 300G.
+          <codeblock>‑‑scratch_dirs="hdfs://ip_address:port_num/path:300G, /local_buffer_dir" ‑‑remote_tmp_file_size=64M</codeblock></li>
+        <li>A hdfs scratch dir with one local buffer dir, and one local dir. The space of hdfs
+          scratch dir is limited to 300G.
+          <codeblock>‑‑scratch_dirs="hdfs://ip_address:port_num/path:300G, /local_buffer_dir, /local_dir"</codeblock></li>
+        <li>A hdfs scratch dir with one local buffer dir, and multiple local dirs. The space of hdfs
+          scratch dir is unlimited.
+          <codeblock>‑‑scratch_dirs="hdfs://ip_address:port_num/path, /local_buffer_dir, /local_dir_1, /local_dir_2"</codeblock></li>
+      </ul>
+      <p>Even though max_bytes is optional it is highly recommended to configure for spilling to
+        HDFS because the HDFS cluster space is limited.</p>
+    </section>
+    <section>
+      <title>Configure Impala Daemon to spill to Ozone</title>
+      <p><b>Before you begin</b></p>
+      <ul>
+        <li>Identify the Ozone scratch directory where you want your new Impala to write the
+          temporary data.</li>
+        <li>Identify the port number of the Ozone scratch directory.</li>
+      </ul>
+      <p><b>Configuring the Start-up Option in Impala daemon</b></p>
+      <p>You can use the Impalad start option “scratch_dirs” to specify the locations of the
+        intermediate files.</p>
+      <codeblock>‑‑scratch_dirs=”ofs://ip_address:port_num(:max_bytes)(:priority), /local_buffer_dir” ‑‑remote_tmp_file_size=xM</codeblock>
+    </section>
   </conbody>
 
 </concept>