You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Yida Wu (Code Review)" <ge...@cloudera.org> on 2022/06/14 18:51:01 UTC
[Impala-ASF-CR] IMPALA-11064 Optimizing Temporary File Structure for Batch Reading

Yida Wu has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/18219 )

Change subject: IMPALA-11064 Optimizing Temporary File Structure for Batch Reading
......................................................................

IMPALA-11064 Optimizing Temporary File Structure for Batch Reading

This patch optimizes the structure of temporary files to improve the
batch reading performance, which is a follow up of IMPALA-10791.

There will be two types of structures, one is the original, allocate
the space for a new page from the last file allocated, when the file
is full, we will create a new file and allocate the space from it.

The other is the new structure, which contains multiple blocks, the
data in each block belongs to the same spill id, once a block is
full, we firstly try to allocate a block in the same file, if the
file is full, we will try to allocate the block from a new file.

The new structure benefits the batch reading by gathering the data
with the same spill id (normally the same partitioned hash join
node) in the same block, therefore benefits the case when reading
sequentially on the node from the remote filesystem.

To use the new structure more efficiently, we also have two
features.

1. The batch reading is only for partitioned hash join node.
Because the way to pin the data back to the memory of partitioned
hash join node is sequential, using this limitation would save the
memory usage while for the data spilled from the grouping
aggregation nodes remain reading by page, because the reads of
these could be quite random.

2. Prefetch the block to be read.
When pinning a page from a file using batch reading, we will try
to prefetch a block ahead of the current read block (step number is
configurable). Since we limit the batch reading for sequential
reads only, a prefetch for the block can accelerate the
reading rate.

3. Auto file uploader.
The way to spill the data to different blocks by the spill id
instead of keeping writing to the end of the last file could create
more half writing files and consume more local disk buffer.
Therefore, to deal with this issue, the auto file uploaders is
to help to upload the files within a timeout period after creation
or the last access.

New start option:
'remote_batch_read_max_block_size_level'
Default value of the option is 3, which stands for the maximum block
size is 2^3=8MB.

'remote_batch_read_tmp_file_size'
The option can specify a different file size for batch reading, if
set to 0, will use remote_tmp_file_size as the file size.
By default, use 16MB for the files with batch read, because in the
tests, small files for batch read seem to have higher performance.

New query options:
'remote_batch_read_prefetch_step'
The option specifies the step number for prefetch. For example, if
step number is 1, and we are trying to read the data from a block
with spill id 1 and sequence number 0, then we will try to prefetch
a block with spill id 1 and sequence number 1.
The default value is 1.

'base_spill_id_level'
The option is used to generate the spill id generated within certain
range. For example, if the value is 6, we will generate a random
number within 2^6=64, then left shift 16bits, which is 64 << 16, to
be the base spill id for the partitioned hash join builder, then
assign the spill id, which is base spill id + partition id, to
the specific partitioned hash join node.
The purpose of this is to reduce the number of temporary files that
could be created by too many spill ids at the same time, and too
many files may easily use up all the local disk buffer then may
slow down not only current query but also other queries spilling.
The default value is 6. Using -1 can disable the feature, which is
using the pointer address of the builder as the base spill id.

'remote_batch_buffer_file_limit'
Limit the number of local buffer file for batch reading can be used
by the current query. The purpose is similar to
'base_spill_id_level', to limit the number of files that is used
by batch reading, because it may block the process of spilling
from non-partitioned hash join node.
The default value is 0, which means no limitation from the option,
and use the global limitation which is half of the number of the
local buffer.

'auto_upload_timeout_s'
The upload timeout limit for the uploader. If the file is created
after the timeout period and no access to the file, file uploader
will force to upload the file to the remote filesystem.
Only works for the files for batch reading.

Testing:
Ran exhaustive tests.
TODO need to add testcases for batch reading.

Change-Id: If913785cac9e2dafa20013b6600c87fcaf3e2018
---
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
M be/src/runtime/io/disk-file.cc
M be/src/runtime/io/disk-file.h
M be/src/runtime/io/disk-io-mgr-test.cc
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/file-writer.h
M be/src/runtime/io/local-file-writer.cc
M be/src/runtime/io/local-file-writer.h
M be/src/runtime/io/request-context.cc
M be/src/runtime/io/request-context.h
M be/src/runtime/io/request-ranges.h
M be/src/runtime/io/scan-range.cc
M be/src/runtime/query-state.cc
M be/src/runtime/tmp-file-mgr-internal.h
M be/src/runtime/tmp-file-mgr-test.cc
M be/src/runtime/tmp-file-mgr.cc
M be/src/runtime/tmp-file-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M tests/custom_cluster/test_scratch_disk.py
30 files changed, 1,732 insertions(+), 456 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/18219/4
-- 
To view, visit http://gerrit.cloudera.org:8080/18219
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If913785cac9e2dafa20013b6600c87fcaf3e2018
Gerrit-Change-Number: 18219
Gerrit-PatchSet: 4
Gerrit-Owner: Yida Wu <wy...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>