You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2023/01/17 12:33:32 UTC

[GitHub] [doris] kaka11chen opened a new pull request, #16024: [Enhancement][icebergv2_parquet_reader] Optimize the position delete file filtering mechanism in iceberg v2 parquet reader.

kaka11chen opened a new pull request, #16024:
URL: https://github.com/apache/doris/pull/16024

   
   # Proposed changes
   
   Issue Number: close #16023
   
   ## Problem summary
   
   Describe your changes.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: 
       - [ ] Yes
       - [x] No
       - [ ] I don't know
   2. Has unit tests been added:
       - [ ] Yes
       - [x] No
       - [ ] No Need
   3. Has document been added or modified:
       - [ ] Yes
       - [ ] No
       - [x] No Need
   4. Does it need to update dependencies:
       - [ ] Yes
       - [x] No
   5. Are there any changes that cannot be rolled back:
       - [ ] Yes (If Yes, please explain WHY)
       - [x] No
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [doris] morningman merged pull request #16024: [Enhancement](icebergv2) Optimize the position delete file filtering mechanism in iceberg v2 parquet reader.

Posted by "morningman (via GitHub)" <gi...@apache.org>.

morningman merged PR #16024:
URL: https://github.com/apache/doris/pull/16024


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [doris] AshinGau commented on a diff in pull request #16024: [Enhancement][icebergv2_parquet_reader] Optimize the position delete file filtering mechanism in iceberg v2 parquet reader.

Posted by GitBox <gi...@apache.org>.

AshinGau commented on code in PR #16024:
URL: https://github.com/apache/doris/pull/16024#discussion_r1080779915


##########
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:
##########
@@ -262,8 +258,9 @@ Status RowGroupReader::_do_lazy_read(Block* block, size_t batch_size, size_t* re
             // generated from next batch, so the filter column is removed ahead.
             DCHECK_EQ(block->rows(), 0);
         } else {
-            Block::filter_block(block, _lazy_read_ctx.all_predicate_col_ids, filter_column_id,
-                                origin_column_num);
+            ColumnPtr filter_column = block->get_by_position(filter_column_id).column;

Review Comment:
   We should filter the predicate columns by `_filter_ptr` when there're delete rows.



##########
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:
##########
@@ -434,6 +438,121 @@ Status RowGroupReader::_read_empty_batch(size_t batch_size, size_t* read_rows, b
     return Status::OK();
 }
 
+Status RowGroupReader::_build_pos_delete_filter(size_t read_rows) {
+    _filter_ptr.reset(new IColumn::Filter(read_rows, 1));
+    _total_read_rows += read_rows;
+    if (!_position_delete_ctx.has_filter) {
+        return Status::OK();
+    }
+    size_t remaining_read_rows = _total_read_rows + read_rows;
+    while (_position_delete_ctx.index < _position_delete_ctx.end_index) {
+        const int64_t delete_row_index_in_row_group =
+                _position_delete_ctx.delete_rows[_position_delete_ctx.index] -
+                _position_delete_ctx.first_row_id;
+        int64_t read_range_rows = 0;
+        for (auto& range : _read_ranges) {
+            if (delete_row_index_in_row_group < range.first_row) {
+                break;
+            } else if (delete_row_index_in_row_group < range.last_row) {

Review Comment:
   Maybe we can build an inline function to calculate the line number



##########
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:
##########
@@ -434,6 +438,121 @@ Status RowGroupReader::_read_empty_batch(size_t batch_size, size_t* read_rows, b
     return Status::OK();
 }
 
+Status RowGroupReader::_build_pos_delete_filter(size_t read_rows) {
+    _filter_ptr.reset(new IColumn::Filter(read_rows, 1));
+    _total_read_rows += read_rows;
+    if (!_position_delete_ctx.has_filter) {
+        return Status::OK();
+    }
+    size_t remaining_read_rows = _total_read_rows + read_rows;
+    while (_position_delete_ctx.index < _position_delete_ctx.end_index) {
+        const int64_t delete_row_index_in_row_group =
+                _position_delete_ctx.delete_rows[_position_delete_ctx.index] -
+                _position_delete_ctx.first_row_id;
+        int64_t read_range_rows = 0;
+        for (auto& range : _read_ranges) {

Review Comment:
   The two-level loop is time-consuming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [doris] hello-stephen commented on pull request #16024: [Enhancement][icebergv2_parquet_reader] Optimize the position delete file filtering mechanism in iceberg v2 parquet reader.

Posted by GitBox <gi...@apache.org>.

hello-stephen commented on PR #16024:
URL: https://github.com/apache/doris/pull/16024#issuecomment-1385803533

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 35.29 seconds
    load time: 505 seconds
    storage size: 17122417517 Bytes
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230117174841_clickbench_pr_82516.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [doris] AshinGau commented on pull request #16024: [Enhancement][icebergv2_parquet_reader] Optimize the position delete file filtering mechanism in iceberg v2 parquet reader.

Posted by GitBox <gi...@apache.org>.

AshinGau commented on PR #16024:
URL: https://github.com/apache/doris/pull/16024#issuecomment-1397862041

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [doris] kaka11chen commented on a diff in pull request #16024: [Enhancement][icebergv2_parquet_reader] Optimize the position delete file filtering mechanism in iceberg v2 parquet reader.

Posted by GitBox <gi...@apache.org>.

kaka11chen commented on code in PR #16024:
URL: https://github.com/apache/doris/pull/16024#discussion_r1080878057


##########
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:
##########
@@ -262,8 +258,9 @@ Status RowGroupReader::_do_lazy_read(Block* block, size_t batch_size, size_t* re
             // generated from next batch, so the filter column is removed ahead.
             DCHECK_EQ(block->rows(), 0);
         } else {
-            Block::filter_block(block, _lazy_read_ctx.all_predicate_col_ids, filter_column_id,
-                                origin_column_num);
+            ColumnPtr filter_column = block->get_by_position(filter_column_id).column;

Review Comment:
   Filter by _filter_ptr in the _filter_block function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [doris] github-actions[bot] commented on a diff in pull request #16024: [Enhancement][icebergv2_parquet_reader] Optimize the position delete file filtering mechanism in iceberg v2 parquet reader.

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on code in PR #16024:
URL: https://github.com/apache/doris/pull/16024#discussion_r1072157760


##########
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:
##########
@@ -434,6 +438,103 @@
     return Status::OK();
 }
 
+Status RowGroupReader::_build_pos_delete_filter(size_t read_rows) {
+    _filter_ptr.reset(new IColumn::Filter(read_rows, 1));
+    int64_t start_row_id = _position_delete_ctx.current_row_id;
+    int64_t end_row_id = std::min(_position_delete_ctx.current_row_id + (int64_t)read_rows,
+                                  _position_delete_ctx.last_row_id);
+    while (_position_delete_ctx.index < _position_delete_ctx.end_index) {
+        const int64_t& delete_row_id = _position_delete_ctx.delete_rows[_position_delete_ctx.index];
+        if (delete_row_id < start_row_id) {
+            _position_delete_ctx.index++;
+        } else if (delete_row_id < end_row_id) {
+            int64_t index = _position_delete_ctx.delete_rows[_position_delete_ctx.index] -
+                            _position_delete_ctx.current_row_id;
+            (*_filter_ptr)[index] = 0;
+            _position_delete_ctx.index++;
+        } else { // delete_row_id >= end_row_id
+            break;
+        }
+    }
+    _position_delete_ctx.current_row_id = end_row_id;
+    return Status::OK();
+}
+
+Status RowGroupReader::_filter_block(Block* block, const ColumnPtr filter_column,
+                                     int column_to_keep, std::vector<uint32_t> columns_to_filter) {
+    if (auto* nullable_column = check_and_get_column<ColumnNullable>(*filter_column)) {
+        ColumnPtr nested_column = nullable_column->get_nested_column_ptr();
+
+        MutableColumnPtr mutable_holder =
+                nested_column->use_count() == 1
+                        ? nested_column->assume_mutable()
+                        : nested_column->clone_resized(nested_column->size());
+
+        ColumnUInt8* concrete_column = typeid_cast<ColumnUInt8*>(mutable_holder.get());
+        if (!concrete_column) {
+            return Status::InvalidArgument(
+                    "Illegal type {} of column for filter. Must be UInt8 or Nullable(UInt8).",
+                    filter_column->get_name());
+        }
+        auto* __restrict null_map = nullable_column->get_null_map_data().data();
+        IColumn::Filter& filter = concrete_column->get_data();
+        auto* __restrict filter_data = filter.data();
+
+        const size_t size = filter.size();
+        for (size_t i = 0; i < size; ++i) {
+            (*_filter_ptr)[i] &= (!null_map[i]) & filter_data[i];
+        }
+        RETURN_IF_ERROR(_filter_block_internal(block, columns_to_filter));
+    } else if (auto* const_column = check_and_get_column<ColumnConst>(*filter_column)) {
+        bool ret = const_column->get_bool(0);
+        if (!ret) {
+            for (auto& col : columns_to_filter) {
+                std::move(*block->get_by_position(col).column).assume_mutable()->clear();
+            }
+        }
+    } else {
+        const IColumn::Filter& filter =
+                assert_cast<const doris::vectorized::ColumnVector<UInt8>&>(*filter_column)
+                        .get_data();
+
+        auto* __restrict filter_data = filter.data();
+        const size_t size = filter.size();
+        for (size_t i = 0; i < size; ++i) {
+            (*_filter_ptr)[i] &= filter_data[i];
+        }
+        RETURN_IF_ERROR(_filter_block_internal(block, columns_to_filter));
+    }
+    Block::erase_useless_column(block, column_to_keep);
+    return Status::OK();
+}
+
+Status RowGroupReader::_filter_block(Block* block, int column_to_keep,
+                                     const std::vector<uint32_t>& columns_to_filter) {
+    RETURN_IF_ERROR(_filter_block_internal(block, columns_to_filter));
+    Block::erase_useless_column(block, column_to_keep);
+
+    return Status::OK();
+}
+
+Status RowGroupReader::_filter_block_internal(Block* block,
+                                              const std::vector<uint32_t>& columns_to_filter) {
+    size_t count = _filter_ptr->size() -
+                   simd::count_zero_num((int8_t*)_filter_ptr->data(), _filter_ptr->size());
+    if (count == 0) {
+        for (auto& col : columns_to_filter) {
+            std::move(*block->get_by_position(col).column).assume_mutable()->clear();

Review Comment:
   warning: std::move of the const expression has no effect; remove std::move() [performance-move-const-arg]
   
   ```suggestion
               *block->get_by_position(col).column.assume_mutable()->clear();
   ```
   



##########
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:
##########
@@ -434,6 +438,103 @@ Status RowGroupReader::_read_empty_batch(size_t batch_size, size_t* read_rows, b
     return Status::OK();
 }
 
+Status RowGroupReader::_build_pos_delete_filter(size_t read_rows) {
+    _filter_ptr.reset(new IColumn::Filter(read_rows, 1));
+    int64_t start_row_id = _position_delete_ctx.current_row_id;
+    int64_t end_row_id = std::min(_position_delete_ctx.current_row_id + (int64_t)read_rows,
+                                  _position_delete_ctx.last_row_id);
+    while (_position_delete_ctx.index < _position_delete_ctx.end_index) {
+        const int64_t& delete_row_id = _position_delete_ctx.delete_rows[_position_delete_ctx.index];
+        if (delete_row_id < start_row_id) {
+            _position_delete_ctx.index++;
+        } else if (delete_row_id < end_row_id) {
+            int64_t index = _position_delete_ctx.delete_rows[_position_delete_ctx.index] -
+                            _position_delete_ctx.current_row_id;
+            (*_filter_ptr)[index] = 0;
+            _position_delete_ctx.index++;
+        } else { // delete_row_id >= end_row_id
+            break;
+        }
+    }
+    _position_delete_ctx.current_row_id = end_row_id;
+    return Status::OK();
+}
+
+Status RowGroupReader::_filter_block(Block* block, const ColumnPtr filter_column,
+                                     int column_to_keep, std::vector<uint32_t> columns_to_filter) {
+    if (auto* nullable_column = check_and_get_column<ColumnNullable>(*filter_column)) {
+        ColumnPtr nested_column = nullable_column->get_nested_column_ptr();
+
+        MutableColumnPtr mutable_holder =
+                nested_column->use_count() == 1
+                        ? nested_column->assume_mutable()
+                        : nested_column->clone_resized(nested_column->size());
+
+        ColumnUInt8* concrete_column = typeid_cast<ColumnUInt8*>(mutable_holder.get());
+        if (!concrete_column) {
+            return Status::InvalidArgument(
+                    "Illegal type {} of column for filter. Must be UInt8 or Nullable(UInt8).",
+                    filter_column->get_name());
+        }
+        auto* __restrict null_map = nullable_column->get_null_map_data().data();
+        IColumn::Filter& filter = concrete_column->get_data();
+        auto* __restrict filter_data = filter.data();
+
+        const size_t size = filter.size();
+        for (size_t i = 0; i < size; ++i) {
+            (*_filter_ptr)[i] &= (!null_map[i]) & filter_data[i];
+        }
+        RETURN_IF_ERROR(_filter_block_internal(block, columns_to_filter));
+    } else if (auto* const_column = check_and_get_column<ColumnConst>(*filter_column)) {
+        bool ret = const_column->get_bool(0);
+        if (!ret) {
+            for (auto& col : columns_to_filter) {
+                std::move(*block->get_by_position(col).column).assume_mutable()->clear();

Review Comment:
   warning: std::move of the const expression has no effect; remove std::move() [performance-move-const-arg]
   
   ```suggestion
                   *block->get_by_position(col).column.assume_mutable()->clear();
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [doris] github-actions[bot] commented on pull request #16024: [Enhancement][icebergv2_parquet_reader] Optimize the position delete file filtering mechanism in iceberg v2 parquet reader.

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #16024:
URL: https://github.com/apache/doris/pull/16024#issuecomment-1385361201

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [doris] kaka11chen commented on a diff in pull request #16024: [Enhancement][icebergv2] Optimize the position delete file filtering mechanism in iceberg v2 parquet reader.

Posted by GitBox <gi...@apache.org>.

kaka11chen commented on code in PR #16024:
URL: https://github.com/apache/doris/pull/16024#discussion_r1082069163


##########
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp:
##########
@@ -434,6 +438,121 @@ Status RowGroupReader::_read_empty_batch(size_t batch_size, size_t* read_rows, b
     return Status::OK();
 }
 
+Status RowGroupReader::_build_pos_delete_filter(size_t read_rows) {
+    _filter_ptr.reset(new IColumn::Filter(read_rows, 1));
+    _total_read_rows += read_rows;
+    if (!_position_delete_ctx.has_filter) {
+        return Status::OK();
+    }
+    size_t remaining_read_rows = _total_read_rows + read_rows;
+    while (_position_delete_ctx.index < _position_delete_ctx.end_index) {
+        const int64_t delete_row_index_in_row_group =
+                _position_delete_ctx.delete_rows[_position_delete_ctx.index] -
+                _position_delete_ctx.first_row_id;
+        int64_t read_range_rows = 0;
+        for (auto& range : _read_ranges) {

Review Comment:
   Don't search next range when there is no remaining_read_rows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org