You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/28 19:06:47 UTC
[GitHub] [arrow] fatemehp opened a new pull request, #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size
fatemehp opened a new pull request, #14545:
URL: https://github.com/apache/arrow/pull/14545
In the current code, we will read this page because we are using > and not >= in the branch that decides to skip the rest of the page.
Also includes minor refactoring to reuse the function available_values_current_page(), and use ConsumeBufferedValues() accordingly.
Benchmark results for when batch size = 100K and number of values per page = 100K.
```
BEFORE
-------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------
REQUIRED 96831 ns 96326 ns 1000
OPTIONAL 623897 ns 621734 ns 1000
REPEATED 1006153 ns 997482 ns 1000
AFTER
-------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------
REQUIRED 2175 ns 2164 ns 1000
OPTIONAL 2743 ns 2719 ns 1000
REPEATED 2368 ns 2424 ns 1000
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a diff in pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1010278851
##########
cpp/src/parquet/column_reader_test.cc:
##########
@@ -303,28 +303,40 @@ TEST_F(TestPrimitiveReader, TestSkipAroundPageBoundries) {
values_.begin() + 4 * levels_per_page);
ASSERT_TRUE(vector_equal(sub_values, vresult));
+ // 3) skip_size == page_size (skip page 4 from start of the page to the end)
+ levels_skipped = reader->Skip(levels_per_page);
+ ASSERT_EQ(levels_per_page, levels_skipped);
+ // Read half a page (page 5 to 5.5)
+ reader->ReadBatch(levels_per_page / 2, dresult.data(), rresult.data(), vresult.data(),
+ &values_read);
+ sub_values.clear();
+ sub_values.insert(
+ sub_values.end(),
+ values_.begin() + static_cast<int>(5 * static_cast<double>(levels_per_page)),
+ values_.begin() + 5.5 * levels_per_page);
+ ASSERT_TRUE(vector_equal(sub_values, vresult));
+
// 3) skip_size < page_size (skip limited to a single page)
Review Comment:
Can you fix the numbering in the comments? :-)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] fatemehp commented on a diff in pull request #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
fatemehp commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1008368346
##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
template <typename DType>
int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
int64_t values_to_skip = num_values_to_skip;
- while (HasNext() && values_to_skip > 0) {
+ // Optimization: Do not call HasNext() when values_to_skip == 0.
+ while (values_to_skip > 0 && HasNext()) {
// If the number of values to skip is more than the number of undecoded values, skip
// the Page.
- if (values_to_skip > (this->num_buffered_values_ - this->num_decoded_values_)) {
- values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_;
- this->num_decoded_values_ = this->num_buffered_values_;
+ const int64_t available_values = this->available_values_current_page();
+ if (values_to_skip >= available_values) {
Review Comment:
This line has the main change for this request.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] fatemehp commented on a diff in pull request #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
fatemehp commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1008370621
##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
template <typename DType>
int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
int64_t values_to_skip = num_values_to_skip;
- while (HasNext() && values_to_skip > 0) {
+ // Optimization: Do not call HasNext() when values_to_skip == 0.
+ while (values_to_skip > 0 && HasNext()) {
// If the number of values to skip is more than the number of undecoded values, skip
// the Page.
- if (values_to_skip > (this->num_buffered_values_ - this->num_decoded_values_)) {
- values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_;
- this->num_decoded_values_ = this->num_buffered_values_;
+ const int64_t available_values = this->available_values_current_page();
+ if (values_to_skip >= available_values) {
+ values_to_skip -= available_values;
+ this->ConsumeBufferedValues(available_values);
Review Comment:
Refactoring to use this function instead of doing the manipulation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a diff in pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1009618889
##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
template <typename DType>
int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
int64_t values_to_skip = num_values_to_skip;
- while (HasNext() && values_to_skip > 0) {
+ // Optimization: Do not call HasNext() when values_to_skip == 0.
+ while (values_to_skip > 0 && HasNext()) {
Review Comment:
Just to be clear: this is the crux of the optimization, right?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] fatemehp commented on a diff in pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
fatemehp commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1009855139
##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
template <typename DType>
int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
int64_t values_to_skip = num_values_to_skip;
- while (HasNext() && values_to_skip > 0) {
+ // Optimization: Do not call HasNext() when values_to_skip == 0.
+ while (values_to_skip > 0 && HasNext()) {
Review Comment:
No, this line is the main change. Sorry I should have separated the refactoring from this change.
if (values_to_skip >= available_values) {
Change > to >=
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] fatemehp commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
fatemehp commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297666613
@pitrou, @emkornfield regarding your comments, I am running the same benchmark that is being checked in here https://github.com/apache/arrow/pull/14523. I just cleaned up the output a bit here to be more readable.
What the benchmark does is it creates a few pages with 100K entries each. Then it calls Skip(100K) repeatedly. Before this change, we would read the 100K values and throw them away. With this change, we skip reading from the page.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a diff in pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1010278228
##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
template <typename DType>
int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
int64_t values_to_skip = num_values_to_skip;
- while (HasNext() && values_to_skip > 0) {
+ // Optimization: Do not call HasNext() when values_to_skip == 0.
+ while (values_to_skip > 0 && HasNext()) {
// If the number of values to skip is more than the number of undecoded values, skip
// the Page.
- if (values_to_skip > (this->num_buffered_values_ - this->num_decoded_values_)) {
- values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_;
- this->num_decoded_values_ = this->num_buffered_values_;
+ const int64_t available_values = this->available_values_current_page();
+ if (values_to_skip >= available_values) {
Review Comment:
I see, thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] fatemehp commented on a diff in pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
fatemehp commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1010652197
##########
cpp/src/parquet/column_reader_test.cc:
##########
@@ -303,28 +303,40 @@ TEST_F(TestPrimitiveReader, TestSkipAroundPageBoundries) {
values_.begin() + 4 * levels_per_page);
ASSERT_TRUE(vector_equal(sub_values, vresult));
+ // 3) skip_size == page_size (skip page 4 from start of the page to the end)
+ levels_skipped = reader->Skip(levels_per_page);
+ ASSERT_EQ(levels_per_page, levels_skipped);
+ // Read half a page (page 5 to 5.5)
+ reader->ReadBatch(levels_per_page / 2, dresult.data(), rresult.data(), vresult.data(),
+ &values_read);
+ sub_values.clear();
+ sub_values.insert(
+ sub_values.end(),
+ values_.begin() + static_cast<int>(5 * static_cast<double>(levels_per_page)),
+ values_.begin() + 5.5 * levels_per_page);
+ ASSERT_TRUE(vector_equal(sub_values, vresult));
+
// 3) skip_size < page_size (skip limited to a single page)
Review Comment:
Done.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] fatemehp commented on pull request #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
fatemehp commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1295356058
@emkornfield could you take a look?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297400916
:warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
emkornfield commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297636489
@pitrou I think https://github.com/apache/arrow/pull/14523 is the microbenchmark.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297188519
This is not so minor IMHO, can you open a JIRA for this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou merged pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
pitrou merged PR #14545:
URL: https://github.com/apache/arrow/pull/14545
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] fatemehp commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
fatemehp commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297343530
Opened a Jira ticket.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297400871
https://issues.apache.org/jira/browse/PARQUET-2209
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297352281
> Benchmark results for when batch size = 100K and number of values per page = 100K.
Can you explain what this benchmark does?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] fatemehp commented on a diff in pull request #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size
Posted by GitBox <gi...@apache.org>.
fatemehp commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1008369931
##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
template <typename DType>
int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
int64_t values_to_skip = num_values_to_skip;
- while (HasNext() && values_to_skip > 0) {
+ // Optimization: Do not call HasNext() when values_to_skip == 0.
+ while (values_to_skip > 0 && HasNext()) {
// If the number of values to skip is more than the number of undecoded values, skip
// the Page.
- if (values_to_skip > (this->num_buffered_values_ - this->num_decoded_values_)) {
- values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_;
- this->num_decoded_values_ = this->num_buffered_values_;
+ const int64_t available_values = this->available_values_current_page();
Review Comment:
Refactoring to re-use this function instead of doing the manipulation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org