You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/28 19:06:47 UTC

[GitHub] [arrow] fatemehp opened a new pull request, #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size

fatemehp opened a new pull request, #14545:
URL: https://github.com/apache/arrow/pull/14545

   In the current code, we will read this page because we are using > and not >= in the branch that decides to skip the rest of the page.
   
   Also includes minor refactoring to reuse the function available_values_current_page(), and use ConsumeBufferedValues() accordingly.
   
   Benchmark results for when batch size = 100K and  number of values per page = 100K.
   ```
   BEFORE
   -------------------------------------------------------------------------------
   Benchmark       Time             CPU         Iterations
   -------------------------------------------------------------------------------
   REQUIRED      96831 ns        96326 ns         1000
   OPTIONAL     623897 ns       621734 ns         1000
   REPEATED    1006153 ns       997482 ns         1000
   
   AFTER
   -------------------------------------------------------------------------------
   Benchmark      Time             CPU          Iterations
   -------------------------------------------------------------------------------
   REQUIRED       2175 ns         2164 ns         1000
   OPTIONAL       2743 ns         2719 ns         1000
   REPEATED       2368 ns         2424 ns         1000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1010278851


##########
cpp/src/parquet/column_reader_test.cc:
##########
@@ -303,28 +303,40 @@ TEST_F(TestPrimitiveReader, TestSkipAroundPageBoundries) {
       values_.begin() + 4 * levels_per_page);
   ASSERT_TRUE(vector_equal(sub_values, vresult));
 
+  // 3) skip_size == page_size (skip page 4 from start of the page to the end)
+  levels_skipped = reader->Skip(levels_per_page);
+  ASSERT_EQ(levels_per_page, levels_skipped);
+  // Read half a page (page 5 to 5.5)
+  reader->ReadBatch(levels_per_page / 2, dresult.data(), rresult.data(), vresult.data(),
+                    &values_read);
+  sub_values.clear();
+  sub_values.insert(
+      sub_values.end(),
+      values_.begin() + static_cast<int>(5 * static_cast<double>(levels_per_page)),
+      values_.begin() + 5.5 * levels_per_page);
+  ASSERT_TRUE(vector_equal(sub_values, vresult));
+
   // 3) skip_size < page_size (skip limited to a single page)

Review Comment:
   Can you fix the numbering in the comments? :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] fatemehp commented on a diff in pull request #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
fatemehp commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1008368346


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
 template <typename DType>
 int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
   int64_t values_to_skip = num_values_to_skip;
-  while (HasNext() && values_to_skip > 0) {
+  // Optimization: Do not call HasNext() when values_to_skip == 0.
+  while (values_to_skip > 0 && HasNext()) {
     // If the number of values to skip is more than the number of undecoded values, skip
     // the Page.
-    if (values_to_skip > (this->num_buffered_values_ - this->num_decoded_values_)) {
-      values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_;
-      this->num_decoded_values_ = this->num_buffered_values_;
+    const int64_t available_values = this->available_values_current_page();
+    if (values_to_skip >= available_values) {

Review Comment:
   This line has the main change for this request. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] fatemehp commented on a diff in pull request #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
fatemehp commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1008370621


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
 template <typename DType>
 int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
   int64_t values_to_skip = num_values_to_skip;
-  while (HasNext() && values_to_skip > 0) {
+  // Optimization: Do not call HasNext() when values_to_skip == 0.
+  while (values_to_skip > 0 && HasNext()) {
     // If the number of values to skip is more than the number of undecoded values, skip
     // the Page.
-    if (values_to_skip > (this->num_buffered_values_ - this->num_decoded_values_)) {
-      values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_;
-      this->num_decoded_values_ = this->num_buffered_values_;
+    const int64_t available_values = this->available_values_current_page();
+    if (values_to_skip >= available_values) {
+      values_to_skip -= available_values;
+      this->ConsumeBufferedValues(available_values);

Review Comment:
   Refactoring to use this function instead of doing the manipulation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1009618889


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
 template <typename DType>
 int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
   int64_t values_to_skip = num_values_to_skip;
-  while (HasNext() && values_to_skip > 0) {
+  // Optimization: Do not call HasNext() when values_to_skip == 0.
+  while (values_to_skip > 0 && HasNext()) {

Review Comment:
   Just to be clear: this is the crux of the optimization, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] fatemehp commented on a diff in pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
fatemehp commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1009855139


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
 template <typename DType>
 int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
   int64_t values_to_skip = num_values_to_skip;
-  while (HasNext() && values_to_skip > 0) {
+  // Optimization: Do not call HasNext() when values_to_skip == 0.
+  while (values_to_skip > 0 && HasNext()) {

Review Comment:
   No, this line is the main change. Sorry I should have separated the refactoring from this change.
   if (values_to_skip >= available_values) {
   
   Change > to >=



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] fatemehp commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
fatemehp commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297666613

   @pitrou, @emkornfield  regarding your comments, I am running the same benchmark that is being checked in here https://github.com/apache/arrow/pull/14523. I just cleaned up the output a bit here to be more readable.
   
   What the benchmark does is it creates a few pages with 100K entries each. Then it calls Skip(100K) repeatedly. Before this change, we would read the 100K values and throw them away. With this change, we skip reading from the page.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1010278228


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
 template <typename DType>
 int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
   int64_t values_to_skip = num_values_to_skip;
-  while (HasNext() && values_to_skip > 0) {
+  // Optimization: Do not call HasNext() when values_to_skip == 0.
+  while (values_to_skip > 0 && HasNext()) {
     // If the number of values to skip is more than the number of undecoded values, skip
     // the Page.
-    if (values_to_skip > (this->num_buffered_values_ - this->num_decoded_values_)) {
-      values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_;
-      this->num_decoded_values_ = this->num_buffered_values_;
+    const int64_t available_values = this->available_values_current_page();
+    if (values_to_skip >= available_values) {

Review Comment:
   I see, thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] fatemehp commented on a diff in pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
fatemehp commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1010652197


##########
cpp/src/parquet/column_reader_test.cc:
##########
@@ -303,28 +303,40 @@ TEST_F(TestPrimitiveReader, TestSkipAroundPageBoundries) {
       values_.begin() + 4 * levels_per_page);
   ASSERT_TRUE(vector_equal(sub_values, vresult));
 
+  // 3) skip_size == page_size (skip page 4 from start of the page to the end)
+  levels_skipped = reader->Skip(levels_per_page);
+  ASSERT_EQ(levels_per_page, levels_skipped);
+  // Read half a page (page 5 to 5.5)
+  reader->ReadBatch(levels_per_page / 2, dresult.data(), rresult.data(), vresult.data(),
+                    &values_read);
+  sub_values.clear();
+  sub_values.insert(
+      sub_values.end(),
+      values_.begin() + static_cast<int>(5 * static_cast<double>(levels_per_page)),
+      values_.begin() + 5.5 * levels_per_page);
+  ASSERT_TRUE(vector_equal(sub_values, vresult));
+
   // 3) skip_size < page_size (skip limited to a single page)

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] fatemehp commented on pull request #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
fatemehp commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1295356058

   @emkornfield could you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297400916

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
emkornfield commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297636489

   @pitrou I think https://github.com/apache/arrow/pull/14523 is the microbenchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297188519

   This is not so minor IMHO, can you open a JIRA for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
pitrou merged PR #14545:
URL: https://github.com/apache/arrow/pull/14545


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] fatemehp commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
fatemehp commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297343530

   Opened a Jira ticket.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297400871

   https://issues.apache.org/jira/browse/PARQUET-2209


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14545: PARQUET-2209: [parquet-cpp] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14545:
URL: https://github.com/apache/arrow/pull/14545#issuecomment-1297352281

   > Benchmark results for when batch size = 100K and number of values per page = 100K.
   
   Can you explain what this benchmark does?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] fatemehp commented on a diff in pull request #14545: MINOR: [PARQUET] Optimize skip for the case that number of values to skip equals page size

Posted by GitBox <gi...@apache.org>.
fatemehp commented on code in PR #14545:
URL: https://github.com/apache/arrow/pull/14545#discussion_r1008369931


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1141,12 +1146,14 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchSpaced(
 template <typename DType>
 int64_t TypedColumnReaderImpl<DType>::Skip(int64_t num_values_to_skip) {
   int64_t values_to_skip = num_values_to_skip;
-  while (HasNext() && values_to_skip > 0) {
+  // Optimization: Do not call HasNext() when values_to_skip == 0.
+  while (values_to_skip > 0 && HasNext()) {
     // If the number of values to skip is more than the number of undecoded values, skip
     // the Page.
-    if (values_to_skip > (this->num_buffered_values_ - this->num_decoded_values_)) {
-      values_to_skip -= this->num_buffered_values_ - this->num_decoded_values_;
-      this->num_decoded_values_ = this->num_buffered_values_;
+    const int64_t available_values = this->available_values_current_page();

Review Comment:
   Refactoring to re-use this function instead of doing the manipulation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org