You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Hattonuri (via GitHub)" <gi...@apache.org> on 2023/12/03 12:09:12 UTC

[PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Hattonuri opened a new pull request, #39054:
URL: https://github.com/apache/arrow/pull/39054

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   ### Are these changes tested?
   
   Yes, parquet-arrow-test tests this and if we don't downgrade to allocations when needed segfaults
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   ### Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "Hattonuri (via GitHub)" <gi...@apache.org>.
Hattonuri commented on code in PR #39054:
URL: https://github.com/apache/arrow/pull/39054#discussion_r1413618841


##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -556,10 +556,16 @@ struct DecimalConverter<DecimalArrayType, FLBAType> {
     const int32_t byte_width =
         checked_cast<const ::arrow::FixedSizeBinaryType&>(*fixed_size_binary_array.type())
             .byte_width();
-    // allocate memory for the decimal array
-    ARROW_ASSIGN_OR_RAISE(auto data, ::arrow::AllocateBuffer(length * type_length, pool));
 
     // raw bytes that we can write to
+    std::shared_ptr<::arrow::Buffer> data;
+    // if the byte width of the FixedSizeBinaryArray is greater than or equal to the given
+    // array then we can reuse its data buffer to write the decimal array
+    if (byte_width >= type_length) {
+      data = fixed_size_binary_array.data()->buffers[1];

Review Comment:
   It is not shared by other arrays because it is only used here
   https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/parquet/arrow/reader_internal.cc#L743
   Which is used only here
   https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/parquet/arrow/reader.cc#L488-L489
   So this is function only for reading from file and we see that arrays in it are not shared



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #39054:
URL: https://github.com/apache/arrow/pull/39054#issuecomment-1837461678

   :warning: GitHub issue #38881 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #39054:
URL: https://github.com/apache/arrow/pull/39054#issuecomment-1853883974

   Well I think the mock tricky problem is that this api may contains multiple `::arrow::Array` because `GetBinaryBuilder` might get more than 1 buffer, so I think it might be a bit hard to provide that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #39054:
URL: https://github.com/apache/arrow/pull/39054#discussion_r1413512827


##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -556,10 +556,16 @@ struct DecimalConverter<DecimalArrayType, FLBAType> {
     const int32_t byte_width =
         checked_cast<const ::arrow::FixedSizeBinaryType&>(*fixed_size_binary_array.type())
             .byte_width();
-    // allocate memory for the decimal array
-    ARROW_ASSIGN_OR_RAISE(auto data, ::arrow::AllocateBuffer(length * type_length, pool));
 
     // raw bytes that we can write to
+    std::shared_ptr<::arrow::Buffer> data;
+    // if the byte width of the FixedSizeBinaryArray is greater than or equal to the given
+    // array then we can reuse its data buffer to write the decimal array
+    if (byte_width >= type_length) {
+      data = fixed_size_binary_array.data()->buffers[1];

Review Comment:
   This may change the buffer which is shared by other arrays. So I don't think this is a safe change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "Hattonuri (via GitHub)" <gi...@apache.org>.
Hattonuri commented on PR #39054:
URL: https://github.com/apache/arrow/pull/39054#issuecomment-1843793585

   <img width="1307" alt="image" src="https://github.com/apache/arrow/assets/53221537/5a894d29-e7a6-4742-bb2a-71e6ab9ae1f3">
   Also flamegraph now looks like this. It can be seen that RawBytesToDecimalBytes is a pretty small chunk now instead of juge in linked issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #39054:
URL: https://github.com/apache/arrow/pull/39054#issuecomment-1853877554

   I don't know, perhaps get a mutable Buffer from the Parquet decoder?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "Hattonuri (via GitHub)" <gi...@apache.org>.
Hattonuri commented on PR #39054:
URL: https://github.com/apache/arrow/pull/39054#issuecomment-1852740034

   Could you please approve it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #39054:
URL: https://github.com/apache/arrow/pull/39054#issuecomment-1837466639

   @pitrou @wgtmac Would something like a `scratch` better in this case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #39054:
URL: https://github.com/apache/arrow/pull/39054#issuecomment-1842002193

   I think we can maintain a `Context` struct for each Arrow columnšŸ¤”, and reuse the context for each fetching?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "Hattonuri (via GitHub)" <gi...@apache.org>.
Hattonuri commented on PR #39054:
URL: https://github.com/apache/arrow/pull/39054#issuecomment-1843783701

   > I agree this looks a bit fragile even if in the current context this wouldn't be a problem.
   > 
   > @Hattonuri Did you run any benchmarks on this? Are there any tangible benefits?
   
   I run Decimal128Conversions and got these results
   `benchmark          baseline         contender  change % counters
   Decimal128Conversion 18.102M items/sec 18.877M items/sec     4.285 {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'Decimal128Conversion', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1921, 'allocs_per_iter': 2.375, 'max_bytes_used': 209536, 'total_allocated_bytes': 1931840}`
   
   But this 4.285% increase is not stable. 
   In my reading code i have lowered cpu usage from ~590% to ~530%.
   I did not find any other benchmarks with decimals


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #39054:
URL: https://github.com/apache/arrow/pull/39054#issuecomment-1841102327

   I agree this looks a bit fragile even if in the current context this wouldn't be a problem.
   
   @Hattonuri Did you run any benchmarks on this? Are there any tangible benefits?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #39054:
URL: https://github.com/apache/arrow/pull/39054#issuecomment-1853564328

   I think this is a bit tricky and not commonly used, the problem is that this leaks the abstraction of `buffers` handling in `parquet::RecordReader`. Transfering it to same output-buffer is also bug-prone, it might introduce some undefined behavior if memcpy is used rather than memmove, which is very hard to debugging.
   
   Maybe I'd like to survey a better interface here?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38881: [C++][Parquet] Reuse FLBA array data for decimal conversion when possible [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #39054:
URL: https://github.com/apache/arrow/pull/39054#issuecomment-1853752088

   cc @pitrou can we improve the builder for this problem( .i.e. using DecimalBuilder?) ? Or making a protocol that ensures `ArrayData` can be modified?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org