You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "rtpsw (via GitHub)" <gi...@apache.org> on 2023/05/11 08:06:32 UTC

[GitHub] [arrow] rtpsw opened a new pull request, #35541: Fix source node batch realignment

rtpsw opened a new pull request, #35541:
URL: https://github.com/apache/arrow/pull/35541

   ### Rationale for this change
   
   Currently, source node uses too high a value for realignment, which [caused a performance degradation](https://github.com/apache/arrow/issues/35498).
   
   ### What changes are included in this PR?
   
   The source node batch realigns an array to its byte width (if it is more than 1).
   
   ### Are these changes tested?
   
   The changes are tested for correctness by the existing tests. The performance is checked by regression jobs.
   
   ### Are there any user-facing changes?
   
   Only performance.
   
   **This PR contains a "Critical Fix".**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1544558231

   ['Python', 'R'] benchmarks have high level of regressions.
   [test-mac-arm](https://conbench.ursa.dev/compare/runs/9648ebda147f4392be54162295f0c572...c2a504e85aeb472980487adec848a4d7/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1544151067

   I was thinking along the lines of your work; you're clearly ahead of me on this.
   
   >  I don't think we need to worry about aligning any buffers with a width of 8 or less
   
   We do. In failures I observed internally, the cause was misalignment of buffer addresses (which were even numerically odd) that were expected to be 4- or 8-byte aligned.
   
   I think the switch-statement should be refactored into `type_traits.h`, if it ends up as part of the solution.
   
   I think the correct idea here is that alignment applies to buffers, not arrays. That is, the format of the buffer is what determines the required alignment, if any, and I guess this format is determined by the buffer index within the array and the array's storage type.
   
   @westonpace, should you or I take this forward? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1543739381

   Benchmark runs are scheduled for baseline = 1a038adad85be6f7e6949cc700dcb9b211feb44d and contender = 892e4002690c869c31a942cbaf80f6dc7830581f. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/2a1019bef15540e195ac2b45f22f71c0...841ff54b254e46fe9956742760f27342/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9648ebda147f4392be54162295f0c572...c2a504e85aeb472980487adec848a4d7/)
   [Scheduled :warning: ursa-i9-9960x is offline.] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/3ef9ee03999c4ea188de94b5488a5665...50d854b965a24aac8833e146a8fc8765/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/d14230f22162403ab43f96c6fc95b7fc...82680cccb72f418485dd49d536b2ff2f/)
   Buildkite builds:
   [Scheduled] [`892e4002` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2855)
   [Scheduled] [`892e4002` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2889)
   [Scheduled] [`892e4002` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2851)
   [Scheduled] [`892e4002` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2879)
   [Finished] [`1a038ada` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2852)
   [Scheduled] [`1a038ada` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2888)
   [Scheduled] [`1a038ada` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2850)
   [Scheduled] [`1a038ada` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2878)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1543739061

   @ursabot please benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1544503234

   > Sorry, I meant 8bits or less (e.g. we can safely assume that everything has at least 1 byte of alignment)
   
   Agreed. I included this condition in my recent commit here.
   
   > > I think the correct idea here is that alignment applies to buffers, not arrays.
   > 
   > Yes, unfortunately.
   
   I'd suggest considering adding something like `DataType::GetBufferAlignment(int index)` to return the required alignment of  buffer at `index` that must be a power-of-2 or otherwise be 0 for when no alignment is required. This would force any future data type to provide this info.
   
   > Sorry to steal a task
   
   OK, you owe me one :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1544507251

   Also, consider using `{Check/Ensure}Alignment(Buffer, int64_T)` from `util/align_util.cc`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1543953119

   There is [one CI job](https://github.com/apache/arrow/actions/runs/4947103153/jobs/8846041791?pr=35541) with unexpected results, which appear to be unrelated to this PR, but are worth noting. There is a [second CI job](https://github.com/apache/arrow/actions/runs/4947103153/jobs/8846041975?pr=35541) which segfaults, and it's not clear whether they are related. @westonpace, have you seen these segfaults before?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35541: Fix source node batch realignment

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1543532738

   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1544400971

   > We do. In failures I observed internally, the cause was misalignment of buffer addresses (which were even numerically odd) that were expected to be 4- or 8-byte aligned.
   
   Sorry, I meant 8bits or less (e.g. we can safely assume that everything has at least 1 byte of alignment)
   
   > I think the correct idea here is that alignment applies to buffers, not arrays. That is, the format of the buffer is what determines the required alignment, if any, and I guess this format is determined by the buffer index within the array and the array's storage type.
   
   Yes, unfortunately.
   
   > @westonpace, should you or I take this forward?
   
   I think I should have time to make a PR today.  Sorry to steal a task, I should've marked yesterday that I was working on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1551637750

   > Is this obsoleted by #35565?
   
   Yes. Closing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw closed pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw closed pull request #35541: GH-35498: [C++] Fix source node batch realignment
URL: https://github.com/apache/arrow/pull/35541


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1544027272

   Structs do not have any buffers themselves beyond the validity buffer.  The validity buffer will have a bit width of 1.  I don't think we need to worry about aligning any buffers with a width of 8 or less (e.g. only worry about it once the width is 16)
   
   One problem with this PR is that it doesn't address offsets buffers.  For example, in a string / list / binary array there is a 32-bit offsets buffer which should be aligned to 32.  These arrays won't show up as fixed width arrays.
   
   I was working on this a bit yesterday as well.  I ended up with something like this...
   
   ```
   bool CheckMallocAlignment(const ArrayData& array) {
     auto type_id = array.type->storage_id();
     if (type_id == Type::DICTIONARY) {
       // The values array will be checked separately
       type_id = ::arrow::internal::checked_pointer_cast<DictionaryType>(array.type)
                     ->index_type()
                     ->id();
     }
     switch (array.type->id()) {
       case Type::NA:
       case Type::FIXED_SIZE_LIST:
       case Type::FIXED_SIZE_BINARY:
       case Type::BOOL:
       case Type::INT8:
       case Type::UINT8:
       case Type::DECIMAL128:
       case Type::DECIMAL256:
       case Type::SPARSE_UNION:
       case Type::RUN_END_ENCODED:
       case Type::STRUCT:
         // These have no buffers or all buffers need only byte alignment
         return true;
       case Type::INT16:
       case Type::UINT16:
       case Type::HALF_FLOAT:
         return CheckValuesAlignment(array, 16);
       case Type::INT32:
       case Type::UINT32:
       case Type::FLOAT:
       case Type::STRING:
       case Type::BINARY:
       case Type::DATE32:
       case Type::TIME32:
       case Type::LIST:
       case Type::MAP:
       case Type::DENSE_UNION:
       case Type::INTERVAL_MONTHS:
         return CheckValuesAlignment(array, 32);
       case Type::INT64:
       case Type::UINT64:
       case Type::DOUBLE:
       case Type::LARGE_BINARY:
       case Type::LARGE_LIST:
       case Type::LARGE_STRING:
       case Type::DATE64:
       case Type::TIME64:
       case Type::TIMESTAMP:
       case Type::DURATION:
       case Type::INTERVAL_DAY_TIME:
         return CheckValuesAlignment(array, 64);
       case Type::INTERVAL_MONTH_DAY_NANO:
         return CheckValuesAlignment(array, 128);
       default:
         return true;
     }
   }
   ```
   
   It's not pretty.  I also tried using a visitor pattern but it was ending up to be more complicated (though arguably a bit more future proof were we to add more fixed-width types).  `CheckValuesAlignment` checks the buffer at index 1.  Fortunately, it seems that in all cases, the buffer that needs alignment seems to be at that position.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1543748201

   I see a bunch of CI job ([here](https://github.com/apache/arrow/actions/runs/4945533592/jobs/8842394608?pr=35541), [here](https://github.com/apache/arrow/actions/runs/4945533592/jobs/8842394874?pr=35541), [here](https://github.com/apache/arrow/actions/runs/4945533592/jobs/8842395019?pr=35541) and [here](https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/47016479)) are failing due to OOM, presumably on realignment, but the allocation sizes are not large. More concerning is [this failure](https://github.com/apache/arrow/actions/runs/4945533592/jobs/8842394197?pr=35541), which shows an invalid alignment of 3. Presumably, the data type of the value being realigned is some kind of struct. @westonpace, what are your thoughts? Perhaps there is an API (or idiom) to get the byte width of the largest field of the data type, which I think would be the correct alignment in such a case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1543540322

   :warning: GitHub issue #35498 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1543540240

   * Closes: #35498


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35541: GH-35498: [C++] Fix source node batch realignment

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35541:
URL: https://github.com/apache/arrow/pull/35541#issuecomment-1551590574

   Is this obsoleted by #35565?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org