You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/11 22:43:57 UTC

[GitHub] [arrow] njwhite opened a new pull request #8644: [WIP] Align written buffers to specified value

njwhite opened a new pull request #8644:
URL: https://github.com/apache/arrow/pull/8644


   See mailing list discussion for now


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-729163995


   > I’m not sure I understand the philosophy here though; Arrow has a lot of Python code to provide Pandas compatibility but if you always have to copy IPC data to use it from Pandas then you’re not getting any advantage from using Arrow as the underlying data format?
   
   My thinking here is Arrow isn't just about the python implementation.  There are all sorts of tweaks that might make one language/library pair more performant, but changing the spec and ensuring all the other implementations can handle it is is a bunch of work, and over time degrades compatibility between implementation (we are already behind on some other features that are already in the spec).  So in these cases, I tend to be conservative with changing the specification. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-727068271


   From the specification
   
   > Implementations are recommended to allocate memory on aligned addresses (multiple of 8- or 64-bytes) and pad (overallocate) to a length that is a multiple of 8 or 64 bytes. **When serializing Arrow data for interprocess communication, these alignment and padding requirements are enforced.**
   
   So I guess it's a question whether we want to allow the creation of non-conforming IPC messages


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] njwhite edited a comment on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
njwhite edited a comment on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-729138006


   @emkornfield thanks, I will. Can I ask why you don’t want unaligned buffers allowable in the spec? 
   
   I’m not sure I understand the philosophy here though; Arrow has a lot of Python code to provide Pandas compatibility but if you always have to copy IPC data to use it from Pandas then you’re not getting any advantage from using Arrow as the underlying data format?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-887324797


   It looks like there is consensus that the proposed changes are not desirable, so I'm going to close this PR. Sorry @njwhite , and thank you for contributing this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] njwhite commented on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
njwhite commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-729050524


   @emkornfield @wesm thanks for the comments. What’s the process for updating the spec? As allowing unaligned buffers in IPC streams is backwards (but not forwards) compatible it’d be great to get it into v1.1!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #8644:
URL: https://github.com/apache/arrow/pull/8644


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-726540158


   [ML Discussion](https://lists.apache.org/thread.html/rca411c6a753cec1a960f9ed3a474ea3cbf196633f951b353d580aed9%40%3Cdev.arrow.apache.org%3E).  As I said there, I don't think this is a good approach either since it doesn't conform to the specification.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-726395783


   https://issues.apache.org/jira/browse/ARROW-10573


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-727073502


   I'm -1 on allowing non-conforming IPC implementations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-729132706


   > @emkornfield @wesm thanks for the comments. What’s the process for updating the spec? As allowing unaligned buffers in IPC streams is backwards (but not forwards) compatible it’d be great to get it into v1.1!
   
   You can start a discussion thread on dev@ mailing list.  I'm not in favor of making this change to the specification, others might feel differently.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] njwhite commented on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
njwhite commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-729138006


   @emkornfield thanks, I will. Can I ask why you don’t want unaligned buffers allowable in the spec? 
   
   I’m not sure I understand the philosophy here though; Arrow has a lot of Python code to provide Pandas compatibility but if you always have to copy IPC data to use it from Pandas then your not getting any advantage from using Arrow as the underlying data format?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] njwhite commented on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
njwhite commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-727063708


   @wesm I disagree with your assertion that it's only useful in an extraordinarily narrow use case - I've added a test case `test_contiguous_buffers_mixed_types` to show a zero-copy load of mixed-datatype DataFrame. The columns do have to be sorted by dtype for now (`df[df.dtypes.sort_values().index]`) if you want to eliminate consolidation, but `pyarrow` could easily do this under the hood. 
   
   w.r.t. the spec, I see in _Implementations are **recommended** to allocate memory on aligned addresses_ [here](https://arrow.apache.org/docs/format/Columnar.html); my PR doesn't change the default behaviour of following the recommendation and 8-byte aligning the buffers. A user would have to opt in to creating unaligned files if they wanted the benefit of reading their DataFrames without copying the data.
   
   My use case is to expose Arrow files saved on disk (and probably still resident in the OS' page cache) as Pandas Dataframes with as low a latency as possible. Copying the entire DataFrame from one place in memory to another (to strip out the padding) adds latency and memory pressure, far outweighing the performance benefit of 8-byte aligned reads.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8644: [WIP] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-725703424


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #8644: ARROW-10573: [C++] Align written buffers to specified value

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #8644:
URL: https://github.com/apache/arrow/pull/8644#issuecomment-726416818


   Hm. Unaligned buffers are not compliant with the Arrow specification. This optimization is only useful in the extraordinarily narrow use case where all of the columns in the DataFrame are the same data type. Could you give a little more color on what problem you are solving? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org