You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/25 20:54:19 UTC

[GitHub] [arrow] kshitij12345 opened a new pull request, #13234: [16613] speed-up parquet.write_metadata

kshitij12345 opened a new pull request, #13234:
URL: https://github.com/apache/arrow/pull/13234

   Ref Code:
   
   <details>
   
   ```
   from io import BytesIO
   
   import pyarrow as pa
   import pyarrow.parquet as pq
   from contexttimer import Timer  # non standard lib (can be installed with pip)
   
   
   def create_example_file_meta_data():
       data = {
           "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
           "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
           "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
           "bool": pa.array([True, True, False, False], type=pa.bool_()),
       }
       table = pa.table(data)
       metadata_collector = []
       pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
       return table.schema, metadata_collector[0]
   
   schema, meta = create_example_file_meta_data()
   print("Created Example File")
   metadata_collector = [meta] * 500
   with Timer(prefix='1'):
       pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
   
   metadata_collector = [meta] * 1000
   with Timer(prefix='2'):
       pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
   
   metadata_collector = [meta] * 2000
   with Timer(prefix='3'):
       pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
   
   metadata_collector = [meta] * 4000
   with Timer(prefix='4'):
       pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
   ```
   
   </details>
   
   Before
   ```
   Created Example File
   1 took 0.615 seconds
   2 took 2.446 seconds
   3 took 9.813 seconds
   4 took 40.237 seconds
   ```
   
   After
   ```
   Created Example File
   1 took 0.009 seconds
   2 took 0.018 seconds
   3 took 0.036 seconds
   4 took 0.072 seconds
   ```
   
   TODO:
   * [ ] To actually overload Cython function than adding a new one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1141334468

   Ah. That is neat :)
   Closing in favour of #13265


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on pull request #13234: [16613] speed-up parquet.write_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1137841256

   cc: @AlenkaF  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on pull request #13234: [16613] speed-up parquet.write_metadata

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1138142265

   Thanks @kshitij12345 for contributing!
   
   Current work looks perfect. Interested to see the continuation of it, join of two cython functions into one to accept one or a list of `FileMetaData` objects and also the tests.
   
   Could you change the name of the PR to `ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)` so it connects to the JIRA ticket you are working on?
   
   And I will ask for the workflows to get started on this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1141263332

   I submitted a much simpler fix in https://github.com/apache/arrow/pull/13265


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1138219629

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1138578343

   Thank you for your work! This PR looks great to me.
   @jorisvandenbossche @pitrou could you also have a look at it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13234: [16613] speed-up parquet.write_metadata

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1137839116

   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 closed pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

Posted by GitBox <gi...@apache.org>.
kshitij12345 closed pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
URL: https://github.com/apache/arrow/pull/13234


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1138219601

   https://issues.apache.org/jira/browse/ARROW-16613


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13234:
URL: https://github.com/apache/arrow/pull/13234#discussion_r882390054


##########
cpp/src/parquet/metadata.cc:
##########
@@ -664,6 +664,34 @@ class FileMetaData::FileMetaDataImpl {
     }
   }
 
+  void AppendRowGroups(const std::vector<std::shared_ptr<FileMetaData>>& others) {
+    // Figure out the total num_groups and reserve vector accordinly.

Review Comment:
   accordinly -> accordingly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org