You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/25 20:54:19 UTC
[GitHub] [arrow] kshitij12345 opened a new pull request, #13234: [16613] speed-up parquet.write_metadata
kshitij12345 opened a new pull request, #13234:
URL: https://github.com/apache/arrow/pull/13234
Ref Code:
<details>
```
from io import BytesIO
import pyarrow as pa
import pyarrow.parquet as pq
from contexttimer import Timer # non standard lib (can be installed with pip)
def create_example_file_meta_data():
data = {
"str": pa.array(["a", "b", "c", "d"], type=pa.string()),
"uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
"int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
"bool": pa.array([True, True, False, False], type=pa.bool_()),
}
table = pa.table(data)
metadata_collector = []
pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
return table.schema, metadata_collector[0]
schema, meta = create_example_file_meta_data()
print("Created Example File")
metadata_collector = [meta] * 500
with Timer(prefix='1'):
pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
metadata_collector = [meta] * 1000
with Timer(prefix='2'):
pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
metadata_collector = [meta] * 2000
with Timer(prefix='3'):
pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
metadata_collector = [meta] * 4000
with Timer(prefix='4'):
pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
```
</details>
Before
```
Created Example File
1 took 0.615 seconds
2 took 2.446 seconds
3 took 9.813 seconds
4 took 40.237 seconds
```
After
```
Created Example File
1 took 0.009 seconds
2 took 0.018 seconds
3 took 0.036 seconds
4 took 0.072 seconds
```
TODO:
* [ ] To actually overload Cython function than adding a new one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kshitij12345 commented on pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1141334468
Ah. That is neat :)
Closing in favour of #13265
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kshitij12345 commented on pull request #13234: [16613] speed-up parquet.write_metadata
Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1137841256
cc: @AlenkaF
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] AlenkaF commented on pull request #13234: [16613] speed-up parquet.write_metadata
Posted by GitBox <gi...@apache.org>.
AlenkaF commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1138142265
Thanks @kshitij12345 for contributing!
Current work looks perfect. Interested to see the continuation of it, join of two cython functions into one to accept one or a list of `FileMetaData` objects and also the tests.
Could you change the name of the PR to `ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)` so it connects to the JIRA ticket you are working on?
And I will ask for the workflows to get started on this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1141263332
I submitted a much simpler fix in https://github.com/apache/arrow/pull/13265
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1138219629
:warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] AlenkaF commented on pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
Posted by GitBox <gi...@apache.org>.
AlenkaF commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1138578343
Thank you for your work! This PR looks great to me.
@jorisvandenbossche @pitrou could you also have a look at it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #13234: [16613] speed-up parquet.write_metadata
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1137839116
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
Thanks for opening a pull request!
If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
Then could you also rename pull request title in the following format?
ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
See also:
* [Other pull requests](https://github.com/apache/arrow/pulls/)
* [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kshitij12345 closed pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
Posted by GitBox <gi...@apache.org>.
kshitij12345 closed pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
URL: https://github.com/apache/arrow/pull/13234
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13234:
URL: https://github.com/apache/arrow/pull/13234#issuecomment-1138219601
https://issues.apache.org/jira/browse/ARROW-16613
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13234: ARROW-16613: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13234:
URL: https://github.com/apache/arrow/pull/13234#discussion_r882390054
##########
cpp/src/parquet/metadata.cc:
##########
@@ -664,6 +664,34 @@ class FileMetaData::FileMetaDataImpl {
}
}
+ void AppendRowGroups(const std::vector<std::shared_ptr<FileMetaData>>& others) {
+ // Figure out the total num_groups and reserve vector accordinly.
Review Comment:
accordinly -> accordingly
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org