You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/16 12:14:57 UTC

[GitHub] [arrow] Dandandan opened a new pull request #9214: [Arrow][DataFusion] Mem table repartition [WIP]

Dandandan opened a new pull request #9214:
URL: https://github.com/apache/arrow/pull/9214


   I think the feature to be able to repartition an in memory table is useful, as the repartitioning only needs to be applied once, and it's also quite cheap. This can be very useful for in-memory analytics.
   
   The speed up from repartitioning is very big (mainly on aggregates), on my (8-core machine): 6-7x on query 1 and 12 versus a single partition, a bit less of a difference on query 5 when using 16 partitions and has very high cpu utilization.
   
   @jorgecarleitao maybe this is of interest to you, as you mentioned you are looking into multi-threading. I think this would be a "high level" way to get more parallelism. I think in some optimizer rules and/or dynamically we can do repartitions, similar to what's described here https://issues.apache.org/jira/browse/ARROW-9464


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9214: ARROW-11268: [Arrow][DataFusion] Mem table repartition

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9214:
URL: https://github.com/apache/arrow/pull/9214#issuecomment-761556616


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=h1) Report
   > Merging [#9214](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=desc) (8ba5828) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **decrease** coverage by `0.01%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9214/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9214      +/-   ##
   ==========================================
   - Coverage   81.61%   81.59%   -0.02%     
   ==========================================
     Files         215      215              
     Lines       51867    51877      +10     
   ==========================================
     Hits        42329    42329              
   - Misses       9538     9548      +10     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9214/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `80.98% <0.00%> (-5.30%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=footer). Last update [eaa7b7a...afd6528](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9214: ARROW-11268: [Rust][DataFusion] MemTable::load output partition support

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9214:
URL: https://github.com/apache/arrow/pull/9214#discussion_r559487823



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -66,6 +66,10 @@ struct BenchmarkOpt {
     /// Load the data into a MemTable before executing the query
     #[structopt(short = "m", long = "mem-table")]
     mem_table: bool,
+
+    /// Number of partitions to use when using MemTable

Review comment:
       ```suggestion
       /// Number of partitions to create when using MemTable as input
   ```

##########
File path: rust/datafusion/src/datasource/memory.rs
##########
@@ -126,6 +134,28 @@ impl MemTable {
             data.push(result);
         }
 
+        let exec = MemoryExec::try_new(&data, schema.clone(), None)?;
+
+        if let Some(num_partitions) = output_partitions {

Review comment:
       👍 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9214: ARROW-11268: [Arrow][DataFusion] Mem table repartition [WIP]

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9214:
URL: https://github.com/apache/arrow/pull/9214#issuecomment-761554645


   https://issues.apache.org/jira/browse/ARROW-11268


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9214: ARROW-11268: [Rust][DataFusion] MemTable output partition support

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9214:
URL: https://github.com/apache/arrow/pull/9214#issuecomment-761556616


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=h1) Report
   > Merging [#9214](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=desc) (9750ead) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **decrease** coverage by `0.02%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9214/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9214      +/-   ##
   ==========================================
   - Coverage   81.61%   81.58%   -0.03%     
   ==========================================
     Files         215      215              
     Lines       51867    51882      +15     
   ==========================================
     Hits        42329    42329              
   - Misses       9538     9553      +15     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9214/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `12.09% <0.00%> (-0.10%)` | :arrow_down: |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9214/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `80.00% <0.00%> (-6.28%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=footer). Last update [eaa7b7a...9750ead](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9214: [Arrow][DataFusion] Mem table repartition [WIP]

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9214:
URL: https://github.com/apache/arrow/pull/9214#issuecomment-761554183


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9214: ARROW-11268: [Rust][DataFusion] MemTable::load output partition support

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9214:
URL: https://github.com/apache/arrow/pull/9214#issuecomment-761821904


   This would also help us in the db-benchmark https://github.com/h2oai/db-benchmark/pull/182


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #9214: ARROW-11268: [Rust][DataFusion] MemTable::load output partition support

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #9214:
URL: https://github.com/apache/arrow/pull/9214


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9214: ARROW-11268: [Rust][DataFusion] MemTable output partition support

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9214:
URL: https://github.com/apache/arrow/pull/9214#issuecomment-761556616


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=h1) Report
   > Merging [#9214](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=desc) (34bd32f) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **decrease** coverage by `0.02%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9214/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9214      +/-   ##
   ==========================================
   - Coverage   81.61%   81.58%   -0.03%     
   ==========================================
     Files         215      215              
     Lines       51867    51882      +15     
   ==========================================
     Hits        42329    42329              
   - Misses       9538     9553      +15     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9214/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `12.09% <0.00%> (-0.10%)` | :arrow_down: |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9214/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `80.00% <0.00%> (-6.28%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=footer). Last update [eaa7b7a...34bd32f](https://codecov.io/gh/apache/arrow/pull/9214?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org