You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/16 16:34:28 UTC

[GitHub] [arrow-datafusion] sitano opened a new issue, #2928: Concurrency: changing the number of partitions does not increase concurrency

sitano opened a new issue, #2928:
URL: https://github.com/apache/arrow-datafusion/issues/2928

   **Describe the bug**
   
   Changing the number of partitions has no positive effect on the execution.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. take 10 GB CSV file
   2. execute cli with 1 partition, it will take about 40 sec
   3. -- CREATE EXTERNAL TABLE test (...) STORED AS CSV WITH HEADER ROW LOCATION 'test.csv';
   4. -- SELECT SUM(total_amount) FROM test GROUP BY VendorID;
   5. execute with 8 partitions (or 1000) (I have 8 real cores CPU), it will take 38 sec.
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   
   At least some linear scalability per core number. For 8 parts = 40/8 ~ to be 5 sec.
   
   **Additional context**
   ```
   let mut session_config = SessionConfig::new()
           .with_information_schema(true)
           .with_target_partitions(args.threads);
   ```
   
   maybe my patch to the CLI is wrong...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan closed issue #2928: bug: changing the number of partitions does not increase concurrency

Posted by GitBox <gi...@apache.org>.

Dandandan closed issue #2928: bug: changing the number of partitions does not increase concurrency
URL: https://github.com/apache/arrow-datafusion/issues/2928


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #2928: bug: changing the number of partitions does not increase concurrency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #2928:
URL: https://github.com/apache/arrow-datafusion/issues/2928#issuecomment-1187051725

   Closing this for now, Feel free to reopen if you think it's a bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #2928: bug: changing the number of partitions does not increase concurrency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #2928:
URL: https://github.com/apache/arrow-datafusion/issues/2928#issuecomment-1186244765

   For reading csv, DataFusion reads the file sequentially. So setting the config on target partitions has limited effect on simple queries as reading CSV takes most of the time. Also by default it will use the number of logical cores available in the system.
   For this query, I expect you will get faster results by splitting the CSV into 8 equal smaller CSV's (and, if it's more than testing, converting to parquet directly too).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org