You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/12 02:59:10 UTC

[GitHub] [arrow-datafusion] Jeeesie opened a new issue #327: How can I make ballista distributed compute work?

Jeeesie opened a new issue #327:
URL: https://github.com/apache/arrow-datafusion/issues/327


   I want to execute benchmake q1.sql distributed,  And I noticed that  in from_proto.rs  there is PhysicalPlanType::ParquetScan, in which we can use ParquetExec::try_from_files() to make several partitions. 
   However, in benchmark tests,  the code didnot call this method, instead, it directly use read_csv(). Can I know why? And how can I use parquetScan?
   
   Also, I attempted to call datafusion's repartition() function in register_table() :
   `        let rr_repartition = Partitioning::RoundRobinBatch(3);
           let roundtrip_plan = LogicalPlan::Repartition {
               input: Arc::from(table.to_logical_plan()),
               partitioning_scheme: rr_repartition,
           };
           state
               .tables
               .insert(name.to_owned(), roundtrip_plan);`
   
    but I meet the error:  
   `General("Invalid LogicalPlan::TableScan")`
   Can you help to resolve this?  My purpose is to execute benchmake q1.sql distributed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jeeesie edited a comment on issue #327: How can I make ballista distributed compute work?

Posted by GitBox <gi...@apache.org>.

Jeeesie edited a comment on issue #327:
URL: https://github.com/apache/arrow-datafusion/issues/327#issuecomment-840391963


   Thanks. 
   Under the data path, each schema only has one data file.  As you said, one file will be in one partition. While one partition will only be executed in one executor.  
   So are there distributed examples already?  
   
   
   > The User Guide source is here: https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide
   > 
   > The previously published version is here: https://ballistacompute.org/docs/
   > 
   > I am in the process of updating the user guide, and it will be published to the Arrow web site on the next release.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove closed issue #327: How can I make ballista distributed compute work?

Posted by GitBox <gi...@apache.org>.

andygrove closed issue #327:
URL: https://github.com/apache/arrow-datafusion/issues/327


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jeeesie commented on issue #327: How can I make ballista distributed compute work?

Posted by GitBox <gi...@apache.org>.

Jeeesie commented on issue #327:
URL: https://github.com/apache/arrow-datafusion/issues/327#issuecomment-840391963


   Thanks. 
   Under the data path, each schema only has one data file.  As you said, one file will be in one partition. While one partition will only be executed in one executor.  
   So I wonder are there distributed examples already?
   
   
   
   > The User Guide source is here: https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide
   > 
   > The previously published version is here: https://ballistacompute.org/docs/
   > 
   > I am in the process of updating the user guide, and it will be published to the Arrow web site on the next release.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #327: How can I make ballista distributed compute work?

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #327:
URL: https://github.com/apache/arrow-datafusion/issues/327#issuecomment-839763005


   The User Guide source is here: https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide
   
   The previously published version is here: https://ballistacompute.org/docs/
   
   I am in the process of updating the user guide, and it will be published to the Arrow web site on the next release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #327: How can I make ballista distributed compute work?

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #327:
URL: https://github.com/apache/arrow-datafusion/issues/327#issuecomment-907650313


   The benchmark crate in the repo can be used for executing fully distributed queries against partitioned data and the [README](https://github.com/apache/arrow-datafusion/tree/master/benchmarks) in there explains how to do this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org