You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/15 20:36:21 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue #349: Ballista context should get file metadata from scheduler, not from local disk

andygrove opened a new issue #349:
URL: https://github.com/apache/arrow-datafusion/issues/349


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   I have a Ballista cluster running, and each scheduler and executor has access to TPC-H data locally.
   I am running the benchmark client on my desktop, and I do not have access to the data locally.
   Query planning fails with "file not found" because `BallistaContext::read_parquet` is looking for the file on the local file system when it should be getting the file metadata from a scheduler in the cluster.
   
   **Describe the solution you'd like**
   The context should send a gRPC request to the scheduler to get the necessary metadata.
   
   **Describe alternatives you've considered**
   None
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yahoNanJing commented on issue #349: Ballista context should get file metadata from scheduler, not from local disk

Posted by GitBox <gi...@apache.org>.

yahoNanJing commented on issue #349:
URL: https://github.com/apache/arrow-datafusion/issues/349#issuecomment-1024820528


   Hi @andygrove, we have integrated ballista with HDFS support. Our workaround is to make the file path self described. For example, a local file path should be file://tmp/..., a hdfs file path should hdfs://localhost:xxx:/tmp/...
   
   To make it work, we also changed the object store api a bit. Later I'll create a PR for this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai commented on issue #349: Ballista context should get file metadata from scheduler, not from local disk

Posted by GitBox <gi...@apache.org>.

rdettai commented on issue #349:
URL: https://github.com/apache/arrow-datafusion/issues/349#issuecomment-912419254


   @andygrove as the client is handling the logical plan, I think it does not need to know about the list of files or the statistics, it only needs the schema:
   - with the current df implementation, we could just build a table provider without any statistics on the client, and then load the statistics once the logical plan is deserialized on the scheduler
   - in #962 I am proposing a change that would move completely the statistics from the logical plan to the physical plan
   
   As flight already has an endpoint to query the schema, this would avoid creating a new one 😃 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai edited a comment on issue #349: Ballista context should get file metadata from scheduler, not from local disk

Posted by GitBox <gi...@apache.org>.

rdettai edited a comment on issue #349:
URL: https://github.com/apache/arrow-datafusion/issues/349#issuecomment-912419254


   @andygrove as the client is handling the logical plan, I think it does not need to know about the list of files or the statistics, it only needs the schema:
   - with the current df implementation, we could just build a table provider without any statistics on the client, and then load the statistics once the logical plan is deserialized on the scheduler (cost based optimizations would be ineffective on the client but that is not a big issue as we could run them on the scheduler instead)
   - in #962 I am proposing a change that would move completely the statistics from the logical plan to the physical plan
   
   As flight already has an endpoint to query the schema, this would avoid creating a new one 😃 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai edited a comment on issue #349: Ballista context should get file metadata from scheduler, not from local disk

Posted by GitBox <gi...@apache.org>.

rdettai edited a comment on issue #349:
URL: https://github.com/apache/arrow-datafusion/issues/349#issuecomment-912419254


   @andygrove as the client is handling the logical plan, I think it does not need to know about the list of files or the statistics, it only needs the schema:
   - with the current datafusion implementation, we could just build a table provider without any statistics on the client, and then load the statistics once the logical plan is deserialized on the scheduler (cost based optimizations would be ineffective on the client but that is not a big issue as we could run them on the scheduler instead)
   - in #962 I am proposing a change that would move completely the statistics from the logical plan to the physical plan
   
   As flight already has an endpoint to query the schema, this would avoid creating a new one 😃 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai edited a comment on issue #349: Ballista context should get file metadata from scheduler, not from local disk

Posted by GitBox <gi...@apache.org>.

rdettai edited a comment on issue #349:
URL: https://github.com/apache/arrow-datafusion/issues/349#issuecomment-912419254


   @andygrove as the client is handling the logical plan, I think it does not need to know about the list of files or the statistics, it only needs the schema:
   - with the current datafusion implementation, we could just build a table provider without any statistics on the client, and then load the statistics once the logical plan is deserialized on the scheduler (cost based optimizations would be ineffective on the client but that is not a big issue as we could run them on the scheduler instead)
   - in #962 I am proposing a change that would move completely the statistics from the logical plan to the physical plan
   
   As flight already has an endpoint to query the schema, this would avoid creating and maintaining a new one 😃 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org