You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/20 16:35:44 UTC

[GitHub] [arrow-datafusion] tustvold opened a new issue, #2292: Add SchemaAdapterExec

tustvold opened a new issue, #2292:
URL: https://github.com/apache/arrow-datafusion/issues/2292

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   Part of #2079, related to #2170
   
   Currently schema adaption is handled within each of the file format specific operators. As described in #2079 this has a number of drawbacks.
   
   **Describe the solution you'd like**
   
   I would like a `SchemaAdapterExec` that can be created with a provided `Schema` and a child `ExecutionPlan`. It would then adapt the schema of the batches returned by this inner `ExecutionPlan` to match the provided `Schema`, creating null columns as necessary.
   
   This can likely reuse the existing `SchemaAdapter`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold closed issue #2292: Add SchemaAdapterExec

Posted by GitBox <gi...@apache.org>.

tustvold closed issue #2292: Add SchemaAdapterExec
URL: https://github.com/apache/arrow-datafusion/issues/2292


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on issue #2292: Add SchemaAdapterExec

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on issue #2292:
URL: https://github.com/apache/arrow-datafusion/issues/2292#issuecomment-1105155180

   > * The ParquetExec, etc... needs to provide a schema at plan time
   > * It needs to yield batches that match this schema
   > * Depending on the catalog, the individual files might not match this schema, but must be compatible with it
   
   Yeah, exactly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on issue #2292: Add SchemaAdapterExec

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on issue #2292:
URL: https://github.com/apache/arrow-datafusion/issues/2292#issuecomment-1105010711

   I have some concerns about this. The problem is that this sort of assumes that we actually know at planning time what the schema for each individual file is in a `ListingScan`. And if you infer the schemas at planning and merge then together to get the table schema then that is true. But since this happens during planning and can be quite expensive, I suspect that real world use cases will leverage some sort of metadata catalog to get the merged schema for a logical table instead of re-deriving it for each query. In that case we have no idea what the individual file schemas are. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #2292: Add SchemaAdapterExec

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #2292:
URL: https://github.com/apache/arrow-datafusion/issues/2292#issuecomment-1105031204

   Thank you for bringing this up, to phrase it differently to check my understanding:
   
   * The ParquetExec, etc... needs to provide a schema at plan time
   * It needs to yield batches that match this schema
   * Depending on the catalog, the individual files might not match this schema, but must be compatible with it
   
   I agree that there doesn't appear to be a way around this without the file operator handling the schema adaption. I will close this and update the other tickets accordingly. Thank you :+1:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #2292: Add SchemaAdapterExec

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #2292:
URL: https://github.com/apache/arrow-datafusion/issues/2292#issuecomment-1104467552

   Note schema adapter is here: https://github.com/apache/arrow-datafusion/blob/9815ac6ecc2aee7fbbafa09c704ca81b0225221e/datafusion/core/src/physical_plan/file_format/mod.rs#L205
   
   IOx has most of the necessary code here for this logic here: https://github.com/influxdata/influxdb_iox/blob/5488c257d1bbb9a9b2f6882444b9e88098e53fdc/query/src/provider/adapter.rs#L45-L80


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org