You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/23 22:33:50 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #4349: DataFusion Configuration Consolidation

alamb opened a new issue, #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349

   
   # Related
   https://github.com/apache/arrow-ballista/issues/479
   https://github.com/apache/arrow-datafusion/issues/4346
   
   # Introduction
   
   "Configuration" in DataFusion has a few usecases:
   * set a values from a string provided by the user (`set XX = YY` in datafusion-cli)
   * set a value from environment variables (See https://docs.rs/datafusion/14.0.0/datafusion/config/struct.ConfigOptions.html#method.from_env)
   * display as a string (e.g. `SHOW` in datafusion-cli`)
   * Documented in the datafusion docs:
   * serialize / deserialize over the network (e.g. Ballista)
   * settable programmatically (e.g. via the dataframe API)
   
   There are effectively two overlapping levels of configuration:
   * Session level (e.g. that can be reused from one query execution to the next)
   * Statement level (e.g. that is needed to plan a query and doesn't change for the duration of a statement such as "the value of now()" and target batch sizes, etc)
   
   
   # Current state of configuration in DataFusion
   
   The current state is .... inconsistent to put it mildly.
   
   The core structure is "SessionContext" which is the final glue that puts a datafusion session (e.g. tables provided, etc) and serves as the high level entrypoint into most DataFusion functionality
   
   There is then some combinationof SessionState, SessionConfig, ConfigOptions,
   
   
   SessionContext
    -- Has a SessionState
    -- SessionStartTime
    -- SessionConfig
      -- ConfigOptions
   
   
   `SessionConfig` is effectively the Session level configuration I describe above.
   
   TaskContext is  meant to be the statement level (aka per task / per query) level context. If you look hard  you can see has a copy of the SessionConfig (buried in TaskPropertoes) or also maybe is backed by KVPairs.
   
   # Desire
   I would like to have a clear configuration story that clearly separated the statement level config from the task level config and made all configuration values easy to introspect programatically
   
   Recommendations
   This is a complicated issue and I don't have a magic answer. However I have some concrete suggestions
   
   Some suggested steps:
   - [] Remove the `KFPairs` stuff in TaskProperties as that is not used by ballista https://github.com/search?q=repo%3Aapache%2Farrow-ballista%20KVPairs&type=code and has been superceded by ConfigOptions and the TaskProperties wrappee
   - [ ] Consolidate SessionConfig / ConfigOptions (my preference would be to inline ConfigOptions in SessionConfig
   
   I think consolidating SessionConfig/Config options is likely to be the most controversial / cause the most chrun but it will provide immense benefits I think.
   
   Then we can further improve from there
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #4349: DataFusion Configuration Consolidation

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349#issuecomment-1351363475

   > If a TaskContext represents a query/task
   
   I think it might be more precise to describe `TaskContext` as the state for query execution, with `SessionState` used for query planning.
   
   > the only way right now to identify task in TableProvider
   
   Correct, at the point in planning where `TableProvider::scan` is called, the task conceptually doesn't exist yet
   
   > create new SessionContext for each query
   
   FWIW this is what IOx does


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #4349: DataFusion Configuration Consolidation

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349#issuecomment-1328485457

   > 🤔 there is also ExecutionProps which is some subset of the TaskProperties 🤔
   
   We should consolidate ExecutionProps and TaskProperties also. 
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #4349: DataFusion Configuration Consolidation

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349#issuecomment-1328491946

   > * [ ]  Remove the `KFPairs` stuff in TaskProperties as that is not used by ballista https://github.com/search?q=repo%3Aapache%2Farrow-ballista%20KVPairs&type=code and has been superceded by ConfigOptions and the TaskProperties wrappee
   
   No, those KV pairs in TaskProperties are used by Ballista actually.  Please check those code snippets:
   Configurations are serialized in the form of KV pairs and sent from Ballista Scheduler to Ballista Executor.
   
   https://github.com/apache/arrow-ballista/blob/acbdf608807697e72122a249c4bf7505274906d0/ballista/executor/src/execution_loop.rs#L172-L95
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Cheappie commented on issue #4349: DataFusion Configuration Consolidation

Posted by GitBox <gi...@apache.org>.
Cheappie commented on issue #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349#issuecomment-1351211475

   If a TaskContext represents a query/task, then could `TableProvider` receive in scan method `TaskContext` as replacement for `SessionState` ? I would like to have a way to identify task in `TableProvider` `scan` fn. If I am not wrong, the only way right now to identify task in `TableProvider` is to create new `SessionContext` for each query and use `session_id` from `SessionState`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Cheappie commented on issue #4349: DataFusion Configuration Consolidation

Posted by GitBox <gi...@apache.org>.
Cheappie commented on issue #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349#issuecomment-1351414264

   Thank you @tustvold for clarification, in such case I will follow `SessionContext` per query approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4349: DataFusion Configuration Consolidation

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349#issuecomment-1325742564

   🤔  there is also ExecutionProps which is some subset of the TaskProperties 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #4349: DataFusion Configuration Consolidation

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349#issuecomment-1328484278

   > I think consolidating SessionConfig/Config options is likely to be the most controversial / cause the most chrun but it will provide immense benefits I think (like runtime visibility into the current settings)
   
   I think consolidating SessionConfig/ConfigOptions is definitely the right direction. The debate should be how to consolidate them, either keep SessionConfig or keep ConfigOptions.  I do not have a clear preference on this either. maybe we can have a vote. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4349: DataFusion Configuration Consolidation

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349#issuecomment-1331402197

   Here is my next contribution to clean up configuration: https://github.com/apache/arrow-datafusion/pull/4427 (slowly consolidating the configurations)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4349: DataFusion Configuration Consolidation

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349#issuecomment-1650173678

   I think this is largely complete and we don't have any additional work planned here, so closing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #4349: DataFusion Configuration Consolidation

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349#issuecomment-1328649599

   I'm OK with the change in this PR. Ballista should still work after this PR. 
   
   https://github.com/apache/arrow-datafusion/pull/4382


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4349: DataFusion Configuration Consolidation

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4349:
URL: https://github.com/apache/arrow-datafusion/issues/4349#issuecomment-1325737060

   However, one question, raised by @avantgardnerio  in https://github.com/apache/arrow-ballista/issues/479, is how to structure the interactions with the configuration code. 
   
   It turns out that `SessionConfig` has a more rust struct style and `ConfigOptions` is a more dynamic hashmap style. I honestly don't have a preference as long as the introspection (with docstrings) features of ConfigOptions are preserved. Maybe this is possible via macro magic


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #4349: DataFusion Configuration Consolidation

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #4349: DataFusion Configuration Consolidation
URL: https://github.com/apache/arrow-datafusion/issues/4349


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org