You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/26 13:41:35 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #4382: Config Cleanup: Remove TaskProperties and KV structure, keep key=value serialization

alamb opened a new pull request, #4382:
URL: https://github.com/apache/arrow-datafusion/pull/4382

   # Which issue does this PR close?
   
   re https://github.com/apache/arrow-datafusion/issues/4349
   
   # Rationale for this change
   
   Step 1 of N in unraveling the gordian knot of datafusion configuration
   
   While there is a need to serialize/deserialize session state as key=value pairs in SessionContext, there is no reason these key / value pairs need to be kept in the actual TaskContext after it is created and having the different representation makes the eventual unification of configuration that much harder. 
   
   
   
   # What changes are included in this PR?
   1. Remove duplication in `From` impls
   2. 
   
   # Are these changes tested?
   
   covered by existing tests
   
   # Are there any user-facing changes?
   
   If you use TaskProperties directly you'll be impacted, but I don't think anyone does. It is not used by ballista https://github.com/search?q=repo%3Aapache%2Farrow-ballista%20KVPairs&type=code
    
   @mingmwang  this code was originally added by you in  https://github.com/apache/arrow-datafusion/pull/1987 -- can you take a look ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #4382: Config Cleanup: Remove TaskProperties and KV structure, keep key=value serialization

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #4382:
URL: https://github.com/apache/arrow-datafusion/pull/4382#issuecomment-1329589191

   I plan to merge this tomorrow unless I hear otherwise and then will likely work on chipping away at the random parquet settings that are sprinkled around to make them visible via `ConfigOptions`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4382: Config Cleanup: Remove TaskProperties and KV structure, keep key=value serialization

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4382:
URL: https://github.com/apache/arrow-datafusion/pull/4382#discussion_r1032789766


##########
datafusion/core/src/execution/context.rs:
##########
@@ -1844,55 +1836,52 @@ impl TaskContext {
         aggregate_functions: HashMap<String, Arc<AggregateUDF>>,
         runtime: Arc<RuntimeEnv>,
     ) -> Self {
+        let session_config = if task_props.is_empty() {
+            SessionConfig::new()
+        } else {
+            SessionConfig::new()
+                .with_batch_size(task_props.get(OPT_BATCH_SIZE).unwrap().parse().unwrap())
+                .with_target_partitions(
+                    task_props.get(TARGET_PARTITIONS).unwrap().parse().unwrap(),
+                )
+                .with_repartition_joins(
+                    task_props.get(REPARTITION_JOINS).unwrap().parse().unwrap(),
+                )
+                .with_repartition_aggregations(
+                    task_props
+                        .get(REPARTITION_AGGREGATIONS)
+                        .unwrap()
+                        .parse()
+                        .unwrap(),
+                )
+                .with_repartition_windows(
+                    task_props
+                        .get(REPARTITION_WINDOWS)
+                        .unwrap()
+                        .parse()
+                        .unwrap(),
+                )
+                .with_parquet_pruning(
+                    task_props.get(PARQUET_PRUNING).unwrap().parse().unwrap(),
+                )
+                .with_collect_statistics(
+                    task_props.get(COLLECT_STATISTICS).unwrap().parse().unwrap(),
+                )
+        };
+
         Self {
             task_id: Some(task_id),
             session_id,
-            properties: TaskProperties::KVPairs(task_props),
+            session_config,
             scalar_functions,
             aggregate_functions,
             runtime,
         }
     }
 
     /// Return the SessionConfig associated with the Task
-    pub fn session_config(&self) -> SessionConfig {

Review Comment:
   this conversion from string/value properties to SessionConfig is moved to `TaskContext::new`



##########
datafusion/core/src/execution/context.rs:
##########
@@ -1810,22 +1810,14 @@ impl FunctionRegistry for SessionState {
     }
 }
 
-/// Task Context Properties
-pub enum TaskProperties {
-    ///SessionConfig
-    SessionConfig(SessionConfig),

Review Comment:
   The core of the change is to just use `SessionConfig` everywhere, creating it in the constructor of `TaskContext` if needed



##########
datafusion/core/src/execution/context.rs:
##########
@@ -1914,39 +1903,22 @@ impl TaskContext {
 /// Create a new task context instance from SessionContext
 impl From<&SessionContext> for TaskContext {
     fn from(session: &SessionContext) -> Self {
-        let session_id = session.session_id.clone();
-        let (config, scalar_functions, aggregate_functions) = {
-            let session_state = session.state.read();
-            (
-                session_state.config.clone(),
-                session_state.scalar_functions.clone(),
-                session_state.aggregate_functions.clone(),
-            )
-        };
-        let runtime = session.runtime_env();
-        Self {
-            task_id: None,
-            session_id,
-            properties: TaskProperties::SessionConfig(config),
-            scalar_functions,
-            aggregate_functions,
-            runtime,
-        }
+        TaskContext::from(&*session.state.read())

Review Comment:
   this is a drive by clean up as the code was replicated in `impl From<&SessionState> for TaskContext {`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #4382: Config Cleanup: Remove TaskProperties and KV structure, keep key=value serialization

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #4382:
URL: https://github.com/apache/arrow-datafusion/pull/4382#issuecomment-1328416061

   I will take a closer look at the configuration related issues today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #4382: Config Cleanup: Remove TaskProperties and KV structure, keep key=value serialization

Posted by GitBox <gi...@apache.org>.
alamb merged PR #4382:
URL: https://github.com/apache/arrow-datafusion/pull/4382


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #4382: Config Cleanup: Remove TaskProperties and KV structure, keep key=value serialization

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #4382:
URL: https://github.com/apache/arrow-datafusion/pull/4382#issuecomment-1330832841

   Benchmark runs are scheduled for baseline = 02da32e2ef7bc40a89f024a3c6f1a9540412f636 and contender = 1438bc4ca329e7887ab2dd1c2697ba4038255bdd. 1438bc4ca329e7887ab2dd1c2697ba4038255bdd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/1a14d1c7d5b544f79976ac26472a5034...81cac12500ee46ff8ba305a33a35283e/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/6125d0bb3cfd4142af2147e14f4a7702...93a8ccabd37f4aa78f69628d60e8a744/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6b6ee46569104275b94d47661fa554a5...cbed48370fd244ab85065dfc28f68308/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/ab456070d7c94bc5b55c9b98c536c442...a973ea34cd2c454495c7ad73ae1dae82/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #4382: Config Cleanup: Remove TaskProperties and KV structure, keep key=value serialization

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #4382:
URL: https://github.com/apache/arrow-datafusion/pull/4382#issuecomment-1328647978

   @alamb
   
   I'm OK with the change.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org