You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/23 14:10:41 UTC

[GitHub] [arrow] alamb opened a new pull request #8031: ARROW-9815: [Rust][DataFusion] Add a trait for looking up scalar functions by name

alamb opened a new pull request #8031:
URL: https://github.com/apache/arrow/pull/8031


   Inspired by the conversation on https://github.com/apache/arrow/pull/8018/files, I have been bothered by the use of Arc/Mutex and the resulting code complication in ExecutionContext and LogicalPlanning.
   
   The more I read the code, the more I am convinced that `ExecutionContextState` needs to be mutable *before* planning, but once planning has started the state is all read only (and any relevant state is copied / cloned into the `ExecutionPlan`) so the only thing Mutex/Arc are doing is making lurking bugs like ARROW-9815 more likely and requiring copies. 
   
   This PR proposes a modest change to add a trait for looking up scalar functions by name, and thus removes the direct use of Box / HashMaps / etc in the TypeCoercion optimizer pass.
   
   I have several other changes in mind to avoid the need for Box / Mutex entirely in ExecutionContext but I want to keep the individual PRs small.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8031: ARROW-9815: [Rust][DataFusion] Add a trait for looking up scalar functions by name

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8031:
URL: https://github.com/apache/arrow/pull/8031#discussion_r475224655



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -189,16 +190,7 @@ impl ExecutionContext {
     /// Register a scalar UDF
     pub fn register_udf(&mut self, f: ScalarFunction) {
         let mut state = self.state.lock().expect("failed to lock mutex");
-        state.scalar_functions.insert(f.name.clone(), Box::new(f));
-    }
-
-    /// Get a reference to the registered scalar functions
-    pub fn scalar_functions(&self) -> Box<HashMap<String, Box<ScalarFunction>>> {

Review comment:
       This function actually deep-copied the HashMap. After this PR it does not

##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -317,22 +309,9 @@ impl ExecutionContext {
 
     /// Optimize the logical plan by applying optimizer rules
     pub fn optimize(&self, plan: &LogicalPlan) -> Result<LogicalPlan> {
-        let rules: Vec<Box<dyn OptimizerRule>> = vec![
-            Box::new(ProjectionPushDown::new()),
-            Box::new(FilterPushDown::new()),
-            Box::new(TypeCoercionRule::new(
-                self.state
-                    .lock()
-                    .expect("failed to lock mutex")
-                    .scalar_functions
-                    .as_ref(),
-            )),
-        ];
-        let mut plan = plan.clone();
-
-        for mut rule in rules {
-            plan = rule.optimize(&plan)?;
-        }
+        let plan = ProjectionPushDown::new().optimize(&plan)?;
+        let plan = FilterPushDown::new().optimize(&plan)?;
+        let plan = TypeCoercionRule::new(self).optimize(&plan)?;

Review comment:
       Now, this code makes it clear that the TypeCoercisonRule optimizer pass is a short lived object (and can't possibly outlive the `ExecutionContext`)

##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -427,6 +406,16 @@ impl ExecutionContext {
     }
 }
 
+impl ScalarFunctionProvider for ExecutionContext {
+    fn lookup(&self, name: &str) -> Option<Arc<ScalarFunction>> {
+        self.state

Review comment:
       My long term plan is to remove the Mutex entirely from `ExecutionContext` and thus this can be made simpler

##########
File path: rust/datafusion/src/optimizer/type_coercion.rs
##########
@@ -38,17 +35,21 @@ use utils::optimize_explain;
 /// Optimizer that applies coercion rules to expressions in the logical plan.
 ///
 /// This optimizer does not alter the structure of the plan, it only changes expressions on it.
-pub struct TypeCoercionRule {
-    scalar_functions: Arc<HashMap<String, Box<ScalarFunction>>>,
+pub struct TypeCoercionRule<'a, P>
+where
+    P: ScalarFunctionProvider,
+{
+    scalar_functions: &'a P,
 }
 
-impl TypeCoercionRule {
+impl<'a, P> TypeCoercionRule<'a, P>
+where
+    P: ScalarFunctionProvider,
+{
     /// Create a new type coercion optimizer rule using meta-data about registered
     /// scalar functions
-    pub fn new(scalar_functions: &HashMap<String, Box<ScalarFunction>>) -> Self {
-        Self {
-            scalar_functions: Arc::new(scalar_functions.clone()),

Review comment:
       This used to copy the entire HashMap




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8031: ARROW-9815: [Rust][DataFusion] Add a trait for looking up scalar functions by name

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8031:
URL: https://github.com/apache/arrow/pull/8031#discussion_r475531252



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -477,9 +466,9 @@ impl ExecutionConfig {
 /// Execution context for registering data sources and executing queries
 pub struct ExecutionContextState {
     /// Data sources that are registered with the context
-    pub datasources: Box<HashMap<String, Box<dyn TableProvider + Send + Sync>>>,
+    pub datasources: HashMap<String, Box<dyn TableProvider + Send + Sync>>,
     /// Scalar functions that are registered with the context
-    pub scalar_functions: Box<HashMap<String, Box<ScalarFunction>>>,
+    pub scalar_functions: HashMap<String, Arc<ScalarFunction>>,

Review comment:
       I originally changed `Box` to `Arc` to avoid a having to deep-copy the `ScalarFunction` object around.
   
   When I tried to  use just `ScalarFunction` without wrapping it in an `Arc` ref count, that meant the interface for `ScalarFunctionRegistry` looked like:
   
   ```
       fn lookup(&self, name: &str) -> Option<&ScalarFunction> {
   ```
   
   And I remember the Rust compiler did not seem to like the idea of returning references out of traits. I don't have the exact error handy and can't remember exactly was wrong. 
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8031: ARROW-9815: [Rust][DataFusion] Add a trait for looking up scalar functions by name

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8031:
URL: https://github.com/apache/arrow/pull/8031#discussion_r475229968



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -477,9 +466,9 @@ impl ExecutionConfig {
 /// Execution context for registering data sources and executing queries
 pub struct ExecutionContextState {
     /// Data sources that are registered with the context
-    pub datasources: Box<HashMap<String, Box<dyn TableProvider + Send + Sync>>>,
+    pub datasources: HashMap<String, Box<dyn TableProvider + Send + Sync>>,
     /// Scalar functions that are registered with the context
-    pub scalar_functions: Box<HashMap<String, Box<ScalarFunction>>>,
+    pub scalar_functions: HashMap<String, Arc<ScalarFunction>>,

Review comment:
       just for the sake of my understanding (prob a very basic CS question): why shouldn't we use simply `ScalarFunction` here? Is so that we can safely share this object with others without having to share the whole `HashMap`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8031: ARROW-9815: [Rust][DataFusion] Add a trait for looking up scalar functions by name

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8031:
URL: https://github.com/apache/arrow/pull/8031#issuecomment-678779883


   FYI @jorgecarleitao  and @andygrove 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #8031: ARROW-9815: [Rust][DataFusion] Add a trait for looking up scalar functions by name

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #8031:
URL: https://github.com/apache/arrow/pull/8031


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8031: ARROW-9815: [Rust][DataFusion] Add a trait for looking up scalar functions by name

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8031:
URL: https://github.com/apache/arrow/pull/8031#issuecomment-678780263


   https://issues.apache.org/jira/browse/ARROW-9815


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org