You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/04 10:32:51 UTC

[GitHub] [arrow-datafusion] mingmwang opened a new pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

mingmwang opened a new pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1862.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#discussion_r820227297



##########
File path: datafusion/src/execution/runtime_env.rs
##########
@@ -25,47 +25,110 @@ use crate::{
         memory_manager::{MemoryConsumerId, MemoryManager, MemoryManagerConfig},
     },
 };
-
+use std::fmt;
 use std::fmt::{Debug, Formatter};
+
+use crate::datasource::object_store::{ObjectStore, ObjectStoreRegistry};
+use crate::execution::context::{
+    SessionContextRegistry, TaskContext, TaskContextRegistry, BATCH_SIZE,
+    PARQUET_PRUNING, REPARTITION_AGGREGATIONS, REPARTITION_JOINS, REPARTITION_WINDOWS,
+    TARGET_PARTITIONS,
+};
+use crate::prelude::{SessionConfig, SessionContext};
+use datafusion_common::DataFusionError;
+use once_cell::sync::OnceCell;
+use parking_lot::Mutex;
+use std::path::PathBuf;
 use std::sync::Arc;
 
-#[derive(Clone)]
-/// Execution runtime environment. This structure is passed to the
-/// physical plans when they are run.
+/// Global singleton RuntimeEnv
+pub static RUNTIME_ENV: OnceCell<Arc<RuntimeEnv>> = OnceCell::new();

Review comment:
       I think a single static runtime environment makes sense for Ballista but not for DataFusion (which gets used in a variety of usecases that a single runtime might not be applicable for)
   

##########
File path: datafusion/src/execution/runtime_env.rs
##########
@@ -25,47 +25,110 @@ use crate::{
         memory_manager::{MemoryConsumerId, MemoryManager, MemoryManagerConfig},
     },
 };
-
+use std::fmt;
 use std::fmt::{Debug, Formatter};
+
+use crate::datasource::object_store::{ObjectStore, ObjectStoreRegistry};
+use crate::execution::context::{
+    SessionContextRegistry, TaskContext, TaskContextRegistry, BATCH_SIZE,
+    PARQUET_PRUNING, REPARTITION_AGGREGATIONS, REPARTITION_JOINS, REPARTITION_WINDOWS,
+    TARGET_PARTITIONS,
+};
+use crate::prelude::{SessionConfig, SessionContext};
+use datafusion_common::DataFusionError;
+use once_cell::sync::OnceCell;
+use parking_lot::Mutex;
+use std::path::PathBuf;
 use std::sync::Arc;
 
-#[derive(Clone)]
-/// Execution runtime environment. This structure is passed to the
-/// physical plans when they are run.
+/// Global singleton RuntimeEnv
+pub static RUNTIME_ENV: OnceCell<Arc<RuntimeEnv>> = OnceCell::new();
+
+/// Execution runtime environment. This structure is a singleton for each Scheduler/Executor instance.
 pub struct RuntimeEnv {
-    /// Default batch size while creating new batches
-    pub batch_size: usize,
+    /// Executor Id
+    pub executor_id: Option<String>,
+    /// Local Env
+    pub is_local: bool,
     /// Runtime memory management
     pub memory_manager: Arc<MemoryManager>,
     /// Manage temporary files during query execution
     pub disk_manager: Arc<DiskManager>,
+    /// Object Store that are registered within the Scheduler's or Executors' Runtime
+    pub object_store_registry: Arc<ObjectStoreRegistry>,
+    /// DataFusion task contexts that are registered within the Executors' Runtime
+    pub task_context_registry: Option<Arc<TaskContextRegistry>>,
+    /// DataFusion session contexts that are registered within the Scheduler's Runtime
+    pub session_context_registry: Option<Arc<SessionContextRegistry>>,
 }
 
-impl Debug for RuntimeEnv {
-    fn fmt(&self, f: &mut Formatter) -> std::fmt::Result {
-        write!(f, "RuntimeEnv")
+impl RuntimeEnv {
+    /// Create an executor env based on configuration
+    pub fn new_executor_env(config: RuntimeConfig, executor_id: String) -> Result<Self> {
+        let RuntimeConfig {
+            memory_manager,
+            disk_manager,
+        } = config;
+        Ok(Self {
+            executor_id: Some(executor_id),
+            is_local: false,
+            memory_manager: MemoryManager::new(memory_manager),
+            disk_manager: DiskManager::try_new(disk_manager)?,
+            object_store_registry: Arc::new(ObjectStoreRegistry::new()),
+            task_context_registry: Some(Arc::new(TaskContextRegistry::new())),
+            session_context_registry: None,
+        })
     }
-}
 
-impl RuntimeEnv {
-    /// Create env based on configuration
-    pub fn new(config: RuntimeConfig) -> Result<Self> {
+    /// Create a scheduler env based on configuration
+    pub fn new_scheduler_env(config: RuntimeConfig) -> Result<Self> {
         let RuntimeConfig {
-            batch_size,
             memory_manager,
             disk_manager,
         } = config;
+        Ok(Self {
+            executor_id: None,
+            is_local: false,
+            memory_manager: MemoryManager::new(memory_manager),
+            disk_manager: DiskManager::try_new(disk_manager)?,
+            object_store_registry: Arc::new(ObjectStoreRegistry::new()),
+            task_context_registry: None,
+            session_context_registry: Some(Arc::new(SessionContextRegistry::new())),
+        })
+    }
 
+    /// Create a local env based on configuration
+    pub fn new_local_env(config: RuntimeConfig) -> Result<Self> {
+        let RuntimeConfig {
+            memory_manager,
+            disk_manager,
+        } = config;
         Ok(Self {
-            batch_size,
+            executor_id: None,
+            is_local: true,
             memory_manager: MemoryManager::new(memory_manager),
             disk_manager: DiskManager::try_new(disk_manager)?,
+            object_store_registry: Arc::new(ObjectStoreRegistry::new()),
+            task_context_registry: None,
+            session_context_registry: Some(Arc::new(SessionContextRegistry::new())),
         })
     }
 
-    /// Get execution batch size based on config
-    pub fn batch_size(&self) -> usize {
-        self.batch_size
+    /// Return the global singleton RuntimeEnv
+    pub fn global() -> &'static Arc<RuntimeEnv> {

Review comment:
       I am concerned that if we have this function, it will become too easy for DataFusion code to accidentally use the global `RuntimeEnv` rather than the one that is passed through
   
   I would prefer that the global runtime is put in ballista (or a separate module in DataFusion)

##########
File path: benchmarks/src/bin/tpch.rs
##########
@@ -1282,10 +1289,10 @@ mod tests {
     async fn run_query(n: usize) -> Result<()> {
         // Tests running query with empty tables, to see whether they run succesfully.
 
-        let config = ExecutionConfig::new()
+        let config = SessionConfig::new()
             .with_target_partitions(1)
             .with_batch_size(10);
-        let mut ctx = ExecutionContext::with_config(config);
+        let ctx = SessionContext::with_config(config, RuntimeEnv::global());

Review comment:
       I like that the RuntimeEnv is explicitly passed here

##########
File path: datafusion/src/execution/runtime_env.rs
##########
@@ -25,47 +25,110 @@ use crate::{
         memory_manager::{MemoryConsumerId, MemoryManager, MemoryManagerConfig},
     },
 };
-
+use std::fmt;
 use std::fmt::{Debug, Formatter};
+
+use crate::datasource::object_store::{ObjectStore, ObjectStoreRegistry};
+use crate::execution::context::{
+    SessionContextRegistry, TaskContext, TaskContextRegistry, BATCH_SIZE,
+    PARQUET_PRUNING, REPARTITION_AGGREGATIONS, REPARTITION_JOINS, REPARTITION_WINDOWS,
+    TARGET_PARTITIONS,
+};
+use crate::prelude::{SessionConfig, SessionContext};
+use datafusion_common::DataFusionError;
+use once_cell::sync::OnceCell;
+use parking_lot::Mutex;
+use std::path::PathBuf;
 use std::sync::Arc;
 
-#[derive(Clone)]
-/// Execution runtime environment. This structure is passed to the
-/// physical plans when they are run.
+/// Global singleton RuntimeEnv
+pub static RUNTIME_ENV: OnceCell<Arc<RuntimeEnv>> = OnceCell::new();
+
+/// Execution runtime environment. This structure is a singleton for each Scheduler/Executor instance.
 pub struct RuntimeEnv {
-    /// Default batch size while creating new batches
-    pub batch_size: usize,
+    /// Executor Id
+    pub executor_id: Option<String>,
+    /// Local Env
+    pub is_local: bool,
     /// Runtime memory management
     pub memory_manager: Arc<MemoryManager>,
     /// Manage temporary files during query execution
     pub disk_manager: Arc<DiskManager>,
+    /// Object Store that are registered within the Scheduler's or Executors' Runtime
+    pub object_store_registry: Arc<ObjectStoreRegistry>,
+    /// DataFusion task contexts that are registered within the Executors' Runtime
+    pub task_context_registry: Option<Arc<TaskContextRegistry>>,
+    /// DataFusion session contexts that are registered within the Scheduler's Runtime
+    pub session_context_registry: Option<Arc<SessionContextRegistry>>,
 }
 
-impl Debug for RuntimeEnv {
-    fn fmt(&self, f: &mut Formatter) -> std::fmt::Result {
-        write!(f, "RuntimeEnv")
+impl RuntimeEnv {
+    /// Create an executor env based on configuration
+    pub fn new_executor_env(config: RuntimeConfig, executor_id: String) -> Result<Self> {
+        let RuntimeConfig {
+            memory_manager,
+            disk_manager,
+        } = config;
+        Ok(Self {
+            executor_id: Some(executor_id),
+            is_local: false,
+            memory_manager: MemoryManager::new(memory_manager),
+            disk_manager: DiskManager::try_new(disk_manager)?,
+            object_store_registry: Arc::new(ObjectStoreRegistry::new()),
+            task_context_registry: Some(Arc::new(TaskContextRegistry::new())),
+            session_context_registry: None,
+        })
     }
-}
 
-impl RuntimeEnv {
-    /// Create env based on configuration
-    pub fn new(config: RuntimeConfig) -> Result<Self> {
+    /// Create a scheduler env based on configuration
+    pub fn new_scheduler_env(config: RuntimeConfig) -> Result<Self> {
         let RuntimeConfig {
-            batch_size,
             memory_manager,
             disk_manager,
         } = config;
+        Ok(Self {
+            executor_id: None,
+            is_local: false,
+            memory_manager: MemoryManager::new(memory_manager),
+            disk_manager: DiskManager::try_new(disk_manager)?,
+            object_store_registry: Arc::new(ObjectStoreRegistry::new()),

Review comment:
       I like the idea of adding a `object_store_registry`, `task_context_registry` and `session_conctext_registry` to the `RuntimeEnvironment` -- I think having a way to track those items is a good one. 👍 

##########
File path: benchmarks/src/bin/tpch.rs
##########
@@ -1307,7 +1314,7 @@ mod tests {
             // load expected answers from tpch-dbgen
             // read csv as all strings, trim and cast to expected type as the csv string
             // to value parser does not handle data with leading/trailing spaces
-            let mut ctx = ExecutionContext::new();
+            let ctx = SessionContext::new();

Review comment:
       I wonder what you think about calling this function `SessionContext::create_default()` or something to give a hint that it will be connected to the Global runtime environment?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1062565637


   @alamb @yjshen @thinkharderdev 
   Please help to take a look.  I do not use static singleton RuntimeEnv anymore in both DataFusion and Ballista.
   For Ballista, the SessionContextRegistry is managed by the Ballista SchedulerServer to support long running sessions.
   
   I had verified that the Ballista client can correctly set the target partitions, batch size to Scheduler and propagate to Executor/Tasks. Issue https://github.com/apache/arrow-datafusion/issues/1848 was also fixed.
   
   @yahoNanJing 
   Please help to take a look at the Ballista part changes.
   
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1061137810


   I think another consideration is 
   
   I agree with @yjshen (and I think @thinkharderdev ) that for the datafusion (use as a library) feature, a single static `RuntimeEnv` is a challenge because of how many different ways it may be deployed / embedded.  I can imagine people using DataFusion to build systems that attempt to build in isolation (by using separate `RuntimeEnv` for different subsets of users / queries) as well as systems that use other technologies (e.g. kubernetes) to implement isolation
   
   I think passing around the `RuntimeEnv` is the most flexible, if annoying, approach.
   
   I also wanted to reiterate that I liked pretty much all of the rest of this PR (like adding TaskContext, SessionContext, cleanups, etc) so perhaps we can proceed with adding in those pieces incrementall while we work out the "global" vs "non global" context thing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on a change in pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on a change in pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#discussion_r823439560



##########
File path: ballista/rust/scheduler/src/lib.rs
##########
@@ -1001,10 +1103,56 @@ mod test {
                 namespace.to_string(),
                 BallistaCodec::default(),
             );
-        let ctx = scheduler.ctx.read().await;
-        state.init(&ctx).await?;
+        state.init().await?;
         // executor should be registered
         assert_eq!(state.get_executors_metadata().await.unwrap().len(), 1);
         Ok(())
     }
 }
+
+/// A Registry holds all the datafusion session contexts
+pub struct SessionContextRegistry {

Review comment:
       Looks great! Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1062563208


   Hi @mingmwang, would you like to split this effort into several parts as @alamb suggested? This may make it easier for us to merge this huge feature gradually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on a change in pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on a change in pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#discussion_r820349045



##########
File path: datafusion/src/execution/runtime_env.rs
##########
@@ -25,47 +25,110 @@ use crate::{
         memory_manager::{MemoryConsumerId, MemoryManager, MemoryManagerConfig},
     },
 };
-
+use std::fmt;
 use std::fmt::{Debug, Formatter};
+
+use crate::datasource::object_store::{ObjectStore, ObjectStoreRegistry};
+use crate::execution::context::{
+    SessionContextRegistry, TaskContext, TaskContextRegistry, BATCH_SIZE,
+    PARQUET_PRUNING, REPARTITION_AGGREGATIONS, REPARTITION_JOINS, REPARTITION_WINDOWS,
+    TARGET_PARTITIONS,
+};
+use crate::prelude::{SessionConfig, SessionContext};
+use datafusion_common::DataFusionError;
+use once_cell::sync::OnceCell;
+use parking_lot::Mutex;
+use std::path::PathBuf;
 use std::sync::Arc;
 
-#[derive(Clone)]
-/// Execution runtime environment. This structure is passed to the
-/// physical plans when they are run.
+/// Global singleton RuntimeEnv
+pub static RUNTIME_ENV: OnceCell<Arc<RuntimeEnv>> = OnceCell::new();

Review comment:
       > I think a single static runtime environment makes sense for Ballista but not for DataFusion (which gets used in a variety of usecases that a single runtime might not be applicable for)
   
   Hi, Alamb
   
   Thanks a lot for taking a look. One reason that I have to use the global static runtime is make the ExecutionPlans session/configuration aware. 
   
   ````
    pub trait ExecutionPlan: Debug + Send + Sync {
   
       ...................
       
       /// Return the Session id associated with the execution plan.
       fn session_id(&self) -> String;
   
       /// Return the Session configuration associated with the execution plan.
       fn session_config(&self) -> SessionConfig {
           let session_id = self.session_id();
           let runtime = RuntimeEnv::global();
           runtime.lookup_session_config(session_id.as_str())
       }
   
   
   impl RuntimeEnv {
   
       ...................
       
       /// Retrieves a copied version of `SessionConfig` by session_id
       pub fn lookup_session_config(&self, session_id: &str) -> SessionConfig {
           if self.is_local() {
               // It is possible that in a local env such as in unit tests there is no
               // SessionContext created, in this case we have to return a default SessionConfig.
               self.lookup_config(session_id)
                   .map_or(SessionConfig::new(), |c| c.lock().clone())
           } else if self.is_scheduler() {
               self.lookup_session(session_id)
                   .expect("SessionContext doesn't exist")
                   .copied_config()
           } else {
               self.config_from_task_context(session_id)
           }
    }
   
   ````
   
   I think even we just use DataFusion as a lib, it is still a valid requirement to make the ExecutionPlan configuration aware. We might add more and more configuration options to enable/disable some feature during plan time and execution runtime(System like SparkSQL/Presto they have more than hundreds of configuration options). We need an extendable way to achieve this. With the single static runtime environment,  I could make the ExecutionPlan configuration aware and runtime aware(it knows it runs in DataFusion local as a lib, or in the Ballista executor/scheduler) without pass-though anything. 
   
   Some of the systems might leverage the ThreadLocal facility in the program language to get the current context/configuration to avoid pass-through configurations. But In Rust we can not leverage ThreadLocal, they are so many async parts in the code paths.  
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1064729039


   > > Could you please guide me that who else can review this PR ?
   > 
   > The current list of maintainers is listed at https://arrow.apache.org/committers/ -- ones who may be interested in this PR include @andygrove @Dandandan @rdettai @Jimexist @yjshen and @houqp
   > 
   > > Sorry for pushing this hard, I need to close this the early the better so that I can continue the working on other parts.
   > 
   > No problem -- I want to support you and you work as I think it is valuable.
   > 
   > If you are able to break this PR into smaller pieces, as suggested on [#1924 (review)](https://github.com/apache/arrow-datafusion/pull/1924#pullrequestreview-901068370) I think you'll find sufficient review capacity to get it through.
   
   Hi, @alamb 
   
   Ok. I will break this PR into three parts.
   
   1.  Rename from ExecutionContext to SessionContext and ExecutionContextState to SessionState, Add TaskContext
   and wrap the RuntimeEnv into TaskContext and pass down TaskContext into ExecutionPlan's execute() method, fix all the
    trivial UTs.
   
   2. Changes related to the SessionContext and SessionContextState themselves, move the planner/optimizer from SessionConfig to SessionState.
   
   3. Ballista related changes, add SessionContextRegistry, make SessionContext long running.  SessionContext can created from BallistaConfig and can propagate to Executor.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1064003994


   > From what I can see the basic structure now looks good, but I don't feel I ca
   
   
   
   > Hi @mingmwang -- thank you for taking feedback and removing the global singleton.
   > 
   > Sadly, realistically I am not going to be able to find enough contiguous time to carefully review this PR as written. It changes too many fundamental structures to DataFusion (like `ExecutionContext`) in non trivial ways.
   > 
   > From what I can see the basic structure now looks good, but I don't feel I can evaluate the implications of this PR as a whole.
   > 
   > If other maintainers feel differently, I will defer to their judgement and don't oppose merging this and sorting out any fallout afterwards. However, I stand with my original suggestion to break this into smaller parts that can be evaluated separately.
   
   Hi, @alamb 
   
   Could you please guide me that who else can review this PR ?  
   Sorry for pushing this hard, I need to close this the early the better so that I can continue the working on other parts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1060124062


   As we regard the DataFusion as library and the Ballista as a standalone system, putting the global singleton into Ballista should be the right choice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#discussion_r821100225



##########
File path: datafusion/src/execution/runtime_env.rs
##########
@@ -25,47 +25,110 @@ use crate::{
         memory_manager::{MemoryConsumerId, MemoryManager, MemoryManagerConfig},
     },
 };
-
+use std::fmt;
 use std::fmt::{Debug, Formatter};
+
+use crate::datasource::object_store::{ObjectStore, ObjectStoreRegistry};
+use crate::execution::context::{
+    SessionContextRegistry, TaskContext, TaskContextRegistry, BATCH_SIZE,
+    PARQUET_PRUNING, REPARTITION_AGGREGATIONS, REPARTITION_JOINS, REPARTITION_WINDOWS,
+    TARGET_PARTITIONS,
+};
+use crate::prelude::{SessionConfig, SessionContext};
+use datafusion_common::DataFusionError;
+use once_cell::sync::OnceCell;
+use parking_lot::Mutex;
+use std::path::PathBuf;
 use std::sync::Arc;
 
-#[derive(Clone)]
-/// Execution runtime environment. This structure is passed to the
-/// physical plans when they are run.
+/// Global singleton RuntimeEnv
+pub static RUNTIME_ENV: OnceCell<Arc<RuntimeEnv>> = OnceCell::new();

Review comment:
       Yes, I agree the alternative is having to pass through some sort of context (like the `ExecutionContext`) explicitly everywhere, which may make the code harder to read. However the upside of explicitly passing through any shared state is that there are no surprising behaviors
   
   > I think even we just use DataFusion as a lib, it is still a valid requirement to make the ExecutionPlan configuration aware. We might add more and more configuration options to enable/disable some feature during plan time and execution runtime(System like SparkSQL/Presto they have more than hundreds of configuration options).
   
   Yes I agree having `ExecutionPlan` be configuration aware is important. 
   
   The idea up to this point has been to explicitly copy / plumb through this state (via  `Arc<ExecutionContextState>`)
   https://github.com/apache/arrow-datafusion/blob/7a2cbd5500c1e9447c7d71599eeccfdd5833cd4e/datafusion%2Fsrc%2Fexecution%2Fcontext.rs#L148
   
   And  the `RuntimeEnv` on the call to `ExecutionPlan::execute`
   
   https://github.com/apache/arrow-datafusion/blob/33b9357139ad918da0f45a92db37f00ffa64b0ba/datafusion/src/physical_plan/mod.rs#L226-L230
   
   
   What would you think about adding configuration options to RuntimeEnv (or something else more general)?
   https://github.com/apache/arrow-datafusion/blob/641338f726549c10c5bafee34537dc1e56cdec04/datafusion/src/execution/runtime_env.rs#L35-L42




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1062573258


   > Hi @mingmwang, would you like to split this effort into several parts as @alamb suggested? This may make it easier for us to merge this huge feature gradually.
   
   Hi @yjshen 
   The major changes are in the file context.rs.  Most of other changed files are trivial UT fixes. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang edited a comment on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang edited a comment on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1064729039


   > > Could you please guide me that who else can review this PR ?
   > 
   > The current list of maintainers is listed at https://arrow.apache.org/committers/ -- ones who may be interested in this PR include @andygrove @Dandandan @rdettai @Jimexist @yjshen and @houqp
   > 
   > > Sorry for pushing this hard, I need to close this the early the better so that I can continue the working on other parts.
   > 
   > No problem -- I want to support you and you work as I think it is valuable.
   > 
   > If you are able to break this PR into smaller pieces, as suggested on [#1924 (review)](https://github.com/apache/arrow-datafusion/pull/1924#pullrequestreview-901068370) I think you'll find sufficient review capacity to get it through.
   
   Hi, @alamb 
   
   Ok. I will break this PR into three parts.
   
   1.  Rename from ExecutionContext to SessionContext and ExecutionContextState to SessionState, Add TaskContext
   and wrap the RuntimeEnv into TaskContext and pass down TaskContext into ExecutionPlan's execute() method, fix all the trivial UTs.
   
   2. Changes related to the SessionContext and SessionContextState themselves, move the planner/optimizer from SessionConfig to SessionState.
   
   3. Ballista related changes, add SessionContextRegistry, make SessionContext long running.  SessionContext can created from BallistaConfig and can propagate to Executor.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1060279616


   @yjshen @thinkharderdev 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1069233571


   I'm closing this since we decided to separate this monolithic PR into steps and get it merged gradually. And we've already merged the first part #1987.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen closed pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
yjshen closed pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1060702963


   > I wonder if we are missing a layer here. My intuition is that a global `RuntimeEnv` might be a problem and not really support real multi-tenancy but a session-level `RuntimeEnv` may be too fine-grained. We already have the concept of a namespace in the state configurations so does it make sense to namespace the `RuntimeEnv`. This could be away to support proper resource-level multi-tenancy. That is each namespace is a proper "slice" of the total cluster resources and there is some level of isolation so that one resource-hungry query by one client (in theory at least) can not use all cluster resources. It would probably be a little complicated to do dynamic registration of namespaces but should be possible in principle (and we probably don't have to tackle this right away).
   > 
   > I have the same question as @yjshen with respect to mutli-scheduler environments. I think that will be a key consideration for Ballista in general.
   
   For cluster level resource isolations, I think it's the responsibility of the query scheduler and task scheduler. We should not put too much burden to the RunTimeEnv. Just make the RunTimeEnv as simple as possible. 
   If future, we can add user level query profile or resource profile to the session configuration, if the plan is session configuration aware, the query scheduler and task scheduler can take the resource isolations and resource requirements into consideration, no need to leverage RuntimeEnv to make decision.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1061291910


   > I think another consideration is
   > 
   > I agree with @yjshen (and I think @thinkharderdev ) that for the datafusion (use as a library) feature, a single static `RuntimeEnv` is a challenge because of how many different ways it may be deployed / embedded. I can imagine people using DataFusion to build systems that attempt to build in isolation (by using separate `RuntimeEnv` for different subsets of users / queries) as well as systems that use other technologies (e.g. kubernetes) to implement isolation
   > 
   > I think passing around the `RuntimeEnv` is the most flexible, if annoying, approach.
   > 
   > I also wanted to reiterate that I liked pretty much all of the rest of this PR (like adding TaskContext, SessionContext, cleanups, etc) so perhaps we can proceed with adding in those pieces incremental while we work out the "global" vs "non global" context thing
   
   Ok, I will remove the the static singleton RuntimeEnv in DataFusion, and go with the approach to pass through the required structure in the ExecutionPlan execute() method.  It can still achieve the current goal and make the plan's major execution path configuration aware. But in the same time it will lose some capabilities for example in other methods like display() or stats(), if we want to add some session configuration to control the plan display(short plan/long plan displaying etc) or stats collection(columnar stats, support histogram etc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1060683009


   > I might be asking too much about this great initiative. Another plausible and low-hanging fruit in my mind is: multi-tenancy only inside the scheduler. By making executors one query entrance at a time, a "Mutex". We could have an easier route to achieve multi-tenancy.
   > 
   > Another question, just out of curiosity, is the scheduler a single point of failure in the current design? If it is, have you considered a "multi" scheduler architecture?
   > 
   > Thanks again for broadening Ballista's landscape!
   
   Actually the PR does not touch the existing scheduler part too much.  The major purpose is to refactoring the configuration and make execution plan session/configuration aware.  Of cause the Ballista scheduler server need to maintain a list of running sessions and their session state currently in a global RuntimeEnv.  But since Ballista Client also maintains its own local session context and session state for optimization and planning,  I think if we have multiple Ballista scheduler's in the system, the current design should not make the scheduler a single point of failure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1064082167


   > Could you please guide me that who else can review this PR ?
   
   The current list of maintainers is listed at https://arrow.apache.org/committers/ -- ones who may be interested in this PR include @andygrove @Dandandan @rdettai @Jimexist @yjshen and @houqp 
   
   > Sorry for pushing this hard, I need to close this the early the better so that I can continue the working on other parts.
   
   No problem -- I want to support you and you work as I think it is valuable.
   
   If you are able to break this PR into smaller pieces, as suggested on https://github.com/apache/arrow-datafusion/pull/1924#pullrequestreview-901068370 I think you'll find sufficient review capacity to get it through. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on a change in pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on a change in pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#discussion_r822539024



##########
File path: ballista/rust/scheduler/src/lib.rs
##########
@@ -116,30 +117,29 @@ pub struct SchedulerServer<T: 'static + AsLogicalPlan, U: 'static + AsExecutionP
     pub(crate) state: Arc<SchedulerState<T, U>>,
     start_time: u128,
     policy: TaskSchedulingPolicy,
-    scheduler_env: Option<SchedulerEnv>,
+    scheduler_loop: Option<SchedulerLoop>,
     executors_client: Option<ExecutorsClient>,
-    ctx: Arc<RwLock<ExecutionContext>>,
     codec: BallistaCodec<T, U>,
+    /// DataFusion session contexts that are registered within the SchedulerServer
+    session_context_registry: Arc<SessionContextRegistry>,

Review comment:
       So I think we need to preserve some mechanism to register common extensions in the `Context`. The original intention of adding `ExecutionContext` to the `SchedulerServer` was to allow creating a `SchedulerServer` which had a customized `ExecutionContext`. Looks like you've dealt with custom `ObjectStore` implementations by moving that to `RuntimeEnv` but there are serveral other extension points which may be useful such as custom planners, optimizers, udf/udaf 
   
   Maybe we can add a "default" `SessionContext` to `SessionContextRegistry` so we can create new contexts from a template? That way the template context can be initialized with custom extensions. 

##########
File path: ballista/rust/scheduler/src/lib.rs
##########
@@ -1001,10 +1103,56 @@ mod test {
                 namespace.to_string(),
                 BallistaCodec::default(),
             );
-        let ctx = scheduler.ctx.read().await;
-        state.init(&ctx).await?;
+        state.init().await?;
         // executor should be registered
         assert_eq!(state.get_executors_metadata().await.unwrap().len(), 1);
         Ok(())
     }
 }
+
+/// A Registry holds all the datafusion session contexts
+pub struct SessionContextRegistry {

Review comment:
       Here maybe we can add `base_context: Option<Arc<SessionContext>>`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1061323542


   > I think another consideration is
   > 
   > I agree with @yjshen (and I think @thinkharderdev ) that for the datafusion (use as a library) feature, a single static `RuntimeEnv` is a challenge because of how many different ways it may be deployed / embedded. I can imagine people using DataFusion to build systems that attempt to build in isolation (by using separate `RuntimeEnv` for different subsets of users / queries) as well as systems that use other technologies (e.g. kubernetes) to implement isolation
   > 
   > I think passing around the `RuntimeEnv` is the most flexible, if annoying, approach.
   > 
   > I also wanted to reiterate that I liked pretty much all of the rest of this PR (like adding TaskContext, SessionContext, cleanups, etc) so perhaps we can proceed with adding in those pieces incremental while we work out the "global" vs "non global" context thing
   
   Ok, I will remove the the static singleton RuntimeEnv in DataFusion, and go with the approach to pass through the required structure in the ExecutionPlan execute() method.  It can still achieve the current goal and make the plan's major execution path configuration aware. But in the same time it will lose some capabilities for example in other methods like display() or stats(), if we want to add some session configuration to control the plan display(short plan/long plan displaying etc) or stats collection(columnar stats, support histogram etc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1060278909


   I think the debate is whether we should have a global RuntimeEnv for DataFusion. Originally the RuntimeEnv was not a singleton. But I would say the definition of RuntimeEnv was not clear . It stood for an 
   execution runtime environment and the structure was passed to the physical plans when they are run. Looks like It was the task/physical plan's runtime. But the original ExecutionContextState also had a member of type RuntimeEnv(not a reference), so the RuntimeEnv also stood for DataFusion/Ballista context's runtime. So does it stand for a task level runtime or session/execution context level runtime is not clear.  RuntimeEnv was originally introduced for MemoryManager and DiskManager which I think should be managed globally.  After this refactoring, I think the meaning of RuntimeEnv is more clear.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1060547632


   I wonder if we are missing a layer here. My intuition is that a global `RuntimeEnv` might be a problem and not really support real multi-tenancy but a session-level `RuntimeEnv` may be too fine-grained. We already have the concept of a namespace in the state configurations so does it make sense to namespace the `RuntimeEnv`. This could be away to support proper resource-level multi-tenancy. That is each namespace is a proper "slice" of the total cluster resources and there is some level of isolation so that one resource-hungry query by one client (in theory at least) can not use all cluster resources. It would probably be a little complicated to do dynamic registration of namespaces but should be possible in principle (and we probably don't have to tackle this right away). 
   
   I have the same question as @yjshen with respect to mutli-scheduler environments. I think that will be a key consideration for Ballista in general. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb edited a comment on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1061137810


   I think another consideration is 
   
   I agree with @yjshen (and I think @thinkharderdev ) that for the datafusion (use as a library) feature, a single static `RuntimeEnv` is a challenge because of how many different ways it may be deployed / embedded.  I can imagine people using DataFusion to build systems that attempt to build in isolation (by using separate `RuntimeEnv` for different subsets of users / queries) as well as systems that use other technologies (e.g. kubernetes) to implement isolation
   
   I think passing around the `RuntimeEnv` is the most flexible, if annoying, approach.
   
   I also wanted to reiterate that I liked pretty much all of the rest of this PR (like adding TaskContext, SessionContext, cleanups, etc) so perhaps we can proceed with adding in those pieces incremental while we work out the "global" vs "non global" context thing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on a change in pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on a change in pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#discussion_r823286069



##########
File path: ballista/rust/scheduler/src/lib.rs
##########
@@ -1001,10 +1103,56 @@ mod test {
                 namespace.to_string(),
                 BallistaCodec::default(),
             );
-        let ctx = scheduler.ctx.read().await;
-        state.init(&ctx).await?;
+        state.init().await?;
         // executor should be registered
         assert_eq!(state.get_executors_metadata().await.unwrap().len(), 1);
         Ok(())
     }
 }
+
+/// A Registry holds all the datafusion session contexts
+pub struct SessionContextRegistry {

Review comment:
       Hi @thinkharderdev 
   
   Yes, this is very good point.  In the current implementation, the SessionContext is alway created using configurations from query settings/Ballista configurations, there is no way to configure different optimizers/planners.  Maybe I can add a new struct like Base SessionContext Builder and make that an extension point. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on a change in pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang commented on a change in pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#discussion_r823384840



##########
File path: ballista/rust/scheduler/src/lib.rs
##########
@@ -1001,10 +1103,56 @@ mod test {
                 namespace.to_string(),
                 BallistaCodec::default(),
             );
-        let ctx = scheduler.ctx.read().await;
-        state.init(&ctx).await?;
+        state.init().await?;
         // executor should be registered
         assert_eq!(state.get_executors_metadata().await.unwrap().len(), 1);
         Ok(())
     }
 }
+
+/// A Registry holds all the datafusion session contexts
+pub struct SessionContextRegistry {

Review comment:
       @thinkharderdev  Done the change, please help to take a look.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1060433340


   Sorry to join the party late! Thanks for @ me.
   
   Initially, I made `RuntimeEnv` a global one for minor interface plumbing while prototyping `MemoryManager`. And I even have had the formal proposal before introducing the `ObjectStore` interface to make `ObjectStore` itself global. In my humble opinion, whether the `Env` or `Store` static or not is more like a programming flavor. When implementing a lib, explicitly plumbing is favored. While implementing a framework or a standalone program, you are free to use `global`.
   
   However, when it comes to the multi-tenancy world, the issue changed significantly in my mind. We should prompt **Runtime Isolation** to among our first-class considerations. Therefore, I'm **against** making `RuntimeEnv`, especially `MemoryManager` be global among multiple tenants in DataFusion or Ballista. We should never have tasks from different clients with different priorities or shares treated equally in one manager. We may even need to consider **physical isolation** for the most crucial jobs or queries and assign exclusive executors to guarantee that no bad things happen.
   
   I'm OK with whatever configurations are passed down because of the simple usage pattern.
   
   Therefore, please provide another abstraction layer for **resource-related** or **performance-critical** components. We can afford to use resources less optimally, but never in a way that interferes with or even hinders each other.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1060493159


   I might be asking too much about this great initiative. Another plausible and low-hanging fruit in my mind is: multi-tenancy only inside the scheduler. By making executors one query entrance at a time, a "Mutex". We could have an easier route to achieve multi-tenancy.
   
   Another question, just out of curiosity, is the scheduler a single point of failure in the current design? If it is, have you considered a "multi" scheduler architecture?
   
   Thanks again for broadening Ballista's landscape!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang removed a comment on pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
mingmwang removed a comment on pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#issuecomment-1061323542


   > I think another consideration is
   > 
   > I agree with @yjshen (and I think @thinkharderdev ) that for the datafusion (use as a library) feature, a single static `RuntimeEnv` is a challenge because of how many different ways it may be deployed / embedded. I can imagine people using DataFusion to build systems that attempt to build in isolation (by using separate `RuntimeEnv` for different subsets of users / queries) as well as systems that use other technologies (e.g. kubernetes) to implement isolation
   > 
   > I think passing around the `RuntimeEnv` is the most flexible, if annoying, approach.
   > 
   > I also wanted to reiterate that I liked pretty much all of the rest of this PR (like adding TaskContext, SessionContext, cleanups, etc) so perhaps we can proceed with adding in those pieces incremental while we work out the "global" vs "non global" context thing
   
   Ok, I will remove the the static singleton RuntimeEnv in DataFusion, and go with the approach to pass through the required structure in the ExecutionPlan execute() method.  It can still achieve the current goal and make the plan's major execution path configuration aware. But in the same time it will lose some capabilities for example in other methods like display() or stats(), if we want to add some session configuration to control the plan display(short plan/long plan displaying etc) or stats collection(columnar stats, support histogram etc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on a change in pull request #1924: Refactor ExecutionContext and related conf to support multi-tenancy configurations

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on a change in pull request #1924:
URL: https://github.com/apache/arrow-datafusion/pull/1924#discussion_r822547792



##########
File path: ballista/rust/scheduler/src/lib.rs
##########
@@ -116,30 +117,29 @@ pub struct SchedulerServer<T: 'static + AsLogicalPlan, U: 'static + AsExecutionP
     pub(crate) state: Arc<SchedulerState<T, U>>,
     start_time: u128,
     policy: TaskSchedulingPolicy,
-    scheduler_env: Option<SchedulerEnv>,
+    scheduler_loop: Option<SchedulerLoop>,
     executors_client: Option<ExecutorsClient>,
-    ctx: Arc<RwLock<ExecutionContext>>,
     codec: BallistaCodec<T, U>,
+    /// DataFusion session contexts that are registered within the SchedulerServer
+    session_context_registry: Arc<SessionContextRegistry>,

Review comment:
       To add on to that slightly, previous work allows the serialization of extension plans in Ballista, but in order to actually use them an `ExtensionPlanner` needs to be registered in the `DefaultQueryPlanner`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org