You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/03 04:14:22 UTC

[GitHub] [arrow-datafusion] avantgardnerio opened a new pull request, #4095: Bg listing catalog

avantgardnerio opened a new pull request, #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095

   # Which issue does this PR close?
   
   Closes #4094.
   
   # Rationale for this change
   
   Described in issue.
   
   # What changes are included in this PR?
   
   1. a `ListingSchemaProvider` is defined
   2. if appropriate `config_options` are set, the `ListingSchemaProvider` is automatically registered
   3. if the `ListingSchemaProvider` is registered, it scans the `ObjectStore` and registers the appropriate tables
   
   # Are there any user-facing changes?
   
   Then they connect, tables are there as they would expect.
   
   ![Screenshot from 2022-11-02 15-52-18](https://user-images.githubusercontent.com/3855243/199646383-9ccce786-2462-4f67-9a6c-0f44c30ba405.png)
   
   ![Screenshot from 2022-11-02 15-50-55](https://user-images.githubusercontent.com/3855243/199646401-5a3f50d7-3ec9-4140-8658-31436fe4d270.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove merged pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
andygrove merged PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1302771762

   Oh, could the underscores be messing with it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1302567592

   > missing is some sort of test
   
   For sure, working on it...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1302772990

   Yup, a bug in ObjectStore:
   ```
   Looking for data in /home/bgardner/workspace/ballista/arrow-datafusion/datafusion/core/tests/tpch_csv exists=true
   
   
   Left:  0
   Right: 8
   <Click to see difference>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#discussion_r1013298691


##########
datafusion/core/src/catalog/listing_schema.rs:
##########
@@ -0,0 +1,145 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! listing_schema contains a SchemaProvider that scans ObjectStores for tables automatically
+use crate::catalog::schema::SchemaProvider;
+use crate::datasource::datasource::TableProviderFactory;
+use crate::datasource::TableProvider;
+use datafusion_common::DataFusionError;
+use futures::TryStreamExt;
+use object_store::ObjectStore;
+use std::any::Any;
+use std::collections::{HashMap, HashSet};
+use std::path::Path;
+use std::sync::{Arc, Mutex};
+
+/// A SchemaProvider that scans an ObjectStore to automatically discover tables
+pub struct ListingSchemaProvider {

Review Comment:
   It might make sense to add some documents here about the assumptions this class makes.
   
   Like that it assumes each directory in `path` corresponds to a table, for example



##########
datafusion/core/src/catalog/listing_schema.rs:
##########
@@ -0,0 +1,145 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! listing_schema contains a SchemaProvider that scans ObjectStores for tables automatically
+use crate::catalog::schema::SchemaProvider;
+use crate::datasource::datasource::TableProviderFactory;
+use crate::datasource::TableProvider;
+use datafusion_common::DataFusionError;
+use futures::TryStreamExt;
+use object_store::ObjectStore;
+use std::any::Any;
+use std::collections::{HashMap, HashSet};
+use std::path::Path;
+use std::sync::{Arc, Mutex};
+
+/// A SchemaProvider that scans an ObjectStore to automatically discover tables
+pub struct ListingSchemaProvider {
+    authority: String,
+    path: object_store::path::Path,
+    factory: Arc<dyn TableProviderFactory>,
+    store: Arc<dyn ObjectStore>,
+    tables: Arc<Mutex<HashMap<String, Arc<dyn TableProvider>>>>,
+}
+
+impl ListingSchemaProvider {
+    /// Create a new ListingSchemaProvider
+    pub fn new(
+        authority: String,

Review Comment:
   Can you please document these parameters? Specifically, I am not sure what `authority` means / is for
   
   I assume `path` is the root path for which to search for tables



##########
datafusion/core/src/catalog/listing_schema.rs:
##########
@@ -0,0 +1,145 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! listing_schema contains a SchemaProvider that scans ObjectStores for tables automatically
+use crate::catalog::schema::SchemaProvider;
+use crate::datasource::datasource::TableProviderFactory;
+use crate::datasource::TableProvider;
+use datafusion_common::DataFusionError;
+use futures::TryStreamExt;
+use object_store::ObjectStore;
+use std::any::Any;
+use std::collections::{HashMap, HashSet};
+use std::path::Path;
+use std::sync::{Arc, Mutex};
+
+/// A SchemaProvider that scans an ObjectStore to automatically discover tables
+pub struct ListingSchemaProvider {
+    authority: String,
+    path: object_store::path::Path,
+    factory: Arc<dyn TableProviderFactory>,
+    store: Arc<dyn ObjectStore>,
+    tables: Arc<Mutex<HashMap<String, Arc<dyn TableProvider>>>>,
+}
+
+impl ListingSchemaProvider {
+    /// Create a new ListingSchemaProvider
+    pub fn new(
+        authority: String,
+        path: object_store::path::Path,
+        factory: Arc<dyn TableProviderFactory>,
+        store: Arc<dyn ObjectStore>,
+    ) -> Self {
+        Self {
+            authority,
+            path,
+            factory,
+            store,
+            tables: Arc::new(Mutex::new(HashMap::new())),
+        }
+    }
+
+    /// Reload table information from ObjectStore
+    pub async fn refresh(&self) -> datafusion_common::Result<()> {
+        let entries: Vec<_> = self
+            .store
+            .list(Some(&self.path))
+            .await?
+            .try_collect()
+            .await?;
+        let base = Path::new(self.path.as_ref());
+        let mut tables = HashSet::new();
+        for file in entries.iter() {
+            let mut parent = Path::new(file.location.as_ref());
+            while let Some(p) = parent.parent() {
+                if p == base {
+                    tables.insert(parent);
+                }
+                parent = p;
+            }
+        }
+        for table in tables.iter() {
+            let file_name = table
+                .file_name()
+                .ok_or_else(|| {
+                    DataFusionError::Internal("Cannot parse file name!".to_string())
+                })?
+                .to_str()
+                .ok_or_else(|| {
+                    DataFusionError::Internal("Cannot parse file name!".to_string())
+                })?;
+            let path = table.to_str().ok_or_else(|| {
+                DataFusionError::Internal("Cannot parse file name!".to_string())
+            })?;
+            if !self.table_exist(file_name) {
+                let path = format!("{}/{}", self.authority, path);
+                let provider = self.factory.create(path.as_str()).await?;
+                let _ = self.register_table(file_name.to_string(), provider.clone())?;
+            }
+        }
+        Ok(())
+    }
+}
+
+impl SchemaProvider for ListingSchemaProvider {
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+
+    fn table_names(&self) -> Vec<String> {
+        self.tables
+            .lock()
+            .expect("Can't lock tables")
+            .keys()
+            .map(|it| it.to_string())
+            .collect()
+    }
+
+    fn table(&self, name: &str) -> Option<Arc<dyn TableProvider>> {
+        self.tables
+            .lock()
+            .expect("Can't lock tables")
+            .get(name)
+            .cloned()
+    }
+
+    fn register_table(
+        &self,
+        name: String,
+        table: Arc<dyn TableProvider>,
+    ) -> datafusion_common::Result<Option<Arc<dyn TableProvider>>> {
+        self.tables
+            .lock()
+            .expect("Can't lock tables")
+            .insert(name, table.clone());
+        Ok(Some(table))
+    }
+
+    fn deregister_table(
+        &self,
+        _name: &str,
+    ) -> datafusion_common::Result<Option<Arc<dyn TableProvider>>> {
+        todo!("ListingSchemaProvider::deregister_table")

Review Comment:
   Is this meant to be left as a todo? Perhaps we can return an NotYetImplemented error here instead of panicing



##########
datafusion/core/src/catalog/listing_schema.rs:
##########
@@ -0,0 +1,145 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! listing_schema contains a SchemaProvider that scans ObjectStores for tables automatically
+use crate::catalog::schema::SchemaProvider;
+use crate::datasource::datasource::TableProviderFactory;
+use crate::datasource::TableProvider;
+use datafusion_common::DataFusionError;
+use futures::TryStreamExt;
+use object_store::ObjectStore;
+use std::any::Any;
+use std::collections::{HashMap, HashSet};
+use std::path::Path;
+use std::sync::{Arc, Mutex};
+
+/// A SchemaProvider that scans an ObjectStore to automatically discover tables
+pub struct ListingSchemaProvider {
+    authority: String,
+    path: object_store::path::Path,
+    factory: Arc<dyn TableProviderFactory>,
+    store: Arc<dyn ObjectStore>,
+    tables: Arc<Mutex<HashMap<String, Arc<dyn TableProvider>>>>,
+}
+
+impl ListingSchemaProvider {
+    /// Create a new ListingSchemaProvider
+    pub fn new(
+        authority: String,
+        path: object_store::path::Path,
+        factory: Arc<dyn TableProviderFactory>,
+        store: Arc<dyn ObjectStore>,
+    ) -> Self {
+        Self {
+            authority,
+            path,
+            factory,
+            store,
+            tables: Arc::new(Mutex::new(HashMap::new())),
+        }
+    }
+
+    /// Reload table information from ObjectStore
+    pub async fn refresh(&self) -> datafusion_common::Result<()> {
+        let entries: Vec<_> = self
+            .store
+            .list(Some(&self.path))
+            .await?
+            .try_collect()
+            .await?;
+        let base = Path::new(self.path.as_ref());
+        let mut tables = HashSet::new();
+        for file in entries.iter() {
+            let mut parent = Path::new(file.location.as_ref());
+            while let Some(p) = parent.parent() {
+                if p == base {
+                    tables.insert(parent);
+                }
+                parent = p;
+            }
+        }
+        for table in tables.iter() {
+            let file_name = table
+                .file_name()
+                .ok_or_else(|| {
+                    DataFusionError::Internal("Cannot parse file name!".to_string())
+                })?
+                .to_str()
+                .ok_or_else(|| {
+                    DataFusionError::Internal("Cannot parse file name!".to_string())
+                })?;
+            let path = table.to_str().ok_or_else(|| {
+                DataFusionError::Internal("Cannot parse file name!".to_string())
+            })?;
+            if !self.table_exist(file_name) {
+                let path = format!("{}/{}", self.authority, path);

Review Comment:
   ```suggestion
               let table_name = table.to_str().ok_or_else(|| {
                   DataFusionError::Internal("Cannot parse file name!".to_string())
               })?;
               if !self.table_exist(file_name) {
                   let table_name = format!("{}/{}", self.authority, path);
   ```



##########
datafusion/core/src/catalog/listing_schema.rs:
##########
@@ -0,0 +1,145 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! listing_schema contains a SchemaProvider that scans ObjectStores for tables automatically
+use crate::catalog::schema::SchemaProvider;
+use crate::datasource::datasource::TableProviderFactory;
+use crate::datasource::TableProvider;
+use datafusion_common::DataFusionError;
+use futures::TryStreamExt;
+use object_store::ObjectStore;
+use std::any::Any;
+use std::collections::{HashMap, HashSet};
+use std::path::Path;
+use std::sync::{Arc, Mutex};
+
+/// A SchemaProvider that scans an ObjectStore to automatically discover tables
+pub struct ListingSchemaProvider {
+    authority: String,
+    path: object_store::path::Path,
+    factory: Arc<dyn TableProviderFactory>,
+    store: Arc<dyn ObjectStore>,
+    tables: Arc<Mutex<HashMap<String, Arc<dyn TableProvider>>>>,
+}
+
+impl ListingSchemaProvider {
+    /// Create a new ListingSchemaProvider
+    pub fn new(
+        authority: String,
+        path: object_store::path::Path,
+        factory: Arc<dyn TableProviderFactory>,
+        store: Arc<dyn ObjectStore>,
+    ) -> Self {
+        Self {
+            authority,
+            path,
+            factory,
+            store,
+            tables: Arc::new(Mutex::new(HashMap::new())),
+        }
+    }
+
+    /// Reload table information from ObjectStore

Review Comment:
   I recommend documenting here when `refresh()` should be called  -- like it has to be called explicitly after construction for example



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on a diff in pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on code in PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#discussion_r1013363120


##########
datafusion/core/src/execution/context.rs:
##########
@@ -2145,6 +2195,40 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn with_listing_schema_provider() -> Result<()> {
+        let path = PathBuf::from(env!("CARGO_MANIFEST_DIR"));
+        let url = format!("file://{}/tests/tpch-csv", path.display());
+
+        let mut table_factories: HashMap<String, Arc<dyn TableProviderFactory>> =
+            HashMap::new();
+        table_factories.insert("test".to_string(), Arc::new(TestTableFactory {}));
+        let rt_cfg = RuntimeConfig::new().with_table_factories(table_factories);
+        let runtime = Arc::new(RuntimeEnv::new(rt_cfg).unwrap());
+        let cfg = SessionConfig::new()
+            .set_str("datafusion.catalog.location", url.as_str())
+            .set_str("datafusion.catalog.type", "test");
+        let session_state = SessionState::with_config_rt(cfg, runtime);
+        let ctx = SessionContext::with_state(session_state);
+
+        let mut table_count = 0;
+        for cat_name in ctx.catalog_names().iter() {
+            let cat = ctx.catalog(cat_name).unwrap();
+            for s_name in cat.schema_names().iter() {
+                let schema = cat.schema(s_name).unwrap();
+                if let Some(listing) =
+                    schema.as_any().downcast_ref::<ListingSchemaProvider>()
+                {
+                    listing.refresh().await.unwrap();

Review Comment:
   :shrug: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on a diff in pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on code in PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#discussion_r1013444122


##########
benchmarks/Cargo.toml:
##########
@@ -41,13 +41,13 @@ mimalloc = { version = "0.1", optional = true, default-features = false }
 num_cpus = "1.13.0"
 object_store = "0.5.0"
 parquet = "25.0.0"
-parquet-test-utils = { path = "../parquet-test-utils/" }
+parquet-test-utils = { path = "../parquet-test-utils" }

Review Comment:
   I just took away the slash. I didn't add these. I can put the slash back if that's better form.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #4095: Bg listing catalog

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1301621964

   @andygrove and @alamb as usual, I'd appreciate a review as well as adding anyone else who is appropriate. Thanks guys :raised_hands: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1302528100

   Until the fix for https://github.com/apache/arrow-datafusion/issues/4100 is merged, clippy will likely fail on this PR as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on a diff in pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on code in PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#discussion_r1013362371


##########
datafusion/core/src/execution/context.rs:
##########
@@ -2145,6 +2195,40 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn with_listing_schema_provider() -> Result<()> {
+        let path = PathBuf::from(env!("CARGO_MANIFEST_DIR"));
+        let url = format!("file://{}/tests/tpch-csv", path.display());
+
+        let mut table_factories: HashMap<String, Arc<dyn TableProviderFactory>> =
+            HashMap::new();
+        table_factories.insert("test".to_string(), Arc::new(TestTableFactory {}));
+        let rt_cfg = RuntimeConfig::new().with_table_factories(table_factories);
+        let runtime = Arc::new(RuntimeEnv::new(rt_cfg).unwrap());
+        let cfg = SessionConfig::new()
+            .set_str("datafusion.catalog.location", url.as_str())
+            .set_str("datafusion.catalog.type", "test");
+        let session_state = SessionState::with_config_rt(cfg, runtime);
+        let ctx = SessionContext::with_state(session_state);
+
+        let mut table_count = 0;
+        for cat_name in ctx.catalog_names().iter() {
+            let cat = ctx.catalog(cat_name).unwrap();
+            for s_name in cat.schema_names().iter() {
+                let schema = cat.schema(s_name).unwrap();
+                if let Some(listing) =
+                    schema.as_any().downcast_ref::<ListingSchemaProvider>()
+                {
+                    listing.refresh().await.unwrap();
+                }
+                table_count = schema.table_names().len();
+            }
+        }
+
+        assert_eq!(table_count, 8);

Review Comment:
   @alamb I think this one test covers all the bases. Let me know if you think it could/should be improved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#discussion_r1013417479


##########
benchmarks/Cargo.toml:
##########
@@ -41,13 +41,13 @@ mimalloc = { version = "0.1", optional = true, default-features = false }
 num_cpus = "1.13.0"
 object_store = "0.5.0"
 parquet = "25.0.0"
-parquet-test-utils = { path = "../parquet-test-utils/" }
+parquet-test-utils = { path = "../parquet-test-utils" }

Review Comment:
   Are these new dependencies only used in tests? If so, maybe they can move to `[dev-dependencies]`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on a diff in pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on code in PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#discussion_r1013312827


##########
datafusion/core/src/catalog/listing_schema.rs:
##########
@@ -0,0 +1,145 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! listing_schema contains a SchemaProvider that scans ObjectStores for tables automatically
+use crate::catalog::schema::SchemaProvider;
+use crate::datasource::datasource::TableProviderFactory;
+use crate::datasource::TableProvider;
+use datafusion_common::DataFusionError;
+use futures::TryStreamExt;
+use object_store::ObjectStore;
+use std::any::Any;
+use std::collections::{HashMap, HashSet};
+use std::path::Path;
+use std::sync::{Arc, Mutex};
+
+/// A SchemaProvider that scans an ObjectStore to automatically discover tables
+pub struct ListingSchemaProvider {
+    authority: String,
+    path: object_store::path::Path,
+    factory: Arc<dyn TableProviderFactory>,
+    store: Arc<dyn ObjectStore>,
+    tables: Arc<Mutex<HashMap<String, Arc<dyn TableProvider>>>>,
+}
+
+impl ListingSchemaProvider {
+    /// Create a new ListingSchemaProvider
+    pub fn new(
+        authority: String,
+        path: object_store::path::Path,
+        factory: Arc<dyn TableProviderFactory>,
+        store: Arc<dyn ObjectStore>,
+    ) -> Self {
+        Self {
+            authority,
+            path,
+            factory,
+            store,
+            tables: Arc::new(Mutex::new(HashMap::new())),
+        }
+    }
+
+    /// Reload table information from ObjectStore

Review Comment:
   That is the big open question with this PR :smile: . Presently, in `flight_sql.rs` in Ballista, when FlightSql clients enumerate the catalogs/schemas/tables, I am doing a checked `downcast_ref` to see if it is a `ListingSchemaProvider` and if so calling this method. Ultimately, I think we should probably extend the `SchemaProvider` API? The downcast trick certainly doesn't seem elegant.
   
   Unfortunately this is a state synchronization problem, and I'm not sure that the `ObjectStore` has APIs for file system listeners, so we will need to figure out the best times to try to synchronize the state. Every time we run a query perhaps? My worry is that this could get expensive, I can imagine (or have heard about) each delta table containing 1000 parquet files, and there could probably be 100s of tables, which means a lot of files to scan.
   
   Also unfortunately, it looks like the `ObjectStore` doesn't let us list only children of a folder - it lists all the files in the entire bucket, thus the weird recursive `.parent()` stuff.
   
   I thought I would file this PR to get the discussion going.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1302755493

   If it looks anything like local, we should see
   ```
   Looking for data in /home/bgardner/workspace/ballista/arrow-datafusion/datafusion/core/tests/tpch-csv exists=true
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#discussion_r1014224252


##########
datafusion/core/src/catalog/listing_schema.rs:
##########
@@ -0,0 +1,163 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! listing_schema contains a SchemaProvider that scans ObjectStores for tables automatically
+use crate::catalog::schema::SchemaProvider;
+use crate::datasource::datasource::TableProviderFactory;
+use crate::datasource::TableProvider;
+use datafusion_common::DataFusionError;
+use futures::TryStreamExt;
+use object_store::ObjectStore;
+use std::any::Any;
+use std::collections::{HashMap, HashSet};
+use std::path::Path;
+use std::sync::{Arc, Mutex};
+
+/// A `SchemaProvider` that scans an `ObjectStore` to automatically discover tables
+///
+/// A subfolder relationship is assumed, i.e. given:
+/// authority = s3://host.example.com:3000
+/// path = /data/tpch
+/// factory = `DeltaTableFactory`
+///
+/// A table called "customer" will be registered for the folder:
+/// s3://host.example.com:3000/data/tpch/customer
+///
+/// assuming it contains valid deltalake data, i.e:
+/// s3://host.example.com:3000/data/tpch/customer/part-00000-xxxx.snappy.parquet
+/// s3://host.example.com:3000/data/tpch/customer/_delta_log/
+pub struct ListingSchemaProvider {
+    authority: String,
+    path: object_store::path::Path,
+    factory: Arc<dyn TableProviderFactory>,
+    store: Arc<dyn ObjectStore>,
+    tables: Arc<Mutex<HashMap<String, Arc<dyn TableProvider>>>>,
+}
+
+impl ListingSchemaProvider {
+    /// Create a new `ListingSchemaProvider`
+    ///
+    /// Arguments:

Review Comment:
   ❤️ 



##########
datafusion/core/src/config.rs:
##########
@@ -245,11 +251,22 @@ impl BuiltInConfigs {
                 rule. When set to false, any rules that produce errors will cause the query to fail.",
                 true
             ),
-             ConfigDefinition::new_u64(
-                 OPT_OPTIMIZER_MAX_PASSES,
-                 "Number of times that the optimizer will attempt to optimize the plan",
-                 3
-             )]
+            ConfigDefinition::new_u64(
+                OPT_OPTIMIZER_MAX_PASSES,
+                "Number of times that the optimizer will attempt to optimize the plan",
+                3
+            ),
+            ConfigDefinition::new_string(
+                OPT_CATALOG_LOCATION,
+                "Location scanned to load tables for `default` schema",
+                "".to_string()
+            ),
+            ConfigDefinition::new_string(
+                OPT_CATALOG_TYPE,
+                "Type of `TableProvider` to use when loading `default` schema",

Review Comment:
   ```suggestion
                   "Type of `TableProvider` to use when loading `default` schema. Defaults to None",
   ```



##########
datafusion/core/src/config.rs:
##########
@@ -245,11 +251,22 @@ impl BuiltInConfigs {
                 rule. When set to false, any rules that produce errors will cause the query to fail.",
                 true
             ),
-             ConfigDefinition::new_u64(
-                 OPT_OPTIMIZER_MAX_PASSES,
-                 "Number of times that the optimizer will attempt to optimize the plan",
-                 3
-             )]
+            ConfigDefinition::new_u64(
+                OPT_OPTIMIZER_MAX_PASSES,
+                "Number of times that the optimizer will attempt to optimize the plan",
+                3
+            ),
+            ConfigDefinition::new_string(
+                OPT_CATALOG_LOCATION,
+                "Location scanned to load tables for `default` schema",
+                "".to_string()

Review Comment:
   I wonder if using `None here would be better than an empty string?



##########
datafusion/core/src/config.rs:
##########
@@ -245,11 +251,22 @@ impl BuiltInConfigs {
                 rule. When set to false, any rules that produce errors will cause the query to fail.",
                 true
             ),
-             ConfigDefinition::new_u64(
-                 OPT_OPTIMIZER_MAX_PASSES,
-                 "Number of times that the optimizer will attempt to optimize the plan",
-                 3
-             )]
+            ConfigDefinition::new_u64(
+                OPT_OPTIMIZER_MAX_PASSES,
+                "Number of times that the optimizer will attempt to optimize the plan",
+                3
+            ),
+            ConfigDefinition::new_string(
+                OPT_CATALOG_LOCATION,
+                "Location scanned to load tables for `default` schema",

Review Comment:
   ```suggestion
                   "Location scanned to load tables for `default` schema, defaults to None",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1302743566

   I don't understand why the hash collision test is failing, and I can't get it to happen locally :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#discussion_r1014223721


##########
datafusion/core/src/catalog/listing_schema.rs:
##########
@@ -0,0 +1,145 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! listing_schema contains a SchemaProvider that scans ObjectStores for tables automatically
+use crate::catalog::schema::SchemaProvider;
+use crate::datasource::datasource::TableProviderFactory;
+use crate::datasource::TableProvider;
+use datafusion_common::DataFusionError;
+use futures::TryStreamExt;
+use object_store::ObjectStore;
+use std::any::Any;
+use std::collections::{HashMap, HashSet};
+use std::path::Path;
+use std::sync::{Arc, Mutex};
+
+/// A SchemaProvider that scans an ObjectStore to automatically discover tables
+pub struct ListingSchemaProvider {
+    authority: String,
+    path: object_store::path::Path,
+    factory: Arc<dyn TableProviderFactory>,
+    store: Arc<dyn ObjectStore>,
+    tables: Arc<Mutex<HashMap<String, Arc<dyn TableProvider>>>>,
+}
+
+impl ListingSchemaProvider {
+    /// Create a new ListingSchemaProvider
+    pub fn new(
+        authority: String,
+        path: object_store::path::Path,
+        factory: Arc<dyn TableProviderFactory>,
+        store: Arc<dyn ObjectStore>,
+    ) -> Self {
+        Self {
+            authority,
+            path,
+            factory,
+            store,
+            tables: Arc::new(Mutex::new(HashMap::new())),
+        }
+    }
+
+    /// Reload table information from ObjectStore

Review Comment:
   I agree that there is no policy that will work well for all implementations / usecases so the refresh policy will need to be decided by whatever the upstream system is (e.g. ballista or iox, or whatever)
   
   Adding a `refresh()` method to `SchemaProvider` seems fine, though to be honest I think using `downcast_ref()` also seems fine to me
   
   Would definitely be interested in hearing other opinions on this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1303572568

   I really like this feature. It will save me time as a user once we plumb this through to the CLI and Python :+1: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on a diff in pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on code in PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#discussion_r1013316116


##########
datafusion/core/src/catalog/listing_schema.rs:
##########
@@ -0,0 +1,145 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! listing_schema contains a SchemaProvider that scans ObjectStores for tables automatically
+use crate::catalog::schema::SchemaProvider;
+use crate::datasource::datasource::TableProviderFactory;
+use crate::datasource::TableProvider;
+use datafusion_common::DataFusionError;
+use futures::TryStreamExt;
+use object_store::ObjectStore;
+use std::any::Any;
+use std::collections::{HashMap, HashSet};
+use std::path::Path;
+use std::sync::{Arc, Mutex};
+
+/// A SchemaProvider that scans an ObjectStore to automatically discover tables
+pub struct ListingSchemaProvider {
+    authority: String,
+    path: object_store::path::Path,
+    factory: Arc<dyn TableProviderFactory>,
+    store: Arc<dyn ObjectStore>,
+    tables: Arc<Mutex<HashMap<String, Arc<dyn TableProvider>>>>,
+}
+
+impl ListingSchemaProvider {
+    /// Create a new ListingSchemaProvider
+    pub fn new(
+        authority: String,

Review Comment:
   I think it is scheme + host (if applicable). I borrowed the term from the `ObjectStore` code. I'll document that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1302784779

   > Yup, a bug in ObjectStore:
   
   Nope, it was my faulty test combined with hashing inconsistency that happened to get triggered by underscores in this case :facepalm: .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1302824587

   Oh wow. The UI went nuts. Sorry for all the review requests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1302770927

   So the directory exists, so it seems like:
   1. either the files in it were deleted
   2. or, the ObjectStore has some problem accessing it
   ```
   Looking for data in /__w/arrow-datafusion/arrow-datafusion/datafusion/core/tests/tpch-csv exists=true
   thread 'execution::context::tests::with_listing_schema_provider' panicked at 'assertion failed: `(left == right)`
     left: `0`,
    right: `8`', datafusion/core/src/execution/context.rs:2288:9
    ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #4095: Automatically register tables if ObjectStore root is configured

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #4095:
URL: https://github.com/apache/arrow-datafusion/pull/4095#issuecomment-1303984839

   Benchmark runs are scheduled for baseline = fc669d5892954cbd2612a272314785758a7cb176 and contender = 4a67d0d2613d8c4b323e0dd5a6afe1eb1115c7ba. 4a67d0d2613d8c4b323e0dd5a6afe1eb1115c7ba is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/f3e309087f1d49849e482337b794de7c...b2e8be84c49a4bbfb7990e231c997487/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/a170522cd97348448b9f891917c27f3a...c973c84b93a449278456729e863560d6/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/62e06b89795748cf968b8102ee19b171...392d724b6f90470ba9358b7ee009f93b/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/ad304deffd1c4f03ac808b076685be3b...16e63a8adb1c47438688e5780de2c7cc/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org