You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/31 06:54:40 UTC

[GitHub] [arrow-datafusion] matthewmturner opened a new pull request #1715: Create ListingTableConfig which includes file format and schema inference

matthewmturner opened a new pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1705 
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1031903376


   > @alamb any idea why only one CI check is running?
   
   
   I think it is because the PR has a conflict


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r802968609



##########
File path: datafusion/tests/path_partition.rs
##########
@@ -271,7 +274,7 @@ async fn parquet_overlapping_columns() -> Result<()> {
     Ok(())
 }
 
-fn register_partitioned_aggregate_csv(
+async fn register_partitioned_aggregate_csv(

Review comment:
       This doesn't need to be `async` anymore does it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r802967720



##########
File path: datafusion/src/datasource/listing/table.rs
##########
@@ -420,20 +552,10 @@ mod tests {
     async fn load_table(name: &str) -> Result<Arc<dyn TableProvider>> {
         let testdata = crate::test_util::parquet_test_data();
         let filename = format!("{}/{}", testdata, name);
-        let opt = ListingOptions {
-            file_extension: DEFAULT_PARQUET_EXTENSION.to_owned(),
-            format: Arc::new(ParquetFormat::default()),
-            table_partition_cols: vec![],
-            target_partitions: 2,
-            collect_stat: true,
-        };
-        // here we resolve the schema locally
-        let schema = opt
-            .infer_schema(Arc::new(LocalFileSystem {}), &filename)
-            .await
-            .expect("Infer schema");
-        let table =
-            ListingTable::new(Arc::new(LocalFileSystem {}), filename, schema, opt);
+        let config = ListingTableConfig::new(Arc::new(LocalFileSystem {}), filename)
+            .infer()

Review comment:
       this is cool




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner edited a comment on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner edited a comment on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1034033006


   @alamb i think this is finally ready.
   
   I havent included partition column inference, but i think that can be added as a follow on PR if needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1034240793


   @alamb hopefully were good now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r802973470



##########
File path: datafusion/tests/path_partition.rs
##########
@@ -271,7 +274,7 @@ async fn parquet_overlapping_columns() -> Result<()> {
     Ok(())
 }
 
-fn register_partitioned_aggregate_csv(
+async fn register_partitioned_aggregate_csv(

Review comment:
       You are right. ive fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1034074960


   @alamb thx! as always appreciate your guidance.  
   
   is it ok for me to include this feature in post for 7.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1029114932


   Just as FYI I havent had time this week to work on DataFusion PRs. I plan to pick up work on this early next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1032227792


   @alamb do you know of any Rust magic that would allow me to use an async method for `impl TryInto<LogicalPlan> for &protobuf::LogicalPlanNode`?  `#[async_trait]` doesnt work as it doesnt match the trait definition for `TryInto`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r803176255



##########
File path: datafusion/src/datasource/listing/table.rs
##########
@@ -40,7 +44,123 @@ use crate::datasource::{
 
 use super::helpers::{expr_applicable_for_cols, pruned_partition_list, split_files};
 
+/// Configuration for creating a 'ListingTable'  
+pub struct ListingTableConfig {
+    /// `ObjectStore` that contains the files for the `ListingTable`.
+    pub object_store: Arc<dyn ObjectStore>,
+    /// Path on the `ObjectStore` for creating `ListingTable`.
+    pub table_path: String,
+    /// Optional `SchemaRef` for the to be created `ListingTable`.
+    pub file_schema: Option<SchemaRef>,
+    /// Optional `ListingOptions` for the to be created `ListingTable`.
+    pub options: Option<ListingOptions>,
+}
+
+impl ListingTableConfig {
+    /// Creates new `ListingTableConfig`.  The `SchemaRef` and `ListingOptions` are inferred based on the suffix of the provided `table_path`.
+    pub fn new(
+        object_store: Arc<dyn ObjectStore>,
+        table_path: impl Into<String>,
+    ) -> Self {
+        Self {
+            object_store,
+            table_path: table_path.into(),
+            file_schema: None,
+            options: None,
+        }
+    }
+    /// Add `schema` to `ListingTableConfig`
+    pub fn with_schema(self, schema: SchemaRef) -> Self {
+        Self {
+            object_store: self.object_store,
+            table_path: self.table_path,
+            file_schema: Some(schema),
+            options: self.options,
+        }
+    }
+
+    /// Add `listing_options` to `ListingTableConfig`
+    pub fn with_listing_options(self, listing_options: ListingOptions) -> Self {
+        Self {
+            object_store: self.object_store,
+            table_path: self.table_path,
+            file_schema: self.file_schema,
+            options: Some(listing_options),
+        }
+    }
+
+    fn infer_format(suffix: &str) -> Result<Arc<dyn FileFormat>> {
+        match suffix {
+            "avro" => Ok(Arc::new(AvroFormat::default())),
+            "csv" => Ok(Arc::new(CsvFormat::default())),
+            "json" => Ok(Arc::new(JsonFormat::default())),
+            "parquet" => Ok(Arc::new(ParquetFormat::default())),
+            _ => Err(DataFusionError::Internal(
+                "Unable to infer file type".into(),

Review comment:
       We can make the log more specifically with the `suffix`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1034095950


   > Ok no now it has a conflict!
   
   working on it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1034033006


   @alamb i think this is finally ready (pending final CI test passing).
   
   I havent included partition column inference, but i think that can be added as a follow on PR if needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1030519703


   @alamb @houqp would you mind checking this out before I go updating all the other tests and places where this is used?  I confess i struggled a bit with the borrow checker here and there is likely more idiomatic ways to achieve the goal of this - but i think that it at least serves as a starting point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r800124714



##########
File path: datafusion/src/datasource/listing/table.rs
##########
@@ -40,7 +44,70 @@ use crate::datasource::{
 
 use super::helpers::{expr_applicable_for_cols, pruned_partition_list, split_files};
 
+/// Configuration for creating a 'ListingTable'  
+pub struct ListingTableConfig {
+    pub object_store: Arc<dyn ObjectStore>,
+    pub table_path: String,
+    pub file_schema: Option<SchemaRef>,
+    pub options: Option<ListingOptions>,
+}
+
+impl ListingTableConfig {
+    /// Creates new `ListingTableConfig`.  The `SchemaRef` and `ListingOptions` are inferred based on the suffix of the provided `table_path`.
+    pub async fn new(
+        object_store: Arc<dyn ObjectStore>,
+        table_path: impl Into<String>,
+    ) -> Self {
+        Self {
+            object_store,
+            table_path: table_path.into(),
+            file_schema: None,
+            options: None,
+        }
+    }
+    pub fn with_schema(mut self, schema: SchemaRef) {

Review comment:
       Thx @houqp.  FYI, i made updates and im planning on updating the remaining errors next.  Lmk if you see anything else.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1032528711


   > @alamb do you know of any Rust magic that would allow me to use an async method for impl TryInto<LogicalPlan> for &protobuf::LogicalPlanNode to get the ballista errors resolved? 
   
   I do not know
   
   > Thinking out loud here. Im wondering if i can move the required async functionality to all be within ListingTableConfig and the user can decide if its needed. Then ListingTable::new doesnt need to be async.
   
   I think that sounds like a very good idea to me 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r803179007



##########
File path: datafusion/src/datasource/listing/table.rs
##########
@@ -40,7 +44,123 @@ use crate::datasource::{
 
 use super::helpers::{expr_applicable_for_cols, pruned_partition_list, split_files};
 
+/// Configuration for creating a 'ListingTable'  
+pub struct ListingTableConfig {
+    /// `ObjectStore` that contains the files for the `ListingTable`.
+    pub object_store: Arc<dyn ObjectStore>,
+    /// Path on the `ObjectStore` for creating `ListingTable`.
+    pub table_path: String,
+    /// Optional `SchemaRef` for the to be created `ListingTable`.
+    pub file_schema: Option<SchemaRef>,
+    /// Optional `ListingOptions` for the to be created `ListingTable`.
+    pub options: Option<ListingOptions>,
+}
+
+impl ListingTableConfig {
+    /// Creates new `ListingTableConfig`.  The `SchemaRef` and `ListingOptions` are inferred based on the suffix of the provided `table_path`.
+    pub fn new(
+        object_store: Arc<dyn ObjectStore>,
+        table_path: impl Into<String>,
+    ) -> Self {
+        Self {
+            object_store,
+            table_path: table_path.into(),
+            file_schema: None,
+            options: None,
+        }
+    }
+    /// Add `schema` to `ListingTableConfig`
+    pub fn with_schema(self, schema: SchemaRef) -> Self {
+        Self {
+            object_store: self.object_store,
+            table_path: self.table_path,
+            file_schema: Some(schema),
+            options: self.options,
+        }
+    }
+
+    /// Add `listing_options` to `ListingTableConfig`
+    pub fn with_listing_options(self, listing_options: ListingOptions) -> Self {
+        Self {
+            object_store: self.object_store,
+            table_path: self.table_path,
+            file_schema: self.file_schema,
+            options: Some(listing_options),
+        }
+    }
+
+    fn infer_format(suffix: &str) -> Result<Arc<dyn FileFormat>> {
+        match suffix {
+            "avro" => Ok(Arc::new(AvroFormat::default())),
+            "csv" => Ok(Arc::new(CsvFormat::default())),
+            "json" => Ok(Arc::new(JsonFormat::default())),
+            "parquet" => Ok(Arc::new(ParquetFormat::default())),
+            _ => Err(DataFusionError::Internal(
+                "Unable to infer file type".into(),

Review comment:
       Good idea. I'll update.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1033054340


   As an update, I need to handle the case of inferring file format when a partitioned file / directory is provided.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r800127070



##########
File path: datafusion/src/datasource/listing/table.rs
##########
@@ -40,7 +44,70 @@ use crate::datasource::{
 
 use super::helpers::{expr_applicable_for_cols, pruned_partition_list, split_files};
 
+/// Configuration for creating a 'ListingTable'  
+pub struct ListingTableConfig {
+    pub object_store: Arc<dyn ObjectStore>,
+    pub table_path: String,
+    pub file_schema: Option<SchemaRef>,
+    pub options: Option<ListingOptions>,
+}
+
+impl ListingTableConfig {
+    /// Creates new `ListingTableConfig`.  The `SchemaRef` and `ListingOptions` are inferred based on the suffix of the provided `table_path`.
+    pub async fn new(
+        object_store: Arc<dyn ObjectStore>,
+        table_path: impl Into<String>,
+    ) -> Self {
+        Self {
+            object_store,
+            table_path: table_path.into(),
+            file_schema: None,
+            options: None,
+        }
+    }
+    pub fn with_schema(mut self, schema: SchemaRef) {

Review comment:
       Looks good to me :+1: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r800698423



##########
File path: datafusion/src/datasource/listing/table.rs
##########
@@ -40,7 +44,80 @@ use crate::datasource::{
 
 use super::helpers::{expr_applicable_for_cols, pruned_partition_list, split_files};
 
+/// Configuration for creating a 'ListingTable'  
+pub struct ListingTableConfig {
+    pub object_store: Arc<dyn ObjectStore>,
+    pub table_path: String,
+    pub file_schema: Option<SchemaRef>,
+    pub options: Option<ListingOptions>,
+}
+
+impl ListingTableConfig {
+    /// Creates new `ListingTableConfig`.  The `SchemaRef` and `ListingOptions` are inferred based on the suffix of the provided `table_path`.
+    pub async fn new(
+        object_store: Arc<dyn ObjectStore>,
+        table_path: impl Into<String>,
+    ) -> Self {
+        Self {
+            object_store,
+            table_path: table_path.into(),
+            file_schema: None,
+            options: None,
+        }
+    }
+    pub fn with_schema(self, schema: SchemaRef) -> Self {
+        Self {
+            object_store: self.object_store,
+            table_path: self.table_path,
+            file_schema: Some(schema),
+            options: self.options,
+        }
+    }
+
+    pub fn with_listing_options(self, listing_options: ListingOptions) -> Self {
+        Self {
+            object_store: self.object_store,
+            table_path: self.table_path,
+            file_schema: self.file_schema,
+            options: Some(listing_options),
+        }
+    }
+
+    fn infer_options(&mut self) -> Result<ListingOptions> {

Review comment:
       I wonder if this is meant to be `pub` as well




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1031742701


   @alamb any idea why only one CI check is running?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r800700763



##########
File path: datafusion/src/datasource/listing/table.rs
##########
@@ -118,12 +195,17 @@ impl ListingTable {
     /// The provided `schema` must be resolved before creating the table
     /// and should contain the fields of the file without the table
     /// partitioning columns.
-    pub fn new(
-        object_store: Arc<dyn ObjectStore>,
-        table_path: String,
-        file_schema: SchemaRef,
-        options: ListingOptions,
-    ) -> Self {
+    pub async fn new(mut config: ListingTableConfig) -> Result<Self> {

Review comment:
       I like this API -- it would be cool if we could add some more documentation (either here or in `ListingTableConfig` explaining that the options / schema are inferred if not explicitly created
   
   Also, as this can now return an `Err` perhaps renaming to `pub async fn try_new(...)` would be a more idiomatic / clearer name




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1034093788


   Ok no now it has a conflict!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1034799533


   Thanks again for the good work @matthewmturner 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1031911269


   > 
   > I think it is because the PR has a conflict
   
   Oh, duh, that makes sense.  Sry.  Will fix.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner edited a comment on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner edited a comment on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1032227792


   @alamb do you know of any Rust magic that would allow me to use an async method for `impl TryInto<LogicalPlan> for &protobuf::LogicalPlanNode` to get the ballista errors resolved?  `#[async_trait]` doesnt work as it doesnt match the trait definition for `TryInto`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1032256824


   Thinking out loud here. Im wondering if i can move the required async functionality to all be within `ListingTableConfig` and the user can decide if its needed.  Then `ListingTable::new` doesnt need to be async.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb edited a comment on pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#issuecomment-1032528711


   > @alamb do you know of any Rust magic that would allow me to use an async method for impl TryInto<LogicalPlan> for &protobuf::LogicalPlanNode to get the ballista errors resolved? 
   
   I do not know 😞 
   
   > Thinking out loud here. Im wondering if i can move the required async functionality to all be within ListingTableConfig and the user can decide if its needed. Then ListingTable::new doesnt need to be async.
   
   I think that sounds like a very good idea to me 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r799968936



##########
File path: datafusion/src/datasource/listing/table.rs
##########
@@ -40,7 +44,70 @@ use crate::datasource::{
 
 use super::helpers::{expr_applicable_for_cols, pruned_partition_list, split_files};
 
+/// Configuration for creating a 'ListingTable'  
+pub struct ListingTableConfig {
+    pub object_store: Arc<dyn ObjectStore>,
+    pub table_path: String,
+    pub file_schema: Option<SchemaRef>,
+    pub options: Option<ListingOptions>,
+}
+
+impl ListingTableConfig {
+    /// Creates new `ListingTableConfig`.  The `SchemaRef` and `ListingOptions` are inferred based on the suffix of the provided `table_path`.
+    pub async fn new(
+        object_store: Arc<dyn ObjectStore>,
+        table_path: impl Into<String>,
+    ) -> Self {
+        Self {
+            object_store,
+            table_path: table_path.into(),
+            file_schema: None,
+            options: None,
+        }
+    }
+    pub fn with_schema(mut self, schema: SchemaRef) {

Review comment:
       If you consume self and return Self, in these APIs, then you will be able to chain them together for a better UX:
   
   ```suggestion
       pub fn with_schema(mut self, schema: SchemaRef) -> Self {
   ```
   
   Something like this:
   
   ```rust
   let cfg = ListingTableConfig::new().with_listing_options().with_scheame();
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r799968936



##########
File path: datafusion/src/datasource/listing/table.rs
##########
@@ -40,7 +44,70 @@ use crate::datasource::{
 
 use super::helpers::{expr_applicable_for_cols, pruned_partition_list, split_files};
 
+/// Configuration for creating a 'ListingTable'  
+pub struct ListingTableConfig {
+    pub object_store: Arc<dyn ObjectStore>,
+    pub table_path: String,
+    pub file_schema: Option<SchemaRef>,
+    pub options: Option<ListingOptions>,
+}
+
+impl ListingTableConfig {
+    /// Creates new `ListingTableConfig`.  The `SchemaRef` and `ListingOptions` are inferred based on the suffix of the provided `table_path`.
+    pub async fn new(
+        object_store: Arc<dyn ObjectStore>,
+        table_path: impl Into<String>,
+    ) -> Self {
+        Self {
+            object_store,
+            table_path: table_path.into(),
+            file_schema: None,
+            options: None,
+        }
+    }
+    pub fn with_schema(mut self, schema: SchemaRef) {

Review comment:
       If you consume self and return Self, in these APIs, then you will be able to chain them together for a better UX:
   
   ```suggestion
       pub fn with_schema(self, schema: SchemaRef) -> Self {
   ```
   
   Something like this:
   
   ```rust
   let cfg = ListingTableConfig::new().with_listing_options().with_scheame();
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on a change in pull request #1715: Create ListingTableConfig which includes file format and schema inference

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on a change in pull request #1715:
URL: https://github.com/apache/arrow-datafusion/pull/1715#discussion_r800753188



##########
File path: datafusion/src/datasource/listing/table.rs
##########
@@ -118,12 +195,17 @@ impl ListingTable {
     /// The provided `schema` must be resolved before creating the table
     /// and should contain the fields of the file without the table
     /// partitioning columns.
-    pub fn new(
-        object_store: Arc<dyn ObjectStore>,
-        table_path: String,
-        file_schema: SchemaRef,
-        options: ListingOptions,
-    ) -> Self {
+    pub async fn new(mut config: ListingTableConfig) -> Result<Self> {

Review comment:
       yes was planning on adding docs once i was sure API was good.  and will update to `try_new`!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org