You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/29 23:04:14 UTC

[GitHub] [arrow-datafusion] matthewmturner opened a new issue #1705: Simplify creating new `ListingTable`

matthewmturner opened a new issue #1705:
URL: https://github.com/apache/arrow-datafusion/issues/1705


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*)
   I think that we can simplify creating a `ListingTable` by using some simple inference and reasonable defaults for `ListingOptions` and `Schema`
   
   **Describe the solution you'd like**
   A clear and concise description of what you want to happen.
   I would like to update the signature to go from:
   
   ```
   ListingTable::new(object_store: Arc<dyn ObjectStore>, table_path: String, file_schema: SchemaRef, options: ListingOptions)
   ```
   
   to
   
   ```
   ListingTable::new(object_store: Arc<dyn ObjectStore>, table_path: String, file_schema: Option<SchemaRef>, options: <ListingOptions>)
   ```
   
   Then, we can look at the suffix of the `table_path` to infer a file type and use that for the `format` and `file_extension` parameters of `ListingOptions`.  We could use the below as defaults for the other parameters:
   
   ```
   let listing_options = ListingOptions {
               format: Arc::new(ParquetFormat::default()), // derived from ListingTable::new(table_path, ..)
               collect_stat: true,
               file_extension: "parquet".to_owned(), // derived from ListingTable::new(table_path, ..)
               target_partitions: num_cpus::get(),
               table_partition_cols: vec![],
           };
   ```
   
   We could then use `listing_options` to create a `Schema`
   
   ```
   // object_store from ListingTable::new(object_store, table_path, ..)
   let resolved_schema = listing_options.infer_schema(object_store.clone(), filename).await?;
   ```
   
   The end result is a much simpler interface for creating tables.
   
   Old
   ```
   let filename = "data/alltypes_plain.snappy.parquet";
   
   let listing_options = ListingOptions {
       format: Arc::new(ParquetFormat::default()),
       collect_stat: true,
       file_extension: "parquet".to_owned(),
       target_partitions: num_cpus::get(),
       table_partition_cols: vec![],
   };
   
   let resolved_schema = listing_options
       .infer_schema(object_store.clone(), filename)
       .await?;
   
   let table = ListingTable::new(
       object_store,
       filename.to_owned(),
       resolved_schema,
       listing_options,
   );
   ```
   
   New
   ```
   let filename = "data/alltypes_plain.snappy.parquet";
   
   let table = ListingTable::new(
       object_store,
       filename.to_owned(),
       None,
       None,
   );
   ```
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #1705: Simplify creating new `ListingTable`

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #1705:
URL: https://github.com/apache/arrow-datafusion/issues/1705#issuecomment-1025175512


   @alamb  great idea. Will do that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #1705: Simplify creating new `ListingTable`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1705:
URL: https://github.com/apache/arrow-datafusion/issues/1705#issuecomment-1025115375


   > @alamb @houqp @seddonm1 what do you think about this proposal?
   
   
   I think the usecase of defaulting schema and format makes a lot of sense
   
   If you are going to change the signature of `ListingTable` perhaps it would be worth using a builder / config (so that future changes don't require another signature change):
   
   Something like 
   ```rust
   struct ListingTableConfig {
     object_store: Arc<dyn ObjectStore>, 
     table_path: String, 
     file_schema: Option<SchemaRef>, 
     options: Option<ListingOptions>
   }
   
   impl LIstingTableConfig {
     fn new(object_store: Arc<dyn ObjectStore>, table_path: impl Into<String>) -> Self {
       ..
     }
   
     fn with_schema(mut self, schema: SchemaRef) -> Self {
     ...
     }
   
     fn with_options(mut self, listing_options: ListingOptions) -> Self {
     ..
     }
   }
   
   ```
   
   And then your example could look like
   
   ```rust
   let config = ListingTableConfig::new(
     object_store,
     "data/alltypes_plain.snappy.parquet"
   );
     
   
   let table = ListingTable::new(config);
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #1705: Simplify creating new `ListingTable`

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #1705:
URL: https://github.com/apache/arrow-datafusion/issues/1705#issuecomment-1025199602


   Both of your idea and @alamb's suggestions of builder pattern sounds like a good plan to me :+1: Thank you for bring this up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on issue #1705: Simplify creating new `ListingTable`

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #1705:
URL: https://github.com/apache/arrow-datafusion/issues/1705#issuecomment-1025003439


   @alamb @houqp @seddonm1 what do you think about this proposal?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #1705: Simplify creating new `ListingTable`

Posted by GitBox <gi...@apache.org>.
alamb closed issue #1705:
URL: https://github.com/apache/arrow-datafusion/issues/1705


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org