You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/27 04:46:06 UTC

[GitHub] [arrow-datafusion] jychen7 opened a new pull request #2099: 2061 create external table ddl table partition cols

jychen7 opened a new pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099


   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/2061
   
    # Rationale for this change
   support partition pruning using CreateExternalTable DDL
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   No change if user is using SQL.
   However, there is change in `context.read_parquet` params, receiving a ParquetReadOptions for `table_partition_cols`. It is more align with other file format, since Csv, Avro, Json already have their own ReadOptions
   
   # Manual Test in datafusion-cli
   
   create tmp sample file
   ```
   echo "1,2" > tmp/year=2022/data.csv
   echo "3,4" > tmp/year=2021/data.csv
   ```
   
   run in datafusion-cli
   ```
   ❯ CREATE EXTERNAL TABLE t1 (a INT, b INT) STORED AS CSV LOCATION 'tmp';
   0 rows in set. Query took 0.002 seconds.
   ❯ select * from t1;
   +---+---+
   | a | b |
   +---+---+
   | 3 | 4 |
   | 1 | 2 |
   +---+---+
   2 rows in set. Query took 0.012 seconds.
   ❯ CREATE EXTERNAL TABLE t2 (a INT, b INT) STORED AS CSV PARTITIONED BY (year) LOCATION 'tmp';
   0 rows in set. Query took 0.001 seconds.
   ❯ select * from t2;
   +---+---+------+
   | a | b | year |
   +---+---+------+
   | 1 | 2 | 2022 |
   | 3 | 4 | 2021 |
   +---+---+------+
   2 rows in set. Query took 0.011 seconds.
   ❯ select * from t2 where year = '2022';
   +---+---+------+
   | a | b | year |
   +---+---+------+
   | 1 | 2 | 2022 |
   +---+---+------+
   1 row in set. Query took 0.013 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on a change in pull request #2099: feat: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r841080493



##########
File path: datafusion/core/src/sql/parser.rs
##########
@@ -192,6 +194,35 @@ impl<'a> DFParser<'a> {
         }
     }
 
+    fn parse_partitions(&mut self) -> Result<Vec<String>, ParserError> {
+        let mut partitions: Vec<String> = vec![];
+        if !self.parser.consume_token(&Token::LParen)
+            || self.parser.consume_token(&Token::RParen)
+        {
+            return Ok(partitions);
+        }
+
+        loop {

Review comment:
       https://github.com/apache/arrow-datafusion/pull/2099#discussion_r835851145
   
   yes, it maybe possible. Or maybe we can expose this as public method in `sqlparser` crate?
   (agree not needed for this PR)
   
   > This is a copy of the equivalent implementation in sqlparser




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on a change in pull request #2099: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r835968916



##########
File path: datafusion/src/execution/context.rs
##########
@@ -549,20 +558,19 @@ impl SessionContext {
 
     /// Registers a Parquet data source so that it can be referenced from SQL statements
     /// executed against this context.
-    pub async fn register_parquet(&mut self, name: &str, uri: &str) -> Result<()> {
-        let (target_partitions, enable_pruning) = {
+    pub async fn register_parquet(
+        &mut self,
+        name: &str,
+        uri: &str,
+        options: ParquetReadOptions<'_>,
+    ) -> Result<()> {
+        let (target_partitions, parquet_pruning) = {
             let conf = self.copied_config();
             (conf.target_partitions, conf.parquet_pruning)
         };
-        let file_format = ParquetFormat::default().with_enable_pruning(enable_pruning);
-
-        let listing_options = ListingOptions {
-            format: Arc::new(file_format),
-            collect_stat: true,
-            file_extension: DEFAULT_PARQUET_EXTENSION.to_owned(),
-            target_partitions,
-            table_partition_cols: vec![],
-        };
+        let listing_options = options
+            .parquet_pruning(parquet_pruning)

Review comment:
       this is for following test use
   
   https://github.com/apache/arrow-datafusion/blob/a09e1aeb5fa279e2a14554c3dad9dfb17d9326e7/datafusion/tests/parquet_pruning.rs#L163-L174
   
   disable parquet pruning has one use case from https://github.com/apache/arrow-datafusion/issues/723




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #2099: feat: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on a change in pull request #2099: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r835850870



##########
File path: datafusion/src/execution/options.rs
##########
@@ -21,14 +21,18 @@ use std::sync::Arc;
 
 use arrow::datatypes::{Schema, SchemaRef};
 
-use crate::datasource::file_format::json::DEFAULT_JSON_EXTENSION;
 use crate::datasource::{
-    file_format::{avro::AvroFormat, csv::CsvFormat, json::JsonFormat},
+    file_format::{
+        avro::{AvroFormat, DEFAULT_AVRO_EXTENSION},
+        csv::{CsvFormat, DEFAULT_CSV_EXTENSION},
+        json::{JsonFormat, DEFAULT_JSON_EXTENSION},
+        parquet::{ParquetFormat, DEFAULT_PARQUET_EXTENSION},
+    },
     listing::ListingOptions,
 };
 
 /// CSV file read option
-#[derive(Copy, Clone)]

Review comment:
       Arvo and Json read options do not use Copy trait. Removing it allows to use `Vec<String>`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on a change in pull request #2099: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r835851145



##########
File path: datafusion/src/sql/parser.rs
##########
@@ -192,6 +194,35 @@ impl<'a> DFParser<'a> {
         }
     }
 
+    fn parse_partitions(&mut self) -> Result<Vec<String>, ParserError> {

Review comment:
       similar to `parse_columns`. I didn't reuse `parse_columns` because partitions only have identifier/name, there is no type or other options like column
   
   https://github.com/apache/arrow-datafusion/blob/f50e8000ee4de8432d7b59b63c18a8e0171a072b/datafusion/src/sql/parser.rs#L227-L229




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on pull request #2099: feat: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#issuecomment-1082497603


   will rebase, because of the whole folder re-org in #2081


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on a change in pull request #2099: feat: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r841079008



##########
File path: datafusion/core/src/sql/parser.rs
##########
@@ -277,6 +308,12 @@ impl<'a> DFParser<'a> {
 
         let has_header = self.parse_csv_has_header();
 
+        let has_partition = self.parse_has_partition();
+        let mut table_partition_cols: Vec<String> = vec![];
+        if has_partition {
+            table_partition_cols = self.parse_partitions()?;
+        }

Review comment:
       thanks, that's nice




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #2099: feat: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#issuecomment-1086833595


   Thanks again @jychen7 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on a change in pull request #2099: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r835968916



##########
File path: datafusion/src/execution/context.rs
##########
@@ -549,20 +558,19 @@ impl SessionContext {
 
     /// Registers a Parquet data source so that it can be referenced from SQL statements
     /// executed against this context.
-    pub async fn register_parquet(&mut self, name: &str, uri: &str) -> Result<()> {
-        let (target_partitions, enable_pruning) = {
+    pub async fn register_parquet(
+        &mut self,
+        name: &str,
+        uri: &str,
+        options: ParquetReadOptions<'_>,
+    ) -> Result<()> {
+        let (target_partitions, parquet_pruning) = {
             let conf = self.copied_config();
             (conf.target_partitions, conf.parquet_pruning)
         };
-        let file_format = ParquetFormat::default().with_enable_pruning(enable_pruning);
-
-        let listing_options = ListingOptions {
-            format: Arc::new(file_format),
-            collect_stat: true,
-            file_extension: DEFAULT_PARQUET_EXTENSION.to_owned(),
-            target_partitions,
-            table_partition_cols: vec![],
-        };
+        let listing_options = options
+            .parquet_pruning(parquet_pruning)

Review comment:
       this is for following test use, though not sure when we want to disable pruning explicitly yet
   
   https://github.com/apache/arrow-datafusion/blob/a09e1aeb5fa279e2a14554c3dad9dfb17d9326e7/datafusion/tests/parquet_pruning.rs#L163-L174




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on a change in pull request #2099: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r835851028



##########
File path: datafusion/src/execution/options.rs
##########
@@ -115,32 +128,81 @@ impl<'a> CsvReadOptions<'a> {
             collect_stat: false,
             file_extension: self.file_extension.to_owned(),
             target_partitions,
+            table_partition_cols: self.table_partition_cols.clone(),
+        }
+    }
+}
+
+/// Parquet read options
+#[derive(Clone)]
+pub struct ParquetReadOptions<'a> {
+    /// File extension; only files with this extension are selected for data input.
+    /// Defaults to ".parquet".
+    pub file_extension: &'a str,
+    // Partition Columns
+    pub table_partition_cols: Vec<String>,
+}
+
+impl<'a> Default for ParquetReadOptions<'a> {
+    fn default() -> Self {
+        Self {
+            file_extension: DEFAULT_PARQUET_EXTENSION,
             table_partition_cols: vec![],
         }
     }
 }
 
+impl<'a> ParquetReadOptions<'a> {
+    /// Specify table_partition_cols for partition pruning
+    pub fn table_partition_cols(mut self, table_partition_cols: Vec<String>) -> Self {
+        self.table_partition_cols = table_partition_cols;
+        self
+    }
+
+    /// Helper to convert these user facing options to `ListingTable` options
+    pub fn to_listing_options(&self, target_partitions: usize) -> ListingOptions {
+        let file_format = ParquetFormat::default();
+
+        ListingOptions {
+            format: Arc::new(file_format),
+            collect_stat: true,

Review comment:
       this `true` align with the old code in `datafusion/src/logical_plan/builder.rs`
   
   https://github.com/apache/arrow-datafusion/blob/81592947e8814327ebdbd1fbc3d4a090796e37a3/datafusion/src/logical_plan/builder.rs#L284-L290




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 edited a comment on pull request #2099: feat: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 edited a comment on pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#issuecomment-1082497603


   will merge master or rebase, because of the whole folder re-org in #2081


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #2099: feat: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r841202002



##########
File path: datafusion/src/sql/parser.rs
##########
@@ -192,6 +194,35 @@ impl<'a> DFParser<'a> {
         }
     }
 
+    fn parse_partitions(&mut self) -> Result<Vec<String>, ParserError> {

Review comment:
       sorry missed this comment yesterday 👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on a change in pull request #2099: feat: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r841078677



##########
File path: datafusion/core/src/execution/context.rs
##########
@@ -548,20 +560,19 @@ impl SessionContext {
 
     /// Registers a Parquet data source so that it can be referenced from SQL statements
     /// executed against this context.
-    pub async fn register_parquet(&self, name: &str, uri: &str) -> Result<()> {
-        let (target_partitions, enable_pruning) = {
+    pub async fn register_parquet(
+        &self,
+        name: &str,
+        uri: &str,
+        options: ParquetReadOptions<'_>,
+    ) -> Result<()> {
+        let (target_partitions, parquet_pruning) = {

Review comment:
       agree, https://github.com/apache/arrow-datafusion/issues/2138 is created




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #2099: feat: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r841063796



##########
File path: datafusion/core/src/execution/context.rs
##########
@@ -548,20 +560,19 @@ impl SessionContext {
 
     /// Registers a Parquet data source so that it can be referenced from SQL statements
     /// executed against this context.
-    pub async fn register_parquet(&self, name: &str, uri: &str) -> Result<()> {
-        let (target_partitions, enable_pruning) = {
+    pub async fn register_parquet(
+        &self,
+        name: &str,
+        uri: &str,
+        options: ParquetReadOptions<'_>,
+    ) -> Result<()> {
+        let (target_partitions, parquet_pruning) = {

Review comment:
       Maybe we should eventually move the `parquet_pruning` option out of `SessionContext` and into the `ParquetReadOptions` structure. As a follow on PR

##########
File path: datafusion/core/src/execution/options.rs
##########
@@ -115,7 +128,57 @@ impl<'a> CsvReadOptions<'a> {
             collect_stat: false,
             file_extension: self.file_extension.to_owned(),
             target_partitions,
+            table_partition_cols: self.table_partition_cols.clone(),
+        }
+    }
+}
+
+/// Parquet read options
+#[derive(Clone)]
+pub struct ParquetReadOptions<'a> {
+    /// File extension; only files with this extension are selected for data input.
+    /// Defaults to ".parquet".
+    pub file_extension: &'a str,
+    /// Partition Columns
+    pub table_partition_cols: Vec<String>,
+    /// Should DataFusion parquet reader using the predicate to prune data, following execution::context::SessionConfig
+    pub parquet_pruning: bool,
+}
+
+impl<'a> Default for ParquetReadOptions<'a> {
+    fn default() -> Self {
+        Self {
+            file_extension: DEFAULT_PARQUET_EXTENSION,
             table_partition_cols: vec![],
+            parquet_pruning: ParquetFormat::default().enable_pruning(),
+        }
+    }
+}
+
+impl<'a> ParquetReadOptions<'a> {
+    /// Specify parquet_pruning
+    pub fn parquet_pruning(mut self, parquet_pruning: bool) -> Self {
+        self.parquet_pruning = parquet_pruning;
+        self
+    }
+
+    /// Specify table_partition_cols for partition pruning
+    pub fn table_partition_cols(mut self, table_partition_cols: Vec<String>) -> Self {
+        self.table_partition_cols = table_partition_cols;
+        self
+    }
+
+    /// Helper to convert these user facing options to `ListingTable` options
+    pub fn to_listing_options(&self, target_partitions: usize) -> ListingOptions {

Review comment:
       👍 

##########
File path: datafusion/core/src/sql/parser.rs
##########
@@ -192,6 +194,35 @@ impl<'a> DFParser<'a> {
         }
     }
 
+    fn parse_partitions(&mut self) -> Result<Vec<String>, ParserError> {
+        let mut partitions: Vec<String> = vec![];
+        if !self.parser.consume_token(&Token::LParen)
+            || self.parser.consume_token(&Token::RParen)
+        {
+            return Ok(partitions);
+        }
+
+        loop {

Review comment:
       Given this code to parse a comma separated list is duplicated in `parse_columns` below, perhaps we could refactor into a common function to reduce the replication  -- not needed for this PR though

##########
File path: datafusion/core/src/sql/parser.rs
##########
@@ -277,6 +308,12 @@ impl<'a> DFParser<'a> {
 
         let has_header = self.parse_csv_has_header();
 
+        let has_partition = self.parse_has_partition();
+        let mut table_partition_cols: Vec<String> = vec![];
+        if has_partition {
+            table_partition_cols = self.parse_partitions()?;
+        }

Review comment:
       ```suggestion
           let table_partition_cols = if self.parse_has_partition() {
             self.parse_partitions()?;
           } else {
             vec![]
           };
   ```
   
   If you wanted to avoid a `mut`

##########
File path: datafusion/core/src/execution/options.rs
##########
@@ -43,8 +47,10 @@ pub struct CsvReadOptions<'a> {
     /// Max number of rows to read from CSV files for schema inference if needed. Defaults to 1000.
     pub schema_infer_max_records: usize,
     /// File extension; only files with this extension are selected for data input.
-    /// Defaults to ".csv".
+    /// Defaults to DEFAULT_CSV_EXTENSION.

Review comment:
       👍 

##########
File path: datafusion/core/src/logical_plan/builder.rs
##########
@@ -274,21 +275,12 @@ impl LogicalPlanBuilder {
     pub async fn scan_parquet_with_name(
         object_store: Arc<dyn ObjectStore>,
         path: impl Into<String>,
+        options: ParquetReadOptions<'_>,
         projection: Option<Vec<usize>>,
         target_partitions: usize,
         table_name: impl Into<String>,
     ) -> Result<Self> {
-        // TODO remove hard coded enable_pruning

Review comment:
       ❤️ 

##########
File path: docs/source/user-guide/sql/ddl.md
##########
@@ -55,6 +55,21 @@ WITH HEADER ROW
 LOCATION '/path/to/aggregate_test_100.csv';
 ```
 
+If data sources are already partitioned in Hive style, `PARTITIONED BY` can be used for partition pruning.

Review comment:
       ❤️ 

##########
File path: datafusion/core/src/execution/options.rs
##########
@@ -115,7 +128,57 @@ impl<'a> CsvReadOptions<'a> {
             collect_stat: false,
             file_extension: self.file_extension.to_owned(),
             target_partitions,
+            table_partition_cols: self.table_partition_cols.clone(),
+        }
+    }
+}
+
+/// Parquet read options
+#[derive(Clone)]
+pub struct ParquetReadOptions<'a> {
+    /// File extension; only files with this extension are selected for data input.
+    /// Defaults to ".parquet".
+    pub file_extension: &'a str,
+    /// Partition Columns
+    pub table_partition_cols: Vec<String>,
+    /// Should DataFusion parquet reader using the predicate to prune data, following execution::context::SessionConfig

Review comment:
       ```suggestion
       /// Should DataFusion parquet reader using the predicate to prune data, 
       /// overridden by value on execution::context::SessionConfig
   ```
   
   Maybe after this PR is merged, we can remove the option on `SessionConfig`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on pull request #2099: feat: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#issuecomment-1086644966


   > Hopefully fixing up the merge conflicts won't be too bad
   
   thanks @alamb , if you mean merge master because of https://github.com/apache/arrow-datafusion/pull/2081, it is actually super easy (than I thought), since https://github.com/apache/arrow-datafusion/pull/2081 is just rename and this PR didn't introduce any new file


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on a change in pull request #2099: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r835850884



##########
File path: datafusion/src/execution/options.rs
##########
@@ -61,7 +67,8 @@ impl<'a> CsvReadOptions<'a> {
             schema: None,
             schema_infer_max_records: 1000,
             delimiter: b',',
-            file_extension: ".csv",
+            file_extension: DEFAULT_CSV_EXTENSION,

Review comment:
       replace with same constant




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jychen7 commented on a change in pull request #2099: 2061 create external table ddl table partition cols

Posted by GitBox <gi...@apache.org>.
jychen7 commented on a change in pull request #2099:
URL: https://github.com/apache/arrow-datafusion/pull/2099#discussion_r835968916



##########
File path: datafusion/src/execution/context.rs
##########
@@ -549,20 +558,19 @@ impl SessionContext {
 
     /// Registers a Parquet data source so that it can be referenced from SQL statements
     /// executed against this context.
-    pub async fn register_parquet(&mut self, name: &str, uri: &str) -> Result<()> {
-        let (target_partitions, enable_pruning) = {
+    pub async fn register_parquet(
+        &mut self,
+        name: &str,
+        uri: &str,
+        options: ParquetReadOptions<'_>,
+    ) -> Result<()> {
+        let (target_partitions, parquet_pruning) = {
             let conf = self.copied_config();
             (conf.target_partitions, conf.parquet_pruning)
         };
-        let file_format = ParquetFormat::default().with_enable_pruning(enable_pruning);
-
-        let listing_options = ListingOptions {
-            format: Arc::new(file_format),
-            collect_stat: true,
-            file_extension: DEFAULT_PARQUET_EXTENSION.to_owned(),
-            target_partitions,
-            table_partition_cols: vec![],
-        };
+        let listing_options = options
+            .parquet_pruning(parquet_pruning)

Review comment:
       this is for following test use
   
   https://github.com/apache/arrow-datafusion/blob/a09e1aeb5fa279e2a14554c3dad9dfb17d9326e7/datafusion/tests/parquet_pruning.rs#L163-L174
   
   disable pruning has one use case from https://github.com/apache/arrow-datafusion/issues/723




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org