You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Asura7969 (via GitHub)" <gi...@apache.org> on 2023/12/16 14:22:06 UTC

[PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Asura7969 opened a new pull request, #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #8524.
   
   ## Rationale for this change
   
   Consistent behavior with duckdb and hive
   
   ## What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   add config `ignore_subdirectory`
   When scanning file paths, whether to ignore subdirectory files, ignored by default (true)
   
   
   ## Are these changes tested?
   
   `test_prefix_path`
   
   ## Are there any user-facing changes?
   
   `ListingTableUrl::contains` method add parameters
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1428847968


##########
datafusion/core/src/execution/context/parquet.rs:
##########
@@ -109,7 +109,7 @@ mod tests {
             .read_parquet(
                 // it was reported that when a path contains // (two consecutive separator) no files were found
                 // in this test, regardless of parquet_test_data() value, our path now contains a //
-                format!("{}/..//*/alltypes_plain*.parquet", parquet_test_data()),

Review Comment:
   Should we add regex matching to directories?🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1433336655


##########
datafusion/sqllogictest/test_files/csv_files.slt:
##########
@@ -63,3 +63,60 @@ id6 value"6
 id7 value"7
 id8 value"8
 id9 value"9
+
+
+# When reading a partitioned table, the `listing_table_ignore_subdirectory` configuration will be invalid
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = false;
+
+statement ok
+CREATE EXTERNAL TABLE partition_csv_table (
+  name VARCHAR,
+  ts TIMESTAMP,
+  c_date DATE,
+)
+STORED AS CSV
+PARTITIONED BY (c_date)
+LOCATION '../core/tests/data/partitioned_table';
+
+query I
+select count(*) from partition_csv_table;
+----
+4
+
+statement ok
+DROP TABLE partition_csv_table
+
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = true;

Review Comment:
   When reading a partitioned table, `listing_table_ignore_subdirectory` is always equal to false, even if set to true



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `listing_table_ignore_subdirectory` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1431835143


##########
datafusion/core/src/datasource/listing/url.rs:
##########
@@ -424,6 +433,13 @@ mod tests {
         let b = ListingTableUrl::parse("../bar/./foo/../baz").unwrap();
         assert_eq!(a, b);
         assert!(a.prefix.as_ref().ends_with("bar/baz"));
+
+        let url = ListingTableUrl::parse("../foo/*.parquet").unwrap();

Review Comment:
   I am probably missing something here, but how does this test the new code? I don't see it passing in `ignore_subdirectory`



##########
datafusion/core/src/execution/context/parquet.rs:
##########
@@ -109,7 +109,7 @@ mod tests {
             .read_parquet(
                 // it was reported that when a path contains // (two consecutive separator) no files were found
                 // in this test, regardless of parquet_test_data() value, our path now contains a //
-                format!("{}/..//*/alltypes_plain*.parquet", parquet_test_data()),

Review Comment:
   I don't fully understand this question



##########
datafusion/sqllogictest/test_files/information_schema.slt:
##########
@@ -150,6 +150,7 @@ datafusion.execution.aggregate.scalar_update_factor 10
 datafusion.execution.batch_size 8192
 datafusion.execution.coalesce_batches true
 datafusion.execution.collect_statistics false
+datafusion.execution.ignore_subdirectory true

Review Comment:
   Could we use a name that gives some context about when the `ignore_subdirectory` is actually used?
   
   For example, maybe like `listing_table_ignore_subdirectory`. Or maybe even it is time to create a whole new category of configuration for listing tables `datafusion.listing_table.ignore_subdirectory` 🤔 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1432083474


##########
datafusion/core/src/datasource/listing/url.rs:
##########
@@ -424,6 +433,13 @@ mod tests {
         let b = ListingTableUrl::parse("../bar/./foo/../baz").unwrap();
         assert_eq!(a, b);
         assert!(a.prefix.as_ref().ends_with("bar/baz"));
+
+        let url = ListingTableUrl::parse("../foo/*.parquet").unwrap();

Review Comment:
   Yes, it's really not obvious here (actually in `ListingTableUrl.contains`), I would create a sqllogictest as you suggested



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1433252219


##########
datafusion/sqllogictest/test_files/parquet.slt:
##########
@@ -276,6 +276,118 @@ LIMIT 10;
 0 2014-08-27T14:00:00Z Timestamp(Millisecond, Some("UTC"))
 0 2014-08-27T14:00:00Z Timestamp(Millisecond, Some("UTC"))
 
+# test for
+query ITID
+COPY (SELECT * FROM src_table WHERE int_col > 6 LIMIT 3)
+TO 'test_files/scratch/parquet/test_table/subdir/3.parquet'
+(FORMAT PARQUET, SINGLE_FILE_OUTPUT true);
+----
+3
+
+# Test config ignore_subdirectory:
+
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = true;
+
+statement ok
+CREATE EXTERNAL TABLE t1_ignore_subdirectory
+STORED AS PARQUET
+WITH HEADER ROW
+LOCATION 'test_files/scratch/parquet/test_table/*.parquet';
+
+query TT
+explain select count(*) from t1_ignore_subdirectory;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: t1_ignore_subdirectory projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet]]}
+
+statement ok
+CREATE EXTERNAL TABLE t2_ignore_subdirectory
+STORED AS PARQUET
+WITH HEADER ROW
+LOCATION 'test_files/scratch/parquet/test_table/';
+
+query TT
+explain select count(*) from t2_ignore_subdirectory;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: t2_ignore_subdirectory projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet]]}
+
+# scan file: 0.parquet 1.parquet 2.parquet
+
+query I
+select count(*) from t1_ignore_subdirectory;
+----
+9
+
+query I
+select count(*) from t2_ignore_subdirectory;
+----
+9
+
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = false;
+
+statement ok
+CREATE EXTERNAL TABLE t1_with_subdirectory
+STORED AS PARQUET
+WITH HEADER ROW
+LOCATION 'test_files/scratch/parquet/test_table/*.parquet';
+
+query TT
+explain select count(*) from t1_with_subdirectory;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: t1_with_subdirectory projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/subdir/3.parquet]]}
+
+
+statement ok
+CREATE EXTERNAL TABLE t2_with_subdirectory
+STORED AS PARQUET
+WITH HEADER ROW
+LOCATION 'test_files/scratch/parquet/test_table/';
+
+query TT
+explain select count(*) from t2_with_subdirectory;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: t2_with_subdirectory projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/subdir/3.parquet]]}
+
+# scan file: 0.parquet 1.parquet 2.parquet 3.parquet
+query I
+select count(*) from t1_with_subdirectory;
+----
+12
+
+query I
+select count(*) from t2_with_subdirectory;

Review Comment:
   It is cool to see the different rows but I don't understand the need for all the different tables and explain plans
   
   I think we can get coverage by simply creating the equivalent of `t2_with_subdirectory` and showing that it returns 12 rows when
   
   ```sql
   set datafusion.execution.listing_table_ignore_subdirectory = false;
   ```
   
   And 9 when 
   
   ```sql
   set datafusion.execution.listing_table_ignore_subdirectory = true;
   ```
   



##########
datafusion/sqllogictest/test_files/csv_files.slt:
##########
@@ -63,3 +63,60 @@ id6 value"6
 id7 value"7
 id8 value"8
 id9 value"9
+
+
+# When reading a partitioned table, the `listing_table_ignore_subdirectory` configuration will be invalid
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = false;
+
+statement ok
+CREATE EXTERNAL TABLE partition_csv_table (
+  name VARCHAR,
+  ts TIMESTAMP,
+  c_date DATE,
+)
+STORED AS CSV
+PARTITIONED BY (c_date)
+LOCATION '../core/tests/data/partitioned_table';
+
+query I
+select count(*) from partition_csv_table;
+----
+4
+
+statement ok
+DROP TABLE partition_csv_table
+
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = true;
+
+statement ok
+CREATE EXTERNAL TABLE partition_csv_table (
+  name VARCHAR,
+  ts TIMESTAMP,
+  c_date DATE,
+)
+STORED AS CSV
+PARTITIONED BY (c_date)
+LOCATION '../core/tests/data/partitioned_table';
+
+query TT
+explain select count(*) from partition_csv_table;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: partition_csv_table projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2
+--------CsvExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-11-13/timestamps.csv], [WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-12-13/timestamps.csv]]}, has_header=false
+
+query I
+select count(*) from partition_csv_table;

Review Comment:
   I don't understand what this test is testing -- in both cases the table has 4 rows (aka there is no data in a subdirectory to ignore, right)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `listing_table_ignore_subdirectory` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1435304695


##########
datafusion/common/src/config.rs:
##########
@@ -273,6 +273,11 @@ config_namespace! {
         /// memory consumption
         pub max_buffered_batches_per_output_file: usize, default = 2
 
+        /// When scanning file paths, whether to ignore subdirectory files,
+        /// ignored by default (true), when reading a partitioned table,
+        /// `listing_table_ignore_subdirectory` is always equal to false, even if set to true

Review Comment:
   Ah, got it -- thank you -- I will propose a clarification in a follow on PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1432221945


##########
datafusion/sqllogictest/test_files/information_schema.slt:
##########
@@ -150,6 +150,7 @@ datafusion.execution.aggregate.scalar_update_factor 10
 datafusion.execution.batch_size 8192
 datafusion.execution.coalesce_batches true
 datafusion.execution.collect_statistics false
+datafusion.execution.ignore_subdirectory true

Review Comment:
   I used `listing_table_ignore_subdirectory`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1434646388


##########
datafusion/common/src/config.rs:
##########
@@ -273,6 +273,11 @@ config_namespace! {
         /// memory consumption
         pub max_buffered_batches_per_output_file: usize, default = 2
 
+        /// When scanning file paths, whether to ignore subdirectory files,
+        /// ignored by default (true), when reading a partitioned table,
+        /// `listing_table_ignore_subdirectory` is always equal to false, even if set to true

Review Comment:
   ![image](https://github.com/apache/arrow-datafusion/assets/26200914/4a69a6ff-6ecf-40ca-ae6f-8e55c17f0479)
   ```sql
   ## read partition table
   
   CREATE EXTERNAL TABLE csv_with_timestamps (
     name VARCHAR,
     ts TIMESTAMP,
     c_date DATE,
   )
   STORED AS CSV
   PARTITIONED BY (c_date)
   LOCATION '../core/tests/data/partitioned_table';
   
   set datafusion.execution.listing_table_ignore_subdirectory = true;
   
   select count(*) from partition_tbale;    ## return 4
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1434494015


##########
datafusion/common/src/config.rs:
##########
@@ -273,6 +273,11 @@ config_namespace! {
         /// memory consumption
         pub max_buffered_batches_per_output_file: usize, default = 2
 
+        /// When scanning file paths, whether to ignore subdirectory files,
+        /// ignored by default (true), when reading a partitioned table,
+        /// `listing_table_ignore_subdirectory` is always equal to false, even if set to true

Review Comment:
   I don't understand what this is trying to say 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1433336655


##########
datafusion/sqllogictest/test_files/csv_files.slt:
##########
@@ -63,3 +63,60 @@ id6 value"6
 id7 value"7
 id8 value"8
 id9 value"9
+
+
+# When reading a partitioned table, the `listing_table_ignore_subdirectory` configuration will be invalid
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = false;
+
+statement ok
+CREATE EXTERNAL TABLE partition_csv_table (
+  name VARCHAR,
+  ts TIMESTAMP,
+  c_date DATE,
+)
+STORED AS CSV
+PARTITIONED BY (c_date)
+LOCATION '../core/tests/data/partitioned_table';
+
+query I
+select count(*) from partition_csv_table;
+----
+4
+
+statement ok
+DROP TABLE partition_csv_table
+
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = true;

Review Comment:
   When reading a partitioned table, listing_table_ignore_subdirectory is always equal to false, even if set to true



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1433358747


##########
datafusion/common/src/config.rs:
##########
@@ -273,6 +273,11 @@ config_namespace! {
         /// memory consumption
         pub max_buffered_batches_per_output_file: usize, default = 2
 
+        /// When scanning file paths, whether to ignore subdirectory files,
+        /// ignored by default (true), when reading a partitioned table,
+        /// `listing_table_ignore_subdirectory` is always equal to false, even if set to true

Review Comment:
   I updated the description information, do you agree? @alamb 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1433336655


##########
datafusion/sqllogictest/test_files/csv_files.slt:
##########
@@ -63,3 +63,60 @@ id6 value"6
 id7 value"7
 id8 value"8
 id9 value"9
+
+
+# When reading a partitioned table, the `listing_table_ignore_subdirectory` configuration will be invalid
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = false;
+
+statement ok
+CREATE EXTERNAL TABLE partition_csv_table (
+  name VARCHAR,
+  ts TIMESTAMP,
+  c_date DATE,
+)
+STORED AS CSV
+PARTITIONED BY (c_date)
+LOCATION '../core/tests/data/partitioned_table';
+
+query I
+select count(*) from partition_csv_table;
+----
+4
+
+statement ok
+DROP TABLE partition_csv_table
+
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = true;

Review Comment:
   When reading a partitioned table, `listing_table_ignore_subdirectory` is always equal to false, even if set to true



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1434492621


##########
datafusion/sqllogictest/test_files/parquet.slt:
##########
@@ -276,6 +276,39 @@ LIMIT 10;
 0 2014-08-27T14:00:00Z Timestamp(Millisecond, Some("UTC"))
 0 2014-08-27T14:00:00Z Timestamp(Millisecond, Some("UTC"))
 
+# Test config listing_table_ignore_subdirectory:

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1433335906


##########
datafusion/sqllogictest/test_files/csv_files.slt:
##########
@@ -63,3 +63,60 @@ id6 value"6
 id7 value"7
 id8 value"8
 id9 value"9
+
+
+# When reading a partitioned table, the `listing_table_ignore_subdirectory` configuration will be invalid
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = false;
+
+statement ok
+CREATE EXTERNAL TABLE partition_csv_table (
+  name VARCHAR,
+  ts TIMESTAMP,
+  c_date DATE,
+)
+STORED AS CSV
+PARTITIONED BY (c_date)
+LOCATION '../core/tests/data/partitioned_table';
+
+query I
+select count(*) from partition_csv_table;
+----
+4
+
+statement ok
+DROP TABLE partition_csv_table
+
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = true;
+
+statement ok
+CREATE EXTERNAL TABLE partition_csv_table (
+  name VARCHAR,
+  ts TIMESTAMP,
+  c_date DATE,
+)
+STORED AS CSV
+PARTITIONED BY (c_date)
+LOCATION '../core/tests/data/partitioned_table';
+
+query TT
+explain select count(*) from partition_csv_table;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: partition_csv_table projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2
+--------CsvExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-11-13/timestamps.csv], [WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-12-13/timestamps.csv]]}, has_header=false
+
+query I
+select count(*) from partition_csv_table;

Review Comment:
   When reading a partitioned table, `listing_table_ignore_subdirectory` is always equal to false, even if the default is true



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1433346298


##########
datafusion/sqllogictest/test_files/parquet.slt:
##########
@@ -276,6 +276,118 @@ LIMIT 10;
 0 2014-08-27T14:00:00Z Timestamp(Millisecond, Some("UTC"))
 0 2014-08-27T14:00:00Z Timestamp(Millisecond, Some("UTC"))
 
+# test for
+query ITID
+COPY (SELECT * FROM src_table WHERE int_col > 6 LIMIT 3)
+TO 'test_files/scratch/parquet/test_table/subdir/3.parquet'
+(FORMAT PARQUET, SINGLE_FILE_OUTPUT true);
+----
+3
+
+# Test config ignore_subdirectory:
+
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = true;
+
+statement ok
+CREATE EXTERNAL TABLE t1_ignore_subdirectory
+STORED AS PARQUET
+WITH HEADER ROW
+LOCATION 'test_files/scratch/parquet/test_table/*.parquet';
+
+query TT
+explain select count(*) from t1_ignore_subdirectory;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: t1_ignore_subdirectory projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet]]}
+
+statement ok
+CREATE EXTERNAL TABLE t2_ignore_subdirectory
+STORED AS PARQUET
+WITH HEADER ROW
+LOCATION 'test_files/scratch/parquet/test_table/';
+
+query TT
+explain select count(*) from t2_ignore_subdirectory;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: t2_ignore_subdirectory projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet]]}
+
+# scan file: 0.parquet 1.parquet 2.parquet
+
+query I
+select count(*) from t1_ignore_subdirectory;
+----
+9
+
+query I
+select count(*) from t2_ignore_subdirectory;
+----
+9
+
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = false;
+
+statement ok
+CREATE EXTERNAL TABLE t1_with_subdirectory
+STORED AS PARQUET
+WITH HEADER ROW
+LOCATION 'test_files/scratch/parquet/test_table/*.parquet';
+
+query TT
+explain select count(*) from t1_with_subdirectory;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: t1_with_subdirectory projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/subdir/3.parquet]]}
+
+
+statement ok
+CREATE EXTERNAL TABLE t2_with_subdirectory
+STORED AS PARQUET
+WITH HEADER ROW
+LOCATION 'test_files/scratch/parquet/test_table/';
+
+query TT
+explain select count(*) from t2_with_subdirectory;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: t2_with_subdirectory projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/subdir/3.parquet]]}
+
+# scan file: 0.parquet 1.parquet 2.parquet 3.parquet
+query I
+select count(*) from t1_with_subdirectory;
+----
+12
+
+query I
+select count(*) from t2_with_subdirectory;

Review Comment:
   Thank you for your suggestion, I understand



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1433335906


##########
datafusion/sqllogictest/test_files/csv_files.slt:
##########
@@ -63,3 +63,60 @@ id6 value"6
 id7 value"7
 id8 value"8
 id9 value"9
+
+
+# When reading a partitioned table, the `listing_table_ignore_subdirectory` configuration will be invalid
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = false;
+
+statement ok
+CREATE EXTERNAL TABLE partition_csv_table (
+  name VARCHAR,
+  ts TIMESTAMP,
+  c_date DATE,
+)
+STORED AS CSV
+PARTITIONED BY (c_date)
+LOCATION '../core/tests/data/partitioned_table';
+
+query I
+select count(*) from partition_csv_table;
+----
+4
+
+statement ok
+DROP TABLE partition_csv_table
+
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = true;
+
+statement ok
+CREATE EXTERNAL TABLE partition_csv_table (
+  name VARCHAR,
+  ts TIMESTAMP,
+  c_date DATE,
+)
+STORED AS CSV
+PARTITIONED BY (c_date)
+LOCATION '../core/tests/data/partitioned_table';
+
+query TT
+explain select count(*) from partition_csv_table;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: partition_csv_table projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2
+--------CsvExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-11-13/timestamps.csv], [WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-12-13/timestamps.csv]]}, has_header=false
+
+query I
+select count(*) from partition_csv_table;

Review Comment:
   When reading a partitioned table, `listing_table_ignore_subdirectory` is always equal to false, even if set to true



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add new configuration item `ignore_subdirectory` [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.
Asura7969 commented on code in PR #8565:
URL: https://github.com/apache/arrow-datafusion/pull/8565#discussion_r1433335906


##########
datafusion/sqllogictest/test_files/csv_files.slt:
##########
@@ -63,3 +63,60 @@ id6 value"6
 id7 value"7
 id8 value"8
 id9 value"9
+
+
+# When reading a partitioned table, the `listing_table_ignore_subdirectory` configuration will be invalid
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = false;
+
+statement ok
+CREATE EXTERNAL TABLE partition_csv_table (
+  name VARCHAR,
+  ts TIMESTAMP,
+  c_date DATE,
+)
+STORED AS CSV
+PARTITIONED BY (c_date)
+LOCATION '../core/tests/data/partitioned_table';
+
+query I
+select count(*) from partition_csv_table;
+----
+4
+
+statement ok
+DROP TABLE partition_csv_table
+
+statement ok
+set datafusion.execution.listing_table_ignore_subdirectory = true;
+
+statement ok
+CREATE EXTERNAL TABLE partition_csv_table (
+  name VARCHAR,
+  ts TIMESTAMP,
+  c_date DATE,
+)
+STORED AS CSV
+PARTITIONED BY (c_date)
+LOCATION '../core/tests/data/partitioned_table';
+
+query TT
+explain select count(*) from partition_csv_table;
+----
+logical_plan
+Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
+--TableScan: partition_csv_table projection=[]
+physical_plan
+AggregateExec: mode=Final, gby=[], aggr=[COUNT(*)]
+--CoalescePartitionsExec
+----AggregateExec: mode=Partial, gby=[], aggr=[COUNT(*)]
+------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2
+--------CsvExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-11-13/timestamps.csv], [WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-12-13/timestamps.csv]]}, has_header=false
+
+query I
+select count(*) from partition_csv_table;

Review Comment:
   When reading a partitioned table, `listing_table_ignore_subdirectory` is always equal to false, even if set to true,But this test seems a bit redundant, i will clean up



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org