You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org> on 2023/02/19 15:39:50 UTC

[GitHub] [arrow-datafusion] saikrishna1-bidgely opened a new pull request, #4908: Allow `SessionContext::read_csv`, etc to read multiple files

saikrishna1-bidgely opened a new pull request, #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908

   closes #4909


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb merged PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1103835352


##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   The generic method would have a constraint on P as `P: DataFilePaths`. When we call the method `read_csv("path", options)` we will have to use the `use ..DataFilePaths` so that rust knows the trait `DataFilePaths` is implemented for string. Is this what you meant?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1433080962

   @alamb updated the docs. Should I remove `call_read_csvs` and the related methods?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by GitBox <gi...@apache.org>.

saikrishna1-bidgely commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1396064487

   > This looks like a nice improvement @saikrishna1-bidgely
   > 
   > I think we should add a test for this new functionality so that we don't accidentally break the new APIs going forward.
   > 
   > Also I was wondering about the signature
   > 
   > Rather than a slice, what would you think about taking something that could be turned into an iter:
   > 
   > So instead of
   > 
   > ```rust
   >     pub async fn read_parquet_with_multi_paths(
   >         &self,
   >         table_paths: &[impl AsRef<str>],
   >         options: ParquetReadOptions<'_>,
   >     ) -> Result<DataFrame> {
   > ```
   > 
   > Something more like
   > 
   > ```rust
   >     pub async fn read_parquet_with_multi_paths(
   >         &self,
   >         table_paths: impl IntoIterator<Item = &str>],
   >         options: ParquetReadOptions<'_>,
   >     ) -> Result<DataFrame> {
   > ```
   > 
   > I also think it would be ideal to figure out some way to have the same API take both a single path and an iterator -- do you think the above would work?
   
   I agree, something like `IntoIterator` works better than a simple slice.
   
   Regarding if we can take both str and iter of str in the same method, I don't think it is simple since rust doesn't have function overloading or variadic arguments. We can look into following:
   1. Enums for arguments and then do pattern matching. I'm trying to implement this.
   2. Union type for arguments. This might not be possible, see [link](https://stackoverflow.com/questions/64690981/can-trait-objects-of-type-intoiterator-be-boxed-and-held-inside-of-a-structure).
   3. Create a custom trait and then implement it for str and for a slice/vec. See [Link](https://users.rust-lang.org/t/make-a-function-accept-either-str-or-str-as-an-argument/79139/9).
   
   What do you think about having a single method which only takes a list of paths? For a single path, the callee can create a Vec.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1107752138


##########
datafusion/core/src/execution/context.rs:
##########
@@ -106,6 +106,43 @@ use super::options::{
     AvroReadOptions, CsvReadOptions, NdJsonReadOptions, ParquetReadOptions, ReadOptions,
 };
 
+/// DataFilePaths adds a method to convert strings and vector of strings to vector of URLs.
+/// This is primarily used as restriciton on the input to methods like read_csv, read_avro
+/// and the rest.
+pub trait DataFilePaths {
+    /// Parse to a list of URLs
+    fn to_urls(self) -> Result<Vec<ListingTableUrl>>;
+}
+
+impl DataFilePaths for &str {

Review Comment:
   Looks like rust won't allow it. We won't be able to implement a trait for another trait and also for another class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1089903323


##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   This might be better. Less code to implement. But we have to use the `use` statement for the defined trait whenever we call `read_csv` and others.
   
   One more issue could be when the users implement a trait with the same method name. They won't be able to use both traits, only one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1436873171

   Thanks again for sticking with this @saikrishna1-bidgely 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1103993446


##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   Thanks. I didn't know traits worked that way with generics. I tried an implementation and it worked!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1103780476


##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   Yes, I think that would be a good idea. Perhaps @andygrove @Dandandan @yahoNanJing @liukun4515  or @tustvold  have some thoughts



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1435660256

   @alamb removed methods from `CallReadTrait` too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1436872518

   Thanks again for sticking with this @saikrishna1-bidgely 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1398259203

   > Quick question, wouldn't we want to support multiple paths anyways since we would want to use it in DataFusion-CLI?
   
   I think that is likely a separate question -- DataFusion CLI can potentially create a ListingTable directly rather than using the higher level SessionContext API as well
   
   > Also, if we are able to crack the implementation which can take both single and multiple paths in the same function, API itself should be unchanged, right?
   
   Yes, I agree 
   
   > So, I think the following are our options:
   
   I agree with your assessments -- I do think documentation on how to create a ListingTable would go a long way. It seems we are lacking in such docs now https://docs.rs/datafusion/16.0.0/datafusion/datasource/listing/struct.ListingTable.html
   
   I think documentation will help regardless of what else we chose to do -- I'll go write some now. Thank you for this good discussion
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by GitBox <gi...@apache.org>.

saikrishna1-bidgely commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1383079469

   I'll add for `read_avro`, `read_json` and `read_parquet` too once what needs to be done gets finalised.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1397083616

   > What do you think about having a single method which only takes a list of paths? For a single path, the callee can create a slice/Vec. This would be a lot simpler to do.
   
   I was thinking about this PR and I have an alternate suggestion
   
   It seems to me that  `read_parquet`, `read_avro`, etc are wrappers to simplify the process of creating a `ListingTable`. Support for multiple paths starts complicating the API more -- what do you think about instead of adding `read_parquet_from_path`s we make it easier to see how to read multiple files using the `ListingTable` API directly?
   
   For example, I bet if we added a doc example like the following
   
   ```rust
       /// Creates a [`DataFrame`] for reading a Parquet data source from a single file or directory. 
       ///
       /// Note: if you want to read from multiple files, or control other behaviors
       /// you can use the [`ListingTable`] API directly. For example to read multiple files
       /// 
       /// ```
       /// Example here (basically copy/paste the implementation of read_parquet and support multiple files)
       /// ```
       pub async fn read_parquet(
           &self,
           table_path: impl AsRef<str>,
           options: ParquetReadOptions<'_>,
       ) -> Result<DataFrame> {
   ...
   ```
   
   We could give similar treatment to the docstrings for `read_avro` and `read_csv` (perhaps by pointing to the docs for `read_parquet` for an example of creating `ListingTable`s)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1072844902


##########
datafusion/core/src/execution/context.rs:
##########
@@ -551,12 +551,14 @@ impl SessionContext {
     }
 
     /// Creates a [`DataFrame`] for reading an Avro data source.
-    pub async fn read_avro(
+    pub async fn read_avro_with_multi_paths(

Review Comment:
   what do you think about calling this `read_avro_from_paths`? rather than `multi_paths`?



##########
datafusion/core/src/execution/context.rs:
##########
@@ -576,20 +578,31 @@ impl SessionContext {
             }
         };
 
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Json data source.
-    pub async fn read_json(
+    /// Creates a [`DataFrame`] for reading an Avro data source.
+    pub async fn read_avro(
         &self,
         table_path: impl AsRef<str>,
+        options: AvroReadOptions<'_>,
+    ) -> Result<DataFrame> {
+        return self.read_avro_with_multi_paths(&[table_path], options).await;
+    }
+
+    /// Creates a [`DataFrame`] for reading an Json data source.

Review Comment:
   Perhaps we can update the docstring as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1083343498


##########
datafusion/core/src/execution/context.rs:
##########
@@ -1035,6 +1037,80 @@ impl FunctionRegistry for SessionContext {
     }
 }
 
+#[async_trait]
+/// Trat implementing `read_csv`, `read_avro`, `read_json` and `read_parquet`.
+/// We implement these functions for `String`, `&str`, `Vec<&str>` and `Vec<String>`.
+/// This way these functions can accept either a single path or a list of paths.
+/// 
+/// An example of using these methods is given below.
+/// ```rust
+/// let ctx = SessionContext::default();
+/// ctx.read_csv("s3://bucket-name/key/")
+/// ctx.read_csv(vec!["s3://bucket-name/key/", "s3://bucket-name/key2/", "s3://another-bucket-name/key/"])
+/// ```
+pub trait Reader<'a, T>: Sized {
+
+    /// Creates a [`DataFrame`] for reading a CSV data source.
+    ///
+    /// For more control such as reading multiple files, you can use
+    /// [`read_table`](Self::read_table) with a [`ListingTable`].
+    async fn read_csv(&self, table_paths: T, options: CsvReadOptions<'_>) -> Result<DataFrame>
+    where 'a:'async_trait
+    ;
+}
+
+#[async_trait]
+impl<'a> Reader<'a, &'a str> for SessionContext {
+    async fn read_csv(&self, table_path: &'a str, options: CsvReadOptions<'_>) -> Result<DataFrame> {
+        self.read_csv(vec![table_path], options).await
+    }
+}
+
+#[async_trait]
+impl<'a> Reader<'a, String> for SessionContext {
+    async fn read_csv(&self, table_path: String, options: CsvReadOptions<'_>) -> Result<DataFrame> {
+        self.read_csv(vec![table_path.as_str()], options).await
+    }
+}
+
+#[async_trait]
+impl<'a> Reader<'a, Vec<&'a str>> for SessionContext {
+    async fn read_csv(&self, table_paths: Vec<&'a str>, options: CsvReadOptions<'_>) -> Result<DataFrame> {

Review Comment:
   Should we implement a trait `ReadOptions`?
   ```rust
   trait ReadOptions {
       fn to_listing_options(&self, target_partitions: usize) -> ListingOptions;
       fn get_resolved_schema(&self) -> Result<SchemaRef>;
   }
   ```
   `CsvReadOptions` and other will implement this trait. This way, we can avoid repeating this code for other methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1406369413

   Marking as a draft so it is clear this is waiting on some updates rather than waiting on review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1436020455

   Closing/ reopening to rerun CI (for some reason several of the tests were canceled)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1431651740

   Regarding the docs, I think we should extend the example in `SessionContext` and add an example to all the `read_*` methods.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1083336098


##########
datafusion/core/src/execution/context.rs:
##########
@@ -1035,6 +1037,65 @@ impl FunctionRegistry for SessionContext {
     }
 }
 
+#[async_trait]
+pub trait Reader<'a, T>: Sized {
+    async fn read_csv(&self, table_paths: T, options: CsvReadOptions<'_>) -> Result<DataFrame>
+    where 'a:'async_trait
+    ;
+}
+
+#[async_trait]
+impl<'a> Reader<'a, &'a str> for SessionContext {
+    async fn read_csv(&self, table_path: &'a str, options: CsvReadOptions<'_>) -> Result<DataFrame> {
+        self.read_csv(vec![table_path], options).await
+    }
+}
+
+#[async_trait]
+impl<'a> Reader<'a, String> for SessionContext {

Review Comment:
   We can't use AsRef<str> here. If we do that, we won't be able to implement for `Vec<str>`. We have to implement both `&str` and `String` separately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1111057255


##########
datafusion/core/src/execution/context.rs:
##########
@@ -124,6 +161,7 @@ use super::options::{
 /// # #[tokio::main]
 /// # async fn main() -> Result<()> {
 /// let ctx = SessionContext::new();
+/// let df = ctx.read_csv(vec!["tests/data/example.csv", "tests/data/example.csv"], CsvReadOptions::new()).await?;

Review Comment:
   I think this particular line is just confusing -- the next line ignores the result (redefines `df`) and it registers the same file twice. Given this is supposed to be the simplest example, I think we should leave this one alone. 



##########
datafusion/core/src/execution/context.rs:
##########
@@ -685,24 +723,40 @@ impl SessionContext {
     ///
     /// For more control such as reading multiple files, you can use
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_csv(
+    ///
+    /// Example usage is given below:
+    ///
+    /// ```
+    /// use datafusion::prelude::*;
+    /// # use datafusion::error::Result;
+    /// # #[tokio::main]
+    /// # async fn main() -> Result<()> {
+    /// let ctx = SessionContext::new();
+    /// let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
+    /// let df = ctx.read_csv(vec!["tests/data/example.csv", "tests/data/example.csv"], CsvReadOptions::new()).await?;

Review Comment:
   ```suggestion
       /// // You can read a single file using `read_csv`
       /// let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
       /// // you can also read multiple files:
       /// let df = ctx.read_csv(vec!["tests/data/example.csv", "tests/data/example.csv"], CsvReadOptions::new()).await?;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1435732284

   Cool! Do we need more approvals before merging?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1406369732

   Please mark it ready for review once if I made a mistake or when it is next ready


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1422088739

   Hi @alamb I was on vacation and couldn't work on this. I've updated the comments pls go through them once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1083336098


##########
datafusion/core/src/execution/context.rs:
##########
@@ -1035,6 +1037,65 @@ impl FunctionRegistry for SessionContext {
     }
 }
 
+#[async_trait]
+pub trait Reader<'a, T>: Sized {
+    async fn read_csv(&self, table_paths: T, options: CsvReadOptions<'_>) -> Result<DataFrame>
+    where 'a:'async_trait
+    ;
+}
+
+#[async_trait]
+impl<'a> Reader<'a, &'a str> for SessionContext {
+    async fn read_csv(&self, table_path: &'a str, options: CsvReadOptions<'_>) -> Result<DataFrame> {
+        self.read_csv(vec![table_path], options).await
+    }
+}
+
+#[async_trait]
+impl<'a> Reader<'a, String> for SessionContext {

Review Comment:
   We can't use AsRef<str> here. We have to implement both `&str` and `String` seperately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1096915932


##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   Hi @alamb, In the above example, we will need to use the `use` statement even when we are using `read_csv("s3://bucket/key", options)`. Instead I propose the following:
   1. What if we implement AsRef for Vec. Then, we only need to use the `use` statement when we are calling for multiple  paths. The default single path code will remain unchanged. The AsRef will simply concatenate all the strings in Vec with `,` delimiter. And in the `_read_type`, we will split on `,`.
   2. How about having a different method `read_csvs` for taking multiple paths. `read_csvs` is a lot cleaner than `read_csv_with_multi_paths`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1423041952

   Thanks @saikrishna1-bidgely  -- will check them out


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1089734021


##########
datafusion/core/src/dataframe.rs:
##########
@@ -62,6 +62,7 @@ use crate::prelude::SessionContext;
 /// ```
 /// # use datafusion::prelude::*;
 /// # use datafusion::error::Result;
+/// # use datafusion::execution::context::Reader;

Review Comment:
   I am somewhat worried about session context getting a new trait like this as now everyone who uses SessionContext must use now add this new `use` statement



##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   I wonder if you could leave these methods on `SessionContext` and instead use a trait for the argument and avoid changes to downstream users. 
   
   Perhaps something like:
   
   
   ```rust
   trait DataFilePaths {
     // Parse to a list of URs
     fn to_urls(self) -> Result<Vec<ListingTableUrl>>>;
   }
   
   impl DataFilePaths for &str {
     fn to_urls(self) -> Result<Vec<ListingTableUrl>>> {
       Ok(vec![ListingTableUrl::parse(self)?])
     }
   }
   
   impl DataFilePaths for Vec<&str> {
     fn to_urls(self) -> Result<Vec<ListingTableUrl>>> {
       self
               .iter()
               .map(ListingTableUrl::parse)
               .collect::<Result<Vec<ListingTableUrl>>>()
     }
   }
   
   
   
   impl SessionContext {
   ...
       pub async fn <P> read_avro(
           &self,
           table_path: P,
           options: AvroReadOptions<'_>,
       ) -> Result<DataFrame> {
           self._read_type(table_path.to_urls()>, options).await
       }
   ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1430279363

   @alamb @tustvold I updated the code with suggestion from @tustvold. The change is now much smaller and cleaner. Pls review the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1107349323


##########
datafusion/core/src/execution/context.rs:
##########
@@ -106,6 +106,43 @@ use super::options::{
     AvroReadOptions, CsvReadOptions, NdJsonReadOptions, ParquetReadOptions, ReadOptions,
 };
 
+/// DataFilePaths adds a method to convert strings and vector of strings to vector of URLs.
+/// This is primarily used as restriciton on the input to methods like read_csv, read_avro
+/// and the rest.
+pub trait DataFilePaths {
+    /// Parse to a list of URLs
+    fn to_urls(self) -> Result<Vec<ListingTableUrl>>;
+}
+
+impl DataFilePaths for &str {

Review Comment:
   I'm trying to implement for AsRef<str> along with an iterator. I'll update once it's ready.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ursabot commented on pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1436894557

   Benchmark runs are scheduled for baseline = ae89960bc9f59706941be32afa730a87f7e47f1b and contender = cfbb14d9c0b8f4a86ed12d716a55c1dba3791e8b. cfbb14d9c0b8f4a86ed12d716a55c1dba3791e8b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4586b4c26d474c3ca0f3835fc689e2f0...ff22fab5fb3846c7b9c611a167d449fb/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/52aed2054edb413aa42ded27459f8efe...7924af1c8ea3441cba4db8f1561c5f27/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/eab623e20fd94a439d558729515b7332...feb147c017d24cad81d5c7b206d07189/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/da926ace54454f18a181ad24fbdc2220...5d2e925e77804fc79a6a64ab47a3a9e7/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1398306975

   Here is a proposal to at least add some more docs: https://github.com/apache/arrow-datafusion/pull/5001 -- it is not necessarily mutually exclusive with updating the signatures as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1083341999


##########
datafusion/core/src/execution/context.rs:
##########
@@ -1035,6 +1037,80 @@ impl FunctionRegistry for SessionContext {
     }
 }
 
+#[async_trait]
+/// Trat implementing `read_csv`, `read_avro`, `read_json` and `read_parquet`.
+/// We implement these functions for `String`, `&str`, `Vec<&str>` and `Vec<String>`.
+/// This way these functions can accept either a single path or a list of paths.
+/// 
+/// An example of using these methods is given below.
+/// ```rust
+/// let ctx = SessionContext::default();
+/// ctx.read_csv("s3://bucket-name/key/")
+/// ctx.read_csv(vec!["s3://bucket-name/key/", "s3://bucket-name/key2/", "s3://another-bucket-name/key/"])
+/// ```
+pub trait Reader<'a, T>: Sized {
+
+    /// Creates a [`DataFrame`] for reading a CSV data source.
+    ///
+    /// For more control such as reading multiple files, you can use
+    /// [`read_table`](Self::read_table) with a [`ListingTable`].
+    async fn read_csv(&self, table_paths: T, options: CsvReadOptions<'_>) -> Result<DataFrame>

Review Comment:
   For now implemented only for `read_csv` but once it gets finalised for it, I'll implement for the rest of the methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1398775889

   @alamb I finally got a way to implement overloading using Traits.
   ```rust
   use std::vec;
   
   fn main() {
       struct PATH {
       }
       
       impl PATH {
           pub fn new() -> Self {
               Self {
               }
           }
       }
       
       pub trait Reader<T>: Sized {
           fn read_csv(&self, value: T) -> i32;
       }
       
       impl<'a> Reader<&'a str> for PATH {
           fn read_csv(&self, p: &'a str) -> i32  {
               self.read_csv(vec![p])
           }
       }
       
       impl<'a> Reader<Vec<&'a str>> for PATH {
           fn read_csv(&self, v: Vec<&'a str>) -> i32 {
               v.len() as i32
           }
       }
       let p = PATH::new();
       println!("{:?}", p.read_csv("path"));
       println!("{:?}", p.read_csv(vec!["path", "paths2"]));
   }
   ```
   This way, callees using `str` or `String` can continue using the function and we can add support for vector/iterators. I will write a more general solution for read functions.
   
   I tried to implement using Enum and Union but I wasn't able to do so and they will change the function signature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files
URL: https://github.com/apache/arrow-datafusion/pull/4908


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1111058619


##########
datafusion/core/src/execution/context.rs:
##########
@@ -124,6 +161,7 @@ use super::options::{
 /// # #[tokio::main]
 /// # async fn main() -> Result<()> {
 /// let ctx = SessionContext::new();
+/// let df = ctx.read_csv(vec!["tests/data/example.csv", "tests/data/example.csv"], CsvReadOptions::new()).await?;

Review Comment:
   Makes sense. So, let's just extend `read_csv` example?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1436020310

   > Cool! Do we need more approvals before merging?
   
   Nope, I was just giving it some time after approval for other maintainers to have a look if they wanted. 
   
   🚀 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4908: Allow `SessionContext::read_csv`, etc to read multiple files

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1107191243


##########
datafusion/core/src/execution/context.rs:
##########
@@ -627,22 +664,18 @@ impl SessionContext {
     ///
     /// For more control such as reading multiple files, you can use
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    async fn _read_type<'a>(
+    async fn _read_type<'a, P: DataFilePaths>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: P,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths.to_urls()?;

Review Comment:
   ❤️ 



##########
datafusion/core/src/execution/context.rs:
##########
@@ -2764,6 +2797,9 @@ mod tests {
         async fn call_read_csv(&self) -> DataFrame;
         async fn call_read_avro(&self) -> DataFrame;
         async fn call_read_parquet(&self) -> DataFrame;
+        async fn call_read_csvs(&self) -> DataFrame;

Review Comment:
   Are these methods still needed?



##########
datafusion/core/src/execution/context.rs:
##########
@@ -106,6 +106,43 @@ use super::options::{
     AvroReadOptions, CsvReadOptions, NdJsonReadOptions, ParquetReadOptions, ReadOptions,
 };
 
+/// DataFilePaths adds a method to convert strings and vector of strings to vector of URLs.
+/// This is primarily used as restriciton on the input to methods like read_csv, read_avro
+/// and the rest.

Review Comment:
   ```suggestion
   /// DataFilePaths adds a method to convert strings and vector of strings to vector of [`ListingTableUrl`] URLs.
   /// This allows methods such [`SessionContext::read_csv`] and `[`SessionContext::read_avro`]
   /// to take either a single file or multiple files. 
   ```



##########
datafusion/core/src/execution/context.rs:
##########
@@ -106,6 +106,43 @@ use super::options::{
     AvroReadOptions, CsvReadOptions, NdJsonReadOptions, ParquetReadOptions, ReadOptions,
 };
 
+/// DataFilePaths adds a method to convert strings and vector of strings to vector of URLs.
+/// This is primarily used as restriciton on the input to methods like read_csv, read_avro
+/// and the rest.
+pub trait DataFilePaths {
+    /// Parse to a list of URLs
+    fn to_urls(self) -> Result<Vec<ListingTableUrl>>;
+}
+
+impl DataFilePaths for &str {

Review Comment:
   technically speaking I think this might be a breaking change as the methods used to take `impl AsRef<str>` 
   
   However, I think it is fine for this case as there is an easy workaround (call `as_ref()`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1096915932


##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   Hi @alamb, I was travelling and couldn't work on this last week. In the above example, we will need to use the `use` statement even when we are using `read_csv("s3://bucket/key", options)`. Instead I propose the following:
   1. What if we implement AsRef for Vec. Then, we only need to use the `use` statement when we are calling for multiple  paths. The default single path code will remain unchanged. The AsRef will simply concatenate all the strings in Vec with `,` delimiter. And in the `_read_type`, we will split on `,`.
   2. How about having a different method `read_csvs` for taking multiple paths. `read_csvs` is a lot cleaner than `read_csv_with_multi_paths`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by GitBox <gi...@apache.org>.

saikrishna1-bidgely commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1397704446

   Quick question, wouldn't we want to support multiple paths anyways since we would want to use it in DataFusion-CLI?
   
   Also, if we are able to crack the implementation which can take both single and multiple paths in the same function, API itself should be unchanged, right?
   
   So, I think the following are our options:
   1. As you said, don't support it but provide an example in the docs.
   2. Have multiple methods.
   3. Same method takes both single and multiple paths.
   4. Change the current methods to take only multiple paths.
   
   1 is definitely the simplest but I think we should support it as it would make using DataFusion simpler.
   2 is simple but multiple functions is a downside.
   3 if possible is best case IMO.
   4 is also simpler but needs a lot of changes downstream.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #4908: added a method to read multiple locations at the same time.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1399227866

   Thank you @saikrishna1-bidgely  -- sounds like great progress. I think if we go the trait route as long as we document how to use it (basically we can make a doc example showing the use of a &str) I think we'll be good. 
   
   Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1103627059


##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   > How about having a different method read_csvs for taking multiple paths. read_csvs is a lot cleaner than read_csv_with_multi_paths.
   
   I agree that would be better
   
   I think my concern was about the fact that then there would be `read_parquets` `read_avros`, etc. which starts to sound like too many 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1103659269


##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   Looks like a toss up between multiple methods and using traits. Shall we get inputs from other core members?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1103843849


##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   That isn't how traits behave, you do not need to have a trait in scope in order to call a generic method with a type constraint on it - https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a629bf1bdd4b6418ab09e274f8214de5



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1103794858


##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   > But we have to use the use statement for the DataFilePaths whenever we call read_csv and others.
   
   I'm surprised by this, you shouldn't need to use a trait in order to satisfy a type constraint on a generic method?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by GitBox <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1073817337


##########
datafusion/core/src/execution/context.rs:
##########
@@ -551,12 +551,14 @@ impl SessionContext {
     }
 
     /// Creates a [`DataFrame`] for reading an Avro data source.
-    pub async fn read_avro(
+    pub async fn read_avro_with_multi_paths(

Review Comment:
   I used `_with_multi_paths` since arrow uses a similar name but `_from_paths` seems cleaner. I'm fine with either.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1083513043


##########
datafusion/core/src/execution/context.rs:
##########
@@ -1035,6 +1037,80 @@ impl FunctionRegistry for SessionContext {
     }
 }
 
+#[async_trait]
+/// Trat implementing `read_csv`, `read_avro`, `read_json` and `read_parquet`.
+/// We implement these functions for `String`, `&str`, `Vec<&str>` and `Vec<String>`.
+/// This way these functions can accept either a single path or a list of paths.
+/// 
+/// An example of using these methods is given below.
+/// ```rust
+/// let ctx = SessionContext::default();
+/// ctx.read_csv("s3://bucket-name/key/")
+/// ctx.read_csv(vec!["s3://bucket-name/key/", "s3://bucket-name/key2/", "s3://another-bucket-name/key/"])
+/// ```
+pub trait Reader<'a, T>: Sized {
+
+    /// Creates a [`DataFrame`] for reading a CSV data source.
+    ///
+    /// For more control such as reading multiple files, you can use
+    /// [`read_table`](Self::read_table) with a [`ListingTable`].
+    async fn read_csv(&self, table_paths: T, options: CsvReadOptions<'_>) -> Result<DataFrame>
+    where 'a:'async_trait
+    ;
+}
+
+#[async_trait]
+impl<'a> Reader<'a, &'a str> for SessionContext {
+    async fn read_csv(&self, table_path: &'a str, options: CsvReadOptions<'_>) -> Result<DataFrame> {
+        self.read_csv(vec![table_path], options).await
+    }
+}
+
+#[async_trait]
+impl<'a> Reader<'a, String> for SessionContext {
+    async fn read_csv(&self, table_path: String, options: CsvReadOptions<'_>) -> Result<DataFrame> {
+        self.read_csv(vec![table_path.as_str()], options).await
+    }
+}
+
+#[async_trait]
+impl<'a> Reader<'a, Vec<&'a str>> for SessionContext {
+    async fn read_csv(&self, table_paths: Vec<&'a str>, options: CsvReadOptions<'_>) -> Result<DataFrame> {

Review Comment:
   Now implemented in #5025.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] saikrishna1-bidgely commented on a diff in pull request #4908: added a method to read multiple locations at the same time.

Posted by "saikrishna1-bidgely (via GitHub)" <gi...@apache.org>.

saikrishna1-bidgely commented on code in PR #4908:
URL: https://github.com/apache/arrow-datafusion/pull/4908#discussion_r1089903323


##########
datafusion/core/src/execution/context.rs:
##########
@@ -613,50 +615,29 @@ impl SessionContext {
     /// [`read_table`](Self::read_table) with a [`ListingTable`].
     async fn _read_type<'a>(
         &self,
-        table_path: impl AsRef<str>,
+        table_paths: Vec<impl AsRef<str>>,
         options: impl ReadOptions<'a>,
     ) -> Result<DataFrame> {
-        let table_path = ListingTableUrl::parse(table_path)?;
+        let table_paths = table_paths
+            .iter()
+            .map(ListingTableUrl::parse)
+            .collect::<Result<Vec<ListingTableUrl>>>()?;
         let session_config = self.copied_config();
         let listing_options = options.to_listing_options(&session_config);
         let resolved_schema = match options
-            .get_resolved_schema(&session_config, self.state(), table_path.clone())
+            .get_resolved_schema(&session_config, self.state(), table_paths[0].clone())
             .await
         {
             Ok(resolved_schema) => resolved_schema,
             Err(e) => return Err(e),
         };
-        let config = ListingTableConfig::new(table_path)
+        let config = ListingTableConfig::new_with_multi_paths(table_paths)
             .with_listing_options(listing_options)
             .with_schema(resolved_schema);
         let provider = ListingTable::try_new(config)?;
         self.read_table(Arc::new(provider))
     }
 
-    /// Creates a [`DataFrame`] for reading an Avro data source.
-    ///
-    /// For more control such as reading multiple files, you can use
-    /// [`read_table`](Self::read_table) with a [`ListingTable`].
-    pub async fn read_avro(
-        &self,
-        table_path: impl AsRef<str>,
-        options: AvroReadOptions<'_>,
-    ) -> Result<DataFrame> {
-        self._read_type(table_path, options).await
-    }

Review Comment:
   This might be better. Less code to implement. But we have to use the `use` statement for the `DataFilePaths` whenever we call `read_csv` and others.
   
   One more issue could be when the users implement a trait with the same method name. They won't be able to use both traits, only one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org