You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "2010YOUY01 (via GitHub)" <gi...@apache.org> on 2023/07/23 05:41:08 UTC

[GitHub] [arrow-datafusion] 2010YOUY01 opened a new pull request, #7058: Support file path like "~/$TEST_FILE"

2010YOUY01 opened a new pull request, #7058:
URL: https://github.com/apache/arrow-datafusion/pull/7058

   # Which issue does this PR close?
   
   NA
   
   # Rationale for this change
   
   It is more convenient to use shell expansion for paths when reading files, it is not supported now (standard path library doesn't support that)
   ```sql
   arrow-datafusion/datafusion-cli|shell-expand🌝| >>> export CSV_FILE=aggregate_test_100.csv
   arrow-datafusion/datafusion-cli|shell-expand🌝| >>> cargo run
       Finished dev [unoptimized + debuginfo] target(s) in 0.75s
        Running `target/debug/datafusion-cli`
   DataFusion CLI v27.0.0
   ❯ CREATE EXTERNAL TABLE aggr
   STORED AS CSV
   WITH HEADER ROW LOCATION '~/db_data/datafusion/data/csv/$CSV_FILE';
   0 rows in set. Query took 0.101 seconds.
   ❯ select c1, max(c12) from aggr group by c1;
   +----+--------------------+
   | c1 | MAX(aggr.c12)      |
   +----+--------------------+
   | c  | 0.991517828651004  |
   | d  | 0.9748360509016578 |
   | b  | 0.9185813970744787 |
   | a  | 0.9800193410444061 |
   | e  | 0.9965400387585364 |
   +----+--------------------+
   ```
   
   # What changes are included in this PR?
   
   Expand paths when necessary.
   The implementation directly uses the shell to expand paths, I found this way easy and also avoids external dependencies.
   
   # Are these changes tested?
   Yes
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   No
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] 2010YOUY01 commented on a diff in pull request #7058: Support file path like "~/$TEST_FILE"

Posted by "2010YOUY01 (via GitHub)" <gi...@apache.org>.
2010YOUY01 commented on code in PR #7058:
URL: https://github.com/apache/arrow-datafusion/pull/7058#discussion_r1275386601


##########
datafusion/core/src/datasource/listing/url.rs:
##########
@@ -87,6 +89,41 @@ impl ListingTableUrl {
         }
     }
 
+    /// Perform shell-like path expansions
+    /// * Home directory expansion: "~/test.csv" expands to "/Users/user1/test.csv"
+    /// * Environment variable expansion: "$HOME/$DATA/test.csv" expands to
+    /// "/Users/user1/data/test.csv"
+    fn expand_path_prefix(path: &str) -> Result<String, DataFusionError> {
+        let home_dir = home::home_dir()
+            .unwrap()
+            .into_os_string()
+            .into_string()
+            .unwrap();
+
+        let path = path.replace('~', &home_dir);
+
+        let parts = path.split('/').collect::<Vec<_>>();
+        let mut expanded_parts = Vec::new();
+
+        for part in parts {
+            if let Some(envvar_name) = part.strip_prefix('$') {

Review Comment:
   Good point, I will add support for patterns like `/foo$bar/` or `/$foo$bar/`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #7058: Support file path like "~/$TEST_FILE"

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7058:
URL: https://github.com/apache/arrow-datafusion/pull/7058#issuecomment-1652394407

   > Is this because there might be some other unexpected behavior we might not be aware of for now?
   
   Yes, I am concerned that since DataFusion is used to build server software, that it doesn't have security vulnerabilities. Accessing the environment and home directories are classic attack vectors from my understanding
   
   > library: How about adding a Catalog configuration option like use_path_shell_expansion, and turn it off in default config (also suggest not to use it in production in doc), and turn it on in CLI by default?
   
   I think that sounds like an excellent idea 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #7058: Support file path like "~/$TEST_FILE"

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7058:
URL: https://github.com/apache/arrow-datafusion/pull/7058#issuecomment-1653433250

   Thank you for this contribution @2010YOUY01 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] 2010YOUY01 closed pull request #7058: Support file path like "~/$TEST_FILE"

Posted by "2010YOUY01 (via GitHub)" <gi...@apache.org>.
2010YOUY01 closed pull request #7058: Support file path like "~/$TEST_FILE"
URL: https://github.com/apache/arrow-datafusion/pull/7058


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] 2010YOUY01 commented on pull request #7058: Support file path like "~/$TEST_FILE"

Posted by "2010YOUY01 (via GitHub)" <gi...@apache.org>.
2010YOUY01 commented on PR #7058:
URL: https://github.com/apache/arrow-datafusion/pull/7058#issuecomment-1652352286

   > Looking better @2010YOUY01 -- thank you
   > 
   > What do you think about making this code only apply in `datafusion-cli` rather than the core?
   
   Thanks for the review again! @alamb 
   Is this because there might be some other unexpected behavior we might not be aware of for now?
   Since path expansion might also be useful if someone wants to write some simple testing program using DataFusion as library: How about adding a Catalog configuration option like `use_path_shell_expansion`, and turn it off in default config (also suggest not to use it in production in doc), and turn it on in CLI by default?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #7058: Support file path like "~/$TEST_FILE"

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #7058:
URL: https://github.com/apache/arrow-datafusion/pull/7058#discussion_r1275356651


##########
datafusion/core/src/datasource/listing/url.rs:
##########
@@ -87,6 +89,41 @@ impl ListingTableUrl {
         }
     }
 
+    /// Perform shell-like path expansions
+    /// * Home directory expansion: "~/test.csv" expands to "/Users/user1/test.csv"
+    /// * Environment variable expansion: "$HOME/$DATA/test.csv" expands to
+    /// "/Users/user1/data/test.csv"
+    fn expand_path_prefix(path: &str) -> Result<String, DataFusionError> {
+        let home_dir = home::home_dir()
+            .unwrap()

Review Comment:
   I don't think we should use either `unwrap`  as it will cause a panic of the user doesn't have a $HOME set, I think. Instead I think the code should simply not attempt `~` substitution if it can't find the home directory expansion
   
   



##########
datafusion/core/src/datasource/listing/url.rs:
##########
@@ -87,6 +89,41 @@ impl ListingTableUrl {
         }
     }
 
+    /// Perform shell-like path expansions
+    /// * Home directory expansion: "~/test.csv" expands to "/Users/user1/test.csv"
+    /// * Environment variable expansion: "$HOME/$DATA/test.csv" expands to
+    /// "/Users/user1/data/test.csv"
+    fn expand_path_prefix(path: &str) -> Result<String, DataFusionError> {
+        let home_dir = home::home_dir()
+            .unwrap()
+            .into_os_string()
+            .into_string()
+            .unwrap();
+
+        let path = path.replace('~', &home_dir);
+
+        let parts = path.split('/').collect::<Vec<_>>();
+        let mut expanded_parts = Vec::new();
+
+        for part in parts {
+            if let Some(envvar_name) = part.strip_prefix('$') {

Review Comment:
   this only works for paths like `/$foo/` right, not `/foo$bar/`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] 2010YOUY01 commented on a diff in pull request #7058: Support file path like "~/$TEST_FILE"

Posted by "2010YOUY01 (via GitHub)" <gi...@apache.org>.
2010YOUY01 commented on code in PR #7058:
URL: https://github.com/apache/arrow-datafusion/pull/7058#discussion_r1273862018


##########
datafusion/core/src/datasource/listing/url.rs:
##########
@@ -87,6 +88,34 @@ impl ListingTableUrl {
         }
     }
 
+    /// Perform shell-like path expansions
+    /// * Home directory expansion: "~/test.csv" expands to "/Users/user1/test.csv"
+    /// * Environment variable expansion: "$HOME/$DATA/test.csv" expands to
+    /// "/Users/user1/data/test.csv"
+    fn expand_path_prefix(prefix: &str) -> Result<String, DataFusionError> {
+        let error_msg = format!("Failed to perform shell expansion in path: {prefix}");
+
+        let expanded_dir_output = Command::new("sh")
+            .arg("-c")
+            .arg(&format!("echo {prefix}"))

Review Comment:
   Thank you! Totally agreed.
   Updated the implementation with "2."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] 2010YOUY01 commented on pull request #7058: Support file path like "~/$TEST_FILE"

Posted by "2010YOUY01 (via GitHub)" <gi...@apache.org>.
2010YOUY01 commented on PR #7058:
URL: https://github.com/apache/arrow-datafusion/pull/7058#issuecomment-1652865045

   > > Is this because there might be some other unexpected behavior we might not be aware of for now?
   > 
   > Yes, I am concerned that since DataFusion is used to build server software, that it doesn't have security vulnerabilities. Accessing the environment and home directories are classic attack vectors from my understanding
   
   TIL 👍🏼 
   
   Also I took a look at that approach, and although it's doable, it will make related code less readable and bring in 
    an additional configuration knob.
   I think the extra complexity outweighs the benefits of this feature, so I will be closing this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #7058: Support file path like "~/$TEST_FILE"

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #7058:
URL: https://github.com/apache/arrow-datafusion/pull/7058#discussion_r1273403499


##########
datafusion/core/src/datasource/listing/url.rs:
##########
@@ -87,6 +88,34 @@ impl ListingTableUrl {
         }
     }
 
+    /// Perform shell-like path expansions
+    /// * Home directory expansion: "~/test.csv" expands to "/Users/user1/test.csv"
+    /// * Environment variable expansion: "$HOME/$DATA/test.csv" expands to
+    /// "/Users/user1/data/test.csv"
+    fn expand_path_prefix(prefix: &str) -> Result<String, DataFusionError> {
+        let error_msg = format!("Failed to perform shell expansion in path: {prefix}");
+
+        let expanded_dir_output = Command::new("sh")
+            .arg("-c")
+            .arg(&format!("echo {prefix}"))

Review Comment:
   This seems like it may be subject to a shell injection attack. 
   
   For example, what if someone did 
   
   ```sql
   CREATE EXTERNAL TABLE foo LOCATED AT '"\"hi"; rm -rf /"'
   ```
   
   Could this potentially `rm -rf` the filesystem?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org