You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "casperhart (via GitHub)" <gi...@apache.org> on 2023/05/22 01:40:08 UTC

[GitHub] [arrow-datafusion] casperhart opened a new issue, #6403: No way to specify file extension in datafusion-cli

casperhart opened a new issue, #6403:
URL: https://github.com/apache/arrow-datafusion/issues/6403

   ### Describe the bug
   
   I would like to read a `.tsv` file using the datafusion-cli. However, the file isn't recognised because the file extension is `.tsv` instead of the default `.csv`.  In vanilla datafusion, I can specify `CsvReadOptions::new().file_extension(".tsv")`, but from what I can see there is no similar option available in the datafusion-cli (correct me if I'm wrong).
   
   ### To Reproduce
   
   In bash:
   
   ```
   echo "col1, col2" > test.tsv
   echo "1, 2" >> test.tsv 
   ```
   
   In datafusion-cli:
   
   ```
   create external table test stored as csv with header row location "test.tsv";
   select * from test;
   ```
   
   gives:
   
   ```
   0 rows in set. Query took 0.001 seconds.
   ```
   
   ### Expected behavior
   
   Technically this is the expected behaviour, but it would be nice if there was a way to read the `.tsv` file and return the rows from it. 
   
   It would also be nice if the file_extension was only needed if the specified location is a directory. I.e. if I specify a file, I don't see why there's a need to separately specify the extension.
   
   ### Additional context
   
   I'd like to work on this, but I don't know what the best approach is.
   
   E.g. a few ways I can think of are:
   - making this specifiable in the sql statement itself, as is the case with `delimiter x` and `with header row`
   - adding a global option `datafusion.catalog.file_extesion`
   - (if possible) using a method like Hive's `tblproperties`
   
   Let me know what you think, cheers
   
   P.S. I have another issue reading .tsv files using the datafusion-cli here: https://github.com/apache/arrow-datafusion/issues/6397.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] casperhart closed issue #6403: No way to specify file extension in datafusion-cli

Posted by "casperhart (via GitHub)" <gi...@apache.org>.
casperhart closed issue #6403: No way to specify file extension in datafusion-cli
URL: https://github.com/apache/arrow-datafusion/issues/6403


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6403: No way to specify file extension in datafusion-cli

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6403:
URL: https://github.com/apache/arrow-datafusion/issues/6403#issuecomment-1558031754

   I have some thoughts about this and will provide them tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] casperhart commented on issue #6403: No way to specify file extension in datafusion-cli

Posted by "casperhart (via GitHub)" <gi...@apache.org>.
casperhart commented on issue #6403:
URL: https://github.com/apache/arrow-datafusion/issues/6403#issuecomment-1560220892

   Just tried on main, works like a charm. I was running via `cargo run` yesterday but it hadn't picked up the more recent changes for some reason 🤔. Regardless, this is resolved now, thanks @alamb and @aprimadi!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6403: No way to specify file extension in datafusion-cli

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6403:
URL: https://github.com/apache/arrow-datafusion/issues/6403#issuecomment-1559029726

   I think this sounds a lot like https://github.com/apache/arrow-datafusion/issues/1736 which was fixed by https://github.com/apache/arrow-datafusion/pull/6274 by @aprimadi  ❤️ 
   
   
   I just tried your reproducer on main and the data is selected as expected:
   
   ```
   alamb@MacBook-Pro-8:~/Software/arrow-datafusion2/datafusion-cli$ echo "col1, col2" > test.tsv
   echo "1, 2" >> test.tsv
   alamb@MacBook-Pro-8:~/Software/arrow-datafusion2/datafusion-cli$
   alamb@MacBook-Pro-8:~/Software/arrow-datafusion2/datafusion-cli$ CARGO_TARGET_DIR=/Users/alamb/Software/target-df2 cargo run
       Finished dev [unoptimized + debuginfo] target(s) in 0.42s
        Running `/Users/alamb/Software/target-df2/debug/datafusion-cli`
   DataFusion CLI v25.0.0
   ❯ create external table test stored as csv with header row location "test.tsv";
   0 rows in set. Query took 0.026 seconds.
   ❯ select * from test;
   
   +------+-------+
   | col1 |  col2 |
   +------+-------+
   | 1    |  2    |
   +------+-------+
   ```
   
   @andygrove  is in the process of finalizing the `25.0.0` release (it should be available in the next day or two hopefully) so I think this will be fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org