You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/20 00:52:11 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue, #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names

andygrove opened a new issue, #2278:
URL: https://github.com/apache/arrow-datafusion/issues/2278

   **Describe the bug**
   This is a placeholder for now. I will add more detail in the next day or so, but the basic issue is that if we call `LogicalPlanBuilder.scan_csv` with a file named `employee.csv` then we end up with a `TableScan` with a `table_name` of `employee.csv`. If we try and find a table with this name in the catalog then it would look for a table named `csv` in schema `employee`.
   
   **To Reproduce**
   I found during this during some refactoring.
   
   **Expected behavior**
   Table name should be valid.
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names

Posted by GitBox <gi...@apache.org>.
comphead commented on issue #2278:
URL: https://github.com/apache/arrow-datafusion/issues/2278#issuecomment-1112568903

   @andygrove if file named xxx.yyy.zzz.csv, or with whitespaces, other special symbols what is the table name should be? Do we have a list of special symbols, so we can  replace them with _ or so


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2278:
URL: https://github.com/apache/arrow-datafusion/issues/2278#issuecomment-1139208490

   i'm no longer sure that we need to do anything for this issue so am going to close it. @comphead feel tree to re-open if you disagree.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names

Posted by GitBox <gi...@apache.org>.
comphead commented on issue #2278:
URL: https://github.com/apache/arrow-datafusion/issues/2278#issuecomment-1113037505

   @andygrove @alamb 
   ```
           let plan = LogicalPlanBuilder::scan_csv(
               Arc::new(LocalFileSystem {}),
               "/tmp/xxxx/employee.csv",
               CsvReadOptions::new().schema(&schema).has_header(true),
               Some(vec![3, 4]),
               4,
           )
   ```
   leads to   `TableScan: /tmp/xxxx/employee.csv projection=Some([3, 4])`
   
   This happens because file path used as a table name. 
   ```
           Self::scan_csv_with_name(
               object_store,
               path.clone(),
               options,
               projection,
               path, <----- path is a table name here, the same is for avro, parq, etc
               target_partitions,
           )
   ```
   
   Proposed solution is to trunc the parent folder and extension and leave only the file name.
   Let me know your thoughts guys.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names

Posted by GitBox <gi...@apache.org>.
comphead commented on issue #2278:
URL: https://github.com/apache/arrow-datafusion/issues/2278#issuecomment-1112572565

   We can stick to PostgresSQL naming convention https://www.postgresql.org/docs/7.0/syntax525.htm
   Any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2278:
URL: https://github.com/apache/arrow-datafusion/issues/2278#issuecomment-1113753014

   Sorry, just catching up here. This issue came up when I was attempting to refactor DF to have the plan just refer to table sources by name, and this would require names to be valid SQL object names. I eventually gave up on the refactor because I ran into too many places where our design really doesn't support this.
   
   Fundamentally, file scans don't have table names, but we store the path in an attribute named `table_name`. Perhaps a better approach here would be to introduce a `LogicalPlan::FileScan` which would be almost identical to `LogicalPlan::TableScan` but with `file_path` instead of `table_name`? Over time we could have `FileScan` and `TableScan` diverge as needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names

Posted by GitBox <gi...@apache.org>.
comphead commented on issue #2278:
URL: https://github.com/apache/arrow-datafusion/issues/2278#issuecomment-1113738338

   > I haven't tried the scenario described in this report, but I would expect to be able to refer to a table named `employee.csv` using `"employee.csv"` (aka put it in single quotes).
   > 
   > I don't necessarily think munging the file name is what a user might expect.
   
   To solve this we need more details.
   @andygrove put a bug that currently table name is set to a full file path. That happens because table name derives from full file path
   
   ```
           Self::scan_csv_with_name(
               object_store,
               path.clone(),
               options,
               projection,
               path, <----- table_name param here
               target_partitions,
           )
   ```
   
   The question is how to derive a table_name correctly. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names

Posted by GitBox <gi...@apache.org>.
comphead commented on issue #2278:
URL: https://github.com/apache/arrow-datafusion/issues/2278#issuecomment-1115957047

   I'll try to implement a FileScan for file read operations. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on issue #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names

Posted by GitBox <gi...@apache.org>.
comphead commented on issue #2278:
URL: https://github.com/apache/arrow-datafusion/issues/2278#issuecomment-1112189221

   I'll try to fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2278:
URL: https://github.com/apache/arrow-datafusion/issues/2278#issuecomment-1113602539

   I haven't tried the scenario described in this report, but I would expect to be able to refer to a table named `employee.csv` using `"employee.csv"` (aka put it in single quotes). 
   
   I don't necessarily think munging the file name is what a user might expect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove closed issue #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names

Posted by GitBox <gi...@apache.org>.
andygrove closed issue #2278: LogicalPlanBuilder::scan_csv creates scans with invalid table names
URL: https://github.com/apache/arrow-datafusion/issues/2278


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org