You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/15 18:15:05 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue #888: [DataFusion CLI] Support querying CSV files without providing the schema

andygrove opened a new issue #888:
URL: https://github.com/apache/arrow-datafusion/issues/888


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   I would like to be able to query CSV files without having to provide the schema.
   
   **Describe the solution you'd like**
   The `CREATE EXTERNAL` table command should not require columns to be provided for CSV files. There should be an option for specifying whether the CSV file has a header row with column names.
   
   **Describe alternatives you've considered**
   N/A
   
   **Additional context**
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] sum12 edited a comment on issue #888: [DataFusion CLI] Support querying CSV files without providing the schema

Posted by GitBox <gi...@apache.org>.

sum12 edited a comment on issue #888:
URL: https://github.com/apache/arrow-datafusion/issues/888#issuecomment-910129402


   it looks like the columns definitions are required. Do we want to make those optional (and default to inferece if columns defs are missing) or add an extra keyword to require inference. (something like `CREATE EXTERNAL TABLE WITH INFERED COLUMNS STORED .....`) 
   
   Also if we want are inferring the schema then the API currently needs to read a set of records to actually do the inference. Do we want to control the number of rows read (default is to read the entire file) ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] sum12 edited a comment on issue #888: [DataFusion CLI] Support querying CSV files without providing the schema

Posted by GitBox <gi...@apache.org>.

sum12 edited a comment on issue #888:
URL: https://github.com/apache/arrow-datafusion/issues/888#issuecomment-910129402


   it looks like the columns definitions are required. Do we want to make those optional (and default to inferece if columns defs are missing) or add an extra keyword to require inference. (something like `CREATE EXTERNAL TABLE WITH INFERED COLUMNS STORED .....`) 
   
   Also if we are inferring the schema then the API currently needs to read a set of records to actually do the inference. Do we want to control the number of rows read (default is to read the entire file) ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #888: [DataFusion CLI] Support querying CSV files without providing the schema

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #888:
URL: https://github.com/apache/arrow-datafusion/issues/888#issuecomment-901266951


   There is already code to infer the schema when reading CSV files in arrow: https://docs.rs/arrow/5.2.0/arrow/csv/reader/struct.ReaderBuilder.html#method.infer_schema
   
   This ticket would likely involve connecting that up in DataFusion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #888: [DataFusion CLI] Support querying CSV files without providing the schema

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #888:
URL: https://github.com/apache/arrow-datafusion/issues/888#issuecomment-916816150


   Hi @sum12  -- 
   
   > it looks like the columns definitions are required. 
   
   I am not sure about this -- I think the following works (note there are no column definitions). I was imagining we would do something similar for CSV
   
   ```sql
   CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION '/Users/alamb/Downloads/demo.parquet';
   ```
   
   > Also if we are inferring the schema then the API currently needs to read a set of records to actually do the inference. Do we want to control the number of rows read (default is to read the entire file) ?
   
   I think having a setting (in https://docs.rs/datafusion/5.0.0/datafusion/execution/context/struct.ExecutionConfig.html) would be helpful. I think reading a thousand rows is probably a good default (as most CSV files will have easily detectable schemas in their initial rows if at all)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #888: [DataFusion CLI] Support querying CSV files without providing the schema

Posted by GitBox <gi...@apache.org>.

alamb closed issue #888:
URL: https://github.com/apache/arrow-datafusion/issues/888


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] sum12 commented on issue #888: [DataFusion CLI] Support querying CSV files without providing the schema

Posted by GitBox <gi...@apache.org>.

sum12 commented on issue #888:
URL: https://github.com/apache/arrow-datafusion/issues/888#issuecomment-910129402


   it looks like the columns definitions are required. Do we want to make those optional (and default to inferece if columns defs are missing) or add an extra keyword to require inference. (something like `CREATE EXTERNAL TABLE WITH INFERED COLUMNS STORED .....`) .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] sum12 commented on issue #888: [DataFusion CLI] Support querying CSV files without providing the schema

Posted by GitBox <gi...@apache.org>.

sum12 commented on issue #888:
URL: https://github.com/apache/arrow-datafusion/issues/888#issuecomment-910126758


   I had look at this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org