You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/23 19:45:00 UTC

[GitHub] [arrow] seddonm1 opened a new pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

seddonm1 opened a new pull request #9001:
URL: https://github.com/apache/arrow/pull/9001


   The current CSV reader cannot parse strings to types with leading/trailing white spaces as the parsers are very strict. This means being able to read and parse the [tpch-dbgen included answers](https://github.com/databricks/tpch-dbgen/tree/master/answers) files is not possible.
   
   The underlying csv crate supports a four different [behaviors for trimming strings](https://docs.rs/csv/1.1.5/csv/enum.Trim.html): 
   - `None` (default): does no trimming.
   - `Headers`: trim only header fields.
   - `Fields`: trim only field values.
   - `All`: trim both headers and field values.
   
   Rather than exposing all these options and forcing users to understand the underlying csv crate this PR simplifies this decision to boolean: `None` (false) or `All` (true) while retaining the default false behavior.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 edited a comment on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
seddonm1 edited a comment on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-750456630


   I was referencing the underlying library that Spark uses: https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/csv/CsvParser.java#L116
   
   Here are the Spark CSV reader options:Lhttps://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame
   
   Of course you could go down a rabbit hole trying to support all use cases.
   
   I am more than happy to kill this PR if we make the decision that it doesn't belong here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-751319155


   Closed in favor of https://issues.apache.org/jira/browse/ARROW-11036


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 closed pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
seddonm1 closed pull request #9001:
URL: https://github.com/apache/arrow/pull/9001


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-750456630


   I was referencing the underlying library that Spark uses: https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/csv/CsvParser.java#L116
   
   I am more than happy to kill this PR if we make the decision that it doesn't belong here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-750464473


   It actually looks like we are missing the ability to read a csv and return the values all strings (similar to how infer schema works) without also trying to parse the values or having to provide an all DataType::Utf8 schema.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-751317950


   @Dandandan I have taken the read -> trim -> parse approach here: https://github.com/apache/arrow/pull/9015
   
   I think I will close this an open a new ticket that allows the CSVReader to infer number of columns (with named from headers if provided) but return all DataType::Utf8. Thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-750447265


   https://issues.apache.org/jira/browse/ARROW-11013


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 edited a comment on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
seddonm1 edited a comment on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-750456630


   I was referencing the underlying library that Spark uses: https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/csv/CsvParser.java#L116
   
   Here are the Spark CSV reader options: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame
   
   Of course you could go down a rabbit hole trying to support all use cases.
   
   I am more than happy to kill this PR if we make the decision that it doesn't belong here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-750452242


   > Thanks for the PR @seddonm1!
   > Wouldn't it be better to keep the CSV reader simple and do the trim after loading the CSV (in Arrow/DataFusion)? There it could be a simple `trim(col)` too. I the use case is relatively limited where you actually want to apply trimming on _all_ columns.
   > 
   > I'm hesitant too to depend on the "more advanced" csv crate features, I think at some point it makes sense to utilize `csv_core` instead (for better performance).
   
   Yes it may be so I think this is up for discussion.
   
   I have added the `trim` function in https://github.com/apache/arrow/pull/8966 but the actual flow is read csv -> trim -> cast.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-751318921


   I think that is a great idea @seddonm1 . We can always revisit later if it turns out we really need it in the parser! Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-750450724


   Thanks for the PR @seddonm1!
   Wouldn't it be better to keep the CSV reader simple and do the trim after loading the CSV (in Arrow/DataFusion)? There it could be a simple `trim(col)` too. I the use case is relatively limited where you actually want to apply trimming on _all_ columns. 
   
   I'm hesitant too to depend on the "more advanced" csv crate features, I think at some point it makes sense to utilize `csv_core` instead (for better performance).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-750455303


   Are you aware of any other parser that does similar trimming like the csv crate?
   Looking at `pandas` for example it seems that it also doesn't have the option to trim whitespaces from header / values directly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #9001: ARROW-11013: [Rust][DataFusion] Add trim to CsvReader

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9001:
URL: https://github.com/apache/arrow/pull/9001#issuecomment-750450724


   Thanks for the PR @seddonm1!
   Wouldn't it be better to keep the CSV reader simple and do the trim after loading the CSV (in Arrow/DataFusion)? There it could be a simple `trim(col)` too. I the use case is relatively limited where you actually want to apply trimming on _all_ columns. 
   
   I'm hesitant too to depend on the "more advanced" csv crate features, I think at some point it makes sense to utilize `csv_core` instead (for better performance).
   
   EDIT: Ah I didn't read your message carefully... Will have a look


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org