You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "parkma99 (via GitHub)" <gi...@apache.org> on 2023/07/12 05:56:15 UTC

[GitHub] [arrow-datafusion] parkma99 opened a new pull request, #6927: Enrich CSV reader config: quote & escape

parkma99 opened a new pull request, #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #6854 
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   Current add quote to CSV reader config.
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6927: Enrich CSV reader config: quote & escape

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1634845390

   Hi @parkma99  I don't really follow your question
   
   > Set escape as a Option<u8> is a chose. In this way, I perfer to change delimiter and quote to Option<u8> just like
   
   I think this sounds reasonable to me (and if it is `None` fall back to whatever the arrow default is)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6927: Enrich CSV reader config: quote & escape

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1649692509

   In an attempt to clear the PR backlog, I took the liberty of merging this PR up from main and plan to merge it when CI passes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6927: Enrich CSV reader config: quote & escape

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#discussion_r1273420825


##########
datafusion/proto/proto/datafusion.proto:
##########
@@ -88,6 +88,10 @@ message ProjectionColumns {
 message CsvFormat {
   bool has_header = 1;
   string delimiter = 2;
+  string quote = 3;
+  oneof optional_escape {

Review Comment:
   I think it is necessary to encode the fact that the escape might be `None` or might be `Some(..)` -- the other fields like `quote` are always some particular value



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #6927: Enrich CSV reader config: quote & escape

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#discussion_r1272441251


##########
datafusion/proto/proto/datafusion.proto:
##########
@@ -88,6 +88,10 @@ message ProjectionColumns {
 message CsvFormat {
   bool has_header = 1;
   string delimiter = 2;
+  string quote = 3;
+  oneof optional_escape {

Review Comment:
   Is `oneof` necessary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] parkma99 commented on pull request #6927: Enrich CSV reader config: quote & escape

Posted by "parkma99 (via GitHub)" <gi...@apache.org>.
parkma99 commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1631985801

   Hello @alamb , I read the  code in 
   https://github.com/apache/arrow-rs/blob/6e0faaf24d84cded196be3abfccd80a154a92740/arrow-csv/src/reader/mod.rs#L1860
   
   in comment says `escape` default is None.
   
   I would like add `escape` as like `delimiter`, got a question how to set default `escape`.
   
   `delimiter` default value is `,` and `quote` default value is `"`. In this line, comment says `escape` default value is None😒.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #6927: Enrich CSV reader config: quote & escape

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Weijun-H commented on pull request #6927: Enrich CSV reader config: quote & escape

Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.
Weijun-H commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1648146208

   LGTM! @parkma99 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6927: Enrich CSV reader config: quote & escape

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1648494736

   Since this PR has been open for a while I will merge it. @parkma99  let me know if you want to address @jackwener 's comments in https://github.com/apache/arrow-datafusion/pull/6927#discussion_r1272441251 in this PR otherwise I'll merge it in


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6927: Enrich CSV reader config: quote & escape

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1640975848

   Marking as draft while feedback is addressed. Please mark it as ready for review when it is ready for another look.
   
   Thanks again @parkma99 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Weijun-H commented on pull request #6927: Enrich CSV reader config: quote & escape

Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.
Weijun-H commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1637046032

   LGTM πŸ‘


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6927: Enrich CSV reader config: quote & escape

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#discussion_r1264681066


##########
datafusion/core/src/datasource/physical_plan/csv.rs:
##########
@@ -656,7 +682,14 @@ mod tests {
         let mut config = partitioned_csv_config(file_schema, file_groups)?;
         config.projection = Some(vec![0, 2, 4]);
 
-        let csv = CsvExec::new(config, true, b',', file_compression_type.to_owned());
+        let csv = CsvExec::new(
+            config,
+            true,
+            b',',
+            b'"',
+            None,

Review Comment:
   I wonder if we can reduce the number of arguments to `CsvExec::new()` somehow (like maybe pass in the config πŸ€” )



##########
datafusion/core/src/datasource/physical_plan/csv.rs:
##########
@@ -656,7 +682,14 @@ mod tests {
         let mut config = partitioned_csv_config(file_schema, file_groups)?;
         config.projection = Some(vec![0, 2, 4]);
 
-        let csv = CsvExec::new(config, true, b',', file_compression_type.to_owned());
+        let csv = CsvExec::new(
+            config,
+            true,
+            b',',
+            b'"',
+            None,

Review Comment:
   I wonder if we can reduce the number of arguments to `CsvExec::new()` somehow (like maybe pass in the config πŸ€” )



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] parkma99 commented on pull request #6927: Enrich CSV reader config: quote & escape

Posted by "parkma99 (via GitHub)" <gi...@apache.org>.
parkma99 commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1639130547

   Thanks @alamb  @Weijun-H, I will update code recently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org