You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "parkma99 (via GitHub)" <gi...@apache.org> on 2023/07/12 05:56:15 UTC
[GitHub] [arrow-datafusion] parkma99 opened a new pull request, #6927: Enrich CSV reader config: quote & escape
parkma99 opened a new pull request, #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927
# Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
-->
Closes #6854
# Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
-->
# What changes are included in this PR?
Current add quote to CSV reader config.
<!--
There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
-->
# Are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code
If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
-->
# Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be updated before approving the PR.
-->
<!--
If there are any breaking changes to public APIs, please add the `api change` label.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #6927: Enrich CSV reader config: quote & escape
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1634845390
Hi @parkma99 I don't really follow your question
> Set escape as a Option<u8> is a chose. In this way, I perfer to change delimiter and quote to Option<u8> just like
I think this sounds reasonable to me (and if it is `None` fall back to whatever the arrow default is)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #6927: Enrich CSV reader config: quote & escape
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1649692509
In an attempt to clear the PR backlog, I took the liberty of merging this PR up from main and plan to merge it when CI passes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6927: Enrich CSV reader config: quote & escape
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#discussion_r1273420825
##########
datafusion/proto/proto/datafusion.proto:
##########
@@ -88,6 +88,10 @@ message ProjectionColumns {
message CsvFormat {
bool has_header = 1;
string delimiter = 2;
+ string quote = 3;
+ oneof optional_escape {
Review Comment:
I think it is necessary to encode the fact that the escape might be `None` or might be `Some(..)` -- the other fields like `quote` are always some particular value
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #6927: Enrich CSV reader config: quote & escape
Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#discussion_r1272441251
##########
datafusion/proto/proto/datafusion.proto:
##########
@@ -88,6 +88,10 @@ message ProjectionColumns {
message CsvFormat {
bool has_header = 1;
string delimiter = 2;
+ string quote = 3;
+ oneof optional_escape {
Review Comment:
Is `oneof` necessary?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] parkma99 commented on pull request #6927: Enrich CSV reader config: quote & escape
Posted by "parkma99 (via GitHub)" <gi...@apache.org>.
parkma99 commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1631985801
Hello @alamb , I read the code in
https://github.com/apache/arrow-rs/blob/6e0faaf24d84cded196be3abfccd80a154a92740/arrow-csv/src/reader/mod.rs#L1860
in comment says `escape` default is None.
I would like add `escape` as like `delimiter`, got a question how to set default `escape`.
`delimiter` default value is `,` and `quote` default value is `"`. In this line, comment says `escape` default value is Noneπ’.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #6927: Enrich CSV reader config: quote & escape
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Weijun-H commented on pull request #6927: Enrich CSV reader config: quote & escape
Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.
Weijun-H commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1648146208
LGTM! @parkma99
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #6927: Enrich CSV reader config: quote & escape
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1648494736
Since this PR has been open for a while I will merge it. @parkma99 let me know if you want to address @jackwener 's comments in https://github.com/apache/arrow-datafusion/pull/6927#discussion_r1272441251 in this PR otherwise I'll merge it in
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #6927: Enrich CSV reader config: quote & escape
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1640975848
Marking as draft while feedback is addressed. Please mark it as ready for review when it is ready for another look.
Thanks again @parkma99
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Weijun-H commented on pull request #6927: Enrich CSV reader config: quote & escape
Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.
Weijun-H commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1637046032
LGTM π
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6927: Enrich CSV reader config: quote & escape
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#discussion_r1264681066
##########
datafusion/core/src/datasource/physical_plan/csv.rs:
##########
@@ -656,7 +682,14 @@ mod tests {
let mut config = partitioned_csv_config(file_schema, file_groups)?;
config.projection = Some(vec![0, 2, 4]);
- let csv = CsvExec::new(config, true, b',', file_compression_type.to_owned());
+ let csv = CsvExec::new(
+ config,
+ true,
+ b',',
+ b'"',
+ None,
Review Comment:
I wonder if we can reduce the number of arguments to `CsvExec::new()` somehow (like maybe pass in the config π€ )
##########
datafusion/core/src/datasource/physical_plan/csv.rs:
##########
@@ -656,7 +682,14 @@ mod tests {
let mut config = partitioned_csv_config(file_schema, file_groups)?;
config.projection = Some(vec![0, 2, 4]);
- let csv = CsvExec::new(config, true, b',', file_compression_type.to_owned());
+ let csv = CsvExec::new(
+ config,
+ true,
+ b',',
+ b'"',
+ None,
Review Comment:
I wonder if we can reduce the number of arguments to `CsvExec::new()` somehow (like maybe pass in the config π€ )
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] parkma99 commented on pull request #6927: Enrich CSV reader config: quote & escape
Posted by "parkma99 (via GitHub)" <gi...@apache.org>.
parkma99 commented on PR #6927:
URL: https://github.com/apache/arrow-datafusion/pull/6927#issuecomment-1639130547
Thanks @alamb @Weijun-H, I will update code recently.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org