You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/02/08 18:50:24 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #3677: Add CSV Decoder::capacity (#3674)

tustvold opened a new pull request, #3677:
URL: https://github.com/apache/arrow-rs/pull/3677

   _draft to collect feedback_
   
   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #3674
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677#discussion_r1102731412


##########
arrow-csv/src/lib.rs:
##########
@@ -17,6 +17,8 @@
 
 //! Transfer data between the Arrow memory format and CSV (comma-separated values).
 
+extern crate core;

Review Comment:
   Stupid IDE being stupid, will remove



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677#discussion_r1102733671


##########
arrow-csv/src/reader/mod.rs:
##########
@@ -2269,4 +2274,73 @@ mod tests {
             "Csv error: Encountered invalid UTF-8 data for line 1 and field 1",
         );
     }
+

Review Comment:
   > namely that data that has been read can be decoded and read out as record batches prior to sending the end of stream.
   
   Because that isn't the issue. The _problem_ was that it would try to fill the buffer again, even if it had already read the batch_size number of rows. 
   
   Without the change in this PR you have
   
   ```
   fill_sizes: [23, 3, 3, 0, 0]
   ```
   
   In the case of a streaming decoder, this could potentially cause it to wait for more input when it doesn't actually need any more input as it already has the requisite number of rows
   
   > I wonder can we write a test like
   
   This has never been supported, and realistically can't be supported as `BufRead::fill_buf` will only return an empty slice on EOS, otherwise it will block. There is no `fill_buf` if data available that I am aware of.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677#discussion_r1102733671


##########
arrow-csv/src/reader/mod.rs:
##########
@@ -2269,4 +2274,73 @@ mod tests {
             "Csv error: Encountered invalid UTF-8 data for line 1 and field 1",
         );
     }
+

Review Comment:
   > namely that data that has been read can be decoded and read out as record batches prior to sending the end of stream.
   
   Because that isn't the issue. The _problem_ was that it would try to fill the buffer again, even if it had already read the batch_size number of rows. 
   
   Without the change in this PR you have
   
   ```
   fill_sizes: [23, 3, 3, 0, 0]
   ```
   
   In the case of a streaming decoder, this could potentially cause it to wait for more input when it doesn't actually need any more input as it already has the requisite number of rows
   
   > I wonder can we write a test like
   
   This has never been supported, and realistically can't be supported as `BufRead::fill_buf` will only return an empty slice on EOS, otherwise it will block. There is no `fill_buf` if data available that I am aware of.
   
   Edit: it would appear there is an experimental API - https://github.com/rust-lang/rust/issues/86423



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677#discussion_r1102800121


##########
arrow-csv/src/reader/mod.rs:
##########
@@ -2269,4 +2274,73 @@ mod tests {
             "Csv error: Encountered invalid UTF-8 data for line 1 and field 1",
         );
     }
+

Review Comment:
   Thank you for the explination



##########
arrow-csv/src/reader/mod.rs:
##########
@@ -2269,4 +2274,73 @@ mod tests {
             "Csv error: Encountered invalid UTF-8 data for line 1 and field 1",
         );
     }
+

Review Comment:
   Thank you for the explanation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ursabot commented on pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677#issuecomment-1425948504

   Benchmark runs are scheduled for baseline = 5b1821e0564f586f6f98e5c392a0a208890055df and contender = 3e08a754ff08ebe51086b4f65beefb99769b4068. 3e08a754ff08ebe51086b4f65beefb99769b4068 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/0b451d832075462a8896ea2e4b0ba84b...f3361fac96ec4a3fb14c95c9339cb164/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9c04d2a702644220a7ed4e78e8398d29...fb1e2076fa5945209773a475e99191d8/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2abe12f079ec44cbafb34696acedfc57...ab63460e811b457a8c8cfa5c5f2716db/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b36c046ca51d433293a90a7d3d25a0a7...044c57f6ffd34609931689b97f7d15be/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677#discussion_r1102730781


##########
arrow-csv/src/reader/mod.rs:
##########
@@ -438,10 +438,10 @@ impl<R: BufRead> BufReader<R> {
         loop {
             let buf = self.reader.fill_buf()?;
             let decoded = self.decoder.decode(buf)?;
-            if decoded == 0 {
+            self.reader.consume(decoded);
+            if decoded == 0 || self.decoder.capacity() == 0 {

Review Comment:
   Nope, it will yield only if it has fully read the batch size number of rows. I.e it will yield if it has read enough data, instead of looping around and calling `fill_buf` again to potentially read more data that it is just going to ignore (as it has already read batch_size rows)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677#discussion_r1102733671


##########
arrow-csv/src/reader/mod.rs:
##########
@@ -2269,4 +2274,73 @@ mod tests {
             "Csv error: Encountered invalid UTF-8 data for line 1 and field 1",
         );
     }
+

Review Comment:
   > namely that data that has been read can be decoded and read out as record batches prior to sending the end of stream.
   
   Because that isn't the issue, that has never been an issue. The _problem_ was that it would try to fill the buffer again, even if it had already read the batch_size number of rows. 
   
   Without the change in this PR you have
   
   ```
   fill_sizes: [23, 3, 3, 0, 0]
   ```
   
   In the case of a streaming decoder, this could potentially cause it to wait for more input when it doesn't actually need any more input as it already has the requisite number of rows



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677#discussion_r1102801001


##########
arrow-csv/src/reader/mod.rs:
##########
@@ -438,10 +438,10 @@ impl<R: BufRead> BufReader<R> {
         loop {
             let buf = self.reader.fill_buf()?;
             let decoded = self.decoder.decode(buf)?;
-            if decoded == 0 {
+            self.reader.consume(decoded);
+            if decoded == 0 || self.decoder.capacity() == 0 {

Review Comment:
   ```suggestion
               // yield only if it has fully read the batch size number of rows.
               //  instead of looping around and calling fill_buf again to potentially 
               // read more data that it is just going to ignore (as it has already read batch_size rows)
               if decoded == 0 || self.decoder.capacity() == 0 {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677#discussion_r1102705589


##########
arrow-csv/src/reader/mod.rs:
##########
@@ -438,10 +438,10 @@ impl<R: BufRead> BufReader<R> {
         loop {
             let buf = self.reader.fill_buf()?;
             let decoded = self.decoder.decode(buf)?;
-            if decoded == 0 {
+            self.reader.consume(decoded);
+            if decoded == 0 || self.decoder.capacity() == 0 {

Review Comment:
   this has the effect of potentially creating smaller output batches, right? Basically the reader will now yield up any batches it already has buffered. 



##########
arrow-csv/src/reader/mod.rs:
##########
@@ -2269,4 +2274,73 @@ mod tests {
             "Csv error: Encountered invalid UTF-8 data for line 1 and field 1",
         );
     }
+

Review Comment:
   I am sorry if I am mis understanding the rationale for this change, but I don't understand how this tests mimics what is described in https://github.com/apache/arrow-rs/issues/3674 -- namely that data that has been read can be decoded and read out as record batches prior to sending the end of stream.
   
   I wonder can we write a test like
   
   1. set the batch size to 5 (bigger than the output data)
   2.send in "foo,bar\nbaz,foo\n"
   3. Read those two  records <-- as I understand this is what can not be done today
   4. Feed in "a,b\nc,d" + EOS
   5. read the final two records
   
   I struggle how to map this test to that usecase -- it would be hard for me to understand the purpose of this test if I saw it without context. E.g. why is it important that there  were two calls to `fill(0)`?



##########
arrow-csv/src/lib.rs:
##########
@@ -17,6 +17,8 @@
 
 //! Transfer data between the Arrow memory format and CSV (comma-separated values).
 
+extern crate core;

Review Comment:
   What is the purpose of this line?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677#discussion_r1102733671


##########
arrow-csv/src/reader/mod.rs:
##########
@@ -2269,4 +2274,73 @@ mod tests {
             "Csv error: Encountered invalid UTF-8 data for line 1 and field 1",
         );
     }
+

Review Comment:
   > namely that data that has been read can be decoded and read out as record batches prior to sending the end of stream.
   
   Because that isn't the issue. The _problem_ was that it would try to fill the buffer again, even if it had already read the batch_size number of rows. 
   
   Without the change in this PR you have
   
   ```
   fill_sizes: [23, 3, 3, 0, 0]
   ```
   
   In the case of a streaming decoder, this could potentially cause it to wait for more input when it doesn't actually need any more input as it already has the requisite number of rows
   
   > I wonder can we write a test like
   
   This has never been supported, and realistically can't be supported as `BufRead::fill_buf` will only return an empty slice on EOS, otherwise it will block. There is no `fill_buf` if data available that I am aware of.
   
   Edit: it would appear there is an experimental API - https://github.com/rust-lang/rust/issues/86423
   
   Edit Edit: this might actually be impossible in general - https://stackoverflow.com/questions/5431941/why-is-while-feoffile-always-wrong



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3677: Add CSV Decoder::capacity (#3674)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3677:
URL: https://github.com/apache/arrow-rs/pull/3677#discussion_r1102730781


##########
arrow-csv/src/reader/mod.rs:
##########
@@ -438,10 +438,10 @@ impl<R: BufRead> BufReader<R> {
         loop {
             let buf = self.reader.fill_buf()?;
             let decoded = self.decoder.decode(buf)?;
-            if decoded == 0 {
+            self.reader.consume(decoded);
+            if decoded == 0 || self.decoder.capacity() == 0 {

Review Comment:
   Nope, it will yield only if it has fully read the batch size number of rows



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org