You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/30 13:53:17 UTC

[GitHub] [arrow] pitrou opened a new pull request, #13264: ARROW-16272: [Python] Fix NativeFile.read1()

pitrou opened a new pull request, #13264:
URL: https://github.com/apache/arrow/pull/13264

   `read1(nbytes=None)` should not read the entire input stream but instead return a reasonable amount of bytes, suitable for building up an internal buffer.
   
   Should fix the performance issue when using `TextIOWrapper` or `pandas.read_csv` on a S3 input file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #13264: ARROW-16272: [Python] Fix NativeFile.read1()

Posted by GitBox <gi...@apache.org>.

pitrou commented on PR #13264:
URL: https://github.com/apache/arrow/pull/13264#issuecomment-1141187165

   @ursabot please benchmark lang=Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13264: ARROW-16272: [Python] Fix NativeFile.read1()

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13264:
URL: https://github.com/apache/arrow/pull/13264#issuecomment-1141187112

   https://issues.apache.org/jira/browse/ARROW-16272


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13264: ARROW-16272: [Python] Fix NativeFile.read1()

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13264:
URL: https://github.com/apache/arrow/pull/13264#discussion_r885582892


##########
python/pyarrow/io.pxi:
##########
@@ -436,14 +440,22 @@ cdef class NativeFile(_Weakrefable):
     def read1(self, nbytes=None):
         """Read and return up to n bytes.
 
-        Alias for read, needed to match the BufferedIOBase interface.
+        A short result does not imply that EOF is imminent.
 
         Parameters
         ----------
         nbytes : int
             The maximum number of bytes to read.
         """
-        return self.read(nbytes=None)
+        if nbytes is None:
+            # The expectation when passing `nbytes=None` is not to read the
+            # entire file but to issue a single underlying read call up to
+            # a reasonable size (the use case being to read a bufferable
+            # amount of bytes, such as with io.TextIOWrapper).
+            nbytes = self._default_chunk_size
+        else:
+            nbytes = min(nbytes, self._default_chunk_size)

Review Comment:
   Hmm, you're right that it may not be necessary, Python's IO stack is happily letting you read1() large sizes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou closed pull request #13264: ARROW-16272: [Python] Fix NativeFile.read1()

Posted by GitBox <gi...@apache.org>.

pitrou closed pull request #13264: ARROW-16272: [Python] Fix NativeFile.read1()
URL: https://github.com/apache/arrow/pull/13264


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #13264: ARROW-16272: [Python] Fix NativeFile.read1()

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #13264:
URL: https://github.com/apache/arrow/pull/13264#issuecomment-1143034740

   Benchmark runs are scheduled for baseline = b85139204a7d6f9e92daf8d6987695ce133a1aa6 and contender = 31494860c23ff7fe38a748a09c58822378605477. 31494860c23ff7fe38a748a09c58822378605477 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/8e63235370a0482bab4db9811201d4cc...6e29635e9ee04b2e970138601d0c597c/)
   [Finished :arrow_down:1.09% :arrow_up:0.43%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9ca29be1dc564c19900d28916492d309...86f61aee97644f30b67de9804f1fea53/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/49292ba932f14335b13b8f7274f3aa48...7578b2b5416848fea84217bd644bfe7a/)
   [Finished :arrow_down:1.34% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/0d6953fcbfdb44c7a656e6317ecdc061...265111cefac64525b238f3f0a3d0228b/)
   Buildkite builds:
   [Finished] [`31494860` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/856)
   [Finished] [`31494860` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/856)
   [Failed] [`31494860` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/846)
   [Finished] [`31494860` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/858)
   [Finished] [`b8513920` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/855)
   [Finished] [`b8513920` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/855)
   [Finished] [`b8513920` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/845)
   [Finished] [`b8513920` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/857)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a diff in pull request #13264: ARROW-16272: [Python] Fix NativeFile.read1()

Posted by GitBox <gi...@apache.org>.

lidavidm commented on code in PR #13264:
URL: https://github.com/apache/arrow/pull/13264#discussion_r885567674


##########
python/pyarrow/tests/test_fs.py:
##########
@@ -1649,3 +1649,20 @@ def check_copied_files(destination_dir):
     destination_dir5.mkdir()
     copy_files(source_dir, destination_dir5, chunk_size=1, use_threads=False)
     check_copied_files(destination_dir5)
+
+
+@pytest.mark.pandas
+@pytest.mark.s3
+def test_pandas_text_reader_s3(tmpdir):
+    # ARROW-16272: Pandas' read_csv() should not exhaust a S3 input stream
+    # when a small nrows is passed.
+    import pandas as pd
+    from pyarrow.fs import S3FileSystem
+
+    fs = S3FileSystem(anonymous=True, region="us-east-2")
+    f = fs.open_input_file("ursa-qa/nyctaxi/yellow_tripdata_2010-01.csv")
+
+    df = pd.read_csv(f, nrows=2)
+    assert list(df["vendor_id"]) == ["VTS", "DDS"]
+    # Some readahead occurred, but not up to the end of file (which is ~2 GB)
+    assert f.tell() <= 256 * 1024

Review Comment:
   To me it seems S3 is unnecessary here? Or at least the 'real' S3 is unnecessary here?



##########
python/pyarrow/io.pxi:
##########
@@ -436,14 +440,22 @@ cdef class NativeFile(_Weakrefable):
     def read1(self, nbytes=None):
         """Read and return up to n bytes.
 
-        Alias for read, needed to match the BufferedIOBase interface.
+        A short result does not imply that EOF is imminent.
 
         Parameters
         ----------
         nbytes : int
             The maximum number of bytes to read.
         """
-        return self.read(nbytes=None)
+        if nbytes is None:
+            # The expectation when passing `nbytes=None` is not to read the
+            # entire file but to issue a single underlying read call up to
+            # a reasonable size (the use case being to read a bufferable
+            # amount of bytes, such as with io.TextIOWrapper).
+            nbytes = self._default_chunk_size
+        else:
+            nbytes = min(nbytes, self._default_chunk_size)

Review Comment:
   Why are we limiting the read size to the chunk size when an explicit size is passed? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13264: ARROW-16272: [Python] Fix NativeFile.read1()

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13264:
URL: https://github.com/apache/arrow/pull/13264#discussion_r885005532


##########
python/pyarrow/tests/test_io.py:
##########
@@ -1201,6 +1202,46 @@ def test_native_file_TextIOWrapper(tmpdir):
         assert res == data
 
 
+def test_native_file_TextIOWrapper_perf(tmpdir):
+    # ARROW-16272: TextIOWrapper.readline() shouldn't exhaust a large
+    # Arrow input stream.
+    data = b'foo\nquux\n'
+    path = str(tmpdir / 'largefile.txt')
+    with open(path, 'wb') as f:
+        f.write(data * 100_000)
+
+    binary_file = pa.OSFile(path, mode='rb')
+    with TextIOWrapper(binary_file) as f:
+        assert binary_file.tell() == 0
+        nbytes = 20_000
+        lines = f.readlines(nbytes)
+        assert len(lines) == math.ceil(2 * nbytes / len(data))
+        assert nbytes <= binary_file.tell() <= nbytes * 2

Review Comment:
   Hmm, for some reason this test doesn't fail without the fix. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #13264: ARROW-16272: [Python] Fix NativeFile.read1()

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #13264:
URL: https://github.com/apache/arrow/pull/13264#issuecomment-1141187240

   Benchmark runs are scheduled for baseline = e19acbe9c698699de6a24600060ef662316bdc2a and contender = 94cee03de5e1a9a30223c1f9ccba7f9695501489. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/9e91db272191460ebc59a411ec58e614...22c63ed2c9bb4adb93dcb68ad47a6b5f/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/4ceb38ba923e460fa40edc7e6d2aae34...6c9921863aa84cedac9c419d8417c9aa/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/4f3940536dc94bbd95758fd16356f098...671f7a3b2efe4557a9dfe7017aa9fc3b/)
   [Skipped :warning: Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/0f89e81a90f3451eaa381d9f8d3f33a0...3fa2fcf9a1b34b3dbe3c935b4cd1a4a8/)
   Buildkite builds:
   [Scheduled] [`94cee03d` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/846)
   [Scheduled] [`94cee03d` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/844)
   [Scheduled] [`94cee03d` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/833)
   [Scheduled] [`e19acbe9` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/845)
   [Scheduled] [`e19acbe9` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/843)
   [Scheduled] [`e19acbe9` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/832)
   [Scheduled] [`e19acbe9` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/847)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13264: ARROW-16272: [Python] Fix NativeFile.read1()

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13264:
URL: https://github.com/apache/arrow/pull/13264#discussion_r885005532


##########
python/pyarrow/tests/test_io.py:
##########
@@ -1201,6 +1202,46 @@ def test_native_file_TextIOWrapper(tmpdir):
         assert res == data
 
 
+def test_native_file_TextIOWrapper_perf(tmpdir):
+    # ARROW-16272: TextIOWrapper.readline() shouldn't exhaust a large
+    # Arrow input stream.
+    data = b'foo\nquux\n'
+    path = str(tmpdir / 'largefile.txt')
+    with open(path, 'wb') as f:
+        f.write(data * 100_000)
+
+    binary_file = pa.OSFile(path, mode='rb')
+    with TextIOWrapper(binary_file) as f:
+        assert binary_file.tell() == 0
+        nbytes = 20_000
+        lines = f.readlines(nbytes)
+        assert len(lines) == math.ceil(2 * nbytes / len(data))
+        assert nbytes <= binary_file.tell() <= nbytes * 2

Review Comment:
   Hmm, for some reason this test doesn't fail without the fix. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13264: ARROW-16272: [Python] Fix NativeFile.read1()

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13264:
URL: https://github.com/apache/arrow/pull/13264#discussion_r885587724


##########
python/pyarrow/tests/test_fs.py:
##########
@@ -1649,3 +1649,20 @@ def check_copied_files(destination_dir):
     destination_dir5.mkdir()
     copy_files(source_dir, destination_dir5, chunk_size=1, use_threads=False)
     check_copied_files(destination_dir5)
+
+
+@pytest.mark.pandas
+@pytest.mark.s3
+def test_pandas_text_reader_s3(tmpdir):
+    # ARROW-16272: Pandas' read_csv() should not exhaust a S3 input stream
+    # when a small nrows is passed.
+    import pandas as pd
+    from pyarrow.fs import S3FileSystem
+
+    fs = S3FileSystem(anonymous=True, region="us-east-2")
+    f = fs.open_input_file("ursa-qa/nyctaxi/yellow_tripdata_2010-01.csv")
+
+    df = pd.read_csv(f, nrows=2)
+    assert list(df["vendor_id"]) == ["VTS", "DDS"]
+    # Some readahead occurred, but not up to the end of file (which is ~2 GB)
+    assert f.tell() <= 256 * 1024

Review Comment:
   Ah, you're right, it works as well with a local file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org