You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/16 20:13:41 UTC

[GitHub] [arrow] lidavidm opened a new pull request #10074: ARROW-12428: [Python] Expose pre_buffer in pyarrow.parquet

lidavidm opened a new pull request #10074:
URL: https://github.com/apache/arrow/pull/10074


   This allows using the new option without going through the datasets API, such as to read a single file. A simple benchmark is in the JIRA. This helps close the gap between PyArrow and fsspec read performance, as fsspec performs readahead by default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10074: ARROW-12428: [C++][Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#discussion_r615794180



##########
File path: python/pyarrow/parquet.py
##########
@@ -1674,6 +1686,12 @@ def pieces(self):
     keys and only a hive-style directory structure is supported. When
     setting `use_legacy_dataset` to False, also within-file level filtering
     and different partitioning schemes are supported.
+pre_buffer : bool, default True
+    Coalesce and issue file reads in parallel to improve performance on

Review comment:
       > I guess for someone who wants Arrow to be truly single-threaded, this may be confusing, so I'll see if I can reword this.
   
   So for such a case, you would need to set `pre_buffer=False` to have it truly single-threaded?  
   I am thinking about dask, where `use_threads` is set to False, but I am not sure if it would be a problem for them to still do I/O in parallel.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche closed pull request #10074: ARROW-12428: [Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche closed pull request #10074:
URL: https://github.com/apache/arrow/pull/10074


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10074: ARROW-12428: [Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#issuecomment-821536190


   https://issues.apache.org/jira/browse/ARROW-12428


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10074: ARROW-12428: [Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#discussion_r625977041



##########
File path: python/pyarrow/parquet.py
##########
@@ -1212,13 +1217,20 @@ class ParquetDataset:
     new Arrow Dataset API). Among other things, this allows to pass
     `filters` for all columns and not only the partition keys, enables
     different partitioning schemes, etc.
+pre_buffer : bool, default True
+    Coalesce and issue file reads in parallel to improve performance on
+    high-latency filesystems (e.g. S3). If True, Arrow will use a
+    background I/O thread pool. This option is only supported for
+    use_legacy_dataset=True. If using a filesystem layer that itself

Review comment:
       ```suggestion
       use_legacy_dataset=False. If using a filesystem layer that itself
   ```
   
   ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10074: ARROW-12428: [C++][Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#discussion_r615717077



##########
File path: python/pyarrow/parquet.py
##########
@@ -1674,6 +1686,12 @@ def pieces(self):
     keys and only a hive-style directory structure is supported. When
     setting `use_legacy_dataset` to False, also within-file level filtering
     and different partitioning schemes are supported.
+pre_buffer : bool, default True
+    Coalesce and issue file reads in parallel to improve performance on

Review comment:
       Is there any impact/conflict with specifying `use_threads=False` ?

##########
File path: python/pyarrow/parquet.py
##########
@@ -1244,7 +1255,7 @@ def __init__(self, path_or_paths, filesystem=None, schema=None,
                  metadata=None, split_row_groups=False, validate_schema=True,
                  filters=None, metadata_nthreads=1, read_dictionary=None,
                  memory_map=False, buffer_size=0, partitioning="hive",
-                 use_legacy_dataset=True):
+                 use_legacy_dataset=True, pre_buffer=True):

Review comment:
       Should we maybe raise an error here if the user sets it to False? (as we simply ignore it)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10074: ARROW-12428: [Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#issuecomment-832112286


   I've rebased this, and tests pass. (Sorry for the long delay here.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10074: ARROW-12428: [Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#discussion_r625980584



##########
File path: python/pyarrow/parquet.py
##########
@@ -1676,6 +1690,13 @@ def pieces(self):
     keys and only a hive-style directory structure is supported. When
     setting `use_legacy_dataset` to False, also within-file level filtering
     and different partitioning schemes are supported.
+pre_buffer : bool, default True
+    Coalesce and issue file reads in parallel to improve performance on
+    high-latency filesystems (e.g. S3). If True, Arrow will use a
+    background I/O thread pool. This option is only supported for
+    use_legacy_dataset=True. If using a filesystem layer that itself

Review comment:
       ```suggestion
       use_legacy_dataset=False. If using a filesystem layer that itself
   ```
   
   here as well




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10074: ARROW-12428: [Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#issuecomment-821548915


   The pre-buffer/Parquet logic has a bug with empty files, which I will need to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #10074: ARROW-12428: [C++][Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#issuecomment-822351950


   Thanks @lidavidm ! (I know it goes against the point of an RC, but it would be nice if we could still get this into 4.0 ..)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10074: ARROW-12428: [C++][Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#discussion_r615814433



##########
File path: python/pyarrow/parquet.py
##########
@@ -1674,6 +1686,12 @@ def pieces(self):
     keys and only a hive-style directory structure is supported. When
     setting `use_legacy_dataset` to False, also within-file level filtering
     and different partitioning schemes are supported.
+pre_buffer : bool, default True
+    Coalesce and issue file reads in parallel to improve performance on

Review comment:
       Yes, you'd need to set both to False to get true single-threading. I added a short note about the use of threads to the docstring.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10074: ARROW-12428: [C++][Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#discussion_r615782839



##########
File path: python/pyarrow/parquet.py
##########
@@ -1244,7 +1255,7 @@ def __init__(self, path_or_paths, filesystem=None, schema=None,
                  metadata=None, split_row_groups=False, validate_schema=True,
                  filters=None, metadata_nthreads=1, read_dictionary=None,
                  memory_map=False, buffer_size=0, partitioning="hive",
-                 use_legacy_dataset=True):
+                 use_legacy_dataset=True, pre_buffer=True):

Review comment:
       I don't think so as False is equivalent to the "not supported" behavior. I could set the default value to 'None' and let each individual ParquetDataset implementation handle it (and error on True in this case).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10074: ARROW-12428: [C++][Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#discussion_r615793101



##########
File path: python/pyarrow/parquet.py
##########
@@ -1244,7 +1255,7 @@ def __init__(self, path_or_paths, filesystem=None, schema=None,
                  metadata=None, split_row_groups=False, validate_schema=True,
                  filters=None, metadata_nthreads=1, read_dictionary=None,
                  memory_map=False, buffer_size=0, partitioning="hive",
-                 use_legacy_dataset=True):
+                 use_legacy_dataset=True, pre_buffer=True):

Review comment:
       Ah, yes, since `pre_buffer=False` means to not do anything, it's indeed fine to "ignore" it. 
   As you say, we could do some extra work to let it error on `pre_buffer=True` (for the legacy implementation), but since it's clearly documented to be only for the new version, I think it's fine to leave it as is.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10074: ARROW-12428: [C++][Python] Expose pre_buffer in pyarrow.parquet

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #10074:
URL: https://github.com/apache/arrow/pull/10074#discussion_r615781238



##########
File path: python/pyarrow/parquet.py
##########
@@ -1674,6 +1686,12 @@ def pieces(self):
     keys and only a hive-style directory structure is supported. When
     setting `use_legacy_dataset` to False, also within-file level filtering
     and different partitioning schemes are supported.
+pre_buffer : bool, default True
+    Coalesce and issue file reads in parallel to improve performance on

Review comment:
       There's no conflict: use_threads controls whether decoding work is done in parallel on the CPU thread pool, and pre_buffer controls whether I/O is done in parallel on the I/O thread pool. I guess for someone who wants Arrow to be truly single-threaded, this may be confusing, so I'll see if I can reword this.
   
   Also, #9620 has some refactoring to allow pre-buffering without requiring use of the I/O thread pool.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org