You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/02/01 13:57:58 UTC

[GitHub] [arrow] Fokko opened a new issue, #33972: [Python] Remove redundant S3 call

Fokko opened a new issue, #33972:
URL: https://github.com/apache/arrow/issues/33972

   ### Describe the enhancement requested
   
   Hey all,
   
   First of all thanks everyone for working on PyArrow! Really loving it so far. I'm currently working on PyIceberg that will load an Iceberg table in PyArrow. For those unfamiliar with Apache Iceberg. This is a table format that focusses on having huge tables (petabyte size). PyIceberg makes you life easier by taking care of statistics to boost performance, and all the schema maintenance. For example, if you change the partitioning of an Iceberg table, you don't have to directly rewrite all the files, you can do this in an incremental way.
   
   Now I'm running into some performance issues, and I noticed that PyArrow is doing more queries than required to S3. I went down the rabbit hole, and was able to narrow it down to:
   
   ```python
   import pyarrow.dataset as ds
   from pyarrow.fs import S3FileSystem
   ONE_MEGABYTE = 1024 * 1024
   
   client_kwargs = {
       "endpoint_override": "http://localhost:9000",
       "access_key": "admin",
       "secret_key": "password",
   }
   parquet_format = ds.ParquetFileFormat(
       use_buffered_stream=True,
       pre_buffer=True,
       buffer_size=8 * ONE_MEGABYTE
   )
   fs = S3FileSystem(**client_kwargs)
   with fs.open_input_file("warehouse/wh/nyc/taxis/data/tpep_pickup_datetime_day=2022-04-30/00003-4-89e0ad58-fb77-4512-8679-6f26d8d6ef28-00033.parquet") as fout:
       # First get the fragment
       fragment = parquet_format.make_fragment(fout, None)
       print(f"Schema: {fragment.physical_schema}")
       arrow_table = ds.Scanner.from_fragment(
           fragment=fragment
       ).to_table()
   ```
   
   I need the schema first, because it can be that a column got renamed, but the the file hasn't been rewritten against the latest schema. The same goes for filtering, if you change a column name, and the file still has the old name in there, then you would like to leverage the predicate pushdown of PyArrow to not load the data in memory at all.
   
   When looking into the minio logs I can see that it does four requests.
   
   1. A head to check if the file exists
   2. The last 64kb from the Parquet file to get the schema
   3. Another last 64kb from the parquet file to get the schema
   4. A nice beefy 1978578kb request to fetch the data
   
   Looking at the tests, we shouldn't fetch the footer twice:
   
   ```python
   # with default discovery, no metadata loaded
   with assert_opens([fragment.path]):
       fragment.ensure_complete_metadata()
   assert fragment.row_groups == [0, 1]
   
   # second time -> use cached / no file IO
   with assert_opens([]):
       fragment.ensure_complete_metadata()
   ```
   
   Any thoughts or advice? I went through the code a bit already, but my cpp is a bit rusty
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #33972: [Python] Remove redundant S3 call

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #33972:
URL: https://github.com/apache/arrow/issues/33972#issuecomment-1416089793

   > In the short term I think you should use [`pyarrow.parquet.ParquetFile`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html).
   
   The simple `ParquetFile` interface for single files doesn't support filtering row groups with a `filter`, so that would be a step back from using `pq.read_table`? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33972: [Python] Remove redundant S3 call

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #33972:
URL: https://github.com/apache/arrow/issues/33972#issuecomment-1414589158

   > We need to make projections, and we need to have the schema before loading the data. For example, if you have an Iceberg table, and you do a rename on a column, then you don't want to rewrite your multi-petabyte table. Iceberg uses IDs to identify the column, and if you filter or project on that column, it will select the old column name in the files that are written before the rename.
   
   Ok, that helps.  In the short term I think you should use `pyarrow.parquet.ParquetFile`.  That's a direct binding to the parquet-cpp libs and won't use any of the dataset stuff.  We don't have a format-agnostic concept of "read the metadata but cache it for use later so you don't have to read it again".
   
   Longer term, you can probably just specify a [custom evolution strategy](https://github.com/apache/arrow/blob/apache-arrow-11.0.0/cpp/src/arrow/dataset/dataset.h#L254) (using parquet column IDs) and let pyarrow handle the expression conversion for you.  Sadly, this feature is not yet ready (I'm working on it when I can. :crossed_fingers: for 12.0.0)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33972: [Python] Remove redundant S3 call

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #33972:
URL: https://github.com/apache/arrow/issues/33972#issuecomment-1414504116

   You mentioned in the other issue that you want to reuse the connection.  Could you clarify a little bit to your larger goal?  Or perhaps do you have some example code somewhere of how you're planning on using this?
   
   For example, are you bringing in an S3 connection from outside of pyarrow?  Or do you start with a path?  Are you reading from the same dataset multiple times or is this a one-shot operation (or the list of files changes from call to call)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33972: [Python] Remove redundant S3 call

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #33972:
URL: https://github.com/apache/arrow/issues/33972#issuecomment-1413788554

   The datasets feature went through considerable change a while back when it moved from a parquet-only feature to format-agnostic.  Looks like this connection came loose in the conversion.  If you just want to read one file the approach is normally something more like:
   
   ```
   import pyarrow.parquet as pq
   pq.read_table(path)
   ```
   
   If you're looking to read a collection of files you would normally use:
   
   ```
   import pyarrow.dataset as ds
   ds.dataset([paths]).to_table()
   ```
   
   I suspect (though am not entirely certain) both of the above paths will only read the metadata once.
   
   However, your usage is legitimate, and it even affects the normal datasets path when you scan the dataset multiple times (because we should be caching the metadata on the first scan and reusing on the second).  So I would consider this a bug.
   
   I don't know for sure but my guess is the problem is [here](https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_parquet.cc#L364).  The fragment is opening a reader and should pass the metadata to the reader, if already populated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Fokko commented on issue #33972: [Python] Remove redundant S3 call

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on issue #33972:
URL: https://github.com/apache/arrow/issues/33972#issuecomment-1416165705

   > We don't have a format-agnostic concept of "read the metadata but cache it for use later so you don't have to read it again".
   
   That's not a problem, as long as it keeps cached in the fragment. Because the reverse bytes to get the footer are rather expensive (in terms of time), so we would love to eliminate that call. I went through the code, and was able to pass down the metadata from the fragment down to the reader: https://github.com/apache/arrow/pull/34015
   
   > The simple ParquetFile interface for single files doesn't support filtering row groups with a filter, so that would be a step back from using `pq.read_table`?
   
   I agree, we need to have predicate pushdown 👍🏻 
   
   > Longer term, you can probably just specify a [custom evolution strategy](https://github.com/apache/arrow/blob/apache-arrow-11.0.0/cpp/src/arrow/dataset/dataset.h#L254) (using parquet column IDs) and let pyarrow handle the expression conversion for you. Sadly, this feature is not yet ready (I'm working on it when I can. 🤞 for 12.0.0)
   
   Let me know when something is ready, happy to test 👍🏻 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace closed issue #33972: [Python] Remove redundant S3 call

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace closed issue #33972: [Python] Remove redundant S3 call
URL: https://github.com/apache/arrow/issues/33972


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Fokko commented on issue #33972: [Python] Remove redundant S3 call

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on issue #33972:
URL: https://github.com/apache/arrow/issues/33972#issuecomment-1414514062

   @westonpace sure thing!
   
   We need to make projections, and we need to have the schema before loading the data. For example, if you have an Iceberg table, and you do a rename on a column, then you don't want to rewrite your multi-petabyte table. Iceberg uses IDs to identify the column, and if you filter or project on that column, it will select the old column name in the files that are written before the rename.
   
   The current code is over here: https://github.com/apache/iceberg/blob/master/python/pyiceberg/io/pyarrow.py#L486-L522
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org