You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/28 13:54:13 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #6501: Python: Use PyArrow buffer

Fokko opened a new pull request, #6501:
URL: https://github.com/apache/iceberg/pull/6501

   I noticed that PyArrow is doing two calls to the Avro file, while one should be sufficient (150kb):
   
   ```
   2022-12-27T08:45:32.822 [206 Partial Content] s3.GetObject minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json 172.18.0.3        1.142ms      ↑ 169 B ↓ 14 KiB
   2022-12-27T08:45:32.913 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1        867µs       ↑ 153 B ↓ 412 B
   2022-12-27T08:45:32.925 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1        1.626ms      ↑ 159 B ↓ 4.6 KiB
   2022-12-27T08:45:32.973 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1        1.216ms      ↑ 153 B ↓ 413 B
   2022-12-27T08:45:32.989 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1        3.719ms      ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.020 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1        3.904ms      ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.042 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1        1.903ms      ↑ 159 B ↓ 1.7 KiB
   2022-12-27T08:45:33.104 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1        1.232ms      ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.113 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1        683µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.120 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1        975µs       ↑ 159 B ↓ 7.0 KiB
   2022-12-27T08:45:33.141 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1        383µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.144 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1        774µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.148 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1        833µs       ↑ 159 B ↓ 7.4 KiB
   2022-12-27T08:45:33.170 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1        432µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.173 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1        1.208ms      ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.178 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1        814µs       ↑ 159 B ↓ 8.2 KiB
   2022-12-27T08:45:33.202 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1        427µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.205 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1        671µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.209 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1        502µs       ↑ 159 B ↓ 7.9 KiB
   2022-12-27T08:45:33.233 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1        616µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.236 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1        955µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.240 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1        934µs       ↑ 159 B ↓ 7.4 KiB
   2022-12-27T08:45:33.262 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1        308µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.265 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1        641µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.269 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1        831µs       ↑ 159 B ↓ 7.6 KiB
   2022-12-27T08:45:33.295 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1        625µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.298 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1        828µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.302 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1        897µs       ↑ 159 B ↓ 7.8 KiB
   2022-12-27T08:45:33.324 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1        474µs       ↑ 153 B ↓ 413 B
   2022-12-27T08:45:33.326 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1        644µs       ↑ 159 B ↓ 8.5 KiB
   2022-12-27T08:45:33.330 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1        904µs       ↑ 159 B ↓ 7.1 KiB
   ```
   
   Instead I've changed it to using the PyArrow `open_input_stream` which is used for sequential reading, where `open_input_file` is used for random access.
   
   After this change we can see that the file is requested just once:
   ```
   2022-12-28T13:31:46.815 [206 Partial Content] s3.GetObject minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json 172.18.0.3        983µs       ↑ 169 B ↓ 14 KiB
   2022-12-28T13:31:46.912 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1        3.509ms      ↑ 153 B ↓ 412 B
   2022-12-28T13:31:46.923 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1        2.779ms      ↑ 159 B ↓ 4.6 KiB
   2022-12-28T13:31:46.967 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1        2.011ms      ↑ 153 B ↓ 413 B
   2022-12-28T13:31:46.988 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1        8.244ms      ↑ 159 B ↓ 18 KiB
   2022-12-28T13:31:47.107 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1        404µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.110 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1        830µs       ↑ 159 B ↓ 15 KiB
   2022-12-28T13:31:47.132 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1        286µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.135 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1        760µs       ↑ 159 B ↓ 15 KiB
   2022-12-28T13:31:47.157 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1        303µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.159 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1        821µs       ↑ 159 B ↓ 16 KiB
   2022-12-28T13:31:47.187 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1        323µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.191 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1        863µs       ↑ 159 B ↓ 16 KiB
   2022-12-28T13:31:47.213 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1        602µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.216 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1        839µs       ↑ 159 B ↓ 15 KiB
   2022-12-28T13:31:47.238 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1        293µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.242 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1        898µs       ↑ 159 B ↓ 16 KiB
   2022-12-28T13:31:47.267 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1        316µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.270 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1        655µs       ↑ 159 B ↓ 16 KiB
   2022-12-28T13:31:47.295 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1        315µs       ↑ 153 B ↓ 413 B
   2022-12-28T13:31:47.298 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1        999µs       ↑ 159 B ↓ 15 KiB
   ```
   
   This also includes the manifest. I think this makes more sense since we always want to read the whole file. The only thing is that the 1mb is a bit arbitrary.
   
   I've verified that s3fs also works as expected:
   ```
   2022-12-28T13:52:52.647 [206 Partial Content] s3.GetObject minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json 172.18.0.3        991µs       ↑ 169 B ↓ 14 KiB
   2022-12-28T13:53:03.335 [206 Partial Content] s3.GetObject minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json 172.18.0.3        954µs       ↑ 169 B ↓ 14 KiB
   2022-12-28T13:53:03.845 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1        1.357ms      ↑ 138 B ↓ 412 B
   2022-12-28T13:53:03.857 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro 172.18.0.1        1.355ms      ↑ 153 B ↓ 4.6 KiB
   2022-12-28T13:53:03.864 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1        422µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:03.868 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro 172.18.0.1        707µs       ↑ 153 B ↓ 18 KiB
   2022-12-28T13:53:03.897 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1        333µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:03.901 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro 172.18.0.1        793µs       ↑ 153 B ↓ 15 KiB
   2022-12-28T13:53:03.921 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1        371µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:03.924 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro 172.18.0.1        794µs       ↑ 153 B ↓ 15 KiB
   2022-12-28T13:53:03.945 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1        332µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:03.948 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro 172.18.0.1        570µs       ↑ 153 B ↓ 16 KiB
   2022-12-28T13:53:03.971 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1        496µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:03.974 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro 172.18.0.1        1.266ms      ↑ 153 B ↓ 16 KiB
   2022-12-28T13:53:03.998 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1        389µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:04.001 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro 172.18.0.1        717µs       ↑ 153 B ↓ 15 KiB
   2022-12-28T13:53:04.023 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1        306µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:04.026 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro 172.18.0.1        920µs       ↑ 153 B ↓ 16 KiB
   2022-12-28T13:53:04.049 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1        397µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:04.070 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro 172.18.0.1        708µs       ↑ 153 B ↓ 16 KiB
   2022-12-28T13:53:04.092 [200 OK] s3.HeadObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1        289µs       ↑ 138 B ↓ 413 B
   2022-12-28T13:53:04.094 [206 Partial Content] s3.GetObject 127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro 172.18.0.1        733µs       ↑ 153 B ↓ 15 KiB
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #6501: Python: Use PyArrow buffer

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #6501:
URL: https://github.com/apache/iceberg/pull/6501


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6501: Python: Use PyArrow buffer

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6501:
URL: https://github.com/apache/iceberg/pull/6501#discussion_r1058668353


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -230,6 +240,14 @@ def _get_fs(self, scheme: str) -> FileSystem:
         else:
             raise ValueError(f"Unrecognized filesystem type in URI: {scheme}")
 
+    def _get_file(self, location: str) -> PyArrowFile:
+        scheme, path = self.parse_location(location)
+        fs = self._get_fs(scheme)
+
+        buffer_size = self.properties.get(BUFFER_SIZE)

Review Comment:
   Just a nit, I feel it's a bit more readable if we return the 1 MB default buffer size here and then pass that through. Then buffer size is a required integer for PyArrowFile and the constructor logic is simpler imo.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6501: Python: Use PyArrow buffer

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6501:
URL: https://github.com/apache/iceberg/pull/6501#discussion_r1060878045


##########
python/pyiceberg/io/fsspec.py:
##########
@@ -150,9 +150,12 @@ def exists(self) -> bool:
         """Checks whether the location exists"""
         return self._fs.lexists(self.location)
 
-    def open(self) -> InputStream:
+    def open(self, seekable: bool = False) -> InputStream:

Review Comment:
   Sure, done!



##########
python/pyiceberg/avro/file.py:
##########
@@ -132,7 +131,7 @@ def __enter__(self) -> AvroFile:
         Returns:
             A generator returning the AvroStructs
         """
-        self.input_stream = BufferedReader(self.input_file.open())  # type: ignore
+        self.input_stream = self.input_file.open()

Review Comment:
   Done! 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6501: Python: Use PyArrow buffer

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6501:
URL: https://github.com/apache/iceberg/pull/6501#discussion_r1060858929


##########
python/pyiceberg/avro/file.py:
##########
@@ -132,7 +131,7 @@ def __enter__(self) -> AvroFile:
         Returns:
             A generator returning the AvroStructs
         """
-        self.input_stream = BufferedReader(self.input_file.open())  # type: ignore
+        self.input_stream = self.input_file.open()

Review Comment:
   This should pass `seekable=False`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6501: Python: Use PyArrow buffer

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6501:
URL: https://github.com/apache/iceberg/pull/6501#discussion_r1058364492


##########
python/pyiceberg/avro/file.py:
##########
@@ -183,7 +182,6 @@ def __next__(self) -> Record:
         raise StopIteration
 
     def _read_header(self) -> AvroFileHeader:
-        self.input_stream.seek(0, SEEK_SET)

Review Comment:
   Seek is not allowed on a stream, but it is okay to leave this out since we do this directly after opening the file and we call it only once (after that it is cached)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6501: Python: Use PyArrow buffer

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6501:
URL: https://github.com/apache/iceberg/pull/6501#discussion_r1059047328


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -230,6 +240,14 @@ def _get_fs(self, scheme: str) -> FileSystem:
         else:
             raise ValueError(f"Unrecognized filesystem type in URI: {scheme}")
 
+    def _get_file(self, location: str) -> PyArrowFile:
+        scheme, path = self.parse_location(location)
+        fs = self._get_fs(scheme)
+
+        buffer_size = self.properties.get(BUFFER_SIZE)

Review Comment:
   Makes a lot of sense, let me update that! 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6501: Python: Use PyArrow buffer

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6501:
URL: https://github.com/apache/iceberg/pull/6501#discussion_r1059492469


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -151,7 +160,7 @@ def open(self) -> InputStream:
                 an AWS error code 15
         """
         try:
-            input_file = self._filesystem.open_input_file(self._path)
+            input_file = self._filesystem.open_input_stream(self._path, buffer_size=self._buffer_size)

Review Comment:
   I think that we need to have a flag to signal when `seek` will not be called. The protocol includes seek and we need to support it for Parquet files. However, we can hint to Arrow that we don't need to seek. How about adding `seekable=True` to `open` and allowing the caller to use this by setting `seekable=False`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6501: Python: Use PyArrow buffer

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6501:
URL: https://github.com/apache/iceberg/pull/6501#discussion_r1059512957


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -151,7 +160,7 @@ def open(self) -> InputStream:
                 an AWS error code 15
         """
         try:
-            input_file = self._filesystem.open_input_file(self._path)
+            input_file = self._filesystem.open_input_stream(self._path, buffer_size=self._buffer_size)

Review Comment:
   That's a great point. Since it is part of the protocol, we should be able to use `seek`. I've added the `seekable` option to the `.open` operator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6501: Python: Use PyArrow buffer

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6501:
URL: https://github.com/apache/iceberg/pull/6501#discussion_r1060838341


##########
python/pyiceberg/io/fsspec.py:
##########
@@ -150,9 +150,12 @@ def exists(self) -> bool:
         """Checks whether the location exists"""
         return self._fs.lexists(self.location)
 
-    def open(self) -> InputStream:
+    def open(self, seekable: bool = False) -> InputStream:

Review Comment:
   I think we should default seekable to `True`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org