You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/17 08:35:03 UTC

[GitHub] [arrow] kshitij12345 opened a new pull request, #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

kshitij12345 opened a new pull request, #13629:
URL: https://github.com/apache/arrow/pull/13629

   Add `filesystem` support to `pq.read_metadata` and `pq.read_schema`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r945076750


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -533,6 +536,44 @@ def test_metadata_exceeds_message_size():
     metadata = pq.read_metadata(pa.BufferReader(buf))
 
 
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    fname = "data.parquet"
+    file_path = 'file:///' + os.path.join(str(tmpdir), fname)
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / fname)
+    schema = table.schema
+
+    assert pq.read_metadata(file_path).equals(metadata)
+    assert pq.read_metadata(
+        fname, filesystem=f'file:///{tmpdir}').equals(metadata)
+
+    assert pq.read_schema(file_path).equals(schema)
+    assert pq.read_schema(fname, filesystem=f'file:///{tmpdir}').equals(schema)
+
+    with util.change_cwd(tmpdir):
+        # Pass `filesystem` arg
+        assert pq.read_metadata(
+            fname, filesystem=LocalFileSystem()).equals(metadata)
+        assert pq.read_metadata(
+            fname, filesystem=LocalFileSystem.get_instance()).equals(metadata)

Review Comment:
   Thanks! I was away from keyboard for the week. Will fix it!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1190572971

   I think some of the failures are related:
   
   ```python
   =================================== FAILURES ===================================
   _______________________ test_metadata_schema_filesystem ________________________
   tmpdir = local('/tmp/pytest-of-root/pytest-0/test_metadata_schema_filesyste0')
       def test_metadata_schema_filesystem(tmpdir):
           table = pa.table({"a": [1, 2, 3]})
           # URI writing to local file.
           fname = "data.parquet"
           file_path = 'file:///' + os.path.join(str(tmpdir), fname)
           pq.write_table(table, file_path)
           # Get expected `metadata` from path.
           metadata = pq.read_metadata(tmpdir / fname)
           schema = table.schema
           assert pq.read_metadata(file_path).equals(metadata)
   >       assert pq.read_metadata(
               fname, filesystem=f'file:///{tmpdir}').equals(metadata)
   opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/parquet/test_metadata.py:553: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
   opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet/__init__.py:3425: in read_metadata
       file = ParquetFile(where, memory_map=memory_map,
   opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet/__init__.py:287: in __init__
       self.reader.open(
   pyarrow/_parquet.pyx:1225: in pyarrow._parquet.ParquetReader.open
       ???
   pyarrow/io.pxi:1674: in pyarrow.lib.get_reader
       ???
   pyarrow/io.pxi:1665: in pyarrow.lib.get_native_file
       ???
   pyarrow/io.pxi:943: in pyarrow.lib.OSFile.__cinit__
       ???
   pyarrow/io.pxi:953: in pyarrow.lib.OSFile._open_readable
       ???
   pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
       ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
   >   ???
   E   FileNotFoundError: [Errno 2] Failed to open local file 'data.parquet'. Detail: [errno 2] No such file or directory
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r941695546


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -533,6 +536,44 @@ def test_metadata_exceeds_message_size():
     metadata = pq.read_metadata(pa.BufferReader(buf))
 
 
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    fname = "data.parquet"
+    file_path = 'file:///' + os.path.join(str(tmpdir), fname)
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / fname)
+    schema = table.schema
+
+    assert pq.read_metadata(file_path).equals(metadata)
+    assert pq.read_metadata(
+        fname, filesystem=f'file:///{tmpdir}').equals(metadata)
+
+    assert pq.read_schema(file_path).equals(schema)
+    assert pq.read_schema(fname, filesystem=f'file:///{tmpdir}').equals(schema)
+
+    with util.change_cwd(tmpdir):
+        # Pass `filesystem` arg
+        assert pq.read_metadata(
+            fname, filesystem=LocalFileSystem()).equals(metadata)
+        assert pq.read_metadata(
+            fname, filesystem=LocalFileSystem.get_instance()).equals(metadata)

Review Comment:
   The new LocalFileSystem has no `get_instance()` method, so I think you can just remove this case (since the line above already includes the case of a plain LocalFileSystem())



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r925240175


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   Is there already a JIRA opened for this segfault?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1186604788

   macOS CI Failure looks unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r923294237


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   I think this is great! Thank you so much for all the updates.
   
   Just one more thing: there are files being committed by accident I think? Can you remove those?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r925411503


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   I don't think so. I will open one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1216641382

   @jorisvandenbossche CI failure looks irrelevant. PTAL :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche merged pull request #13629: ARROW-16719: [Python] Add path/URI + filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche merged PR #13629:
URL: https://github.com/apache/arrow/pull/13629


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r923089353


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   ```
   num_columns: 1
   num_rows: 3
   num_row_groups: 1
   format_version: 2.6
   serialized_size: 375
   ```
   
   From the following metadata attributes,  I think `num_columns`, `num_rows`, `num_row_groups` make sense. Should we also check for the other two (I am not sure if they will stay same across versions and platforms)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r925244886


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3411,11 +3416,15 @@ def read_metadata(where, memory_map=False, decryption_properties=None):
       format_version: 2.6
       serialized_size: 561
     """
+    filesystem, where = _resolve_filesystem_and_path(where, filesystem)
+    if filesystem is not None:
+        where = filesystem.open_input_file(where)
     return ParquetFile(where, memory_map=memory_map,

Review Comment:
   When opening the file with `open_input_file`, we should probably use it in a context manager to ensure we also close the file handle again:
   
   ```
   with filesystem.open_input_file(where) as source:
       return ParquetFile(source, ...).metadata
   ```
   
   (and the same for read_schema)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1186445923

   https://issues.apache.org/jira/browse/ARROW-16719


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1188096461

   Thank you for working on this issue @kshitij12345! LGTM +1
   
   @jorisvandenbossche can you please have a look before we merge this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
raulcd commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r931047338


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3453,6 +3471,13 @@ def read_schema(where, memory_map=False, decryption_properties=None):
     n_legs: int64
     animal: string
     """
-    return ParquetFile(
-        where, memory_map=memory_map,
-        decryption_properties=decryption_properties).schema.to_arrow_schema()
+    filesystem, where = _resolve_filesystem_and_path(where, filesystem)
+    source = filesystem.open_input_file(
+        where) if filesystem is not None else where
+    ctx = source if filesystem is not None else nullcontext()

Review Comment:
   I find these lines slightly difficult to follow due to using both `source` and `where` variables and the two inline ifs for the same case, would something like:
   ```
       ctx = nullcontext()
       if filesystem is not None:
           ctx = where = filesystem.open_input_file(where)
   ```
   be slightly more readable?
   In thas case we should use `where` instead of `source` on the `ParquetFile` but this was how it was originally too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r923145585


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   Hm, I guess so. As `file_path` is a URI and you are reading from direct path, you are also checking that the filesystem is selected correctly.
   
   When thinking about it I wondered if it would be good to also add a test with supplying a `filesystem=fs.LocalFileSystem()` to `read_metadata` and `read_schema` (or maybe some other to catch an error)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13629: ARROW-16719: [Python] Add path/URI + filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1218228025

   Benchmark runs are scheduled for baseline = f6127fca7ade9665f31493d37929346e651ed0e4 and contender = 42ed37e3fc84465f365531e611f1bf632b599e7b. 42ed37e3fc84465f365531e611f1bf632b599e7b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/18224653c151498b8e5f72114dda15d7...5bc869a124784be6b5bd4515b6ab2b0b/)
   [Finished :arrow_down:3.44% :arrow_up:2.89%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/322aafe569ea40c5b175c29eceb8d98d...65290c5d2c0143399dc499e7b5aa1aed/)
   [Failed :arrow_down:4.38% :arrow_up:1.92%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/17da1b6eb83b445a85c9bb4b4df80842...c0723bda099e48169225d24ba35da37f/)
   [Finished :arrow_down:5.12% :arrow_up:2.42%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/84afb5fdeff142acb8bf60cc883e7fc2...cd39a3044a0049b1a082ad4cdf662a84/)
   Buildkite builds:
   [Finished] [`42ed37e3` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1323)
   [Finished] [`42ed37e3` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1339)
   [Failed] [`42ed37e3` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1321)
   [Finished] [`42ed37e3` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1338)
   [Finished] [`f6127fca` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1322)
   [Finished] [`f6127fca` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1338)
   [Failed] [`f6127fca` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1320)
   [Finished] [`f6127fca` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1337)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13629: ARROW-16719: [Python] Add path/URI + filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1218228366

   ['Python', 'R'] benchmarks have high level of regressions.
   [test-mac-arm](https://conbench.ursa.dev/compare/runs/322aafe569ea40c5b175c29eceb8d98d...65290c5d2c0143399dc499e7b5aa1aed/)
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/17da1b6eb83b445a85c9bb4b4df80842...c0723bda099e48169225d24ba35da37f/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r938631912


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -18,12 +18,15 @@
 import datetime
 import decimal
 from collections import OrderedDict
+import os
 
 import numpy as np
 import pytest
 
 import pyarrow as pa
 from pyarrow.tests.parquet.common import _check_roundtrip, make_sample_file
+from pyarrow.filesystem import LocalFileSystem, FileSystem

Review Comment:
   Sorry for only noticing this late, but we should import here from `pyarrow.fs` to use the new filesystems in the tests (`pyarrow.filesystem` is deprecated)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1190458709

   Failures look unrelated. Should I retrigger the CI?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1186445929

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r923106106


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   On second thought, isn't the current approach better to check complete meta-data retrieved from known to be correct `read_metadata(str-path)` (instead of checking parts of it)? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r931237338


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3453,6 +3471,13 @@ def read_schema(where, memory_map=False, decryption_properties=None):
     n_legs: int64
     animal: string
     """
-    return ParquetFile(
-        where, memory_map=memory_map,
-        decryption_properties=decryption_properties).schema.to_arrow_schema()
+    filesystem, where = _resolve_filesystem_and_path(where, filesystem)
+    source = filesystem.open_input_file(
+        where) if filesystem is not None else where
+    ctx = source if filesystem is not None else nullcontext()

Review Comment:
   Makes sense. Thanks for the suggestion. Have updated the code. PTAL :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r922893873


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   I tried that but it leads to segfault. Probably worth an issue? (I don't think the program should crash)
   
   ```
   python/pyarrow/tests/parquet/test_metadata.py .........................sFatal Python error: Segmentation fault
   
   Current thread 0x00007f066e2be740 (most recent call first):
     File "/home/kshiteej/Apache/arrow/python/pyarrow/tests/parquet/test_metadata.py", line 549 in test_metadata_schema_filesystem
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/python.py", line 1761 in runtest
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 166 in pytest_runtest_call
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 259 in <lambda>
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 338 in from_call
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 258 in call_runtest_hook
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 219 in call_and_report
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 130 in runtestprotocol
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 347 in pytest_runtestloop
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 322 in _main
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 268 in wrap_session
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 315 in pytest_cmdline_main
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/config/__init__.py", line 164 in main
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/config/__init__.py", line 187 in console_main
     File "/home/kshiteej/.conda/envs/pyarrow-dev/bin/pytest", line 11 in <module>
   Segmentation fault (core dumped)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r923099456


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   Maybe only add something similar to:
   
   ```python
   assert 'parquet-cpp' in pq.read_metadata(file_path).created_by
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r923074731


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   Running this locally I can confirm a segfault. I think it happens because the table metadata is (correctly) empty:
   
   ```python
   >>> import pyarrow as pa
   >>> import pyarrow.parquet as pq
   
   >>> table = pa.table({"a": [1, 2, 3]})
   >>> file_path = "/tmp/data.parquet"
   >>> metadata = table.schema.metadata
   
   >>> pq.read_metadata(file_path).equals(metadata)
   zsh: segmentation fault  python
   ```
   
   Which would deserve an issue (a warning should be returned without a crash).
   
   Maybe a better option to test ParquetFile metadata would be to inspect individual attributes:
   
   ```python
   >>> pq.read_metadata(file_path).num_columns == 1
   True
   >>> pq.read_metadata(file_path).num_rows == 3
   True
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r923306349


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   Ah, they sneaked in 😅. Removed :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r925471055


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   Thanks for opening the JIRA



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1209753146

   @kshitij12345 there are some test failures that actually seem related


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1217877867

   I pushed a small additional update to the test (mainly changing to use our internal `tempdir` fixture instead of `tmpdir`)
   
   Thanks again for the PR @kshitij12345 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on pull request #13629: ARROW-16719: [Python] Add path/URI + filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1219115715

   @jorisvandenbossche Thank you very much :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1205221630

   Gentle ping :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r922893873


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   I tried that but it leads to segfault. Probably worth an issue? (I don't think the program should crash)
   
   <details>
   
   <summary> Crash Log </summary>
   
   ```
   python/pyarrow/tests/parquet/test_metadata.py .........................sFatal Python error: Segmentation fault
   
   Current thread 0x00007f066e2be740 (most recent call first):
     File "/home/kshiteej/Apache/arrow/python/pyarrow/tests/parquet/test_metadata.py", line 549 in test_metadata_schema_filesystem
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/python.py", line 1761 in runtest
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 166 in pytest_runtest_call
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 259 in <lambda>
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 338 in from_call
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 258 in call_runtest_hook
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 219 in call_and_report
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 130 in runtestprotocol
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 347 in pytest_runtestloop
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 322 in _main
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 268 in wrap_session
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/main.py", line 315 in pytest_cmdline_main
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/config/__init__.py", line 164 in main
     File "/home/kshiteej/.conda/envs/pyarrow-dev/lib/python3.9/site-packages/_pytest/config/__init__.py", line 187 in console_main
     File "/home/kshiteej/.conda/envs/pyarrow-dev/bin/pytest", line 11 in <module>
   Segmentation fault (core dumped)
   ```
   
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r922796401


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   Is there a way to get `metadata` directly from the table?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r925432093


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   Issue Link : https://issues.apache.org/jira/browse/ARROW-17142



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r922881973


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   Try `table.schema.metadata`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r923283964


##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -531,3 +532,19 @@ def test_metadata_exceeds_message_size():
         buf = out.getvalue()
 
     metadata = pq.read_metadata(pa.BufferReader(buf))
+
+
+def test_metadata_schema_filesystem(tmpdir):
+    table = pa.table({"a": [1, 2, 3]})
+
+    # URI writing to local file.
+    file_path = 'file:///' + os.path.join(str(tmpdir), "data.parquet")
+
+    pq.write_table(table, file_path)
+
+    # Get expected `metadata` from path.
+    metadata = pq.read_metadata(tmpdir / '/data.parquet')

Review Comment:
   Have updated the test to cover more case. PTAL :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r925497407


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3411,11 +3416,15 @@ def read_metadata(where, memory_map=False, decryption_properties=None):
       format_version: 2.6
       serialized_size: 561
     """
+    filesystem, where = _resolve_filesystem_and_path(where, filesystem)
+    if filesystem is not None:
+        where = filesystem.open_input_file(where)
     return ParquetFile(where, memory_map=memory_map,

Review Comment:
   Thanks for the confirmation. Have addressed the comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r925423325


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3411,11 +3416,15 @@ def read_metadata(where, memory_map=False, decryption_properties=None):
       format_version: 2.6
       serialized_size: 561
     """
+    filesystem, where = _resolve_filesystem_and_path(where, filesystem)
+    if filesystem is not None:
+        where = filesystem.open_input_file(where)
     return ParquetFile(where, memory_map=memory_map,

Review Comment:
   I am curious as to what is the minimum supported Python version.
   
   I was planning to do something like which requires Python 3.7 or more. 
   
   ```python
   source = filesystem.open_input_file(where) if filesystem is not None else nullcontext()
   with source:
       return ParquetFile(
           source, memory_map=memory_map,
           decryption_properties=decryption_properties).schema.to_arrow_schema()
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #13629:
URL: https://github.com/apache/arrow/pull/13629#discussion_r925471818


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3411,11 +3416,15 @@ def read_metadata(where, memory_map=False, decryption_properties=None):
       format_version: 2.6
       serialized_size: 561
     """
+    filesystem, where = _resolve_filesystem_and_path(where, filesystem)
+    if filesystem is not None:
+        where = filesystem.open_input_file(where)
     return ParquetFile(where, memory_map=memory_map,

Review Comment:
   Python 3.7 is our minimum supported version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1190582598

   Thanks for catching that @AlenkaF!
   
   And Windows strikes 😓 ! I don't access to a Windows system. I think `file:///` is not handled by Windows. Do you have any recommendations or is it ok to skip that particular approach on Windows?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kshitij12345 commented on pull request #13629: ARROW-16719: [Python] Add path/URI /+ filesystem handling to parquet.read_metadata

Posted by GitBox <gi...@apache.org>.
kshitij12345 commented on PR #13629:
URL: https://github.com/apache/arrow/pull/13629#issuecomment-1196681103

   Gentle ping @jorisvandenbossche @AlenkaF 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org