You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/14 10:01:31 UTC

[GitHub] [arrow] dongjoon-hyun opened a new pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

dongjoon-hyun opened a new pull request #12153:
URL: https://github.com/apache/arrow/pull/12153


   This PR aims to add `pyarrow.orc.read_table` like `pyarrow.parquet.read_table`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r787346975



##########
File path: python/pyarrow/tests/test_orc.py
##########
@@ -169,6 +171,36 @@ def test_orcfile_empty(datadir):
     assert table.schema == expected_schema
 
 
+def test_readwrite(tmpdir):
+    from pyarrow import orc
+    a = pa.array([1, None, 3, None])
+    b = pa.array([None, "Arrow", None, "ORC"])
+    table = pa.table({"int64": a, "utf8": b})
+    file = tmpdir.join("test.orc")
+    orc.write_table(table, file)
+    output_table = orc.read_table(file)
+    assert table.equals(output_table)
+

Review comment:
       Sure!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r787338358



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.

Review comment:
       Since it needs a fix on `ORCFile` itself, can I handle it later in a separate PR, @jorisvandenbossche ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1016768070


   Benchmark runs are scheduled for baseline = deb6e132da927abeb0be4d0b8ad5eee8d49d9980 and contender = ff4b9bea56aeb2c48f19d6137dd2fbae59d618c7. ff4b9bea56aeb2c48f19d6137dd2fbae59d618c7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a77ca33986a54ea285ee65a3b6b16679...447026ce9a564b41aee5e0c736f750d9/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fe341b04335d488b86df05884bde832e...95ab46fc61cd4a6b9516b44996240583/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/4b64ba753b334fadb7a375e00d69a123...886fcf14f8fb4bfd81b2c7488b3665e0/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r785757447



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format

Review comment:
       ```suggestion
       Read a Table from an ORC file.
   ```

##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.

Review comment:
       This last sentence doesn't actually seem to be true:
   
   ```
   In [1]: import pyarrow as pa
   
   In [2]: from pyarrow import orc
   
   In [3]: orc.write_table(pa.table({'a': [1, 2, 3]}), "test.orc")
   
   In [4]: result = orc.ORCFile("test.orc").read(columns=[])
   
   In [5]: result.num_rows
   Out[5]: 0
   
   In [6]: result.num_columns
   Out[6]: 0
   ```

##########
File path: python/pyarrow/tests/test_orc.py
##########
@@ -169,6 +171,36 @@ def test_orcfile_empty(datadir):
     assert table.schema == expected_schema
 
 
+def test_readwrite(tmpdir):
+    from pyarrow import orc
+    a = pa.array([1, None, 3, None])
+    b = pa.array([None, "Arrow", None, "ORC"])
+    table = pa.table({"int64": a, "utf8": b})
+    file = tmpdir.join("test.orc")
+    orc.write_table(table, file)
+    output_table = orc.read_table(file)
+    assert table.equals(output_table)
+

Review comment:
       Can you add one more read_table call here with eg `columns=["int64"]`, just to make sure that this keyword is properly passed through

##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.
+    filesystem : FileSystem, default None
+        If nothing passed, paths assumed to be found in the local on-disk
+        filesystem.

Review comment:
       I know this is copied from the parquet read_table docstring, but I am not sure this is fully correct? I suppose passing a URI for eg a remote file also works?

##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For

Review comment:
       For ORC, this implementation of `read_table` does not support reading directories. So the above should only be a single file name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r788036346



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.
+    filesystem : FileSystem, default None
+        If nothing passed, paths assumed to be found in the local on-disk
+        filesystem.

Review comment:
       Got it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r787338358



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.

Review comment:
       ~Since it needs a fix on `ORCFile` itself, can I handle it later in a separate PR, @jorisvandenbossche ?~




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r787429771



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.
+    filesystem : FileSystem, default None
+        If nothing passed, paths assumed to be found in the local on-disk
+        filesystem.

Review comment:
       We don't yet support http remote filesystem (urls), but do in general support URIs for implemented filesystems such as S3 (eg `s3://bucket/data.orc` _should_ work I think)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1016079423


   Thank you for review. I addressed your comments. Could you review this once more, @jorisvandenbossche ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1012977611






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r787433850



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.

Review comment:
       Hmm, it's maybe better to leave it for a separate JIRA, as reading in all columns to then select none of them in a second step might be a bit surprising for users? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r787337725



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.

Review comment:
       Thank you for pointing out that. Indeed. I didn't notice that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1016768070


   Benchmark runs are scheduled for baseline = deb6e132da927abeb0be4d0b8ad5eee8d49d9980 and contender = ff4b9bea56aeb2c48f19d6137dd2fbae59d618c7. ff4b9bea56aeb2c48f19d6137dd2fbae59d618c7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a77ca33986a54ea285ee65a3b6b16679...447026ce9a564b41aee5e0c736f750d9/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fe341b04335d488b86df05884bde832e...95ab46fc61cd4a6b9516b44996240583/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/4b64ba753b334fadb7a375e00d69a123...886fcf14f8fb4bfd81b2c7488b3665e0/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1013257090


   cc @williamhyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r787343963



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.
+    filesystem : FileSystem, default None
+        If nothing passed, paths assumed to be found in the local on-disk
+        filesystem.

Review comment:
       It doesn't work like the following.
   <img width="1556" alt="Screen Shot 2022-01-18 at 8 38 52 PM" src="https://user-images.githubusercontent.com/9700541/150065159-29158061-f5e7-43b0-9e48-940d982882bd.png">
   .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1016083265


   Oops. Here is a correction. For the following
   - https://github.com/apache/arrow/pull/12153/files#r785785373
   
   Actually, I chose to return `all` columns like `None` for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r787341715



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.

Review comment:
       Oh, it seems that I can fix it in this layer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1013256876


   Could you review this please, @sunchao ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r787335649



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format

Review comment:
       Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r787341715



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.

Review comment:
       Oh, it seems that I can fix it in this layer.
   ```python
   In [4]: result = orc.ORCFile("test.orc").read(columns=[])
   
   In [5]: result.num_rows
   Out[5]: 0
   
   In [6]: result = orc.ORCFile("test.orc").read(columns=None)
   
   In [7]: result.num_rows
   Out[7]: 3
   
   In [8]: result.num_columns
   Out[8]: 1
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1013359012


   Thank you so much for your review and approval, @sunchao !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1016768070


   Benchmark runs are scheduled for baseline = deb6e132da927abeb0be4d0b8ad5eee8d49d9980 and contender = ff4b9bea56aeb2c48f19d6137dd2fbae59d618c7. ff4b9bea56aeb2c48f19d6137dd2fbae59d618c7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a77ca33986a54ea285ee65a3b6b16679...447026ce9a564b41aee5e0c736f750d9/)
   [Finished :arrow_down:0.71% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fe341b04335d488b86df05884bde832e...95ab46fc61cd4a6b9516b44996240583/)
   [Finished :arrow_down:0.09% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/4b64ba753b334fadb7a375e00d69a123...886fcf14f8fb4bfd81b2c7488b3665e0/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun removed a comment on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun removed a comment on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1016083265


   Oops. Here is a correction. For the following
   - https://github.com/apache/arrow/pull/12153/files#r785785373
   
   Actually, I chose to return `all` columns like `None` for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

kszucs commented on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1013791479


   Sure, I'm going to. Also cc @jorisvandenbossche 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1013771377


   Hi, @kszucs . Could you review this, please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1013806937


   Thank you so much, @kszucs .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1016768070


   Benchmark runs are scheduled for baseline = deb6e132da927abeb0be4d0b8ad5eee8d49d9980 and contender = ff4b9bea56aeb2c48f19d6137dd2fbae59d618c7. ff4b9bea56aeb2c48f19d6137dd2fbae59d618c7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a77ca33986a54ea285ee65a3b6b16679...447026ce9a564b41aee5e0c736f750d9/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fe341b04335d488b86df05884bde832e...95ab46fc61cd4a6b9516b44996240583/)
   [Finished :arrow_down:0.09% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/4b64ba753b334fadb7a375e00d69a123...886fcf14f8fb4bfd81b2c7488b3665e0/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs closed pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

kszucs closed pull request #12153:
URL: https://github.com/apache/arrow/pull/12153


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#issuecomment-1016757122


   Thank you, @kszucs , @jorisvandenbossche , and @sunchao .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r788056019



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.
+    filesystem : FileSystem, default None
+        If nothing passed, paths assumed to be found in the local on-disk
+        filesystem.

Review comment:
       FYI, for this documentation issue, I already created a JIRA -> https://issues.apache.org/jira/browse/ARROW-15364 (it's also outdated in the parquet docs)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on a change in pull request #12153: ARROW-15338: [Python] Add `pyarrow.orc.read_table` API

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #12153:
URL: https://github.com/apache/arrow/pull/12153#discussion_r788035820



##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.

Review comment:
       Thanks, @jorisvandenbossche . Yes, I agree with you that we need to optimize it in both `ORCFile` API and this one together for that case.

##########
File path: python/pyarrow/orc.py
##########
@@ -175,3 +176,33 @@ def write_table(table, where):
     writer = ORCWriter(where)
     writer.write(table)
     writer.close()
+
+
+def read_table(source, columns=None, filesystem=None):
+    """
+    Read a table from ORC format
+
+    Parameters
+    ----------
+    source : str, pyarrow.NativeFile, or file-like object
+        If a string passed, can be a single file name or directory name. For
+        file-like objects, only read a single file. Use pyarrow.BufferReader to
+        read a file contained in a bytes or buffer-like object.
+    columns : list
+        If not None, only these columns will be read from the file. A column
+        name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
+        'a.c', and 'a.d.e'. If empty, no columns will be read. Note
+        that the table will still have the correct num_rows set despite having
+        no columns.

Review comment:
       Thanks, @jorisvandenbossche . Yes, I agree with you that we need to optimize it in both `ORCFile` API and this one together for that case in a separate JIRA.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org