You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Gert Hulselmans (Jira)" <ji...@apache.org> on 2021/02/01 13:35:00 UTC

[jira] [Comment Edited] (ARROW-10344) [Python] Get all columns names (or schema) from Feather file, before loading whole Feather file

    [ https://issues.apache.org/jira/browse/ARROW-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276321#comment-17276321 ] 

Gert Hulselmans edited comment on ARROW-10344 at 2/1/21, 1:34 PM:
------------------------------------------------------------------

[~weldingwelding] The first 4/6 bytes (and last 4/6 bytes) of the Feather file would tell you. For example, you can check it with `hexdump`.

{code:bash}
 ❯ hexdump -C -n 8 feather_version1.feather
 00000000 46 45 41 31 00 00 00 00 |FEA1....|
 00000008

❯ hexdump -C -n 8 feather_version2.feather
 00000000 41 52 52 4f 57 31 00 00 |ARROW1..|
 00000008
{code}

{code:python}
 def feather_v1_or_v2(feather_file):
 with open(feather_file, 'rb') as fh_feather:
 fh_feather.seek(0, 0)
 feather_v1_magic_bytes_header = fh_feather.read(4)
 fh_feather.seek(-4, 2)
 feather_v1_magic_bytes_footer = fh_feather.read(4)

if feather_v1_magic_bytes_header == feather_v1_magic_bytes_footer == b'FEA1':
 return 1

fh_feather.seek(0, 0)
 feather_v2_magic_bytes_header = fh_feather.read(6)
 fh_feather.seek(-6, 2)
 feather_v2_magic_bytes_footer = fh_feather.read(6)

if feather_v2_magic_bytes_header == feather_v2_magic_bytes_footer == b'ARROW1':
 return 2

return None
{code}

[~jorisvandenbossche] Now that https://issues.apache.org/jira/browse/ARROW-10056 is resolved, Feather v1 support is less critical. so the IPC and dataset API workaround are now useful for me. It still would be good to have Feather v1 support and exposure of the columns in the feather submodule directly.


was (Author: ghuls):
[~weldingwelding] The first 4/6 bytes (and last 4/6 bytes) of the Feather file would tell you. For example, you can check it with `hexdump`.

```
❯ hexdump -C -n 8 feather_version1.feather
00000000  46 45 41 31 00 00 00 00                           |FEA1....|
00000008

❯ hexdump -C -n 8 feather_version2.feather
00000000  41 52 52 4f 57 31 00 00                           |ARROW1..|
00000008
```

```python
def feather_v1_or_v2(feather_file):
    with open(feather_file, 'rb') as fh_feather:
        fh_feather.seek(0, 0)
        feather_v1_magic_bytes_header = fh_feather.read(4)
        fh_feather.seek(-4, 2)
        feather_v1_magic_bytes_footer = fh_feather.read(4)

        if feather_v1_magic_bytes_header == feather_v1_magic_bytes_footer == b'FEA1':
            return 1

        fh_feather.seek(0, 0)
        feather_v2_magic_bytes_header = fh_feather.read(6)
        fh_feather.seek(-6, 2)
        feather_v2_magic_bytes_footer = fh_feather.read(6)

        if feather_v2_magic_bytes_header == feather_v2_magic_bytes_footer == b'ARROW1':
            return 2

        return None
```

[~jorisvandenbossche] Now that https://issues.apache.org/jira/browse/ARROW-10056 is resolved, Feather v1 support is less critical. so the IPC and dataset API workaround are now useful for me. It still would be good to have Feather v1 support and exposure of the columns in the feather submodule directly.


       


> [Python]  Get all columns names (or schema) from Feather file, before loading whole Feather file
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10344
>                 URL: https://issues.apache.org/jira/browse/ARROW-10344
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Gert Hulselmans
>            Priority: Major
>
> Is there a way to get all column names (or schema) from a Feather file before loading the full Feather file?
> My Feather files are big (like 100GB) and the names of the columns are different per analysis and can't be hard coded.
> {code:python}
> import pyarrow.feather as feather
> # Code here to check which columns are in the feather file.
> ...
> my_columns = ...
> # Result is pandas.DataFrame
> read_df = feather.read_feather('/path/to/file', columns=my_columns)
> # Result is pyarrow.Table
> read_arrow = feather.read_table('/path/to/file', columns=my_columns)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)