You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2019/11/05 11:20:00 UTC

[jira] [Commented] (ARROW-7063) [C++] Schema print method prints too much metadata

    [ https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967438#comment-16967438 ] 

Joris Van den Bossche commented on ARROW-7063:
----------------------------------------------

I also ran into this recently when looking at the reports involving a huge number of columns (although that was in Python, and I see that we don't use the exact same code as the C++ pretty printer: https://github.com/apache/arrow/blob/e0cc9c43276840579a29332aca7348bbc415c051/python/pyarrow/types.pxi#L1245-L1264). 

We should probably at least truncate the metadata. Personally I would prefer truncating them (so they don't get annoying) instead of not showing them at all, as IMO it is useful to see that the table has metadata.  
We could for example truncate each entry to a max of 50 characters (adding {{...}}) while still showing all entries (all keys).

{quote}And IDK what to do with this {{ARROW:schema: }} business but it's clearly not readable as is.{quote}

It's a the original arrow schema in serialized format. Example with python how it is created when writing a parquet file, and retrieving it again:

{code}
In [33]: import pyarrow as pa                                                                                                                                                                                      

In [34]: table = pa.table(pd.DataFrame({'a': [1, 2, 3]}))                                                                                                                                                          

In [35]: table                                                                                                                                                                                                     
Out[35]: 
pyarrow.Table
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
            b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
            b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
            b' "0.15.1.dev212+g4afe9f0ea"}, "pandas_version": "0.26.0.dev0+691'
            b'.g157495696.dirty"}'}

In [36]: import pyarrow.parquet as pq                                                                                                                                                                              

In [37]: pq.write_table(table, 'test.parquet')                                                                                                                                                                     

In [39]: schema = pq.read_schema('test.parquet')                                                                                                                                                                   

In [40]: schema                                                                                                                                                                                                    
Out[40]: 
a: int64
metadata
--------
{b'ARROW:schema': b'/////4ACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwA'
                  b'AAAEAAgACgAAAAgCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAA'
                  b'EAAAAAYAAABwYW5kYXMAANMBAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJr'
                  b'aW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAi'
                  b'c3RvcCI6IDMsICJzdGVwIjogMX1dLCAiY29sdW1uX2luZGV4ZXMiOiBb'
                  b'eyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiBudWxsLCAicGFuZGFz'
                  b'X3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIs'
                  b'ICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifX1dLCAiY29s'
                  b'dW1ucyI6IFt7Im5hbWUiOiAiYSIsICJmaWVsZF9uYW1lIjogImEiLCAi'
                  b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2'
                  b'NCIsICJtZXRhZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJh'
                  b'cnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjAuMTUuMS5kZXYyMTIr'
                  b'ZzRhZmU5ZjBlYSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMC4yNi4wLmRl'
                  b'djArNjkxLmcxNTc0OTU2OTYuZGlydHkifQABAAAAFAAAABAAFAAIAAYA'
                  b'BwAMAAAAEAAQAAAAAAABAiQAAAAUAAAABAAAAAAAAAAIAAwACAAHAAgA'
                  b'AAAAAAABQAAAAAEAAABhAAAA',
 b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
            b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
            b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
            b' "0.15.1.dev212+g4afe9f0ea"}, "pandas_version": "0.26.0.dev0+691'
            b'.g157495696.dirty"}'}

In [44]: original_schema_encoded = schema.metadata[b'ARROW:schema']                                                                                                                                                 

In [45]: import base64                                                                                                                                                                                             

In [46]: original_schema = pa.read_schema(pa.BufferReader(base64.b64decode(original_schema_encoded)))                                                                                                                      

In [47]: original_schema                                                                                                                                                                                           
Out[47]: 
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
            b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
            b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
            b' "0.15.1.dev212+g4afe9f0ea"}, "pandas_version": "0.26.0.dev0+691'
            b'.g157495696.dirty"}'}

{code}

> [C++] Schema print method prints too much metadata
> --------------------------------------------------
>
>                 Key: ARROW-7063
>                 URL: https://issues.apache.org/jira/browse/ARROW-7063
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, C++ - Dataset
>            Reporter: Neal Richardson
>            Priority: Minor
>              Labels: dataset, parquet
>             Fix For: 1.0.0
>
>
> I loaded some taxi data in a Dataset and printed the schema. This is what was printed:
> {code}
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> pickup_longitude: float
> pickup_latitude: float
> rate_code_id: null
> store_and_fwd_flag: string
> dropoff_longitude: float
> dropoff_latitude: float
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> total_amount: float
> -- metadata --
> pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, {"name": "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, {"name": "pickup_latitude", "field_name": "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, {"name": "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, {"name": "payment_type", "field_name": "payment_type", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "fare_amount", "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, {"name": "extra", "field_name": "extra", "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, {"name": "tolls_amount", "field_name": "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, {"name": "total_amount", "field_name": "total_amount", "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": "0.25.3"}
> ARROW:schema: /////3gOAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAEAAgACgAAAFQKAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAsCgAABAAAAB8KAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDE0Mzg3MzcxLCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogInZlbmRvcl9pZCIsICJmaWVsZF9uYW1lIjogInZlbmRvcl9pZCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJwaWNrdXBfYXQiLCAiZmllbGRfbmFtZSI6ICJwaWNrdXBfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9hdCIsICJmaWVsZF9uYW1lIjogImRyb3BvZmZfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAicGFzc2VuZ2VyX2NvdW50IiwgImZpZWxkX25hbWUiOiAicGFzc2VuZ2VyX2NvdW50IiwgInBhbmRhc190eXBlIjogImludDgiLCAibnVtcHlfdHlwZSI6ICJpbnQ4IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJ0cmlwX2Rpc3RhbmNlIiwgImZpZWxkX25hbWUiOiAidHJpcF9kaXN0YW5jZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAicGlja3VwX2xvbmdpdHVkZSIsICJmaWVsZF9uYW1lIjogInBpY2t1cF9sb25naXR1ZGUiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQzMiIsICJudW1weV90eXBlIjogImZsb2F0MzIiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInBpY2t1cF9sYXRpdHVkZSIsICJmaWVsZF9uYW1lIjogInBpY2t1cF9sYXRpdHVkZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAicmF0ZV9jb2RlX2lkIiwgImZpZWxkX25hbWUiOiAicmF0ZV9jb2RlX2lkIiwgInBhbmRhc190eXBlIjogImVtcHR5IiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJzdG9yZV9hbmRfZndkX2ZsYWciLCAiZmllbGRfbmFtZSI6ICJzdG9yZV9hbmRfZndkX2ZsYWciLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9sb25naXR1ZGUiLCAiZmllbGRfbmFtZSI6ICJkcm9wb2ZmX2xvbmdpdHVkZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9sYXRpdHVkZSIsICJmaWVsZF9uYW1lIjogImRyb3BvZmZfbGF0aXR1ZGUiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQzMiIsICJudW1weV90eXBlIjogImZsb2F0MzIiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInBheW1lbnRfdHlwZSIsICJmaWVsZF9uYW1lIjogInBheW1lbnRfdHlwZSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJmYXJlX2Ftb3VudCIsICJmaWVsZF9uYW1lIjogImZhcmVfYW1vdW50IiwgInBhbmRhc190eXBlIjogImZsb2F0MzIiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDMyIiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJleHRyYSIsICJmaWVsZF9uYW1lIjogImV4dHJhIiwgInBhbmRhc190eXBlIjogImZsb2F0MzIiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDMyIiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJtdGFfdGF4IiwgImZpZWxkX25hbWUiOiAibXRhX3RheCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidGlwX2Ftb3VudCIsICJmaWVsZF9uYW1lIjogInRpcF9hbW91bnQiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQzMiIsICJudW1weV90eXBlIjogImZsb2F0MzIiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInRvbGxzX2Ftb3VudCIsICJmaWVsZF9uYW1lIjogInRvbGxzX2Ftb3VudCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidG90YWxfYW1vdW50IiwgImZpZWxkX25hbWUiOiAidG90YWxfYW1vdW50IiwgInBhbmRhc190eXBlIjogImZsb2F0MzIiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDMyIiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVyc2lvbiI6ICIwLjI1LjMifQAGAAAAcGFuZGFzAAASAAAAxAMAAHgDAABEAwAAAAMAAMgCAACMAgAAVAIAACACAADoAQAArAEAAHABAAA8AQAACAEAANgAAACoAAAAdAAAADwAAAAEAAAAlPz//wAAAQMYAAAADAAAAAQAAAAAAAAAyvz//wAAAQAMAAAAdG90YWxfYW1vdW50AAAAAMj8//8AAAEDGAAAAAwAAAAEAAAAAAAAAP78//8AAAEADAAAAHRvbGxzX2Ftb3VudAAAAAD8/P//AAABAxgAAAAMAAAABAAAAAAAAAAy/f//AAABAAoAAAB0aXBfYW1vdW50AAAs/f//AAABAxgAAAAMAAAABAAAAAAAAABi/f//AAABAAcAAABtdGFfdGF4AFj9//8AAAEDGAAAAAwAAAAEAAAAAAAAAI79//8AAAEABQAAAGV4dHJhAAAAhP3//wAAAQMYAAAADAAAAAQAAAAAAAAAuv3//wAAAQALAAAAZmFyZV9hbW91bnQAtP3//wAAAQUUAAAADAAAAAQAAAAAAAAApP3//wwAAABwYXltZW50X3R5cGUAAAAA5P3//wAAAQMYAAAADAAAAAQAAAAAAAAAGv7//wAAAQAQAAAAZHJvcG9mZl9sYXRpdHVkZQAAAAAc/v//AAABAxgAAAAMAAAABAAAAAAAAABS/v//AAABABEAAABkcm9wb2ZmX2xvbmdpdHVkZQAAAFT+//8AAAEFFAAAAAwAAAAEAAAAAAAAAET+//8SAAAAc3RvcmVfYW5kX2Z3ZF9mbGFnAACI/v//AAABARQAAAAMAAAABAAAAAAAAAB4/v//DAAAAHJhdGVfY29kZV9pZAAAAAC4/v//AAABAxgAAAAMAAAABAAAAAAAAADu/v//AAABAA8AAABwaWNrdXBfbGF0aXR1ZGUA7P7//wAAAQMYAAAADAAAAAQAAAAAAAAAIv///wAAAQAQAAAAcGlja3VwX2xvbmdpdHVkZQAAAAAk////AAABAxgAAAAMAAAABAAAAAAAAABa////AAABAA0AAAB0cmlwX2Rpc3RhbmNlAAAAWP///wAAAQIkAAAAFAAAAAQAAAAAAAAACAAMAAgABwAIAAAAAAAAAQgAAAAPAAAAcGFzc2VuZ2VyX2NvdW50AJj///8AAAEKGAAAAAwAAAAEAAAAAAAAAM7///8AAAMACgAAAGRyb3BvZmZfYXQAAMj///8AAAEKIAAAABQAAAAEAAAAAAAAAAAABgAIAAYABgAAAAAAAwAJAAAAcGlja3VwX2F0AAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFGAAAABAAAAAEAAAAAAAAAAQABAAEAAAACQAAAHZlbmRvcl9pZAAAAA==
> {code}
> I'd argue that extra metadata, if it's not part of the Arrow format and can be whatever an application wants to put in there, should not be printed as part of the schema's ToString method. It should be viewable some way, just not always. And IDK what to do with this {{ARROW:schema: }} business but it's clearly not readable as is.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)