You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2020/02/26 21:06:00 UTC

[jira] [Updated] (ARROW-3247) [Python] Support spark parquet array and map types

     [ https://issues.apache.org/jira/browse/ARROW-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Hulette updated ARROW-3247:
---------------------------------
    Description: 
As far I understand, there is already some support for nested array/dict/structs in arrow. However, spark Map and List types are structured one level deeper (I believe to allow for both NULL and empty entries). Surprisingly, fastparquet can load these. I do not know the plan for arbitrary nested object support, but it should be made clear.

Schema of spark-generated file from the fastparquet test suite:
{code:java}
 - spark_schema:
| - map_op_op: MAP, OPTIONAL
|   - key_value: REPEATED
|   | - key: BYTE_ARRAY, UTF8, REQUIRED
|     - value: BYTE_ARRAY, UTF8, OPTIONAL
| - map_op_req: MAP, OPTIONAL
|   - key_value: REPEATED
|   | - key: BYTE_ARRAY, UTF8, REQUIRED
|     - value: BYTE_ARRAY, UTF8, REQUIRED
| - map_req_op: MAP, REQUIRED
|   - key_value: REPEATED
|   | - key: BYTE_ARRAY, UTF8, REQUIRED
|     - value: BYTE_ARRAY, UTF8, OPTIONAL
| - map_req_req: MAP, REQUIRED
|   - key_value: REPEATED
|   | - key: BYTE_ARRAY, UTF8, REQUIRED
|     - value: BYTE_ARRAY, UTF8, REQUIRED
| - arr_op_op: LIST, OPTIONAL
|   - list: REPEATED
|     - element: BYTE_ARRAY, UTF8, OPTIONAL
| - arr_op_req: LIST, OPTIONAL
|   - list: REPEATED
|     - element: BYTE_ARRAY, UTF8, REQUIRED
| - arr_req_op: LIST, REQUIRED
|   - list: REPEATED
|     - element: BYTE_ARRAY, UTF8, OPTIONAL
  - arr_req_req: LIST, REQUIRED
    - list: REPEATED
      - element: BYTE_ARRAY, UTF8, REQUIRED
{code}
(please forgive that some of this has already been mentioned elsewhere; this is one of the entries in the list at [https://github.com/dask/fastparquet/issues/374] as a feature that is useful in fastparquet)

  was:
As far I understand, there is already some support for nested array/dict/structs in arrow. However, spark Map and List types are structured one level deeper (I believe to allow for both NULL and empty entries). Surprisingly, fastparquet can load these. I do not know the plan for arbitrary nested object support, but it should be made clear.

Schema of spark-generated file from the fastparquet test suite (please see in text mode):

 - spark_schema:
| - map_op_op: MAP, OPTIONAL
|   - key_value: REPEATED
|   | - key: BYTE_ARRAY, UTF8, REQUIRED
|     - value: BYTE_ARRAY, UTF8, OPTIONAL
| - map_op_req: MAP, OPTIONAL
|   - key_value: REPEATED
|   | - key: BYTE_ARRAY, UTF8, REQUIRED
|     - value: BYTE_ARRAY, UTF8, REQUIRED
| - map_req_op: MAP, REQUIRED
|   - key_value: REPEATED
|   | - key: BYTE_ARRAY, UTF8, REQUIRED
|     - value: BYTE_ARRAY, UTF8, OPTIONAL
| - map_req_req: MAP, REQUIRED
|   - key_value: REPEATED
|   | - key: BYTE_ARRAY, UTF8, REQUIRED
|     - value: BYTE_ARRAY, UTF8, REQUIRED
| - arr_op_op: LIST, OPTIONAL
|   - list: REPEATED
|     - element: BYTE_ARRAY, UTF8, OPTIONAL
| - arr_op_req: LIST, OPTIONAL
|   - list: REPEATED
|     - element: BYTE_ARRAY, UTF8, REQUIRED
| - arr_req_op: LIST, REQUIRED
|   - list: REPEATED
|     - element: BYTE_ARRAY, UTF8, OPTIONAL
  - arr_req_req: LIST, REQUIRED
    - list: REPEATED
      - element: BYTE_ARRAY, UTF8, REQUIRED

(please forgive that some of this has already been mentioned elsewhere; this is one of the entries in the list at https://github.com/dask/fastparquet/issues/374 as a feature that is useful in fastparquet)


> [Python] Support spark parquet array and map types
> --------------------------------------------------
>
>                 Key: ARROW-3247
>                 URL: https://issues.apache.org/jira/browse/ARROW-3247
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Martin Durant
>            Priority: Minor
>              Labels: parquet
>
> As far I understand, there is already some support for nested array/dict/structs in arrow. However, spark Map and List types are structured one level deeper (I believe to allow for both NULL and empty entries). Surprisingly, fastparquet can load these. I do not know the plan for arbitrary nested object support, but it should be made clear.
> Schema of spark-generated file from the fastparquet test suite:
> {code:java}
>  - spark_schema:
> | - map_op_op: MAP, OPTIONAL
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> |     - value: BYTE_ARRAY, UTF8, OPTIONAL
> | - map_op_req: MAP, OPTIONAL
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> |     - value: BYTE_ARRAY, UTF8, REQUIRED
> | - map_req_op: MAP, REQUIRED
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> |     - value: BYTE_ARRAY, UTF8, OPTIONAL
> | - map_req_req: MAP, REQUIRED
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> |     - value: BYTE_ARRAY, UTF8, REQUIRED
> | - arr_op_op: LIST, OPTIONAL
> |   - list: REPEATED
> |     - element: BYTE_ARRAY, UTF8, OPTIONAL
> | - arr_op_req: LIST, OPTIONAL
> |   - list: REPEATED
> |     - element: BYTE_ARRAY, UTF8, REQUIRED
> | - arr_req_op: LIST, REQUIRED
> |   - list: REPEATED
> |     - element: BYTE_ARRAY, UTF8, OPTIONAL
>   - arr_req_req: LIST, REQUIRED
>     - list: REPEATED
>       - element: BYTE_ARRAY, UTF8, REQUIRED
> {code}
> (please forgive that some of this has already been mentioned elsewhere; this is one of the entries in the list at [https://github.com/dask/fastparquet/issues/374] as a feature that is useful in fastparquet)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)