You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Itai Incze (JIRA)" <ji...@apache.org> on 2017/03/22 16:08:41 UTC

[jira] [Comment Edited] (PARQUET-918) FromParquetSchema API crashes on nested schemas

    [ https://issues.apache.org/jira/browse/PARQUET-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936602#comment-15936602 ] 

Itai Incze edited comment on PARQUET-918 at 3/22/17 4:08 PM:
-------------------------------------------------------------

So assuming the way to fix this is to create a sub-schema (i.e. a subtree of the full schema), I see a few ways to fix this:

# Change all the functions related to building the arrow schema tree (FromParquetSchema, FieldToNode, NodeToList, StructFromGroup, FromPrimitive) to accept an indices vector, plus some needed context
# Change those same functions to accept some more generic construct, say a filter function pointer
# Construct a full arrow schema using the code available today, then prune unwanted columns

Seems to me 1 is most straight forward, 2 is a bit nicer and more general but might be unnecessary and less intuitive and 3 is less efficient but touches the least of today's code.

Any thoughts?



was (Author: itaiin):
So assuming the way to fix this is creating a sub-schema (i.e. a subtree of the full schema), I see a few ways to fix this:

# Change all the functions related to building the arrow schema tree (FromParquetSchema, FieldToNode, NodeToList, StructFromGroup, FromPrimitive) to accept an indices vector, plus some needed context
# Change those same functions to accept some more generic construct, say a filter function pointer
# Construct a full arrow schema using the code available today, then prune unwanted columns

Seems to me 1 is most straight forward, 2 is a bit nicer and more general but might be unnecessary and less intuitive and 3 is less efficient but touches the least of today's code.

Any thoughts?


> FromParquetSchema API crashes on nested schemas
> -----------------------------------------------
>
>                 Key: PARQUET-918
>                 URL: https://issues.apache.org/jira/browse/PARQUET-918
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.0.0
>            Reporter: Itai Incze
>
> {{FromParquetSchema@src/parquet/arrow/schema.cc:276}} misbehaves by using its column_indices parameter in the second version of the function as indices to the direct schema root fields. 
> This is problematic with nested schema parquet files - the bug crashes the process by accessing the fields vector out of bounds.
> This bug is masked by another bug in the first version of the {{FromParquetSchema}} function which constructs a complete indices list the size of the number of schema fields (instead of the # of columns).
> The bug is triggered in many significant use-cases, for example when using the {{arrow::ReadTable}} API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)