You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Gabriel Reid (JIRA)" <ji...@apache.org> on 2014/11/06 17:21:34 UTC

[jira] [Assigned] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema

     [ https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriel Reid reassigned CRUNCH-480:
-----------------------------------

    Assignee: Gabriel Reid

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
>                 Key: CRUNCH-480
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-480
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Assignee: Gabriel Reid
>            Priority: Blocker
>
> It seems like AvroParquetFileSource doesn't properly set the configuration param required to use a user-supplied read schema that differs from the schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore the supplied requestedSchema and, instead, looks for the key avro.read.schema in the readSupportMetadata map. This is seriously kookie code in Parquet (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can never properly supply a read schema. Boooo hisssss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)