You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Alex Kozlov (JIRA)" <ji...@apache.org> on 2013/12/10 21:52:09 UTC
[jira] [Updated] (CRUNCH-310) There should be a way to specify
projection schema for Parquet files
[ https://issues.apache.org/jira/browse/CRUNCH-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alex Kozlov updated CRUNCH-310:
-------------------------------
Attachment: 0001-CRUNCH-310-A-fix-for-projected-schemas.txt
Here is a simple fix that works, but I don't like it. I would prefer to have something like a *Builder*:
{code}
Source<Person> source = AvroParquetFileSource.newBuilder().forClass(Person.class).includeColumn("name").includeColumn("age").pushdownFilter(filter).build()
{code}
Any ideas?
> There should be a way to specify projection schema for Parquet files
> --------------------------------------------------------------------
>
> Key: CRUNCH-310
> URL: https://issues.apache.org/jira/browse/CRUNCH-310
> Project: Crunch
> Issue Type: Improvement
> Components: IO
> Reporter: Alex Kozlov
> Priority: Critical
> Attachments: 0001-CRUNCH-310-A-fix-for-projected-schemas.txt
>
>
> Currently the projection schema is set based on the ptype:
> {code}
> private static <S> FormatBundle<AvroParquetInputFormat> getBundle(AvroType<S> ptype) {
> return FormatBundle.forInput(AvroParquetInputFormat.class)
> .set(AvroReadSupport.AVRO_REQUESTED_PROJECTION, ptype.getSchema().toString())
> // ParquetRecordReader expects ParquetInputSplits, not FileSplits, so it
> // doesn't work with CombineFileInputFormat
> .set(RuntimeParameters.DISABLE_COMBINE_FILE, "true");
> }
> {code}
> Sometimes a user wants a subset of columns as a projection. Need a mechanism to supply desired projection schema.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)