You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gang Wu (Jira)" <ji...@apache.org> on 2023/05/06 02:13:00 UTC

[jira] [Updated] (PARQUET-2265) AvroParquetWriter should default to data supplier model from Configuration

     [ https://issues.apache.org/jira/browse/PARQUET-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gang Wu updated PARQUET-2265:
-----------------------------
    Fix Version/s: 1.13.1

> AvroParquetWriter should default to data supplier model from Configuration
> --------------------------------------------------------------------------
>
>                 Key: PARQUET-2265
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2265
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Claire McGinty
>            Assignee: Claire McGinty
>            Priority: Major
>             Fix For: 1.14.0, 1.13.1
>
>
> I recently ran into a bug where the AvroDataSupplier I specified in my Configuration wasn't respected when creating an AvroParquetWriter:
>  
> ```
> Configuration configuration = new Configuration();
> configuration.put(AvroWriteSupport.AVRO_DATA_SUPPLIER, myCustomDataSupplier)
> AvroParquetWriter<MyAvroRecord> writer =
>   AvroParquetWriter.<MyAvroRecord>builder(...)
>     .withSchema(...)
>     .withConf(configuration)
>     .build();
> ```
> In this instance, the writer's attached AvroWriteSupport uses a SpecificData model, rather than the value of `myCustomDataSupplier.get()`. This is due to AvroParquetWriter defaulting to SpecificData model[0] if it's not supplied in the AvroParquetWriter.Builder.
> I see that AvroParquetWriter.Builder has a `.withDataModel` method, but IMO this creates confusion/redundancy, since I end up supplying the data model twice; also, I can't create any abstractions around this (i.e. a `createWriterForConfiguration(Configuration conf)` type of method) without having to use reflection to invoke a dataModel for the value of `conf.getClass(AvroWriteSupport.AVRO_DATA_SUPPLIER)`.
> I think it would be simplest if AvroWriteSupport just defaulted to `model = null` and let AvroWriteSupport initialize it based on the Configuration[1]. What do you think? That seems to be what AvroParquetReader is currently doing[2].
>  
> [0][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L163]
> [1][https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L134] 
> [2]https://github.com/apache/parquet-mr/blob/9a1fbc4ee3f63284a675eeac6c62e96ffc973575/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java#L133



--
This message was sent by Atlassian Jira
(v8.20.10#820010)