You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Thomas Omans (JIRA)" <ji...@apache.org> on 2016/01/26 02:27:40 UTC

[jira] [Comment Edited] (PARQUET-465) Parquet-Avro does not support field removal

    [ https://issues.apache.org/jira/browse/PARQUET-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116441#comment-15116441 ] 

Thomas Omans edited comment on PARQUET-465 at 1/26/16 1:26 AM:
---------------------------------------------------------------

Ryan, 

Thank you for the quick response.

Unfortunately, adding `AvroParquetInputFormat.setRequestedProjection(avroReaderSchema)` breaks other things.  Specifically you lose the ability to rename fields (lookups silently fail and put `null` into the field value) and apply default values.

I actually created a series of evolving schemas in order to see what was supported and what was not:

{code}
@namespace("com.example.avro.compatibility")
protocol Compatibility {

  // original record, only one field
  @namespace("com.example.avro.compatibility.v1")
  record CompatibilityTestRecord {
    long time;
  }

  // add new field with default value
  @namespace("com.example.avro.compatibility.v2")
  record CompatibilityTestRecord {
    long time;
    string id = "v2";
  }

  // reorder fields
  @namespace("com.example.avro.compatibility.v3")
  record CompatibilityTestRecord {
    string id = "v3";
    long time;
  }
  
  // alias field
  @namespace("com.example.avro.compatibility.v4")
  record CompatibilityTestRecord {
    string @aliases(["id"]) notId = "v4";
    long @aliases(["time"]) notTime;
  }

  // drop field
  @namespace("com.example.avro.compatibility.v5")
  record CompatibilityTestRecord {
    string @aliases(["notId"]) id = "v5";
  }

}
{code}

I wrote 5 parquet files each containing one record written at the specific version, then tried to read them in: v2 reading v1, v3 reading v2 and v1, etc.

Only setting setAvroReadSchema to the latest version of the schema causes all to pass besides the final case of dropping a field, but setting both read and projection causes all to fail due to defaults not being properly applied and aliases not being respected.

Thanks again, what is there is great -- I have been reading your commits all day :)


was (Author: eggsby):
Ryan, 

Thank you for the quick response.

Unfortunately, adding `AvroParquetInputFormat.setRequestedProjection(avroReaderSchema)` breaks other things.  Specifically you lose the ability to rename fields (lookups silently fail and put `null` into the field value) and apply default values.

I actually created a series of evolving schemas in order to see what was supported and what was not:

{code}
@namespace("com.example.avro.compatibility")
protocol Compatibility {

  // original record, only one field
  @namespace("com.example.avro.compatibility.v1")
  record CompatibilityTestRecord {
    long time;
  }

  // add new field with default value
  @namespace("com.example.avro.compatibility.v2")
  record CompatibilityTestRecord {
    long time;
    string id = "v2";
  }

  // reorder fields
  @namespace("com.example.avro.compatibility.v3")
  record CompatibilityTestRecord {
    string id = "v3";
    long time;
  }
  
  // alias field
  @namespace("com.example.avro.compatibility.v4")
  record CompatibilityTestRecord {
    string @aliases(["id"]) notId = "v4";
    long @aliases(["time"]) notTime;
  }

  // drop field
  @namespace("com.example.avro.compatibility.v5")
  record CompatibilityTestRecord {
    string id = "v5";
  }

}
{code}

I wrote 5 parquet files each containing one record written at the specific version, then tried to read them in: v2 reading v1, v3 reading v2 and v1, etc.

Only setting setAvroReadSchema to the latest version of the schema causes all to pass besides the final case of dropping a field, but setting both read and projection causes all to fail due to defaults not being properly applied and aliases not being respected.

Thanks again, what is there is great -- I have been reading your commits all day :)

> Parquet-Avro does not support field removal
> -------------------------------------------
>
>                 Key: PARQUET-465
>                 URL: https://issues.apache.org/jira/browse/PARQUET-465
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.8.0
>            Reporter: Thomas Omans
>
> Parquet avro does not support removal of fields, when used with the new compatibility layer:
> Given a parquet file written with parquet avro at v1 and the following schema:
> {code}
> record FooBar {
>   long foo;
>   string bar;
> }
> {code}
> And the following configuration settings:
> {code}
> job.getConfiguration.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, false)
> AvroParquetInputFormat.setAvroReadSchema(job, avroReaderSchema)
> {code}
> A job fails when trying to read it using schema version v2:
> {code}
> record FooBar {
>   string bar;
> }
> {code}
> With the error:
> {code}
> org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch: Avro field 'foo' not found
> 	at org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:159)
> {code}
> It looks like because it sees the field in the original version it assumes the new version must expect it, but this case just means that the field was removed. Avro schema resolution dictates that you just ignore this field, since it is not relevant in the new version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)