You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Wei Yan <yw...@gmail.com> on 2015/05/05 18:31:53 UTC

Met a schema problem when using AvroParquetInputFormat

Hi,

Have met a problem for using AvroParquetInputFromat for my MapReduce job.
The input files are written using two different version schemas. One field
in v1 is "int", while in v2 is "long". The Exception:

Exception in thread "main"
parquet.schema.IncompatibleSchemaModificationException: can not merge type
optional int32 a into optional int64 a
at parquet.schema.PrimitiveType.union(PrimitiveType.java:513)
at parquet.schema.GroupType.mergeFields(GroupType.java:359)
at parquet.schema.GroupType.union(GroupType.java:341)
at parquet.schema.GroupType.mergeFields(GroupType.java:359)
at parquet.schema.MessageType.union(MessageType.java:138)
at parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:497)
at parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:470)
at
parquet.hadoop.ParquetFileWriter.getGlobalMetaData(ParquetFileWriter.java:446)
at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:429)
at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:412)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:589)

I'm using Parquet 1.5, and it looks "int" cannot be merged with "long". I
tried 1.6.rc1, and set the "parquet.strict.typing", but still cannot help.

So I want to ask is there anyway to solve this problem, like automatically
convert "int" to "long"? instead of re-writing all data using the same
version.

thanks,
Wei

Re: Met a schema problem when using AvroParquetInputFormat

Posted by Ryan Blue <bl...@cloudera.com>.

Wei,

I think the best practice is to have an overall schema for the data that 
can be satisfied using all of the currently-written file schemas. For 
example, you'd read the column with a long schema, which can handle both 
ints and longs in the data. Ints just get promoted when reading.

How would merging the schemas help? Hive should do the same resolution 
that I'm talking about here, but should use the current table definition 
to generate its expected schema. Spark SQL might be relying on this, 
which I'll follow up on with the Spark community.

rb

On 05/11/2015 10:52 AM, Wei Yan wrote:
> Thanks for the update, Ryan.
> Yes, I found this info in https://issues.apache.org/jira/browse/PARQUET-139,
> which avoids to merge the schema in the client side.
>
> And for schema merge, is then plan for defining some rules for merging
> schemas, like merging "int" and "long" to a "long" field? I asked this
> because we have some parquet files written by different schemas, due to
> some **history** reason. Allow this type of merging can help a lot when we
> process the data. Besides MapReduce application, we also meet the schema
> problem when using hive and spark-sql to load the data.
>
> -Wei

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Met a schema problem when using AvroParquetInputFormat

Posted by Wei Yan <yw...@gmail.com>.

Thanks for the update, Ryan.
Yes, I found this info in https://issues.apache.org/jira/browse/PARQUET-139,
which avoids to merge the schema in the client side.

And for schema merge, is then plan for defining some rules for merging
schemas, like merging "int" and "long" to a "long" field? I asked this
because we have some parquet files written by different schemas, due to
some **history** reason. Allow this type of merging can help a lot when we
process the data. Besides MapReduce application, we also meet the schema
problem when using hive and spark-sql to load the data.

-Wei

On Mon, May 11, 2015 at 10:45 AM, Ryan Blue <bl...@cloudera.com> wrote:

> To follow up, I think the problem here was that we were merging two
> Parquet schemas. We don't really have rules for merging schemas and we
> don't really need them. 1.6.0 works because we resolve the expected schema
> with each file schema individually.
>
> This will still be a problem if you use client-side metadata instead of
> task-side.
>
> rb
>
>
> On 05/06/2015 08:43 PM, Alex Levenson wrote:
>
>> Glad that worked!
>>
>> On Wed, May 6, 2015 at 6:42 PM, Wei Yan <yw...@gmail.com> wrote:
>>
>>  Thanks, Alex.
>>> The new version solves the issue.
>>>
>>> -Wei
>>>
>>> On Tue, May 5, 2015 at 8:20 PM, Alex Levenson <
>>> alexlevenson@twitter.com.invalid> wrote:
>>>
>>>  1.6.0rc1 is pretty old, have you tried with 1.6.0 ?
>>>>
>>>> On Tue, May 5, 2015 at 9:31 AM, Wei Yan <yw...@gmail.com> wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> Have met a problem for using AvroParquetInputFromat for my MapReduce
>>>>>
>>>> job.
>>>
>>>> The input files are written using two different version schemas. One
>>>>>
>>>> field
>>>>
>>>>> in v1 is "int", while in v2 is "long". The Exception:
>>>>>
>>>>> Exception in thread "main"
>>>>> parquet.schema.IncompatibleSchemaModificationException: can not merge
>>>>>
>>>> type
>>>>
>>>>> optional int32 a into optional int64 a
>>>>> at parquet.schema.PrimitiveType.union(PrimitiveType.java:513)
>>>>> at parquet.schema.GroupType.mergeFields(GroupType.java:359)
>>>>> at parquet.schema.GroupType.union(GroupType.java:341)
>>>>> at parquet.schema.GroupType.mergeFields(GroupType.java:359)
>>>>> at parquet.schema.MessageType.union(MessageType.java:138)
>>>>> at
>>>>>
>>>> parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:497)
>>>
>>>> at
>>>>>
>>>> parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:470)
>>>
>>>> at
>>>>>
>>>>>
>>>>>
>>>>
>>> parquet.hadoop.ParquetFileWriter.getGlobalMetaData(ParquetFileWriter.java:446)
>>>
>>>> at
>>>>>
>>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:429)
>>>>
>>>>> at
>>>>>
>>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:412)
>>>>
>>>>> at
>>>>>
>>>>>
>>>>>
>>>>
>>> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:589)
>>>
>>>>
>>>>> I'm using Parquet 1.5, and it looks "int" cannot be merged with
>>>>>
>>>> "long". I
>>>
>>>> tried 1.6.rc1, and set the "parquet.strict.typing", but still cannot
>>>>>
>>>> help.
>>>>
>>>>>
>>>>> So I want to ask is there anyway to solve this problem, like
>>>>>
>>>> automatically
>>>>
>>>>> convert "int" to "long"? instead of re-writing all data using the same
>>>>> version.
>>>>>
>>>>> thanks,
>>>>> Wei
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Alex Levenson
>>>> @THISWILLWORK
>>>>
>>>>
>>>
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Met a schema problem when using AvroParquetInputFormat

Posted by Ryan Blue <bl...@cloudera.com>.

To follow up, I think the problem here was that we were merging two 
Parquet schemas. We don't really have rules for merging schemas and we 
don't really need them. 1.6.0 works because we resolve the expected 
schema with each file schema individually.

This will still be a problem if you use client-side metadata instead of 
task-side.

rb

On 05/06/2015 08:43 PM, Alex Levenson wrote:
> Glad that worked!
>
> On Wed, May 6, 2015 at 6:42 PM, Wei Yan <yw...@gmail.com> wrote:
>
>> Thanks, Alex.
>> The new version solves the issue.
>>
>> -Wei
>>
>> On Tue, May 5, 2015 at 8:20 PM, Alex Levenson <
>> alexlevenson@twitter.com.invalid> wrote:
>>
>>> 1.6.0rc1 is pretty old, have you tried with 1.6.0 ?
>>>
>>> On Tue, May 5, 2015 at 9:31 AM, Wei Yan <yw...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Have met a problem for using AvroParquetInputFromat for my MapReduce
>> job.
>>>> The input files are written using two different version schemas. One
>>> field
>>>> in v1 is "int", while in v2 is "long". The Exception:
>>>>
>>>> Exception in thread "main"
>>>> parquet.schema.IncompatibleSchemaModificationException: can not merge
>>> type
>>>> optional int32 a into optional int64 a
>>>> at parquet.schema.PrimitiveType.union(PrimitiveType.java:513)
>>>> at parquet.schema.GroupType.mergeFields(GroupType.java:359)
>>>> at parquet.schema.GroupType.union(GroupType.java:341)
>>>> at parquet.schema.GroupType.mergeFields(GroupType.java:359)
>>>> at parquet.schema.MessageType.union(MessageType.java:138)
>>>> at
>> parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:497)
>>>> at
>> parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:470)
>>>> at
>>>>
>>>>
>>>
>> parquet.hadoop.ParquetFileWriter.getGlobalMetaData(ParquetFileWriter.java:446)
>>>> at
>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:429)
>>>> at
>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:412)
>>>> at
>>>>
>>>>
>>>
>> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:589)
>>>>
>>>> I'm using Parquet 1.5, and it looks "int" cannot be merged with
>> "long". I
>>>> tried 1.6.rc1, and set the "parquet.strict.typing", but still cannot
>>> help.
>>>>
>>>> So I want to ask is there anyway to solve this problem, like
>>> automatically
>>>> convert "int" to "long"? instead of re-writing all data using the same
>>>> version.
>>>>
>>>> thanks,
>>>> Wei
>>>>
>>>
>>>
>>>
>>> --
>>> Alex Levenson
>>> @THISWILLWORK
>>>
>>
>
>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Met a schema problem when using AvroParquetInputFormat

Posted by Alex Levenson <al...@twitter.com.INVALID>.

Glad that worked!

On Wed, May 6, 2015 at 6:42 PM, Wei Yan <yw...@gmail.com> wrote:

> Thanks, Alex.
> The new version solves the issue.
>
> -Wei
>
> On Tue, May 5, 2015 at 8:20 PM, Alex Levenson <
> alexlevenson@twitter.com.invalid> wrote:
>
> > 1.6.0rc1 is pretty old, have you tried with 1.6.0 ?
> >
> > On Tue, May 5, 2015 at 9:31 AM, Wei Yan <yw...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Have met a problem for using AvroParquetInputFromat for my MapReduce
> job.
> > > The input files are written using two different version schemas. One
> > field
> > > in v1 is "int", while in v2 is "long". The Exception:
> > >
> > > Exception in thread "main"
> > > parquet.schema.IncompatibleSchemaModificationException: can not merge
> > type
> > > optional int32 a into optional int64 a
> > > at parquet.schema.PrimitiveType.union(PrimitiveType.java:513)
> > > at parquet.schema.GroupType.mergeFields(GroupType.java:359)
> > > at parquet.schema.GroupType.union(GroupType.java:341)
> > > at parquet.schema.GroupType.mergeFields(GroupType.java:359)
> > > at parquet.schema.MessageType.union(MessageType.java:138)
> > > at
> parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:497)
> > > at
> parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:470)
> > > at
> > >
> > >
> >
> parquet.hadoop.ParquetFileWriter.getGlobalMetaData(ParquetFileWriter.java:446)
> > > at
> > parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:429)
> > > at
> > parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:412)
> > > at
> > >
> > >
> >
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:589)
> > >
> > > I'm using Parquet 1.5, and it looks "int" cannot be merged with
> "long". I
> > > tried 1.6.rc1, and set the "parquet.strict.typing", but still cannot
> > help.
> > >
> > > So I want to ask is there anyway to solve this problem, like
> > automatically
> > > convert "int" to "long"? instead of re-writing all data using the same
> > > version.
> > >
> > > thanks,
> > > Wei
> > >
> >
> >
> >
> > --
> > Alex Levenson
> > @THISWILLWORK
> >
>



-- 
Alex Levenson
@THISWILLWORK

Re: Met a schema problem when using AvroParquetInputFormat

Posted by Wei Yan <yw...@gmail.com>.

Thanks, Alex.
The new version solves the issue.

-Wei

On Tue, May 5, 2015 at 8:20 PM, Alex Levenson <
alexlevenson@twitter.com.invalid> wrote:

> 1.6.0rc1 is pretty old, have you tried with 1.6.0 ?
>
> On Tue, May 5, 2015 at 9:31 AM, Wei Yan <yw...@gmail.com> wrote:
>
> > Hi,
> >
> > Have met a problem for using AvroParquetInputFromat for my MapReduce job.
> > The input files are written using two different version schemas. One
> field
> > in v1 is "int", while in v2 is "long". The Exception:
> >
> > Exception in thread "main"
> > parquet.schema.IncompatibleSchemaModificationException: can not merge
> type
> > optional int32 a into optional int64 a
> > at parquet.schema.PrimitiveType.union(PrimitiveType.java:513)
> > at parquet.schema.GroupType.mergeFields(GroupType.java:359)
> > at parquet.schema.GroupType.union(GroupType.java:341)
> > at parquet.schema.GroupType.mergeFields(GroupType.java:359)
> > at parquet.schema.MessageType.union(MessageType.java:138)
> > at parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:497)
> > at parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:470)
> > at
> >
> >
> parquet.hadoop.ParquetFileWriter.getGlobalMetaData(ParquetFileWriter.java:446)
> > at
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:429)
> > at
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:412)
> > at
> >
> >
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:589)
> >
> > I'm using Parquet 1.5, and it looks "int" cannot be merged with "long". I
> > tried 1.6.rc1, and set the "parquet.strict.typing", but still cannot
> help.
> >
> > So I want to ask is there anyway to solve this problem, like
> automatically
> > convert "int" to "long"? instead of re-writing all data using the same
> > version.
> >
> > thanks,
> > Wei
> >
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>

Re: Met a schema problem when using AvroParquetInputFormat

Posted by Alex Levenson <al...@twitter.com.INVALID>.

1.6.0rc1 is pretty old, have you tried with 1.6.0 ?

On Tue, May 5, 2015 at 9:31 AM, Wei Yan <yw...@gmail.com> wrote:

> Hi,
>
> Have met a problem for using AvroParquetInputFromat for my MapReduce job.
> The input files are written using two different version schemas. One field
> in v1 is "int", while in v2 is "long". The Exception:
>
> Exception in thread "main"
> parquet.schema.IncompatibleSchemaModificationException: can not merge type
> optional int32 a into optional int64 a
> at parquet.schema.PrimitiveType.union(PrimitiveType.java:513)
> at parquet.schema.GroupType.mergeFields(GroupType.java:359)
> at parquet.schema.GroupType.union(GroupType.java:341)
> at parquet.schema.GroupType.mergeFields(GroupType.java:359)
> at parquet.schema.MessageType.union(MessageType.java:138)
> at parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:497)
> at parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:470)
> at
>
> parquet.hadoop.ParquetFileWriter.getGlobalMetaData(ParquetFileWriter.java:446)
> at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:429)
> at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:412)
> at
>
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:589)
>
> I'm using Parquet 1.5, and it looks "int" cannot be merged with "long". I
> tried 1.6.rc1, and set the "parquet.strict.typing", but still cannot help.
>
> So I want to ask is there anyway to solve this problem, like automatically
> convert "int" to "long"? instead of re-writing all data using the same
> version.
>
> thanks,
> Wei
>



-- 
Alex Levenson
@THISWILLWORK