You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Ronnie Huang <dr...@outlook.com> on 2019/07/17 02:39:39 UTC

[Question] Change Column Type in Parquet File

Hi Parquet Devs,

Our team is working on userid changing from int to bigint in whole hadoop system. It's easy for us to quick refresh non-partitioned tables, however, more partitioned tables have huge partition files. We are trying to find a quick solution to change data type fast without refreshing partition one by one. That's why I send you this email.

I take a look at your website https://github.com/apache/parquet-format to understand parquet format but I still confused on metadata, so l list following questions:

  1.  If I want to change one column type, I need to change it in file metadata and column (chunk) metadata, am I right or missing anything?
  2.  If I change one column type from int32 to int64 in file metadata and column (chunk) metadata directly, can compressed data be read correctly? If not, what's problem?

Thank you so much for your time and we would be appreciated if you could reply.

Best Regards,
Ronnie

Re: [Question] Change Column Type in Parquet File

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I agree with what Tim said. Parquet is an immutable format because it is
designed for append-only file systems like HDFS or object stores like S3.
Reading existing data into a compatible type is a higher-level concern
because if you only have one file, you only have one schema. But when you
have a collection of files that you manage as a table, you have to handle
type coercion to the current table schema. Hive tables handle this in some
systems that Tim mentioned, and this is formalized in the [Iceberg table
spec](http://iceberg.apache.org/spec/). I recommend using one of those to
manage your collection of data files so that you get these features and
don't have to worry about it.

rb

On Wed, Jul 17, 2019 at 10:51 AM Tim Armstrong
<ta...@cloudera.com.invalid> wrote:

> I think generally the best solution, if it's supported by the tools you're
> using, is to do schema evolution by  *not* rewriting the files and just
> updating the metadata, and rely on the engine that's querying the table to
> promote the int32 to int64 if the parquet file has an int32 but the hive
> schema has an int64.
>
> E.g. the support has been added in Impala and Hive:
> https://issues.apache.org/jira/browse/HIVE-12080,
> https://issues.apache.org/jira/browse/IMPALA-6373. I'm not sure about
> other
> engines.
>
> Generally Parquet is not designed to support modifying files in-place - if
> you want to change a file's schema, you would regenerate the file.
>
> On Tue, Jul 16, 2019 at 8:38 PM Ronnie Huang <dr...@outlook.com>
> wrote:
>
> > Hi Parquet Devs,
> >
> > Our team is working on userid changing from int to bigint in whole hadoop
> > system. It's easy for us to quick refresh non-partitioned tables,
> however,
> > more partitioned tables have huge partition files. We are trying to find
> a
> > quick solution to change data type fast without refreshing partition one
> by
> > one. That's why I send you this email.
> >
> > I take a look at your website https://github.com/apache/parquet-format
> to
> > understand parquet format but I still confused on metadata, so l list
> > following questions:
> >
> >   1.  If I want to change one column type, I need to change it in file
> > metadata and column (chunk) metadata, am I right or missing anything?
> >   2.  If I change one column type from int32 to int64 in file metadata
> and
> > column (chunk) metadata directly, can compressed data be read correctly?
> If
> > not, what's problem?
> >
> > Thank you so much for your time and we would be appreciated if you could
> > reply.
> >
> > Best Regards,
> > Ronnie
> >
> >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [Question] Change Column Type in Parquet File

Posted by Tim Armstrong <ta...@cloudera.com.INVALID>.

You're welcome, it's always nice to hear that we're able to help out.

On Thu, Jul 18, 2019 at 8:42 PM Ronnie Huang <dr...@outlook.com> wrote:

> Hi Tim,
>
> You are really really helpful.
>
> I did testing in impala 3.2 and hive 2.0, they were both working fine. Out
> platform team is planning to upgrade impala and hive to fix this. We only
> need to update the metadata after the engine upgrading.
>
> Thank a lot and wish you have a nice day.
>
> Best Regards,
> Ronnie
> ------------------------------
> *From:* Tim Armstrong <ta...@cloudera.com>
> *Sent:* Wednesday, July 17, 2019 12:50 PM
> *To:* Parquet Dev
> *Cc:* Ronnie Huang
> *Subject:* Re: [Question] Change Column Type in Parquet File
>
> I think generally the best solution, if it's supported by the tools you're
> using, is to do schema evolution by  *not* rewriting the files and just
> updating the metadata, and rely on the engine that's querying the table to
> promote the int32 to int64 if the parquet file has an int32 but the hive
> schema has an int64.
>
> E.g. the support has been added in Impala and Hive:
> https://issues.apache.org/jira/browse/HIVE-12080,
> https://issues.apache.org/jira/browse/IMPALA-6373. I'm not sure about
> other engines.
>
> Generally Parquet is not designed to support modifying files in-place - if
> you want to change a file's schema, you would regenerate the file.
>
> On Tue, Jul 16, 2019 at 8:38 PM Ronnie Huang <dr...@outlook.com>
> wrote:
>
> Hi Parquet Devs,
>
> Our team is working on userid changing from int to bigint in whole hadoop
> system. It's easy for us to quick refresh non-partitioned tables, however,
> more partitioned tables have huge partition files. We are trying to find a
> quick solution to change data type fast without refreshing partition one by
> one. That's why I send you this email.
>
> I take a look at your website https://github.com/apache/parquet-format to
> understand parquet format but I still confused on metadata, so l list
> following questions:
>
>   1.  If I want to change one column type, I need to change it in file
> metadata and column (chunk) metadata, am I right or missing anything?
>   2.  If I change one column type from int32 to int64 in file metadata and
> column (chunk) metadata directly, can compressed data be read correctly? If
> not, what's problem?
>
> Thank you so much for your time and we would be appreciated if you could
> reply.
>
> Best Regards,
> Ronnie
>
>
>

Re: [Question] Change Column Type in Parquet File

Posted by Ronnie Huang <dr...@outlook.com>.

Hi Tim,

You are really really helpful.

I did testing in impala 3.2 and hive 2.0, they were both working fine. Out platform team is planning to upgrade impala and hive to fix this. We only need to update the metadata after the engine upgrading.

Thank a lot and wish you have a nice day.

Best Regards,
Ronnie
________________________________
From: Tim Armstrong <ta...@cloudera.com>
Sent: Wednesday, July 17, 2019 12:50 PM
To: Parquet Dev
Cc: Ronnie Huang
Subject: Re: [Question] Change Column Type in Parquet File

I think generally the best solution, if it's supported by the tools you're using, is to do schema evolution by  *not* rewriting the files and just updating the metadata, and rely on the engine that's querying the table to promote the int32 to int64 if the parquet file has an int32 but the hive schema has an int64.

E.g. the support has been added in Impala and Hive: https://issues.apache.org/jira/browse/HIVE-12080, https://issues.apache.org/jira/browse/IMPALA-6373. I'm not sure about other engines.

Generally Parquet is not designed to support modifying files in-place - if you want to change a file's schema, you would regenerate the file.

On Tue, Jul 16, 2019 at 8:38 PM Ronnie Huang <dr...@outlook.com>> wrote:
Hi Parquet Devs,

Our team is working on userid changing from int to bigint in whole hadoop system. It's easy for us to quick refresh non-partitioned tables, however, more partitioned tables have huge partition files. We are trying to find a quick solution to change data type fast without refreshing partition one by one. That's why I send you this email.

I take a look at your website https://github.com/apache/parquet-format to understand parquet format but I still confused on metadata, so l list following questions:

  1.  If I want to change one column type, I need to change it in file metadata and column (chunk) metadata, am I right or missing anything?
  2.  If I change one column type from int32 to int64 in file metadata and column (chunk) metadata directly, can compressed data be read correctly? If not, what's problem?

Thank you so much for your time and we would be appreciated if you could reply.

Best Regards,
Ronnie

Re: [Question] Change Column Type in Parquet File

Posted by Tim Armstrong <ta...@cloudera.com.INVALID>.

I think generally the best solution, if it's supported by the tools you're
using, is to do schema evolution by  *not* rewriting the files and just
updating the metadata, and rely on the engine that's querying the table to
promote the int32 to int64 if the parquet file has an int32 but the hive
schema has an int64.

E.g. the support has been added in Impala and Hive:
https://issues.apache.org/jira/browse/HIVE-12080,
https://issues.apache.org/jira/browse/IMPALA-6373. I'm not sure about other
engines.

Generally Parquet is not designed to support modifying files in-place - if
you want to change a file's schema, you would regenerate the file.

On Tue, Jul 16, 2019 at 8:38 PM Ronnie Huang <dr...@outlook.com> wrote:

> Hi Parquet Devs,
>
> Our team is working on userid changing from int to bigint in whole hadoop
> system. It's easy for us to quick refresh non-partitioned tables, however,
> more partitioned tables have huge partition files. We are trying to find a
> quick solution to change data type fast without refreshing partition one by
> one. That's why I send you this email.
>
> I take a look at your website https://github.com/apache/parquet-format to
> understand parquet format but I still confused on metadata, so l list
> following questions:
>
>   1.  If I want to change one column type, I need to change it in file
> metadata and column (chunk) metadata, am I right or missing anything?
>   2.  If I change one column type from int32 to int64 in file metadata and
> column (chunk) metadata directly, can compressed data be read correctly? If
> not, what's problem?
>
> Thank you so much for your time and we would be appreciated if you could
> reply.
>
> Best Regards,
> Ronnie
>
>
>