You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by Owen O'Malley <ow...@gmail.com> on 2017/08/04 16:29:03 UTC

[DISCUSS] ORC 2.0

All,
  We've started the process of updating the encodings for ORC. These
changes are going to extend the format in ways that aren't forward
compatible. (eg. The ORC 1.4 readers won't be able to read the new format.)

The changes that I've heard about are:
* Decimal encoding - this will like be separated in to two categories
   + precision <= 18
   + precision > 18
  In both cases the precision and scale will be fixed for the entire file
rather than per value.
* a new Float/Double encoding
* a new RLE encoding

Are there other encodings that we should consider adding?

We haven't made forward incompatible changes in a while. Currently the ORC
Writer can write either:
 * Hive 0.11 ORC files
 * Hive 0.12 ORC files

So I'd like to propose that we add a new ORC 2.0 file version and all of
these changes need to be so tagged.

The new ORC writers will maintain the ability to write the old versions of
the files (Hive 0.11 ORC and Hive 0.12 ORC) as well as the ORC 2.0 files.
The new reader will automatically read all three versions.

Thoughts?

  Owen

Re: [DISCUSS] ORC 2.0

Posted by Gopal Vijayaraghavan <go...@apache.org>.

Hi,

> My intention is that we can iterate on the UNSTABLE-PRE-2.0 format without
> cross-version compatibility. It will only be used for developer testing. 

Sounds good - I tested Hive can communicate this to ORC correctly.

set hive.exec.orc.write.format="UNSTABLE-PRE-2.0";

offers a very loosely coupled connectivity for the new features being tested.

Cheers,
Gopal

Re: [DISCUSS] ORC 2.0

Posted by Owen O'Malley <ow...@gmail.com>.

Ok, I created ORC-229 https://issues.apache.org/jira/browse/ORC-229 so that
we'll have a new OrcFile.Version of UNSTABLE-PRE-2.0. If you look at the
associated pull request, you can see the comments in the code are pretty
clear that users should stay away. I also added a logged warning when the
writer uses that version.

My intention is that we can iterate on the UNSTABLE-PRE-2.0 format without
cross-version compatibility. It will only be used for developer testing. As
part of the ORC 2.0 release, we can delete that version and move to a new
2.0 version.

Thoughts?

.. Owen

On Tue, Aug 8, 2017 at 12:13 AM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

> Hi,
>
> > > Let me make sure I have the backwards compatibility straight.  If a
> user
> > > switches to ORC 2.0, he could choose to continue writing in older
> formats
> > > so that his old tools could read it
> >
> >    Yes, exactly.
>
> To chime in on Owen's point, the development process has a slight wrinkle
> in it, which we avoided in the 0.11 -> 0.12 migration due to ORC being
> embedded in Hive.
>
> The feature addition is two-fold - the new features are available only
> when a user flips the writer versions.
>
> There is no feature flag for reader versions, so the readers have to keep
> up to date with the writer changes (or just fail for the "blackholed" ones,
> with good errors).
>
> Due to the split between projects, I expect to see a two-step development
> cycle, to clean up the integration pathways before the ABI is frozen in 2.0.
>
> The entire process can be gated on the writer version - during the
> development process, there will be an experimental version (1.5?) and a
> stable version.
>
> I have no interest in ever supporting an actual 1.5 version data setup in
> ORC, but for the sake of integration testing the 1.5->2.0 writer versions
> are extremely useful stepping stones towards a multi-project dependency
> like ORC.
>
> Once the integrations are all complete and the format can be frozen, ORC
> 2.0 releases can still disable the default writer version from being
> upgraded for another stable release.
>
> After the ecosystem has had all its upgrades, the default version gets
> flipped to 2.0, while the ability to write 0.12 files will still remain as
> an option, while all intermediate reader versions will get dropped.
>
> That's a bit more complicated than being part of Hive and sync'ing
> releases, but I think this gives ORC the flexibility to accept
> contributions from a wide community, supporting multi-project release
> timelines, without leaving the implementation full of reader
> implementations for many writer versions.
>
> Cheers,
> Gopal
>
>
>

Re: [DISCUSS] ORC 2.0

Posted by Gopal Vijayaraghavan <go...@apache.org>.

Hi,

> > Let me make sure I have the backwards compatibility straight.  If a user
> > switches to ORC 2.0, he could choose to continue writing in older formats
> > so that his old tools could read it
>
>    Yes, exactly.

To chime in on Owen's point, the development process has a slight wrinkle in it, which we avoided in the 0.11 -> 0.12 migration due to ORC being embedded in Hive.

The feature addition is two-fold - the new features are available only when a user flips the writer versions.

There is no feature flag for reader versions, so the readers have to keep up to date with the writer changes (or just fail for the "blackholed" ones, with good errors).

Due to the split between projects, I expect to see a two-step development cycle, to clean up the integration pathways before the ABI is frozen in 2.0.

The entire process can be gated on the writer version - during the development process, there will be an experimental version (1.5?) and a stable version.

I have no interest in ever supporting an actual 1.5 version data setup in ORC, but for the sake of integration testing the 1.5->2.0 writer versions are extremely useful stepping stones towards a multi-project dependency like ORC.

Once the integrations are all complete and the format can be frozen, ORC 2.0 releases can still disable the default writer version from being upgraded for another stable release.

After the ecosystem has had all its upgrades, the default version gets flipped to 2.0, while the ability to write 0.12 files will still remain as an option, while all intermediate reader versions will get dropped.

That's a bit more complicated than being part of Hive and sync'ing releases, but I think this gives ORC the flexibility to accept contributions from a wide community, supporting multi-project release timelines, without leaving the implementation full of reader implementations for many writer versions.

Cheers,
Gopal

Re: [DISCUSS] ORC 2.0

Posted by Alan Gates <al...@gmail.com>.

It seems ok to change them in a source and binary compatible way.  +1 to
making upgrades as easy as switching out the jars at runtime.

Alan.

On Fri, Aug 4, 2017 at 3:00 PM, Dain Sundstrom <da...@iq80.com> wrote:

>
> > On Aug 4, 2017, at 2:51 PM, Owen O'Malley <ow...@gmail.com>
> wrote:
> >
> > On Fri, Aug 4, 2017 at 12:15 PM, Alan Gates <al...@gmail.com>
> wrote:
> >
> >> Let me make sure I have the backwards compatibility straight.  If a user
> >> switches to ORC 2.0, he could choose to continue writing in older
> formats
> >> so that his old tools could read it.  Then once all his tools are
> upgraded
> >> he could throw a config switch and new data would be written in the new
> >> format.  Once that switch was thrown, any pre-ORC 2.0 tools would be
> >> unusable.  Before throwing that switch, he would get none of the
> benefits
> >> of ORC 2.0.  Is this summary correct?
> >>
> >
> > Yes, exactly.
>
> I think the important part is not to change the APIs so tools can be
> updated by just upgrading the dep.
>
> -dain

Re: [DISCUSS] ORC 2.0

Posted by Dain Sundstrom <da...@iq80.com>.

> On Aug 4, 2017, at 2:51 PM, Owen O'Malley <ow...@gmail.com> wrote:
> 
> On Fri, Aug 4, 2017 at 12:15 PM, Alan Gates <al...@gmail.com> wrote:
> 
>> Let me make sure I have the backwards compatibility straight.  If a user
>> switches to ORC 2.0, he could choose to continue writing in older formats
>> so that his old tools could read it.  Then once all his tools are upgraded
>> he could throw a config switch and new data would be written in the new
>> format.  Once that switch was thrown, any pre-ORC 2.0 tools would be
>> unusable.  Before throwing that switch, he would get none of the benefits
>> of ORC 2.0.  Is this summary correct?
>> 
> 
> Yes, exactly.

I think the important part is not to change the APIs so tools can be updated by just upgrading the dep.

-dain

Re: [DISCUSS] ORC 2.0

Posted by Owen O'Malley <ow...@gmail.com>.

On Fri, Aug 4, 2017 at 12:15 PM, Alan Gates <al...@gmail.com> wrote:

> Let me make sure I have the backwards compatibility straight.  If a user
> switches to ORC 2.0, he could choose to continue writing in older formats
> so that his old tools could read it.  Then once all his tools are upgraded
> he could throw a config switch and new data would be written in the new
> format.  Once that switch was thrown, any pre-ORC 2.0 tools would be
> unusable.  Before throwing that switch, he would get none of the benefits
> of ORC 2.0.  Is this summary correct?
>

Yes, exactly.


>
> If so, I agree we should do this.  The list of potential benefits for
> performance and space efficiency is compelling.  And the long lag for users
> with many old tools to upgrade will never get better.
>
> Alan.
>
> On Fri, Aug 4, 2017 at 9:29 AM, Owen O'Malley <ow...@gmail.com>
> wrote:
>
> > All,
> >   We've started the process of updating the encodings for ORC. These
> > changes are going to extend the format in ways that aren't forward
> > compatible. (eg. The ORC 1.4 readers won't be able to read the new
> format.)
> >
> > The changes that I've heard about are:
> > * Decimal encoding - this will like be separated in to two categories
> >    + precision <= 18
> >    + precision > 18
> >   In both cases the precision and scale will be fixed for the entire file
> > rather than per value.
> > * a new Float/Double encoding
> > * a new RLE encoding
> >
> > Are there other encodings that we should consider adding?
> >
> > We haven't made forward incompatible changes in a while. Currently the
> ORC
> > Writer can write either:
> >  * Hive 0.11 ORC files
> >  * Hive 0.12 ORC files
> >
> > So I'd like to propose that we add a new ORC 2.0 file version and all of
> > these changes need to be so tagged.
> >
> > The new ORC writers will maintain the ability to write the old versions
> of
> > the files (Hive 0.11 ORC and Hive 0.12 ORC) as well as the ORC 2.0 files.
> > The new reader will automatically read all three versions.
> >
> > Thoughts?
> >
> >   Owen
> >
>

Re: [DISCUSS] ORC 2.0

Posted by Alan Gates <al...@gmail.com>.

Let me make sure I have the backwards compatibility straight.  If a user
switches to ORC 2.0, he could choose to continue writing in older formats
so that his old tools could read it.  Then once all his tools are upgraded
he could throw a config switch and new data would be written in the new
format.  Once that switch was thrown, any pre-ORC 2.0 tools would be
unusable.  Before throwing that switch, he would get none of the benefits
of ORC 2.0.  Is this summary correct?

If so, I agree we should do this.  The list of potential benefits for
performance and space efficiency is compelling.  And the long lag for users
with many old tools to upgrade will never get better.

Alan.

On Fri, Aug 4, 2017 at 9:29 AM, Owen O'Malley <ow...@gmail.com>
wrote:

> All,
>   We've started the process of updating the encodings for ORC. These
> changes are going to extend the format in ways that aren't forward
> compatible. (eg. The ORC 1.4 readers won't be able to read the new format.)
>
> The changes that I've heard about are:
> * Decimal encoding - this will like be separated in to two categories
>    + precision <= 18
>    + precision > 18
>   In both cases the precision and scale will be fixed for the entire file
> rather than per value.
> * a new Float/Double encoding
> * a new RLE encoding
>
> Are there other encodings that we should consider adding?
>
> We haven't made forward incompatible changes in a while. Currently the ORC
> Writer can write either:
>  * Hive 0.11 ORC files
>  * Hive 0.12 ORC files
>
> So I'd like to propose that we add a new ORC 2.0 file version and all of
> these changes need to be so tagged.
>
> The new ORC writers will maintain the ability to write the old versions of
> the files (Hive 0.11 ORC and Hive 0.12 ORC) as well as the ORC 2.0 files.
> The new reader will automatically read all three versions.
>
> Thoughts?
>
>   Owen
>

Re: [DISCUSS] ORC 2.0

Posted by Dain Sundstrom <da...@iq80.com>.

+1 to all of the ideas

If we are cool with incompatible changes…
 * Allow dictionary for VARBINARY
 * Disallow old encodings in new files (e.g., no v1)
 * Fix DATE encoding epoch
 * Rearrange stripe so index is next to footer so a single IOP can get all data
 * Change metastore properties so there is a logical mapping from column names to physical column identifiers so columns can be renamed
 * New timestamp encoding with fixed size per file.. similar to decimal
 * For compression like zstd, we may want to ship a compression dictionary for a stream

Stuff we could do today
 * A flag that says if CHAR or VHARCHAR contain any multi byte characters (isAsciiOnly)
 * Max character count for CHAR or VARCHAR (so we don’t need to check length for schema changes)
 * Max length for VARBINARY (easier to estimate memory usage)
 * Truncated MIN/MAX for VARBINARY/CHAR/VARCHAR

For the new encodings, we should pick encodings that play well with vectorization which is coming in Java 10 (Java 9 also has vastly improved auto vectorization).

-dain

> On Aug 4, 2017, at 9:29 AM, Owen O'Malley <ow...@gmail.com> wrote:
> 
> All,
>  We've started the process of updating the encodings for ORC. These
> changes are going to extend the format in ways that aren't forward
> compatible. (eg. The ORC 1.4 readers won't be able to read the new format.)
> 
> The changes that I've heard about are:
> * Decimal encoding - this will like be separated in to two categories
>   + precision <= 18
>   + precision > 18
>  In both cases the precision and scale will be fixed for the entire file
> rather than per value.
> * a new Float/Double encoding
> * a new RLE encoding
> 
> Are there other encodings that we should consider adding?
> 
> We haven't made forward incompatible changes in a while. Currently the ORC
> Writer can write either:
> * Hive 0.11 ORC files
> * Hive 0.12 ORC files
> 
> So I'd like to propose that we add a new ORC 2.0 file version and all of
> these changes need to be so tagged.
> 
> The new ORC writers will maintain the ability to write the old versions of
> the files (Hive 0.11 ORC and Hive 0.12 ORC) as well as the ORC 2.0 files.
> The new reader will automatically read all three versions.
> 
> Thoughts?
> 
>  Owen