You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by Matt Burgess <ma...@apache.org> on 2016/07/22 20:21:21 UTC

Complex types in hive-orc

All,

Is this the right place to ask questions about hive-orc? I know it was
split out into Apache ORC, and up until recently I have been using
Apache ORC 1.1.2 to convert Avro files to ORC files, but I was told I
need a version that works with only Hive 1.2.1.

If I should direct to the Hive list please let me know, otherwise:

- Are complex types (list, map, struct, union, etc.) supported in
hive-orc 1.2.1? I don't see the ListColumnVector and such types. I
can't bring in that storage-api-2.1.1-pre-orc JAR because of a
conflict with BloomFilter, etc.

- I was using VectorizedRowBatch to write my values in ORC 1.1.2, is
that the correct/recommended approach in 1.2.1? I see Apache Crunch
uses lots of MapReduce types but I would really like to limit the MR
dependencies if possible since my app will not always be on a Hadoop
node.

- Are there any examples of converting Avro to ORC outside of Hive
(but using Avro and hive-orc)? I see a couple of examples of
reading/writing ORC files but nothing with Avro. No worries if not, I
am writing one as part of this effort :)

Thank you in advance,
Matt

Re: Complex types in hive-orc

Posted by Matt Burgess <ma...@gmail.com>.

Ok looks like I'll need to go with the row-by-row API. Just to make
sure I understand correctly, is that the approach Apache Crunch is
using? With ObjectInspectors, Writables / POJOs. etc?

https://github.com/apache/crunch/blob/master/crunch-hive/src/main/java/org/apache/crunch/types/orc/OrcUtils.java

If not, what is considered the row-by-row API (not using
VectorizedRowBatch or ColumnVectors)?

Thanks again,
Matt

On Fri, Jul 22, 2016 at 5:14 PM, Owen O'Malley <om...@apache.org> wrote:
> Hi Matt,
>
> On Fri, Jul 22, 2016 at 1:21 PM, Matt Burgess <ma...@apache.org> wrote:
>
>> All,
>>
>> Is this the right place to ask questions about hive-orc? I know it was
>> split out into Apache ORC, and up until recently I have been using
>> Apache ORC 1.1.2 to convert Avro files to ORC files, but I was told I
>> need a version that works with only Hive 1.2.1.
>>
>
> This works great, although most of the ORC developers read both.
>
>
>> - Are complex types (list, map, struct, union, etc.) supported in
>> hive-orc 1.2.1? I don't see the ListColumnVector and such types.
>
>
>
> Before HIVE-12159, which went into Hive 2.1, the only way to read complex
> types was to use the row by row API.
>
>
>> I
>> can't bring in that storage-api-2.1.1-pre-orc JAR because of a
>> conflict with BloomFilter, etc.
>>
>
> How bad is the breakage? Can we fix it with a patch to ORC?
>
>
>>
>> - I was using VectorizedRowBatch to write my values in ORC 1.1.2, is
>> that the correct/recommended approach in 1.2.1? I see Apache Crunch
>> uses lots of MapReduce types but I would really like to limit the MR
>> dependencies if possible since my app will not always be on a Hadoop
>> node.
>>
>
> Yes, the ORC MapReduce shim uses the VectorizedRowBatch and converts them
> into WritableComparables so it will be fastest if you use
> VectorizedRowBatch directly. Although as you have discovered that won't
> work if you are trying to use hive-orc 1.2
>
>
>> - Are there any examples of converting Avro to ORC outside of Hive
>> (but using Avro and hive-orc)? I see a couple of examples of
>> reading/writing ORC files but nothing with Avro. No worries if not, I
>> am writing one as part of this effort :)
>>
>
> If you look at the benchmarking code in
> https://github.com/apache/orc/pull/43 , you'll see that I took a first stab
> at making an Avro writer that goes from ORC's TypeDescription and a
> VectorizedRowBatch.
>
> .. Owen
>
>
>>
>> Thank you in advance,
>> Matt
>>

Re: Complex types in hive-orc

Posted by Matt Burgess <ma...@gmail.com>.

Thanks for the great info! For the BloomFilter thing, the first/only thing I saw was addBytes() also needs a start and length param in 1.2.1 but later version take just the column vector and something else as params. Not sure if there are other issues with duplicate classes and such.

Regards,
Matt


> On Jul 22, 2016, at 5:14 PM, Owen O'Malley <om...@apache.org> wrote:
> 
> Hi Matt,
> 
>> On Fri, Jul 22, 2016 at 1:21 PM, Matt Burgess <ma...@apache.org> wrote:
>> 
>> All,
>> 
>> Is this the right place to ask questions about hive-orc? I know it was
>> split out into Apache ORC, and up until recently I have been using
>> Apache ORC 1.1.2 to convert Avro files to ORC files, but I was told I
>> need a version that works with only Hive 1.2.1.
> 
> This works great, although most of the ORC developers read both.
> 
> 
>> - Are complex types (list, map, struct, union, etc.) supported in
>> hive-orc 1.2.1? I don't see the ListColumnVector and such types.
> 
> 
> 
> Before HIVE-12159, which went into Hive 2.1, the only way to read complex
> types was to use the row by row API.
> 
> 
>> I
>> can't bring in that storage-api-2.1.1-pre-orc JAR because of a
>> conflict with BloomFilter, etc.
> 
> How bad is the breakage? Can we fix it with a patch to ORC?
> 
> 
>> 
>> - I was using VectorizedRowBatch to write my values in ORC 1.1.2, is
>> that the correct/recommended approach in 1.2.1? I see Apache Crunch
>> uses lots of MapReduce types but I would really like to limit the MR
>> dependencies if possible since my app will not always be on a Hadoop
>> node.
> 
> Yes, the ORC MapReduce shim uses the VectorizedRowBatch and converts them
> into WritableComparables so it will be fastest if you use
> VectorizedRowBatch directly. Although as you have discovered that won't
> work if you are trying to use hive-orc 1.2
> 
> 
>> - Are there any examples of converting Avro to ORC outside of Hive
>> (but using Avro and hive-orc)? I see a couple of examples of
>> reading/writing ORC files but nothing with Avro. No worries if not, I
>> am writing one as part of this effort :)
> 
> If you look at the benchmarking code in
> https://github.com/apache/orc/pull/43 , you'll see that I took a first stab
> at making an Avro writer that goes from ORC's TypeDescription and a
> VectorizedRowBatch.
> 
> .. Owen
> 
> 
>> 
>> Thank you in advance,
>> Matt
>>

Re: Complex types in hive-orc

Posted by Owen O'Malley <om...@apache.org>.

Hi Matt,

On Fri, Jul 22, 2016 at 1:21 PM, Matt Burgess <ma...@apache.org> wrote:

> All,
>
> Is this the right place to ask questions about hive-orc? I know it was
> split out into Apache ORC, and up until recently I have been using
> Apache ORC 1.1.2 to convert Avro files to ORC files, but I was told I
> need a version that works with only Hive 1.2.1.
>

This works great, although most of the ORC developers read both.

> - Are complex types (list, map, struct, union, etc.) supported in
> hive-orc 1.2.1? I don't see the ListColumnVector and such types.

Before HIVE-12159, which went into Hive 2.1, the only way to read complex
types was to use the row by row API.

> I
> can't bring in that storage-api-2.1.1-pre-orc JAR because of a
> conflict with BloomFilter, etc.
>

How bad is the breakage? Can we fix it with a patch to ORC?

>
> - I was using VectorizedRowBatch to write my values in ORC 1.1.2, is
> that the correct/recommended approach in 1.2.1? I see Apache Crunch
> uses lots of MapReduce types but I would really like to limit the MR
> dependencies if possible since my app will not always be on a Hadoop
> node.
>

Yes, the ORC MapReduce shim uses the VectorizedRowBatch and converts them
into WritableComparables so it will be fastest if you use
VectorizedRowBatch directly. Although as you have discovered that won't
work if you are trying to use hive-orc 1.2

> - Are there any examples of converting Avro to ORC outside of Hive
> (but using Avro and hive-orc)? I see a couple of examples of
> reading/writing ORC files but nothing with Avro. No worries if not, I
> am writing one as part of this effort :)
>

If you look at the benchmarking code in
https://github.com/apache/orc/pull/43 , you'll see that I took a first stab
at making an Avro writer that goes from ORC's TypeDescription and a
VectorizedRowBatch.

.. Owen

>
> Thank you in advance,
> Matt
>