You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ying Zhou <yz...@gmail.com> on 2020/10/18 15:24:07 UTC

[C++] Arrow to ORC type conversion

Hi,

I’m developing the adapter that converts Arrow Arrays, ChunkedArrays, RecordBatches and Tables into ORC files. Given the ORC Specification and Arrow Columnar Format. 

Here is my current type mapping:

Type::type::NA -> nulllptr
Type::type::BOOL -> liborc::TypeKind::BOOLEAN
Type::type::UINT8 -> liborc::TypeKind::BYTE
Type::type::INT8 -> liborc::TypeKind::BYTE
Type::type::UINT16 -> liborc::TypeKind::SHORT
Type::type::INT16 -> liborc::TypeKind::SHORT
Type::type::UINT32 -> liborc::TypeKind::INT
Type::type::INT32 -> liborc::TypeKind::INT
Type::type::INTERVAL_MONTH -> liborc::TypeKind:INT
Type::type::UINT64 -> liborc::TypeKind::LONG
Type::type::INT64 -> liborc::TypeKind::LONG
Type::type::INTERVAL_DAY_TIME -> liborc::TypeKind:LONG
Type::type::DURATION -> liborc::TypeKind::LONG
Type::type::HALF_FLOAT -> liborc::TypeKind::FLOAT
Type::type::FLOAT -> liborc::TypeKind::FLOAT
Type::type::DOUBLE -> liborc::TypeKind::DOUBLE
Type::type::STRING -> liborc::TypeKind::STRING
Type::type::LARGE_STRING -> liborc::TypeKind::STRING
Type::type::FIXED_SIZE_BINARY -> liborc::TypeKind::CHAR
Type::type::BINARY -> liborc::TypeKind::BINARY
Type::type::LARGE_BINARY -> liborc::TypeKind::BINARY
Type::type::DATE32 -> liborc::TypeKind::DATE
Type::type::TIMESTAMP -> liborc::TypeKind::TIMESTAMP
Type::type::TIME32 -> liborc::TypeKind::TIMESTAMP
Type::type::TIME64 -> liborc::TypeKind::TIMESTAMP
Type::type::DATE64 -> liborc::TypeKind::TIMESTAMP
Type::type::DECIMAL -> liborc::TypeKind::DECIMAL
Type::type::LIST -> liborc::TypeKind::LIST
Type::type::FIXED_SIZE_LIST -> liborc::TypeKind::LIST
Type::type::LARGE_LIST -> liborc::TypeKind::LIST
Type::type::STRUCT -> liborc::TypeKind::STRUCT
Type::type::MAP -> liborc::TypeKind::MAP
Type::type::DENSE_UNION -> liborc::TypeKind::UNION
Type::type::SPARSE_UNION -> liborc::TypeKind::UNION
Type::type::DICTIONARY -> the ORC version of its value type

There are some concerns particularly related to duration types which don’t exist for Apache ORC which I have to convert to integers. Is my current mapping reasonable? Thanks!

Best,
Ying Zhou

Re: [C++] Arrow to ORC type conversion

Posted by "Uwe L. Korn" <ma...@uwekorn.com>.
This sounds reasonable from an Arrow perspective, you might want to CC the ORC list as well or ask someone there to co-review your work in the adapter.

Uwe

> Am 18.10.2020 um 17:24 schrieb Ying Zhou <yz...@gmail.com>:
> 
> Hi,
> 
> I’m developing the adapter that converts Arrow Arrays, ChunkedArrays, RecordBatches and Tables into ORC files. Given the ORC Specification and Arrow Columnar Format. 
> 
> Here is my current type mapping:
> 
> Type::type::NA -> nulllptr
> Type::type::BOOL -> liborc::TypeKind::BOOLEAN
> Type::type::UINT8 -> liborc::TypeKind::BYTE
> Type::type::INT8 -> liborc::TypeKind::BYTE
> Type::type::UINT16 -> liborc::TypeKind::SHORT
> Type::type::INT16 -> liborc::TypeKind::SHORT
> Type::type::UINT32 -> liborc::TypeKind::INT
> Type::type::INT32 -> liborc::TypeKind::INT
> Type::type::INTERVAL_MONTH -> liborc::TypeKind:INT
> Type::type::UINT64 -> liborc::TypeKind::LONG
> Type::type::INT64 -> liborc::TypeKind::LONG
> Type::type::INTERVAL_DAY_TIME -> liborc::TypeKind:LONG
> Type::type::DURATION -> liborc::TypeKind::LONG
> Type::type::HALF_FLOAT -> liborc::TypeKind::FLOAT
> Type::type::FLOAT -> liborc::TypeKind::FLOAT
> Type::type::DOUBLE -> liborc::TypeKind::DOUBLE
> Type::type::STRING -> liborc::TypeKind::STRING
> Type::type::LARGE_STRING -> liborc::TypeKind::STRING
> Type::type::FIXED_SIZE_BINARY -> liborc::TypeKind::CHAR
> Type::type::BINARY -> liborc::TypeKind::BINARY
> Type::type::LARGE_BINARY -> liborc::TypeKind::BINARY
> Type::type::DATE32 -> liborc::TypeKind::DATE
> Type::type::TIMESTAMP -> liborc::TypeKind::TIMESTAMP
> Type::type::TIME32 -> liborc::TypeKind::TIMESTAMP
> Type::type::TIME64 -> liborc::TypeKind::TIMESTAMP
> Type::type::DATE64 -> liborc::TypeKind::TIMESTAMP
> Type::type::DECIMAL -> liborc::TypeKind::DECIMAL
> Type::type::LIST -> liborc::TypeKind::LIST
> Type::type::FIXED_SIZE_LIST -> liborc::TypeKind::LIST
> Type::type::LARGE_LIST -> liborc::TypeKind::LIST
> Type::type::STRUCT -> liborc::TypeKind::STRUCT
> Type::type::MAP -> liborc::TypeKind::MAP
> Type::type::DENSE_UNION -> liborc::TypeKind::UNION
> Type::type::SPARSE_UNION -> liborc::TypeKind::UNION
> Type::type::DICTIONARY -> the ORC version of its value type
> 
> There are some concerns particularly related to duration types which don’t exist for Apache ORC which I have to convert to integers. Is my current mapping reasonable? Thanks!
> 
> Best,
> Ying Zhou