You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by 侯宗田 <zo...@icloud.com> on 2018/05/13 14:34:51 UTC

What does the ORC SERDE do

Hello,everyone
   I know the json serde turn fields in a row to a json format, csv serde turn it to csv format with their serdeproperties. But I wonder what the orc serde does when I choose to stored as orc file format. And why is there still escaper, separator in orc serdeproperties. Also with RC Parquet. I think they are just about how to stored and compressed with their input and output format respectively, but I don’t know what their serde does, can anyone give some hint?  

Re: What does the ORC SERDE do

Posted by Lefty Leverenz <le...@gmail.com>.
Jörn, please do update the wiki, we really need better SerDe documentation.

Getting write access is easy:

About This Wiki -- How to get permission to edit
<https://cwiki.apache.org/confluence/display/Hive/AboutThisWiki#AboutThisWiki-Howtogetpermissiontoedit>


-- Lefty


On Sun, May 13, 2018 at 10:18 AM Jörn Franke <jo...@gmail.com> wrote:

> You have in AbstractSerde a method to return very basic stats related to
> your fileformat (mostly size of the data and number of rows etc):
>
>
> https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java
>
>  In method initialize of your Serde you can retrieve properties related to
> partitions and include this information in your file format, if needed (you
> don’t need to create folders etc for partitions - this is done by Hive)
>
>
> On 13. May 2018, at 19:09, Elliot West <te...@gmail.com> wrote:
>
> Hi Jörn,
>
> I’m curious to know how the SerDe framework provides the means to deal
> with partitions, table properties, and statistics? I was under the
> impression that these were in the domain of the metastore and I’ve not
> found anything in the SerDe interface related to these. I would appreciate
> if you could point me in the direction of anything I’ve missed.
>
> Thanks,
>
> Elliot.
>
> On Sun, 13 May 2018 at 15:42, Jörn Franke <jo...@gmail.com> wrote:
>
>> In detail you can check the source code, but a Serde needs to translate
>> an object to a Hive object and vice versa. Usually this is very simple
>> (simply passing the object or create A HiveDecimal etc). It also provides
>> an ObjectInspector that basically describes an object in more detail (eg to
>> be processed by an UDF). For example, it can tell you precision and scale
>> of an objects. In case of ORC it describes also how a bunch of objects
>> (vectorized) can be mapped to hive objects and the other way around.
>> Furthermore, it provides statistics and provides means to deal with
>> partitions as well as table properties (!=input/outputformat properties).
>> Although it sounds complex, hive provides most of the functionality so
>> implementing a serde is most of the times easy.
>>
>> > On 13. May 2018, at 16:34, 侯宗田 <zo...@icloud.com> wrote:
>> >
>> > Hello,everyone
>> >   I know the json serde turn fields in a row to a json format, csv
>> serde turn it to csv format with their serdeproperties. But I wonder what
>> the orc serde does when I choose to stored as orc file format. And why is
>> there still escaper, separator in orc serdeproperties. Also with RC
>> Parquet. I think they are just about how to stored and compressed with
>> their input and output format respectively, but I don’t know what their
>> serde does, can anyone give some hint?
>>
>

Re: What does the ORC SERDE do

Posted by Jörn Franke <jo...@gmail.com>.
You have in AbstractSerde a method to return very basic stats related to your fileformat (mostly size of the data and number of rows etc):

https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java

 In method initialize of your Serde you can retrieve properties related to partitions and include this information in your file format, if needed (you don’t need to create folders etc for partitions - this is done by Hive)


> On 13. May 2018, at 19:09, Elliot West <te...@gmail.com> wrote:
> 
> Hi Jörn,
> 
> I’m curious to know how the SerDe framework provides the means to deal with partitions, table properties, and statistics? I was under the impression that these were in the domain of the metastore and I’ve not found anything in the SerDe interface related to these. I would appreciate if you could point me in the direction of anything I’ve missed.
> 
> Thanks,
> 
> Elliot.
> 
>> On Sun, 13 May 2018 at 15:42, Jörn Franke <jo...@gmail.com> wrote:
>> In detail you can check the source code, but a Serde needs to translate an object to a Hive object and vice versa. Usually this is very simple (simply passing the object or create A HiveDecimal etc). It also provides an ObjectInspector that basically describes an object in more detail (eg to be processed by an UDF). For example, it can tell you precision and scale of an objects. In case of ORC it describes also how a bunch of objects (vectorized) can be mapped to hive objects and the other way around. Furthermore, it provides statistics and provides means to deal with partitions as well as table properties (!=input/outputformat properties).
>> Although it sounds complex, hive provides most of the functionality so implementing a serde is most of the times easy.
>> 
>> > On 13. May 2018, at 16:34, 侯宗田 <zo...@icloud.com> wrote:
>> > 
>> > Hello,everyone
>> >   I know the json serde turn fields in a row to a json format, csv serde turn it to csv format with their serdeproperties. But I wonder what the orc serde does when I choose to stored as orc file format. And why is there still escaper, separator in orc serdeproperties. Also with RC Parquet. I think they are just about how to stored and compressed with their input and output format respectively, but I don’t know what their serde does, can anyone give some hint?  

Re: What does the ORC SERDE do

Posted by Elliot West <te...@gmail.com>.
Hi Jörn,

I’m curious to know how the SerDe framework provides the means to deal with
partitions, table properties, and statistics? I was under the impression
that these were in the domain of the metastore and I’ve not found anything
in the SerDe interface related to these. I would appreciate if you could
point me in the direction of anything I’ve missed.

Thanks,

Elliot.

On Sun, 13 May 2018 at 15:42, Jörn Franke <jo...@gmail.com> wrote:

> In detail you can check the source code, but a Serde needs to translate an
> object to a Hive object and vice versa. Usually this is very simple (simply
> passing the object or create A HiveDecimal etc). It also provides an
> ObjectInspector that basically describes an object in more detail (eg to be
> processed by an UDF). For example, it can tell you precision and scale of
> an objects. In case of ORC it describes also how a bunch of objects
> (vectorized) can be mapped to hive objects and the other way around.
> Furthermore, it provides statistics and provides means to deal with
> partitions as well as table properties (!=input/outputformat properties).
> Although it sounds complex, hive provides most of the functionality so
> implementing a serde is most of the times easy.
>
> > On 13. May 2018, at 16:34, 侯宗田 <zo...@icloud.com> wrote:
> >
> > Hello,everyone
> >   I know the json serde turn fields in a row to a json format, csv serde
> turn it to csv format with their serdeproperties. But I wonder what the orc
> serde does when I choose to stored as orc file format. And why is there
> still escaper, separator in orc serdeproperties. Also with RC Parquet. I
> think they are just about how to stored and compressed with their input and
> output format respectively, but I don’t know what their serde does, can
> anyone give some hint?
>

Re: What does the ORC SERDE do

Posted by Jörn Franke <jo...@gmail.com>.
Yes this was what I did when writing the Hive part of the HadoopOffice / HadoopCryptoledger library. Be aware that Orc uses also some internal Hive APIs/ Extended the existing ones (eg Vectorizedserde)

I don’t have access to the Hive Wiki otherwise I could update it a little bit.

> On 13. May 2018, at 17:08, 侯宗田 <zo...@icloud.com> wrote:
> 
> Thank you, it makes the concept clearer to me. I think I need to look up the source code for some details.
>> 在 2018年5月13日,下午10:42,Jörn Franke <jo...@gmail.com> 写道:
>> 
>> In detail you can check the source code, but a Serde needs to translate an object to a Hive object and vice versa. Usually this is very simple (simply passing the object or create A HiveDecimal etc). It also provides an ObjectInspector that basically describes an object in more detail (eg to be processed by an UDF). For example, it can tell you precision and scale of an objects. In case of ORC it describes also how a bunch of objects (vectorized) can be mapped to hive objects and the other way around. Furthermore, it provides statistics and provides means to deal with partitions as well as table properties (!=input/outputformat properties).
>> Although it sounds complex, hive provides most of the functionality so implementing a serde is most of the times easy.
>> 
>>> On 13. May 2018, at 16:34, 侯宗田 <zo...@icloud.com> wrote:
>>> 
>>> Hello,everyone
>>> I know the json serde turn fields in a row to a json format, csv serde turn it to csv format with their serdeproperties. But I wonder what the orc serde does when I choose to stored as orc file format. And why is there still escaper, separator in orc serdeproperties. Also with RC Parquet. I think they are just about how to stored and compressed with their input and output format respectively, but I don’t know what their serde does, can anyone give some hint?  
> 

Re: What does the ORC SERDE do

Posted by 侯宗田 <zo...@icloud.com>.
Thank you, it makes the concept clearer to me. I think I need to look up the source code for some details.
> 在 2018年5月13日,下午10:42,Jörn Franke <jo...@gmail.com> 写道:
> 
> In detail you can check the source code, but a Serde needs to translate an object to a Hive object and vice versa. Usually this is very simple (simply passing the object or create A HiveDecimal etc). It also provides an ObjectInspector that basically describes an object in more detail (eg to be processed by an UDF). For example, it can tell you precision and scale of an objects. In case of ORC it describes also how a bunch of objects (vectorized) can be mapped to hive objects and the other way around. Furthermore, it provides statistics and provides means to deal with partitions as well as table properties (!=input/outputformat properties).
> Although it sounds complex, hive provides most of the functionality so implementing a serde is most of the times easy.
> 
>> On 13. May 2018, at 16:34, 侯宗田 <zo...@icloud.com> wrote:
>> 
>> Hello,everyone
>>  I know the json serde turn fields in a row to a json format, csv serde turn it to csv format with their serdeproperties. But I wonder what the orc serde does when I choose to stored as orc file format. And why is there still escaper, separator in orc serdeproperties. Also with RC Parquet. I think they are just about how to stored and compressed with their input and output format respectively, but I don’t know what their serde does, can anyone give some hint?  


Re: What does the ORC SERDE do

Posted by Jörn Franke <jo...@gmail.com>.
In detail you can check the source code, but a Serde needs to translate an object to a Hive object and vice versa. Usually this is very simple (simply passing the object or create A HiveDecimal etc). It also provides an ObjectInspector that basically describes an object in more detail (eg to be processed by an UDF). For example, it can tell you precision and scale of an objects. In case of ORC it describes also how a bunch of objects (vectorized) can be mapped to hive objects and the other way around. Furthermore, it provides statistics and provides means to deal with partitions as well as table properties (!=input/outputformat properties).
Although it sounds complex, hive provides most of the functionality so implementing a serde is most of the times easy.

> On 13. May 2018, at 16:34, 侯宗田 <zo...@icloud.com> wrote:
> 
> Hello,everyone
>   I know the json serde turn fields in a row to a json format, csv serde turn it to csv format with their serdeproperties. But I wonder what the orc serde does when I choose to stored as orc file format. And why is there still escaper, separator in orc serdeproperties. Also with RC Parquet. I think they are just about how to stored and compressed with their input and output format respectively, but I don’t know what their serde does, can anyone give some hint?