You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Flavio Pompermaier <po...@okkam.it> on 2017/05/15 13:08:02 UTC

Thrift object serialization

Hi to all,
in my Flink job I create a Dataset<MyThriftObj> using HadoopInputFormat in
this way:

HadoopInputFormat<Void, MyThriftObj> inputFormat = new HadoopInputFormat<>(
        new ParquetThriftInputFormat<MyThriftObj>(), Void.class,
MyThriftObj.class, job);
FileInputFormat.addInputPath(job,  new org.apache.hadoop.fs.Path(inputPath);
*DataSet<Tuple2<Void, MyThriftObj>> ds* = env.createInput(inputFormat);

Flink logs this message:

   - TypeExtractor -* class MyThriftObj contains custom serialization
   methods we do not call.*


Indeed MyThriftObj has readObject/writeObject functions and when I print
the type of ds I see:

   - Java Tuple2<Void,* GenericType<MyThriftObj>*>

Fom my experience GenericType is a performace killer...what should I do to
improve the reading/writing of MyThriftObj?

Best,
Flavio


-- 
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 1823908

Re: Thrift object serialization

Posted by Flavio Pompermaier <po...@okkam.it>.

Ok thanks Gordon! It would be nice to have a benchmark also on this ;)

Thanks a lot for the support,
Flavio

On Tue, May 16, 2017 at 9:41 AM, Tzu-Li (Gordon) Tai <tz...@apache.org>
wrote:

> If you don’t register the TBaseSerializer for your MyThriftObj (or in
> general don’t register any serializer for the Thrift class), I think Kryo’s
> default FieldSerializer will be used for it.
>
> The TBaseSerializer basically just uses TBase for de-/serialization as you
> normally would for the Thrift classes (see [1]), so there should be some
> specific optimization going on for that compared to Kryo’s generic
> FieldSerializer. I’m not entirely sure about performance gain between the
> two as I don’t really know the details of the serialization differences,
> but I would suggest to use TBaseSerializer if they are Thrift classes.
>
> Cheers,
> Gordon
>
> [1] https://github.com/twitter/chill/blob/develop/
> chill-thrift/src/main/java/com/twitter/chill/thrift/TBaseSerializer.java
>
>
> On 16 May 2017 at 3:26:32 PM, Flavio Pompermaier (pompermaier@okkam.it)
> wrote:
>
> Hi Gordon,
> thanks for the link. Will the usage ofTBaseSerializer wrt Kryo lead to a
> performance gain?
>
> On Tue, May 16, 2017 at 7:32 AM, Tzu-Li (Gordon) Tai <tz...@apache.org>
> wrote:
>
>> Hi Flavio!
>>
>> I believe [1] has what you are looking for. Have you taken a look at that?
>>
>> Cheers,
>> Gordon
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-release-
>> 1.3/dev/custom_serializers.html
>>
>> On 15 May 2017 at 9:08:33 PM, Flavio Pompermaier (pompermaier@okkam.it)
>> wrote:
>>
>> Hi to all,
>> in my Flink job I create a Dataset<MyThriftObj> using HadoopInputFormat
>> in this way:
>>
>> HadoopInputFormat<Void, MyThriftObj> inputFormat = new
>> HadoopInputFormat<>(
>>         new ParquetThriftInputFormat<MyThriftObj>(), Void.class,
>> MyThriftObj.class, job);
>> FileInputFormat.addInputPath(job,  new org.apache.hadoop.fs.Path(inpu
>> tPath);
>> *DataSet<Tuple2<Void, MyThriftObj>> ds* = env.createInput(inputFormat);
>>
>> Flink logs this message:
>>
>>    - TypeExtractor - *class MyThriftObj contains custom serialization
>>    methods we do not call.*
>>
>>
>> Indeed MyThriftObj has readObject/writeObject functions and when I print
>> the type of ds I see:
>>
>>    - Java Tuple2<Void, *GenericType<MyThriftObj>*>
>>
>> Fom my experience GenericType is a performace killer...what should I do
>> to improve the reading/writing of MyThriftObj?
>>
>> Best,
>> Flavio
>>
>>
>> --
>> Flavio Pompermaier
>> Development Department
>>
>> OKKAM S.r.l.
>> Tel. +(39) 0461 1823908 <+39%200461%20182%203908>
>>
>>
>
>
> --
> Flavio Pompermaier
> Development Department
>
> OKKAM S.r.l.
> Tel. +(39) 0461 1823908 <+39%200461%20182%203908>
>
>

Re: Thrift object serialization

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.

If you don’t register the TBaseSerializer for your MyThriftObj (or in general don’t register any serializer for the Thrift class), I think Kryo’s default FieldSerializer will be used for it.

The TBaseSerializer basically just uses TBase for de-/serialization as you normally would for the Thrift classes (see [1]), so there should be some specific optimization going on for that compared to Kryo’s generic FieldSerializer. I’m not entirely sure about performance gain between the two as I don’t really know the details of the serialization differences, but I would suggest to use TBaseSerializer if they are Thrift classes.

Cheers,
Gordon

[1] https://github.com/twitter/chill/blob/develop/chill-thrift/src/main/java/com/twitter/chill/thrift/TBaseSerializer.java

On 16 May 2017 at 3:26:32 PM, Flavio Pompermaier (pompermaier@okkam.it) wrote:

Hi Gordon,
thanks for the link. Will the usage ofTBaseSerializer wrt Kryo lead to a performance gain?

On Tue, May 16, 2017 at 7:32 AM, Tzu-Li (Gordon) Tai <tz...@apache.org> wrote:
Hi Flavio!

I believe [1] has what you are looking for. Have you taken a look at that?

Cheers,
Gordon

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/custom_serializers.html

On 15 May 2017 at 9:08:33 PM, Flavio Pompermaier (pompermaier@okkam.it) wrote:

Hi to all,
in my Flink job I create a Dataset<MyThriftObj> using HadoopInputFormat in this way:

HadoopInputFormat<Void, MyThriftObj> inputFormat = new HadoopInputFormat<>(
        new ParquetThriftInputFormat<MyThriftObj>(), Void.class, MyThriftObj.class, job);
FileInputFormat.addInputPath(job,  new org.apache.hadoop.fs.Path(inputPath);
DataSet<Tuple2<Void, MyThriftObj>> ds = env.createInput(inputFormat);

Flink logs this message:
TypeExtractor - class MyThriftObj contains custom serialization methods we do not call.

Indeed MyThriftObj has readObject/writeObject functions and when I print the type of ds I see:
Java Tuple2<Void, GenericType<MyThriftObj>>
Fom my experience GenericType is a performace killer...what should I do to improve the reading/writing of MyThriftObj?

Best,
Flavio


--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 1823908



--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 1823908

Re: Thrift object serialization

Posted by Flavio Pompermaier <po...@okkam.it>.

Hi Gordon,
thanks for the link. Will the usage ofTBaseSerializer wrt Kryo lead to a
performance gain?

On Tue, May 16, 2017 at 7:32 AM, Tzu-Li (Gordon) Tai <tz...@apache.org>
wrote:

> Hi Flavio!
>
> I believe [1] has what you are looking for. Have you taken a look at that?
>
> Cheers,
> Gordon
>
> [1] https://ci.apache.org/projects/flink/flink-docs-
> release-1.3/dev/custom_serializers.html
>
> On 15 May 2017 at 9:08:33 PM, Flavio Pompermaier (pompermaier@okkam.it)
> wrote:
>
> Hi to all,
> in my Flink job I create a Dataset<MyThriftObj> using HadoopInputFormat in
> this way:
>
> HadoopInputFormat<Void, MyThriftObj> inputFormat = new HadoopInputFormat<>(
>         new ParquetThriftInputFormat<MyThriftObj>(), Void.class,
> MyThriftObj.class, job);
> FileInputFormat.addInputPath(job,  new org.apache.hadoop.fs.Path(
> inputPath);
> *DataSet<Tuple2<Void, MyThriftObj>> ds* = env.createInput(inputFormat);
>
> Flink logs this message:
>
>    - TypeExtractor - *class MyThriftObj contains custom serialization
>    methods we do not call.*
>
>
> Indeed MyThriftObj has readObject/writeObject functions and when I print
> the type of ds I see:
>
>    - Java Tuple2<Void, *GenericType<MyThriftObj>*>
>
> Fom my experience GenericType is a performace killer...what should I do to
> improve the reading/writing of MyThriftObj?
>
> Best,
> Flavio
>
>
> --
> Flavio Pompermaier
> Development Department
>
> OKKAM S.r.l.
> Tel. +(39) 0461 1823908 <+39%200461%20182%203908>
>
>


-- 
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 1823908

Re: Thrift object serialization

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.

Hi Flavio!

I believe [1] has what you are looking for. Have you taken a look at that?

Cheers,
Gordon

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/custom_serializers.html

On 15 May 2017 at 9:08:33 PM, Flavio Pompermaier (pompermaier@okkam.it) wrote:

Hi to all,
in my Flink job I create a Dataset<MyThriftObj> using HadoopInputFormat in this way:

HadoopInputFormat<Void, MyThriftObj> inputFormat = new HadoopInputFormat<>(
        new ParquetThriftInputFormat<MyThriftObj>(), Void.class, MyThriftObj.class, job);
FileInputFormat.addInputPath(job,  new org.apache.hadoop.fs.Path(inputPath);
DataSet<Tuple2<Void, MyThriftObj>> ds = env.createInput(inputFormat);

Flink logs this message:
TypeExtractor - class MyThriftObj contains custom serialization methods we do not call.

Indeed MyThriftObj has readObject/writeObject functions and when I print the type of ds I see:
Java Tuple2<Void, GenericType<MyThriftObj>>
Fom my experience GenericType is a performace killer...what should I do to improve the reading/writing of MyThriftObj?

Best,
Flavio


--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 1823908