You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Kurt Young (Jira)" <ji...@apache.org> on 2020/02/27 09:08:00 UTC
[jira] [Comment Edited] (FLINK-16296) Improve performance of
BaseRowSerializer#serialize() for GenericRow
[ https://issues.apache.org/jira/browse/FLINK-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046366#comment-17046366 ]
Kurt Young edited comment on FLINK-16296 at 2/27/20 9:07 AM:
-------------------------------------------------------------
The reason behind this was a non-obvious design choice:
{noformat}
BinaryRow is the standard binary format for all base rows. {noformat}
In another word, `BinaryRow` controls the binary format for all base row. So the safest way for a `BaseRowSerializer` to generate a correct binary format is converting the base row to `BinaryRow` first and then do the serialization via bytes copy.
If we want to do such optimization, we should break our old design choice, by saying:
{noformat}
We establish a stand binary format somewhere in our code base, and all base rows should comply with such standard, includes BinaryRow and GenericRow.{noformat}
It sounds like a not big deal, but IMO is quite important, for developers and future modifications.
was (Author: ykt836):
The reason behind this was an non-obvious design choice:
{noformat}
BinaryRow is the standard binary format for all base rows. {noformat}
In another word, `BinaryRow` controls the binary format for all base row. So the safest way for a `BaseRowSerializer` to generate a correct binary format is converting the base row to `BinaryRow` first and then do the serialization via bytes copy.
If we want to do such optimization, we should break our old design choice, we should say:
{noformat}
We establish a stand binary format somewhere in our code base, and all base rows should comply with such standard, includes BinaryRow and GenericRow.{noformat}
It sounds like a not big deal, but IMO is quite important, for developers and future modifications.
> Improve performance of BaseRowSerializer#serialize() for GenericRow
> -------------------------------------------------------------------
>
> Key: FLINK-16296
> URL: https://issues.apache.org/jira/browse/FLINK-16296
> Project: Flink
> Issue Type: Improvement
> Components: Table SQL / Runtime
> Reporter: Jark Wu
> Priority: Major
>
> Currently, when serialize a {{GenericRow}} using {{BaseRowSerializer#serialize()}} , there will be 2 memory copy. The first is GenericRow -> BinaryRow, the second is BinaryRow -> DataOutputView.
> However, in theory, we can serialize GenericRow into DataOutputView directly, because we already get all the column values and types. We can serialize the null bit part for all columns and then the fix-part for all columns and then the variable lenght part.
> For example, when the column is a BinaryString, we can serialize the pos and length, and calcute the new variable part length, and then serialize the next column. If there is a generic type in the row, then it will fallback into previous way. But generic type in SQL is rare.
> This is a general improvements and can be benefit for every operators.
> If this can be done, then {{GenericRow}} is always the best choice for producers, and {{BinaryRow}} is always the best choice for consumers. For example, constructing a GenericRow or BinaryRow with existing {{(String, Integer, Long)}} fields, and serailize into network. The GenericRow can simpliy wraps on the {{(String, Integer, Long)}} values and seralize into network directly with only one memory copy. However, BinaryRow will copy {{(String, Integer, Long)}} fields into a bytes[] and then copy the byte[] into network. It involves two memory copy.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)