You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Jark Wu (Jira)" <ji...@apache.org> on 2020/02/27 02:38:00 UTC
[jira] [Updated] (FLINK-16296) Improve performance of BaseRowSerializer#serialize() for GenericRow

     [ https://issues.apache.org/jira/browse/FLINK-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jark Wu updated FLINK-16296:
----------------------------
    Description: 
Currently, when serialize a {{GenericRow}} using {{BaseRowSerializer#serialize()}} , there will be 2 memory copy. The first is GenericRow -> BinaryRow, the second is  BinaryRow -> DataOutputView. 

However, in theory, we can serialize GenericRow into DataOutputView directly, because we already get all the column values and types. We can serialize the null bit part for all columns and then the fix-part for all columns and then the variable lenght part. 

For example, when the column is a BinaryString, we can serialize the pos and length, and calcute the new variable part length, and then serialize the next column. If there is a generic type in the row, then it will fallback into previous way. But generic type in SQL is rare. 

This is a general improvements and can be benefit for every operators. 

If this can be done, then {{GenericRow}} is always the best choice for producers, and {{BinaryRow}} is always the best choice for consumers. 

  was:
Currently, when serialize a {{GenericRow}} using {{BaseRowSerializer#serialize()}} , there will be 2 memory copy. The first is GenericRow -> BinaryRow, the second is  BinaryRow -> DataOutputView. 

However, in theory, we can serialize GenericRow into DataOutputView directly, because we already get all the column values and types. We can serialize the null bit part for all columns and then the fix-part for all columns and then the variable lenght part. 

For example, when the column is a BinaryString, we can serialize the pos and length, and calcute the new variable part length, and then serialize the next column. If there is a generic type in the row, then it will fallback into previous way. But generic type in SQL is rare. 

This is a general improvements and can be benefit for every operators. 


> Improve performance of BaseRowSerializer#serialize() for GenericRow
> -------------------------------------------------------------------
>
>                 Key: FLINK-16296
>                 URL: https://issues.apache.org/jira/browse/FLINK-16296
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / Runtime
>            Reporter: Jark Wu
>            Priority: Major
>
> Currently, when serialize a {{GenericRow}} using {{BaseRowSerializer#serialize()}} , there will be 2 memory copy. The first is GenericRow -> BinaryRow, the second is  BinaryRow -> DataOutputView. 
> However, in theory, we can serialize GenericRow into DataOutputView directly, because we already get all the column values and types. We can serialize the null bit part for all columns and then the fix-part for all columns and then the variable lenght part. 
> For example, when the column is a BinaryString, we can serialize the pos and length, and calcute the new variable part length, and then serialize the next column. If there is a generic type in the row, then it will fallback into previous way. But generic type in SQL is rare. 
> This is a general improvements and can be benefit for every operators. 
> If this can be done, then {{GenericRow}} is always the best choice for producers, and {{BinaryRow}} is always the best choice for consumers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)