You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Andrzej Swatowski (Jira)" <ji...@apache.org> on 2022/07/18 11:45:00 UTC

[jira] [Updated] (FLINK-28591) Array> is not serialized correctly

     [ https://issues.apache.org/jira/browse/FLINK-28591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Swatowski updated FLINK-28591:
--------------------------------------
    Description: 
When using Table API to insert data into array of rows, the data apparently is incorrectly serialized internally, which leads to incorrect serialization at the connectors.

E.g., a following table:
{code:java}
CREATE TABLE wrongArray (
    foo bigint,
    bar ARRAY<ROW<`foo1` STRING, `foo2` STRING>>,
    strings ARRAY<STRING>,
    intRows ARRAY<ROW<`a` INT, `b` INT>>
) WITH (
  'connector' = 'filesystem',
  'path' = 'file://path/to/somewhere',
  'format' = 'json'
) {code}
along with the following insert:
{code:java}
insert into wrongArray (
    SELECT
        1,
        array[
            ('Field1', 'Value1'),
            ('Field2', 'Value2')
        ],
        array['foo', 'bar', 'foobar'],
        array[ROW(1, 1), ROW(2, 2)]
    FROM (VALUES(1))
) {code}
gets serialized into: 
{code:java}
{
  "foo":1,
  "bar":[
    {
      "foo1":"Field2",
      "foo2":"Value2"
    },
    {
      "foo1":"Field2",
      "foo2":"Value2"
    }
  ],
  "strings":[
    "foo",
    "bar",
    "foobar"
  ],
  "intRows":[
    {
      "a":2,
      "b":2
    },
    {
      "a":2,
      "b":2
    }
  ]
}{code}
It is easy to spot that `strings` (being an Array of String) yields the correct values. However, both `bar` (an Array of Rows with two Strings) and `intRows` (an Array of Rows with two Integers) consists of duplicates of the last row in the array.
----
It is not an error connected with either a specific connector or format. I have done a bit of debugging when trying to implement my own format and it seems that `BinaryArrayData` holding the row values has wrong data saved in its `MemorySegment`, i.e. calling: 
{code:java}
for (var i = 0; i < array.size(); i++) {
  Object element = arrayDataElementGetter.getElementOrNull(array, i);
}{code}
correctly calculates offsets but yields the same result as the data is malformed in the array's `MemorySegment`.

  was:
When using Table API to insert data into array of rows, the data apparently is incorrectly serialized internally, which leads to incorrect serialization at the connectors.

E.g., a following table:
{code:java}
CREATE TABLE wrongArray (
    foo bigint,
    bar ARRAY<ROW<`foo1` STRING, `foo2` STRING>>,
    strings ARRAY<STRING>,
    intRows ARRAY<ROW<`a` INT, `b` INT>>
) WITH (
  'connector' = 'filesystem',
  'path' = 'file:///home/jovyan/issue.json',
  'format' = 'json'
) {code}
along with the following insert:
{code:java}
insert into wrongArray (
    SELECT
        1,
        array[
            ('Field1', 'Value1'),
            ('Field2', 'Value2')
        ],
        array['foo', 'bar', 'foobar'],
        array[ROW(1, 1), ROW(2, 2)]
    FROM (VALUES(1))
) {code}
gets serialized into: 
{code:java}
{
  "foo":1,
  "bar":[
    {
      "foo1":"Field2",
      "foo2":"Value2"
    },
    {
      "foo1":"Field2",
      "foo2":"Value2"
    }
  ],
  "strings":[
    "foo",
    "bar",
    "foobar"
  ],
  "intRows":[
    {
      "a":2,
      "b":2
    },
    {
      "a":2,
      "b":2
    }
  ]
}{code}
It is easy to spot that `strings` (being an Array of String) yields the correct values. However, both `bar` (an Array of Rows with two Strings) and `intRows` (an Array of Rows with two Integers) consists of duplicates of the last row in the array.
----
It is not an error connected with either a specific connector or format. I have done a bit of debugging when trying to implement my own format and it seems that `BinaryArrayData` holding the row values has wrong data saved in its `MemorySegment`, i.e. calling: 
{code:java}
for (var i = 0; i < array.size(); i++) {
  Object element = arrayDataElementGetter.getElementOrNull(array, i);
}{code}
correctly calculates offsets but yields the same result as the data is malformed in the array's `MemorySegment`.


> Array<Row<...>> is not serialized correctly
> -------------------------------------------
>
>                 Key: FLINK-28591
>                 URL: https://issues.apache.org/jira/browse/FLINK-28591
>             Project: Flink
>          Issue Type: Bug
>          Components: Table SQL / API, Table SQL / Planner
>    Affects Versions: 1.15.0
>            Reporter: Andrzej Swatowski
>            Priority: Major
>
> When using Table API to insert data into array of rows, the data apparently is incorrectly serialized internally, which leads to incorrect serialization at the connectors.
> E.g., a following table:
> {code:java}
> CREATE TABLE wrongArray (
>     foo bigint,
>     bar ARRAY<ROW<`foo1` STRING, `foo2` STRING>>,
>     strings ARRAY<STRING>,
>     intRows ARRAY<ROW<`a` INT, `b` INT>>
> ) WITH (
>   'connector' = 'filesystem',
>   'path' = 'file://path/to/somewhere',
>   'format' = 'json'
> ) {code}
> along with the following insert:
> {code:java}
> insert into wrongArray (
>     SELECT
>         1,
>         array[
>             ('Field1', 'Value1'),
>             ('Field2', 'Value2')
>         ],
>         array['foo', 'bar', 'foobar'],
>         array[ROW(1, 1), ROW(2, 2)]
>     FROM (VALUES(1))
> ) {code}
> gets serialized into: 
> {code:java}
> {
>   "foo":1,
>   "bar":[
>     {
>       "foo1":"Field2",
>       "foo2":"Value2"
>     },
>     {
>       "foo1":"Field2",
>       "foo2":"Value2"
>     }
>   ],
>   "strings":[
>     "foo",
>     "bar",
>     "foobar"
>   ],
>   "intRows":[
>     {
>       "a":2,
>       "b":2
>     },
>     {
>       "a":2,
>       "b":2
>     }
>   ]
> }{code}
> It is easy to spot that `strings` (being an Array of String) yields the correct values. However, both `bar` (an Array of Rows with two Strings) and `intRows` (an Array of Rows with two Integers) consists of duplicates of the last row in the array.
> ----
> It is not an error connected with either a specific connector or format. I have done a bit of debugging when trying to implement my own format and it seems that `BinaryArrayData` holding the row values has wrong data saved in its `MemorySegment`, i.e. calling: 
> {code:java}
> for (var i = 0; i < array.size(); i++) {
>   Object element = arrayDataElementGetter.getElementOrNull(array, i);
> }{code}
> correctly calculates offsets but yields the same result as the data is malformed in the array's `MemorySegment`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)