You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Илья Соин <il...@gmail.com> on 2022/02/03 14:55:53 UTC

Json deserialisation with .jsonValue vs format=json in Table API

Hi, 

I’m using the Table / SQL API. 

I have a stream of strings, where each message contains several json strings separated by "\n”. 
For example:
{“timestamp”: “2021-01-01T00:00:00”, person: {“name”: “Vasya”}}\n 
{“timestamp”: “2021-01-01T01:00:00”, person: {“name”: “Max” }}

I would like to split each message by “\n”, parse each string as a json object and get some of the fields. 

AFIK there are 2 ways to do it:

1) Write custom deserialiser and provide it in source table DDL, i.e. 
CREATE TABLE source (
    timestamp STRING,
    person: ROW(name STRING)
)
WITH(‘format’ = ‘multiline-json’, …);

2) Use ‘format’ = ‘raw’ and extract the needed fields using .jsonValue, i.e.

CREATE TABLE source (
    row STRING
);

env.from("source")
        .joinLateral(
            call(SplitFunction.class, $("row"), "\n").as(“msg")
        )
        .select(
             $("msg").jsonValue("$.timestamp", DataTypes.STRING()),
             $("msg").jsonValue(“$.person.name", DataTypes.STRING()).as(“name”)
       );

In 2), will each call of .jsonValue parse the string all over again or will it reuse the same JsonNode object internally? Which option better fits my problem?

__
Best, Ilya

Re: Json deserialisation with .jsonValue vs format=json in Table API

Posted by Илья Соин <il...@gmail.com>.
Thank you, Francesco

> On 3 Feb 2022, at 18:21, Francesco Guardiani <fr...@ververica.com> wrote:
> 
> Hi,
> 
> I think the more stable option would be the first one, as it also gives you more flexibility. Reading the row as string and then parsing it in a query definitely costs more, and makes less straightforward to use the other Schema features of table, such as watermark definition, primary keys, etc.
> 
> I guess you can implement it straightforwardly subclassing the existing json format provided by flink, in particular JsonRowDataDeserializationSchema.
> 
> A third solution would be to create a SplitFunction, like the one you created, which directly performs the parsing, outputting rows rather than strings. This removes the double parsing issue, but still create problems when interacting with other schema features.
> 
> Hope it helps,
> FG
> 
> On Thu, Feb 3, 2022 at 3:56 PM Илья Соин <ilya.soin.95@gmail.com <ma...@gmail.com>> wrote:
> Hi, 
> 
> I’m using the Table / SQL API. 
> 
> I have a stream of strings, where each message contains several json strings separated by "\n”. 
> For example:
> {“timestamp”: “2021-01-01T00:00:00”, person: {“name”: “Vasya”}}\n 
> {“timestamp”: “2021-01-01T01:00:00”, person: {“name”: “Max” }}
> 
> I would like to split each message by “\n”, parse each string as a json object and get some of the fields. 
> 
> AFIK there are 2 ways to do it:
> 
> 1) Write custom deserialiser and provide it in source table DDL, i.e. 
> CREATE TABLE source (
>     timestamp STRING,
>     person: ROW(name STRING)
> )
> WITH(‘format’ = ‘multiline-json’, …);
> 
> 2) Use ‘format’ = ‘raw’ and extract the needed fields using .jsonValue, i.e.
> 
> CREATE TABLE source (
>     row STRING
> );
> 
> env.from("source")
>         .joinLateral(
>             call(SplitFunction.class, $("row"), "\n").as(“msg")
>         )
>         .select(
>              $("msg").jsonValue("$.timestamp", DataTypes.STRING()),
>              $("msg").jsonValue(“$.person.name <http://person.name/>", DataTypes.STRING()).as(“name”)
>        );
> 
> In 2), will each call of .jsonValue parse the string all over again or will it reuse the same JsonNode object internally? Which option better fits my problem?
> 
> __
> Best, Ilya


Re: Json deserialisation with .jsonValue vs format=json in Table API

Posted by Francesco Guardiani <fr...@ververica.com>.
Hi,

I think the more stable option would be the first one, as it also gives you
more flexibility. Reading the row as string and then parsing it in a query
definitely costs more, and makes less straightforward to use the other
Schema features of table, such as watermark definition, primary keys, etc.

I guess you can implement it straightforwardly subclassing the existing
json format provided by flink, in particular
JsonRowDataDeserializationSchema.

A third solution would be to create a SplitFunction, like the one you
created, which directly performs the parsing, outputting rows rather than
strings. This removes the double parsing issue, but still create problems
when interacting with other schema features.

Hope it helps,
FG

On Thu, Feb 3, 2022 at 3:56 PM Илья Соин <il...@gmail.com> wrote:

> Hi,
>
> I’m using the Table / SQL API.
>
> I have a stream of strings, where each message contains several json
> strings separated by "\n”.
> For example:
> {“timestamp”: “2021-01-01T00:00:00”, person: {“name”: “Vasya”}}\n
> {“timestamp”: “2021-01-01T01:00:00”, person: {“name”: “Max” }}
>
> I would like to split each message by “\n”, parse each string as a json
> object and get some of the fields.
>
> AFIK there are 2 ways to do it:
>
> 1) Write custom deserialiser and provide it in source table DDL, i.e.
> CREATE TABLE source (
>     timestamp STRING,
>     person: ROW(name STRING)
> )
> WITH(‘format’ = ‘multiline-json’, …);
>
> 2) Use ‘format’ = ‘raw’ and extract the needed fields using .jsonValue,
> i.e.
>
> CREATE TABLE source (
>     row STRING
> );
>
> env.from("source")
>         .joinLateral(
>             call(SplitFunction.class, $("row"), "\n").as(“msg")
>         )
>         .select(
>              $("msg").jsonValue("$.timestamp", DataTypes.STRING()),
>              $("msg").jsonValue(“$.person.name",
> DataTypes.STRING()).as(“name”)
>        );
>
> In 2), will each call of .jsonValue parse the string all over again or
> will it reuse the same JsonNode object internally? Which option better fits
> my problem?
>
> __
> Best, Ilya