You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Averell <lv...@gmail.com> on 2020/08/21 00:40:52 UTC

JSON to Parquet

Hello,

I have a stream with each message is a JSON string with a quite complex
schema (multiple fields, multiple nested layers), and I need to write that
into parquet files after some slight modifications/enrichment.

I wonder what options are available for me to do that. I'm thinking of JSON
-> AVRO (GenericRecord) -> Parquet. Is that an option? I would want to be
able to quickly/dynamically (as less code change as possible) change the
JSON schema.

Thanks and regards,
Averell



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: JSON to Parquet

Posted by Averell <lv...@gmail.com>.
Hi Dawid,

Thanks for the suggestion. So, basically I'll need to use the JSON connector
to get the JSON strings into Rows, and from Rows to Parquet records using
the parquet connecter?
I have never tried the TableAPI in the past, have been using the
StreamingAPI only. Will follow your suggestion now.

Thanks for your help.
Regards,
Averell



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: JSON to Parquet

Posted by Dawid Wysakowicz <dw...@apache.org>.
Hi Averell,

If you can describe the JSON schema I'd suggest looking into the SQL
API. (And I think you do need to define your schema upfront. If I am not
mistaken Parquet must know the common schema.)

Then you could do sth like:
CREATE TABLE json (
    // define the schema of your json data
) WITH (
  ...
 'format' = 'json',
 'json.fail-on-missing-field' = 'false',
 'json.ignore-parse-errors' = 'true'
)

CREATE TABLE parquet (

    // define the schema of your parquet data

) WITH (
 'connector' = 'filesystem',
 'path' = '/tmp/parquet',
 'format' = 'parquet'
);

You might also want to have a look at the LIKE[3] to define the schema
of your parquet table if it is mostly similar to the json schema.

INSERT INTO parquet SELECT /*transform your data*/ FROM json;

Best,

Dawid

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/json.html#how-to-create-a-table-with-json-format

[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/parquet.html#how-to-create-a-table-with-parquet-format

[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/create.html#create-table

On 21/08/2020 02:40, Averell wrote:
> Hello,
>
> I have a stream with each message is a JSON string with a quite complex
> schema (multiple fields, multiple nested layers), and I need to write that
> into parquet files after some slight modifications/enrichment.
>
> I wonder what options are available for me to do that. I'm thinking of JSON
> -> AVRO (GenericRecord) -> Parquet. Is that an option? I would want to be
> able to quickly/dynamically (as less code change as possible) change the
> JSON schema.
>
> Thanks and regards,
> Averell
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: JSON to Parquet

Posted by Dawid Wysakowicz <dw...@apache.org>.
Hi Averell,

If you can describe the JSON schema I'd suggest looking into the SQL
API. (And I think you do need to define your schema upfront. If I am not
mistaken Parquet must know the common schema.)

Then you could do sth like:
CREATE TABLE json (
    // define the schema of your json data
) WITH (
  ...
 'format' = 'json',
 'json.fail-on-missing-field' = 'false',
 'json.ignore-parse-errors' = 'true'
)

CREATE TABLE parquet (

    // define the schema of your parquet data

) WITH (
 'connector' = 'filesystem',
 'path' = '/tmp/parquet',
 'format' = 'parquet'
);

You might also want to have a look at the LIKE[3] to define the schema
of your parquet table if it is mostly similar to the json schema.

INSERT INTO parquet SELECT /*transform your data*/ FROM json;

Best,

Dawid

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/json.html#how-to-create-a-table-with-json-format

[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/parquet.html#how-to-create-a-table-with-parquet-format

[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/create.html#create-table

On 21/08/2020 02:40, Averell wrote:
> Hello,
>
> I have a stream with each message is a JSON string with a quite complex
> schema (multiple fields, multiple nested layers), and I need to write that
> into parquet files after some slight modifications/enrichment.
>
> I wonder what options are available for me to do that. I'm thinking of JSON
> -> AVRO (GenericRecord) -> Parquet. Is that an option? I would want to be
> able to quickly/dynamically (as less code change as possible) change the
> JSON schema.
>
> Thanks and regards,
> Averell
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/