You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Averell <lv...@gmail.com> on 2020/08/21 00:40:52 UTC
JSON to Parquet
Hello,
I have a stream with each message is a JSON string with a quite complex
schema (multiple fields, multiple nested layers), and I need to write that
into parquet files after some slight modifications/enrichment.
I wonder what options are available for me to do that. I'm thinking of JSON
-> AVRO (GenericRecord) -> Parquet. Is that an option? I would want to be
able to quickly/dynamically (as less code change as possible) change the
JSON schema.
Thanks and regards,
Averell
--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: JSON to Parquet
Posted by Averell <lv...@gmail.com>.
Hi Dawid,
Thanks for the suggestion. So, basically I'll need to use the JSON connector
to get the JSON strings into Rows, and from Rows to Parquet records using
the parquet connecter?
I have never tried the TableAPI in the past, have been using the
StreamingAPI only. Will follow your suggestion now.
Thanks for your help.
Regards,
Averell
--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: JSON to Parquet
Posted by Dawid Wysakowicz <dw...@apache.org>.
Hi Averell,
If you can describe the JSON schema I'd suggest looking into the SQL
API. (And I think you do need to define your schema upfront. If I am not
mistaken Parquet must know the common schema.)
Then you could do sth like:
CREATE TABLE json (
// define the schema of your json data
) WITH (
...
'format' = 'json',
'json.fail-on-missing-field' = 'false',
'json.ignore-parse-errors' = 'true'
)
CREATE TABLE parquet (
// define the schema of your parquet data
) WITH (
'connector' = 'filesystem',
'path' = '/tmp/parquet',
'format' = 'parquet'
);
You might also want to have a look at the LIKE[3] to define the schema
of your parquet table if it is mostly similar to the json schema.
INSERT INTO parquet SELECT /*transform your data*/ FROM json;
Best,
Dawid
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/json.html#how-to-create-a-table-with-json-format
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/parquet.html#how-to-create-a-table-with-parquet-format
[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/create.html#create-table
On 21/08/2020 02:40, Averell wrote:
> Hello,
>
> I have a stream with each message is a JSON string with a quite complex
> schema (multiple fields, multiple nested layers), and I need to write that
> into parquet files after some slight modifications/enrichment.
>
> I wonder what options are available for me to do that. I'm thinking of JSON
> -> AVRO (GenericRecord) -> Parquet. Is that an option? I would want to be
> able to quickly/dynamically (as less code change as possible) change the
> JSON schema.
>
> Thanks and regards,
> Averell
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Re: JSON to Parquet
Posted by Dawid Wysakowicz <dw...@apache.org>.
Hi Averell,
If you can describe the JSON schema I'd suggest looking into the SQL
API. (And I think you do need to define your schema upfront. If I am not
mistaken Parquet must know the common schema.)
Then you could do sth like:
CREATE TABLE json (
// define the schema of your json data
) WITH (
...
'format' = 'json',
'json.fail-on-missing-field' = 'false',
'json.ignore-parse-errors' = 'true'
)
CREATE TABLE parquet (
// define the schema of your parquet data
) WITH (
'connector' = 'filesystem',
'path' = '/tmp/parquet',
'format' = 'parquet'
);
You might also want to have a look at the LIKE[3] to define the schema
of your parquet table if it is mostly similar to the json schema.
INSERT INTO parquet SELECT /*transform your data*/ FROM json;
Best,
Dawid
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/json.html#how-to-create-a-table-with-json-format
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/parquet.html#how-to-create-a-table-with-parquet-format
[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/create.html#create-table
On 21/08/2020 02:40, Averell wrote:
> Hello,
>
> I have a stream with each message is a JSON string with a quite complex
> schema (multiple fields, multiple nested layers), and I need to write that
> into parquet files after some slight modifications/enrichment.
>
> I wonder what options are available for me to do that. I'm thinking of JSON
> -> AVRO (GenericRecord) -> Parquet. Is that an option? I would want to be
> able to quickly/dynamically (as less code change as possible) change the
> JSON schema.
>
> Thanks and regards,
> Averell
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/