You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lian Jiang <ji...@gmail.com> on 2018/04/23 18:46:26 UTC

schema change for structured spark streaming using jsonl files

Hi,

I am using structured spark streaming which reads jsonl files and writes
into parquet files. I am wondering what's the process if jsonl files schema
change.

Suppose jsonl files are generated in \jsonl folder and the old schema is {
"field1": String}. My proposal is:

1. write the jsonl files with new schema (e.g. {"field1":String,
"field2":Int}) into another folder \jsonl2
2. let spark job complete handling all data in \jsonl, then stop the spark
streaming job.
3. use a spark script to convert the parquet files from old schema to new
schema (e.g. add a new column with some default value for "field2").
4. upgrade and start the spark streaming job for handling the new schema
jsonl files and parquet files.

Is this process correct (best)? Thanks for any clue.

Re: schema change for structured spark streaming using jsonl files

Posted by Michael Segel <ms...@hotmail.com>.
Hi,

This is going to sound complicated.

Taken as an individual JSON document, because its a self contained schema doc, its structured.  However there isn’t a persisting schema that has to be consistent across multiple documents.  So you can consider it semi structured.

If you’re parsing the JSON document and storing different attributes in separate columns… you will have a major issue because its possible for a JSON document to contain a new element that isn’t in your Parquet schema.

If you are going from JSON to parquet… you will probably be better off storing a serialized version of the JSON doc and then storing highlighted attributes in separate columns.

HTH

-Mike


> On Apr 23, 2018, at 1:46 PM, Lian Jiang <ji...@gmail.com> wrote:
> 
> Hi,
> 
> I am using structured spark streaming which reads jsonl files and writes into parquet files. I am wondering what's the process if jsonl files schema change.
> 
> Suppose jsonl files are generated in \jsonl folder and the old schema is { "field1": String}. My proposal is:
> 
> 1. write the jsonl files with new schema (e.g. {"field1":String, "field2":Int}) into another folder \jsonl2
> 2. let spark job complete handling all data in \jsonl, then stop the spark streaming job.
> 3. use a spark script to convert the parquet files from old schema to new schema (e.g. add a new column with some default value for "field2").
> 4. upgrade and start the spark streaming job for handling the new schema jsonl files and parquet files.
> 
> Is this process correct (best)? Thanks for any clue.


Re: schema change for structured spark streaming using jsonl files

Posted by Lian Jiang <ji...@gmail.com>.
Thanks for any help!

On Mon, Apr 23, 2018 at 11:46 AM, Lian Jiang <ji...@gmail.com> wrote:

> Hi,
>
> I am using structured spark streaming which reads jsonl files and writes
> into parquet files. I am wondering what's the process if jsonl files schema
> change.
>
> Suppose jsonl files are generated in \jsonl folder and the old schema is {
> "field1": String}. My proposal is:
>
> 1. write the jsonl files with new schema (e.g. {"field1":String,
> "field2":Int}) into another folder \jsonl2
> 2. let spark job complete handling all data in \jsonl, then stop the spark
> streaming job.
> 3. use a spark script to convert the parquet files from old schema to new
> schema (e.g. add a new column with some default value for "field2").
> 4. upgrade and start the spark streaming job for handling the new schema
> jsonl files and parquet files.
>
> Is this process correct (best)? Thanks for any clue.
>