You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lian Jiang <ji...@gmail.com> on 2019/02/09 19:25:39 UTC

structured streaming handling validation and json flattening

Hi,

We have a structured streaming job that converting json into parquets. We
want to validate the json records. If a json record is not valid, we want
to log a message and refuse to write it into the parquet. Also the json has
nesting jsons and we want to flatten the nesting jsons into other parquets
by using the same streaming job. My questions are:

1. how to validate the json records in a structured streaming job?
2. how to flattening the nesting jsons in a structured streaming job?
3. is it possible to use one structured streaming job to validate json,
convert json into a parquet and convert nesting jsons into other parquets?

I think unstructured streaming can achieve these goals but structured
streaming is recommended by spark community.

Appreciate your feedback!

Re: structured streaming handling validation and json flattening

Posted by Phillip Henry <lo...@gmail.com>.
Hi,

I'm in a somewhat similar situation. Here's what I do (it seems to be
working so far):

1. Stream in the JSON as a plain string.
2. Feed this string into a JSON library to validate it (I use Circe).
3. Using the same library, parse the JSON and extract fields X, Y and Z.
4. Create a dataset with fields X, Y, Z and the JSON as a String/
5. Write this dataset to HDFS as Parquet partitioned on X and sorted on Y.

Obviously, this is not exactly the same as your use case (for instance, I
have no idea what your requirements are regarding "flattening the nesting
jsons"). Also, I extract only a few fields that I use as columns in the
resulting Dataset but then store the rest of the JSON as a string. However,
the principle should be the same for you.

HTH.

Phillip





On Mon, Feb 11, 2019 at 2:59 PM Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Lian,
>
> "What have you tried?" would be a good starting point. Any help on this?
>
> How do you read the JSONs? readStream.json? You could use readStream.text
> followed by filter to include/exclude good/bad JSONs.
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Feb 9, 2019 at 8:25 PM Lian Jiang <ji...@gmail.com> wrote:
>
>> Hi,
>>
>> We have a structured streaming job that converting json into parquets. We
>> want to validate the json records. If a json record is not valid, we want
>> to log a message and refuse to write it into the parquet. Also the json has
>> nesting jsons and we want to flatten the nesting jsons into other parquets
>> by using the same streaming job. My questions are:
>>
>> 1. how to validate the json records in a structured streaming job?
>> 2. how to flattening the nesting jsons in a structured streaming job?
>> 3. is it possible to use one structured streaming job to validate json,
>> convert json into a parquet and convert nesting jsons into other parquets?
>>
>> I think unstructured streaming can achieve these goals but structured
>> streaming is recommended by spark community.
>>
>> Appreciate your feedback!
>>
>

Re: structured streaming handling validation and json flattening

Posted by Jacek Laskowski <ja...@japila.pl>.
Hi Lian,

"What have you tried?" would be a good starting point. Any help on this?

How do you read the JSONs? readStream.json? You could use readStream.text
followed by filter to include/exclude good/bad JSONs.

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Sat, Feb 9, 2019 at 8:25 PM Lian Jiang <ji...@gmail.com> wrote:

> Hi,
>
> We have a structured streaming job that converting json into parquets. We
> want to validate the json records. If a json record is not valid, we want
> to log a message and refuse to write it into the parquet. Also the json has
> nesting jsons and we want to flatten the nesting jsons into other parquets
> by using the same streaming job. My questions are:
>
> 1. how to validate the json records in a structured streaming job?
> 2. how to flattening the nesting jsons in a structured streaming job?
> 3. is it possible to use one structured streaming job to validate json,
> convert json into a parquet and convert nesting jsons into other parquets?
>
> I think unstructured streaming can achieve these goals but structured
> streaming is recommended by spark community.
>
> Appreciate your feedback!
>