You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Flavio Pompermaier <po...@okkam.it> on 2019/11/29 11:09:26 UTC

Read multiline JSON/XML

Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in
Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio

Re: Read multiline JSON/XML

Posted by Suneel Marthi <sm...@apache.org>.

For XML, u could look at Mahout's XMLInputFormat (if u r using HadoopInput
Format).

On Fri, Nov 29, 2019 at 9:01 AM Chesnay Schepler <ch...@apache.org> wrote:

> Why vino?
>
> He's specifically asking whether Flink offers something _like_ spark.
>
> On 29/11/2019 14:39, vino yang wrote:
>
> Hi Flavio,
>
> IMO, it would take more effect to ask this question in the Spark user
> mailing list.
>
> WDYT?
>
> Best,
> Vino
>
> Flavio Pompermaier <po...@okkam.it> 于2019年11月29日周五 下午7:09写道：
>
>> Hi to all,
>> is there any out-of-the-box option to read multiline JSON or XML like in
>> Spark?
>> It would be awesome to have something like
>>
>> spark.read .option("multiline", true) .json("/path/to/user.json")
>>
>> Best,
>> Flavio
>>
>
>

Re: Read multiline JSON/XML

Posted by vino yang <ya...@gmail.com>.

Also, say sorry to Flavio!

Best,
Vino

vino yang <ya...@gmail.com> 于2019年12月2日周一 上午10:29写道：

> Hi Chesnay,
>
> Sorry, yes, I lost the "like" keyword. I mistakenly thought he wanted to
> ask how to use Spark to accomplish this job.
>
> Best,
> Vino
>
> Chesnay Schepler <ch...@apache.org> 于2019年11月29日周五 下午10:01写道：
>
>> Why vino?
>>
>> He's specifically asking whether Flink offers something _like_ spark.
>>
>> On 29/11/2019 14:39, vino yang wrote:
>>
>> Hi Flavio,
>>
>> IMO, it would take more effect to ask this question in the Spark user
>> mailing list.
>>
>> WDYT?
>>
>> Best,
>> Vino
>>
>> Flavio Pompermaier <po...@okkam.it> 于2019年11月29日周五 下午7:09写道：
>>
>>> Hi to all,
>>> is there any out-of-the-box option to read multiline JSON or XML like in
>>> Spark?
>>> It would be awesome to have something like
>>>
>>> spark.read .option("multiline", true) .json("/path/to/user.json")
>>>
>>> Best,
>>> Flavio
>>>
>>
>>

Re: Read multiline JSON/XML

Posted by Chesnay Schepler <ch...@apache.org>.

Why vino?

He's specifically asking whether Flink offers something _like_ spark.

On 29/11/2019 14:39, vino yang wrote:
> Hi Flavio,
>
> IMO, it would take more effect to ask this question in the Spark user 
> mailing list.
>
> WDYT?
>
> Best,
> Vino
>
> Flavio Pompermaier <pompermaier@okkam.it 
> <ma...@okkam.it>> 于2019年11月29日周五 下午7:09写道：
>
>     Hi to all,
>     is there any out-of-the-box option to read multiline JSON or XML
>     like in Spark?
>     It would be awesome to have something like
>
>     spark.read .option("multiline", true) .json("/path/to/user.json")
>
>     Best,
>     Flavio
>

Re: Read multiline JSON/XML

Posted by vino yang <ya...@gmail.com>.

Hi Flavio,

IMO, it would take more effect to ask this question in the Spark user
mailing list.

WDYT?

Best,
Vino

Flavio Pompermaier <po...@okkam.it> 于2019年11月29日周五 下午7:09写道：

> Hi to all,
> is there any out-of-the-box option to read multiline JSON or XML like in
> Spark?
> It would be awesome to have something like
>
> spark.read .option("multiline", true) .json("/path/to/user.json")
>
> Best,
> Flavio
>

Re: Read multiline JSON/XML

Posted by Flavio Pompermaier <po...@okkam.it>.

Parallel files processing would be enough, inner file parallelism would be
awesome but it's a plus

On Fri, Nov 29, 2019 at 3:46 PM Arvid Heise <ar...@ververica.com> wrote:

> A while ago, I implemented XML and Json input formats. However, having
> proper split support for structured formats without sync markers is not
> that easy. Any split that has a random start offset need to figure out the
> start of the next record on its own, which is fragile by definition.
> That's why supporting jsonl files is much easier; you just need to look
> for the next newline. For the same reason, supporting json or xml in Kafka
> is fairly straightforward: records are already split.
>
> It would be easier to support XML and Json if we can get of splits.
> @Flavio would you expect to get inner file parallelism or would you be fine
> with processing only the files in parallel?
>
> Best,
>
> Arvid
>
> On Fri, Nov 29, 2019 at 3:26 PM Chesnay Schepler <ch...@apache.org>
> wrote:
>
>> I know that at least the Table API
>> <https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#csv-format>
>> can read json, but I don't know how well this translates into other APIs.
>>
>> On 29/11/2019 12:09, Flavio Pompermaier wrote:
>>
>> Hi to all,
>> is there any out-of-the-box option to read multiline JSON or XML like in
>> Spark?
>> It would be awesome to have something like
>>
>> spark.read .option("multiline", true) .json("/path/to/user.json")
>>
>> Best,
>> Flavio
>>
>>
>>

Re: Read multiline JSON/XML

Posted by Arvid Heise <ar...@ververica.com>.

A while ago, I implemented XML and Json input formats. However, having
proper split support for structured formats without sync markers is not
that easy. Any split that has a random start offset need to figure out the
start of the next record on its own, which is fragile by definition.
That's why supporting jsonl files is much easier; you just need to look for
the next newline. For the same reason, supporting json or xml in Kafka is
fairly straightforward: records are already split.

It would be easier to support XML and Json if we can get of splits. @Flavio
would you expect to get inner file parallelism or would you be fine with
processing only the files in parallel?

Best,

Arvid

On Fri, Nov 29, 2019 at 3:26 PM Chesnay Schepler <ch...@apache.org> wrote:

> I know that at least the Table API
> <https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#csv-format>
> can read json, but I don't know how well this translates into other APIs.
>
> On 29/11/2019 12:09, Flavio Pompermaier wrote:
>
> Hi to all,
> is there any out-of-the-box option to read multiline JSON or XML like in
> Spark?
> It would be awesome to have something like
>
> spark.read .option("multiline", true) .json("/path/to/user.json")
>
> Best,
> Flavio
>
>
>

Re: Read multiline JSON/XML

Posted by Chesnay Schepler <ch...@apache.org>.

I know that at least the Table API 
<https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#csv-format> 
can read json, but I don't know how well this translates into other APIs.

On 29/11/2019 12:09, Flavio Pompermaier wrote:
> Hi to all,
> is there any out-of-the-box option to read multiline JSON or XML like 
> in Spark?
> It would be awesome to have something like
>
> spark.read .option("multiline", true) .json("/path/to/user.json")
>
> Best,
> Flavio