You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by madan <ma...@gmail.com> on 2019/02/04 17:12:04 UTC

Regarding json/xml/csv file splitting

Hi,

Can someone please tell me how to split json/xml data file. Since they are
structured form (i.e., parent/child hierarchy), is it possible to split the
file and process in parallel with 2 or more instances of source operator ?
Also please confirm if my understanding of csv splitting is correct as
mentioned below,

When used parallelism greater than 1, file will be split into equal parts
more or less and each operator instance will have respective start position
of file partition. There can be possibility that start position of file
partition can come in the middle of the delimited line as shown below. And
when file reading is started initial partial record will be ignored by
respective operator instance and reads full records which are coming
afterwards. ie.,
# Operator1 reads emp1, emp2 records (reads emp2 since record's starting
char position fell in its reading range)
# Operator2 ignores partial emp2 rec and reads emp3 and emp4
# Operator3 ignores partial emp4 and reads emp5
Record delimiter is used to skip partial record and identifying new record.

[image: csv_reader_positions.jpg]



-- 
Thank you,
Madan.

Re: Regarding json/xml/csv file splitting

Posted by Ken Krugler <kk...@transpac.com>.

Normally parallel processing of text input files is handled via Hadoop TextInputFormat, which support splitting of files on line boundaries at (roughly) HDFS block boundaries.

There are various XML Hadoop InputFormats available, which try to sync up with splittable locations. The one I’ve used in the past <https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java> is part of the Mahout project.

If each JSON record is on its own line, then you can just use a regular source, and parse it in a subsequent map function 

Otherwise you can still create a custom input format, as long as there’s some unique JSON that identifies the beginning/end of each record.

See https://stackoverflow.com/questions/18593595/custom-inputformat-for-reading-json-in-hadoop

And failing that, you can always build a list of file paths as your input, and then in your map function explicitly open/read each file and process it as you would any JSON file. In a past project where we had a similar requirement, the only interesting challenge was building N lists of files (for N mappers) where the sum of file sizes was roughly equal for each parallel map parser, as there was significant skew in the file sizes.

— Ken

> On Feb 4, 2019, at 9:12 AM, madan <ma...@gmail.com> wrote:
> 
> Hi,
> 
> Can someone please tell me how to split json/xml data file. Since they are structured form (i.e., parent/child hierarchy), is it possible to split the file and process in parallel with 2 or more instances of source operator ?
> Also please confirm if my understanding of csv splitting is correct as mentioned below,
> 
> When used parallelism greater than 1, file will be split into equal parts more or less and each operator instance will have respective start position of file partition. There can be possibility that start position of file partition can come in the middle of the delimited line as shown below. And when file reading is started initial partial record will be ignored by respective operator instance and reads full records which are coming afterwards. ie.,
> # Operator1 reads emp1, emp2 records (reads emp2 since record's starting char position fell in its reading range)
> # Operator2 ignores partial emp2 rec and reads emp3 and emp4
> # Operator3 ignores partial emp4 and reads emp5
> Record delimiter is used to skip partial record and identifying new record.
> 
> <csv_reader_positions.jpg>
> 
> 
> 
> -- 
> Thank you,
> Madan.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra