You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Billy Bain <bi...@gmail.com> on 2020/12/28 13:40:42 UTC
read a tarred + gzipped file flink 1.12
We have an input file that is tarred and compressed to 12gb. It is about
50gb uncompressed.
With readTextFile(), I see it uncompress the file but then flink doesn't
seem to handle the untar portion. It's just a single file. (We don't
control the input format)
foo.tar.gz 12gb
foo.tar 50gb
then untar it and it is valid jsonl
When reading, we get this exception:
Caused by:
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException:
Unrecognized token 'playstore': was expecting (JSON String, Number, Array,
Object or token 'null', 'true' or 'false')
at [Source: UNKNOWN; line: 1, column: 10]
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840)
The process is seeing the header in the tar format and rightly complaining
about the JSON format.
Is it possible to untar this file using Flink?
--
Wayne D. Young
aka Billy Bob Bain
billybobbain@gmail.com
Re: read a tarred + gzipped file flink 1.12
Posted by Arvid Heise <ar...@ververica.com>.
Hi Billy,
I suspect that it's not possible in Flink as is. The tar file acts as a
directory containing an arbitrary number of files. Afaik, Flink assumes
that all compressed files or just single files, like gz without tar. It's
like this in your case, but then the tar part doesn't make much sense.
Since you cannot control the input, you have two options:
* External process that unpacks the file and then calls Flink.
* Implement your own input format similar to [1].
[1]
https://stackoverflow.com/questions/49122170/zip-compressed-input-for-apache-flink
On Mon, Dec 28, 2020 at 2:41 PM Billy Bain <bi...@gmail.com> wrote:
> We have an input file that is tarred and compressed to 12gb. It is about
> 50gb uncompressed.
>
> With readTextFile(), I see it uncompress the file but then flink doesn't
> seem to handle the untar portion. It's just a single file. (We don't
> control the input format)
>
> foo.tar.gz 12gb
> foo.tar 50gb
> then untar it and it is valid jsonl
>
> When reading, we get this exception:
>
> Caused by:
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException:
> Unrecognized token 'playstore': was expecting (JSON String, Number, Array,
> Object or token 'null', 'true' or 'false')
> at [Source: UNKNOWN; line: 1, column: 10]
> at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840)
>
> The process is seeing the header in the tar format and rightly complaining
> about the JSON format.
>
> Is it possible to untar this file using Flink?
>
> --
> Wayne D. Young
> aka Billy Bob Bain
> billybobbain@gmail.com
>
--
Arvid Heise | Senior Java Developer
<https://www.ververica.com/>
Follow us @VervericaData
--
Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference
Stream Processing | Event Driven | Real Time
--
Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng