You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Billy Bain <bi...@gmail.com> on 2020/12/28 13:40:42 UTC

read a tarred + gzipped file flink 1.12

We have an input file that is tarred and compressed to 12gb. It is about
50gb uncompressed.

With readTextFile(), I see it uncompress the file but then flink doesn't
seem to handle the untar portion. It's just a single file. (We don't
control the input format)

foo.tar.gz 12gb
foo.tar  50gb
then untar it and it is valid jsonl

When reading, we get this exception:

Caused by:
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException:
Unrecognized token 'playstore': was expecting (JSON String, Number, Array,
Object or token 'null', 'true' or 'false')
 at [Source: UNKNOWN; line: 1, column: 10]
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840)

The process is seeing the header in the tar format and rightly complaining
about the JSON format.

Is it possible to untar this file using Flink?

-- 
Wayne D. Young
aka Billy Bob Bain
billybobbain@gmail.com

Re: read a tarred + gzipped file flink 1.12

Posted by Arvid Heise <ar...@ververica.com>.
Hi Billy,

I suspect that it's not possible in Flink as is. The tar file acts as a
directory containing an arbitrary number of files. Afaik, Flink assumes
that all compressed files or just single files, like gz without tar. It's
like this in your case, but then the tar part doesn't make much sense.

Since you cannot control the input, you have two options:
* External process that unpacks the file and then calls Flink.
* Implement your own input format similar to [1].

[1]
https://stackoverflow.com/questions/49122170/zip-compressed-input-for-apache-flink

On Mon, Dec 28, 2020 at 2:41 PM Billy Bain <bi...@gmail.com> wrote:

> We have an input file that is tarred and compressed to 12gb. It is about
> 50gb uncompressed.
>
> With readTextFile(), I see it uncompress the file but then flink doesn't
> seem to handle the untar portion. It's just a single file. (We don't
> control the input format)
>
> foo.tar.gz 12gb
> foo.tar  50gb
> then untar it and it is valid jsonl
>
> When reading, we get this exception:
>
> Caused by:
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException:
> Unrecognized token 'playstore': was expecting (JSON String, Number, Array,
> Object or token 'null', 'true' or 'false')
>  at [Source: UNKNOWN; line: 1, column: 10]
> at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840)
>
> The process is seeing the header in the tar format and rightly complaining
> about the JSON format.
>
> Is it possible to untar this file using Flink?
>
> --
> Wayne D. Young
> aka Billy Bob Bain
> billybobbain@gmail.com
>


-- 

Arvid Heise | Senior Java Developer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng