You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Austin Cawley-Edwards <au...@gmail.com> on 2020/07/06 23:30:44 UTC

Decompressing Tar Files for Batch Processing

Hey all,

I need to ingest a tar file containing ~1GB of data in around 10 CSVs. The
data is fairly connected and needs some cleaning, which I'd like to do with
the Batch Table API + SQL (but have never used before). I've got a small
prototype loading the uncompressed CSVs and applying the necessary SQL,
which works well.

I'm wondering about the task of downloading the tar file and unzipping it
into the CSVs. Does this sound like something I can/ should do in Flink, or
should I set up another process to download, unzip, and store in a
filesystem to then read with the Flink Batch job? My research is leading me
towards doing it separately but I'd like to do it all in the same job if
there's a creative way.

Thanks!
Austin

Re: Decompressing Tar Files for Batch Processing

Posted by Austin Cawley-Edwards <au...@gmail.com>.

Hey Chesnay,

Thanks for the advice, and easy enough to do it in a separate process.

Best,
Austin

On Tue, Jul 7, 2020 at 10:29 AM Chesnay Schepler <ch...@apache.org> wrote:

> I would probably go with a separate process.
>
> Downloading the file could work with Flink if it is already present in
> some supported filesystem. Decompressing the file is supported for
> selected formats (deflate, gzip, bz2, xz), but this seems to be an
> undocumented feature, so I'm not sure how usable it is in reality.
>
> On 07/07/2020 01:30, Austin Cawley-Edwards wrote:
> > Hey all,
> >
> > I need to ingest a tar file containing ~1GB of data in around 10 CSVs.
> > The data is fairly connected and needs some cleaning, which I'd like
> > to do with the Batch Table API + SQL (but have never used before).
> > I've got a small prototype loading the uncompressed CSVs and applying
> > the necessary SQL, which works well.
> >
> > I'm wondering about the task of downloading the tar file and unzipping
> > it into the CSVs. Does this sound like something I can/ should do in
> > Flink, or should I set up another process to download, unzip, and
> > store in a filesystem to then read with the Flink Batch job? My
> > research is leading me towards doing it separately but I'd like to do
> > it all in the same job if there's a creative way.
> >
> > Thanks!
> > Austin
>
>
>

Re: Decompressing Tar Files for Batch Processing

Posted by Chesnay Schepler <ch...@apache.org>.

I would probably go with a separate process.

Downloading the file could work with Flink if it is already present in 
some supported filesystem. Decompressing the file is supported for 
selected formats (deflate, gzip, bz2, xz), but this seems to be an 
undocumented feature, so I'm not sure how usable it is in reality.

On 07/07/2020 01:30, Austin Cawley-Edwards wrote:
> Hey all,
>
> I need to ingest a tar file containing ~1GB of data in around 10 CSVs. 
> The data is fairly connected and needs some cleaning, which I'd like 
> to do with the Batch Table API + SQL (but have never used before). 
> I've got a small prototype loading the uncompressed CSVs and applying 
> the necessary SQL, which works well.
>
> I'm wondering about the task of downloading the tar file and unzipping 
> it into the CSVs. Does this sound like something I can/ should do in 
> Flink, or should I set up another process to download, unzip, and 
> store in a filesystem to then read with the Flink Batch job? My 
> research is leading me towards doing it separately but I'd like to do 
> it all in the same job if there's a creative way.
>
> Thanks!
> Austin

Re: Decompressing Tar Files for Batch Processing

Posted by Austin Cawley-Edwards <au...@gmail.com>.

On Tue, Jul 7, 2020 at 10:53 AM Austin Cawley-Edwards <
austin.cawley@gmail.com> wrote:

> Hey Xiaolong,
>
> Thanks for the suggestions. Just to make sure I understand, are you saying
> to run the download and decompression in the Job Manager before executing
> the job?
>
> I think another way to ensure the tar file is not downloaded more than
> once is a source w/ parallelism 1. The issue I can't get past is after
> decompressing the tarball, how would I pass those OutputStreams for each
> entry through Flink?
>
> Best,
> Austin
>
>
>
> On Tue, Jul 7, 2020 at 5:56 AM Xiaolong Wang <xi...@smartnews.com>
> wrote:
>
>> It seems like to me that it can not be done by Flink, for code will be
>> run across all task managers. That way, there will be multiple downloads of
>> you tar file, which is unnecessary.
>>
>> However, you can do it  on your code before initializing a Flink runtime,
>> and the code will be run only on the client side.
>>
>> On Tue, Jul 7, 2020 at 7:31 AM Austin Cawley-Edwards <
>> austin.cawley@gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> I need to ingest a tar file containing ~1GB of data in around 10 CSVs.
>>> The data is fairly connected and needs some cleaning, which I'd like to do
>>> with the Batch Table API + SQL (but have never used before). I've got a
>>> small prototype loading the uncompressed CSVs and applying the necessary
>>> SQL, which works well.
>>>
>>> I'm wondering about the task of downloading the tar file and unzipping
>>> it into the CSVs. Does this sound like something I can/ should do in Flink,
>>> or should I set up another process to download, unzip, and store in a
>>> filesystem to then read with the Flink Batch job? My research is leading me
>>> towards doing it separately but I'd like to do it all in the same job if
>>> there's a creative way.
>>>
>>> Thanks!
>>> Austin
>>>
>>