You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by galantaa <al...@gmail.com> on 2018/03/13 14:08:01 UTC

Too many open files on Bucketing sink

Hey all,
I'm using bucketing sink with a bucketer that creates partition per customer
per day.
I sink the files to s3.
it suppose to work on around 500 files at the same time (according to my
partitioning).

I have a critical problem of 'Too many open files'.
I've upload two taskmanagers, each with 16 slots. I've checked how many open
files (or file descriptors) exist with 'lsof | wc -l' and it had reached
over a million files on each taskmanager!

after that, I'd decreased the num of taskSlots to 8 (4 in each taskmanager),
and the concurrency dropped.
checking 'lsof | wc -l' gave around 250k file on each machine. 
I also checked how many actual files exist in my tmp dir (it works on the
files there before uploading them to s3) - around 3000.

I think that each taskSlot works with several threads (maybe 16?), and each
thread holds a fd for the actual file, and thats how the numbers get so
high.

Is that a know problem? is there anything I can do?
by now, I filter just 10 customers and it works great, but I have to find a
real solution so I can stream all the data.
Maybe I can also work with a single task Slot per machine but I'm not sure
this is a good idea.

Thank you very much,
Alon 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Too many open files on Bucketing sink

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi,

There is an open similar issue: https://issues.apache.org/jira/browse/FLINK-8707 <https://issues.apache.org/jira/browse/FLINK-8707>

It’s still under investigation and it would be helpful if you could follow up the discussion there, run same diagnostics commands as Alexander Gardner did (mainly if you could attach output of lsof command for TaskManagers).

Last time I was looking into it, most of the open files came from loading dependency jars for the operators. It seemed like each task/task slot was executed in separate class loader so the same dependency was being loaded multiple times over and over again.

Thanks, Piotrek

> On 14 Mar 2018, at 19:52, Felix Cheung <fe...@hotmail.com> wrote:
> 
> I have seen this before as well.
> 
> My workaround was to limit the number of parallelism but it is the unfortunate effect of limiting the number of processing tasks also (and so slowing things down)
> 
> Another alternative is to have bigger buckets (and smaller number of buckets)
> 
> Not sure if there is a good solution.
> 
> From: galantaa <al...@gmail.com>
> Sent: Tuesday, March 13, 2018 7:08:01 AM
> To: user@flink.apache.org
> Subject: Too many open files on Bucketing sink
>  
> Hey all,
> I'm using bucketing sink with a bucketer that creates partition per customer
> per day.
> I sink the files to s3.
> it suppose to work on around 500 files at the same time (according to my
> partitioning).
> 
> I have a critical problem of 'Too many open files'.
> I've upload two taskmanagers, each with 16 slots. I've checked how many open
> files (or file descriptors) exist with 'lsof | wc -l' and it had reached
> over a million files on each taskmanager!
> 
> after that, I'd decreased the num of taskSlots to 8 (4 in each taskmanager),
> and the concurrency dropped.
> checking 'lsof | wc -l' gave around 250k file on each machine. 
> I also checked how many actual files exist in my tmp dir (it works on the
> files there before uploading them to s3) - around 3000.
> 
> I think that each taskSlot works with several threads (maybe 16?), and each
> thread holds a fd for the actual file, and thats how the numbers get so
> high.
> 
> Is that a know problem? is there anything I can do?
> by now, I filter just 10 customers and it works great, but I have to find a
> real solution so I can stream all the data.
> Maybe I can also work with a single task Slot per machine but I'm not sure
> this is a good idea.
> 
> Thank you very much,
> Alon 
> 
> 
> 
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>

Re: Too many open files on Bucketing sink

Posted by Felix Cheung <fe...@hotmail.com>.
I have seen this before as well.

My workaround was to limit the number of parallelism but it is the unfortunate effect of limiting the number of processing tasks also (and so slowing things down)

Another alternative is to have bigger buckets (and smaller number of buckets)

Not sure if there is a good solution.

________________________________
From: galantaa <al...@gmail.com>
Sent: Tuesday, March 13, 2018 7:08:01 AM
To: user@flink.apache.org
Subject: Too many open files on Bucketing sink

Hey all,
I'm using bucketing sink with a bucketer that creates partition per customer
per day.
I sink the files to s3.
it suppose to work on around 500 files at the same time (according to my
partitioning).

I have a critical problem of 'Too many open files'.
I've upload two taskmanagers, each with 16 slots. I've checked how many open
files (or file descriptors) exist with 'lsof | wc -l' and it had reached
over a million files on each taskmanager!

after that, I'd decreased the num of taskSlots to 8 (4 in each taskmanager),
and the concurrency dropped.
checking 'lsof | wc -l' gave around 250k file on each machine.
I also checked how many actual files exist in my tmp dir (it works on the
files there before uploading them to s3) - around 3000.

I think that each taskSlot works with several threads (maybe 16?), and each
thread holds a fd for the actual file, and thats how the numbers get so
high.

Is that a know problem? is there anything I can do?
by now, I filter just 10 customers and it works great, but I have to find a
real solution so I can stream all the data.
Maybe I can also work with a single task Slot per machine but I'm not sure
this is a good idea.

Thank you very much,
Alon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/