You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Paul Brannan <pa...@thesystech.com> on 2017/02/16 16:42:12 UTC

File descriptor limit for WAL

I wrote a quick script today to see how kudu behaves if I create many
tables.  After creating 334 tables, I started getting timeouts.  I see this
in the master log file:

W0216 11:37:48.961221 49810 catalog_manager.cc:2490] CreateTablet RPC for
tablet 9b259d5c5ff74f04820240f2159bc1a0 on TS
faaf4e9b6e5945d7a14953c4cc34f164 (telx-sb-dev2:7050) failed: IO error:
Couldn't create tablet metadata: Failed to write tablet metadata
9b259d5c5ff74f04820240f2159bc1a0: Call to mkstemp() failed on name template
/var/lib/kudu/tserver/tablet-meta/9b259d5c5ff74f04820240f2159bc1a0.tmp.XXXXXX:
Too many open files (error 24)

I decreased block_manager_max_open_files, but still got the same result.
Lsof shows that the open files are for the WAL:

kudu-tser 49648 kudu 1021u   REG        8,5 67108864   16385457
/var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/wal-000000001
kudu-tser 49648 kudu 1022r   REG        8,5 67108864   16385457
/var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/wal-000000001
kudu-tser 49648 kudu 1023u   REG        8,5 24000000   16385458
/var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/index.000000000

The files do not get closed until the tables are deleted, even though no
running process has any of those tables open.

Is there a setting that will reduce the number of WAL files that get
created or held open at any given point in time?

Re: File descriptor limit for WAL

Posted by Todd Lipcon <to...@cloudera.com>.

On Fri, Feb 24, 2017 at 12:39 PM, Adar Dembo <ad...@cloudera.com> wrote:

> It's definitely safe to increase the ulimit for open files; we
> typically test with higher values (like 32K or 64K). We don't use
> select(2) directly; any fd polling in Kudu is done via libev which I
> believe uses epoll(2) under the hood. There's one other place where we
> use ppoll() (in RPC negotiation), but no select().
>
> A bit of historical curiosity: we actually had this bug a few years back
and fixed it, see 82cf3724077a8fb639a44dd86f04d10ecbedabf4

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: File descriptor limit for WAL

Posted by Adar Dembo <ad...@cloudera.com>.

I think range partitioning is a fine solution for your use case,
though you should know that we're not recommending more than 4 TB of
total data (post-encoding/compression) per tserver at the moment. I
don't expect anything to break outright if you exceed that, but
startup will get slower and slower, as will operations that rewrite
tablet superblocks (such as flushes and compactions).

It's definitely safe to increase the ulimit for open files; we
typically test with higher values (like 32K or 64K). We don't use
select(2) directly; any fd polling in Kudu is done via libev which I
believe uses epoll(2) under the hood. There's one other place where we
use ppoll() (in RPC negotiation), but no select().

As the number of tablets increase, startup will become slower, and the
number of threads in the process will grow too (we start a certain
number of threads per tablet). Keep an eye out for that.

On Fri, Feb 24, 2017 at 10:04 AM, Paul Brannan
<pa...@thesystech.com> wrote:
> I'm using the debs from the cloudera-kudu ppa with little change to the
> default configuration, so one master and one tablet server.  I set
> num_replicas(1) when creating each table.  I used range partitioning with
> (if I understand correctly) one large open-ended range.  So that should have
> 334 tablet replicas.
>
> I added two more tablet servers and was able to get to create 1002 tables
> (exactly 3*334) before running out of file descriptors.  Using hash
> partitioning instead of range partitioning, I was able to create 500 tables
> (roughly half).  This is using 2 hash buckets, so it's what I expect.  I run
> into the same limit when I have a single table with many range partitions
> (997 partitions on a single partition column).
>
> My goal here is to be able to keep N months of data (on the order of 100's
> of GB per day) and to be able to drop a single date from the beginning of
> the range.  Rows are only inserted for the current date, and rows for
> previous dates are not  modified.  Partitioning seems ideal for this case
> (it's mentioned as a use case in the non-covering range partitions
> document).  Is there a better solution?
>
> Does the 100-tablet limit only affect startup time?  In other words, if
> multiple-minute startup time is acceptable, then is there any other reason
> to limit each tablet server to 100 tablets?  Is it safe to increase ulimit
> for open files past 1024 (i.e. does the tablet server ever call select(2))?
>
>
> On Thu, Feb 16, 2017 at 3:33 PM, Adar Dembo <ad...@cloudera.com> wrote:
>>
>> Hi Paul,
>>
>> As you discovered, Kudu holds WAL segments open until the tablets they
>> belong to are deleted. block_manager_max_open_files won't help here;
>> that just applies to files opened for accessing data blocks, not WAL
>> segments.
>>
>> As far as WAL segments are concerned, we've previously discussed
>> "queiscing" tablets that haven't been used in some time, which would
>> involve halting their Raft consensus state machine and perhaps closing
>> their WAL segments. I can't find a JIRA for this feature, but I'm also
>> not aware of anyone working on it. If you're interested in
>> contributing to Kudu, this could be a worthwhile avenue for you to
>> explore further.
>>
>> I'm a little fuzzy on the details, but I believe that by default a
>> tablet will retain anywhere from 2 to 10 WAL segments, all of them
>> open. The exact number depends on how "caught up" the replication
>> group is; if one peer is behind, more segments may be retained in
>> order to help that peer catch up in the future. The settings that
>> control these numbers are log_min_segments_to_retain and
>> log_max_segments_to_retain.
>>
>> Out of curiosity, how many tablet replicas did your 334 tables
>> generate in total? You can deduce that by calculating, for each table,
>> the total number of partitions multiplied by the table's replication
>> factor. And across how many tservers were they all distributed? By
>> design, tservers can handle many tablets, but as usual, the
>> implementation lags the design, and at the moment we're recommending
>> no more than 100 tablets per tserver
>> (http://kudu.apache.org/docs/known_issues.html#_other_known_issues).
>>
>>
>> On Thu, Feb 16, 2017 at 8:42 AM, Paul Brannan
>> <pa...@thesystech.com> wrote:
>> > I wrote a quick script today to see how kudu behaves if I create many
>> > tables.  After creating 334 tables, I started getting timeouts.  I see
>> > this
>> > in the master log file:
>> >
>> > W0216 11:37:48.961221 49810 catalog_manager.cc:2490] CreateTablet RPC
>> > for
>> > tablet 9b259d5c5ff74f04820240f2159bc1a0 on TS
>> > faaf4e9b6e5945d7a14953c4cc34f164 (telx-sb-dev2:7050) failed: IO error:
>> > Couldn't create tablet metadata: Failed to write tablet metadata
>> > 9b259d5c5ff74f04820240f2159bc1a0: Call to mkstemp() failed on name
>> > template
>> >
>> > /var/lib/kudu/tserver/tablet-meta/9b259d5c5ff74f04820240f2159bc1a0.tmp.XXXXXX:
>> > Too many open files (error 24)
>> >
>> > I decreased block_manager_max_open_files, but still got the same result.
>> > Lsof shows that the open files are for the WAL:
>> >
>> > kudu-tser 49648 kudu 1021u   REG        8,5 67108864   16385457
>> >
>> > /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/wal-000000001
>> > kudu-tser 49648 kudu 1022r   REG        8,5 67108864   16385457
>> >
>> > /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/wal-000000001
>> > kudu-tser 49648 kudu 1023u   REG        8,5 24000000   16385458
>> >
>> > /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/index.000000000
>> >
>> > The files do not get closed until the tables are deleted, even though no
>> > running process has any of those tables open.
>> >
>> > Is there a setting that will reduce the number of WAL files that get
>> > created
>> > or held open at any given point in time?
>
>

Re: File descriptor limit for WAL

Posted by Paul Brannan <pa...@thesystech.com>.

I'm using the debs from the cloudera-kudu ppa with little change to the
default configuration, so one master and one tablet server.  I set
num_replicas(1) when creating each table.  I used range partitioning with
(if I understand correctly) one large open-ended range.  So that should
have 334 tablet replicas.

I added two more tablet servers and was able to get to create 1002 tables
(exactly 3*334) before running out of file descriptors.  Using hash
partitioning instead of range partitioning, I was able to create 500 tables
(roughly half).  This is using 2 hash buckets, so it's what I expect.  I
run into the same limit when I have a single table with many range
partitions (997 partitions on a single partition column).

My goal here is to be able to keep N months of data (on the order of 100's
of GB per day) and to be able to drop a single date from the beginning of
the range.  Rows are only inserted for the current date, and rows for
previous dates are not  modified.  Partitioning seems ideal for this case
(it's mentioned as a use case in the non-covering range partitions
document).  Is there a better solution?

Does the 100-tablet limit only affect startup time?  In other words, if
multiple-minute startup time is acceptable, then is there any other reason
to limit each tablet server to 100 tablets?  Is it safe to increase ulimit
for open files past 1024 (i.e. does the tablet server ever call select(2))?

On Thu, Feb 16, 2017 at 3:33 PM, Adar Dembo <ad...@cloudera.com> wrote:

> Hi Paul,
>
> As you discovered, Kudu holds WAL segments open until the tablets they
> belong to are deleted. block_manager_max_open_files won't help here;
> that just applies to files opened for accessing data blocks, not WAL
> segments.
>
> As far as WAL segments are concerned, we've previously discussed
> "queiscing" tablets that haven't been used in some time, which would
> involve halting their Raft consensus state machine and perhaps closing
> their WAL segments. I can't find a JIRA for this feature, but I'm also
> not aware of anyone working on it. If you're interested in
> contributing to Kudu, this could be a worthwhile avenue for you to
> explore further.
>
> I'm a little fuzzy on the details, but I believe that by default a
> tablet will retain anywhere from 2 to 10 WAL segments, all of them
> open. The exact number depends on how "caught up" the replication
> group is; if one peer is behind, more segments may be retained in
> order to help that peer catch up in the future. The settings that
> control these numbers are log_min_segments_to_retain and
> log_max_segments_to_retain.
>
> Out of curiosity, how many tablet replicas did your 334 tables
> generate in total? You can deduce that by calculating, for each table,
> the total number of partitions multiplied by the table's replication
> factor. And across how many tservers were they all distributed? By
> design, tservers can handle many tablets, but as usual, the
> implementation lags the design, and at the moment we're recommending
> no more than 100 tablets per tserver
> (http://kudu.apache.org/docs/known_issues.html#_other_known_issues).
>
>
> On Thu, Feb 16, 2017 at 8:42 AM, Paul Brannan
> <pa...@thesystech.com> wrote:
> > I wrote a quick script today to see how kudu behaves if I create many
> > tables.  After creating 334 tables, I started getting timeouts.  I see
> this
> > in the master log file:
> >
> > W0216 11:37:48.961221 49810 catalog_manager.cc:2490] CreateTablet RPC for
> > tablet 9b259d5c5ff74f04820240f2159bc1a0 on TS
> > faaf4e9b6e5945d7a14953c4cc34f164 (telx-sb-dev2:7050) failed: IO error:
> > Couldn't create tablet metadata: Failed to write tablet metadata
> > 9b259d5c5ff74f04820240f2159bc1a0: Call to mkstemp() failed on name
> template
> > /var/lib/kudu/tserver/tablet-meta/9b259d5c5ff74f04820240f2159bc1
> a0.tmp.XXXXXX:
> > Too many open files (error 24)
> >
> > I decreased block_manager_max_open_files, but still got the same result.
> > Lsof shows that the open files are for the WAL:
> >
> > kudu-tser 49648 kudu 1021u   REG        8,5 67108864   16385457
> > /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e6623
> 0b/wal-000000001
> > kudu-tser 49648 kudu 1022r   REG        8,5 67108864   16385457
> > /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e6623
> 0b/wal-000000001
> > kudu-tser 49648 kudu 1023u   REG        8,5 24000000   16385458
> > /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e6623
> 0b/index.000000000
> >
> > The files do not get closed until the tables are deleted, even though no
> > running process has any of those tables open.
> >
> > Is there a setting that will reduce the number of WAL files that get
> created
> > or held open at any given point in time?
>

Re: File descriptor limit for WAL

Posted by Adar Dembo <ad...@cloudera.com>.

Hi Paul,

As you discovered, Kudu holds WAL segments open until the tablets they
belong to are deleted. block_manager_max_open_files won't help here;
that just applies to files opened for accessing data blocks, not WAL
segments.

As far as WAL segments are concerned, we've previously discussed
"queiscing" tablets that haven't been used in some time, which would
involve halting their Raft consensus state machine and perhaps closing
their WAL segments. I can't find a JIRA for this feature, but I'm also
not aware of anyone working on it. If you're interested in
contributing to Kudu, this could be a worthwhile avenue for you to
explore further.

I'm a little fuzzy on the details, but I believe that by default a
tablet will retain anywhere from 2 to 10 WAL segments, all of them
open. The exact number depends on how "caught up" the replication
group is; if one peer is behind, more segments may be retained in
order to help that peer catch up in the future. The settings that
control these numbers are log_min_segments_to_retain and
log_max_segments_to_retain.

Out of curiosity, how many tablet replicas did your 334 tables
generate in total? You can deduce that by calculating, for each table,
the total number of partitions multiplied by the table's replication
factor. And across how many tservers were they all distributed? By
design, tservers can handle many tablets, but as usual, the
implementation lags the design, and at the moment we're recommending
no more than 100 tablets per tserver
(http://kudu.apache.org/docs/known_issues.html#_other_known_issues).

On Thu, Feb 16, 2017 at 8:42 AM, Paul Brannan
<pa...@thesystech.com> wrote:
> I wrote a quick script today to see how kudu behaves if I create many
> tables.  After creating 334 tables, I started getting timeouts.  I see this
> in the master log file:
>
> W0216 11:37:48.961221 49810 catalog_manager.cc:2490] CreateTablet RPC for
> tablet 9b259d5c5ff74f04820240f2159bc1a0 on TS
> faaf4e9b6e5945d7a14953c4cc34f164 (telx-sb-dev2:7050) failed: IO error:
> Couldn't create tablet metadata: Failed to write tablet metadata
> 9b259d5c5ff74f04820240f2159bc1a0: Call to mkstemp() failed on name template
> /var/lib/kudu/tserver/tablet-meta/9b259d5c5ff74f04820240f2159bc1a0.tmp.XXXXXX:
> Too many open files (error 24)
>
> I decreased block_manager_max_open_files, but still got the same result.
> Lsof shows that the open files are for the WAL:
>
> kudu-tser 49648 kudu 1021u   REG        8,5 67108864   16385457
> /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/wal-000000001
> kudu-tser 49648 kudu 1022r   REG        8,5 67108864   16385457
> /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/wal-000000001
> kudu-tser 49648 kudu 1023u   REG        8,5 24000000   16385458
> /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/index.000000000
>
> The files do not get closed until the tables are deleted, even though no
> running process has any of those tables open.
>
> Is there a setting that will reduce the number of WAL files that get created
> or held open at any given point in time?