You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Ligade, Shailesh [USA]" <Li...@bah.com> on 2022/01/31 16:15:35 UTC

tablets per tablet server for accumulo 1.10.0

Hello,

table.split.threshold is set to default 1G (except for metadata nd root - which is set to 64M)
What can cause tablets per tablet server count to go high? Within a week, that count jumped from 5k/tablet server to 23k/tablet server, even though total size in hdfs  has not changed.
Is high count, a cause for concern?
We didn't apply any splits. I did a dumpConfig and checked all my tables and didn't see splits either.

Is there a way to find tablet size in hdfs? When I look at hdfs /accumulo/table/x/ i see some empty folders, meaning not all folders has rf files. is that normal?

Thanks in advance!

-S

RE: tablets per tablet server for accumulo 1.10.0

Posted by dev1 <de...@etcoleman.com>.

Roughly (don’t have the exact command syntax on hand)  I make a file that is the executed by passing to the shell command. To build the command file:

Use the getSplits command with the number of batches that I want – that can roughly be calculated using # current tablets / (# tservers * # compaction slots * comfort factor). IYou can specify and output file or tee the command output, something like


  *   getsplits -t tablename -n 20 -o /tmp/my_splits.txt


This would give you the splits for 20 rounds. Using those splits the compact command file then looks like:

compact -w -t tablename -e {first split}
compact -w -t tablename -b [first split] -e [second split]
…
compact -w -t tablename -b [last split]

To do a merge, interleave the merge commands:

compact -w -t tablename -e [first split]
merge -w -t tablename -size=[5G] -e [first split]
compact -w -t tablename -b [first split] -e [second split]
merge -w -t tablename -size 5G -b [first split] -e [second split]

Then just issue the shell command with (login info) -e filename. (I don’t recall if the switch to pass a file is -e, -f,…?)

The -w switch pauses each round so that it completes before moving to the next.

The comfort factor is some multiple to increase the number of tablets in each round.  This will over subscribe the compaction slots – but usually some compactions are quick for small tablets and the over-subscription quickly drops. It is a balancing act, you want fewer rounds, but limit the over subscription period.

You may want to increase the # of compaction slots available – depending on your hardware and load – I think the default is 3, 6 is not unreasonable.

Using the compact / merge command with just and end (first row) and a beginning (last row) are to insure that all splits are covered – don’t mix them up – or you will compact everything.

A few tablets can take much longer if the row ids are not evenly distributed – the time that each round takes will be the time of the longest compaction. With larger, but fewer rounds you increase the chance that more of the long-poles will be in a round and run in parallel. And shorten the total time needed to complete – but doing it in rounds does take longer because each round may have a long-pole that is essentially being compacted serially in each round.

Ed Coleman


From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Friday, February 4, 2022 8:28 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: tablets per tablet server for accumulo 1.10.0

Thank you,

Will range compaction (compact -t <> --begin-row<> --end-row<>) be faster than just compact -t <>? My worry is, if I somehow issue 72k compact command at once, it will kill the system?
On that part what is the best way to issue these compact commands, especially because there are so many of them. I saw accumulo shell -u<> -p<>  -e 'compact ...,compact...,compact,....' will work just don't know how many i can tack on one shell command..is there a better way of doing all this? I mean i want to be as gentle to my production system and yet as fast as possible.. don't want to spend days doing compact/merge 🙁

Thanks

-S

________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Tuesday, February 1, 2022 8:53 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0


Before.  That has the benefit that file sizes are reduced (if data is eligible for age off) and the merge is operating on current file sizes.



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Tuesday, February 1, 2022 7:49 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: Re: tablets per tablet server for accumulo 1.10.0



Thank you for explanation!



Once ran getsplits it was clear that splits were the culprit, so I need to do merge as well bump the threshold to higher number as you have suggested.



If I have to perform a major compaction, should i do it before merge or after merge?



Thanks again,



-S





________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Monday, January 31, 2022 1:14 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0



You can get the hdfs size using standard hdfs commands – count or ls.  As long as you have not cloned the table, the size of the hdfs files and the space occupied by the table are equivalent.



You can also get a sense of the referenced files examining the metadata table – the column qualifier file: will just give you the referenced files. You can look at the directories b-xxxxxxx are from a bulk import and t-xxxxxxx files are assigned to the tablets.  Also bulk import file names start with I-xxxxxx, files from compactions will be A-xxxxxx if from a full compaction, C-xxxxxxx from a minor compaction and F-xxxxxx is the result of a flush. You can look at the entries for the files – the numbers for the value are number of entities, file size



How do you ingest? Bulk or continuous?  On a bulk ingest, the imported files end up in /accumulo/table/x/b-xxxxx and then are assigned to tablets – the directories for the

Tablets will be created, but will be “empty” until a compaction occurs.  A compaction will copy from the files referenced by the tablets into a new file that will be placed into the corresponding /accumulo/table/x/t-xxxxxx directory.  When a bulk imported file is no longer referenced by any tablets, it will get garbage collected, until then file will exist and inflate the actual space used by the table. The compaction will also remove any data that is past the TTL for the records.



Do you ever run a compaction?  With a very large number of tablets, you may want to run the compaction in parts so that you don’t end up occupying all of the compaction slots for a long time.



Are you using keys (row ids) that are always increasing? An typical example would be a date.  Say some of your row ids are yyyy-mm-dd-hh and there is a 10 day TTL.  What will happened is that new data will continue to create new tablets and on compaction the old tablets will age-off and have 0 size.  You can remove the “unused splits” by running a merge.  Anything that creates new row ids that are ordered can do this – new splits are necessary and the old-splits eventually become unnecessary, if the row ids are distributed across the splits it will not do this. It is not necessary a problem if this what you data looks like, just something that you may want to manage with merges.



There is usually not much benefit having a large number of tablets for a single table on a server.  You can reduce the number of tablets required by setting the split threshold to a larger number and then running a merge.  This can be done in sections, and you should run a compaction on the section first.



If you have recently compacted, you can figure out the rough number of tables necessary  by taking hdfs size / split threshold = number of tablets.   If you increase the spilt threshold size you will need fewer tablets.  You may also consider setting a split threshold that is larger than your target – say you decided that 5G was a good target, if you set the threshold to 8G during the merge and then setting it to 5G when completed will cause the table to split – and it could give you a better distribution of data in the splits.



This can be done while things are running, but it will be a heavy IO load (files and on the hdfs namenode) and can take a very long time. What can be useful is you the getSplits command with the number of split options and create a script that compacts, then merges a section – using the splits as start / end row to the compaction and merge command.



Ed Coleman



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, January 31, 2022 11:16 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: tablets per tablet server for accumulo 1.10.0



Hello,



table.split.threshold is set to default 1G (except for metadata nd root - which is set to 64M)

What can cause tablets per tablet server count to go high? Within a week, that count jumped from 5k/tablet server to 23k/tablet server, even though total size in hdfs  has not changed.

Is high count, a cause for concern?

We didn't apply any splits. I did a dumpConfig and checked all my tables and didn't see splits either.



Is there a way to find tablet size in hdfs? When I look at hdfs /accumulo/table/x/ i see some empty folders, meaning not all folders has rf files. is that normal?



Thanks in advance!



-S

Re: tablets per tablet server for accumulo 1.10.0

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thank you,

Will range compaction (compact -t <> --begin-row<> --end-row<>) be faster than just compact -t <>? My worry is, if I somehow issue 72k compact command at once, it will kill the system?
On that part what is the best way to issue these compact commands, especially because there are so many of them. I saw accumulo shell -u<> -p<>  -e 'compact ...,compact...,compact,....' will work just don't know how many i can tack on one shell command..is there a better way of doing all this? I mean i want to be as gentle to my production system and yet as fast as possible.. don't want to spend days doing compact/merge 🙁

Thanks

-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Tuesday, February 1, 2022 8:53 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0


Before.  That has the benefit that file sizes are reduced (if data is eligible for age off) and the merge is operating on current file sizes.



From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, February 1, 2022 7:49 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: tablets per tablet server for accumulo 1.10.0



Thank you for explanation!



Once ran getsplits it was clear that splits were the culprit, so I need to do merge as well bump the threshold to higher number as you have suggested.



If I have to perform a major compaction, should i do it before merge or after merge?



Thanks again,



-S





________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Monday, January 31, 2022 1:14 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0



You can get the hdfs size using standard hdfs commands – count or ls.  As long as you have not cloned the table, the size of the hdfs files and the space occupied by the table are equivalent.



You can also get a sense of the referenced files examining the metadata table – the column qualifier file: will just give you the referenced files. You can look at the directories b-xxxxxxx are from a bulk import and t-xxxxxxx files are assigned to the tablets.  Also bulk import file names start with I-xxxxxx, files from compactions will be A-xxxxxx if from a full compaction, C-xxxxxxx from a minor compaction and F-xxxxxx is the result of a flush. You can look at the entries for the files – the numbers for the value are number of entities, file size



How do you ingest? Bulk or continuous?  On a bulk ingest, the imported files end up in /accumulo/table/x/b-xxxxx and then are assigned to tablets – the directories for the

Tablets will be created, but will be “empty” until a compaction occurs.  A compaction will copy from the files referenced by the tablets into a new file that will be placed into the corresponding /accumulo/table/x/t-xxxxxx directory.  When a bulk imported file is no longer referenced by any tablets, it will get garbage collected, until then file will exist and inflate the actual space used by the table. The compaction will also remove any data that is past the TTL for the records.



Do you ever run a compaction?  With a very large number of tablets, you may want to run the compaction in parts so that you don’t end up occupying all of the compaction slots for a long time.



Are you using keys (row ids) that are always increasing? An typical example would be a date.  Say some of your row ids are yyyy-mm-dd-hh and there is a 10 day TTL.  What will happened is that new data will continue to create new tablets and on compaction the old tablets will age-off and have 0 size.  You can remove the “unused splits” by running a merge.  Anything that creates new row ids that are ordered can do this – new splits are necessary and the old-splits eventually become unnecessary, if the row ids are distributed across the splits it will not do this. It is not necessary a problem if this what you data looks like, just something that you may want to manage with merges.



There is usually not much benefit having a large number of tablets for a single table on a server.  You can reduce the number of tablets required by setting the split threshold to a larger number and then running a merge.  This can be done in sections, and you should run a compaction on the section first.



If you have recently compacted, you can figure out the rough number of tables necessary  by taking hdfs size / split threshold = number of tablets.   If you increase the spilt threshold size you will need fewer tablets.  You may also consider setting a split threshold that is larger than your target – say you decided that 5G was a good target, if you set the threshold to 8G during the merge and then setting it to 5G when completed will cause the table to split – and it could give you a better distribution of data in the splits.



This can be done while things are running, but it will be a heavy IO load (files and on the hdfs namenode) and can take a very long time. What can be useful is you the getSplits command with the number of split options and create a script that compacts, then merges a section – using the splits as start / end row to the compaction and merge command.



Ed Coleman



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, January 31, 2022 11:16 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: tablets per tablet server for accumulo 1.10.0



Hello,



table.split.threshold is set to default 1G (except for metadata nd root - which is set to 64M)

What can cause tablets per tablet server count to go high? Within a week, that count jumped from 5k/tablet server to 23k/tablet server, even though total size in hdfs  has not changed.

Is high count, a cause for concern?

We didn't apply any splits. I did a dumpConfig and checked all my tables and didn't see splits either.



Is there a way to find tablet size in hdfs? When I look at hdfs /accumulo/table/x/ i see some empty folders, meaning not all folders has rf files. is that normal?



Thanks in advance!



-S

RE: tablets per tablet server for accumulo 1.10.0

Posted by dev1 <de...@etcoleman.com>.

Before.  That has the benefit that file sizes are reduced (if data is eligible for age off) and the merge is operating on current file sizes.

From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, February 1, 2022 7:49 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: tablets per tablet server for accumulo 1.10.0

Thank you for explanation!

Once ran getsplits it was clear that splits were the culprit, so I need to do merge as well bump the threshold to higher number as you have suggested.

If I have to perform a major compaction, should i do it before merge or after merge?

Thanks again,

-S

________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Monday, January 31, 2022 1:14 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0

You can get the hdfs size using standard hdfs commands - count or ls.  As long as you have not cloned the table, the size of the hdfs files and the space occupied by the table are equivalent.

You can also get a sense of the referenced files examining the metadata table - the column qualifier file: will just give you the referenced files. You can look at the directories b-xxxxxxx are from a bulk import and t-xxxxxxx files are assigned to the tablets.  Also bulk import file names start with I-xxxxxx, files from compactions will be A-xxxxxx if from a full compaction, C-xxxxxxx from a minor compaction and F-xxxxxx is the result of a flush. You can look at the entries for the files - the numbers for the value are number of entities, file size

How do you ingest? Bulk or continuous?  On a bulk ingest, the imported files end up in /accumulo/table/x/b-xxxxx and then are assigned to tablets - the directories for the

Tablets will be created, but will be "empty" until a compaction occurs.  A compaction will copy from the files referenced by the tablets into a new file that will be placed into the corresponding /accumulo/table/x/t-xxxxxx directory.  When a bulk imported file is no longer referenced by any tablets, it will get garbage collected, until then file will exist and inflate the actual space used by the table. The compaction will also remove any data that is past the TTL for the records.

Do you ever run a compaction?  With a very large number of tablets, you may want to run the compaction in parts so that you don't end up occupying all of the compaction slots for a long time.

Are you using keys (row ids) that are always increasing? An typical example would be a date.  Say some of your row ids are yyyy-mm-dd-hh and there is a 10 day TTL.  What will happened is that new data will continue to create new tablets and on compaction the old tablets will age-off and have 0 size.  You can remove the "unused splits" by running a merge.  Anything that creates new row ids that are ordered can do this - new splits are necessary and the old-splits eventually become unnecessary, if the row ids are distributed across the splits it will not do this. It is not necessary a problem if this what you data looks like, just something that you may want to manage with merges.

There is usually not much benefit having a large number of tablets for a single table on a server.  You can reduce the number of tablets required by setting the split threshold to a larger number and then running a merge.  This can be done in sections, and you should run a compaction on the section first.

If you have recently compacted, you can figure out the rough number of tables necessary  by taking hdfs size / split threshold = number of tablets.   If you increase the spilt threshold size you will need fewer tablets.  You may also consider setting a split threshold that is larger than your target - say you decided that 5G was a good target, if you set the threshold to 8G during the merge and then setting it to 5G when completed will cause the table to split - and it could give you a better distribution of data in the splits.

This can be done while things are running, but it will be a heavy IO load (files and on the hdfs namenode) and can take a very long time. What can be useful is you the getSplits command with the number of split options and create a script that compacts, then merges a section - using the splits as start / end row to the compaction and merge command.

Ed Coleman

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, January 31, 2022 11:16 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: tablets per tablet server for accumulo 1.10.0

Hello,

table.split.threshold is set to default 1G (except for metadata nd root - which is set to 64M)

What can cause tablets per tablet server count to go high? Within a week, that count jumped from 5k/tablet server to 23k/tablet server, even though total size in hdfs  has not changed.

Is high count, a cause for concern?

We didn't apply any splits. I did a dumpConfig and checked all my tables and didn't see splits either.

Is there a way to find tablet size in hdfs? When I look at hdfs /accumulo/table/x/ i see some empty folders, meaning not all folders has rf files. is that normal?

Thanks in advance!

-S

Re: tablets per tablet server for accumulo 1.10.0

Posted by Michael Wall <mj...@apache.org>.

In my experience, merging goes faster if you compact the ranges to be
merged first.

On Tue, Feb 1, 2022, 07:48 Ligade, Shailesh [USA] <Li...@bah.com>
wrote:

> Thank you for explanation!
>
> Once ran getsplits it was clear that splits were the culprit, so I need to
> do merge as well bump the threshold to higher number as you have suggested.
>
> If I have to perform a major compaction, should i do it before merge or
> after merge?
>
> Thanks again,
>
> -S
>
>
> ------------------------------
> *From:* dev1 <de...@etcoleman.com>
> *Sent:* Monday, January 31, 2022 1:14 PM
> *To:* 'user@accumulo.apache.org' <us...@accumulo.apache.org>
> *Subject:* [External] RE: tablets per tablet server for accumulo 1.10.0
>
>
> You can get the hdfs size using standard hdfs commands – count or ls.  As
> long as you have not cloned the table, the size of the hdfs files and the
> space occupied by the table are equivalent.
>
>
>
> You can also get a sense of the referenced files examining the metadata
> table – the column qualifier file: will just give you the referenced files.
> You can look at the directories b-xxxxxxx are from a bulk import and
> t-xxxxxxx files are assigned to the tablets.  Also bulk import file names
> start with I-xxxxxx, files from compactions will be A-xxxxxx if from a full
> compaction, C-xxxxxxx from a minor compaction and F-xxxxxx is the result of
> a flush. You can look at the entries for the files – the numbers for the
> value are number of entities, file size
>
>
>
> How do you ingest? Bulk or continuous?  On a bulk ingest, the imported
> files end up in /accumulo/table/x/b-xxxxx and then are assigned to
> tablets – the directories for the
>
> Tablets will be created, but will be “empty” until a compaction occurs.  A
> compaction will copy from the files referenced by the tablets into a new
> file that will be placed into the corresponding /accumulo/table/x/t-xxxxxx
> directory.  When a bulk imported file is no longer referenced by any
> tablets, it will get garbage collected, until then file will exist and
> inflate the actual space used by the table. The compaction will also remove
> any data that is past the TTL for the records.
>
>
>
> Do you ever run a compaction?  With a very large number of tablets, you
> may want to run the compaction in parts so that you don’t end up occupying
> all of the compaction slots for a long time.
>
>
>
> Are you using keys (row ids) that are always increasing? An typical
> example would be a date.  Say some of your row ids are yyyy-mm-dd-hh and
> there is a 10 day TTL.  What will happened is that new data will continue
> to create new tablets and on compaction the old tablets will age-off and
> have 0 size.  You can remove the “unused splits” by running a merge.
> Anything that creates new row ids that are ordered can do this – new splits
> are necessary and the old-splits eventually become unnecessary, if the row
> ids are distributed across the splits it will not do this. It is not
> necessary a problem if this what you data looks like, just something that
> you may want to manage with merges.
>
>
>
> There is usually not much benefit having a large number of tablets for a
> single table on a server.  You can reduce the number of tablets required by
> setting the split threshold to a larger number and then running a merge.
> This can be done in sections, and you should run a compaction on the
> section first.
>
>
>
> If you have recently compacted, you can figure out the rough number of
> tables necessary  by taking hdfs size / split threshold = number of
> tablets.   If you increase the spilt threshold size you will need fewer
> tablets.  You may also consider setting a split threshold that is larger
> than your target – say you decided that 5G was a good target, if you set
> the threshold to 8G during the merge and then setting it to 5G when
> completed will cause the table to split – and it could give you a better
> distribution of data in the splits.
>
>
>
> This can be done while things are running, but it will be a heavy IO load
> (files and on the hdfs namenode) and can take a very long time. What can be
> useful is you the getSplits command with the number of split options and
> create a script that compacts, then merges a section – using the splits as
> start / end row to the compaction and merge command.
>
>
>
> Ed Coleman
>
>
>
> *From:* Ligade, Shailesh [USA] <Li...@bah.com>
> *Sent:* Monday, January 31, 2022 11:16 AM
> *To:* user@accumulo.apache.org
> *Subject:* tablets per tablet server for accumulo 1.10.0
>
>
>
> Hello,
>
>
>
> table.split.threshold is set to default 1G (except for metadata nd root -
> which is set to 64M)
>
> What can cause tablets per tablet server count to go high? Within a week,
> that count jumped from 5k/tablet server to 23k/tablet server, even though
> total size in hdfs  has not changed.
>
> Is high count, a cause for concern?
>
> We didn't apply any splits. I did a dumpConfig and checked all my tables
> and didn't see splits either.
>
>
>
> Is there a way to find tablet size in hdfs? When I look at hdfs
> /accumulo/table/x/ i see some empty folders, meaning not all folders has rf
> files. is that normal?
>
>
>
> Thanks in advance!
>
>
>
> -S
>

Re: tablets per tablet server for accumulo 1.10.0

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thank you for explanation!

Once ran getsplits it was clear that splits were the culprit, so I need to do merge as well bump the threshold to higher number as you have suggested.

If I have to perform a major compaction, should i do it before merge or after merge?

Thanks again,

-S


________________________________
From: dev1 <de...@etcoleman.com>
Sent: Monday, January 31, 2022 1:14 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0


You can get the hdfs size using standard hdfs commands – count or ls.  As long as you have not cloned the table, the size of the hdfs files and the space occupied by the table are equivalent.



You can also get a sense of the referenced files examining the metadata table – the column qualifier file: will just give you the referenced files. You can look at the directories b-xxxxxxx are from a bulk import and t-xxxxxxx files are assigned to the tablets.  Also bulk import file names start with I-xxxxxx, files from compactions will be A-xxxxxx if from a full compaction, C-xxxxxxx from a minor compaction and F-xxxxxx is the result of a flush. You can look at the entries for the files – the numbers for the value are number of entities, file size



How do you ingest? Bulk or continuous?  On a bulk ingest, the imported files end up in /accumulo/table/x/b-xxxxx and then are assigned to tablets – the directories for the

Tablets will be created, but will be “empty” until a compaction occurs.  A compaction will copy from the files referenced by the tablets into a new file that will be placed into the corresponding /accumulo/table/x/t-xxxxxx directory.  When a bulk imported file is no longer referenced by any tablets, it will get garbage collected, until then file will exist and inflate the actual space used by the table. The compaction will also remove any data that is past the TTL for the records.



Do you ever run a compaction?  With a very large number of tablets, you may want to run the compaction in parts so that you don’t end up occupying all of the compaction slots for a long time.



Are you using keys (row ids) that are always increasing? An typical example would be a date.  Say some of your row ids are yyyy-mm-dd-hh and there is a 10 day TTL.  What will happened is that new data will continue to create new tablets and on compaction the old tablets will age-off and have 0 size.  You can remove the “unused splits” by running a merge.  Anything that creates new row ids that are ordered can do this – new splits are necessary and the old-splits eventually become unnecessary, if the row ids are distributed across the splits it will not do this. It is not necessary a problem if this what you data looks like, just something that you may want to manage with merges.



There is usually not much benefit having a large number of tablets for a single table on a server.  You can reduce the number of tablets required by setting the split threshold to a larger number and then running a merge.  This can be done in sections, and you should run a compaction on the section first.



If you have recently compacted, you can figure out the rough number of tables necessary  by taking hdfs size / split threshold = number of tablets.   If you increase the spilt threshold size you will need fewer tablets.  You may also consider setting a split threshold that is larger than your target – say you decided that 5G was a good target, if you set the threshold to 8G during the merge and then setting it to 5G when completed will cause the table to split – and it could give you a better distribution of data in the splits.



This can be done while things are running, but it will be a heavy IO load (files and on the hdfs namenode) and can take a very long time. What can be useful is you the getSplits command with the number of split options and create a script that compacts, then merges a section – using the splits as start / end row to the compaction and merge command.



Ed Coleman



From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Monday, January 31, 2022 11:16 AM
To: user@accumulo.apache.org
Subject: tablets per tablet server for accumulo 1.10.0



Hello,



table.split.threshold is set to default 1G (except for metadata nd root - which is set to 64M)

What can cause tablets per tablet server count to go high? Within a week, that count jumped from 5k/tablet server to 23k/tablet server, even though total size in hdfs  has not changed.

Is high count, a cause for concern?

We didn't apply any splits. I did a dumpConfig and checked all my tables and didn't see splits either.



Is there a way to find tablet size in hdfs? When I look at hdfs /accumulo/table/x/ i see some empty folders, meaning not all folders has rf files. is that normal?



Thanks in advance!



-S

RE: tablets per tablet server for accumulo 1.10.0

Posted by dev1 <de...@etcoleman.com>.

You can get the hdfs size using standard hdfs commands - count or ls.  As long as you have not cloned the table, the size of the hdfs files and the space occupied by the table are equivalent.

You can also get a sense of the referenced files examining the metadata table - the column qualifier file: will just give you the referenced files. You can look at the directories b-xxxxxxx are from a bulk import and t-xxxxxxx files are assigned to the tablets.  Also bulk import file names start with I-xxxxxx, files from compactions will be A-xxxxxx if from a full compaction, C-xxxxxxx from a minor compaction and F-xxxxxx is the result of a flush. You can look at the entries for the files - the numbers for the value are number of entities, file size

How do you ingest? Bulk or continuous?  On a bulk ingest, the imported files end up in /accumulo/table/x/b-xxxxx and then are assigned to tablets - the directories for the
Tablets will be created, but will be "empty" until a compaction occurs.  A compaction will copy from the files referenced by the tablets into a new file that will be placed into the corresponding /accumulo/table/x/t-xxxxxx directory.  When a bulk imported file is no longer referenced by any tablets, it will get garbage collected, until then file will exist and inflate the actual space used by the table. The compaction will also remove any data that is past the TTL for the records.

Do you ever run a compaction?  With a very large number of tablets, you may want to run the compaction in parts so that you don't end up occupying all of the compaction slots for a long time.

Are you using keys (row ids) that are always increasing? An typical example would be a date.  Say some of your row ids are yyyy-mm-dd-hh and there is a 10 day TTL.  What will happened is that new data will continue to create new tablets and on compaction the old tablets will age-off and have 0 size.  You can remove the "unused splits" by running a merge.  Anything that creates new row ids that are ordered can do this - new splits are necessary and the old-splits eventually become unnecessary, if the row ids are distributed across the splits it will not do this. It is not necessary a problem if this what you data looks like, just something that you may want to manage with merges.

There is usually not much benefit having a large number of tablets for a single table on a server.  You can reduce the number of tablets required by setting the split threshold to a larger number and then running a merge.  This can be done in sections, and you should run a compaction on the section first.

If you have recently compacted, you can figure out the rough number of tables necessary  by taking hdfs size / split threshold = number of tablets.   If you increase the spilt threshold size you will need fewer tablets.  You may also consider setting a split threshold that is larger than your target - say you decided that 5G was a good target, if you set the threshold to 8G during the merge and then setting it to 5G when completed will cause the table to split - and it could give you a better distribution of data in the splits.

This can be done while things are running, but it will be a heavy IO load (files and on the hdfs namenode) and can take a very long time. What can be useful is you the getSplits command with the number of split options and create a script that compacts, then merges a section - using the splits as start / end row to the compaction and merge command.

Ed Coleman

From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Monday, January 31, 2022 11:16 AM
To: user@accumulo.apache.org
Subject: tablets per tablet server for accumulo 1.10.0

Hello,

table.split.threshold is set to default 1G (except for metadata nd root - which is set to 64M)
What can cause tablets per tablet server count to go high? Within a week, that count jumped from 5k/tablet server to 23k/tablet server, even though total size in hdfs  has not changed.
Is high count, a cause for concern?
We didn't apply any splits. I did a dumpConfig and checked all my tables and didn't see splits either.

Is there a way to find tablet size in hdfs? When I look at hdfs /accumulo/table/x/ i see some empty folders, meaning not all folders has rf files. is that normal?

Thanks in advance!

-S