You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by S L <sl...@gmail.com> on 2017/07/19 05:48:09 UTC

How does changing TTL affect compaction

Just wondering how changing TTL on a table affects minor and major
compaction when it's trying to combine/delete things.

There is a rowkey that shows up when my program does a scan.  According to
the TTL, it shouldn't be there but it is.  It gives me
RetriesExhaustedException.  Looking at the table description, the TTL is 30
days.  We initially had it at 14 days, then at some point we changed it to
30.

Table description
describe 'dbi_based_data'
Table dbi_based_data is ENABLED


dbi_based_data

COLUMN FAMILIES DESCRIPTION


{NAME => 'raw_data', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW',
REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY',
MIN_VERSIONS => '0', T

TL => '2592000 SECONDS (30 DAYS)', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE
=> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

1 row(s) in 0.5310 seconds

The rowkey exception shows up as follows in the hadoop output:

Error: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
after attempts=36, exceptions: Tue Jul 18 13:49:59 PDT 2017, null,
java.net.SocketTimeoutException: callTimeout=75000, callDuration=75301: row
'170_p3431.db160044.hkg1.dbi_1476454680' on table 'dbi_based_data' at
region=dbi_based_data,170_p3431.db160044.hkg1.dbi_1476454680,1490311699762.e00f8737bfc5010b6bd9be11520ca1d3.,
hostname=hslave35118.ams9.mydomain.com,60020,1483570489305, seqNum=16311452

I went to the region server above to look at the logs
(/var/log/hbase/hbase*.out).  I didn't find anything with that rowkey and I
can see it is performing major_compaction on other elements.

hbase-cmf-hbase-REGIONSERVER-hslave35188.ams9.mydomain.com.log.out.7:2017-07-14
22:25:58,692 INFO org.apache.hadoop.hbase.regionserver.HStore: Completed
major compaction of 4 (all) file(s) in raw_data of
dbi_based_data,109_p3400.db160093.sjc5.dbi_1493088600,1495713795634.358bee5b1db95d10e7533c7fe0365c0e.
into 51a8c31e32b8445eb781bdc580c2d5d1(size=4.6 G), total size for store is
4.6 G. This selection was in queue for 6mins, 21sec, and took 2mins, 21sec
to execute.

Two questions:
1)   Can this change in TTL make some keys not get deleted due to timing or
whatever other reason?  If so, how do I delete these undeleteable keys.

2) What is the name of the java file that outputs the exception message
above?  What do the different fields from the output mean?
Tue Jul 18 13:49:59 PDT 2017, null, java.net.SocketTimeoutException:
callTimeout=75000, callDuration=75301: row
'170_p3431.db160044.hkg1.dbi_1476454680' on table 'dbi_based_data' at
region=dbi_based_data,170_p3431.db160044.hkg1.dbi_1476454680,1490311699762.
e00f8737bfc5010b6bd9be11520ca1d3., hostname=hslave35118.ams9.mydomain.com
,60020,1483570489305, seqNum=16311452


I've color coded the fields that I don't know what they mean.  Hopefully
you can see the color.
a) What does "null" mean (light blue)?
b) The yellow is the rowkey but the red is some timestamp that I don't
know.  The red is from March 2017 so it's way past the 30 day TTL.
c) Don't know what the purple alphanumeric string is but it might be the
region name?
d) The orange timestamp again is past the 30day TTL and from Jan 2017 but
again, I don't know what it is from/for.

Thanks in advance for your help.

Re: How does changing TTL affect compaction

Posted by S L <sl...@gmail.com>.

Thanks Ted for answering all my questions this week.

I don't remember when the change was made.  I want to say it occurred 4
months ago.  There are a lot of these RetriesExhaustedException errors so I
can't say what percentage of these "bad" rowkeys are before/after the
estimated change.  I don't think the cluster has been restarted after this
change though, although, I could be wrong.

Good catch on the mismatch between the server log and the Exception log.  I
logged into the wrong server and couldn't find the string but I wanted to
show the cluster was running major compaction and just randomly showed a
snippet from the log.  Here is the log from the correct server showing it
trying to delete the rowkey

>>>>>  from region server log snippet
hbase-cmf-hbase-REGIONSERVER-hslave35118.ams9.mydomain.com.log.out.1:2017-07-18
16:39:06,078 INFO org.apache.hadoop.hbase.regionserver.HStore: Completed
major compaction of 5 (all) file(s) in raw_data of
dbi_based_data,170_p3431.db160044.hkg1.dbi_1476454680,1490311699762.e00f8737bfc5010b6bd9be11520ca1d3.
into 7df0e1450cec483da48f93c87f1c044f(size=3.3 G), total size for store is
3.3 G. This selection was in queue for 0sec, and took 1mins, 47sec to
execute.

hbase-cmf-hbase-REGIONSERVER-hslave35118.ams9.mydomain.com.log.out.6:2017-07-14
22:49:48,808 INFO org.apache.hadoop.hbase.regionserver.HStore: Completed
major compaction of 4 (all) file(s) in raw_data of
dbi_based_data,170_p3431.db160044.hkg1.dbi_1476454680,1490311699762.e00f8737bfc5010b6bd9be11520ca1d3.
into bc26bfcd2e074bd4a73901500e43840c(size=3.3 G), total size for store is
3.3 G. This selection was in queue for 30mins, 20sec, and took 2mins, 12sec
to execute

>>>>>> end of region server log.


Regarding the parts from the Exception log, I'll look at the files you
advised me to look at and if I still have questions, I'll send another
question to the user list.

Thanks again.


On Wed, Jul 19, 2017 at 8:22 AM, Ted Yu <yu...@gmail.com> wrote:

> bq. at some point we changed it to 30.
>
> Do you remember when the change was made ?
> After the change, was there server restart ?
>
> You showed part of log for 35188 but the RetriesExhaustedException was for
> 35118. Was this due to redaction or were you checking a different server ?
>
> The RetriesExhaustedException happend at Jul 18 13:49:59 PDT 2017 which
> didn't match the server log snippet (2017-07-14
> 22:25:58,692)
>
> The coloring didn't go thru.
>
> For SocketTimeoutException, please
> see hbase-client/src/main/java/org/apache/hadoop/hbase/
> client/RpcRetryingCaller.java
> around line 169 (in branch-1.3).
>
> For the remaining exception message,
> see getExceptionMessageAdditionalDetail()
> in hbase-client/src/main/java/org/apache/hadoop/hbase/
> client/RegionServerCallable.java
>
> On Tue, Jul 18, 2017 at 10:48 PM, S L <sl...@gmail.com> wrote:
>
> > Just wondering how changing TTL on a table affects minor and major
> > compaction when it's trying to combine/delete things.
> >
> > There is a rowkey that shows up when my program does a scan.  According
> to
> > the TTL, it shouldn't be there but it is.  It gives me
> > RetriesExhaustedException.  Looking at the table description, the TTL is
> 30
> > days.  We initially had it at 14 days, then at some point we changed it
> to
> > 30.
> >
> > Table description
> > describe 'dbi_based_data'
> > Table dbi_based_data is ENABLED
> >
> >
> > dbi_based_data
> >
> > COLUMN FAMILIES DESCRIPTION
> >
> >
> > {NAME => 'raw_data', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW',
> > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY',
> > MIN_VERSIONS => '0', T
> >
> > TL => '2592000 SECONDS (30 DAYS)', KEEP_DELETED_CELLS => 'FALSE',
> BLOCKSIZE
> > => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
> >
> > 1 row(s) in 0.5310 seconds
> >
> > The rowkey exception shows up as follows in the hadoop output:
> >
> > Error: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> > after attempts=36, exceptions: Tue Jul 18 13:49:59 PDT 2017, null,
> > java.net.SocketTimeoutException: callTimeout=75000, callDuration=75301:
> > row
> > '170_p3431.db160044.hkg1.dbi_1476454680' on table 'dbi_based_data' at
> > region=dbi_based_data,170_p3431.db160044.hkg1.dbi_
> > 1476454680,1490311699762.e00f8737bfc5010b6bd9be11520ca1d3.,
> > hostname=hslave35118.ams9.mydomain.com,60020,1483570489305,
> > seqNum=16311452
> >
> > I went to the region server above to look at the logs
> > (/var/log/hbase/hbase*.out).  I didn't find anything with that rowkey
> and I
> > can see it is performing major_compaction on other elements.
> >
> > hbase-cmf-hbase-REGIONSERVER-hslave35188.ams9.mydomain.com.
> > log.out.7:2017-07-14
> > 22:25:58,692 INFO org.apache.hadoop.hbase.regionserver.HStore: Completed
> > major compaction of 4 (all) file(s) in raw_data of
> > dbi_based_data,109_p3400.db160093.sjc5.dbi_1493088600,1495713795634.
> > 358bee5b1db95d10e7533c7fe0365c0e.
> > into 51a8c31e32b8445eb781bdc580c2d5d1(size=4.6 G), total size for store
> is
> > 4.6 G. This selection was in queue for 6mins, 21sec, and took 2mins,
> 21sec
> > to execute.
> >
> > Two questions:
> > 1)   Can this change in TTL make some keys not get deleted due to timing
> or
> > whatever other reason?  If so, how do I delete these undeleteable keys.
> >
> > 2) What is the name of the java file that outputs the exception message
> > above?  What do the different fields from the output mean?
> > Tue Jul 18 13:49:59 PDT 2017, null, java.net.SocketTimeoutException:
> > callTimeout=75000, callDuration=75301: row
> > '170_p3431.db160044.hkg1.dbi_1476454680' on table 'dbi_based_data' at
> > region=dbi_based_data,170_p3431.db160044.hkg1.dbi_
> > 1476454680,1490311699762.
> > e00f8737bfc5010b6bd9be11520ca1d3., hostname=hslave35118.ams9.
> mydomain.com
> > ,60020,1483570489305, seqNum=16311452
> >
> >
> > I've color coded the fields that I don't know what they mean.  Hopefully
> > you can see the color.
> > a) What does "null" mean (light blue)?
> > b) The yellow is the rowkey but the red is some timestamp that I don't
> > know.  The red is from March 2017 so it's way past the 30 day TTL.
> > c) Don't know what the purple alphanumeric string is but it might be the
> > region name?
> > d) The orange timestamp again is past the 30day TTL and from Jan 2017 but
> > again, I don't know what it is from/for.
> >
> > Thanks in advance for your help.
> >
>

Re: How does changing TTL affect compaction

Posted by Ted Yu <yu...@gmail.com>.

bq. at some point we changed it to 30.

Do you remember when the change was made ?
After the change, was there server restart ?

You showed part of log for 35188 but the RetriesExhaustedException was for
35118. Was this due to redaction or were you checking a different server ?

The RetriesExhaustedException happend at Jul 18 13:49:59 PDT 2017 which
didn't match the server log snippet (2017-07-14
22:25:58,692)

The coloring didn't go thru.

For SocketTimeoutException, please
see hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCaller.java
around line 169 (in branch-1.3).

For the remaining exception message,
see getExceptionMessageAdditionalDetail()
in hbase-client/src/main/java/org/apache/hadoop/hbase/client/RegionServerCallable.java

On Tue, Jul 18, 2017 at 10:48 PM, S L <sl...@gmail.com> wrote:

> Just wondering how changing TTL on a table affects minor and major
> compaction when it's trying to combine/delete things.
>
> There is a rowkey that shows up when my program does a scan.  According to
> the TTL, it shouldn't be there but it is.  It gives me
> RetriesExhaustedException.  Looking at the table description, the TTL is 30
> days.  We initially had it at 14 days, then at some point we changed it to
> 30.
>
> Table description
> describe 'dbi_based_data'
> Table dbi_based_data is ENABLED
>
>
> dbi_based_data
>
> COLUMN FAMILIES DESCRIPTION
>
>
> {NAME => 'raw_data', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW',
> REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY',
> MIN_VERSIONS => '0', T
>
> TL => '2592000 SECONDS (30 DAYS)', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE
> => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
>
> 1 row(s) in 0.5310 seconds
>
> The rowkey exception shows up as follows in the hadoop output:
>
> Error: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> after attempts=36, exceptions: Tue Jul 18 13:49:59 PDT 2017, null,
> java.net.SocketTimeoutException: callTimeout=75000, callDuration=75301:
> row
> '170_p3431.db160044.hkg1.dbi_1476454680' on table 'dbi_based_data' at
> region=dbi_based_data,170_p3431.db160044.hkg1.dbi_
> 1476454680,1490311699762.e00f8737bfc5010b6bd9be11520ca1d3.,
> hostname=hslave35118.ams9.mydomain.com,60020,1483570489305,
> seqNum=16311452
>
> I went to the region server above to look at the logs
> (/var/log/hbase/hbase*.out).  I didn't find anything with that rowkey and I
> can see it is performing major_compaction on other elements.
>
> hbase-cmf-hbase-REGIONSERVER-hslave35188.ams9.mydomain.com.
> log.out.7:2017-07-14
> 22:25:58,692 INFO org.apache.hadoop.hbase.regionserver.HStore: Completed
> major compaction of 4 (all) file(s) in raw_data of
> dbi_based_data,109_p3400.db160093.sjc5.dbi_1493088600,1495713795634.
> 358bee5b1db95d10e7533c7fe0365c0e.
> into 51a8c31e32b8445eb781bdc580c2d5d1(size=4.6 G), total size for store is
> 4.6 G. This selection was in queue for 6mins, 21sec, and took 2mins, 21sec
> to execute.
>
> Two questions:
> 1)   Can this change in TTL make some keys not get deleted due to timing or
> whatever other reason?  If so, how do I delete these undeleteable keys.
>
> 2) What is the name of the java file that outputs the exception message
> above?  What do the different fields from the output mean?
> Tue Jul 18 13:49:59 PDT 2017, null, java.net.SocketTimeoutException:
> callTimeout=75000, callDuration=75301: row
> '170_p3431.db160044.hkg1.dbi_1476454680' on table 'dbi_based_data' at
> region=dbi_based_data,170_p3431.db160044.hkg1.dbi_
> 1476454680,1490311699762.
> e00f8737bfc5010b6bd9be11520ca1d3., hostname=hslave35118.ams9.mydomain.com
> ,60020,1483570489305, seqNum=16311452
>
>
> I've color coded the fields that I don't know what they mean.  Hopefully
> you can see the color.
> a) What does "null" mean (light blue)?
> b) The yellow is the rowkey but the red is some timestamp that I don't
> know.  The red is from March 2017 so it's way past the 30 day TTL.
> c) Don't know what the purple alphanumeric string is but it might be the
> region name?
> d) The orange timestamp again is past the 30day TTL and from Jan 2017 but
> again, I don't know what it is from/for.
>
> Thanks in advance for your help.
>