You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Jimmy Lin <y2...@gmail.com> on 2016/02/25 07:13:19 UTC

how to read parent_repair_history table?

hi all,
few questions regarding how to read or digest the
system_distributed.parent_repair_history CF, that I am very intereted to
use to find out our repair status...

-
Is every invocation of nodetool repair execution will be recorded as one
entry in parent_repair_history CF regardless if it is across DC, local node
repair, or other options ?

-
A repair job is done only if "finished" column contains value? and a repair
job is successfully done only if there is no value in exce
ption_messages or exception_stacktrace ?
what is the purpose of successful_ranges column? do i have to check they
are all matched with requested_range to ensure a successful run?

-
Ultimately, how to find out the overall repair health/status in a given
cluster?
Scanning through parent_repair_history and making sure all the known
keyspaces has a good repair run in recent days?

---------------
CREATE TABLE system_distributed.parent_repair_history (
    parent_id timeuuid PRIMARY KEY,
    columnfamily_names set<text>,
    exception_message text,
    exception_stacktrace text,
    finished_at timestamp,
    keyspace_name text,
    requested_ranges set<text>,
    started_at timestamp,
    successful_ranges set<text>
)

Re: how to read parent_repair_history table?

Posted by Jimmy Lin <y2...@gmail.com>.

Hi  Anuj,

i never thought of using JMX notification as way to check. 
Partially i think it require a live connection or application to keep the notification flowing in, while the DB approach let you look it up whenever you want current or the past jobs.
 thanks

Sent from my iPhone

> On Feb 25, 2016, at 9:25 AM, Anuj Wadehra <an...@yahoo.co.in> wrote:
> 
> Hi Jimmy,
> 
> We are on 2.0.x. We are planning to use JMX notifications for getting repair status. To repair database, we call forceTableRepairPrimaryRange JMX operation from our Java client application on each node. You can call other latest JMX methods for repair.
> 
> I would be keen in knowing the pros/cons of handling repair status via JMX notifications Vs via database tables.
> 
> We are planning to implement it as follows:
> 
> 1. Before repairing each keyspace via JMX, register two listeners: one for listening to StorageService MBean notifications about repair status and other the connection listener for detecting connection failures and lost JMX notifications.
> 
> 2. We ensure that if 256 success session notifications are received, keyspace repair is successful. We have 256 ranges on each node.
> 
> 3.If there are connection closed notifications, we will re-register the Mbean listener and retry repair once.
> 
> 4. If there are Lost Notifications we retry the repair once before failing it.
> 
> 
> 
> Thanks
> Anuj
> 
> 
> Sent from Yahoo Mail on Android
> 
> On Thu, 25 Feb, 2016 at 7:18 pm, Paulo Motta
> <pa...@gmail.com> wrote:
> Hello Jimmy,
> 
> The parent_repair_history table keeps track of start and finish information of a repair session.  The other table repair_history keeps track of repair status as it progresses. So, you must first query the parent_repair_history table to check if a repair started and finish, as well as its duration, and inspect the repair_history table to troubleshoot more specific details of a given repair session.
> 
> Answering your questions below:
> 
> > Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ?
> 
> Actually two entries, one for start and one for finish.
> 
> > A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce ption_messages or exception_stacktrace ?
> 
> correct
> 
> > what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run?
> 
> correct
> 
> -
> > Ultimately, how to find out the overall repair health/status in a given cluster?
> 
> Check if repair is being executed on all nodes within gc_grace_seconds, and tune that value or troubleshoot problems otherwise.
> 
> > Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days?
> 
> Sounds good.
> 
> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more information.
> 
> 
> 2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>> 
>> hi all,
>> few questions regarding how to read or digest the system_distributed.parent_repair_history CF, that I am very intereted to use to find out our repair status... 
>>  
>> -
>> Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ?
>> 
>> -
>> A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce
>> ption_messages or exception_stacktrace ?
>> what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run?
>> 
>> -
>> Ultimately, how to find out the overall repair health/status in a given cluster?
>> Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days?
>> 
>> ---------------
>> CREATE TABLE system_distributed.parent_repair_history (
>>     parent_id timeuuid PRIMARY KEY,
>>     columnfamily_names set<text>,
>>     exception_message text,
>>     exception_stacktrace text,
>>     finished_at timestamp,
>>     keyspace_name text,
>>     requested_ranges set<text>,
>>     started_at timestamp,
>>     successful_ranges set<text>
>> )
>

Re: how to read parent_repair_history table?

Posted by Anuj Wadehra <an...@yahoo.co.in>.

Hi Jimmy,
We are on 2.0.x. We are planning to use JMX notifications for getting repair status. To repair database, we call forceTableRepairPrimaryRange JMX operation from our Java client application on each node. You can call other latest JMX methods for repair.
I would be keen in knowing the pros/cons of handling repair status via JMX notifications Vs via database tables.
We are planning to implement it as follows:
1. Before repairing each keyspace via JMX, register two listeners: one for listening to StorageService MBean notifications about repair status and other the connection listener for detecting connection failures and lost JMX notifications.
2. We ensure that if 256 success session notifications are received, keyspace repair is successful. We have 256 ranges on each node.
3.If there are connection closed notifications, we will re-register the Mbean listener and retry repair once.
4. If there are Lost Notifications we retry the repair once before failing it.

ThanksAnuj

Sent from Yahoo Mail on Android

On Thu, 25 Feb, 2016 at 7:18 pm, Paulo Motta<pa...@gmail.com> wrote: Hello Jimmy,

The parent_repair_history table keeps track of start and finish information of a repair session. The other table repair_history keeps track of repair status as it progresses. So, you must first query the parent_repair_history table to check if a repair started and finish, as well as its duration, and inspect the repair_history table to troubleshoot more specific details of a given repair session.

Answering your questions below:

> Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ?
Actually two entries, one for start and one for finish.

> A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce ption_messages or exception_stacktrace ?

correct

> what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run?
correct

-
> Ultimately, how to find out the overall repair health/status in a given cluster?

Check if repair is being executed on all nodes within gc_grace_seconds, and tune that value or troubleshoot problems otherwise.

> Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days?

Sounds good.

You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more information.

2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2...@gmail.com>:

hi all,
few questions regarding how to read or digest the system_distributed.parent_repair_history CF, that I am very intereted to use to find out our repair status...
-
Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ?
-
A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce
ption_messages or exception_stacktrace ?
what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run?
-
Ultimately, how to find out the overall repair health/status in a given cluster?
Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days?
---------------
CREATE TABLE system_distributed.parent_repair_history (
parent_id timeuuid PRIMARY KEY,
columnfamily_names set<text>,
exception_message text,
exception_stacktrace text,
finished_at timestamp,
keyspace_name text,
requested_ranges set<text>,
started_at timestamp,
successful_ranges set<text>
)

Re: how to read parent_repair_history table?

Posted by Jimmy Lin <y2...@gmail.com>.

hi Paulo,
that is right, I forgot there is another table that actually tracking the
rest of the detail of the repairs.
thanks for the pointers, will explore more with those info.

I am actually surprised not much doc out there talk about these two tables,
or other tools or utilities harvesting these data. (?)

thanks



On Thu, Feb 25, 2016 at 1:38 PM, Paulo Motta <pa...@gmail.com>
wrote:

> > how does it work when repair job targeting only local vs all DC? is
> there any columns or flag i can tell the difference? or does it actualy
> matter?
>
> You can not easily find out from the parent_repair_session table if a
> repair is local-only or multi-dc. I created
> https://issues.apache.org/jira/browse/CASSANDRA-11244 to add more
> information to that table. Since that table only has id as primary key,
> you'd need to do a full scan to perform checks on it, or keep track of the
> parent id session when submitting the repair and query by primary key.
>
> What you could probably do to health check your nodes are repaired on time
> is to check for each table:
>
> select * from repair_history where keyspace = 'ks' columnfamily_name =
> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2);
>
> And then verify for each node if all of its ranges have been repaired in
> this period, and send an alert otherwise. You can find out a nodes range by
> querying JMX via StorageServiceMBean.getRangeToEndpointMap.
>
> To make this task a bit simpler you could probably add a secondary index
> to the participants column of repair_history table with:
>
> CREATE INDEX myindex ON system_distributed.repair_history (participants) ;
>
> and check each node status individually with:
>
> select * from repair_history where keyspace = 'ks' columnfamily_name =
> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2) AND participants
> CONTAINS 'node_IP';
>
>
>
> 2016-02-25 16:22 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>
>> hi Paulo,
>>
>> one more follow up ... :)
>>
>>  I noticed these tables are suppose to replicatd to all nodes in the cluster, and it is not per node specific.
>>
>> how does it work when repair job targeting only local vs all DC? is there any columns or flag i can tell the difference?
>> or does it actualy matter?
>>
>>  thanks
>>
>>
>>
>>
>> Sent from my iPhone
>>
>> On Feb 25, 2016, at 10:37 AM, Paulo Motta <pa...@gmail.com>
>> wrote:
>>
>> > why each job repair execution will have 2 entries? I thought it will
>> be one entry, begining with started_at column filled, and when it
>> completed, finished_at column will be filled.
>>
>> that's correct, I was mistaken!
>>
>> > Also, if my cluster has more than 1 keyspace, and the way this table
>> is structured, it will have multiple entries, one for each keysapce_name
>> value. no ? thanks
>>
>> right, because repair sessions in different keyspaces will have different
>> repair session ids.
>>
>> 2016-02-25 15:04 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>>
>>> hi Paulo,
>>>
>>> follow up on the # of entries question...
>>>
>>>  why each job repair execution will have 2 entries?
>>> I thought it will be one entry, begining with started_at column filled, and when it completed, finished_at column will be filled.
>>>
>>> Also, if my cluster has more than 1 keyspace, and the way this table is structured, it will have multiple entries, one for each keysapce_name value. no ?
>>>
>>> thanks
>>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 25, 2016, at 5:48 AM, Paulo Motta <pa...@gmail.com>
>>> wrote:
>>>
>>> Hello Jimmy,
>>>
>>> The parent_repair_history table keeps track of start and finish
>>> information of a repair session.  The other table repair_history keeps
>>> track of repair status as it progresses. So, you must first query the
>>> parent_repair_history table to check if a repair started and finish, as
>>> well as its duration, and inspect the repair_history table to troubleshoot
>>> more specific details of a given repair session.
>>>
>>> Answering your questions below:
>>>
>>> > Is every invocation of nodetool repair execution will be recorded as
>>> one entry in parent_repair_history CF regardless if it is across DC, local
>>> node repair, or other options ?
>>>
>>> Actually two entries, one for start and one for finish.
>>>
>>> > A repair job is done only if "finished" column contains value? and a
>>> repair job is successfully done only if there is no value in exce
>>> ption_messages or exception_stacktrace ?
>>>
>>> correct
>>>
>>> > what is the purpose of successful_ranges column? do i have to check
>>> they are all matched with requested_range to ensure a successful run?
>>>
>>> correct
>>>
>>> -
>>> > Ultimately, how to find out the overall repair health/status in a
>>> given cluster?
>>>
>>> Check if repair is being executed on all nodes within gc_grace_seconds,
>>> and tune that value or troubleshoot problems otherwise.
>>>
>>> > Scanning through parent_repair_history and making sure all the known
>>> keyspaces has a good repair run in recent days?
>>>
>>> Sounds good.
>>>
>>> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for
>>> more information.
>>>
>>>
>>> 2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>>>
>>>>
>>>> hi all,
>>>> few questions regarding how to read or digest the
>>>> system_distributed.parent_repair_history CF, that I am very intereted to
>>>> use to find out our repair status...
>>>>
>>>> -
>>>> Is every invocation of nodetool repair execution will be recorded as
>>>> one entry in parent_repair_history CF regardless if it is across DC, local
>>>> node repair, or other options ?
>>>>
>>>> -
>>>> A repair job is done only if "finished" column contains value? and a
>>>> repair job is successfully done only if there is no value in exce
>>>> ption_messages or exception_stacktrace ?
>>>> what is the purpose of successful_ranges column? do i have to check
>>>> they are all matched with requested_range to ensure a successful run?
>>>>
>>>> -
>>>> Ultimately, how to find out the overall repair health/status in a given
>>>> cluster?
>>>> Scanning through parent_repair_history and making sure all the known
>>>> keyspaces has a good repair run in recent days?
>>>>
>>>> ---------------
>>>> CREATE TABLE system_distributed.parent_repair_history (
>>>>     parent_id timeuuid PRIMARY KEY,
>>>>     columnfamily_names set<text>,
>>>>     exception_message text,
>>>>     exception_stacktrace text,
>>>>     finished_at timestamp,
>>>>     keyspace_name text,
>>>>     requested_ranges set<text>,
>>>>     started_at timestamp,
>>>>     successful_ranges set<text>
>>>> )
>>>>
>>>
>>>
>>
>

Re: how to read parent_repair_history table?

Posted by Paulo Motta <pa...@gmail.com>.

> is there any other better way to find out a node's token range?  I see
systems.peers column family seems to include range information, so that is
promising but when I look at both datastax java driver and python driver,
its API both require a keyspace name and host name, I wonder why ?

range information is per keyspace, since is related to chosen topology
(replication factor, multi dcs, etc). you can construct ranges from
system.peers, but you will still need to look at keyspace info so that will
be much more complicated. so I guess the driver's getTokenRanges is your
best bet.

> And just to be sure, the participants column in the repair_history table
represented the node being repaired and not the node being used to
comparing the data, correct?

participants include all participants of the repair, including the
coordinator.

2016-03-01 0:44 GMT-03:00 Jimmy Lin <y2...@gmail.com>:

> is there any other better way to find out a node's token range?  I see
> systems.peers column family seems to include range information, so that is
> promising but when I look at both datastax java driver and python driver,
> its API both require a keyspace name and host name, I wonder why ?
>
>
> http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/Metadata.html#getTokenRanges-java.lang.String-com.datastax.driver.core.Host-
>
>
> And just to be sure, the participants column in the repair_history table
> represented the node being repaired and not the node being used to
> comparing the data, correct?
>
>
> On Thu, Feb 25, 2016 at 1:38 PM, Paulo Motta <pa...@gmail.com>
> wrote:
>
>> > how does it work when repair job targeting only local vs all DC? is
>> there any columns or flag i can tell the difference? or does it actualy
>> matter?
>>
>> You can not easily find out from the parent_repair_session table if a
>> repair is local-only or multi-dc. I created
>> https://issues.apache.org/jira/browse/CASSANDRA-11244 to add more
>> information to that table. Since that table only has id as primary key,
>> you'd need to do a full scan to perform checks on it, or keep track of the
>> parent id session when submitting the repair and query by primary key.
>>
>> What you could probably do to health check your nodes are repaired on
>> time is to check for each table:
>>
>> select * from repair_history where keyspace = 'ks' columnfamily_name =
>> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2);
>>
>> And then verify for each node if all of its ranges have been repaired in
>> this period, and send an alert otherwise. You can find out a nodes range by
>> querying JMX via StorageServiceMBean.getRangeToEndpointMap.
>>
>> To make this task a bit simpler you could probably add a secondary index
>> to the participants column of repair_history table with:
>>
>> CREATE INDEX myindex ON system_distributed.repair_history (participants) ;
>>
>> and check each node status individually with:
>>
>> select * from repair_history where keyspace = 'ks' columnfamily_name =
>> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2) AND participants
>> CONTAINS 'node_IP';
>>
>>
>>
>> 2016-02-25 16:22 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>>
>>> hi Paulo,
>>>
>>> one more follow up ... :)
>>>
>>>  I noticed these tables are suppose to replicatd to all nodes in the cluster, and it is not per node specific.
>>>
>>> how does it work when repair job targeting only local vs all DC? is there any columns or flag i can tell the difference?
>>> or does it actualy matter?
>>>
>>>  thanks
>>>
>>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 25, 2016, at 10:37 AM, Paulo Motta <pa...@gmail.com>
>>> wrote:
>>>
>>> > why each job repair execution will have 2 entries? I thought it will
>>> be one entry, begining with started_at column filled, and when it
>>> completed, finished_at column will be filled.
>>>
>>> that's correct, I was mistaken!
>>>
>>> > Also, if my cluster has more than 1 keyspace, and the way this table
>>> is structured, it will have multiple entries, one for each keysapce_name
>>> value. no ? thanks
>>>
>>> right, because repair sessions in different keyspaces will have
>>> different repair session ids.
>>>
>>> 2016-02-25 15:04 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>>>
>>>> hi Paulo,
>>>>
>>>> follow up on the # of entries question...
>>>>
>>>>  why each job repair execution will have 2 entries?
>>>> I thought it will be one entry, begining with started_at column filled, and when it completed, finished_at column will be filled.
>>>>
>>>> Also, if my cluster has more than 1 keyspace, and the way this table is structured, it will have multiple entries, one for each keysapce_name value. no ?
>>>>
>>>> thanks
>>>>
>>>>
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Feb 25, 2016, at 5:48 AM, Paulo Motta <pa...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello Jimmy,
>>>>
>>>> The parent_repair_history table keeps track of start and finish
>>>> information of a repair session.  The other table repair_history keeps
>>>> track of repair status as it progresses. So, you must first query the
>>>> parent_repair_history table to check if a repair started and finish, as
>>>> well as its duration, and inspect the repair_history table to troubleshoot
>>>> more specific details of a given repair session.
>>>>
>>>> Answering your questions below:
>>>>
>>>> > Is every invocation of nodetool repair execution will be recorded as
>>>> one entry in parent_repair_history CF regardless if it is across DC, local
>>>> node repair, or other options ?
>>>>
>>>> Actually two entries, one for start and one for finish.
>>>>
>>>> > A repair job is done only if "finished" column contains value? and a
>>>> repair job is successfully done only if there is no value in exce
>>>> ption_messages or exception_stacktrace ?
>>>>
>>>> correct
>>>>
>>>> > what is the purpose of successful_ranges column? do i have to check
>>>> they are all matched with requested_range to ensure a successful run?
>>>>
>>>> correct
>>>>
>>>> -
>>>> > Ultimately, how to find out the overall repair health/status in a
>>>> given cluster?
>>>>
>>>> Check if repair is being executed on all nodes within gc_grace_seconds,
>>>> and tune that value or troubleshoot problems otherwise.
>>>>
>>>> > Scanning through parent_repair_history and making sure all the known
>>>> keyspaces has a good repair run in recent days?
>>>>
>>>> Sounds good.
>>>>
>>>> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for
>>>> more information.
>>>>
>>>>
>>>> 2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>>>>
>>>>>
>>>>> hi all,
>>>>> few questions regarding how to read or digest the
>>>>> system_distributed.parent_repair_history CF, that I am very intereted to
>>>>> use to find out our repair status...
>>>>>
>>>>> -
>>>>> Is every invocation of nodetool repair execution will be recorded as
>>>>> one entry in parent_repair_history CF regardless if it is across DC, local
>>>>> node repair, or other options ?
>>>>>
>>>>> -
>>>>> A repair job is done only if "finished" column contains value? and a
>>>>> repair job is successfully done only if there is no value in exce
>>>>> ption_messages or exception_stacktrace ?
>>>>> what is the purpose of successful_ranges column? do i have to check
>>>>> they are all matched with requested_range to ensure a successful run?
>>>>>
>>>>> -
>>>>> Ultimately, how to find out the overall repair health/status in a
>>>>> given cluster?
>>>>> Scanning through parent_repair_history and making sure all the known
>>>>> keyspaces has a good repair run in recent days?
>>>>>
>>>>> ---------------
>>>>> CREATE TABLE system_distributed.parent_repair_history (
>>>>>     parent_id timeuuid PRIMARY KEY,
>>>>>     columnfamily_names set<text>,
>>>>>     exception_message text,
>>>>>     exception_stacktrace text,
>>>>>     finished_at timestamp,
>>>>>     keyspace_name text,
>>>>>     requested_ranges set<text>,
>>>>>     started_at timestamp,
>>>>>     successful_ranges set<text>
>>>>> )
>>>>>
>>>>
>>>>
>>>
>>
>

Re: how to read parent_repair_history table?

Posted by Jimmy Lin <y2...@gmail.com>.

is there any other better way to find out a node's token range?  I see
systems.peers column family seems to include range information, so that is
promising but when I look at both datastax java driver and python driver,
its API both require a keyspace name and host name, I wonder why ?


http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/Metadata.html#getTokenRanges-java.lang.String-com.datastax.driver.core.Host-


And just to be sure, the participants column in the repair_history table
represented the node being repaired and not the node being used to
comparing the data, correct?


On Thu, Feb 25, 2016 at 1:38 PM, Paulo Motta <pa...@gmail.com>
wrote:

> > how does it work when repair job targeting only local vs all DC? is
> there any columns or flag i can tell the difference? or does it actualy
> matter?
>
> You can not easily find out from the parent_repair_session table if a
> repair is local-only or multi-dc. I created
> https://issues.apache.org/jira/browse/CASSANDRA-11244 to add more
> information to that table. Since that table only has id as primary key,
> you'd need to do a full scan to perform checks on it, or keep track of the
> parent id session when submitting the repair and query by primary key.
>
> What you could probably do to health check your nodes are repaired on time
> is to check for each table:
>
> select * from repair_history where keyspace = 'ks' columnfamily_name =
> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2);
>
> And then verify for each node if all of its ranges have been repaired in
> this period, and send an alert otherwise. You can find out a nodes range by
> querying JMX via StorageServiceMBean.getRangeToEndpointMap.
>
> To make this task a bit simpler you could probably add a secondary index
> to the participants column of repair_history table with:
>
> CREATE INDEX myindex ON system_distributed.repair_history (participants) ;
>
> and check each node status individually with:
>
> select * from repair_history where keyspace = 'ks' columnfamily_name =
> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2) AND participants
> CONTAINS 'node_IP';
>
>
>
> 2016-02-25 16:22 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>
>> hi Paulo,
>>
>> one more follow up ... :)
>>
>>  I noticed these tables are suppose to replicatd to all nodes in the cluster, and it is not per node specific.
>>
>> how does it work when repair job targeting only local vs all DC? is there any columns or flag i can tell the difference?
>> or does it actualy matter?
>>
>>  thanks
>>
>>
>>
>>
>> Sent from my iPhone
>>
>> On Feb 25, 2016, at 10:37 AM, Paulo Motta <pa...@gmail.com>
>> wrote:
>>
>> > why each job repair execution will have 2 entries? I thought it will
>> be one entry, begining with started_at column filled, and when it
>> completed, finished_at column will be filled.
>>
>> that's correct, I was mistaken!
>>
>> > Also, if my cluster has more than 1 keyspace, and the way this table
>> is structured, it will have multiple entries, one for each keysapce_name
>> value. no ? thanks
>>
>> right, because repair sessions in different keyspaces will have different
>> repair session ids.
>>
>> 2016-02-25 15:04 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>>
>>> hi Paulo,
>>>
>>> follow up on the # of entries question...
>>>
>>>  why each job repair execution will have 2 entries?
>>> I thought it will be one entry, begining with started_at column filled, and when it completed, finished_at column will be filled.
>>>
>>> Also, if my cluster has more than 1 keyspace, and the way this table is structured, it will have multiple entries, one for each keysapce_name value. no ?
>>>
>>> thanks
>>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 25, 2016, at 5:48 AM, Paulo Motta <pa...@gmail.com>
>>> wrote:
>>>
>>> Hello Jimmy,
>>>
>>> The parent_repair_history table keeps track of start and finish
>>> information of a repair session.  The other table repair_history keeps
>>> track of repair status as it progresses. So, you must first query the
>>> parent_repair_history table to check if a repair started and finish, as
>>> well as its duration, and inspect the repair_history table to troubleshoot
>>> more specific details of a given repair session.
>>>
>>> Answering your questions below:
>>>
>>> > Is every invocation of nodetool repair execution will be recorded as
>>> one entry in parent_repair_history CF regardless if it is across DC, local
>>> node repair, or other options ?
>>>
>>> Actually two entries, one for start and one for finish.
>>>
>>> > A repair job is done only if "finished" column contains value? and a
>>> repair job is successfully done only if there is no value in exce
>>> ption_messages or exception_stacktrace ?
>>>
>>> correct
>>>
>>> > what is the purpose of successful_ranges column? do i have to check
>>> they are all matched with requested_range to ensure a successful run?
>>>
>>> correct
>>>
>>> -
>>> > Ultimately, how to find out the overall repair health/status in a
>>> given cluster?
>>>
>>> Check if repair is being executed on all nodes within gc_grace_seconds,
>>> and tune that value or troubleshoot problems otherwise.
>>>
>>> > Scanning through parent_repair_history and making sure all the known
>>> keyspaces has a good repair run in recent days?
>>>
>>> Sounds good.
>>>
>>> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for
>>> more information.
>>>
>>>
>>> 2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>>>
>>>>
>>>> hi all,
>>>> few questions regarding how to read or digest the
>>>> system_distributed.parent_repair_history CF, that I am very intereted to
>>>> use to find out our repair status...
>>>>
>>>> -
>>>> Is every invocation of nodetool repair execution will be recorded as
>>>> one entry in parent_repair_history CF regardless if it is across DC, local
>>>> node repair, or other options ?
>>>>
>>>> -
>>>> A repair job is done only if "finished" column contains value? and a
>>>> repair job is successfully done only if there is no value in exce
>>>> ption_messages or exception_stacktrace ?
>>>> what is the purpose of successful_ranges column? do i have to check
>>>> they are all matched with requested_range to ensure a successful run?
>>>>
>>>> -
>>>> Ultimately, how to find out the overall repair health/status in a given
>>>> cluster?
>>>> Scanning through parent_repair_history and making sure all the known
>>>> keyspaces has a good repair run in recent days?
>>>>
>>>> ---------------
>>>> CREATE TABLE system_distributed.parent_repair_history (
>>>>     parent_id timeuuid PRIMARY KEY,
>>>>     columnfamily_names set<text>,
>>>>     exception_message text,
>>>>     exception_stacktrace text,
>>>>     finished_at timestamp,
>>>>     keyspace_name text,
>>>>     requested_ranges set<text>,
>>>>     started_at timestamp,
>>>>     successful_ranges set<text>
>>>> )
>>>>
>>>
>>>
>>
>

Re: how to read parent_repair_history table?

Posted by Paulo Motta <pa...@gmail.com>.

> how does it work when repair job targeting only local vs all DC? is there
any columns or flag i can tell the difference? or does it actualy matter?

You can not easily find out from the parent_repair_session table if a
repair is local-only or multi-dc. I created
https://issues.apache.org/jira/browse/CASSANDRA-11244 to add more
information to that table. Since that table only has id as primary key,
you'd need to do a full scan to perform checks on it, or keep track of the
parent id session when submitting the repair and query by primary key.

What you could probably do to health check your nodes are repaired on time
is to check for each table:

select * from repair_history where keyspace = 'ks' columnfamily_name = 'cf'
and id > mintimeuuid(now() - gc_grace_seconds/2);

And then verify for each node if all of its ranges have been repaired in
this period, and send an alert otherwise. You can find out a nodes range by
querying JMX via StorageServiceMBean.getRangeToEndpointMap.

To make this task a bit simpler you could probably add a secondary index to
the participants column of repair_history table with:

CREATE INDEX myindex ON system_distributed.repair_history (participants) ;

and check each node status individually with:

select * from repair_history where keyspace = 'ks' columnfamily_name = 'cf'
and id > mintimeuuid(now() - gc_grace_seconds/2) AND participants CONTAINS
'node_IP';



2016-02-25 16:22 GMT-03:00 Jimmy Lin <y2...@gmail.com>:

> hi Paulo,
>
> one more follow up ... :)
>
>  I noticed these tables are suppose to replicatd to all nodes in the cluster, and it is not per node specific.
>
> how does it work when repair job targeting only local vs all DC? is there any columns or flag i can tell the difference?
> or does it actualy matter?
>
>  thanks
>
>
>
>
> Sent from my iPhone
>
> On Feb 25, 2016, at 10:37 AM, Paulo Motta <pa...@gmail.com>
> wrote:
>
> > why each job repair execution will have 2 entries? I thought it will be
> one entry, begining with started_at column filled, and when it completed,
> finished_at column will be filled.
>
> that's correct, I was mistaken!
>
> > Also, if my cluster has more than 1 keyspace, and the way this table is
> structured, it will have multiple entries, one for each keysapce_name
> value. no ? thanks
>
> right, because repair sessions in different keyspaces will have different
> repair session ids.
>
> 2016-02-25 15:04 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>
>> hi Paulo,
>>
>> follow up on the # of entries question...
>>
>>  why each job repair execution will have 2 entries?
>> I thought it will be one entry, begining with started_at column filled, and when it completed, finished_at column will be filled.
>>
>> Also, if my cluster has more than 1 keyspace, and the way this table is structured, it will have multiple entries, one for each keysapce_name value. no ?
>>
>> thanks
>>
>>
>>
>> Sent from my iPhone
>>
>> On Feb 25, 2016, at 5:48 AM, Paulo Motta <pa...@gmail.com>
>> wrote:
>>
>> Hello Jimmy,
>>
>> The parent_repair_history table keeps track of start and finish
>> information of a repair session.  The other table repair_history keeps
>> track of repair status as it progresses. So, you must first query the
>> parent_repair_history table to check if a repair started and finish, as
>> well as its duration, and inspect the repair_history table to troubleshoot
>> more specific details of a given repair session.
>>
>> Answering your questions below:
>>
>> > Is every invocation of nodetool repair execution will be recorded as
>> one entry in parent_repair_history CF regardless if it is across DC, local
>> node repair, or other options ?
>>
>> Actually two entries, one for start and one for finish.
>>
>> > A repair job is done only if "finished" column contains value? and a
>> repair job is successfully done only if there is no value in exce
>> ption_messages or exception_stacktrace ?
>>
>> correct
>>
>> > what is the purpose of successful_ranges column? do i have to check
>> they are all matched with requested_range to ensure a successful run?
>>
>> correct
>>
>> -
>> > Ultimately, how to find out the overall repair health/status in a given
>> cluster?
>>
>> Check if repair is being executed on all nodes within gc_grace_seconds,
>> and tune that value or troubleshoot problems otherwise.
>>
>> > Scanning through parent_repair_history and making sure all the known
>> keyspaces has a good repair run in recent days?
>>
>> Sounds good.
>>
>> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for
>> more information.
>>
>>
>> 2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>>
>>>
>>> hi all,
>>> few questions regarding how to read or digest the
>>> system_distributed.parent_repair_history CF, that I am very intereted to
>>> use to find out our repair status...
>>>
>>> -
>>> Is every invocation of nodetool repair execution will be recorded as one
>>> entry in parent_repair_history CF regardless if it is across DC, local node
>>> repair, or other options ?
>>>
>>> -
>>> A repair job is done only if "finished" column contains value? and a
>>> repair job is successfully done only if there is no value in exce
>>> ption_messages or exception_stacktrace ?
>>> what is the purpose of successful_ranges column? do i have to check they
>>> are all matched with requested_range to ensure a successful run?
>>>
>>> -
>>> Ultimately, how to find out the overall repair health/status in a given
>>> cluster?
>>> Scanning through parent_repair_history and making sure all the known
>>> keyspaces has a good repair run in recent days?
>>>
>>> ---------------
>>> CREATE TABLE system_distributed.parent_repair_history (
>>>     parent_id timeuuid PRIMARY KEY,
>>>     columnfamily_names set<text>,
>>>     exception_message text,
>>>     exception_stacktrace text,
>>>     finished_at timestamp,
>>>     keyspace_name text,
>>>     requested_ranges set<text>,
>>>     started_at timestamp,
>>>     successful_ranges set<text>
>>> )
>>>
>>
>>
>

Re: how to read parent_repair_history table?

Posted by Jimmy Lin <y2...@gmail.com>.

hi Paulo, 
one more follow up ... :)
 I noticed these tables are suppose to replicatd to all nodes in the cluster, and it is not per node specific. 
how does it work when repair job targeting only local vs all DC? is there any columns or flag i can tell the difference? or does it actualy matter?
 thanks



Sent from my iPhone

> On Feb 25, 2016, at 10:37 AM, Paulo Motta <pa...@gmail.com> wrote:
> 
> > why each job repair execution will have 2 entries? I thought it will be one entry, begining with started_at column filled, and when it completed, finished_at column will be filled. 
> 
> that's correct, I was mistaken!
> 
> > Also, if my cluster has more than 1 keyspace, and the way this table is structured, it will have multiple entries, one for each keysapce_name value. no ? thanks
> 
> right, because repair sessions in different keyspaces will have different repair session ids.
> 
> 2016-02-25 15:04 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>> hi Paulo, 
>> follow up on the # of entries question... 
>>  why each job repair execution will have 2 entries? I thought it will be one entry, begining with started_at column filled, and when it completed, finished_at column will be filled. 
>> Also, if my cluster has more than 1 keyspace, and the way this table is structured, it will have multiple entries, one for each keysapce_name value. no ? thanks
>> 
>> 
>> Sent from my iPhone
>> 
>>> On Feb 25, 2016, at 5:48 AM, Paulo Motta <pa...@gmail.com> wrote:
>>> 
>>> Hello Jimmy,
>>> 
>>> The parent_repair_history table keeps track of start and finish information of a repair session.  The other table repair_history keeps track of repair status as it progresses. So, you must first query the parent_repair_history table to check if a repair started and finish, as well as its duration, and inspect the repair_history table to troubleshoot more specific details of a given repair session.
>>> 
>>> Answering your questions below:
>>> 
>>> > Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ?
>>> 
>>> Actually two entries, one for start and one for finish.
>>> 
>>> > A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce ption_messages or exception_stacktrace ?
>>> 
>>> correct
>>> 
>>> > what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run?
>>> 
>>> correct
>>> 
>>> -
>>> > Ultimately, how to find out the overall repair health/status in a given cluster?
>>> 
>>> Check if repair is being executed on all nodes within gc_grace_seconds, and tune that value or troubleshoot problems otherwise.
>>> 
>>> > Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days?
>>> 
>>> Sounds good.
>>> 
>>> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more information.
>>> 
>>> 
>>> 2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>>>> 
>>>> hi all,
>>>> few questions regarding how to read or digest the system_distributed.parent_repair_history CF, that I am very intereted to use to find out our repair status... 
>>>>  
>>>> -
>>>> Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ?
>>>> 
>>>> -
>>>> A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce
>>>> ption_messages or exception_stacktrace ?
>>>> what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run?
>>>> 
>>>> -
>>>> Ultimately, how to find out the overall repair health/status in a given cluster?
>>>> Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days?
>>>> 
>>>> ---------------
>>>> CREATE TABLE system_distributed.parent_repair_history (
>>>>     parent_id timeuuid PRIMARY KEY,
>>>>     columnfamily_names set<text>,
>>>>     exception_message text,
>>>>     exception_stacktrace text,
>>>>     finished_at timestamp,
>>>>     keyspace_name text,
>>>>     requested_ranges set<text>,
>>>>     started_at timestamp,
>>>>     successful_ranges set<text>
>>>> )
>

Re: how to read parent_repair_history table?

Posted by Paulo Motta <pa...@gmail.com>.

> why each job repair execution will have 2 entries? I thought it will be
one entry, begining with started_at column filled, and when it completed,
finished_at column will be filled.

that's correct, I was mistaken!

> Also, if my cluster has more than 1 keyspace, and the way this table is
structured, it will have multiple entries, one for each keysapce_name
value. no ? thanks

right, because repair sessions in different keyspaces will have different
repair session ids.

2016-02-25 15:04 GMT-03:00 Jimmy Lin <y2...@gmail.com>:

> hi Paulo,
>
> follow up on the # of entries question...
>
>  why each job repair execution will have 2 entries?
> I thought it will be one entry, begining with started_at column filled, and when it completed, finished_at column will be filled.
>
> Also, if my cluster has more than 1 keyspace, and the way this table is structured, it will have multiple entries, one for each keysapce_name value. no ?
>
> thanks
>
>
>
> Sent from my iPhone
>
> On Feb 25, 2016, at 5:48 AM, Paulo Motta <pa...@gmail.com> wrote:
>
> Hello Jimmy,
>
> The parent_repair_history table keeps track of start and finish
> information of a repair session.  The other table repair_history keeps
> track of repair status as it progresses. So, you must first query the
> parent_repair_history table to check if a repair started and finish, as
> well as its duration, and inspect the repair_history table to troubleshoot
> more specific details of a given repair session.
>
> Answering your questions below:
>
> > Is every invocation of nodetool repair execution will be recorded as one
> entry in parent_repair_history CF regardless if it is across DC, local node
> repair, or other options ?
>
> Actually two entries, one for start and one for finish.
>
> > A repair job is done only if "finished" column contains value? and a
> repair job is successfully done only if there is no value in exce
> ption_messages or exception_stacktrace ?
>
> correct
>
> > what is the purpose of successful_ranges column? do i have to check they
> are all matched with requested_range to ensure a successful run?
>
> correct
>
> -
> > Ultimately, how to find out the overall repair health/status in a given
> cluster?
>
> Check if repair is being executed on all nodes within gc_grace_seconds,
> and tune that value or troubleshoot problems otherwise.
>
> > Scanning through parent_repair_history and making sure all the known
> keyspaces has a good repair run in recent days?
>
> Sounds good.
>
> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for
> more information.
>
>
> 2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>
>>
>> hi all,
>> few questions regarding how to read or digest the
>> system_distributed.parent_repair_history CF, that I am very intereted to
>> use to find out our repair status...
>>
>> -
>> Is every invocation of nodetool repair execution will be recorded as one
>> entry in parent_repair_history CF regardless if it is across DC, local node
>> repair, or other options ?
>>
>> -
>> A repair job is done only if "finished" column contains value? and a
>> repair job is successfully done only if there is no value in exce
>> ption_messages or exception_stacktrace ?
>> what is the purpose of successful_ranges column? do i have to check they
>> are all matched with requested_range to ensure a successful run?
>>
>> -
>> Ultimately, how to find out the overall repair health/status in a given
>> cluster?
>> Scanning through parent_repair_history and making sure all the known
>> keyspaces has a good repair run in recent days?
>>
>> ---------------
>> CREATE TABLE system_distributed.parent_repair_history (
>>     parent_id timeuuid PRIMARY KEY,
>>     columnfamily_names set<text>,
>>     exception_message text,
>>     exception_stacktrace text,
>>     finished_at timestamp,
>>     keyspace_name text,
>>     requested_ranges set<text>,
>>     started_at timestamp,
>>     successful_ranges set<text>
>> )
>>
>
>

Re: how to read parent_repair_history table?

Posted by Jimmy Lin <y2...@gmail.com>.

hi Paulo, 
follow up on the # of entries question... 
 why each job repair execution will have 2 entries? I thought it will be one entry, begining with started_at column filled, and when it completed, finished_at column will be filled. 
Also, if my cluster has more than 1 keyspace, and the way this table is structured, it will have multiple entries, one for each keysapce_name value. no ? thanks


Sent from my iPhone

> On Feb 25, 2016, at 5:48 AM, Paulo Motta <pa...@gmail.com> wrote:
> 
> Hello Jimmy,
> 
> The parent_repair_history table keeps track of start and finish information of a repair session.  The other table repair_history keeps track of repair status as it progresses. So, you must first query the parent_repair_history table to check if a repair started and finish, as well as its duration, and inspect the repair_history table to troubleshoot more specific details of a given repair session.
> 
> Answering your questions below:
> 
> > Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ?
> 
> Actually two entries, one for start and one for finish.
> 
> > A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce ption_messages or exception_stacktrace ?
> 
> correct
> 
> > what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run?
> 
> correct
> 
> -
> > Ultimately, how to find out the overall repair health/status in a given cluster?
> 
> Check if repair is being executed on all nodes within gc_grace_seconds, and tune that value or troubleshoot problems otherwise.
> 
> > Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days?
> 
> Sounds good.
> 
> You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more information.
> 
> 
> 2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2...@gmail.com>:
>> 
>> hi all,
>> few questions regarding how to read or digest the system_distributed.parent_repair_history CF, that I am very intereted to use to find out our repair status... 
>>  
>> -
>> Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ?
>> 
>> -
>> A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce
>> ption_messages or exception_stacktrace ?
>> what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run?
>> 
>> -
>> Ultimately, how to find out the overall repair health/status in a given cluster?
>> Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days?
>> 
>> ---------------
>> CREATE TABLE system_distributed.parent_repair_history (
>>     parent_id timeuuid PRIMARY KEY,
>>     columnfamily_names set<text>,
>>     exception_message text,
>>     exception_stacktrace text,
>>     finished_at timestamp,
>>     keyspace_name text,
>>     requested_ranges set<text>,
>>     started_at timestamp,
>>     successful_ranges set<text>
>> )
>

Re: how to read parent_repair_history table?

Posted by Paulo Motta <pa...@gmail.com>.

Hello Jimmy,

The parent_repair_history table keeps track of start and finish information
of a repair session.  The other table repair_history keeps track of repair
status as it progresses. So, you must first query the parent_repair_history
table to check if a repair started and finish, as well as its duration, and
inspect the repair_history table to troubleshoot more specific details of a
given repair session.

Answering your questions below:

> Is every invocation of nodetool repair execution will be recorded as one
entry in parent_repair_history CF regardless if it is across DC, local node
repair, or other options ?

Actually two entries, one for start and one for finish.

> A repair job is done only if "finished" column contains value? and a
repair job is successfully done only if there is no value in exce
ption_messages or exception_stacktrace ?

correct

> what is the purpose of successful_ranges column? do i have to check they
are all matched with requested_range to ensure a successful run?

correct

-
> Ultimately, how to find out the overall repair health/status in a given
cluster?

Check if repair is being executed on all nodes within gc_grace_seconds, and
tune that value or troubleshoot problems otherwise.

> Scanning through parent_repair_history and making sure all the known
keyspaces has a good repair run in recent days?

Sounds good.

You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more
information.


2016-02-25 3:13 GMT-03:00 Jimmy Lin <y2...@gmail.com>:

>
> hi all,
> few questions regarding how to read or digest the
> system_distributed.parent_repair_history CF, that I am very intereted to
> use to find out our repair status...
>
> -
> Is every invocation of nodetool repair execution will be recorded as one
> entry in parent_repair_history CF regardless if it is across DC, local node
> repair, or other options ?
>
> -
> A repair job is done only if "finished" column contains value? and a
> repair job is successfully done only if there is no value in exce
> ption_messages or exception_stacktrace ?
> what is the purpose of successful_ranges column? do i have to check they
> are all matched with requested_range to ensure a successful run?
>
> -
> Ultimately, how to find out the overall repair health/status in a given
> cluster?
> Scanning through parent_repair_history and making sure all the known
> keyspaces has a good repair run in recent days?
>
> ---------------
> CREATE TABLE system_distributed.parent_repair_history (
>     parent_id timeuuid PRIMARY KEY,
>     columnfamily_names set<text>,
>     exception_message text,
>     exception_stacktrace text,
>     finished_at timestamp,
>     keyspace_name text,
>     requested_ranges set<text>,
>     started_at timestamp,
>     successful_ranges set<text>
> )
>