You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@qpid.apache.org by wei6rong <we...@hotmail.com> on 2012/09/20 11:02:51 UTC

qpid-cluster broker recovery failed

Hi guys,

I have the following issue: 
I have two node 172.26.184.35 and 172.26.184.45
in 172.26.184.35
cd /root/qpid-0.18/cpp/src/.libs 
./lt-qpidd -d --auth 0 -p 5682 --cluster-name cluster_one --load-module
./cluster.so --data-dir=/tmp/5682
./lt-qpidd -d --auth 0 -p 5683 --cluster-name cluster_two --load-module
./cluster.so --data-dir=/tmp/5683
 
in 172.26.184.45
/root/qpid-0.18/cpp/src/.libs 
./lt-qpidd -d --auth 0 -p 5682 --cluster-name cluster_two --load-module
./cluster.so --data-dir=/tmp/5682
./lt-qpidd -d --auth 0 -p 5683 --cluster-name cluster_one --load-module
./cluster.so --data-dir=/tmp/5683
 
35-->45
172.26.184.35
/root/qpid-0.18/cpp/src/.libs 
qpid-config -b 172.26.184.35:5682 --durable add queue queue-test
--limit-policy=flow-to-disk --flow-stop-size=0 --flow-resume-size=0
--file-size=2 --file-count=4 --max-queue-size=10000
qpid-config -b 172.26.184.45:5682 --durable add queue queue-test
--limit-policy=flow-to-disk --flow-stop-size=0 --flow-resume-size=0
--file-size=2 --file-count=4 --max-queue-size=10000
qpid-config -b 172.26.184.45:5682 --durable add exchange direct ex-test
qpid-config -b 172.26.184.45:5682 --durable bind ex-test queue-test
queue-test
qpid-route queue add 172.26.184.45:5682 172.26.184.35:5682 ex-test
queue-test --durable

after then, i using
qpid-cluster -C 172.26.184.35:5683 -k
kill cluster_two

but when i restart cluster_two, they would throw NoSuchTransportException in
Broker.cpp:1006
follow is output log
[root@localhost .libs]# ./lt-qpidd --auth 0 -p 5682 --cluster-name
cluster_two --load-module ./cluster.so --data-dir=/tmp/5682
2012-09-20 16:44:37 [Unspecified] notice Journal "TplStore": Created
2012-09-20 16:44:37 [Unspecified] notice Store module initialized;
store-dir=/tmp/5682
2012-09-20 16:44:37 [HA] notice Initializing CPG
2012-09-20 16:44:37 [HA] notice Cluster store state: clean
cluster-id=2ee15447-470a-4ef8-bf96-46bd29551bf7
shutdown-id=7d851924-0082-44ab-9f2f-e43bd21eaece
2012-09-20 16:44:37 [HA] notice cluster(172.26.184.45:4822 PRE_INIT)
configuration change: 172.26.184.45:4822 
2012-09-20 16:44:37 [HA] notice cluster(172.26.184.45:4822 PRE_INIT) Members
joined: 172.26.184.45:4822 
2012-09-20 16:44:37 [Unspecified] notice Journal "queue-test": Created
2012-09-20 16:44:37 [Unspecified] warning Journal "queue-test": Recovery
found 4 files (different from --num-jfiles value of 8).
2012-09-20 16:44:37 [Unspecified] warning Journal "queue-test": Recovery
found file size = 2 (different from --jfile-size-pgs value of 24).
2012-09-20 16:44:37 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-20 16:44:39 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-20 16:44:43 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-20 16:44:51 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-20 16:45:07 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-20 16:45:39 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)

>From what I understood, in 
qpid/broker/Broker.cpp:338 
would call Plugin::initializeAll(*this);
but when my restart my cluster_two, this is link info in my rhm store, 
the recovery code is in qpid/broker/Broker.cpp:287 
store->recover(recoverer); which is before Plugin::initializeAll(*this);

so when recover there would throw NoSuchTransportException because at this
time some plugin is not initialized,

Is it possible? I would be very appreciate of any advice you my offer.

Best Regards
weirong
 




--
View this message in context: http://qpid.2158936.n2.nabble.com/qpid-cluster-broker-recovery-failed-tp7582404.html
Sent from the Apache Qpid developers mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org

Re: qpid-cluster broker recovery failed

Posted by Gordon Sim <gs...@redhat.com>.

Sorry for the delay in responding!

On 09/24/2012 03:51 AM, wei6rong wrote:
> In my test, it looks like the link would not get created, it would keep
> retrying and throwing error messages in 64 seconds interval, following is my
> output log.

Yes, durable links are in fact broken in 0.18, thanks for bringing that 
to our attention.

The tcp plugin is not initialised at the time the store is recovered 
which causes the first error. The subsequent retries are on a timer 
thread which I thought should fix the problem as soon as the broker 
completes its initialisation.

However the problem is that the store recovery never completes when 
there are durable links as the broker tries to re-store them on recovery.

This appear to be a regression introduced by 
https://issues.apache.org/jira/browse/QPID-3767 that we haven't picked 
up in tests.

I have raised a new JIRA: 
https://issues.apache.org/jira/browse/QPID-4347 for this.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org

Re: qpid-cluster broker recovery failed

Posted by wei6rong <we...@hotmail.com>.

Gordon Sim wrote
> certainly look like they are recovered links.
> 
> The TCP support is compiled in to the broker, but the factory is only 
> registered when the plugins are initialised.
> 
> However, the store is recovered *before* the plugin initialisation. That 
> seems to me to be a bug.
> 
> Do the links eventually get created? I.e. do the errors stop eventually? 
> The link management logic should keep retrying (with a backoff strategy) 
> until they are established, and the plugins should all eventually be 
> initialised.

hi Gordon，
thanks for your replay.

In my test, it looks like the link would not get created, it would keep
retrying and throwing error messages in 64 seconds interval, following is my
output log.

[root@localhost .libs]# ./lt-qpidd --auth 0 -p 5682 --cluster-name
cluster_two --load-module ./cluster.so --data-dir=/tmp/5682
2012-09-24 10:22:25 [Unspecified] notice Journal "TplStore": Created
2012-09-24 10:22:25 [Unspecified] notice Store module initialized;
store-dir=/tmp/5682
2012-09-24 10:22:25 [HA] notice Initializing CPG
2012-09-24 10:22:25 [HA] notice Cluster store state: clean
cluster-id=0888a5cb-9f2d-4f37-8c51-050e42aee723
shutdown-id=9f7b9e38-1488-4c7a-951b-4a1c13f7a706
2012-09-24 10:22:25 [HA] notice cluster(172.26.184.45:24761 PRE_INIT)
configuration change: 172.26.184.45:24761 
2012-09-24 10:22:25 [HA] notice cluster(172.26.184.45:24761 PRE_INIT)
Members joined: 172.26.184.45:24761 
2012-09-24 10:22:25 [Unspecified] notice Journal "queue-test": Created
2012-09-24 10:22:25 [Unspecified] warning Journal "queue-test": Recovery
found 4 files (different from --num-jfiles value of 8).
2012-09-24 10:22:25 [Unspecified] warning Journal "queue-test": Recovery
found file size = 2 (different from --jfile-size-pgs value of 24).
2012-09-24 10:22:25 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-24 10:22:27 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-24 10:22:31 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-24 10:22:39 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-24 10:22:55 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-24 10:23:27 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-24 10:24:31 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
2012-09-24 10:25:35 [Broker] error Link connection to 172.26.184.35:5682
failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
......


Best Regards
Rong




--
View this message in context: http://qpid.2158936.n2.nabble.com/qpid-cluster-broker-recovery-failed-tp7582404p7582513.html
Sent from the Apache Qpid developers mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org

Re: qpid-cluster broker recovery failed

Posted by Gordon Sim <gs...@redhat.com>.

On 09/21/2012 02:33 AM, wei6rong wrote:
>
> hi,
> no, in my default --module-dir
> [root@localhost daemon]# pwd
> /usr/local/lib/qpid/daemon
> have following lib
> [root@localhost daemon]# ls
> acl.la  cluster.la  ha.la  msgstore.so     replicating_listener.la  replication_exchange.la  ssl.la  watchdog.la  xml.la
> acl.so  cluster.so  ha.so  msgstore.so_16  replicating_listener.so  replication_exchange.so  ssl.so  watchdog.so  xml.so
> [root@localhost daemon]#
>
> and you can see, the fail message is "Unsupported transport type: tcp", not ssl or rdma,

Sorry! I missed that.

The errors:

> 2012-09-20 16:44:37 [Broker] error Link connection to 172.26.184.35:5682
> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)

certainly look like they are recovered links.

The TCP support is compiled in to the broker, but the factory is only 
registered when the plugins are initialised.

However, the store is recovered *before* the plugin initialisation. That 
seems to me to be a bug.

Do the links eventually get created? I.e. do the errors stop eventually? 
The link management logic should keep retrying (with a backoff strategy) 
until they are established, and the plugins should all eventually be 
initialised.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org

RE: qpid-cluster broker recovery failed

Posted by wei6rong <we...@hotmail.com>.

hi, 
no, in my default --module-dir 
[root@localhost daemon]# pwd
/usr/local/lib/qpid/daemon
have following lib
[root@localhost daemon]# ls
acl.la  cluster.la  ha.la  msgstore.so     replicating_listener.la  replication_exchange.la  ssl.la  watchdog.la  xml.la
acl.so  cluster.so  ha.so  msgstore.so_16  replicating_listener.so  replication_exchange.so  ssl.so  watchdog.so  xml.so
[root@localhost daemon]#

and you can see, the fail message is "Unsupported transport type: tcp", not ssl or rdma, and i have been using gdb to restart my cluster_two,  when run into Broker.cpp:1004,
boost::shared_ptr<ProtocolFactory> pf = getProtocolFactory(transport);
i print transport, it is tcp, and pf is 0x0, so NoSuchTransportException is throw. 

following is snippet,when NoSuchTransportException  is throw.
void Broker::connect(
    const std::string& host, const std::string& port, const std::string& transport,
    boost::function2<void, int, std::string> failed,
    sys::ConnectionCodec::Factory* f)
{
    boost::shared_ptr<ProtocolFactory> pf = getProtocolFactory(transport);
    if (pf) pf->connect(poller, host, port, f ? f : factory.get(), failed);
    else throw NoSuchTransportException(QPID_MSG("Unsupported transport type: " << transport));
}

thanks
rong


Date: Thu, 20 Sep 2012 12:06:48 -0700
From: ml-node+s2158936n7582439h58@n2.nabble.com
To: wei6rong@hotmail.com
Subject: Re: qpid-cluster broker recovery failed



	On 09/20/2012 10:02 AM, wei6rong wrote:

> Hi guys,

>

> I have the following issue:

> I have two node 172.26.184.35 and 172.26.184.45

> in 172.26.184.35

> cd /root/qpid-0.18/cpp/src/.libs

> ./lt-qpidd -d --auth 0 -p 5682 --cluster-name cluster_one --load-module

> ./cluster.so --data-dir=/tmp/5682

> ./lt-qpidd -d --auth 0 -p 5683 --cluster-name cluster_two --load-module

> ./cluster.so --data-dir=/tmp/5683

>

> in 172.26.184.45

> /root/qpid-0.18/cpp/src/.libs

> ./lt-qpidd -d --auth 0 -p 5682 --cluster-name cluster_two --load-module

> ./cluster.so --data-dir=/tmp/5682

> ./lt-qpidd -d --auth 0 -p 5683 --cluster-name cluster_one --load-module

> ./cluster.so --data-dir=/tmp/5683

>

> 35-->45

> 172.26.184.35

> /root/qpid-0.18/cpp/src/.libs

> qpid-config -b 172.26.184.35:5682 --durable add queue queue-test

> --limit-policy=flow-to-disk --flow-stop-size=0 --flow-resume-size=0

> --file-size=2 --file-count=4 --max-queue-size=10000

> qpid-config -b 172.26.184.45:5682 --durable add queue queue-test

> --limit-policy=flow-to-disk --flow-stop-size=0 --flow-resume-size=0

> --file-size=2 --file-count=4 --max-queue-size=10000

> qpid-config -b 172.26.184.45:5682 --durable add exchange direct ex-test

> qpid-config -b 172.26.184.45:5682 --durable bind ex-test queue-test

> queue-test

> qpid-route queue add 172.26.184.45:5682 172.26.184.35:5682 ex-test

> queue-test --durable

>

> after then, i using

> qpid-cluster -C 172.26.184.35:5683 -k

> kill cluster_two

>

> but when i restart cluster_two, they would throw NoSuchTransportException in

> Broker.cpp:1006

> follow is output log

> [root@localhost .libs]# ./lt-qpidd --auth 0 -p 5682 --cluster-name

> cluster_two --load-module ./cluster.so --data-dir=/tmp/5682

> 2012-09-20 16:44:37 [Unspecified] notice Journal "TplStore": Created

> 2012-09-20 16:44:37 [Unspecified] notice Store module initialized;

> store-dir=/tmp/5682

> 2012-09-20 16:44:37 [HA] notice Initializing CPG

> 2012-09-20 16:44:37 [HA] notice Cluster store state: clean

> cluster-id=2ee15447-470a-4ef8-bf96-46bd29551bf7

> shutdown-id=7d851924-0082-44ab-9f2f-e43bd21eaece

> 2012-09-20 16:44:37 [HA] notice cluster(172.26.184.45:4822 PRE_INIT)

> configuration change: 172.26.184.45:4822

> 2012-09-20 16:44:37 [HA] notice cluster(172.26.184.45:4822 PRE_INIT) Members

> joined: 172.26.184.45:4822

> 2012-09-20 16:44:37 [Unspecified] notice Journal "queue-test": Created

> 2012-09-20 16:44:37 [Unspecified] warning Journal "queue-test": Recovery

> found 4 files (different from --num-jfiles value of 8).

> 2012-09-20 16:44:37 [Unspecified] warning Journal "queue-test": Recovery

> found file size = 2 (different from --jfile-size-pgs value of 24).

> 2012-09-20 16:44:37 [Broker] error Link connection to 172.26.184.35:5682

> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)

> 2012-09-20 16:44:39 [Broker] error Link connection to 172.26.184.35:5682

> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)

> 2012-09-20 16:44:43 [Broker] error Link connection to 172.26.184.35:5682

> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)

> 2012-09-20 16:44:51 [Broker] error Link connection to 172.26.184.35:5682

> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)

> 2012-09-20 16:45:07 [Broker] error Link connection to 172.26.184.35:5682

> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)

> 2012-09-20 16:45:39 [Broker] error Link connection to 172.26.184.35:5682

> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)

>

>  From what I understood, in

> qpid/broker/Broker.cpp:338

> would call Plugin::initializeAll(*this);

> but when my restart my cluster_two, this is link info in my rhm store,

> the recovery code is in qpid/broker/Broker.cpp:287

> store->recover(recoverer); which is before Plugin::initializeAll(*this);

>

> so when recover there would throw NoSuchTransportException because at this

> time some plugin is not initialized,

>

> Is it possible? I would be very appreciate of any advice you my offer.

The NoSuchTransportException would suggest that the SSL (or less likely 

the rdma) plugin was not loaded or not configured on restart but was 

somehow being requested...


Do you have SSL enabled federation links recorded in the store?



---------------------------------------------------------------------

To unsubscribe, e-mail: [hidden email]

For additional commands, e-mail: [hidden email]




	
	
	
	

	

	
	
		If you reply to this email, your message will be added to the discussion below:
		http://qpid.2158936.n2.nabble.com/qpid-cluster-broker-recovery-failed-tp7582404p7582439.html
	
	
		
		To unsubscribe from qpid-cluster broker recovery failed, click here.

		NAML
	 		 	   		  



--
View this message in context: http://qpid.2158936.n2.nabble.com/qpid-cluster-broker-recovery-failed-tp7582404p7582460.html
Sent from the Apache Qpid developers mailing list archive at Nabble.com.

Re: qpid-cluster broker recovery failed

Posted by Gordon Sim <gs...@redhat.com>.

On 09/20/2012 10:02 AM, wei6rong wrote:
> Hi guys,
>
> I have the following issue:
> I have two node 172.26.184.35 and 172.26.184.45
> in 172.26.184.35
> cd /root/qpid-0.18/cpp/src/.libs
> ./lt-qpidd -d --auth 0 -p 5682 --cluster-name cluster_one --load-module
> ./cluster.so --data-dir=/tmp/5682
> ./lt-qpidd -d --auth 0 -p 5683 --cluster-name cluster_two --load-module
> ./cluster.so --data-dir=/tmp/5683
>
> in 172.26.184.45
> /root/qpid-0.18/cpp/src/.libs
> ./lt-qpidd -d --auth 0 -p 5682 --cluster-name cluster_two --load-module
> ./cluster.so --data-dir=/tmp/5682
> ./lt-qpidd -d --auth 0 -p 5683 --cluster-name cluster_one --load-module
> ./cluster.so --data-dir=/tmp/5683
>
> 35-->45
> 172.26.184.35
> /root/qpid-0.18/cpp/src/.libs
> qpid-config -b 172.26.184.35:5682 --durable add queue queue-test
> --limit-policy=flow-to-disk --flow-stop-size=0 --flow-resume-size=0
> --file-size=2 --file-count=4 --max-queue-size=10000
> qpid-config -b 172.26.184.45:5682 --durable add queue queue-test
> --limit-policy=flow-to-disk --flow-stop-size=0 --flow-resume-size=0
> --file-size=2 --file-count=4 --max-queue-size=10000
> qpid-config -b 172.26.184.45:5682 --durable add exchange direct ex-test
> qpid-config -b 172.26.184.45:5682 --durable bind ex-test queue-test
> queue-test
> qpid-route queue add 172.26.184.45:5682 172.26.184.35:5682 ex-test
> queue-test --durable
>
> after then, i using
> qpid-cluster -C 172.26.184.35:5683 -k
> kill cluster_two
>
> but when i restart cluster_two, they would throw NoSuchTransportException in
> Broker.cpp:1006
> follow is output log
> [root@localhost .libs]# ./lt-qpidd --auth 0 -p 5682 --cluster-name
> cluster_two --load-module ./cluster.so --data-dir=/tmp/5682
> 2012-09-20 16:44:37 [Unspecified] notice Journal "TplStore": Created
> 2012-09-20 16:44:37 [Unspecified] notice Store module initialized;
> store-dir=/tmp/5682
> 2012-09-20 16:44:37 [HA] notice Initializing CPG
> 2012-09-20 16:44:37 [HA] notice Cluster store state: clean
> cluster-id=2ee15447-470a-4ef8-bf96-46bd29551bf7
> shutdown-id=7d851924-0082-44ab-9f2f-e43bd21eaece
> 2012-09-20 16:44:37 [HA] notice cluster(172.26.184.45:4822 PRE_INIT)
> configuration change: 172.26.184.45:4822
> 2012-09-20 16:44:37 [HA] notice cluster(172.26.184.45:4822 PRE_INIT) Members
> joined: 172.26.184.45:4822
> 2012-09-20 16:44:37 [Unspecified] notice Journal "queue-test": Created
> 2012-09-20 16:44:37 [Unspecified] warning Journal "queue-test": Recovery
> found 4 files (different from --num-jfiles value of 8).
> 2012-09-20 16:44:37 [Unspecified] warning Journal "queue-test": Recovery
> found file size = 2 (different from --jfile-size-pgs value of 24).
> 2012-09-20 16:44:37 [Broker] error Link connection to 172.26.184.35:5682
> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
> 2012-09-20 16:44:39 [Broker] error Link connection to 172.26.184.35:5682
> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
> 2012-09-20 16:44:43 [Broker] error Link connection to 172.26.184.35:5682
> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
> 2012-09-20 16:44:51 [Broker] error Link connection to 172.26.184.35:5682
> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
> 2012-09-20 16:45:07 [Broker] error Link connection to 172.26.184.35:5682
> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
> 2012-09-20 16:45:39 [Broker] error Link connection to 172.26.184.35:5682
> failed: Unsupported transport type: tcp (qpid/broker/Broker.cpp:1006)
>
>  From what I understood, in
> qpid/broker/Broker.cpp:338
> would call Plugin::initializeAll(*this);
> but when my restart my cluster_two, this is link info in my rhm store,
> the recovery code is in qpid/broker/Broker.cpp:287
> store->recover(recoverer); which is before Plugin::initializeAll(*this);
>
> so when recover there would throw NoSuchTransportException because at this
> time some plugin is not initialized,
>
> Is it possible? I would be very appreciate of any advice you my offer.

The NoSuchTransportException would suggest that the SSL (or less likely 
the rdma) plugin was not loaded or not configured on restart but was 
somehow being requested...

Do you have SSL enabled federation links recorded in the store?


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org