| title: | Linux cluster Clumembd heartbeat problem |
|
I am not sure if this is the correct place to post this. If not
and you know where I should, could you please tell me.
(This effects RedHat ES 3.0 clumanager software)
I have found a problem with the clumembd daemon where the heartbeat
message is rejected by other nodes causing the node to be powered off.
If you have a Ethernet interface with an alias and are using multicast the
source address may contain the main IP address or
the alias address. If it contains the alias address the message is
then rejected by all other nodes as it now contains the wrong IP address.
The software correctly creates a socket on the main interface and at first
the correct IP address is send. Some time later on the same socket the
alias address seems to get into the packets.
I have extract the relevant parts from my log file showing the output
from the debugging lines I inserted into the code.
Computer has
Interfaces: bond0 addr 10.10.197.11
bond0:0 addr 10.10.197.6
Multcast set up
clumembd[2]: <debug add_interface fd:4 name:bond0
clumembd[2]: <debug Interface IP is 10.10.197.11
clumembd[2]: <debug Setting up multicast 225.0.0.11 on 10.10.197.11
clumembd[2]: <debug Multicast send fd:5 (10.10.197.11)
clumembd[2]: <debug Multicast receive fd:6
Sending and receiving message (Correct behaviour)
clumembd[2]: <debug sending multicast message fd:5 ,nodeid:1
,addr:225.0.0.11,token:0x0002881d4119638e
clumembd[2]: <debug update_seen new msg nodeid:1 token:0x0002881d4119638e
After a while you get. sinp = source address, nsp = expected address
clumembd[2]: <debug sending multicast message fd:5 ,nodeid:1
,addr:225.0.0.11,token:0x0002881d4119638e
clumembd[2]: <debug update_seen new msg nodeid:1 token:0x0002881d4119638e
clumembd[2]: <debug IP/NodeID mismatch: Probably another cluster on our
subnet... msg from nodeid:1 sinp:10.10.197.6 nsp:10.10.197.11
The source address now has bond0:0 address when it did have bond0s address.
The socket has not changed.
This looks to me like a bug in the sending routine (it is using sendto in
std library)
Has anyone else noticed this sort of behaviour on sending multicast messages
to a Ethernet device with multiple addresses.
Cheers
Royce
|