Juniper SRX Failover Testing Part 1

Reading Time: 3 minutes

I thought that it would be better to have the SRX clustering post in multiple posts, as my first post got pretty long! So here is part 2 😀

Lets dive straight in!

Having configured the cluster in my previous post, we will see how the failover process works. I will be using two methods for failover testing will:

i) A manual failover, where I will manually failover redundancy group 1 from node0 to node1
ii) An interface failover (hard failover), where I will shutdown the ports on node0 and the cluster should failover to node1

Pre Testing Checks
Cluster StatusCluster Flow SessionConnectivity VerificationFailover Groups
Before doing each test, I checked that the status of chassis cluster was as expected with Node0 as primary and Node1 as secondary:

[email protected]_SRX220_Top> show chassis cluster status        
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 5
    node0                   100         primary        no       no  
    node1                   1           secondary      no       no  

Redundancy group: 1 , Failover count: 31
    node0                   100         primary        yes      no  
    node1                   1           secondary      yes      no
To check that all the traffic was flowing via Node0, as this cluster is an Active/Standby set up. I had started a rolling ping from trust untrust in 2 seperate windows. As we can see flows are going through Node0 as expected

[email protected]_SRX220_Top> show security flow session    
node0:
--------------------------------------------------------------------------

Session ID: 5932, Policy name: ping/4, State: Active, Timeout: 2, Valid
  In: 172.16.0.2/11 --> 192.168.0.2/10498;icmp, If: reth0.10, Pkts: 1, Bytes: 84
  Out: 192.168.0.2/10498 --> 172.16.0.2/11;icmp, If: reth1.20, Pkts: 1, Bytes: 84

Session ID: 5933, Policy name: ping/5, State: Active, Timeout: 2, Valid
  In: 192.168.0.2/3 --> 172.16.0.2/10500;icmp, If: reth1.20, Pkts: 1, Bytes: 84
  Out: 172.16.0.2/10500 --> 192.168.0.2/3;icmp, If: reth0.10, Pkts: 1, Bytes: 84

Session ID: 5934, Policy name: ping/4, State: Active, Timeout: 2, Valid
  In: 172.16.0.2/12 --> 192.168.0.2/10498;icmp, If: reth0.10, Pkts: 1, Bytes: 84
  Out: 192.168.0.2/10498 --> 172.16.0.2/12;icmp, If: reth1.20, Pkts: 1, Bytes: 84

Session ID: 5935, Policy name: ping/5, State: Active, Timeout: 2, Valid
  In: 192.168.0.2/4 --> 172.16.0.2/10500;icmp, If: reth1.20, Pkts: 1, Bytes: 84
  Out: 172.16.0.2/10500 --> 192.168.0.2/4;icmp, If: reth0.10, Pkts: 1, Bytes: 84

Session ID: 5936, Policy name: ping/4, State: Active, Timeout: 4, Valid
  In: 172.16.0.2/13 --> 192.168.0.2/10498;icmp, If: reth0.10, Pkts: 1, Bytes: 84
  Out: 192.168.0.2/10498 --> 172.16.0.2/13;icmp, If: reth1.20, Pkts: 1, Bytes: 84

Session ID: 5937, Policy name: ping/5, State: Active, Timeout: 4, Valid
  In: 192.168.0.2/5 --> 172.16.0.2/10500;icmp, If: reth1.20, Pkts: 1, Bytes: 84
  Out: 172.16.0.2/10500 --> 192.168.0.2/5;icmp, If: reth0.10, Pkts: 1, Bytes: 84
Total sessions: 6

node1:
--------------------------------------------------------------------------
Total sessions: 0
I will have rolling pings from trust untrust zones in separate terminal windows, it will show if any packets are dropped during the failover
I will be failing over both redundancy groups 0 and 1

Test A (Manual failover)

To perform a manual failover you will need to run the command request chassis cluster failover redundancy-group {0|1} node {0|1}

[email protected]_SRX220_Top> request chassis cluster failover redundancy-group 0 node 1    
node1:
--------------------------------------------------------------------------
Initiated manual failover for redundancy group 0

{primary:node0}
[email protected]_SRX220_Top> request chassis cluster failover redundancy-group 1 node 1    
node1:
--------------------------------------------------------------------------
Initiated manual failover for redundancy group 1

{secondary-hold:node0}

Once the command has been run we can see that both redundancy groups have failed over, as Node1 has the higher priority now. We can also see that with Redundancy Group 0 has Node0 has secondary-hold status. Secondary-hold status is when the device is in passive state and cannot be promoted to active/primary state. The secondary-hold has a 5 minute interval time, this means you will have wait until after this interval before you can failover Redundancy Group 0 back to the Node0

{secondary-hold:node0}
[email protected]_SRX220_Top> show chassis cluster status 
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 6
    node0                   100         secondary-hold no       yes 
    node1                   255         primary        no       yes 

Redundancy group: 1 , Failover count: 32
    node0                   100         secondary      yes      yes 
    node1                   255         primary        yes      yes 

{secondary-hold:node0}

After the 5 minute interval, you can see that Node0 has moved from the secondary-hold to secondary now

[email protected]_SRX220_Top> show chassis cluster status    
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 6
    node0                   100         secondary      no       yes 
    node1                   255         primary        no       yes 

Redundancy group: 1 , Failover count: 32
    node0                   100         secondary      yes      yes 
    node1                   255         primary        yes      yes 

{secondary:node0}

As we can see from the rolling pings in total 3 packets out of 2138 packets were dropped and there was no packet loss. Not a noticeable drop of traffic

{master:0}
root> ping 172.16.0.2 routing-instance trust 

--- 192.168.0.2 ping statistics ---
1071 packets transmitted, 1069 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.870/2.275/9.126/0.523 ms

---------------------------------------------------------------------------------

{master:0}
root> ping 172.16.0.2 routing-instance trust    

--- 172.16.0.2 ping statistics ---
1067 packets transmitted, 1066 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.887/2.509/5.126/0.351 ms

Having failed over Node1, we can clear the manual failover by using the command request chassis cluster failover reset redundancy-group 1. This will reset the node’s priority to the configured values. This command can be used as well, if the device becomes unreachable or the redundancy group threshold reaches zero.

{secondary:node0}
[email protected]_SRX220_Top> request chassis cluster failover reset redundancy-group 1    
node0:
--------------------------------------------------------------------------
No reset required for redundancy group 1.

node1:
--------------------------------------------------------------------------
Successfully reset manual failover for redundancy group 1

{secondary:node0}
[email protected]_SRX220_Top> request chassis cluster failover reset redundancy-group 0    
node0:
--------------------------------------------------------------------------
No reset required for redundancy group 0.

node1:
--------------------------------------------------------------------------
Successfully reset manual failover for redundancy group 0

As we have preempt on Redundancy Group 1, it will automatically fail back to Node0.

[email protected]_SRX220_Top> show chassis cluster status   
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 6
    node0                   100         secondary      no       no  
    node1                   1           primary        no       no  

Redundancy group: 1 , Failover count: 33
    node0                   100         primary        yes      no  
    node1                   1           secondary      yes      no

Whereas with Redundancy Group 0, as you can’t enable to preempt, you will need to do another manual failover to get Node0 to become the master of the cluster.

Manual Failover of Node0 output
[email protected]_SRX220_Top> request chassis cluster failover redundancy-group 0 node 0 
node0:
--------------------------------------------------------------------------
Initiated manual failover for redundancy group 0
{secondary:node0}                                   
[email protected]_SRX220_Top> show chassis cluster status    
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 7
    node0                   255         primary        no       yes 
    node1                   1           secondary      no       yes 

Redundancy group: 1 , Failover count: 33
    node0                   100         primary        yes      no  
    node1                   1           secondary      yes      no
{primary:node0}
[email protected]_SRX220_Top> request chassis cluster failover reset redundancy-group 0     
node0:
--------------------------------------------------------------------------
Successfully reset manual failover for redundancy group 0

node1:
--------------------------------------------------------------------------
No reset required for redundancy group 0.
{primary:node0}
[email protected]_SRX220_Top> show chassis cluster status                                  
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 7
    node0                   100         primary        no       no  
    node1                   1           secondary      no       no  

Redundancy group: 1 , Failover count: 33
    node0                   100         primary        yes      no  
    node1                   1           secondary      yes      no

My next post will look at Test B, Interface Failover. See you on the other side 😀

The following two tabs change content below.

Keeran Marquis

Network Engineer
Keeran Marquis is a Network Engineer. His main goal is to learn everything within the Networking field, pick up a little bit of scripting, be a poor man sysadmin and share whatever he knows! All Posts are his own views, opinions and experiences, no guarantees they will work for you but point you in the right direction 🙂
Share this:
Share

3 thoughts on “Juniper SRX Failover Testing Part 1”

  1. Munish Saini

    Thanks for sharing this Info. I have tried failover for reth1 which holds my untrust connection & it works good without any packet loss, however when I failover the reth0 I see complete loss for 30 secs. I tweaked the heartbeat to minimum but still no success. I am using srx 240’s not sure if this is base behaviour of the HA setup.

  2. Jesse

    Wow! So glad I found your blog. I’m struggling at getting my SRX340 cluster and EX3400 (Virtual Chassis) setup up and running and you’ve helped a bunch. I’m wondering if you would be willing to assist me further? I’m not a network guru, unfortunately, so I’m struggling to fully understand what exactly I’m doing. If you would be able to spare some time, that would be awesome. Let me know.

    Thanks,

    Jesse

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.