In my previous post, I had successfully failed over the redundancy groups on the cluster using Manual Failover and Interface Failure methods. This post will look into the methods that can be used, when upgrading a SRX Chassis Cluster.
You have two methods of updating a SRX Cluster:
Method A (Individual Node upgrades)
This method can also be used for downgrading Junos, as well as upgrading and has no Junos version limitation. With this method you will be simply upgrading both individual nodes at the same time. As I have already uploaded the Junos image onto both nodes. I will need to run the command on BOTH Node0 and Node1 from Operational Mode
{primary:node0} [email protected]_SRX220_Top> request system software add /var/tmp/junos-srxsme-12.1X44-D45.2-domestic.tgz
{secondary:node1} [email protected]_SRX220_Top> request system software add /var/tmp/junos-srxsme-12.1X44-D45.2-domestic.tgz
Once they have been added, you will need to reboot both Nodes simultaneously. You can use request system reboot node all from Node0
After the reboot, you will need to update the backup image of Junos on both Nodes, to have a consistent primary and backup image.
Method B (In Service Software Upgrades)
Before I begin, with in-service updates, Juniper have two types of in-service upgrade. For the High-End Data Centre SRX models SRX1400, SRX3400, SRX5600 and SRX5800 will use In-Service Software Upgrade (ISSU) and the Small/Medium Branch SRX models SRX100, SRX110, SRX220, SRX240 and SRX650 will use In-Band Cluster Upgrade (ICU). Although the commands are near enough the same; the pre-upgrade requirement, service impacts and the minimum Junos firmware version that supporting in-service upgrades are different.
As I’m using 2x SRX220H2 model firewalls, I will be upgrading via ICU. When I get chance to upgrade a High-End SRX model, I will update the post with my findings :p
Even before you consider using the ISSU/ICU method, I am telling you (no recommendation here!!) to check the Juniper page Limitation on ISSU and ICU. The page will confirm what version of Junos is supported by ISSU/ICU and (more importantly) services that are not supported by ISSU/ICU. In essence, you will need to change if/what services you are running on your SRX cluster to see if they are supported. If they are not supported then you are told DO NOT perform an upgrade with this method.
With that out of the way and if you have checked that your cluster is fully supported (firmware and service) by ISSU/ICU you can proceed with the pre-checks π
Pre-Upgrade Checks ICU
Having confirmed all the pre-checks are good, we can proceed with the upgrade. It is important to note that during an ICU, there WILL BE A SERVICE DISRUPTION! will be approximately 30 seconds with no-sync option. During this 30 seconds traffic will be dropped and flow session will be lost. You will need to keep this in mind, if you are doing this upgrade in-hours or you need to have a good record on your flow session for any reason.
To start the upgrade, we need to run request system software in-service-upgrade /path/to/package no-sync
{primary:node0} [email protected]_SRX220_Top> request system software in-service-upgrade /var/tmp/junos-srxsme-12.1X44-D45.2-domestic.tgz no-sync
ICU Console observations
WARNING: in-service-upgrade shall reboot both the nodes in your cluster. Please ignore any subsequent reboot request message
Node1 is upgraded first ISSU: start downloading software package on secondary node Pushing bundle to node1 {.......} JUNOS 12.1X44-D45.2 will become active at next reboot WARNING: A reboot is required to load this software correctly WARNING: Use the 'request system reboot' command WARNING: when software installation is complete Saving state for rollback ... ISSU: failover all redundancy-groups 1...n to primary node Successfully reset all redundancy-groups priority back to configured priority. Successfully reset all redundancy-groups priority back to configured priority. Initiated manual failover for all redundancy-groups to node0 Redundancy-groups-0 will not failover and the primaryship remains unchanged. ISSU: rebooting Secondary Node Shutdown NOW! [pid 13353] ISSU: Waiting for secondary node node1 to reboot. ISSU: node 1 went down ISSU: Waiting for node 1 to come up
ISSU: node 1 came up ISSU: secondary node node1 booted up. Shutdown NOW!
root> ping 172.16.0.2 routing-instance trust --- 172.16.0.2 ping statistics --- 1600 packets transmitted, 1594 packets received, 0% packet loss round-trip min/avg/max/stddev = 1.720/2.640/13.673/0.652 ms -------------------------------------------------------------------------- root> ping 192.168.0.2 routing-instance untrust --- 192.168.0.2 ping statistics --- 1600 packets transmitted, 1594 packets received, 0% packet loss round-trip min/avg/max/stddev = 1.838/2.535/13.669/0.681 ms
{secondary:node0} [email protected]_SRX220_Top> show version node0: -------------------------------------------------------------------------- Hostname: lab_SRX220_Top Model: srx220h2 JUNOS Software Release [12.1X44-D45.2] node1: -------------------------------------------------------------------------- Hostname: lab_SRX220_Top Model: srx220h2 JUNOS Software Release [12.1X44-D45.2]
And show chassis cluster status, to see that chassis status is as expected
[email protected]_SRX220_Top> show chassis cluster status Cluster ID: 1 Node Priority Status Preempt Manual failover Redundancy group: 0 , Failover count: 0 node0 100 secondary no no node1 1 primary no no Redundancy group: 1 , Failover count: 1 node0 100 primary yes no node1 1 secondary yes no
We can see that we are running the upgraded version of Junos. As expected Redundancy Group 0 is primary on Node1 and Redundancy Group 1 is primary on Node0. As discussed in my previous post, with preempt enabled Redundancy Group 1 will automatically failover to Node0, once it is available. We will have to do a manual failover of redundancy group 0 back to Node0 from Node1 and we will need to upgrade the backup image of Junos to have a consistent primary and backup image.
If you had a case where you had to abort the ICU process you will need to run request system software abort in-service-upgrade on the primary node. It is important to note, if you do use the abort command, you will put the cluster into an inconsistent state, where the secondary node will be running a newer version of Junos to the Primary node. To recover the cluster into a consistent state you will need to do the following all on the secondary node:
**UPDATE 29/4/2015**
Lucky enough, as I was finishing up this series of posts, my colleague had finished working on the SRX1400 we have in our lab! So I was able to run testing on doing ISSU upgrade on High End SRX Series device π Happy Days!!!
Pre-Upgrade Checks ISSU
To start the upgrade, firstly all the redundancy groups need to fail over to one active node. As I have an active/backup setup, all my redundancy groups are on node0
{primary:node0} [email protected]_be-rtr0-h3> show chassis cluster status Cluster ID: 1 Node Priority Status Preempt Manual failover Redundancy group: 0 , Failover count: 3 node0 100 primary no no node1 99 secondary no no Redundancy group: 1 , Failover count: 5 node0 100 primary yes no node1 99 secondary yes no
To be the upgrade process, we need to run request system software in-service-upgrade /path/to/package reboot
ISSU Console observations
Validation succeeded failover all RG 1+ groups to node 0 Initiated manual failover for all redundancy-groups to node0 Redundancy-groups-0 will not failover and the primaryship remains unchanged. ISSU: Preparing Backup RE Pushing bundle to node1
ISSU: Backup RE Prepare Done Waiting for node1 to reboot. node1 booted up. Waiting for node1 to become secondary node1 became secondary. Waiting for node1 to be ready for failover ISSU: Preparing Daemons
It takes around 5-10mins before you see anymore output to say the upgrade process is still going on! Again you will need to be patient as this does take its time!
Secondary node1 ready for failover. {.......} Failing over all redundancy-groups to node1 ISSU: Preparing for Switchover Initiated failover for all the redundancy groups to node1 Waiting for node1 take over all redundancy groups
root> ping 172.31.13.2 routing-instance trust --- 172.31.13.2 ping statistics --- 1818 packets transmitted, 1817 packets received, 0% packet loss round-trip min/avg/max/stddev = 1.769/3.080/44.226/3.536 ms -------------------------------------------------------------------------- root> ping 192.168.13.2 routing-instance untrust --- 192.168.13.2 ping statistics --- 1821 packets transmitted, 1820 packets received, 0% packet loss round-trip min/avg/max/stddev = 1.831/3.071/44.524/3.244 ms
To verify that the upgrade has been successful, we can run the commands show version
{secondary:node0} [email protected]_be-rtr0-h3> show version node0: -------------------------------------------------------------------------- Hostname: lab_be-rtr0-h3 Model: srx1400 JUNOS Software Release [12.1X46-D10.2] node1: -------------------------------------------------------------------------- Hostname: lab_be-rtr0-i3 Model: srx1400 JUNOS Software Release [12.1X46-D10.2]
And show chassis cluster status, to see that chassis status is as expected
{secondary:node0} [email protected]_be-rtr0-h3> show chassis cluster status Cluster ID: 1 Node Priority Status Preempt Manual failover Redundancy group: 0 , Failover count: 0 node0 100 secondary no no node1 99 primary no no Redundancy group: 1 , Failover count: 1 node0 100 primary yes no node1 99 secondary yes no
We can see that we are running the upgraded version of Junos. As expected Redundancy Group 0 is primary on Node1 and Redundancy Group 1 is primary on Node0. As discussed in my previous post, with preempt enabled Redundancy Group 1 will automatically failover to Node0, once it is available. We will have to do a manual failover of redundancy group 0 back to Node0 from Node1 and we will need to upgrade the backup image of Junos to have a consistent primary and backup image.
Phew that was a lot of work and quite a bit to take in there!! Time for a break, (a drink or 6 lol)
My next post will be the last post in the SRX Chassis Cluster Series (sad times π ). It will be nice simple one on how to disable chassis cluster!
Keeran Marquis
Latest posts by Keeran Marquis (see all)
- Life and Times of an Unemployed Professional Speed Dater #3 - August 5, 2018
- Life and Times of an Unemployed Professional Speed Dater #2 - August 5, 2018
- Life and Times of an Unemployed Professional Speed Dater #1 - August 5, 2018
hi
” request system reboot node all ”
this command in not available in SRX 550
request system reboot ?
Possible completions:
Execute this command
at Time at which to perform the operation
in Number of minutes to delay before operation
media Boot media for next boot
message Message to display to all users
| Pipe through a command
no option given for node
regards
adam
hi adam
Do you have a cluster configured? As you would only see the “node all” statements when you have cluster. A quick check on a standalone SRX220 shows the same output as youre getting:
Even without using “node all” statement if you run request system reboot it should reboot both members as well.
Let me know if this helps
Cheers
Keeran
Hi Keeran,
I need to update two SRX 220h in cluster, but when I tried to copy junos version from Node0 to Node1, the imagen was corrupted, I used the following command in start shell mode:
rcp -T /var/tmp/junos-srxsme-12.3X48-D30.7-domestic.tgz node0:/var/tmp
I was reading about it, I mean the best practice but it was not possible…
after that I tried with the following..
L> file copy /var/tmp/junos-srxsme-12.1X47-D10.4-domestic.tgz node1:/var/tmp/
ssh: Could not resolve hostname node1: hostname nor servname provided, or not known
lost connection
error: put-file failed
error: could not send local copy of file
I tried with the following command also in star shell mode:
rcp /var/tmp/junos-srxsme-12.1X47-D10.4-domestic.tgz node1: /var/tmp/
cp: /var/tmp/junos-srxsme-12.1X47-D10.4-domestic.tgz and /var/tmp/junos-srxsme-12.1X47-D10.4-domestic.tgz are identical (not copied).
I’m not sure where says ” node1″ is node 1 or the Hostname of SRX.. maybe this is my fault..
I read your post , I will try with this command
request system software in-service-upgrade /path/to/package no-sync…
Maybe do you know any suggestions?
Many thanks
Hi Keeran,
We have a SRX1400 cluster running on JUNOS Software Release [12.3X48-D35.7]. I was running some Failover test cases. In one of the cases, when we power off primary node, the failover works fine but, after powering on the primary node, few minutes later (6 minutes to be precise), the traffic stops between the zones for almost 10 minutes and then cluster comes back to normal operation.
I also tried removing the “preempt” parameter but, results were same. Please let me know how to eliminate this downtime.
Regards