Tag Archives: upgrade

Upgrading Dual Routing Engine Juniper MX Series

Reading Time: 4 minutes

In one of my previous post, I explained how you would go about upgrading a Juniper EX switch. I said whenever I got the chance to upgrade a MX Series Router, I’ll get something noted down…… *raises hands* today is the day! As I’ve said in a few posts, there has been a lot of change and now team is now getting access to the Core Juniper MX Series Routers. As part of this increased access, one of our first tasks is it upgrade JunOS from 12.3R5.7 to 14.1R6.4. With most, if not, all MX Series above the MX80, they will come with two Routing Engines (RE), and both are independent of each other. This is being the case, when upgrading a MX; you will need to upgrade each RE by individually.

This post will go over what you will need to do upgrade an MX Router, in my setup I’ll be upgrading a Juniper MX480 Router and I’ll be doing the upgrade via the console port on each Routing Engine.

To link two Routing Engines together, you would need to apply similar configuration to what I used:

set groups re0 system host-name re0-mx480
set groups re0 interfaces fxp0 unit 0 family inet address x.x.x.x/x
set groups re1 system host-name re1-mx480
set groups re1 interfaces fxp0 unit 0 family inet address x.x.x.x/x
set apply-groups re1
set apply-groups re0

set chassis redundancy graceful-switchover
set routing-options nonstop-routing
set system commit synchronize

With that all cleared up.. Let’s get cracking 🙂

Pre Works

Upload the new firmware version to wherever you normally keep it them. Currently, we would normally upload the package into the /var/tmp folder on the device in question

[[email protected] ~]$ scp jinstall-14.3R6.4-domestic-signed.tgz re0-mx480:/var/tmp

After just saying how you to link the two REs together, for an upgrade, you will need to disable graceful-switchover and nonstop-routing. Skipping this step can potentially result in the control plane and forwarding plane having two different JUNOS versions, which can cause a number of potential issues!

deactivate chassis redundancy graceful-switchover
deactivate routing-options nonstop-routing

Upgrade Process

Having disabled both graceful-switchover and nonstop-routing, log onto the Backup RE, either by console or from the Master RE run the command request routing-engine login re1. Once on the Backup RE, you will need to run the command request system software validate add /var/tmp/xxx reboot.

[email protected]> request system software validate add /var/tmp/jinstall-14.1R6.4-domestic-signed.tgz reboot
NOTE
If you’re like us and save the new firmware package to the local device, when you run the software add command DO NOT set what RE has the package stored. If you do you add the package’s location once the upgrade is completed, on one of the RE, it will delete the image from the device!
Additionally you had requested a session from re0 to re1 to connect to Backup RE, once the RE reboots, you get this message and get booted off

[email protected]>                                                                                
*** FINAL System shutdown message from [email protected] ***                 

System going down IMMEDIATELY                                                  

                                                                               
rlogin: connection closed

If you have console access, you can watch the upgrade ticking along, if you don’t, you can confirm the Backup RE is up and running by using the command show chassis routing-engine, it will show the status and hardware stats for both Routing Engines.

show chassis routing-engine output
[email protected]> show chassis routing-engine    
Routing Engine status:
  Slot 0:
    Current state                  Master
    Election priority              Master (default)
    Temperature                 31 degrees C / 87 degrees F
    CPU temperature             37 degrees C / 98 degrees F
    DRAM                      3584 MB (3584 MB installed)
    Memory utilization          20 percent
    CPU utilization:
      User                       0 percent
      Background                 0 percent
      Kernel                     4 percent
      Interrupt                  0 percent
      Idle                      96 percent
    Model                          RE-S-2000
    Serial ID                      9012021718
    Start time                     2016-03-22 13:14:37 GMT
    Uptime                         3 hours, 38 minutes, 16 seconds
    Last reboot reason             Router rebooted after a normal shutdown.
    Load averages:                 1 minute   5 minute  15 minute
                                       0.01       0.01       0.00
Routing Engine status:
  Slot 1:
    Current state                  Backup
    Election priority              Backup (default)
    Temperature                 33 degrees C / 91 degrees F
    CPU temperature             38 degrees C / 100 degrees F
    DRAM                      3584 MB (4096 MB installed)
    Memory utilization          16 percent
    CPU utilization:
      User                       0 percent
      Background                 0 percent
      Kernel                     0 percent
      Interrupt                  0 percent
      Idle                     100 percent
    Model                          RE-S-2000
    Serial ID                      9012022174
    Start time                     2016-03-22 16:47:56 GMT
    Uptime                         4 minutes, 45 seconds
    Last reboot reason             Router rebooted after a normal shutdown.
    Load averages:                 1 minute   5 minute  15 minute
                                       0.34       0.47       0.23

Having upgraded Backup RE, to reduce the downtime and service impact, you can failover the Master Routing Engine, so that the Backup becomes the new Master Routing Engine. This is an manual process by running the command, from the current Master RE, request chassis routing-engine master switch. This WILL cause a brief outage as the PFE is reset and the new firmware is loaded.

[email protected]> request chassis routing-engine master switch    
warning: Traffic will be interrupted while the PFE is re-initialized
Toggle mastership between routing engines ? [yes,no] (no) yes 

Resolving mastership...
Complete. The other routing engine becomes the master.

You can confirm by running show chassis routing-engine again on RE0

AFTER failing over Routing Engine
[email protected]> show chassis routing-engine 
Routing Engine status:
  Slot 0:
    Current state                  Backup
    Election priority              Master (default)
    Temperature                 32 degrees C / 89 degrees F
    CPU temperature             39 degrees C / 102 degrees F
    DRAM                      3584 MB (3584 MB installed)
    Memory utilization          16 percent
    CPU utilization:
      User                       2 percent
      Background                 0 percent
      Kernel                     1 percent
      Interrupt                  0 percent
      Idle                      97 percent
    Model                          RE-S-2000
    Serial ID                      9012021718
    Start time                     2016-03-22 13:14:37 GMT
    Uptime                         3 hours, 50 minutes, 7 seconds
    Last reboot reason             Router rebooted after a normal shutdown.
    Load averages:                 1 minute   5 minute  15 minute
                                       0.21       0.07       0.02
Routing Engine status:
  Slot 1:
    Current state                  Master
    Election priority              Backup (default)
    Temperature                 33 degrees C / 91 degrees F
    CPU temperature             41 degrees C / 105 degrees F
    DRAM                      3584 MB (4096 MB installed)
    Memory utilization          22 percent
    CPU utilization:
      User                      43 percent
      Background                 0 percent
      Kernel                    28 percent
      Interrupt                  0 percent
      Idle                      29 percent
    Model                          RE-S-2000
    Serial ID                      9012022174
    Start time                     2016-03-22 16:47:56 GMT
    Uptime                         16 minutes, 42 seconds
    Last reboot reason             Router rebooted after a normal shutdown.
    Load averages:                 1 minute   5 minute  15 minute
                                       4.71       1.11       0.46

Having failed over the RE now, all that needed is to repeat the same command as before request system software validate add /var/tmp/xxx reboot to install the new firmware on RE0

[email protected]> request system software add /var/tmp/jinstall-14.1R6.4-domestic-signed.tgz reboot

Once the Routing-Engine has upgraded you need to re-enable graceful-switchover and non-routing, first, you will see why in moment.

activate chassis redundancy graceful-switchover
activate routing-options nonstop-routing

After commit synchronising those changes, you will need to set RE0 back to the Master Routing Engine. This will be done by running request chassis routing-engine master switch from RE1 now.

[email protected]>request chassis routing-engine master switch 
Toggle mastership between routing engines ? [yes,no] (no) yes 

Resolving mastership...
Complete. The other routing engine becomes the master.

{backup}
[email protected]> 

Now that Graceful Switchover is enable when you run the command you wont see the same warning about traffic being disrupted, this is because Graceful Switchover preserves interface and kernel information allowing the PFE to continue forwarding packets, even though one of the REs is unavailable.

NOTE
More detail Graceful Switchover can be found on Juniper’s TechLibrary

We will need to save a backup of the currently running and active file system by issuing the command request system snapshot on both primary as well as backup REs

[email protected]> request system snapshot 
Doing the initial labeling...
Verifying compatibility of destination media partitions...
Running newfs (899MB) on hard-disk media  / partition (ad2s1a)...
Running newfs (100MB) on hard-disk media  /config partition (ad2s1e)...
Copying '/dev/ad0s1a' to '/dev/ad2s1a' .. (this may take a few minutes)
Copying '/dev/ad0s1e' to '/dev/ad2s1e' .. (this may take a few minutes)
The following filesystems were archived: / /config

Finally, confirm the code version by running show version invoke-on all-routing-engine. I’ve used the commands show version and show version invoke-on other-routing-engine just because I can put the two outputs into a table-like thing and it looks neater in a post :p

show version RE0show version RE1
{master}
[email protected]> show version          
Hostname: RE0-MX480-02
Model: mx480
Junos: 14.1R6.4
JUNOS Base OS boot [14.1R6.4]
JUNOS Base OS Software Suite [14.1R6.4]
JUNOS Packet Forwarding Engine Support (M/T/EX Common) [14.1R6.4]
JUNOS Packet Forwarding Engine Support (MX Common) [14.1R6.4]
JUNOS platform Software Suite [14.1R6.4]
JUNOS Runtime Software Suite [14.1R6.4]
JUNOS Online Documentation [14.1R6.4]
JUNOS Services AACL Container package [14.1R6.4]
JUNOS AppId Services [14.1R6.4]
JUNOS Services Application Level Gateways [14.1R6.4]
JUNOS Services Captive Portal and Content Delivery Container package [14.1R6.4]
JUNOS Border Gateway Function package [14.1R6.4]
JUNOS Services HTTP Content Management package [14.1R6.4]
JUNOS IDP Services [14.1R6.4]
JUNOS Services LL-PDF Container package [14.1R6.4]
JUNOS Services Jflow Container package [14.1R6.4]
JUNOS Services MobileNext Software package [14.1R6.4]
JUNOS Services Mobile Subscriber Service Container package [14.1R6.4]
JUNOS Services PTSP Container package [14.1R6.4]
JUNOS Services NAT [14.1R6.4]
JUNOS Services RPM [14.1R6.4]           
JUNOS Services Stateful Firewall [14.1R6.4]
JUNOS Voice Services Container package [14.1R6.4]
JUNOS Services Crypto [14.1R6.4]
JUNOS Services SSL [14.1R6.4]
JUNOS Services IPSec [14.1R6.4]
JUNOS py-base-i386 [14.1R6.4]
JUNOS Kernel Software Suite [14.1R6.4]
JUNOS Crypto Software Suite [14.1R6.4]
JUNOS Routing Software Suite [14.1R6.4]
{master}
[email protected]> show version invoke-on other-routing-engine 
re1:
--------------------------------------------------------------------------
Hostname: RE1-MX480-02
Model: mx480
Junos: 14.1R6.4
JUNOS Base OS boot [14.1R6.4]
JUNOS Base OS Software Suite [14.1R6.4]
JUNOS Packet Forwarding Engine Support (M/T/EX Common) [14.1R6.4]
JUNOS Packet Forwarding Engine Support (MX Common) [14.1R6.4]
JUNOS platform Software Suite [14.1R6.4]
JUNOS Runtime Software Suite [14.1R6.4]
JUNOS Online Documentation [14.1R6.4]
JUNOS Services AACL Container package [14.1R6.4]
JUNOS Services Application Level Gateways [14.1R6.4]
JUNOS AppId Services [14.1R6.4]
JUNOS Services Captive Portal and Content Delivery Container package [14.1R6.4]
JUNOS Border Gateway Function package [14.1R6.4]
JUNOS Services HTTP Content Management package [14.1R6.4]
JUNOS Services Jflow Container package [14.1R6.4]
JUNOS IDP Services [14.1R6.4]
JUNOS Services LL-PDF Container package [14.1R6.4]
JUNOS Services MobileNext Software package [14.1R6.4]
JUNOS Services Mobile Subscriber Service Container package [14.1R6.4]
JUNOS Services NAT [14.1R6.4]           
JUNOS Services RPM [14.1R6.4]
JUNOS Services PTSP Container package [14.1R6.4]
JUNOS Services Stateful Firewall [14.1R6.4]
JUNOS Voice Services Container package [14.1R6.4]
JUNOS Services SSL [14.1R6.4]
JUNOS Services Crypto [14.1R6.4]
JUNOS Services IPSec [14.1R6.4]
JUNOS py-base-i386 [14.1R6.4]
JUNOS Kernel Software Suite [14.1R6.4]
JUNOS Crypto Software Suite [14.1R6.4]
JUNOS Routing Software Suite [14.1R6.4]

And with that, we have an upgraded Dual Routing Engine MX Series router! 😀 Yay! I’ll most likely, now that I’ve got the access, mess about with ISSU upgrade on MX next so keep an eye for that one!

Reference

Configuring Dual Routing Engines MX Series
Procedure to Upgrade JUNOS on a Dual Routing Engine System
Understanding Graceful Switchover (GRES)

Share this:
Share

Configuring a Virtual Chassis on QFX5100

Reading Time: 2 minutes

When configuring a 2 member Virtual Chassis using 2xQFX5100, there is a slight difference compared to EX Series but it’s very much similar. As this is the case, I thought I’d do a quick post (And it doubles up as documentation writing for work lol). The QFX doesn’t have dedicated VC ports or a VC module, so with this in mind you’ll have to use either 10GB SPF+ port(s) or 40GB QSFP(s) ports to connect the switches together. The method is same as Configuring Virtual Chassis on EX switch using VCEP ports, the one difference is that with the QFX you can do the entire configuration with the VCEP port pre-connected, but this wasn’t the case with EX Series. However, similarly, it’s recommended that you have Backup Routing Engine (RE) or Linecard powered off if you’re using the preprovisioned method like I am 🙂

Let’s get cracking 😀

In one of my previous post, Configuring Virtual Chassis on Juniper EX Series, it’s recommended that you have the following commands set before processing with configuring a Virtual Chassis:

set system commit synchronize
set chassis redundancy graceful-switchover
set routing-options nonstop-routing
set protocols layer2-control nonstop-bridging
Note
The QFX non-bridging command is under the protocols layer2-control stanza NOT ethernet-switching-options stanza as on the EX Series

Once these have been committed, you can configure the virtual-chassis stanza.

[email protected]> show configuration virtual-chassis | display set 
set virtual-chassis preprovisioned
set virtual-chassis no-split-detection
set virtual-chassis member 0 role routing-engine
set virtual-chassis member 0 serial-number TA3715110057
set virtual-chassis member 1 role routing-engine
set virtual-chassis member 1 serial-number TA3715110028

Just like on EX Series, you have to set the VC-Ports on the Master Routing Engine for them to know those ports are being used as the Virtual Chassis interconnects

r[email protected]> request virtual-chassis vc-port set pic-slot 0 port 48 
[email protected]> request virtual-chassis vc-port set pic-slot 0 port 50

You can power up (the Backup RE), having completed the entire configuration on the Master RE. Once the Backup has booted, as done on the Master, you have to set the 40GB QSFPs ports as the VC-Ports

[email protected]> request virtual-chassis vc-port set pic-slot 0 port 48 
[email protected]> request virtual-chassis vc-port set pic-slot 0 port 50

Once this is done, the Virtual Chassis is created and you’ll be kicked out of the Backup RE and will have to log back into the switch, where you will be logged into the Master Routing Engine. To verify that everything is working as expected you can run the commands show virtual-chassis vc-port and show virtual-chassis

show virtual-chassis vc-portshow virtual-chassis
[email protected]> show virtual-chassis vc-port    
fpc0:
--------------------------------------------------------------------------
Interface   Type              Trunk  Status       Speed        Neighbor
or                             ID                 (mbps)       ID  Interface
PIC / Port
0/48        Configured          5    Up           40000        1   vcp-255/0/48
0/50        Configured          5    Up           40000        1   vcp-255/0/50

fpc1:
--------------------------------------------------------------------------
Interface   Type              Trunk  Status       Speed        Neighbor
or                             ID                 (mbps)       ID  Interface
PIC / Port
0/48        Configured          5    Up           40000        0   vcp-255/0/48
0/50        Configured          5    Up           40000        0   vcp-255/0/50
[email protected]> show virtual-chassis 

Preprovisioned Virtual Chassis
Virtual Chassis ID: 1165.24bd.5581
Virtual Chassis Mode: Enabled
                                                Mstr           Mixed Route Neighbor List
Member ID  Status   Serial No    Model          prio  Role      Mode  Mode ID  Interface
0 (FPC 0)  Prsnt    TA3715110057 qfx5100-48s-6q 129   Master*      N  VC   1  vcp-255/0/48
                                                                           1  vcp-255/0/50
1 (FPC 1)  Prsnt    TA3715110028 qfx5100-48s-6q 129   Backup       N  VC   0  vcp-255/0/48
                                                                           0  vcp-255/0/50

And you’re done! As I said, it’s very similar to configuring a Virtual Chassis on EX Series, except for a couple of small changes that could throw someone off if they didn’t know!

For more in-depth detail you can check Juniper’s TechLibrary page

Share this:
Share

Virtual Chassis Upgrade with Minimal Downtime

Reading Time: 6 minutes

At work we were looking to do a firmware upgrade of our junos going from 12.3 to 13.2X and we got a few VC switches. The plan was to use the NSSU method so that we didn’t get any downtime however, when doing testing I would kick off the NSSU and the backup member would upgrade, reboot and come up as expected:

{master:0}
[email protected]> ...p/jinstall-ex-4200-13.2X51-D35.3-domestic-signed.tgz    
Chassis ISSU Check Done
[Dec 18 04:32:13]:ISSU: Validating Image
[Dec 18 04:32:41]:ISSU: Preparing Backup RE
[Dec 18 04:32:42]: Installing image on other FPC's along with the backup

[Dec 18 04:32:42]: Checking pending install on fpc1
[Dec 18 04:33:41]: Pushing bundle to fpc1
NOTICE: Validating configuration against mchassis-install.tgz.
NOTICE: Use the 'no-validate' option to skip this if desired.
WARNING: A reboot is required to install the software
WARNING:     Use the 'request system reboot' command immediately
[Dec 18 04:34:42]: Completed install on fpc1
[Dec 18 04:34:53]: Backup upgrade done
[Dec 18 04:34:53]: Rebooting Backup RE

Rebooting fpc1
[Dec 18 04:34:54]:ISSU: Backup RE Prepare Done
[Dec 18 04:34:54]: Waiting for Backup RE reboot

After an hour of looking at this on the master, I consoled into the backup to see what had booted and was up, and I clearly had an issue. I aborted the NSSU and checked to see what was going; the backup member had upgraded and had connected with the master:

{master:0}
[email protected]> show version 
fpc0:
--------------------------------------------------------------------------
Hostname: EX4200-A
Model: ex4200-48t
JUNOS Base OS boot [12.3R5.7]
JUNOS Base OS Software Suite [12.3R5.7]
JUNOS Kernel Software Suite [12.3R5.7]
JUNOS Crypto Software Suite [12.3R5.7]
JUNOS Online Documentation [12.3R5.7]
JUNOS Enterprise Software Suite [12.3R5.7]
JUNOS Packet Forwarding Engine Enterprise Software Suite [12.3R5.7]
JUNOS Routing Software Suite [12.3R5.7]
JUNOS Web Management [12.3R5.7]
JUNOS FIPS mode utilities [12.3R5.7]

fpc1:
--------------------------------------------------------------------------
Hostname: EX4200-A
Model: ex4200-48t
JUNOS EX  Software Suite [13.2X51-D35.3]
JUNOS FIPS mode utilities [13.2X51-D35.3]
JUNOS Online Documentation [13.2X51-D35.3]
JUNOS EX 4200 Software Suite [13.2X51-D35.3]
JUNOS Web Management [13.2X51-D35.3]

I thought this was very odd so I checked the logs to see if anything was out of the norm and saw that VCP ports had come up however, the attempts to backup member had timed out :/

show log messages output
[email protected]> show log messages | last 100    
Dec 18 04:44:33  EX4200-A /kernel: tcp_timer_rexmt: Dropping socket connection due to error: 65
Dec 18 04:44:36  EX4200-A last message repeated 4 times
Dec 18 05:01:30  EX4200-A chassism[1280]: cm_ff_ifd_disable: fast failover disabled for internal-0/26
Dec 18 05:01:30  EX4200-A chassism[1280]: cm_ff_ifd_disable: fast failover disabled for internal-0/27
Dec 18 05:01:30  EX4200-A vccpd[1282]: ifl vcp-0.32768 set up, ifl flags 0, flags 1
Dec 18 05:01:30  EX4200-A vccpd[1282]: interface vcp-0 came up
Dec 18 05:01:30  EX4200-A chassism[1280]: cm_ff_ifd_disable: fast failover disabled for internal-1/26
Dec 18 05:01:30  EX4200-A vccpd[1282]: ifl vcp-1.32768 set up, ifl flags 0, flags 1
Dec 18 05:01:30  EX4200-A vccpd[1282]: interface vcp-1 came up
Dec 18 05:01:30  EX4200-A chassism[1280]: cm_ff_ifd_disable: fast failover disabled for internal-1/27
Dec 18 05:01:30  EX4200-A vccpd[1282]: Member 0, interface vcp-1.32768 came up
Dec 18 05:01:30  EX4200-A vccpd[1282]: Member 0, interface vcp-0.32768 came up
Dec 18 05:01:30  EX4200-A vccpd[1282]: Member 1, interface vcp-1.32768 came up
Dec 18 05:01:30  EX4200-A vccpd[1282]: Member 1, interface vcp-0.32768 came up
Dec 18 05:01:36  EX4200-A chassism[1280]: cm_ff_vcp_port_add: fast failover received VCP port add on dev 0 port 26
Dec 18 05:01:36  EX4200-A chassism[1280]: cm_ff_vcp_port_add: fast failover received VCP port add on dev 0 port 27
Dec 18 05:01:36  EX4200-A chassism[1280]: cm_ff_vcp_port_add: fast failover received VCP port add on dev 1 port 26
Dec 18 05:01:36  EX4200-A chassism[1280]: cm_ff_vcp_port_add: fast failover received VCP port add on dev 1 port 27
Dec 18 05:01:36  EX4200-A chassism[1280]: CM_CHANGE: Member 0->0, Mode M->M, 0M 1B, GID 0, Master Unchanged, Members Changed
Dec 18 05:01:36  EX4200-A chassism[1280]: CM_CHANGE: 0M 1B
Dec 18 05:01:36  EX4200-A chassism[1280]: CM_CHANGE: Signaling license service
Dec 18 05:01:36  EX4200-A chassism[1280]: mvlan_member_change_add: member id 1 (my member id 0, my role 1)
Dec 18 05:01:36  EX4200-A chassism[1280]: mvlan_ifl_create: Creating ifl, name bme0, subunit 32770
Dec 18 05:01:36  EX4200-A chassism[1280]: mvlan_rts_ifl_op: IFL idx is 8 is created
Dec 18 05:01:39  EX4200-A chassisd[1298]: CHASSISD_VERSION_MISMATCH: Version mismatch:   chassisd message version 2   FPC 1 message version 2   local IPC version $Revision: 590540 $   remote IPC version $Revision: 653007 $
Dec 18 05:01:42  EX4200-A license-check[1331]: LICENSE: copy to /config/license from fpc0:/config/.license_priv/
Dec 18 05:01:42  EX4200-A license-check[1331]: LIBJNX_REPLICATE_RCP_ERROR: rcp -r -Ji fpc0:/config/.license_priv/ /config/license : rcp: /config/.license_priv/: No such file or directory
Dec 18 05:01:42  EX4200-A license-check[1331]: LIBJNX_REPLICATE_RCP_ERROR: rcp -r -Ji fpc1:/config/.license_priv/ /config/license : rcp: /config/.license_priv/: No such file or directory
Dec 18 05:01:42  EX4200-A license-check[1331]: copy from member 0 failed
Dec 18 05:01:42  EX4200-A license-check[1331]: LICENSE: copy to /config/license from fpc1:/config/.license_priv/
Dec 18 05:01:42  EX4200-A license-check[1331]: copy from member 1 failed
Dec 18 05:01:50  EX4200-A bdbrepd: Subscriber Management is ready for GRES
Dec 18 05:01:52  EX4200-A license-check[1331]: LICENSE: copy to /config/license from fpc0:/config/.license_priv/
Dec 18 05:01:52  EX4200-A license-check[1331]: LIBJNX_REPLICATE_RCP_ERROR: rcp -r -Ji fpc0:/config/.license_priv/ /config/license : rcp: /config/.license_priv/: No such file or directory
Dec 18 05:01:52  EX4200-A license-check[1331]: copy from member 0 failed
{...}
Dec 18 05:02:39  EX4200-A chassisd[1298]: CHASSISD_FRU_ONLINE_TIMEOUT: fpc_online_timeout: attempt to bring FPC 1 online timed out
Dec 18 05:03:39  EX4200-A chassisd[1298]: CHASSISD_FRU_ONLINE_TIMEOUT: fpc_online_timeout: attempt to bring FPC 1 online timed out
Dec 18 05:03:39  EX4200-A chassisd[1298]: CHASSISD_FRU_UNRESPONSIVE: Error for FPC 1: attempt to bring online timed out; restarted it
Dec 18 05:03:39  EX4200-A chassisd[1298]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 1 offline: Restarting unresponsive board
Dec 18 05:03:39  EX4200-A chassisd[1298]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(1)
Dec 18 05:03:39  EX4200-A chassisd[1298]: CHASSISD_SNMP_TRAP7: SNMP trap generated: FRU removal (jnxFruContentsIndex 7, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: EX4200-48T, 8 POE @ 1/*/*, jnxFruType 3, jnxFruSlot 1)
Dec 18 05:03:40  EX4200-A chassisd[1298]: CHASSISD_VERSION_MISMATCH: Version mismatch:   chassisd message version 2   FPC 1 message version 2   local IPC version $Revision: 590540 $   remote IPC version $Revision: 653007 $

It was Friday and I had a planned upgrade for the following week, so I didn’t have the time to raise a JTAC case (which I should have probably done but that could come later). With this in mind I thought I should be able to manually failover the Routing-Engines and upgrade each member the same way without all of the magic of the NSSU:

NSSU Note
It took longer than expected to do this testing and I had to cancel my change. I found out that currently (as in when this was written) you can’t use NSSU to upgrade from 12.3 to any higher versions. This explained why everything was breaking and giving me issues. After raising this with our Technical Account Manager at Juniper, he provided details on What version Of Junos supports NSSU on EX Series.

Soooooooo this is what this post will be about, the success or failure of manually failing over a VC with minimal downtime 🙂

Let’s get cracking!

I was using 2x EX4200 with JUNOS 12.3R5.7; it’s the same setup I had in my previous Virtual Chassis post. I used the preprovisioned method of stacking the switches, and had the following VC specific configuration applied:

show routing-optionsshow chassisshow virtual-chassis
[email protected]# show routing-options 
nonstop-routing;
static {
    route 0.0.0.0/0 {
        next-hop 10.1.0.1;
        no-readvertise;
    }
}
[email protected]# show chassis 
redundancy {
    graceful-switchover;
}
[email protected]# show virtual-chassis 
preprovisioned;
no-split-detection;
member 0 {
    role routing-engine;
    serial-number BP0214340104;
}
member 1 {
    role routing-engine;
    serial-number BP0215090120;
}
fast-failover {
    ge;
    xe;
}

It’s important to make sure you have nonstop-routing, graceful-switchover and no-split-detection configured without these or you will most likely get a split brain affect and that’s not a good thing!

I’ve got a VM connected to both switches in LACP bond configured

[email protected]> show lldp neighbors 
Local Interface    Parent Interface    Chassis Id          Port info          System Name
ge-0/0/2.0         ae1.0               00:0c:29:4f:26:bb   eth1               km-vm1              
ge-1/0/2.0         ae1.0               00:0c:29:4f:26:bb   eth2               km-vm1

and I have the VM pinging it default gateway (192.31.1.1), which is the l3-interface on the switch

[email protected]:~$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.31.1.1      0.0.0.0         UG    0      0        0 bond0
10.1.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.31.1.0      0.0.0.0         255.255.255.0   U     0      0        0 bond0

Now everything is sorted, let’s try some stuff!

As the VM is dual connected to both members, I’ll shutdown the interfaces and the VCP ports of backup switch, upgrade it and then do the same on the master switch. In essence, I’ll be breaking the VC to upgrade each switch individually. I’ll be running a continuous ping from the VM switch and will be able see if any packets are dropped during this work.

I start with the backup member. I have to disable the data and break the virtual chassis by disabling the VCP ports. I had to copy over the junos package from member 0 to member 1, as I’d have no access to member 0 once the virtual chassis had been broken.

[email protected]> file copy /tmp/jinstall-ex-4200-13.2X51-D35.3-domestic-signed.tgz fpc1:/tmp/

This will copy the package from the member 0 to member 1. Confirmed by entering the shell cli and checking the /tmp folder on member 1

{backup:1}
[email protected]> start shell 
[email protected]:BK:1% cd /tmp/
[email protected]:BK:1% ls -la
total 234744
drwxrwxrwt   3 root  wheel           512 Dec 18 14:57 .
drwxr-xr-x  23 root  wheel           512 Dec 18 04:06 ..
-rw-r--r--   1 root  wheel            92 Dec 18 12:13 .clnpkg.LCK
-rw-r--r--   1 root  wheel            92 Dec 18 12:13 .pkg.LCK
drwxrwxr-x   2 root  operator        512 Dec 18 12:10 .snap
-rw-r--r--   1 root  wheel     120120669 Dec 18 14:58 jinstall-ex-4200-13.2X51-D35.3-domestic-signed.tgz
-rw-r--r--   1 root  wheel           393 Dec 18 12:10 partitions.spec
[email protected]:BK:1% exit

Next disable the member 1 port, in my case ge-1/0/2, deactivate interfaces ge-1/0/2

[email protected]# run show interfaces ge-1/0/2          
Physical interface: ge-1/0/2, Administratively down, Physical link is Down

The server dropped 3 packets, which is acceptable to most; so far so good. Next I disabled the VCP on the member 1 and member 0 and then console onto member 1.

[email protected]> request virtual-chassis vc-port set interface vcp-0 member 1 disable 
  [email protected]> request virtual-chassis vc-port set interface vcp-1 member 1 disable
  [email protected]> request virtual-chassis vc-port set interface vcp-0 disable 
  [email protected]> request virtual-chassis vc-port set interface vcp-1 disable

On member 1, it automatically took mastership and doesn’t member 0 anymore

{master:1}
[email protected]> show virtual-chassis status 

Preprovisioned Virtual Chassis
Virtual Chassis ID: e8a9.d27b.0f05
Virtual Chassis Mode: Enabled
                                           Mstr           Mixed Neighbor List
Member ID  Status   Serial No    Model     prio  Role      Mode ID  Interface
0 (FPC 0)  NotPrsnt BP0214340104 ex4200-48t
1 (FPC 1)  Prsnt    BP0215090120 ex4200-48t 129  Master*      N

The server is still pinging along, so now we can upgrade the backup member as if it was a standalone device. We’ll run request system software add /tmp/jinstall-ex-4200-13.2X51-D35.3-domestic-signed.tgz reboot validate reboot

Once member 1 rebooted I had to wait for a bit as it was looking for the master (due to the preprovisioned config) and it initial booted as a linecard however, it changed back to master after I entered the operational mode.

Next I enabled the member 1 port, activate interfaces ge-1/0/2

Comment
When I went to commit the change it took an awfully long time to activate the interface however, with a bit of patience the interface did come back up….. Eventually! Patience is the Key!

To double check and confirm it was up, I checked the lldp neighbor

[email protected]> show interfaces ge-1/0/2   
Physical interface: ge-1/0/2, Enabled, Physical link is Up
{master:1}
[email protected]> show lldp neighbors 
Local Interface    Parent Interface    Chassis Id          Port info          System Name
ge-1/0/2.0         ae1.0               00:0c:29:4f:26:bb   eth2               km-vm1

Now disable the member 0 port, in my case ge-0/0/2, deactivate interfaces ge-0/0/2

The Server had dropped 47 packets after the interface was disabled. This was most likely due to the convergence time for the LACP bond and the port going down, and this is shown in the log messages

Logs
Dec 18 17:01:19  EX4200-A /kernel: Percentage memory available(19)less than threshold(20 %)- 14
Dec 18 17:01:50  EX4200-A dcd[5164]: ae0 : Warning: aggregated-ether-options link-speed no kernel value! default to  0
Dec 18 17:01:50  EX4200-A dcd[5164]: check_prot: p_ae NULL, ifdp->ifdp_type is 25 ifdp_ifname ae1
Dec 18 17:01:50  EX4200-A mgd[3713]: UI_CHILD_EXITED: Child exited: PID 5164, status 1, command '/sbin/dcd'
Dec 18 17:03:27  EX4200-A mgd[3713]: UI_COMMIT: User 'root' requested 'commit' operation (comment: none)
Dec 18 17:03:34  EX4200-A /kernel: Percentage memory available(19)less than threshold(20 %)- 15
Dec 18 17:04:07  EX4200-A dcd[5205]: ae0 : Warning: aggregated-ether-options link-speed no kernel value! default to  0
Dec 18 17:04:07  EX4200-A dcd[5205]: ae1 : Warning: aggregated-ether-options link-speed no kernel value! default to  0
Dec 18 17:04:12  EX4200-A lldpd[1326]: UI_CONFIGURATION_ERROR: Process: lldpd, path: , statement: , Configuration database open failure: Database is already open
Dec 18 17:04:13  EX4200-A mgd[3713]: UI_DBASE_LOGOUT_EVENT: User 'root' exiting configuration mode
Dec 18 17:04:52  EX4200-A dcd[1297]: ae0 : aggregated-ether-options link-speed set to kernel value of  10000000000
Dec 18 17:04:52  EX4200-A dcd[1297]: ae1 : Warning: aggregated-ether-options has no childern! link-speed set to  0
Dec 18 17:04:52  EX4200-A /kernel: ae_bundlestate_ifd_change: bundle ae1: bundle IFD minimum links not met 0 < 1
Dec 18 17:04:52  EX4200-A mib2d[1304]: SNMP_TRAP_LINK_DOWN: ifIndex 655, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1
Dec 18 17:04:52  EX4200-A /kernel: GENCFG: op 22 (Sflow) failed; err 1 (Unknown)
Dec 18 17:04:52  EX4200-A /kernel: drv_ge_misc_handler: ifd:135  new address:cc:e1:7f:2b:82:85
Dec 18 17:04:53  EX4200-A mib2d[1304]: SNMP_TRAP_LINK_DOWN: ifIndex 708, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.0
Dec 18 17:04:53  EX4200-A mib2d[1304]: SNMP_TRAP_LINK_DOWN: ifIndex 506, ifAdminStatus down(2), ifOperStatus down(2), ifName ge-0/0/2

With the server passing traffic over member 1, I could upgrade member 0 which was the same as before request system software add /tmp/jinstall-ex-4200-13.2X51-D35.3-domestic-signed.tgz reboot validate reboot

Same as member 1, it came back up after its reboot but the switch took an age to find the master and just as long to commit the activation of interface ge-0/0/2! Extreme Patience’s Needed!

Confirmation of the link is up and I have lldp neighbor

{master:0}
[email protected]> show lldp neighbors 
Local Interface    Parent Interface    Chassis Id          Port info          System Name
ge-0/0/2.0         ae1.0               00:0c:29:4f:26:bb   eth1               km-vm1
{master:0}
[email protected]> show interfaces ge-0/0/2   
Physical interface: ge-0/0/2, Enabled, Physical link is Up

Having both members are on the same code as expected:

{master:0}
[email protected]> show version 
fpc0:
--------------------------------------------------------------------------
Hostname: EX4200-A
Model: ex4200-48t
JUNOS EX  Software Suite [13.2X51-D35.3]
JUNOS FIPS mode utilities [13.2X51-D35.3]
JUNOS Online Documentation [13.2X51-D35.3]
JUNOS EX 4200 Software Suite [13.2X51-D35.3]
JUNOS Web Management [13.2X51-D35.3]

{master:1}
[email protected]> show version 
fpc1:
--------------------------------------------------------------------------
Hostname: EX4200-A
Model: ex4200-48t
JUNOS EX  Software Suite [13.2X51-D35.3]
JUNOS FIPS mode utilities [13.2X51-D35.3]
JUNOS Online Documentation [13.2X51-D35.3]
JUNOS EX 4200 Software Suite [13.2X51-D35.3]
JUNOS Web Management [13.2X51-D35.3]

To get them joined together into the virtual chassis I enabled the VCP ports on member 0 and hoped this would bring them back together with no issues (He says!!!)

{master:0}
[email protected]> request virtual-chassis vc-port set interface vcp-0    
{master:0}
[email protected]> request virtual-chassis vc-port set interface vcp-1

To finish off, I ran the command request system snapshot slice alternate all-members to make sure the backup partition image was consistent with the primary

And finally everything is complete! I confirmed the virtual-chassis, firmware version, lldp neighbors and Upgraded the Backup Partition! Never forget to do this!

show virtual-chassisshow versionshow lldp neighborsrequest system snapshot slice alternate
[email protected]> show virtual-chassis    

Preprovisioned Virtual Chassis
Virtual Chassis ID: e8a9.d27b.0f05
Virtual Chassis Mode: Enabled
                                                Mstr           Mixed Route Neighbor List
Member ID  Status   Serial No    Model          prio  Role      Mode  Mode ID  Interface
0 (FPC 0)  Prsnt    BP0214340104 ex4200-48t     129   Master*      N  VC   1  vcp-0      
                                                                           1  vcp-1      
1 (FPC 1)  Prsnt    BP0215090120 ex4200-48t     129   Backup       N  VC   0  vcp-0      
                                                                           0  vcp-1

[email protected]> show version              
fpc0:
--------------------------------------------------------------------------
Hostname: EX4200-A
Model: ex4200-48t
JUNOS EX  Software Suite [13.2X51-D35.3]
JUNOS FIPS mode utilities [13.2X51-D35.3]
JUNOS Online Documentation [13.2X51-D35.3]
JUNOS EX 4200 Software Suite [13.2X51-D35.3]
JUNOS Web Management [13.2X51-D35.3]

fpc1:
--------------------------------------------------------------------------
Hostname: EX4200-A
Model: ex4200-48t
JUNOS EX  Software Suite [13.2X51-D35.3]
JUNOS FIPS mode utilities [13.2X51-D35.3]
JUNOS Online Documentation [13.2X51-D35.3]
JUNOS EX 4200 Software Suite [13.2X51-D35.3]
JUNOS Web Management [13.2X51-D35.3]

[email protected]> show lldp neighbors 
Local Interface    Parent Interface    Chassis Id          Port info          System Name
ge-0/0/2.0         ae1.0               00:0c:29:4f:26:bb   eth1               km-vm1              
ge-1/0/2.0         ae1.0               00:0c:29:4f:26:bb   eth2               km-vm1
[email protected]> request system snapshot slice alternate  
fpc0:
--------------------------------------------------------------------------
Formatting alternate root (/dev/da0s1a)...
Copying '/dev/da0s2a' to '/dev/da0s1a' .. (this may take a few minutes)
The following filesystems were archived: /

fpc1:
--------------------------------------------------------------------------
Formatting alternate root (/dev/da0s2a)...
Copying '/dev/da0s1a' to '/dev/da0s2a' .. (this may take a few minutes)
The following filesystems were archived: /

From the running pings:

--- 192.31.1.1 ping statistics ---
9365 packets transmitted, 9234 received, +42 errors, 1% packet loss, time 9377278ms
rtt min/avg/max/mdev = 0.771/1.162/11.807/0.370 ms, pipe 3
[email protected]:~$

There was 1% packet over the whole time of the test (156 minutes), working out as a 93.77 second outage which isn't too bad. Considering this was the first time I tried this method I’ll be going over it again because it took far too long, but overall this method works!

I also messed about with the different types of bonding methods available:

With the round-robin or bond-type 0, the switch was configured as two access ports and I saw high packet loss during the testing.

--- 192.31.1.1 ping statistics ---
6106 packets transmitted, 3125 received, 48% packet loss, time 6128448ms
rtt min/avg/max/mdev = 0.814/1.484/902.641/16.131 ms

This was due to the nature of the round-robin bonding method.

Round-robin policy to transmit packets in sequential order from the first available slave through the last. This mode provides load balancing and fault tolerance.

With the active-backup or bond-type 1, the switch was configured as two access ports and I saw no packet loss during the testing. A sight difference when using active-backup (as expected to be honest) when you check the lldp neighbors is that you’ll only see one interface up at a time.

This is due to the nature of the bond-type

Only one slave in the bond is active. A different slave becomes active if, and only if, the active slave fails. The bond's MAC address is externally visible on only one port (network adapter) to avoid confusing the switch.
Ping OutputLLDP Difference
--- 192.31.1.1 ping statistics ---
2905 packets transmitted, 2892 received, 0% packet loss, time 2908023ms
rtt min/avg/max/mdev = 0.846/1.214/20.269/0.758 ms
[email protected]> show lldp neighbors 
Local Interface    Parent Interface    Chassis Id          Port info          System Name
ge-0/0/2.0         -                   00:0c:29:4f:26:bb   eth1               km-vm1              
vme.0              -                   00:19:06:cd:8f:80   GigabitEthernet1/0/36 oob-sw0-10.lab      
xe-0/1/0.0         ae0.0               78:fe:3d:46:2a:c0   xe-0/0/2.0         EX4500

Having got a method that worked, the tabs below show some of the methods I tried and failed on. Looking back on some of the methods, the two methods I used were never going to work however, this is why you have a lab and it’s always good to see things for yourself to see if you can troubleshoot your way out! With all that being said I’ve actually picked up a few things I didn’t know, so this was a good exercise!

Tester Method #1
Upgrade the member 1 then see if you failover routing-engine from member 0 to member 1. The issue that could arise is that the routing-engine will not failover as the 2 switches will be on different version of code and the VC will not join back up as backup routing-engine.

I started the upgrade on member 1 running:

request system software add /tmp/jinstall-ex-4200-13.2X51-D35.3-domestic-signed.tgz member 1 reboot

Once the upgrade had completed, I checked the virtual chassis and as I thought member 1 didn’t join back into the VC as backup routing-engine

[email protected]> show virtual-chassis    

Preprovisioned Virtual Chassis
Virtual Chassis ID: e8a9.d27b.0f05
Virtual Chassis Mode: Enabled
                                           Mstr           Mixed Neighbor List
Member ID  Status   Serial No    Model     prio  Role      Mode ID  Interface
0 (FPC 0)  Prsnt    BP0214340104 ex4200-48t 129  Master*      N  1  vcp-0      
                                                                 1  vcp-1      
1 (FPC 1)  Inactive BP0215090120 ex4200-48t 129  Linecard     N  0  vcp-0      
                                                                 0  vcp-1

And the logs show the mismatch of code and timeout of member 1 rejoining the VC. This method is out, but then it was expected to be honest

Tester Method #2
Upgrade using NSSU method and when it gets stuck see if you can abort and failover. This method sounds like it’s a bit of a hack and won’t work however, we’re in the lab so it doesn’t matter, and if it works then yaaay!

I ran the command to kick of the NSSU

request system software nonstop-upgrade /tmp/jinstall-ex-4200-13.2X51-D35.3-domestic-signed.tgz
Chassis ISSU Check Done
[Dec 18 08:41:50]:ISSU: Validating Image
[Dec 18 08:42:20]:ISSU: Preparing Backup RE
[Dec 18 08:42:21]: Installing image on other FPC's along with the backup

[Dec 18 08:42:21]: Checking pending install on fpc1
[Dec 18 08:43:21]: Pushing bundle to fpc1
NOTICE: Validating configuration against mchassis-install.tgz.
NOTICE: Use the 'no-validate' option to skip this if desired.
WARNING: A reboot is required to install the software
WARNING:     Use the 'request system reboot' command immediately
[Dec 18 08:44:23]: Completed install on fpc1
[Dec 18 08:44:34]: Backup upgrade done
[Dec 18 08:44:34]: Rebooting Backup RE

Rebooting fpc1
[Dec 18 08:44:34]:ISSU: Backup RE Prepare Done
[Dec 18 08:44:34]: Waiting for Backup RE reboot

Having a console into member 1, I can see that member 1 has joined the VC cluster

{backup:1}
[email protected]> show virtual-chassis 

Preprovisioned Virtual Chassis
Virtual Chassis ID: e8a9.d27b.0f05
Virtual Chassis Mode: Enabled
                                                Mstr           Mixed Route Neighbor List
Member ID  Status   Serial No    Model          prio  Role      Mode  Mode ID  Interface
0 (FPC 0)  Prsnt    BP0214340104 ex4200-48t     129   Master       N  VC   1  vcp-0      
                                                                           1  vcp-1      
1 (FPC 1)  Prsnt    BP0215090120 ex4200-48t     129   Backup*      N  VC   0  vcp-0      
                                                                           0  vcp-1 

I aborted the NSSU on the member 0 console screen and tried to failover the routing-engine. However, when I aborted the NSSU it took over an hour to get the operational prompt and once I got to the operational prompt, the VC cluster had detached and was back to Master and Linecard. This makes sense now as the switches are out of the NSSU process, and it will just go back seeing 2 mismatched Junos versions

{master:0}
[email protected]> show virtual-chassis 

Preprovisioned Virtual Chassis
Virtual Chassis ID: e8a9.d27b.0f05
Virtual Chassis Mode: Enabled
                                           Mstr           Mixed Neighbor List
Member ID  Status   Serial No    Model     prio  Role      Mode ID  Interface
0 (FPC 0)  Prsnt    BP0214340104 ex4200-48t 129  Master*      N  1  vcp-0      
                                                                 1  vcp-1      
1 (FPC 1)  Inactive BP0215090120 ex4200-48t 129  Linecard     N  0  vcp-0      
                                                                 0  vcp-1

This method is out (but then this was expected)

Side Notes
Note 1Note 1.5Rollback Firmware Upgrade
If you want to change it you can release the routing-engine mastership by running the command request chassis routing-engine master release, but by using this command you will get an outage as the PFE will switchover and as you can not have non-stop routing and graceful switchover, both configured under routing-options stanza an outage will happen.
Additionally, whatever config changes you made as the members are separated will not be kept if you switchover the PFE. I saw that when I enabled interface ge-1/0/2 on member 1, but when the PFE was switchover it become inactive.
[email protected]> request system software rollback member 1 reboot 
fpc1:
--------------------------------------------------------------------------
Junos version '12.3R5.7' will become active at next reboot
Rebooting ...
shutdown: [pid 1280]
Shutdown NOW!

Then once member 1 has rebooted, I checked to make sure it is present into the virtual chassis

[email protected]> show virtual-chassis                                

Preprovisioned Virtual Chassis
Virtual Chassis ID: e8a9.d27b.0f05
Virtual Chassis Mode: Enabled
                                           Mstr           Mixed Neighbor List
Member ID  Status   Serial No    Model     prio  Role      Mode ID  Interface
0 (FPC 0)  Prsnt    BP0214340104 ex4200-48t 129  Master*      N  1  vcp-0      
                                                                 1  vcp-1      
1 (FPC 1)  Prsnt    BP0215090120 ex4200-48t 129  Backup       N  0  vcp-0      
                                                                 0  vcp-1
Share this:
Share

Upgrading a SRX Chassis Cluster

Reading Time: 4 minutes

In my previous post, I had successfully failed over the redundancy groups on the cluster using Manual Failover and Interface Failure methods. This post will look into the methods that can be used, when upgrading a SRX Chassis Cluster.

Testing Information
i)I had the scp latest recommended version of Junos (12.1X44-D45.2) onto both Node0 and Node1. The package is located under the /var/tmp file. You can get this folder via cli. From Operation Mode start shell then cd /var/tmp
ii) I will have rolling pings from trust <--> untrust zones in separate terminal windows, so I can see when the outage starts and will be timing the length
iii) All command will run from Node0, unless stated otherwise

You have two methods of updating a SRX Cluster:

Method A (Individual Node upgrades)

Disclaimer
Using this method of chassis cluster upgrade, as a SERVICE DISRUPTION of 3-5 minutes minimum. You will need to ensure that you have considered the business impact of this method of upgrade.

This method can also be used for downgrading Junos, as well as upgrading and has no Junos version limitation. With this method you will be simply upgrading both individual nodes at the same time. As I have already uploaded the Junos image onto both nodes. I will need to run the command on BOTH Node0 and Node1 from Operational Mode

{primary:node0}
[email protected]_SRX220_Top> request system software add /var/tmp/junos-srxsme-12.1X44-D45.2-domestic.tgz
{secondary:node1}
[email protected]_SRX220_Top> request system software add /var/tmp/junos-srxsme-12.1X44-D45.2-domestic.tgz

Once they have been added, you will need to reboot both Nodes simultaneously. You can use request system reboot node all from Node0

After the reboot, you will need to update the backup image of Junos on both Nodes, to have a consistent primary and backup image.

Method B (In Service Software Upgrades)

Before I begin, with in-service updates, Juniper have two types of in-service upgrade. For the High-End Data Centre SRX models SRX1400, SRX3400, SRX5600 and SRX5800 will use In-Service Software Upgrade (ISSU) and the Small/Medium Branch SRX models SRX100, SRX110, SRX220, SRX240 and SRX650 will use In-Band Cluster Upgrade (ICU). Although the commands are near enough the same; the pre-upgrade requirement, service impacts and the minimum Junos firmware version that supporting in-service upgrades are different.

As I’m using 2x SRX220H2 model firewalls, I will be upgrading via ICU. When I get chance to upgrade a High-End SRX model, I will update the post with my findings :p

Even before you consider using the ISSU/ICU method, I am telling you (no recommendation here!!) to check the Juniper page Limitation on ISSU and ICU. The page will confirm what version of Junos is supported by ISSU/ICU and (more importantly) services that are not supported by ISSU/ICU. In essence, you will need to change if/what services you are running on your SRX cluster to see if they are supported. If they are not supported then you are told DO NOT perform an upgrade with this method.

With that out of the way and if you have checked that your cluster is fully supported (firmware and service) by ISSU/ICU you can proceed with the pre-checks 😀

Pre-Upgrade Checks ICU
Junos VersionNo-sync optionDowngrade Method?Disk Space
You will need to be running Junos version 11.2R2 minimum. This can be checked by running show version on both Nodes.
ICU is available with the no-sync options only. The no-sync option disables the flow state from syncing with the second node when it boots with the new Junos image.
You CAN NOT use ICU to downgrade Junos to version lower than 11.2R2
You will need to check the disk space available in the /var/tmp file on the SRX. From Operational Mode start shell then enter the command df -h and you will get disk spaces available.

Having confirmed all the pre-checks are good, we can proceed with the upgrade. It is important to note that during an ICU, there WILL BE A SERVICE DISRUPTION! will be approximately 30 seconds with no-sync option. During this 30 seconds traffic will be dropped and flow session will be lost. You will need to keep this in mind, if you are doing this upgrade in-hours or you need to have a good record on your flow session for any reason.

To start the upgrade, we need to run request system software in-service-upgrade /path/to/package no-sync

{primary:node0}
[email protected]_SRX220_Top> request system software in-service-upgrade /var/tmp/junos-srxsme-12.1X44-D45.2-domestic.tgz no-sync
ICU Console observations
RebootingUpgrade OrderNode0 to Node1 failover processEnd Host View Point
It is important to note that during the ICU process, you won’t need do any manual reboots, all the reboots are automated within the process

WARNING: in-service-upgrade shall reboot both the nodes
         in your cluster. Please ignore any subsequent 
         reboot request message
Once the process has started Node1 is upgraded first:

Node1 is upgraded first
ISSU: start downloading software package on secondary node
Pushing bundle to node1
{.......}
JUNOS 12.1X44-D45.2 will become active at next reboot
WARNING: A reboot is required to load this software correctly
WARNING:     Use the 'request system reboot' command
WARNING:         when software installation is complete
Saving state for rollback ...
ISSU: failover all redundancy-groups 1...n to primary node
Successfully reset all redundancy-groups priority back to configured priority.
Successfully reset all redundancy-groups priority back to configured priority.
Initiated manual failover for all redundancy-groups to node0
Redundancy-groups-0 will not failover and the primaryship remains unchanged.
ISSU: rebooting Secondary Node
Shutdown NOW!
[pid 13353]
ISSU: Waiting for secondary node node1 to reboot.
ISSU: node 1 went down
ISSU: Waiting for node 1 to come up
It takes few minutes for node0 to reboot after node1 comes back online if you have console connection on both SRXs, you will need to be patient before aborting the upgrade. If you have rolling ping going for each nodes fxp interface you will when the node0 is about to reboot as node1 pings will return. Once node1 is up and booted, Node0 will start to reboot.

ISSU: node 1 came up
ISSU: secondary node node1 booted up.
Shutdown NOW!
From hitting enter to having both firewalls upgraded it had taken 22:45min. Although the documentation said were will be an outage of 30 seconds the rolling ping between trust <--> untrust shows that there was no packet-loss and only 6 packets out of 1600 transmitted weren’t received. (Saying that, for my testing I was unable to get live flow session information.)

root> ping 172.16.0.2 routing-instance trust 
--- 172.16.0.2 ping statistics ---
1600 packets transmitted, 1594 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.720/2.640/13.673/0.652 ms
--------------------------------------------------------------------------
root> ping 192.168.0.2 routing-instance untrust
--- 192.168.0.2 ping statistics ---
1600 packets transmitted, 1594 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.838/2.535/13.669/0.681 ms
To verify that the upgrade has been successful, we can run the commands show version

{secondary:node0}
[email protected]_SRX220_Top> show version 
node0:
--------------------------------------------------------------------------
Hostname: lab_SRX220_Top
Model: srx220h2
JUNOS Software Release [12.1X44-D45.2]

node1:
--------------------------------------------------------------------------
Hostname: lab_SRX220_Top
Model: srx220h2
JUNOS Software Release [12.1X44-D45.2]

And show chassis cluster status, to see that chassis status is as expected

[email protected]_SRX220_Top> show chassis cluster status 
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 0
    node0                   100         secondary      no       no  
    node1                   1           primary        no       no  

Redundancy group: 1 , Failover count: 1
    node0                   100         primary        yes      no  
    node1                   1           secondary      yes      no 

We can see that we are running the upgraded version of Junos. As expected Redundancy Group 0 is primary on Node1 and Redundancy Group 1 is primary on Node0. As discussed in my previous post, with preempt enabled Redundancy Group 1 will automatically failover to Node0, once it is available. We will have to do a manual failover of redundancy group 0 back to Node0 from Node1 and we will need to upgrade the backup image of Junos to have a consistent primary and backup image.

If you had a case where you had to abort the ICU process you will need to run request system software abort in-service-upgrade on the primary node. It is important to note, if you do use the abort command, you will put the cluster into an inconsistent state, where the secondary node will be running a newer version of Junos to the Primary node. To recover the cluster into a consistent state you will need to do the following all on the secondary node:

Recovering from an Inconsistent State
1. You will need to abort the upgrade: request system software abort in-service-upgrade
2. Rollback to the older version of Junos, that will be on the primary node request system software rollback node {node-id}
3. Perform a reboot of the node request system reboot

**UPDATE 29/4/2015**
Lucky enough, as I was finishing up this series of posts, my colleague had finished working on the SRX1400 we have in our lab! So I was able to run testing on doing ISSU upgrade on High End SRX Series device 😀 Happy Days!!!

SRX1400 testing differences
1. The SRX1400 doesn’t have any routing protocols, I will not need to configure graceful restart.
2. I will be upgrading from 12.1X44-D40.2 to 12.1X46-D10.2
3. The topology will be the same, however the IP addressing will be different. Trust will be 192.168.13.0/24 and Untrust will be 172.31.13.0/24
Pre-Upgrade Checks ISSU
Junos VersionDowngrade Method?RoutingRedundancy GroupsRedundancy Group 0
You will need to check to see, if the version of Junos code supports ISSU. This can be checked by running show version on both Nodes. You will need to be using Junos version 9.6 and later
ISSU DOES NOT support firmware downgrade!
Juniper recommend that a graceful restart for routing protocols be enabled Before starting an ISSU
Manually failing over all redundancy groups to one active only (For my example, as I have a active/backup setup, you won’t need to change anything. However, if you have active/active setup, you will need to change you configuration changes)
Once the upgrade has been completed you will need to Manual Failover Redundancy Group 0 back to Node0 (see Failover on SRX cluster pt1)

To start the upgrade, firstly all the redundancy groups need to fail over to one active node. As I have an active/backup setup, all my redundancy groups are on node0

{primary:node0}
[email protected]_be-rtr0-h3> show chassis cluster status        
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 3
    node0                   100         primary        no       no  
    node1                   99          secondary      no       no  

Redundancy group: 1 , Failover count: 5
    node0                   100         primary        yes      no  
    node1                   99          secondary      yes      no

To be the upgrade process, we need to run request system software in-service-upgrade /path/to/package reboot

Important note
Unlike with the ICU upgrade process, you have to enter the option reboot to confirm that you want a reboot after. If you don’t use the option reboot, the command will fail. This only applies to the High End SRX devices, SRX1400, SRX3400, SRX3600, SRX5600 and SRX5800.
ISSU Console observations
Patience neededNode1 FailoverEnd Host View Point
It does take quite a while from this point before more output will come from the console on node0, so you will need to be patience.

Validation succeeded
failover all RG 1+ groups to node 0 
Initiated manual failover for all redundancy-groups to node0
Redundancy-groups-0 will not failover and the primaryship remains unchanged.
ISSU: Preparing Backup RE
Pushing bundle to node1
Once Node1 is up and you see the output below

ISSU: Backup RE Prepare Done
Waiting for node1 to reboot.
node1 booted up.
Waiting for node1 to become secondary
node1 became secondary.
Waiting for node1 to be ready for failover
ISSU: Preparing Daemons

It takes around 5-10mins before you see anymore output to say the upgrade process is still going on! Again you will need to be patient as this does take its time!

Secondary node1 ready for failover.
{.......}
Failing over all redundancy-groups to node1
ISSU: Preparing for Switchover
Initiated failover for all the redundancy groups to node1
Waiting for node1 take over all redundancy groups
From hitting enter to having both firewalls upgraded it had taken 30:18min. The rolling ping between trust <--> untrust shows that they was no packet-loss and only 2 packets out of 3639 transmitted weren’t received. (As like before, unfortunately I was unable to get live flow session information)

root> ping 172.31.13.2 routing-instance trust 
--- 172.31.13.2 ping statistics ---
1818 packets transmitted, 1817 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.769/3.080/44.226/3.536 ms
--------------------------------------------------------------------------
root> ping 192.168.13.2 routing-instance untrust 
--- 192.168.13.2 ping statistics ---
1821 packets transmitted, 1820 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.831/3.071/44.524/3.244 ms

To verify that the upgrade has been successful, we can run the commands show version

{secondary:node0}
[email protected]_be-rtr0-h3> show version 
node0:
--------------------------------------------------------------------------
Hostname: lab_be-rtr0-h3
Model: srx1400
JUNOS Software Release [12.1X46-D10.2]

node1:
--------------------------------------------------------------------------
Hostname: lab_be-rtr0-i3
Model: srx1400
JUNOS Software Release [12.1X46-D10.2]

And show chassis cluster status, to see that chassis status is as expected

{secondary:node0}
[email protected]_be-rtr0-h3> show chassis cluster status 
Cluster ID: 1 
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 0
    node0                   100         secondary      no       no  
    node1                   99          primary        no       no  

Redundancy group: 1 , Failover count: 1
    node0                   100         primary        yes      no  
    node1                   99          secondary      yes      no 

We can see that we are running the upgraded version of Junos. As expected Redundancy Group 0 is primary on Node1 and Redundancy Group 1 is primary on Node0. As discussed in my previous post, with preempt enabled Redundancy Group 1 will automatically failover to Node0, once it is available. We will have to do a manual failover of redundancy group 0 back to Node0 from Node1 and we will need to upgrade the backup image of Junos to have a consistent primary and backup image.

Unexpected output
During the reboot and manual failover of redundancy group 0 on Node0, I had got the output below on my console terminal

Message from [email protected]_be-rtr0-h3 at Apr 29 12:26:40  ...
lab_be-rtr0-h3 node0.fpc1.pic0 PFEMAN: Shutting down , PFEMAN Resync aborted! No peer info on reconnect or master rebooted?  

Message from [email protected]_be-rtr0-h3 at Apr 29 12:26:40  ...
lab_be-rtr0-h3 node0.cpp0 RDP: Remote side closed connection: rdp.(17825794:13321).(serverRouter:chassis)

[email protected]_be-rtr0-i3> Apr 29 12:27:04 init: can not access /usr/sbin/ipmid: No such file or directory

Message from [email protected]_be-rtr0-i3 at Apr 29 12:27:05  ...
lab_be-rtr0-i3 node1.cpp0 RDP: Remote side closed connection: rdp.(34603010:33793).(serverRouter:pfe) 

Message from [email protected]_be-rtr0-i3 at Apr 29 12:27:05  ...
lab_be-rtr0-i3 node1.cpp0 RDP: Remote side closed connection: rdp.(34603010:33792).(serverRouter:chassis) 

Message from [email protected]_be-rtr0-i3 at Apr 29 12:27:17  ...
lab_be-rtr0-i3 node1.cpp0 RDP: Remote side reset connection: rdp.(34603010:33794).(primaryRouter:1008) 

Message from [email protected]_be-rtr0-i3 at Apr 29 12:27:18  ...
lab_be-rtr0-i3 node1.cpp0 RDP: Remote side reset connection: rdp.(34603010:33795).(primaryRouter:1007)

I had raised this with Juniper and they sent this article. The article confirms that the error messages are expected if you are connected via the console or fxp0 interface. “The above mentioned messages, which are generated on the console session, states that the routing-engine [control plane(RG0)] has become active on the other node….These messages are due to the following syslog user configuration: system syslog user *.

You can stop this error by deactivating system syslog user *.

Note: It is recommended by Juniper for you keep the ‘syslog user (‘any emergency’)’ configuration and ignore these informational messages, as they might show certain useful information to the user.

Phew that was a lot of work and quite a bit to take in there!! Time for a break, (a drink or 6 lol)

My next post will be the last post in the SRX Chassis Cluster Series (sad times 🙁 ). It will be nice simple one on how to disable chassis cluster!

Share this:
Share

Correcting EX Switch booted from Backup Partition

Reading Time: 3 minutes

This is the scenario, it’s 2am on a saturday morning and you get that dreaded call from your NOC saying that there has been a power outage at one of the DCs. Luckily, its only a single rack with an top of rack EX4200 that you’re worried about and you’re company has remote hands, so you don’t have to leave your house! But you need to hop online to check that everything is ok. You connect to your EX4200 by your terminal server via the Out-of-Band network. You are greeted by this error message:

--- JUNOS 12.3R5.7 built 2013-12-18 01:32:43 UTC

***********************************************************************
**                                                                   **
**  WARNING: THIS DEVICE HAS BOOTED FROM THE BACKUP JUNOS IMAGE      **
**                                                                   **
**  It is possible that the primary copy of JUNOS failed to boot up  **
**  properly, and so this device has booted from the backup copy.    **
**                                                                   **
**  Please re-install JUNOS to recover the primary copy in case      **
**  it has been corrupted and if auto-snapshot feature is not        **
**  enabled.                                                         **
**                                                                   **
***********************************************************************

Oh lawd, you think this is going to be a big issue. Luckily, it’s not as big of an issue as you think (If your backup junos image is the same as your primary image, if its not, well i don’t have a clue what you could do! But i will look into this and write a post haha). You are actually able correct this issue quite easily but it will require a reboot.

This is normally caused by a non-grateful shutdown (see what I did there with the scenario :P), which could have caused corruption of the partition, but really there could be a number of reasons, that I wont list!

This page will show how you can fix your primary partition, and after you reboot, how you can check you have booted off the correct partition.

First let’s check what partitions we have:

[email protected]> show system storage partitions  
fpc0:
--------------------------------------------------------------------------
Boot Media: internal (da0)
Active Partition: da0s1a
Backup Partition: da0s2a
Currently booted from: backup (da0s2a)

Partitions information:
  Partition  Size   Mountpoint
  s1a        183M   altroot   
  s2a        183M   /         
  s3d        369M   /var/tmp  
  s3e        123M   /var      
  s4d        62M    /config   
  s4e               unused (backup config)

As we can see, the switch is running off the backup partition. Let’s get this sorted now:

1. We need to ensure that both the primary and backup partitions are consistent with each other. In turn fixing the corrupted partition.

request system snapshot media slice alternate

2. Now the primary partition is sorted, we will need to ensure that, we are running off the primary partition. By using the command below, after the reboot the switch will boot off the primary partition and will clear the system alarms.

request system reboot slice alternate media internal

Note: This could be an out-of-hours change, as you can run off your backup partition without an issue but it’s not ideal

3. Having rebooted, we need to verify that everything is ok

[email protected]> show system storage partitions 
fpc0:
--------------------------------------------------------------------------
Boot Media: internal (da0)
Active Partition: da0s1a
Backup Partition: da0s2a
Currently booted from: active (da0s1a)

Partitions information:
  Partition  Size   Mountpoint
  s1a        183M   /         
  s2a        183M   altroot   
  s3d        369M   /var/tmp  
  s3e        123M   /var      
  s4d        62M    /config   
  s4e               unused (backup config)

Boom sorted 😀

This is why it is VERY IMPORTANT to make sure when you do a Junos upgrade that you update your backup image as well.

**UPDATE 20/4/2015**

Went into the lab today and when I consoled onto a spare EX4200 I was greeted with:

--- JUNOS 11.4R1.6 built 2011-11-15 11:14:01 UTC

***********************************************************************
**                                                                   **
**  WARNING: THIS DEVICE HAS BOOTED FROM THE BACKUP JUNOS IMAGE      **
**                                                                   **
**  It is possible that the primary copy of JUNOS failed to boot up  **
**  properly, and so this device has booted from the backup copy.    **
**                                                                   **
**  Please re-install JUNOS to recover the primary copy in case      **
**  it has been corrupted.                                           **
**                                                                   **
***********************************************************************

I did snapshot check and saw that the backup image wasn’t upgraded:

[email protected]> show system snapshot media internal 
Information for snapshot on       internal (/dev/da0s1a) (backup)
Creation date: Nov 15 13:40:51 2011
JUNOS version on snapshot:
  jbase  : 11.4R1.6
  jcrypto-ex: 11.4R1.6
  jdocs-ex: 11.4R1.6
  jkernel-ex: 11.4R1.6
  jroute-ex: 11.4R1.6
  jswitch-ex: 11.4R1.6
  jweb-ex: 11.4R1.6
  jpfe-ex42x: 11.4R1.6
Information for snapshot on       internal (/dev/da0s2a) (primary)
Creation date: Dec 18 04:06:12 2013
JUNOS version on snapshot:
  jbase  : ex-12.3R5.7
  jcrypto-ex: 12.3R5.7
  jdocs-ex: 12.3R5.7
  jkernel-ex: 12.3R5.7
  jroute-ex: 12.3R5.7
  jswitch-ex: 12.3R5.7
  jweb-ex: 12.3R5.7

I started to mess about to see if there was any way of fixing the primary without essentially doing a system downgrade (as the image was an older one). To cut a long story short, I couldn’t 🙁

Neither through reading juniper documentation, googling and random troubleshooting as I hoped for the best!

So always keep in mind, when you’re doing a firmware upgrade, make sure you upgrade the backup image or you will have to downgrade then re-upgrade. This will be a major issue to explain to the business, during an unexpected outage.

Share this:
Share